An512inst PDF

V5.
cover
Front cover
Power Systems for AIX IV:

Performance Management
(Course code AN51)
Instructor Guide
ERC 2.1
Instructor Guide
Trademarks
IBM and the IBM logo are registered trademarks of International Business Machines
Corporation.
The following are trademarks of International Business Machines Corporation, registered in
many jurisdictions worldwide:
DB2
System p
HACMP
System x
System i
System z
Windows is a trademark of Microsoft Corporation in the United States, other countries, or

both.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
November 2010 edition

The information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without
any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer
responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While
each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will
result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.
Copyright International Business Machines Corporation 2010.

This document may not be reproduced in whole or in part without the prior written permission of IBM.
Note to U.S. Government Users Documentation related to restricted rights Use, duplication or disclosure is subject to restrictions
set forth in GSA ADP Schedule Contract with IBM Corp.
V5.4.0.1
Instructor Guide
TOC
Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Instructor course overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Course description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Unit 1. Performance analysis and tuning overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
What exactly is performance? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
What is a performance problem? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
What are benchmarks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10
Components of system performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14
Factors that influence performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17
Performance metrics and baseline measurement . . . . . . . . . . . . . . . . . . . . . . . . . 1-19
Trade-offs and performance approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-23
Performance analysis flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26
Impact of virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-29
The performance management team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-32
Performance analysis tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-35
Performance tuning tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-38
AIX tuning commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-41
Types of tunables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-43
Tunable parameter categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-45
Tunables command options and files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-47
Tuning commands -L option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-51
Stanza file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-54
File control commands for tunables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-57
Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-62
Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-64
Exercise 1: Work with tunables files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-66
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-68
Unit 2. Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Performance problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
Collecting performance data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
Installing PerfPMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
Capturing data with PerfPMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
PerfPMR report types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
Generic report contents (1 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
Copyright IBM Corp. 2010
Contents
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
iii
Instructor Guide
Formatting PerfPMR raw traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-31

When to run PerfPMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-35
The topas command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-37
The nmon and nmon_analyser tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-41
The AIX nmon command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-44
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-46
Exercise 2: Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-48
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-50
Unit 3. Monitoring, analyzing, and tuning CPU usage. . . . . . . . . . . . . . . . . . . . . . . . .3-1
Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3
CPU monitoring strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5
Processes and threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-9
The life of a process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-12
Run queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-15
Process and thread priorities (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-19
Process and thread priorities (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-22
nice/renice examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-25
Viewing process and threat priorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-29
Boosting an important process with nice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-33
Usage penalty and decay rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-35
Priorities: What to do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-40
AIX workload partitions (WPAR): Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-42
System WPAR and application WPAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-46
Target shares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-49
Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-52
WPAR resource management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-55
wlmstat command syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-58
Context switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-62
User mode versus system mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-65
Timing commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-68
Monitoring CPU usage with vmstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-71
sar command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-74
Locating dominant processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-77
tprof output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-80
What is simultaneous multi-threading? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-82
SMT scheduling and CPU utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-87
System wide CPU reports (old and new) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-90
Viewing CPU statistics with SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-92
POWER7 CPU statistics with SMT4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-94
Processor virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-96
Performance management with virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-102
CPU statistics in an SPLPAR (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-106
CPU statistics in an SPLPAR (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-109
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-111
Exercise 3: Monitoring, analyzing, and tuning CPU usage . . . . . . . . . . . . . . . . . .3-113
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-115
iv
AIX Performance Management

V5.4.0.1
Instructor Guide
TOC
Unit 4. Virtual memory performance monitoring and tuning . . . . . . . . . . . . . . . . . . 4-1

Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
Virtual and real memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
Major VMM functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
VMM terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
Free list and page replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19
When to steal pages based on free pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
Free list statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25
Displaying memory usage (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28
Displaying memory usage (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-32
What type of pages are stolen? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-34
Values for page types and classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-38
What types of pages are in real memory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-41
Is memory over committed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-44
Memory leaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-47
Detecting a memory leak with vmstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-50
Detecting a memory leak with ps gv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-52
Active memory sharing: Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-54
Active memory sharing: Loaning and stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-56
Displaying memory usage with AMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-59
Active Memory Expansion (AME) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-62
AME statistics (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-65
AME statistics (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-68
Active Memory Expansion tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-72
Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-75
Managing memory demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-78
Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-82
Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-84
Exercise 4: Virtual memory analysis and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 4-86
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-88
Unit 5. Physical and logical volume performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
I/O stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Individual disks versus disk arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10
Disk groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
LVM attributes that affect performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
LVM mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20
LVM mirroring scheduling policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25
Displaying LV fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
Using iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31
What is iowait? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-36
LVM pbufs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-39
Viewing and changing LVM pbufs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-41
I/O request disk queuing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-45
Using iostat -D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-48
Contents
Instructor Guide
sar -d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-51
Using filemon (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-54
Using filemon (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-58
Managing uneven disk workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-61
Adapter and multipath statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-66
Monitoring adapter I/O throughout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-69
Monitoring multiple paths (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-72
Monitoring multiple paths (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-75
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-77
Exercise 5: I/O Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-79
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-81
Unit 6. File system performance monitoring and tuning . . . . . . . . . . . . . . . . . . . . . .6-1
Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3
File system I/O layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-5
File system performance factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-7
How to measure file system performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-11
How to measure read throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-14
How to measure write throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-16
Using iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-19
Using filemon (1 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-22
Using filemon (2 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-25
Using filemon (3 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-27
Fragmentation and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-30
Determine fragmentation using fileplace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-33
Reorganizing the file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-36
Using defragfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-39
JFS and JFS2 logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-42
Creating additional JFS and JFS2 logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-45
Sequential read-ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-49
Tuning file syncs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-54
Sequential write-behind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-57
Random write-behind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-60
JFS2 random write-behind example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-63
File system buffers and VMM I/O queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-66
Tuning file system buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-69
VMM file I/O pacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-72
The pro and con of VMM file caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-77
JFS and JFS2 release-behind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-80
Normal I/O versus direct I/O (DIO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-84
Using direct I/O (DIO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-86
Checkpoint (1 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-89
Checkpoint (2 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-91
Checkpoint (3 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-93
Exercise 6: File system performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-95
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-97
vi

V5.4.0.1
Instructor Guide
TOC
Unit 7. Network performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1

Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
What affects network performance? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
Document your environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
Measuring network performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
Network services processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
Network memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19
Memory statistics with netstat -m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23
Socket flow control (TCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-26
TCP acknowledgement and retransmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29
TCP flow control and probes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-33
netstat -p tcp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-36
TCP socket buffer tuning (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-41
TCP socket buffer tuning (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-44
Interface specific network options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-49
Nagles algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-52
UDP buffer overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-56
netstat -p upd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-59
Fragmentation and segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-62
Intermediate network MTU restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-64
TCP maximum segment size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-66
Fragmentation and IP input queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-70
netstat -p ip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-74
Interface and hardware flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-79
Transmit queue overflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-82
Adapter configuration conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-87
Receive pool buffer errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-92
Network traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-95
Network trace examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-98
Checkpoint (1 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-100
Checkpoint (2 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-102
Checkpoint (3 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-104
Exercise 7: Network performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-106
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-108
Unit 8. NFS performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
NFS tuning concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
NFS versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Transport layers used by NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
NFS request path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
NFS performance related daemons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
nfsstat -s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23
NFS statistics using netpmon -O nfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26
Server tuning with nfso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28
nfsstat -c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-33
nfsstat -m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-37
Client commit-behind tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-39
Contents
vii
Instructor Guide
Client attribute cache tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-42

NFS I/O pacing, release-behind, and DIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-45
Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-48
Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-50
Exercise 8: NFS performance tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-52
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-54
Unit 9. Performance management methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1
Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2
Factors that can affect performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-4
Determine type of problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-7
Trade-offs and performance approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-10
Performance analysis flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-13
CPU performance flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-15
Memory performance flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-18
Disk/File system performance flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-22
Network performance flowchart (1 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-25
NFS performance flowchart: Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-34
NFS performance flowchart: Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-37
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-40
Exercise 9: Summary exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-42
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-44
Appendix A. Checkpoint solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
viii

V5.4
Instructor Guide
TMK
Trademarks
The reader should recognize that the following terms, which appear in the content of this
training document, are official trademarks of IBM or other companies:
IBM is a registered trademark of International Business Machines Corporation.
The following are trademarks of International Business Machines Corporation in the United
States, or other countries, or both:
Active Memory
DB2
GPFS
Lotus Notes
Notes
POWER6
PowerVM
Tivoli
400
AIX
eServer
HACMP
Lotus
POWER
POWER7
Redbooks
WebSphere
AIX 5L
Enterprise Storage Server
Iterations
Micro-Partitioning
POWER5
POWER Hypervisor
System Storage
1-2-3
Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc.
in the United States, other countries, or both.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Other product and service names might be trademarks of IBM or other companies.
Trademarks
ix
Instructor Guide

V5.4
Instructor Guide
pref
Instructor course overview

This course emphasizes performance problem determination and is a
must for anyone involved in support for performance related issues.
Particular emphasis is placed on solving problems that may occur in
real-life customer systems through the use of classroom examples
and lab exercises. The course will benefit those without any
performance management experience as well as those people with
many years of performance management experience.
Course strategy
The course strategy is to use lecture, checkpoint questions, and lab
exercises.
Instructor course overview

xi
Instructor Guide
xii

V5.4
Instructor Guide
pref
Course description
Power Systems for AIX IV: Performance Management
Duration: 5 days
Purpose
Develop the skills to measure, analyze, and tune common
performance issues on IBM POWER systems running AIX6.
Learn about performance management concepts and techniques and
how to use of basic AIX tools to monitor, analyze, and tune an AIX6
system. The course covers how virtualization technologies such as the
PowerVM environment and workload partitions affect AIX performance
management. Monitoring and analyzing tools discussed in this course
include vmstat, iostat, sar, tprof, svmon, filemon, netstat,
lvmstat, and topas. Tuning tools include schedo, vmo, ioo, no, and
nfso.
The course also covers how to use Performance Problem Reporting
(PerfPMR) to capture a variety of performance data for later analysis.
Each lecture is reinforced with extensive hands-on lab exercises which
provide practical experience.
Audience
AIX technical support personnel
Performance benchmarking personnel
AIX system administrators
Prerequisites
Students attending this course are expected to have basic AIX system
administration skills. These skills can be obtained by attending the
following courses:
- AU14/Q1314 AIX 5L System Administration I: Implementation
or
- AN12 Power Systems for AIX II: Implementation and
Administration
Course description
xiii
Instructor Guide
It is very helpful to have a strong background in TCP/IP networking to

support the network performance portion of the course. These skills
can be built or reinforced by attending:
- AU07/Q1307 AIX 5L Configuring TCP/IP
or
- AN21 TCP/IP for AIX Administrators
It is also very helpful to have a strong background in PowerVM
(particularly micro partitioning and the role of the virtual I/O server).
These skills can be built or reinforced by attending:
- AU73 System p LPAR and Virtualization I: Planning and
Configuration
or
- AN30 Power Virtualization I: Implementing Dual VIOS & IVE
Objectives
On completion of this course, students should be able to:
- Define performance terminology
- Describe the methodology for tuning a system
- Identify the set of basic AIX tools to monitor, analyze, and tune
a system
- Use AIX tools to determine common bottlenecks in the Central
Processing Unit (CPU), Virtual Memory Manager (VMM),
Logical Volume Manager (LVM), internal disk Input/Output (I/O),
and network subsystems
- Use AIX tools to demonstrate techniques to tune the
subsystems
xiv

V5.4
Instructor Guide
pref
Agenda
Day 1
(1:00) Unit 1 - Performance analysis and tuning overview
(0:25) Exercise 1
(0:45) Unit 2 - Data collection
(0:30) Exercise 2
(2:00) Unit 3 - Monitoring, analyzing, and tuning CPU usage
(0:50) Exercise 3 parts 1 and 2
Day 2
(1:20) Exercise 3 parts 3, 4 and 5
(2:30) Unit 4 - Virtual memory performance monitoring and tuning
(1:15) Exercise 4
Students choice optional exercise from Ex 3 or Ex 4
Day 3
(2:30) Unit 5 - Physical and logical volume performance
(1:15) Exercise 5
(1:00) Unit 6 File system performance, topic 1
(1:00) Exercise 6, parts 1, 2, and 3
Day 4
(1:00) Unit 6 File system performance, topic 2
(0:30) Exercise 6, part 4
(2:30) Unit 7 - Network performance
(0:45) Exercise 7
Students choice optional exercise from exercises 3, 4, 5, or 6
Day 5
(1:00) Unit 8 - NFS performance
(0:20) Exercise 8
(0:30) Unit 9 - Performance management methodology
(1:00) Exercise 9
Students choice optional exercises from exercises 3, 4, 5, 6, or 7
Agenda
xv
Instructor Guide
xvi

V5.4
Instructor Guide
Uempty
Unit 1. Performance analysis and tuning overview

Estimated time
1:25 (1:00 Unit; 0:25 Exercise)
What this unit is about

This unit defines performance terminology and gives you a set of tools
to analyze and tune a system. It also discusses the process for tuning
a system.
What you should be able to do

After completing this unit, you should be able to:
Describe the following performance terms:
- Throughput, response time, benchmark, metric, baseline,
performance goal
List performance components
Describe the performance tuning process
List tools available for analysis and tuning
How you will check your progress

Accountability:
Checkpoint
Machine exercises
References
AIX Version 6.1 Performance Management
AIX Version 6.1 Performance Tools Guide and
Reference
AIX Version 6.1 Commands Reference, Volumes 1-6
SG24-6478
AIX 5L Practical Performance Tools and Tuning Guide

(Redbook)

1-1
Instructor Guide
Unit objectives
Describe the following performance terms:
Throughput, response time, benchmark, metric,
baseline, performance goal
List performance components

Describe the performance tuning process
List tools available for analysis and tuning
Copyright IBM Corporation 2010
Figure 1-1. Unit objectives
AN512.0
Notes:
Introduction
The objectives in the visual above state what you should be able to do at the end of this
unit.
1-2

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Review the objectives for this unit.
Details Explain what we will cover and what the students should be able to do at the
end of the unit.
Since this is the students first unit, point out the references listed on the front page of this
unit. Each unit will have its own list of references.
Additional information
Transition statement Lets start by defining system performance.

1-3
Instructor Guide
What exactly is performance?

Performance is the major factor on which the productivity of a
system depends
Performance is dependent on a combination of:
Throughput
Response time
Acceptable performance is based on expectations:

Expectations are the basis for quantitative performance goals
4 Oclock
Panic
Lunch
Dip
5 Oclock
Cliff
Morning
Crunch
7am
10
11
12
Figure 1-2. What exactly is performance?
AN512.0
Notes:
Introduction
Performance of a computer system is different from performance of something else
such as a car or an actor, and so forth. The performance of a computer system is
related to how well the system responds to user requests or how much work the system
can do in a certain amount of time. So we can say that performance is dependent on a
combination of throughput and response time. The performance is also affected by
outside factors such as the network, other machines, and even the environment.
The graphic in the visual above illustrates that the performance of the system will likely
have a pattern and if you understand this pattern, it will make performance
management easier. For example, the 4 OClock Panic, as shown in the visual above, is
the busiest time in this systems day. This is not a good time to schedule additional
workload, but it is a great time to monitor the system for potential bottlenecks.
1-4

V5.4
Instructor Guide
Uempty
Throughput
Throughput is a measure of the amount of work over a period of time. Examples include
database transactions per minute, kilobytes of a file transferred per second, kilobytes of
a file read or written per second, and Web server hits per minute.
Response time
Response time is the elapsed time between when a request is submitted to when the
response from that request is returned. Examples include how long a database query
takes, how long it takes to echo characters to the terminal, or how long it takes to
access a Web page.
Throughput and response time are related. Sometimes you can have higher throughput
at the cost of response time or better response time at the cost of throughput. So,
acceptable performance is based on reasonable throughput combined with reasonable
response time. Sometimes a decision has to be made as to which is more important:
throughput or response time. Typically, user response time is more important since we
humans are probably more impatient than a computer program.
Expectations
Acceptable performance is based on our expectations. Our expectations can be based
on benchmarks (custom written benchmarks or industry standard benchmarks),
computer systems modeling, prior experience with the same or similar systems, or
maybe even wishful thinking. Acceptable response times are relative to the system, the
application, and the expectations of the users. For example, 5 second response time to
initiation of a transaction might seem slow in a fast-paced retail environment but quite
normal in a small town bank.
Setting performance goals

Determining expectations is typically the starting point for setting performance goals.
Performance goals are often stated in terms of specific throughput and response times.
There may be different goals for different applications on the same system.
Example performance goals stated in terms of throughput or response times:
- The average database transaction response time should always be less than 3
seconds.
- The nightly backup job must finish by 6:00 a.m.
- The nightly accounting batch job must finish all 48 runs by 6:00 a.m. each day.
Performance goals are being met; now what?

When performance goals are being met, system administrators must still monitor the
systems to determine if there are any upward trends which show that at some point in

1-5
Instructor Guide
the future the goals will not be met. Know your performance goals and your baseline
(what the system is doing now), and then you can spot a trend and estimate when a
problem will occur.
1-6

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Define system performance.
Details Emphasize that performance is a combination of throughput and response time
which may depend on outside factors.
Review the performance goals examples in the student notes when explaining the
concepts of throughput and response time.
You can mention the term baseline at this point, but it is covered in more detail in a few
pages. The baseline statistics allow you to see whether performance is better or worse at
the moment then when you took the baseline measurements.
Mention that even when performance goals are technically being met, you should monitor
the system regularly and compare against the systems baseline performance to see if
theres a troubling trend. The best time to tune a system is not when youre in crisis mode!
Transition statement Lets look at areas where performance skills are needed.

1-7
Instructor Guide
What is a performance problem?

Functional problem:
An application, hardware, or network is not behaving correctly
Performance problem:
The functions of the application, hardware or network are
being achieved, but the speed of the functions are slow
A functional problem can lead to a performance problem:
Networks or name servers that are down (functional problem)
can slow down communication (performance problem)
A memory leak (functional problem) can cause a paging
problem (performance problem)
Figure 1-3. What is a performance problem?
AN512.0
Notes:
Overview
Support personnel need to determine when a reported problem is a functional problem
or a performance problem. When a system is not producing the correct results or if a
system or network is down, then this is a functional problem. An application or a system
with a memory leak has a functional problem.
Sometimes functional problems lead to performance problems. In these cases, rather
than tune the system, it is more important to determine the root cause of the problem
and fix it.
1-8

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To distinguish a performance problem from a functional problem.
Details Usually, when someone thinks there is a performance problem, they think
tuning may be the solution. However, in some cases, it will require a functional problem to
be fixed. But, since functional problems may have performance side-effects, those who do
performance problem support will still have to deal with functional problems. In some
cases, a workaround can be done to improve the performance issue. For example, a
system that has multiple nameserver entries in its /etc/resolv.conf file may find that the
first name server is currently down. If it is, then the secondary name server is used after a
short time-out. However, this time-out can present itself as a performance problem. While
this is in reality a functional problem, it has a performance side-effect and can be worked
around by changing the order of the name server entries temporarily. In other cases, there
may not be an easy workaround and it could be that there is a real functional problem
without a performance side-effect. For example, if an NFS server goes down, the users will
not get any response at all when accessing NFS file systems.
Transition statement

1-9
Instructor Guide
What are benchmarks?

Benchmarks are standardized, repeatable tests
Unlike real production workloads which change constantly
Benchmarks use a representative set of programs and data
Benchmarks serve as a basis for:
Evaluation
Comparison
Benchmarks include:
Industry standard benchmarks
Customer benchmark
Figure 1-4. What are benchmarks?
AN512.0
Notes:
Introduction
Benchmarks are used to evaluate and compare the performance of computer systems.
They are useful because they remove other variables which might make results
unreliable.
Benchmark tests must be similar to a customer application to predict the performance
of the application or to use a benchmark result as the base for sizing. For example, a
JAVA SPECjbb benchmark result is specific to Java performance and does not give any
information about the NFS server performance of a computer system.
Benchmarks are also used in software development to identify regression in
performance after code changes or enhancements.
1-10 AIX Performance Management

V5.4
Instructor Guide
Uempty
Industry standard benchmarks

Industry standard benchmarks use a representative set of programs and data designed
to evaluate and compare computer and software performance for a specific type of
work load like CPU intensive applications or database work loads. Each industry
standard benchmark has rules and requirements to which all platforms have to adhere.
The following table lists some of the industry standard benchmarks and the type of
application each is used for:
Benchmark
Application Type
SPECint
SPECfp
Single-user technical
TPC-C
Online transaction
processing
TPC-D
TPC-H
Decision support
SPECjbb
Java
PLBwire
PLBsurf
Xmark
Viewperf
X11perf
Graphics and CAD
SPEC SFS
NFS
SPECweb
WebStone
Web Server
NotesBench Lotus Notes
AIM
General commercial
Notes
Updated in certain years so you will see
SPECint95, SPECfp95, and so forth. These
are CPU intensive applications with a heavy
emphasis on integer or floating point
calculations.
Simulates network environments with a large
number of attached terminals running
complex workloads. Typically used as a
database benchmark.
Executes sets of queries against a standard
database with large volumes of data and a
high degree of complexity for answering
critical business questions. TPC-D is
obsolete as of 4/6/99.
Evaluates the performance of server-side
Java. It emulates a 3-tier system focusing on
the middle tier.
Demonstrates relative performance across
platforms/systems using real applications
(2-D design, 3-D wireframe, 3-D solid
modeling, 3-D animation and low-end
simulations).
Measures the throughput supported by an
NFS server for a given response time.
Focuses on server performance and
measures the ability of the server to service
HTTP requests.
Measures the maximum number of users
supported, the average response time and
the number of Notes transactions per minute.
Tests real office automation applications,
memory management, integer and
floating-point calculations, disk I/O, and
multitasking.

1-11
Instructor Guide
Customer benchmarks
Customer benchmarks include customer specific applications which are not measured
through industry standard benchmarks as well as simple benchmarks like network or
file system throughput tests done with standard UNIX commands.
Since industry benchmarks often do not accurately match a customers workload
characteristics or mix, the best way to determine how well a particular combination
hardware. software, and tuning changes will affect the performance of their applications
is to run a standardized mix of the customers own unique workload.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain benchmarks.
Details Explain what a benchmark is, the different types of benchmarks, what type of
workload/performance is measured with it, and why benchmarks are done. Point out those
benchmarks currently being done by IBM on AIX.
More importantly, explain that while industry benchmarks can be a good first pass method
to compare different machines in the marketplace, they are not a predictor or how well a
given companies workload will run. The ultimate and correct evaluation is to run a
benchmark created by the customer using a typical application mix being feed a typical
workload.
Transition statement Lets look at the components of system performance.

1-13
Instructor Guide
Components of system performance

Central processing unit (CPU) resources
Processor speed and number of processors
Performance of software that controls CPU scheduling
Memory resources
Random access memory (RAM) speed, amount of
memory, and caches
Virtual Memory Manager (VMM) performance
I/O resources
Disk latencies, number of disks and I/O adapters
Device driver and kernel performance
Network resources
Network adapter performance and physical network itself
Software performance of network applications
Figure 1-5. Components of system performance
AN512.0
Notes:
Introduction
The performance of a computer system depends on four main components: CPU,
memory, I/O, and network.
Both hardware and software contribute to the entire system performance. You should
not depend on very fast hardware as the sole contributor of system performance. Very
efficient software on average hardware can cause a system to perform much better
(and probably be less costly) than poor software on very fast hardware.
CPU resources
The speed of a processor, more commonly known as clock speed in megahertz, as well
as the number of processors, have an impact on performance. Kernel software that
controls the use of the CPU plays a large role in performance.

V5.4
Instructor Guide
Uempty
Memory resources
Memory, both real and virtual, sometimes is the biggest factor of an applications
performance. The memory latencies (RAM speeds), the design of the memory
subsystem, size of memory caches, and the Virtual Memory Manager (VMM) kernel
software contribute to the performance of a computer system.
I/O resources
I/O performance contributes heavily to system performance as well. I/O resources are
referred to here as I/O related to disk activities, including disks and disk adapters.
Network resources
While not all systems rely on network performance, some systems main performance
component is network related: the network media used (adapters and wiring) as well as
the networking software.
Logical resources
Sometimes the constraining resource is not anything physical. There are logical
resources in the software design that can become bottleneck. Examples are queues
and buffers which are limited in size and pools of control blocks. While the AIX defaults
for these are usually large enough for most systems, there are situation where these
may need to be further tuned.

1-15
Instructor Guide
Instructor notes:
Purpose Describe the components of system performance.
Details Explain the four main components of system performance: CPU, memory, I/O
and network. Point out that both hardware and software contribute to an entire system
performance. Point out that there are logical resources that sometimes can be adjusted
through the operating system tuning commands.
Transition statement Lets review some basics about how programs run.

V5.4
Instructor Guide
Uempty
Factors that influence performance

Detecting the bottleneck(s) within a server system depends
on a range of factors such as:
Software application(s) workload
Speed and amount of available resources
Configuration of the server hardware
Configuration parameters of the operating system
Network configuration and topology
Throughput
Bottlenecks
Figure 1-6. Factors that influence performance
AN512.0
Notes:
As server performance is distributed throughout each server component and type of
resource, it is essential to identify the most important factors or bottlenecks that will
affect the performance for a particular activity. Detecting the bottleneck within a server
system depends on a range of factors such as those shown in the visual:
A bottleneck is a term used to describe a particular performance issue which is throttling
the throughput of the system. It could be in any of the subsystems: CPU, memory, or I/O
including network I/O. The graphic in the visual above illustrates that there may be
several performance bottlenecks on a system and some may not be discovered until
other, more constraining, bottlenecks are discovered and solved.

1-17
Instructor Guide
Instructor notes:
Purpose
Details

V5.4
Instructor Guide
Uempty
Performance metrics
and baseline measurement
Performance is measured through analysis tools
System utilization versus single program performance
Metrics that are measured include:
CPU utilization
Memory utilization and paging
Disk I/O
Network I/O
Each metric can be subdivided into finer details
Create a baseline measurement to compare against in
the future
Figure 1-7. Performance metrics and baseline measurement
AN512.0
Notes:
Introduction
One way to gauge performance is perception. For example, you might ask the question,
Does the system respond to us in a reasonable amount of time? But if the system
does not, then what do we do? That is where performance analysis tools play a role.
These tools are programs that collect and report on various performance metrics.
Whatever system components the application touches, the corresponding metrics must
be analyzed.
There is a difference between the overall system utilization and the performance of a
given application. An objective to fully utilize a system may be in conflict with an
objective to optimize the response time of a critical application. There can be spare
CPU capacity and yet an individual application can be CPU constrained. Sometime a
low utilization is a sign of troubles; some application locking mechanism may be
constraining use of the physical resources).

1-19
Instructor Guide
CPU utilization metrics

CPU utilization can be split into %user, %system, %idle, and %IOwait. Other CPU
metrics can include the length of the run queues, process/thread dispatches, interrupts,
and lock contention statistics.
The main CPU metric is the percent utilization. High CPU utilization is not a bad thing as
some might think. However, the reason for the CPU utilization must be investigated to
see if the utilization can be lowered. In the case of %idle and %IOwait the CPU is really
idle. The CPU is actually being utilized only in the first two cases (user + system).
Memory paging metrics

Memory metrics include virtual memory paging statistics, file paging statistics, and
cache and TLB miss rates.
Disk I/O metrics

Disk metrics include disk throughput (kilobytes read/written), disk transactions
(transactions per second), disk adapter statistics, disk queues (if the device driver and
tools support them), and elapsed time caused by various disk latencies. The type of
disk access, random versus sequential, can also have a big impact on response times.
Network I/O metrics

Network metrics include network adapter throughput, protocol statistics, transmission
statistics, network memory utilization, and much more.
Baseline measurement
You should create a baseline measurement when your system is running well and
under a normal load. This will give you a guideline to compare against when your
system seems to have performance problems.
Performance problems are usually reported right after a change to system hardware or
software. Unless there is a baseline measurement to compare against before the
change, quantification of the problem is impossible.

V5.4
Instructor Guide
Uempty
System changes can affect performance

Changes to any of the following can affect performance:
- Hardware configuration - Adding, removing, or changing configurations such as how
the disks are connected
- Operating system - Installing or updating a fileset, installing PTFs, and changing
parameters
- Applications - Installing new versions and fixes or configuring or changing data
placement
- Application tuning
- Tuning options in the operating system, database or an application
You should measure the performance before and after each change. A change may be
of a single tuning parameter or of multiple parameters that must be made at the same
time as a group.
Another option is to run the measurements at regular intervals (for example, once a
month) and save the output. When a problem is found, the previous capture can be
used for comparison.

1-21
Instructor Guide
Instructor notes:
Purpose Explain performance metrics.
Details Explain how performance measurement is done, the metrics that we measure
and how each metric can be subdivided into finer details.
Stress the importance of having a baseline measurement.
Transition statement Next, well take a look at the performance analysis flowchart.

V5.4
Instructor Guide
Uempty
Trade-offs and performance approach
Trade-offs must be considered, such as:
Cost versus performance
Conflicting performance requirements
Speed versus functionality
Performance may be improved using a methodical

approach:
1. Understanding the factors which can affect performance
2. Measuring the current performance of the server
3. Identifying a performance bottleneck
4. Changing the component which is causing the bottleneck
5. Measuring the new performance of the server to check
for improvement
Figure 1-8. Trade-offs and performance approach
AN512.0
Notes:
Trade-offs
There are many trade-offs related to performance tuning that should be considered.
The key is to ensure there is a balance between them.
The trade-offs are:
- Cost versus performance
In some situations, the only way to improve performance is by using more or faster
hardware. But, ask the question Does the additional cost result in a proportional
increase in performance?
- Conflicting performance requirements
If there is more than one application running simultaneously, there may be
conflicting performance requirements.
- Speed versus functionality
Resources may be increased to improve a particular area, but serve as an overall

1-23
Instructor Guide
detriment to the system. Also, you may need to make choices when configuring your
system for speed versus maximum scalability.
Methodology
Performance tuning is one aspect of performance management. The definition of
performance tuning sounds simple and straight forward, but its actually a complex
process.
Performance tuning involves managing your resources. Resources could be logical
(queues, buffers, and so forth) or physical (real memory, disks, CPUs, network
adapters, and so forth). Resource management involves the various tasks listed here.
We will examine each of these tasks later.
Tuning always must be done based on performance analysis. While there are
recommendations as to where to look for performance problems, what tools to use, and
what parameters to change, what works on one system may not work on another. So
there is no cookbook approach available for performance tuning that will work for all
systems.
Experiences with tuning may range from the informal to the very formal where reports
and reviews are done prior to changes being made. Even for informal tuning actions, it
is essential to plan, gather data, develop a recommendation, implement, and document.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain the performance tuning process.
Details It is important that the performance tuning process is understood before starting
any analysis or system tuning.
The first step in the performance tuning process is to identify performance bottlenecks. A
good way to start with is to answer the question What aspect is slow?. The answer to this
question might be everything is slow or a specific operation like a database query or a
copy command take too long to complete.
Based on the information of what aspect is considered to be slow, one or more goals can
be defined and prioritized. Sometimes it turns out that a specific operation already performs
in the best possible way.
Functional issues in hardware or software can result in performances issues which might
not be fixed or circumvented through resource management.
Transition statement Lets look at the tools available for performance tuning.

1-25
Instructor Guide
Performance analysis flowchart

Yes
Actions
Is there a
performance
problem?
No
CPU bound?
Yes
No
Yes
Memory bound?
Actions
No
I/O bound?
Yes
Actions
Normal operations
No
Monitor system performance

and check against requirements
Network bound?
Yes
Actions
No
No
Does performance
meet stated
goals?
Additional tests
Yes
Actions
Figure 1-9. Performance analysis flowchart
AN512.0
Notes:
Tuning is a process
The flowchart in the visual above can be used for performance analysis and it illustrates
that tuning is an iterative process. We will be following this flowchart throughout our
course.
The starting point for this flowchart is the Normal Operations box. The first piece of data
you need is a performance goal. Only by having a goal, or a set of goals, can you tell if
there is a performance problem. The goals may be something like a specific response
time for an interactive application or a specific length of time in which a batch job needs
to complete. Tuning without a specific goal could in fact lead to the degradation of
system performance.
Once you decide there is a performance problem and you analyze and tune the system,
you must then go back to the performance goals to evaluate whether more tuning
needs to occur.

V5.4
Instructor Guide
Uempty
Additional tests
The additional tests that you perform at the bottom right of the flowchart relate to the
four previous categories of resource contention. If the specific bottleneck is well hidden,
or you missed something, then you must keep testing to figure out what is wrong. Even
when you think youve found a bottleneck, its a good idea to do additional tests to
identify more detail or to make sure one bottleneck is not masquerading as another. For
example, you may find a disk bottleneck, but in reality its a memory bottleneck causing
excessive paging.

1-27
Instructor Guide
Instructor notes:
Purpose Introduce a performance flowchart that emphasizes the four major resource
categories.
Details This flowchart highlights the organization of this course. Identify the
constraining resource and use the tools and analysis that the course covers for that
category. On the other hand, the course also emphasizes that these are interrelated; for
example, memory management affects I/O and I/O affects memory.
The only way to tell if you have a performance problem is to have a specific performance
goal, or set of goals, and by monitoring your system to see if youre meeting them.
Transition statement Lets look at how virtualization affects the performance
management process.

V5.4
Instructor Guide
Uempty
Impact of virtualization
Logical Partitions
Virtual
Ethernet
Dedicated or Shared
Processors
Dedicated or Shared
Memory
Virtual
SCSI
Power Hypervisor and Virtual I/O Server

Physical
Network
Physical
Processors
Physical
Memory
Physical
Storage
Virtualization affects how you manage AIX

performance:
The memory and processor capacity is
determined by the Power Hypervisor
Memory and processors may be shared
Physical adapters to storage or to the
network may be in the virtual I/O server
Figure 1-10. Impact of virtualization
AN512.0
Notes:
Working with an AIX operating system in a logical partition changes how we approach
approach performance management.
On a traditional single operating system server, all of the resources are local to and
dedicated to that OS.
When AIX runs in a logical partition, the resources may all be virtualized. While this
virtualization and sharing of resources can provide better utilization and lower costs, it also
requires an awareness of the factors beyond the immediate AIX operating system.
Some network and I/O tuning requires you to examine and tune physical adapters. In a
virtualized environment, those adapters could reside at the virtual I/O server (VIOS), and
you would have to go to that VIOS partition to complete that part of the work.
Each of the resources could be shared with other LPARs. Thus workloads on these other
LPARs could affect the resource availability in your LPAR (depending on how the
virtualization facilities are configured).

1-29
Instructor Guide
Additional training, beyond this course, in PowerVM configuration and performance tuning
is strongly recommended for administrators who are working with LPARs which are
virtualized.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Develop awareness about the affect of virtualization on the performance
management process.
Details
Transition statement You may not have all of the skills and component access that are
needed to do the full performance management job. This is not just for the PowerVM
aspects but also for other aspects.Let us look at who else you may need to include on your
team to do the full job.

1-31
Instructor Guide
The performance management team

Virtualization
Management
(LPARs, HMC)
Facility
Management
(heat, power)
Network Facility
(switches, routers)
Application
Design
AIX
admin
Network Services
(DNS, NFS, NIM)
Physical Upgrades
(memory, cores,
adapters)
AIX Support Line

Storage Subsystem
(SAN, storage
arrays)
Figure 1-11. The performance management team
AN512.0
Notes:
Overview
Managing application performance is not something an AIX administrator can do in
isolation. Some studies have shown that the greatest performance improvements can
be found in improving the design and coding of the application or the manner in which
the data is organized. Increasing physical resources require working with capital
planning. The performance may be constrained by the components that are controlled
by the network administrator or storage subsystem administrator. Performance
bottlenecks may be isolated to the performance of other servers that you depend upon
such as name servers or file servers. An upgrade of equipment may require changes to
the power and cooling in the machine room. Newer Power-processor based systems
can suppress performance in order to stay within designated heat and power limits.
Most significantly, with the PowerVM environment in which most AIX systems run, the
resources on which performance depends are virtualized. The amount and manner in
which processor, memory, and I/O capacity is provisioned to the logical partition that is
V5.4
Instructor Guide
Uempty
running AIX has a great influence on the performance of the applications in that
partition.
Performance management is an area requiring partnerships with many other areas and
the personnel who administer those areas.

1-33
Instructor Guide
Instructor notes:
Purpose Discuss other staff that might need to be involved in the performance
management process.
Details
Transition statement Let us move on to a brief survey of tools that are available and
most of which will be covered in the course.

V5.4
Instructor Guide
Uempty
Performance analysis tools

CPU
Memory System
I/O Subsystem
Network
Subsystem
vmstat, iostat
vmstat
iostat
lsattr
ps
lsps
vmstat
netstat, entstat
sar
svmon
lsps
nfsstat
tprof, gprof,
prof
filemon
lsattr
netpmon
time, timex
lsdev
ifconfig
netpmon
lspv, lslv, lsvg
iptrace, ipreport
fileplace
tcpdump
locktrace
filemon
emstat, alstat
lvmstat
topas, nmon
performance
toolbox
topas, nmon,
performance
toolbox
topas, nmon,
performance
toolbox
topas, nmon,
performance
toolbox
trace, trcrpt,
curt, splat,
truss
trace, trcrpt,
truss
trace, trcrpt,
truss
trace, trcrpt,
truss
cpupstat,
lparstat,
mpstat, smtctl
lparstat
nfs4cl
Figure 1-12. Performance analysis tools
AN512.0
Notes:
CPU analysis tools
CPU metrics analysis tools include:
- vmstat, iostat, sar, lparstat and mpstat which are packaged with bos.acct
- ps which is in bos.rte.control
- cpupstat which is part of bos.rte.commands
- gprof and prof which are in bos.adt.prof
- time (built into the various shells) or timex which is part of bos.acct
- emstat and alstat are emulation and alignment tools from bos.perf.tools
- netpmon, tprof, locktrace, curt, splat, and topas are in bos.perf.tools
- trace and trcrpt which are part of bos.sysmgt.trace
- truss is in bos.sysmgt.ser_aids

1-35
Instructor Guide
- smtctl is in bos.rte.methods
- Performance toolbox tools such as xmperf, 3dmon which are part of perfmgr
Memory subsystem analysis tools

Some of the memory metric analysis tools are:
- vmstat which is packaged with bos.acct
- lsps which is part of bos.rte.lvm
- topas, svmon and filemon are part of bos.perf.tools
- lparstat is part of bos.acct
I/O subsystem analysis tools

I/O metric analysis tools include:
- iostat and vmstat are packaged with bos.acct
- lsps, lspv, lsvg, lslv and lvmstat are in bos.rte.lvm
- lsattr and lsdev are in bos.rte.methods
- topas, filemon, and fileplace are in bos.perf.tools
Network subsystem analysis tools

Network metric analysis tools include:
- lsattr and netstat which are part of bos.net.tcp.client
- nfsstat and nfs4cl as part of bos.net.nfs.client
- topas and netpmon are part of bos.perf.tools
- ifconfig as part of bos.net.tcp.client
- iptrace and ipreport are part of bos.net.tcp.server
- tcpdump which is part of bos.net.tcp.server

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose List the performance analysis tools.
Details Provide an overview of the available performance tools based on the metrics
they are used for.
The following tools are listed on the visual. These will not be discussed in this course.
The lparstat command reports logical partition (LPAR) related information and
statistics
The mpstat command collects and displays performance statistics for all logical CPUs
in the system
The cpupstat command detects configurations that could cause a CPU DR operation
to fail
The smtctl command controls the enabling and disabling of processor simultaneous
multi-threading mode.
If asked, to find which fileset a command is packaged with, use the following command,
where cmdname is the actual name of the command:
# lslpp -w whence cmdname
Additional information Other tools not written by IBM also exist. Some must be
purchased and others are publicly available on the Internet.
The course listed at the end of the students notes has different course codes depending on
where in the world it is scheduled. Students should search for courses related to
virtualization.
Transition statement Next, well take a look at the tuning process.

1-37
Instructor Guide
Performance tuning tools

CPU
Memory System I/O Subsystem
Network
Subsystem
nice
renice
vmo
vmo
no
schedo
ioo
ioo
lvmo
nfso
bindprocess
bindintcpu
chps
mkps
chlv
ifconfig
chdev
chdev
chdev
chdev
wlm
wlm
migratepv
wpar
wpar
reorgvg
The most important tool is matching resources to demand:

Spreading workload (over time and between components or systems)
Allocating the correct additional resource
Managing the demand
Figure 1-13. Performance tuning tools
AN512.0
Notes:
CPU tuning tools
CPU tuning tools include:
- nice, renice, and setpri modify priorities.
nice and renice are in the bos.rte.control fileset.
setpri is a command available with the perfpmr package.
- schedo modifies scheduler algorithms (in the bos.perf.tune fileset).
- bindprocessor binds processes to CPUs (in the bos.mp fileset).
- chdev modifies certain system tunables (in the bos.rte.methods fileset).
- bindintcpu can bind an adapter interrupt to a specific CPU (in the
devices.chrp.base.rte fileset).
- procmon is in bos.perf.gtools.

V5.4
Instructor Guide
Uempty
Memory tuning tools

Memory tuning tools include:
- vmo and ioo for various VMM, file system, and LVM parameters (in bos.perf.tune
fileset)
- chps and mkps modify paging space attributes (in bos.rte.lvm fileset)
- fdpr rearranges basic blocks in an executable so that memory footprints become
smaller and cache misses are reduced (in perfagent.tools fileset)
- chdev modifies certain system tunables (in bos.rte.methods fileset)
I/O tuning tools

I/O tuning tools include:
- vmo and ioo modify certain file system and LVM parameters (in bos.perf.tune
fileset).
- chdev modifies system tunables such as disk and disk adapter attributes (in
bos.rte.methods fileset)
- migratepv moves logical volumes from one disk to another (in bos.rte.lvm fileset)
- lvmo displays or sets pbuf tuning parameters (in bos.rte.lvm fileset)
- chlv modifies logical volume attributes (in bos.rte.lvm fileset)
- reorgvg moves logical volumes around on a disk (in bos.rte.lvm fileset)
Network tuning tools

Network tuning tools include:
- no modifies network options (in bos.net.tcp.client fileset)
- nfso modifies NFS options (in bos.net.nfs.client fileset)
- chdev modifies network adapter attributes (in bos.rte.methods fileset)
- ifconfig modifies network interface attributes (in bos.net.tcp.client fileset)

1-39
Instructor Guide
Instructor notes:
Purpose List the performance tuning tools.
Details Provide an overview of the available performance tuning tools based on the
performance metrics they are used for.
Do not go into details here about the commands, schedo, vmo and ioo. They will be
discussed in the next pages.
The lvmo command does not work exactly the same as the other tuning options
commands. It does not preserve tunables across boots. And because it cannot be run to
tune a VG that is not varied on, some values cannot be decreased.
AIX 5L V5.2 and earlier versions of AIX are no longer supported levels. The old tuning
commands for these versions are mentioned in the student notes. The schedtune and
vmtune commands were replaced by schedo, vmo and ioo in AIX 5L V5.2. schedtune and
vmtune were still available as scripts in AIX 5L V5.2 (calling the newer commands). In AIX
5L V5.2, AIX pre-5.2 tuning could be done by changing the system attribute pre520tune.
The schedtune and vmtune scripts are no longer available in AIX 5L 5.3.
If anyone asks, the rmss command might be considered by some to be a memory tuning
tool. It does change some memory parameters but it is of limited value and should be used
with great care because it does not change the memory structure that would happen if the
system reconfigured with the lower amount of memory.
Transition statement Let us look at the tuning commands.

V5.4
Instructor Guide
Uempty
AIX tuning commands

Tunable commands include:
vmo manages Virtual Memory Manager tunables
ioo manages I/O tunables
schedo manages CPU scheduler/dispatcher tunables
no manages network tunables
nfso manages NFS tunables
raso manages reliability, availability, serviceability tunables
Tunables are the parameters the tuning commands manipulate
Tunables can be managed from:
SMIT
Web-based system manager
Command line
All tunable commands have the same syntax
Figure 1-14. AIX tuning commands
AN512.0
Notes:
Overview
The tuning options are actually kept in structures in kernel memory. To assist in
reestablishing these kernel values at each system reboot, the tunables values
capabilities are stored in files in the directory /etc/tunables. The tunables commands
can update this file, update the kernel or both. For ease of use, these tunable
commands can be invoked via SMIT (smitty tuning) and Web-based System
Manager, or pconsole.

1-41
Instructor Guide
Instructor notes:
Purpose Enumerate the tuning commands and how they have a common syntax.
Details Cover commands (at a high level) and explain how they use the same syntax
and the same tunables file (for persistence across reboots).
Review the options for the commands.
In AIX 5L V5.2, the vmtune command was replaced with vmo and ioo. The I/O related
vmtune parameters are now tuned with ioo while the rest are tuned with vmo.
The schedtune command was replaced with schedo.
The commands are part of the bos.perf.tune fileset.
In AIX 5L V5.2, the vmtune and schedtune commands were shipped for compatibility
reasons, but as shell scripts that invoke vmo, ioo, and schedo. In AIX 5L V5.2, a new
system parameter was added; pre520tune which is a sys0 attribute, defined whether
system tuning was done the old way, which required adding tuning changes into
/etc/inittab or /etc/rc.boot, or the newer method through /etc/tunables.
In AIX 5L V5.2, the bos.adt.samples needed to be installed to make vmtune and
schedtune available. These were located under /usr/samples/kernel.
In AIX 5L V5.3, the vmtune and schedtune commands are no longer available.
Transition statement Let us discuss the two types of tunables: restricted and
unrestricted.

V5.4
Instructor Guide
Uempty
Types of tunables
There are two types of tunables (AIX 6.1):
Restricted tunables
Restricted tunables should not be changed without approval

Restricted
tunables should NOT be changed without
from AIX development or AIX Support !
approval from AIX Development or AIX Support!
Dynamic change will show a warning message
Permanent change must be confirmed
Permanent changes will cause an error log entry at boot time
Non-restricted tunables
Can have restricted tunables as dependencies

Migration from AIX 5.3 to AIX 6.1 will keep the old tunable values
Recommend reviewing and consider changing to AIX6 defaults
Figure 1-15. Types of tunables
AN512.0
Notes:
Beginning with AIX 6.1, many of the tunables are considered restricted. Restricted
tunables should not be modified unless told to do so by AIX development or support
professionals.
The restricted tunables are not displayed, by default.
When migrating to AIX 6.1, the old tunable values will be kept. However, any restricted
tunables that are not at their default AIX 6.1 value will cause an error log entry.

1-43
Instructor Guide
Instructor notes:
Purpose Discuss restricted tunables.
Details
Transition statement Lets next explain the categories of the tunables

V5.4
Instructor Guide
Uempty
Tunable parameter categories

The tunable parameters manipulated by the tuning commands
have been classified into the following categories:
Dynamic
Static
Reboot
Bosboot
Mount
Incremental
Connect
Figure 1-16. Tunable parameter categories
AN512.0
Notes:
Types of tunable parameters
All the tunable parameters manipulated by the tuning commands (vmo, ioo, schedo, no,
nfso and raso) have been classified into these categories:
Dynamic
Static
Reboot
Bosboot
Mount
Incremental
Connect
The parameter can be changed any time

The parameter can never be changed
The parameter can only be changed during boot
The parameter can only be changed by running
bosboot and rebooting the machine
Changes to the parameter are only effective for future
file systems or directory mounts
Parameter can only be incremented, except at boot
Changes to the parameter are only effective for future
socket connections

1-45
Instructor Guide
Instructor notes:
Purpose Discuss tunable categories.
Details
Transition statement Lets look at the details of using the tuning commands and the
files that are used when booting the system.

V5.4
Instructor Guide
Uempty
Tunables command options and files

/etc/tunables directory contains:
nextboot
lastboot
lastboot.log
(overrides to default)
(values established at last boot)
(log of last boot actions)
To list tunables:
# command a
# command L
# command h tunable
(summary list of tunables)

(long listing of tunables)
(full description of a tunable)
(Note: use of F option forces display of restricted tunables)
To change tunables:
# command -o tunable=value
# command -o tunable=value -r
# command -o tunable=value -p
# command -d tunable
# command -D
(update kernel only)

(update nextboot only)
(update kernel and nextboot)
(reset a single tunable to default)
(reset all tunables to the defaults)
Figure 1-17. Tunables command options and files
AN512.0
Notes:
Introduction
The parameter values tuned by vmo, schedo, ioo, no, and nfso are stored in files in
/etc/tunables.
Tunables files currently support six different stanzas: one for each of the tunable
commands (schedo, vmo, ioo, no and nfso), plus a special info stanza. The five
stanzas schedo, vmo, ioo, no and nfso contain tunable parameters managed by the
corresponding command (see the command's man pages for the complete parameter
lists).
The value can either be a numerical value or the literal word DEFAULT, which is
interpreted as this tunable's default value. It is possible that some stanzas contain
values for non-existent parameters (in the case a tunable file was copied from a
machine running a different version of AIX and one or more tunables do not exist on the
current system).

1-47
Instructor Guide
nextboot file
This file is automatically applied at boot time and only contains the list of tunables to
change. It does not contain all parameters. The bosboot command also gets the value
of Bosboot type tunables from this file. It contains all tunable settings made permanent.
lastboot
This file is automatically generated at boot time. It contains the full set of tunable
parameters, with their values after the last boot. Default values are marked with
# DEFAULT VALUE. Static parameters are marked STATIC in the file.
This file can be very useful as a problem determination tool. For example, it will identify
an error that prevented a requested change from being effective at reboot.
lastboot.log
This should be the only file in /etc/tunables that is not in the stanza format described
here. It is automatically generated at boot time, and contains the logging of the creation
of the lastboot file, that is, any parameter change made is logged. Any change which
could not be made (possible if the nextboot file was created manually and not validated
with tuncheck) is also logged. (tuncheck will be covered soon.)
Tuning command syntax

The vmo, ioo, schedo, no and nfso commands have similar syntax:
command
command
command
command
command
command
command
[-p|-r] {-o Tunable[=Newvalue]}

[-p|-r] {-d Tunable}
[-p|-r] -D
[-p|-r] -a {-F}
-h Tunable
-L [Tunable] {-F}
-x [Tunable] {-F}
The descriptions of the flags are:

Flag
-p
-r
-o
-d
-D
-a
-h
Description
Makes the change apply to both current and reboot values
Forces the change to go into effect on the next reboot
Displays or sets individual parameters
Resets individual Tunable to default value
Resets all tunables to default values
Displays all parameters
Displays help information for a Tunable

V5.4
Instructor Guide
Uempty
Flag
-L
-x
-F
Description
Lists attributes of one or all tunables; includes current value, default
value, value to be set at next reboot, minimum possible value,
maximum possible value, unit, type, and dependencies
Lists characteristics of one or all tunables, one per line, using a
spreadsheet-type format
Forces the display of restricted parameters

1-49
Instructor Guide
Instructor notes:
Purpose Describe the common options of the tunables commands and the files in the
/etc/tunables directory.
Details
Transition statement Now, lets look at listing the tunables.

V5.4
Instructor Guide
Uempty
Tuning commands -L option

# vmo -L
NAME
CUR
DEF
BOOT
MIN
MAX
UNIT
TYPE
DEPENDENCIES
...
<part of output omitted>
...
maxfree
1088
1088
1088
16
367001 4KB pages
D
minfree
memory_frames
-------------------------------------------------------------------------------maxperm
386241
386241
S
-------------------------------------------------------------------------------maxpin
370214
370214
S
-------------------------------------------------------------------------------maxpin%
80
80
80
1
100
% memory
D
pinnable_frames
memory_frames
-------------------------------------------------------------------------------memory_frames
448K
448K
4KB pages
S
-------------------------------------------------------------------------------minfree
960
960
960
8
367001 4KB pages
D
maxfree
memory_frames
-------------------------------------------------------------------------------...
<rest of output omitted>
Figure 1-18. Tuning commands -L option
AN512.0
Notes:
Overview of the -L option
The -L option of the tunable commands (vmo, ioo, schedo, no and nfso) can be used to
print out the attributes of a single tunable or all the tunables.
The output of the command with the -L option shows the current value, default value,
value to be set at next reboot, minimum possible value, maximum possible value, unit,
type, and dependencies.
Types of tunable parameters

All the tunable parameters manipulated by the tuning commands (no, nfso, vmo, ioo,
and schedo) have been classified into these categories:
Dynamic
Static
Reboot
The parameter can be changed any time

The parameter can never be changed
The parameter can only be changed during boot
1-51
Instructor Guide
Bosboot
Mount
Incremental
Connect
The parameter can only be changed by running bosboot and

rebooting the machine
Changes to the parameter are only effective for future file
systems or directory mounts
Parameter can only be incremented, except at boot time
Changes to the parameter are only effective for future socket
connections
For parameters of type Bosboot, whenever a change is performed, the tuning

commands automatically prompt the user and asks if they want to execute the bosboot
command. For parameters of type Connect, the tuning commands automatically restart
the inetd daemon.
The following table shows each command and the tunable types that it supports:
Dynamic (D)
Static (S)
Reboot (R)
Bosboot (B)
Mount (M)
Incremental (I)
Connect (C)
vmo
x
x
ioo
x
x
schedo
x
x
no
x
x
x
nfso
x
x
x
x
x
x
x
x
Tunable flag and type issues

Any change (with -o, -d or -D) to a parameter of type Mount will result in a message
being displayed to warn the user that the change is only effective for future mount
operations.
Any change (with -o, -d or -D flags) to a parameter of type Connect will result in inetd
being restarted, and a message displaying a warning to the user that the change is only
effective for future socket connections.
Any attempt to change (with -o, -d or -D) a parameter of type Bosboot or Reboot
without -r, will result in an error message.
Any attempt to change (with -o, -d or -D but without -r) the current value of a
parameter of type Incremental with a new value smaller than the current value, will
result in an error message.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Display the attributes of tunables with the -L option.
Details The -L option to the o tuning commands will list out the current values of the
tunables and the type of tunable it is. Review the types of tunables.
Transition statement Lets describe the files used to store tunable values in order to
be applied during the boot process.

1-53
Instructor Guide
Stanza file format

Example of a nextboot file:
info:
Description = Tuning changes made July 2009"
AIX_level = 6.1.2.3"
Kernel_type = "MP64"
Last_validation = "2009-07-28 15:31:24 CDT (current)"
vmo:
minfree = "4000"
maxfree = "4128"
nokilluid = "4"
ioo:
j2_maxPageReadAhead = "128"
j2_nRandomCluster = "4"
j2_nRandomWrite = "8"
j2_nPagesPerWriteBehindCluster = "64"
no:
tcp_nodelayack = "1"
tcp_sendspace = "65536"
tcp_recvspace = "65536"
nfso:
nfs_v3_vm_bufs = 15000"
Figure 1-19. Stanza file format
AN512.0
Notes:
Stanza file format (nextboot and lastboot)
The tunables files contain one or more sections, called stanzas. A stanza is started by
a line containing the stanza name followed by a colon (:). There is no marking for the
end of a stanza. It simply continues until another stanza starts. Each stanza contains a
set of parameter/value pairs; one pair per line. The values are surrounded by double
quotes ("), and an equal sign (=) separates the parameter name from its value. A
parameter/value pair must necessarily belong to a stanza. It has no meaning outside of
a stanza. Two parameters sharing the same name but belonging to different stanzas
are considered to be different parameters. If a parameter appears several times in a
stanza, only its first occurrence is used. Following occurrences are simply ignored.
Similarly, if a stanza appears multiple times in the file, only the first occurrence is used.
Everything following a number sign (#) is considered a comment and ignored. Heading
and trailing blanks are also ignored.

V5.4
Instructor Guide
Uempty
There are six possible stanzas for each file:

- info
- schedo
- vmo
- ioo
- no
- nfso
The info stanza is used to store information about the purpose of the tunable file and
the level of AIX on which it was validated. Any parameter is acceptable in this stanza,
however, some fields have a special meaning:
Field
Meaning
A character string describing the tunable file. SMIT
displays this field in the file selection box.
Possible values are:
Description
UP - Uniprocessor kernel, N/A on AIX 5L V5.3 and

later.
MP - Multiprocessor kernel, N/A on AIX6 and later.
MP64 - 64 bit multiprocessor kernel
Kernel_type
This field is automatically updated by tunsave and

tuncheck (on success only).
The most recent date and time this file was validated,
and the type of validation.
Possible values are:
Last_validation
Logfile_checksum
current - File has been validated against the current

context
reboot - File has been validated against the
nextboot context
This field is automatically updated by tunsave and
tuncheck (on success only).
The checksum of the lastboot.log file matching this
tunables file. This field is present only in the lastboot file.

1-55
Instructor Guide
Instructor notes:
Purpose Describe the format of the files in the /etc/tunables directory.
Details
Transition statement Next, well look at the five commands that are used to
manipulate the tunables files.

V5.4
Instructor Guide
Uempty
File control commands for tunables

Commands to manipulate the tunables files in /etc/tunables
are:
tuncheck
Used to validate the parameter values in a file
tunrestore
Changes tunables based on parameters in a file
tunsave
Saves tunable values to a stanza file
tundefault
Resets tunable parameters to their default values
tunchange
Unconditionally updates a values in a file
Figure 1-20. File control commands for tunables
AN512.0
Notes:
Introduction
There are five commands which are used to control files that contain tunables. These
commands take as an argument the filename to use and the commands will assume
that the filename is in the /etc/tunables directory.
tuncheck command
The tuncheck command validates a tunables file. All tunables listed in the specified file
are checked for range and dependencies. If a problem is detected, a warning is issued.
There are two types of validation:
- Against the current context: This checks to see if the file could be applied
immediately. Tunables not listed in the file are interpreted as current values. The
checking fails if a tunable of type Incremental is listed with a smaller value than its

1-57
Instructor Guide
current value; it also fails if a tunable of type Bosboot or Reboot is listed with a
different value than its current value.
- Against the next boot context: This checks to see if the file could be applied
during a reboot, that is, if it could be a valid nextboot file. Decreasing a tunable of
type Incremental is allowed. If a tunable of type Bosboot or Reboot is listed with a
different value than its current value, a warning is issued but the checking does not
fail.
Additionally, warnings are issued if the file contains unknown stanzas, or unknown
tunables in a known stanza. However, that does not make the checking fail.
Upon success, the AIX_level, Kernel_type and Last_validation fields in the info
stanza of the checked file are updated.
The syntax for the tuncheck command is:
tuncheck [-p | -r ] -f Filename
where:
Flag
-f Filename
-p
-r
Description
Specifies the name of the tunable file to be checked. If it does
not contain the '/' (forward slash) character, the name is
relative to the /etc/tunables directory.
Checks Filename in both current and boot contexts. This is
equivalent to running tuncheck twice, one time without any
flag and one time with the -r flag.
Checks Filename in a boot context.
If -p or -r are not specified, Filename is checked according to the current context.
tunrestore command
The tunrestore command is used to change all tunable parameters to values stored in
a specified file.
The syntax for the tunrestore command is:
tunrestore [-r] -f Filename
tunrestore -R
where:
Flag
-f Filename
Description
Immediately applies Filename. All tunables listed in Filename
are set to the value defined in this file. Tunables not listed in
Filename are kept unchanged. Tunables explicitly set to
DEFAULT are set to their default value.

V5.4
Instructor Guide
Uempty
Flag
-r -f Filename
-R
Description
Applies Filename for the next boot. This is achieved by
checking the specified file for inconsistencies (the equivalent of
running tuncheck on it) and copying it over to
/etc/tunables/nextboot. If bosboot is necessary, the user will
be offered to run it.
Is only used during reboot. All tunables that are not already set
to the value defined in the nextboot file are modified. Tunables
not listed in the nextboot file are forced to their default value.
All actions, warnings and errors are logged into
/etc/tunables/lastboot.log.
tunrestore -R can only be called from /etc/inittab.
Additionally, a tunable file called /etc/tunables/lastboot is automatically generated.

That file has all the tunables listed with numerical values. The values representing
default values are marked with the comment DEFAULT VALUE. Its info stanza includes
the checksum of the /etc/tunables/lastboot.log file to make sure pairs of
lastboot/lastboot.log files can be identified.
tunsave command
The tunsave command saves the current state of the tunables parameters in a file.
The syntax for the tunsave command is:
tunsave [ -a | -A ] -f | -F Filename [ -d Description ]
where:
Flag
-a
-A
-d Description
-f Filename
Description
Saves all tunable parameters, including those that are
currently set to their default value. These parameters are
saved with the special value DEFAULT.
Saves all tunable parameters, including those that are
currently set to their default value. These parameters are
saved numerically, and a comment, # DEFAULT VALUE, is
appended to the line to flag them.
Specifies the text to use for the Description field. Special
characters must be escaped or quoted inside the Description
field.
Specifies the name of the tunable file where the tunable
parameters are saved. If Filename already exists, an error
message is displayed. If it does not contain the '/' (forward
slash) character, the Filename is relative to /etc/tunables.

1-59
Instructor Guide
Flag
-F Filename
Description
Specifies the name of the tunable file where the tunable
parameters are saved. If Filename already exists, the existing
file is overwritten. If it does not contain the '/' (forward slash)
character, the Filename is relative to /etc/tunables.
If Filename does not already exist, a new file is created. If it already exists, an error
message prints unless the -F flag is specified, in which case, the existing file is
overwritten.
tundefault command
The tundefault command resets all tunable parameters to their default values. It
launches all the tuning commands (ioo, vmo, schedo, no and nfso) with the -D flag. This
resets all the AIX tunable parameters to their default value, except for parameters of
type Bosboot and Reboot, and parameters of type Incremental set at values bigger
than their default value, unless -r was specified. Error messages are displayed for any
parameter change impossible to make.
The syntax for the tundefault command is:
tundefault [ -r | -p ]
where:
Flag
-r
-p
Description
Defers the reset to their default value until the next reboot. This
clears stanza(s) in the /etc/tunables/nextboot file, and if necessary,
proposes bosboot and warns that a reboot is needed.
Makes the changes permanent: resets all the tunable parameters to
their default values and updates the /etc/tunables/nextboot file.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose List the commands that are used to control the tunables files.
Details tunrestore does not apply changes if its input file contains any incorrect
settings.
Transition statement Now, its time for a checkpoint.

1-61
Instructor Guide
Checkpoint (1 of 2)
1. Use these terms with the following statements:
benchmarks, metrics, baseline, performance goals,
throughput, response time
a. Performance is dependent on a combination of ____________
and ___________________ .
b. Expectations can be used as the basis for _______________ .
c.
These are standardized tests used for evaluation.

________________________
d. You need to know this to be able to tell if your system is

performing normally. _______________________
e. These are collected by analysis tools. ___________________
Figure 1-21. Checkpoint (1 of 2)
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Review and test the students understanding of this unit.
Details A suggested approach is to give the students about five minutes to answer the
questions on this page. Then, go over the questions and answers with the class.
Checkpoint solutions (1 of 2)
a. Performance is dependent on a combination of throughput and
response time.
b. Expectations can be used as the basis for performance goals.
c.
These are standardized tests used for evaluation. benchmarks

performing normally. baseline
e. These are collected by analysis tools. metrics
Transition statement The next page has more checkpoint questions.

1-63
Instructor Guide
Checkpoint (2 of 2)
2.
The four components of system performance are:
3.
After tuning a resource or system parameter and monitoring the outcome,

what is the next step in the tuning process?
__________________________________________________________
2.
The six tuning options commands are:
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
2.
3.
4.
CPU
Memory
I/O
Network
what is the next step in the tuning process? Determine if the performance
goal(s) have been met.
schedo
vmo
ioo
lvmo
no
nfso
Transition statement Lets move on to the exercise.

1-65
Instructor Guide
Exercise 1: Work with tunables files

List the attributes of tunables
Validate the tunable parameters
Examine the tunables files
Reset tunables to their default values
Figure 1-23. Exercise 1: Work with tunables files
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose
Details

1-67
Instructor Guide
Unit summary
This unit covered:
The following performance terms:
Throughput, response time, benchmark, metric, baseline,
performance goal
Performance components
The performance tuning process
Tools available for analysis and tuning
Figure 1-24. Unit summary
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Summarize the unit.
Details

1-69
Instructor Guide

V5.4
Instructor Guide
Uempty
Unit 2. Data collection

Estimated time
01:15 (0:45 Unit; 0:30 Exercise)

This unit describes how to define a performance problem, then use
tools such as the PerfPMR utility, topas, and nmon to collect
performance data.

Describe a performance problem
Install PerfPMR
Collect performance data using PerfPMR
Describe the use of the following tools:
- topas
-nmon

Accountability:
Checkpoint
Machine exercises
References
Reference
SG24-6478

(Redbook)

2-1
Instructor Guide
Unit objectives
At the end of this unit, you should be able to:
Describe a performance problem
Install PerfPMR
Collect performance data using PerfPMR
Describe the use of the following tools:
topas
nmon
AN512.0
Notes:
2-2

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose List the objectives for this unit.
Details Set the students expectation that explaining the details of how to analyze
everything is beyond the scope of this course.
Transition statement So, what exactly is a performance problem?

2-3
Instructor Guide
Performance problem description

When someone reports a performance problem:
It is not enough to just gather data and analyze it
You must know the nature of the problem
Otherwise, you may waste a lot of time
analyzing data which may have nothing to do
with the problem being reported
How can you find out the nature of the problem?
Ask many questions regarding the performance
problem
Figure 2-2. Performance problem description
AN512.0
Notes:
What should a customer do?
If a performance problem exists, the customer should contact their local support center
to open a Problem Management Report (PMR). They should include as much
background of the problem as possible. Then, collect and analyze the data.
What typically happens?

It is quite common for support personnel to receive a problem report in which all it says
is that someone has a performance problem on the system and here is some data for
you to analyze. This little information is not enough to accurately determine the nature
of a performance problem.
An analogy would be a patient that visits a doctor, tells the doctor that she or he is sick,
and then expects an immediate diagnosis. The doctor could run many tests on the
patient gathering data such as blood tests, x-rays, and so forth, and may even find
2-4

V5.4
Instructor Guide
Uempty
interesting results. However, these results may have nothing to do with the problem that
the patient is reporting.
As such, a performance problem is the same. The data could show 100% CPU
utilization and a high run queue, but that may have nothing to do with the cause of the
performance problem. Take, for example, a system where users are logged in from
remote terminals over a network that goes over several routers. The users may report
that the system is slow. Data could show that the CPU is very heavily utilized. But the
real problem could be that the characters get echoed after long delays on their
terminals due to packets getting lost on the network (which could be caused by failing
routers or overloaded networks) and may have nothing to do with the CPU utilization on
the machine. If, on the other hand, the complaint was that a batch job on the system
was taking a long time to run, then CPU utilization or I/O bandwidth may be related. It is
very important to get as much detail as possible before even attempting to collect or
analyze data.
Questions to ask
Ask many questions regarding the performance problem:
- Can the problem be demonstrated with the execution of a specific command or
sequence of events? (that is, ls /slow/fs, or ping xxxx, ). If not, describe the
least complex example of the problem.
- Is the slow performance intermittent? Does the system get slow and then run
normally for a while? Does it occur at certain times of the day or in relation to some
specific activity?
- Is everything slow or just some things?
- What aspect is slow? For example, time to echo a character or elapsed time to
complete a transaction or time to paint the screen?
- When did the problem start occurring? Was it that way since the system was first
installed or went into production? Did anything change on the system before the
problem occurred (such as adding more users or migrating additional data to the
system)?
- If the problem involves a client/server, can the problem be demonstrated when run
just locally on the server (network versus server issue).
- If network related, what are the network segments like (including media information
such as 100 Mbps/sec, half-duplex, and so forth) and routers between the
client/server application.
- What vendor applications are running on the system and are they involved in the
performance issue?
- What is the impact of the performance problem to the users?
- Are there any entries in the error log?

2-5
Instructor Guide
Instructor notes:
Purpose List the questions to ask when a performance problem is reported.
Details
Transition statement The next step after getting the description of the problem is to
collect performance data.
2-6

V5.4
Instructor Guide
Uempty
Collecting performance data

The data may be from just one system or from multiple
systems
Gather a variety of data
To make this simple, a set of tools supplied in a package
called PerfPMR is available
PerfPMR is downloadable from a public website:
ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr
Choose appropriate version based on the AIX release
PerfPMR may be updated for added functionality on an
ongoing basis
Download a new copy if your copy is back level
Be sure to collect the performance data while the problem
is occurring!
Figure 2-3. Collecting performance data
AN512.0
Notes:
Overview
It is important to collect a variety of data that show statistics regarding the various
system components. In order to make this easy, a set of tools supplied in a package
called PerfPMR is available on a public ftp site. The following URL can be used to
download your version using a Web browser:
ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr
The goal
The goal is to collect a good base of information that can be used by AIX technical
support specialists or development lab programmers to get started in analyzing and
solving the performance problem. This process may need to be repeated after analysis
of the initial set of data is completed.

2-7
Instructor Guide
Instructor notes:
Purpose To explain that data needs to be collected and where to get the data collection
tools.
Details The public ftp site at ftp.software.ibm.com should be checked every few months
to see if a newer version is available. Currently, there are versions for AIX V3.2.5, AIX V4,
AIX V4.3.3, AIX 5L V5.1, V5.2, V5.3, and AIX6.1
Transition statement Now, lets see how to install PerfPMR.
2-8

V5.4
Instructor Guide
Uempty
Installing PerfPMR
Download the latest PerfPMR version from the
website
Read about the PerfPMR process in the README
file
Install PerfPMR:
Login as root
Create the directory: /tmp/perf61
Extract the shell scripts out of the compressed
tar file
Install the shell scripts (using Install script)
Figure 2-4. Installing PerfPMR
AN512.0
Notes:
Download PerfPMR
Obtain the latest version of PerfPMR from the Web site
ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr.
The PerfPMR package is distributed as a compressed tar file.
Install PerfPMR
The following assumes you are installing the PerfPMR version for AIX 6.1, the tar file is
in /tmp, and the tar file is named perf61.tar.Z.
1. Login as root or use the su command to obtain root authority

2-9
Instructor Guide
2. Create a perf61 directory and change to that directory (this example assumes
the directory created is under /tmp):
# mkdir /tmp/perf61
# cd /tmp/perf61
3. Extract the shell scripts out of the compressed tar file:
# zcat /tmp/perf61.tar.Z | tar -xvf 4. Install the shell scripts:
# sh ./Install
A link will be created in /usr/bin to the perfpmr.sh script.
The PerfPMR process is described in a README file provided in the PerfPMR
package.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To go over the steps to download and install PerfPMR.
Details
Transition statement Now that you have PerfPMR installed, how do you use it to
collect the information?

2-11
Instructor Guide
Capturing data with PerfPMR

Create a directory to collect the PerfPMR data
Run perfpmr.sh 600 to collect the standard data
It will run considerably longer than 600 seconds
Do not terminate before it finishes
perfpmr.sh runs specialized scripts to collect the data

perfpmr.sh will collect information by:
Running kernel trace (trace.sh)for 5 seconds
Gathering 600 seconds of general system performance data
(passed to monitor.sh script)
Collecting hardware and software configuration information
Running trace based utilities (for example: filemon, tprof)
Running network traces
Lengths of execute controlled by perfpmr.cfg
Answer the questions in PROBLEM.INFO

Figure 2-5. Capturing data with PerfPMR
AN512.0
Notes:
Data collection directory
Create a data collection directory and cd into this directory. Allow at least
12 MB/processor of unused space in whatever file system is used. Use the df
command to verify the file system has at least 30000 blocks in the Free column (30000
512 byte blocks = 15 MB).
Do not collect data in a remotely mounted file system since iptrace may hang.
If there is not enough space in the file system, perfpmr.sh will print a message similar
to:
perfpmr.sh: There may not be enough space in this filesystem
perfpmr.sh: Make sure there is at least 44 Mbytes

V5.4
Instructor Guide
Uempty
Preparing for PerfPMR

The following filesets should be installed before running perfpmr.sh:
- bos.acct
- bos.sysmgt.trace
- bos.perf.tools
- bos.net.tcp.server
- bos.adt.include
- bos.adt.samples
Running PerfPMR
To run PerfPMR, type in the command perfpmr.sh. One of the scripts perfpmr.sh
calls is monitor.sh. monitor.sh calls several scripts to run performance monitoring
commands. By default, each of these performance monitoring commands called by
monitor.sh will collect data for 10 minutes (600 seconds). This default time can be
changed by specifying the number of seconds to run as the first parameter to
perfpmr.sh. For example, perfpmr.sh 300 will collect data for 5 minutes (300
seconds). The minimum time is 60 seconds.
Some of the flags for perfpmr.sh are:
-P
Preview only. Show scripts to run and disk space needed
-Q
Do not run lsattr, lslv, or lspv commands in order to save time
-I
Get lock instrumented trace also
-g
Do not collect gennames output
-f
If gennames is run, specify gennames -f
-n
Used if no netstat or nfsstat desired
-p
Used if no pprof collection desired while monitor.sh running
-s
Used if no svmon desired
-c
Used if no configuration information is desired
-d sec
sec is time to wait before starting collection period (default is 0)
Data collected
By default, the perfpmr.sh script provided will:
- Immediately collect a 5 second trace (trace.sh 5)
- Collect 600 seconds of general system performance data using interval tools such
as such as vmstat, iostat, emtstat, and sar (monitor.sh 600)

2-13
Instructor Guide
- Collect hardware and software configuration information using commands such as

uname -m, lsps -a, lsdev -C, mount and df (config.sh)
In addition, if it finds the following programs available in the current execution path, it
will:
- Collect 10 seconds of tcpdump information (tcpdump.sh 10)
- Collect 10 seconds of iptrace information (iptrace.sh 10)
- Collect 60 seconds of filemon information (filemon.sh 60)
- Collect 60 seconds of tprof information (tprof.sh 60)
You can also run the PerfPMR scripts individually. If you run them as an argument to
perfpmr.sh with the -x flag (for example, perfpmr.sh -x tprof.sh), you do not need
to know where PerfPMR was installed and give it the full path name. The perfpmr.sh
command is automatically known to the system.
For HACMP users, it is generally recommended that the HACMP deadman switch
interval be lengthened while performance data is being collected to avoid accidental
failovers.
PROBLEM.INFO
The text file in the data collection directory, PROBLEM.INFO, asks many questions that
help give a more complete picture of the problem.This background information about
the problem gives the person trying to solve the problem a better understand what is
going wrong.
Some examples of the questions in PROBLEM.INFO are:
- Can you append more detail on the simplest, repeatable example of the problem?
That is, can the problem be demonstrated with the execution of a specific command
or sequence of events? (that is, ls /slow/fs takes 60 seconds or binary mode ftp
put from one specific client only runs at 20 KB/second.
If not, describe the least complex example of the problem.
Is the execution of AIX commands also slow?
- Is this problem a case of something that had worked previously (that is, before an
upgrade) and now does not run properly?
If so, describe any other recent changes (that is, workload, number of users,
networks, configuration, and so forth).
- Is this a case of a application/system/hardware that is being set up for the first time?
If so, what performance is expected and on what is it based?

V5.4
Instructor Guide
Uempty
More PerfPMR information

To learn more about using PerfPMR and where to send the data, read the README file
that comes in the PerfPMR tar file. Also, the beginning of each script file contains a
usage message explaining the parameters for that script.

2-15
Instructor Guide
Instructor notes:
Purpose To explain the data that PerfPMR collects.
Details Emphasize that PerfPMR will run longer than the time specified in the
perfpmr.sh command.
Do not go into detail about monitor.sh. monitor.sh and the scripts it calls will be
discussed soon.
Transition statement PerfPMR produces several types of reports.

V5.4
Instructor Guide
Uempty
PerfPMR report types

The primary report types are:
.int
Data collected at intervals over time
.sum
Averages or differences during the time
.out
One-time output from various commands
.before Data collected before the monitoring time

.after
Data collected after the monitoring time
.raw
Binary files for input into other commands
Most report file names are self explanatory.

For example: vmstat.int or sar.int
More generic names are not as obvious.

For example: monitor.sum or config.sum
Figure 2-6. PerfPMR report types
AN512.0
Notes:
PerfPMR output files
PerfPMR collects its data into many different files. The types of files created are:
- *.int files are from commands that collect the data at intervals over time. For
example, data collected from vmstat, iostat, sar, lparstat, mpstat, netstat and
nfsstat.
- *.sum files contain data that is collected once. There is also a file called
monitor.sum that contains statistics that are averaged from the monitor.int files.
- *.out files contain the output from a command just run once
- *.before files contain information for commands run at the beginning of the
monitoring period. One file that does not follow this convention is psb.elfk that
contains the ps -elfk output before the monitoring period.

2-17
Instructor Guide
- *.after files contain information for commands run at the end of the monitoring
period. One file that does not follow this convention is psa.elfk that contains the
ps -elfk output after the monitoring period.
- *.raw files are binary files from utilities like trace, iptrace, and tcpdump. These
files can be processed to create the ASCII report file by using the -r flag with the
shell program. For example, iptrace.sh -r.
The .int data is most useful for metrics analysis. The .sum data is most useful for
overall or configuration type of data. The .before and .after data are metrics before the
testcase begins and those at end of the test interval. These are good for determining a
starting and delta value for what occurred over life of the test interval.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Define the different types of PerfPMR reports.
Details
Transition statement While the contents for many report files are obvious from the
name of the file, others are not so clear. Let us first look at what you will find in the
monitor.int and monitor.sum files.

2-19
Instructor Guide
Generic report contents (1 of 3)

monitor.int:
ps efk listings (before and after)
sar -A interval report
iostat interval report
vmstat interval report
emstat interval report
monitor.sum:
ps efk deltas from before to after
sar interval reports
vmstat interval averages
vmstat s deltas from before to after
Figure 2-7. Generic report contents (1 of 3)
AN512.0
Notes:
Overview
AIX Support frequently changes and enhances the PerfPMR tool. It is recommended that
you periodically download a new copy (at least every three months) and before using it to
document an open PMR.
The description provided in this course may not be up to date.
monitor.sum contents
monitor.sum contains output from the following files:
- ps.sum
- sar.sum
- iostat.sum
- vmstat.sum
V5.4
Instructor Guide
Uempty
Additional reports from monitor.sh

The following are additional reports from monitor.sh:
- netstat.int contains output from various netstat commands
- nfsstat.int contains output from nfsstat -m and nfsstat -csnr
- lsps.before and lsps.after contain output from lsps -a and lsps -s
- vmstati.before and vmstati.after contain output from vmstat -i
- vmstat_v.before and vmstat_v.after contain output from vmstat -v
- svmon.before and svmon.after contain output from svmon -G and svmon -Pns
- svmon.before.S and svmon.after.S contain output from svmon -1S
-
Capturing before data

The monitor.sh script captures initial data by invoking the following commands and
scripts:
- lsps -a and lsps -s output into the lsps.before file
- vmstat -i output into the vmstati.before file
- vmstat -v output into the vmstat_v.before file
- svmon.sh (for output see the svmon.sh section below)
Capturing after data

The following commands and scripts capture the data after the measurement period:
- lsps -a and lsps -s output into the lsps.after
- vmstat -i output into the vmstati.after
- vmstat -v output into the vmstat_v.after
- svmon.sh (for output see the svmon.sh section below)
svmon.sh
The svmon command captures and analyzes a snapshot of virtual memory. The svmon
commands that the svmon.sh script invokes are:
- svmon -G which gathers general memory usage information.
- svmon -Pns which gathers memory usage statistics for all active processes. It
includes non-system segments (n) and system segments (s).

2-21
Instructor Guide
- svmon -lS which gathers memory usage statistics for defined segments (S). For
each displayed segment (l), the list of process identifiers that use the segment and,
according to the type of report, the entity name to which the process belongs. For
special segments, a label is displayed instead of the list of process identifiers.
The following files are created:
- svmon.before contains the svmon -G and svmon -Pns information at the
beginning of data collection
- svmon.before.S contains the svmon -1S information at the beginning of data
collection
- svmon.after contains the svmon -G and svmon -Pns information at the end of data
collection
- svmon.after.S contains the svmon -1S information at the end of data collection
-
Starting system monitors

The monitor.sh script invokes the following scripts to monitor system data for the
amount of time given in the perfpmr.sh or monitor.sh command.
- nfsstat.sh (unless the -n flag was used)
- netstat.sh (unless the -n flag was used)
- ps.sh
- vmstat.sh
- emstat.sh (unless the -e flag was used)
- mpstat.sh (unless the -m flag was used)
- lparstat.sh (unless the -l flag was used)
- sar.sh
- iostat.sh
- pprof.sh (unless the -p flag was used)
netstat.sh
The netstat subcommand symbolically displays the contents of various
network-related data structures for active connections.
The netstat.sh script builds a report on network configuration and use called
netstat.int containing tokstat -d of the token-ring interfaces, entstat -d of the
Ethernet interfaces, netstat -in, netstat -m, netstat -rn, netstat -rs,
netstat -s, netstat -D, and netstat -an before and after monitor.sh was run. You

V5.4
Instructor Guide
Uempty
can reset the Ethernet and token-ring statistics and re-run this report by running
netstat.sh -r 60. The time parameter must be greater than or equal to 60.
nfsstat.sh
The nfsstat command displays statistical information about the Network File System
(NFS) and Remote Procedure Call (RPC) calls.
The nfsstat.sh script builds a report on NFS configuration and use called nfsstat.int
containing nfsstat -m and nfsstat -csnr before and after nfsstat.sh was run. The
time parameter must be greater than or equal to 60.
ps.sh
The ps command shows current status of processes.
The ps.sh script builds reports on process status (ps). The following files are created:
- psa.elfk contains a ps -elfk listing after ps.sh was run.
- psb.elfk contains a ps -elfk listing before ps.sh was run.
- ps.int contains the active processes before and after ps.sh was run.
- ps.sum contains a summary of the changes between when ps.sh started and
finished. This is useful for determining what processes are consuming resources.
The time parameter must be greater than or equal to 60.
vmstat.sh
The vmstat subcommand displays virtual memory statistics.
The vmstat.sh script builds three reports with vmstat:
- Interval report called vmstat.int
- Summary report called vmstat.sum
- Report with the absolute count of paging activities called vmstat_s.out (vmstat -s)
emstat.sh
The emstat command shows emulation exception statistics.
The emstat.sh script builds a report called emstat.int on emulated instructions. The
time parameter must be greater than or equal to 60.

2-23
Instructor Guide
mpstat.sh
The mpstat command collects and displays performance statistics for all logical CPUs
in the system.
The mpstat.sh script builds a report called mpstat.int with performance statistics for all
logical CPUs in the system.
lparstat.sh
The lparstat command reports logical partition (LPAR) related information and
statistics.
The lparstat.sh script builds two reports on logical partition (LPAR) related
information and statistics:
- Interval report called lparstat.int
- Summary report called lparstat.sum
sar.sh
The sar command collects, reports, or saves system activity information.
The sar.sh script builds reports using sar. The following files are created:
- sar.int contains output of commands sadc 10 7 and sar -A
- sar.sum is a sar summary over the period sar.sh was run
iostat.sh
The iostat command reports CPU statistics, asynchronous input/output (AIO) and
input/output statistics for the entire system, adapters, tty devices, disks and CD-ROMs.
The iostat.sh script builds two reports on I/O statistics:
- Interval report called iostat.int
- Summary report called iostat.sum
pprof.sh
The pprof command reports CPU usage of all kernel threads over a period of time.

V5.4
Instructor Guide
Uempty
- The pprof.sh script builds a file called pprof.trace.raw that can be formatted with
the pprof.sh -r command. The time parameter does not have any restrictions.

2-25
Instructor Guide
Instructor notes:
Purpose Provide an overview of the monitor.sh script.
Details
Transition statement Let us next look at what we will find in the config.sum file.

V5.4
Instructor Guide
Uempty

config.sum:
uname m
lscfg l mem\*; lscfg -vp
lsps a; lsps -s
ipcs -Smqsa
lsdev -C
LVM information:
lspv; lspv l,
lsvg rootvg; lsvg l rootvg,
lslv (for each LV)
lsattr E for many devices, including:

Adapters
Interfaces
Logical volumes
Disks
Volume groups
sys0
AN512.0
Notes:
The purpose of the config.sum file is to provide static information about the configured
environment. It identifies information about the adapters and devices being used, the LVM
and file systems defined, the paging spaces, inter-process communications, network
configuration, the current tuning parameters, memory environment, error log contents, and
more.

2-27
Instructor Guide
Instructor notes:
Purpose Identify the monitor.sh reports that produce before and after data.
Details
Transition statement Let us continue to survey what you will find in the config.sum
file.

V5.4
Instructor Guide
Uempty

config.sum (continued):
Filesystem information:
mount uname m, lsfs q, df
Network information:
netstat reports, ifconfig -a
Tunables listings:
no, nfso, schedo, vmo, ioo, lvmo, raso
vmstat -v
errctrl -q
kdb information:
Memory, filesystems, and more
System auditing status

Environment variables
Error report
And more
AN512.0
Notes:
This visual continues the summary of the config.sum file information.

2-29
Instructor Guide
Instructor notes:
Purpose Identify the monitor.sh reports that produce interval-type data.
Details Describe the function of each of the scripts on the visual. Be sure the students
understand that each of these scripts can be run individually.
Transition statement While we will not cover how to analyze kernel trace reports or
network trace reports in this course, you may b asked to generate one of these reports. Let
us look at how that is done with the perfpmr raw trace files.

V5.4
Instructor Guide
Uempty
Formatting PerfPMR raw traces
Kernel trace files:

Creates one raw trace file per CPU (trace.raw-1, trace.raw-2, and so
on).
To merge kernel trace files together:
# trcrpt -C all -r trace.raw > trace.r
To get a trace report :

# ./trace.sh r
Network trace files:

To create a readable IP trace report file:
# /tmp/perf61/iptrace.sh r
To create a readable tcpdump report file:

# /tmp/perf61/tcpdump.sh -r
Figure 2-10. Formatting PerfPMR raw traces
AN512.0
Notes:
Kernel trace
Because trace can collect huge amounts of data, the trace executed in perfpmr.sh will
only run for five seconds (by default). This may not be enough time to collect trace data
if the problem is not occurring during that time. In this case, you should run the trace by
itself for a period of 15 seconds when the problem is present.
The command trace.sh 15 will run a trace for 15 seconds.
The trace.sh script issues a trcstop command to stop any trace that may already be
running. Remember that only one trace can be running at a time.
Trace files created

The trace.sh script creates one raw trace file per CPU. The files are called
trace.raw-0, trace.raw-1, and so forth for each CPU. Another raw trace file called
trace.raw is also generated. This is a master file that has information that ties in the

2-31
Instructor Guide
other CPU-specific traces. To merge the trace files together to form one raw trace file,
run the following command:
trcrpt -C all -r trace.raw > trace.r
The -C all flag specifies that all CPUs should be used. The -r flag outputs
unformatted (raw) trace entries and writes the contents of the trace log to standard
output, by default. The example redirects the output to a file named trace.r. The trace.r
file can be used as input into other trace-based utilities such as curt and splat.
Creating a trace report with trace.sh

An ASCII trace report can be generated by running perfpmr -x trace.sh -r. This
command creates a file called trace.int that contains the readable trace used for
analyzing performance problems. The trace.sh -r command will produce a report for
all trace events. The trcrpt command can also be used.
The trace.nm file is the output of the trcnm command which is needed to postprocess
the trace.raw file on a different system. The trace.fmt file is a copy of /etc/trcfmt.
There are additional trace files which are used to list the contents of the i-node table
and the listing of /dev so that i-node lock contention statistics can be viewed.
The iptrace utility

The iptrace utility provides interface-level packet tracing for Internet protocols.
iptrace.sh
When perfpmr.sh is run, it checks to see if the iptrace command is installed. If it is,
then the iptrace.sh script is invoked. iptrace will run for a default of 10 seconds. The
iptrace.sh script can also be run directly.
The iptrace.sh script builds a raw Internet Protocol (IP) trace report on network I/O
called iptrace.raw. You can convert the iptrace.raw file to a readable IP report file
called iptrace.int using the perfpmr -x iptrace.sh -r command.
Caution should be used when running iptrace. It can use large amounts of disk space.
The tcpdump utility

The tcpdump utility dumps information of network traffic. It prints headers of packets on
a specified network interface.
tcpdump.sh
When perfpmr.sh is run, it checks to see if the tcpdump command is installed. If it is,
then the tcpdump.sh script is invoked. The tcpdump command will run for a default of
10 seconds. The tcpdump.sh script can also be run directly.
V5.4
Instructor Guide
Uempty
The tcpdump.sh script creates a raw trace file of a TCP/IP dump called tcpdump.raw.
To produce a readable tcpdump.int file, use the tcpdump.sh -r command.

2-33
Instructor Guide
Instructor notes:
Purpose Describe the trace.sh script.
Details This course does not cover kernel trace analysis. Nor does it cover the analysis
of network traces. Focus on the mechanics of working with the perfpmr output.
Emphasize to the students that they should not format the trace manually when sending to
IBM Support. Use trace.sh -r. Then, everyone looking at the trace will have the same
references to look at and the exact same data.
The perfpmr.sh script does not format the traces. If students already have the skills to work
with these traces, they may want to format the raw traces. This visual provides the
technique for generating the readable reports.
Transition statement Now that you have some idea of what is provided by PerfPMR,
when should you run it?

V5.4
Instructor Guide
Uempty
When to run PerfPMR

OK. So now that I know all about PerfPMR and the data it
collects, when do I need to run it?
When your system is running under load and is
performing correctly, so you can get a baseline
Before you add hardware or upgrade your software
When you think you have a performance problem
It is better to have PerfPMR installed on a system before

you need it rather than try to install it after the performance
problem starts!
Figure 2-11. When to run PerfPMR
AN512.0
Notes:
Overview
PerfPMR should be installed when the system is initially set up and tuned. Then, you
can get a baseline measurement from all the performance tools. When you suspect a
performance problem, PerfPMR can be run again and the results compared with the
baseline measurement.
It is also recommended that you run PerfPMR before and after hardware and software
changes. If your system is performing fine and you then you upgrade your system and
begin to have problems, then its difficult to identify the problem without a baseline to
compare against.

2-35
Instructor Guide
Instructor notes:
Purpose Identify when PerfPMR should be run.
Details Stress one more time, the importance of baseline measurements!
Transition statement Enough about PerfPMR for now. Lets take a look at an
interactive tool that gives us a system-wide view of performance data, the topas command.

V5.4
Instructor Guide
Uempty
The topas command

Topas Monitor for host:
Mon Oct 19 21:30:22 2009
sys144_lpar4
Interval: 2
Kernel
User
Wait
Idle
Physc =
77.3
22.5
0.0
0.2
1.01
|######################
|
|#######
|
|#
|
|#
|
%Entc= 288.3
Network
Total
KBPS
1.1
I-Pack
3.0
Disk
Total
Busy%
0.0
FileSystem
Total
Name
cpuprog
syncd
topas
rmcd
gil
java
java
rpc.lock
sendmail
j2pg
netm
lrud
dtlogin
kbiod
biod
ksh
inetd
rdpgc
lvmbb
O-Pack
1.0
KB-In
0.2
KB-Out
0.9
KBPS
40.0
TPS KB-Read KB-Writ

8.0
0.0
40.0
KBPS
0.0
TPS KB-Read KB-Writ

0.0
0.0
0.0
PID
340002
127104
307438
286874
57372
290986
282766
254112
200852
135234
53274
16392
73914
81998
86102
90244
94350
98406
102450
CPU%
98.9
0.7
0.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
PgSp
0.1
0.5
1.4
6.4
0.9
47.0
35.6
1.2
1.1
1.2
0.4
0.7
0.6
0.5
0.2
0.6
0.6
0.4
0.4
Owner
root
root
root
root
root
pconsole
root
root
root
root
root
root
root
root
root
pconsole
root
root
root
EVENTS/QUEUES
Cswitch
76
Syscall 969.4K
Reads
969.4K
Writes
2
Forks
0
Execs
0
Runqueue
1.0
Waitqueue
0.0
PAGING
Faults
Steals
PgspIn
PgspOut
PageIn
PageOut
Sios
0
0
0
0
0
3
3
NFS (calls/sec)
SerV2
0
CliV2
0
SerV3
0
CliV3
0
FILE/TTY
Readch
Writech
Rawin
Ttyout
Igets
Namei
Dirblk
0
1708
0
856
0
1
0
MEMORY
Real,MB
% Comp
% Noncomp
% Client
1024
66
30
30
PAGING SPACE
Size,MB
512
% Used
2
% Free
98
WPAR Activ
0
WPAR Total
0
Press: "h"-help
"q"-quit
Figure 2-12. The topas command
AN512.0
Notes:
Overview
The topas command reports selected statistics about the activity on the local system.
Why does AIX have the topas command? Because, there are similar tools available on
other operating systems and the Internet that provide similar capabilities but are not
supported on AIX.
The topas tool is in the bos.perf.tools fileset. The path to the tool is /usr/bin/topas.
This tool can be used to provide a full screen of a variety of performance statistics.
The topas tool displays a continually changing screen of data rather than a sequence of
interval samples, as displayed by such tools as vmstat and iostat. Therefore, topas is
most useful for online monitoring and the other tools are useful for gathering detailed
performance monitoring statistics for analysis.

2-37
Instructor Guide
If you're running topas in a partition and do a dynamic LPAR command which changes
the system configuration, then topas must be stopped and restart to view accurate
data.
Output sections
The topas command can show many performance statistics at the same time. The
output consists of two fixed parts and a variable section.
The top several lines at the left of the display shows the name of the system topas runs
on, the date and time of the last observation, and the monitoring interval.
The second fixed part fills the rightmost 25 positions of the display. It contains six
subsections of statistics: EVENTS/QUEUES, FILE/TTY, PAGING, MEMORY, PAGING
SPACE, and NFS
The variable part of the topas display can have one, two, three, four, or five
subsections. If more than one subsection displays, they are always shown in the
following order: CPU utilization, network interfaces, physical disks, and workload
management classes, and processes.
When the topas command is started, it displays all subsections that are to be
monitored. The exception to this is the workLoad management (WLM) classes
subsection, which is displayed only when WLM is active. These subsections can be
displayed or not by using the appropriate subcommand to toggle on and off.
Syntax and options

The topas options are:
Option
Description
-d
Specifies the maximum number of disks shown. If this number

exceeds the number of disks installed, the latter is used. If this
argument is omitted, a default of 2 is used. If a value of zero is
specified, no disk information is displayed.
-h
Displays help information.
-i
Sets the monitoring interval in seconds. The default is 2 seconds.
-n
Specifies the maximum number of network interfaces shown. If this

number exceeds the number of network interfaces installed, the
latter is used. If this argument is omitted a default of 2 is assumed. If
a value of zero is specified, no network information will be
displayed.
-p
Specifies the maximum number of processes shown. If this

argument is omitted, a default of 20 is assumed. If a value of zero is
specified, no process information will be displayed. Retrieval of
process information constitutes the majority of the topas overhead.

V5.4
Instructor Guide
Uempty
Option
Description
-w
Specifies the number of monitored Workload Manager classes. If

this argument is omitted a default of 2 is assumed.
-c
Specifies the number of monitored CPUs. If this argument is omitted

a default of 2 is assumed.
-P
Shows a full screen of processes.
-W
Shows only WLM data on the screen.
Subcommands
While topas is running, it accepts one-character subcommands. Each time the
monitoring interval elapses, the program checks for one of the following subcommands
and responds to the action requested.
The subcommands are:
Sub
command
Description
Show all the variable subsections being monitored. Pressing the a

key always returns topas to the main initial display.
Pressing the c key repeatedly toggles the CPU subsection between

the cumulative report, off, and a list of busiest CPUs.
Pressing the d key repeatedly toggles the disk subsection between

busiest disks list, off, and total disk activity for the system.
Moving the cursor over a WLM class and pressing f shows the list
of top processes in the class on the bottom of the screen (WLM
Display Only).
Toggles between help screen and main display.
Pressing the n key repeatedly toggles the network interfaces

sub-section between busiest interfaces list, off, and total network
activity.
Pressing the p key toggles the hot processes subsection on and off.
Toggle to the full screen process display.
Quit the program.
Refresh the screen.
Pressing the w key toggles the workload management (WLM)

classes subsection on and off.
Toggle to the full screen WLM class display.

2-39
Instructor Guide
Instructor notes:
Purpose To describe the topas command output.
Details If the students are interested, demonstrate (or have them try) some of the
subcommands, such as c (for CPU),-d (for disks), and -p 0 (to not show processes).
Additional information The default output of topas shows CPU utilization, some
sar-type statistics, the most heavily used disks, the most heavily used network interfaces,
the top processes in terms of CPU usage, memory statistics, and NFS statistics.
Sometimes, you may not care about the process activity; and since on a system with a lot
of processes, getting process statistics causes more overhead, it may be useful to specify
the -p 0 flag in topas. This way it will not collect the process statistics.
If you want to see the individual CPU utilization, type the c key on the keyboard while topas
is running and it will dynamically change the display.
Transition statement Lets look at another tool that is similar to topas: nmon

V5.4
Instructor Guide
Uempty
The nmon and nmon_analyser tools

nmon (Nigels Monitor)
Similar in concept to topas
nmon not supported by IBM
nmon functionality integrated into topas (AIX6, AIX 5.3 TL9,
VIOS 1.2)
topas_nmon fully supported by AIX Support
nmon can be run in the following modes:

Interactive
Data recording (good for trends and capacity planning)
nmon_analyser
Graphing using Excel spreadsheets

Uses topasout or nmon output
Not supported by IBM (no warranty)
Obtained from www.ibm.com/developerworks/aix
Figure 2-13. The nmon and nmon_analyser tools
AN512.0
Notes:
Introduction
Like topas, the nmon tool is helpful in presenting important performance tuning
information on one screen and dynamically updating it.
Another tool, the nmon_analyser, takes files produced by nmon and turns them into
spreadsheets containing high quality graphs ready to cut and paste into performance
reports. The tool also produces analysis for ESS and EMC subsystems. It is available
for both Lotus 1-2-3 and Microsoft Excel.
The nmon tool and the nmon_analyser tool are free, but are NOT SUPPORTED by IBM.
No warranty is given or implied, and you cannot obtain help with it from IBM.

2-41
Instructor Guide
Obtaining the nmon tools

The nmon functionality which is incorporated into topas is not exactly the same as the
nmon tool supported by NIgel Griffith. Below is information on obtaining Nigels nmon
tool and also for obtaining the nmon_analyszer.
The nmon tool can be obtained from:
http://www.ibm.com/developerworks/eserver/articles/nmon.html
The nmon_analyser tool can be obtained from:
http://www.ibm.com/developerworks/eserver/articles/nmon_analyser/index.html
You can FTP the tool (nmonXX.tar.Z) from the agreement and download page.
Read the README.txt file for more information about which version of nmon to run on
your particular operating system version. You also need to know if your AIX kernel is
32-bit or 64-bit. If you use the wrong one, nmon will simply tell you or fail to start (no
risk).
The README.txt file also has information on how to run and use the nmon tool.
The nmon_analyser tool is designed to work with the latest version of nmon but is also
tested with older versions for backwards compatibility. The tool is updated whenever
nmon is updated and at irregular intervals for new functionality.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe the nmon and nmon_analyser tools.
Details Emphasize that this is an unsupported tool, and its data is not recognized by
the IBM support team. No warranty is given or implied, and you cannot obtain help from
IBM. You can contact the author, Nigel Griffiths nag@uk.ibm.com, for assistance. Nigel will
place you on a list for nmon updates if you email him, using an email title explaining you
want list membership. You can also request to be a nmon beta tester for the next version.
Additional information See the README.txt file or the help while running nmon to get
detailed information on all the functions and flags that can be used.
Transition statement Lets examine what the AIX nmon display looks like.

2-43
Instructor Guide
The AIX nmon command

topas_nmonq=QuitHost=sys144_lpar4Refresh=2 secs21:36.00
CPU-Utilization-Small-View EntitledCPU= 0.35 UsedCPU= 1.002
Logical CPUs
0----------25-----------50----------75----------100
CPU User% Sys% Wait% Idle%|

|
|
|
|
0
0.0
0.0
0.0 100.0| >
|
1
0.0
0.0
0.0 100.0| >
|
2 15.0 85.0
0.0
0.0|UUUUUUUssssssssssssssssssssssssssssssssssssssssss>
3
0.0
0.0
0.0 100.0| >
|
EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+
EC+ 22.6 77.2

0.0
0.2|UUUUUUUUUUUssssssssssssssssssssssssssssssssssssss|
VP 11.3 38.7
0.0
0.1|UUUUUsssssssssssssssssss-------------------------|
EC= 286.3% VP= 50.1%

+--No Cap---|------------|-----------100% VP=2 CPU+
Memory
Physical PageSpace |
pages/sec In
Out | FileSystemCache
% Used
96.7%
1.6% | to Paging Space
0.0
0.0 | (numperm) 30.5%
% Free
3.3%
98.4% | to File System
0.0
0.0 | Process
29.4%
MB Used
990.1MB
8.1MB | Page Scans
0.0
| System
36.8%
MB Free
33.9MB
503.9MB | Page Cycles
0.0
| Free
3.3%
Total(MB) 1024.0MB
512.0MB | Page Steals
0.0
|
-----
| Page Faults
0.0
| Total
100.0%
------------------------------------------------------------ | numclient 30.5%
Min/Maxperm
28MB( 3%) 836MB( 82%) <--% of RAM
| maxclient 81.6%
Min/Maxfree
960
1088
Total Virtual
1.5GB
| User
55.7%
Min/Maxpgahead
2
8
Accessed Virtual
0.6GB 42.3%| Pinned
34.9%
Network
I/F Name Recv=KB/s Trans=KB/s packin packout insize outsize Peak->Recv TransKB
en0
0.0
0.1
0.5
0.5
46.0 189.0
1.0
292.1
lo0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Total
0.0
0.0 in Mbytes/second
Overflow=0
I/F Name MTU ierror oerror collision Mbits/s Description
en0
1500
0
0
0
2047 Standard Ethernet Network Interface
lo0 16896
0
0
0
0 Loopback Network Interface
Disk-KBytes/second-(K=1024,M=1024*1024)
Disk
Busy Read Write 0----------25-----------50------------75--------100
Name
KB/s
KB/s |
|
|
|
|
hdisk0
0%
0
0|
|
hdisk1
0%
0
0|
|
Totals
0
0+-----------|------------|-------------|----------+
Figure 2-14. The AIX nmon command
AN512.0
Notes:
The graphic shows an example of what the nmon display can look like. This example
shows four panels that were selected for display: CPU, memory, network, and disk. As with
topas, the displayed values are updated dynamically on an interval.
nmon has wide variety of statistics panels which can be individually selected (or
deselected) for display through the use of single key strokes.
Pressing the h key will provide a list of the nmon subcommands.
This nmon mode of the topas command can be accessed either by executing the nmon
command, or by using the ~ (tilde) key to toggle between topas mode and nmon mode.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain basic nmon interactive functionality.
Details
Transition statement Lets next cover a few checkpoint questions.

2-45
Instructor Guide
Checkpoint
1. What is the difference between a functional problem and a
performance problem? _____________________________
________________________________________________
________________________________________________
2. What is the name of the supported tool used to collect reports
with a wide variety of performance data? ________________
3. True / False You can run individually the scripts that
perfpmr.sh calls.
4. True / False You can dynamically change the topas and
nmon displays.
Figure 2-15. Checkpoint
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Checkpoint solutions
performance problem? A functional problem is when an
application, hardware, or network is not behaving correctly. A
performance problem is when the function is working, but the
speed it's performing at is slow.
2. What is the name of the supported tool used to collect
reports with a wide variety of performance data? PerfPMR
3. True /False You can individually run the scripts that
perfpmr.sh calls.
nmon displays.
Transition statement The next page has more checkpoint questions.

2-47
Instructor Guide
Exercise 2: Data collection
Install PerfPMR
Collect performance data using
PerfPMR
Use topas and to nmon monitor the
system
Figure 2-16. Exercise 2: Data collection
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Introduce the exercise.
Details
Transition statement Lets summarize the unit.

2-49
Instructor Guide
Unit summary
This unit covered:
Defining a performance problem
Installing PerfPMR
Collecting performance data using PerfPMR
The use of the following tools:
topas
nmon
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Details

2-51
Instructor Guide

V5.4
Instructor Guide
Uempty
Unit 3. Monitoring, analyzing, and tuning CPU

usage
Estimated time
4:20 (2:00 Unit; 2:20 Exercise)
Materials before WPAR - 50 minutes
WPAR materials - 20 minutes
General statistics - 24 minutes
SMT and SPLPAR - 24 minutes

This unit identifies the tools to help determine CPU bottlenecks. It also
demonstrates techniques to tune CPU-related issues on your system.

Describe processes and threads
Describe how process priorities affect CPU scheduling
Manage process CPU utilization with either
- nice and renice commands
- WPAR resource controls
Use the output of the following AIX tools to determine symptoms of
a CPU bottleneck:
-vmstat, sar, ps, topas, tprof, nmon
Correctly interpret CPU statistics in various environments including
where:
- Simultaneous multi-threading (SMT) is enabled
- LPAR is using a shared processor pool
Unit 3. Monitoring, analyzing, and tuning CPU usage

3-1
Instructor Guide

Accountability:
Checkpoint
Machine exercises
References
Reference
SG24-5977
AIX 5L Workload Manager (WLM) (Redbook)
SG24-6478

(Redbook)
SG24-5977
AIX 5L Workload Manager (WLM) (Redbook)
SG24-7940
Introduction to Advanced POWER Virtualization on

IBM p5 Servers, Introduction and basic configuration
(Redbook)
SG24-5768
IBM eServer p5 Virtualization Performance

Considerations (Redbook)
CPU monitoring and tuning article:
http://www-128.ibm.com/developerworks/eserver/articles/aix5_cpu/
3-2

V5.4
Instructor Guide
Uempty
Unit objectives
Describe processes and threads
Describe how process priorities affect CPU scheduling
Manage process CPU utilization with either
nice and renice commands
WPAR resource controls
Use the output of the following AIX tools to determine
symptoms of a CPU bottleneck:
vmstat, sar, ps, topas, tprof
Correctly interpret CPU statistics in various environments
including where:
Simultaneous multi-threading (SMT) is enabled
LPAR is using a shared processor pool
AN512.0
Notes:
Introduction
The objectives in the visual above state what you should be able to do at the end of this
unit.

3-3
Instructor Guide
Instructor notes:
Purpose Review the objectives for this unit.
Details Explain what well cover and what the students should be able to do at the end
of the unit.
Transition statement Next is a flowchart summarizing a basic CPU monitoring
strategy.
3-4

V5.4
Instructor Guide
Uempty
CPU monitoring strategy

Monitor CPU usage
and compare with goals
CPU
supposed to
be idle?
Yes
No
High CPU
usage?
No
Yes
Determine cause
of idle time by
tracing
Fix or tune the

app or system
Locate dominant
process(es)
No
Is
process behavior
normal?
Kill
abnormal
processes
Yes
Tune applications /
operating system
Figure 3-2. CPU monitoring strategy
AN512.0
Notes:
Overview
This flowchart illustrates the CPU-specific monitoring and tuning strategy. If the system
is not meeting the CPU performance goal, you need to find the root cause for why the
CPU subsystem is constrained. It may be simply that the system needs more physical
CPUs, but it could also be because of errant applications or processes gone awry. If the
system is behaving normally but is still showing signs of a CPU bottleneck, tuning
strategies may help to get the most out of the CPU resources.
Monitoring usage and compare with goal(s)

For any tuning strategy it is important to know the baseline performance statistics for a
particular system and what the performance goals are. Then you can compare current
statistics to see if they are abnormal or not meeting the goal(s). Be sure to take baseline
measurements over time to spot any troubling trends.

3-5
Instructor Guide
High CPU usage

If you spot unusually high CPU usage when monitoring, the next question to ask is,
What processes are accumulating CPU time? Are they supposed to be accumulating
so much CPU time? If they are, then perhaps there are some tuning strategies you can
use to tune the application or the operating system to make sure that important
processes get the CPU they need to meet the performance goal.
Idle CPU
Another scenario is that you are not meeting performance goals and the CPUs are fairly
idle or not working as much as they should. This points to a bottleneck in another area
of the computer system.
3-6

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe the specific strategy for discovering CPU performance issues.
Details Touch on each of the boxes in the flowchart to show the overall strategy. This
flowchart gives students the structure for this unit.
At the end of this unit, well cover this flowchart again and all the tools used in this unit will
be shown where they would be used.
Some terminology which is often used in discussing the CPU performance of various
computer designs.
CPU clock speed
The most common terminology used when referring to CPU performance is the speed
of the CPU, otherwise known as the CPU clock speed. This is in units of Megahertz
(MHz) or Gigahertz (GHz). One Megahertz is equal to one million CPU cycles per
second.
To find the CPU clock speed in hertz, use the lsattr -E -l proc0 -a frequency
command. This example shows a CPU clock speed of 1.5 GHz:
# lsattr -El proc0 -a frequency
frequency 1498500000 Processor Speed False
These days, CPU clock speed is becoming less of an indicator of raw throughput. The
design of the CPU can be more important than clock speed. Several parts of the CPU
are used for execution simultaneously which is called pipelining. For example, if there is
an upcoming branch, the CPU will make an educated guess as to which way it will go
and then it will prefetch the associated instructions.
CPU cycle
The CPU cycle is a basic unit of time used in executing instructions. Any instruction
requires one or more CPU clock cycles to complete an operation. This instruction is
known as a language instruction (machine or assembly). An example of an instruction is
a load, store, compare, or branch.
CPU execution time
The CPU execution time of a program can be determined by multiplying the average
number of instructions that will be executed by the number of CPU cycles per
instruction and dividing by the CPU clock speed. Different instructions consume
different numbers of CPU cycles.
Lets say you have a program which you just compiled. From looking at the instruction
listing produced by the compiler (with -qlist), you see that it has 2000 instructions.
Assuming for the sake of simplicity that on average each instruction takes two clock
cycles, you can then predict the CPU execution time needed to execute this program.
Lets also say that the CPU runs at 1 GHz; then the formula would be:

3-7
Instructor Guide
CPU time = 2000 instructions * 2 cycles per instruction / 1,000,000,000 cycles

= 0.000004 seconds or 4 microseconds
Path length
The number of elementary operations needed to complete a program is a result of the
compilation. This is also known as the path length of that program. The number of
cycles per instruction depends on the complexity of the instructions. The more
complicated the instructions are, the higher the number of cycles consumed.
Superscalar architecture means you can execute multiple instructions concurrently.
CPU performance
Faster processors have shorter clock cycles, but they are more expensive to produce.
Higher clock speeds usually result in higher silicon temperatures which then require
special technology to control the heat.
Keep in mind that time is also needed to fetch an instruction or data item from real
memory. The simple execution time formula assumes that the instructions or data are in
the registers. How long it takes to fetch something from memory will be dependent on
the particular hardware model. In general, the latency for accessing memory on a
multi-processor machine will be higher than on a uni-processor machine.
Transition statement Next, well introduce terms related to CPU performance.
3-8

V5.4
Instructor Guide
Uempty
Processes and threads

Disk
Program
Memory
CPU
Singlethreaded
process
CPU 0
Thread 1
Multi-threaded
process
Program
CPU 0
Thread 1
Thread 2
CPU 1
Thread 3
CPU 2
Figure 3-3. Processes and threads
AN512.0
Notes:
Process
A process is the entity that the operating system uses to control the use of system
resources. A process is started by a command, shell program or another process.
Process properties include the process ID, process group ID, user ID, group ID,
environment, current working directory, file descriptors, signal actions, and statistics
such as resource usage. These properties are defined in /usr/include/sys/proc.h.
Thread
Each process is made up of one or more kernel threads. A thread is a single sequential
flow of control. A single-threaded process can only handle one operation at a time,
sequentially. Multiple threads of control allow an application to overlap operations, such
as reading from a terminal and writing to a file. AIX schedules and dispatches CPU

3-9
Instructor Guide
resources at the thread level. In general, when we refer to threads in this course, we will
be referring to the kernel threads within a process.
An application could also be designed to have user-level threads (also known as
pthreads) which are scheduled for work by the application itself or by the pthreads
scheduler in the pthreads shared library (libpthreads.a). These user threads may be
mapped to one or more kernel threads depending on the libpthreads thread policy used.
Multiple threads of control also allow an application to service requests from multiple
users at the same time. Threads provide these capabilities without the added overhead
of multiple processes such as those created through fork and exec system calls.
Rather than duplicating the environment of a parent process, as is done via fork and
exec, all threads within a process use the same address space and can communicate
with each other through variables. Threads synchronize their operations via mutex
(mutual exclusion) variables.
Kernel thread properties are: stack, scheduling policy, scheduling priority, pending
signals, blocked signals, and some thread-specific data. These thread properties are
defined in /usr/include/sys/thread.h.
AIX Version 4 introduced the use of threads to control processor time consumption, but
most of the system management tools still refer to the process in which a thread is
running, rather than the thread itself.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Define what is meant by processes and threads.
Details Differentiate processes and threads. Threads were introduced with AIX Version
4 and allows programs to execute more efficiently. Youll see though that most monitoring
tools show reports at the process level.
We use the term pthreads and user-level threads interchangeably in this course. The p in
pthreads stands for POSIX.
Additional information The fork routine called f_fork (fast fork) is very useful for
multi-threaded applications that will call exec() immediately after they would have called
fork(). The fork() system call is slower because it has to call fork handlers to acquire all
the library locks before actually forking and letting the child run all child handlers to initialize
all the locks. The f_fork() system call bypasses these handlers and calls the kfork
system call directly. Web servers are a good example of an application that can use
f_fork().
Transition statement Lets look at the life of a process.

3-11
Instructor Guide
The life of a process

"I"
Process/Thread States
SNONE
"A"
SIDL
"R"
Ti
"S"
"T"
RUN
NING
"Z"
Zombie
Figure 3-4. The life of a process
AN512.0
Notes:
Introduction
A process can exist in a number of states during its lifetime.
I (Idle) state
Before a process is created, it needs a slot in the process and thread tables; at this
stage it is in the SNONE state.
While a process is undergoing creation, waiting for resources (memory) to be allocated,
it is in the SIDL state.

V5.4
Instructor Guide
Uempty
A (active) state
When a process is in an A state, one or more of its threads are in the R (ready-to-run)
state. Threads of a process in this state have to contend for the CPU with all other
ready-to-run threads.
Only one thread can have the use of the CPU at a time; this is the running thread for
that processor. With SMP models, there are several processors, each of which would
be running a different thread, as part of the same process, or as independent threads of
different processes.
A thread will be in an S state if a thread is waiting on an event or I/O. Instead of wasting
CPU time, it sleeps and relinquishes control of the CPU. When the I/O is completed, the
thread is awakened and placed in the ready-to-run state, where it must again compete
with other ready-to-run threads for the CPU.
A thread may be stopped via the SIGSTOP signal, and started again via the SIGCONT
signal; while suspended it is in the T state. This has nothing to do with performance
management.
Z (zombie) state
The Z state: When a process dies (exits) it becomes a zombie. A zombie occupies a
slot in the process table, and thread table, but no other resources. As such, zombies
are seldom a performance issue; they exist for a very short time until their parent
process receives a signal to say they have terminated. Parent processes which are
programmed in such a way that they ignore this signal, or even die before the child
processes they have created do, can leave zombies on the system. Such an
application, if long running, can with time fill up the process table to a unacceptable
level. One way to remove zombies is to reboot the system, but this is not always a
solution. You should investigate why the parent process is not cleaning up its zombies.
The application developer may need to modify program code to be sure to have a
SIGCHLD handler to read the exit status of their child processes.

3-13
Instructor Guide
Instructor notes:
Purpose To show the possible states of processes and threads.
Details Describe that as a process is created and ends, it moves through several
states. For any particular process you can see its state in the output of the ps command. If,
during the monitoring of a system you spot a process that may not be behaving properly,
you can check its current state to help determine what it might be doing. Trace-based tools
can then be use to see more details about the process.
Transition statement When there are multiple threads that are runnable, they are
queued up in a run queue. The kernel dispatcher will choose a thread from a run queue
based on its position in the priority-ordered queues. Lets discuss the run queues.

V5.4
Instructor Guide
Uempty
Run queues
CPU 0 Run Queue
Global Run Queue
0
1
.
Prioritized
threads
0
1
.
.
.
.
255
CPU 1 Run Queue
0
1
255
initial dispatch
schedo o fixed_pri_global
export RT_GRQ=ON
254
255
Figure 3-5. Run queues
AN512.0
Notes:
Run queues
When there are multiple threads ready to run but not enough CPUs to go around, the
threads are queued up in a run queues. The run queue is divided further into queues
that are priority ordered (one queue per priority number). However, when we discuss
run queues, we shall refer to a run queue as the queue that contains all of the
priority-ordered queues.
Each CPU has its own run queue. Additionally, there is another run queue called the
global run queue.
There are 256 priority levels (for each run queue). Prior to AIX 5L V5.1, AIX had 128
queues.

3-15
Instructor Guide
Global run queue

The global run queue is searched before a local run queue to see which thread has the
best priority. When a thread is created (assuming it is not bound to a CPU), it is placed
on the global run queue.
A thread can be forced to stay on a global run queue if the environment variable RT_GRQ
is set to ON. This can improve performance for threads that are running
SCHED_OTHER (the default scheduling policy) and are interrupt driven. However, this
could also be detrimental because of the cache misses, so use this feature with caution.
Threads that are running fixed priority will be placed on the global run queue if
schedo -o fixed_pri_global=1 is run.
CPU run queues

There is a run queue structure for each CPU as well as a global run queue. The
per-CPU run queues are called local run queues. When a thread has been running on a
CPU, it will tend to stay on that CPUs run queue. If that CPU is busy, then the thread
can be dispatched to another idle CPU and will be assigned to that CPUs run queue.
This is because idle CPUs look for more work to do and will check the global run queue
and then the other local run queues for a thread with a favored priority.
The dispatcher picks the best priority thread in the run queue when a CPU is available.
When a thread is first created, it is assigned to the global run queue. It stays on that
queue until assigned to a local run queue (when its dispatched to a CPU). If all CPUs
are busy, the thread stays on the local run queue even if there are worse priority threads
on other CPUs.
Run queue statistics

The average number of threads in the run queue can be seen in the first column of
vmstat output. If you divide this number by the number of CPUs, you will get the
average number of threads runnable on each CPU. If this value is greater than one,
then these threads will have to wait their turn for the CPU. Having runnable threads in
the queue does not necessarily mean that performance delays will be noticed because
timeslicing between threads in the queue is normal. It may be perfectly normal on your
system to see runnable threads per CPU. The number of runnable threads should be
only one factor in your analysis. If performance goals are being met, simply having
many runnable threads in the queue may simply mean your system has a lot of threads
but is working through them efficiently.
CPU scheduling policies

The default CPU scheduling policy is SCHED_OTHER. There are other CPU
scheduling policies that can be set on a per-thread basis. SCHED_OTHER penalizes
high-CPU usage processes and is a non-fixed priority policy. This means that the
V5.4
Instructor Guide
Uempty
priority value changes over time (and quickly) based on the CPU usage penalty and the
nice value.
Ratio of runnable threads to logical CPUs

It is important to note that priority values only effect which thread is dispatched off of a
given dispatching queue. If there are two threads, each running on a different CPU and
you want one of them to obtain a higher percentage of cycles on the system, then
priority values will have no effect. The single threaded application can not use more
than one logical CPU. It is only when there are more runnable threads competing for
cycles than you have logical CPUs to serve them, that the priority values have a
significant effect.

3-17
Instructor Guide
Instructor notes:
Purpose Describe what a run queue is. Describe how the global and local run queues
are used.
Details Point out that there is one global run queue and one run queue per CPU. Each
queue is further divided by priority.
These priorities can be seen in the output of monitoring commands and the number of
runnable processes/threads can also be monitored. Well see this in a few pages.
The priorities are used to determine which threads will be run first on any particular
physical CPU.
Mention that SCHED_OTHER is the default scheduling policy for AIX and tuning may
involve setting different scheduling policies such as SCHED_RR or one of the
SCHED_FIFO policies. All of the policies except SCHED_OTHER are fixed priority
scheduling policies.
Additional information A run queue is a logical structure pointing to lists of threads
ready to run (linked lists of thread structures). Each structure has space for 256
(PMASK+1) such list pointers, one for each possible priority.
The structure also contains a bit mask indicating which priorities have ready to run threads
and an item containing the current best priority (rq_best_run_pri). The dispatcher uses
this latter item to select the next thread to run. If there are multiple threads in the list for this
priority, they are scheduled in a round robin fashion as long as they remain runnable. This
round-robin algorithm is modified slightly under certain FIFO scheduling policies.
When a CPU becomes available, the dispatcher checks the global run queue first and then
that CPUs local run queue. By forcing a thread to always be on a global run queue, it
ensures that this thread will be dispatched as soon as possible. If it was on a local run
queue, then if all other CPUs are busy, this thread stays on that CPU even if other CPUs
have worse priority threads. Setting the environment variable RT_GRQ=ON forces a process
and its threads to be on the global run queue. However, this can cause a thread to bounce
around quite a bit from CPU to CPU, so its cache affinity may not be as good. Therefore, it
may negatively affect performance in most cases. However, some applications can benefit
from being on the global run queue.
Transition statement Once a thread is scheduled to run, its priority will determine
when it can run. What are priorities?

V5.4
Instructor Guide
Uempty
Process and thread priorities (1 of 2)

High
(Best/
Most favored)
0
1
2
Real-time
Priorities
User
Priorities
3
.
.
.
40
.
.
.
(Worst/
Least favored)
Low
255
The priority value of a

SCHED_OTHER thread is:
Initial priority
+ CPU usage penalty
Effective priority value
Initial priority:
Has a base of 40
Amount over 40 depends upon nice #
CPU usage penalty:
Increases with CPU usage
Some CPU usage forgiven each
second (default is by half)
wait
Figure 3-6. Process and thread priorities (1 of 2)
AN512.0
Notes:
What is a priority?
A priority is a number assigned to a thread used to determine the order of scheduling
when multiple threads are runnable. A process priority is the most favored priority of any
one of its threads. The initial process/thread priority is inherited from the parent
process.
The kernel maintains a priority value (sometimes termed the scheduling priority) for
each thread. The priority value is a positive integer and varies inversely with the
importance of the associated thread. That is, a smaller priority value indicates a more
important thread. When the scheduler is looking for a thread to dispatch, it chooses the
dispatchable thread with the smallest priority value.
A thread can be fixed-priority or nonfixed-priority. The priority value of a fixed-priority
thread is constant, while the priority value of a nonfixed-priority thread can change
depending on its CPU usage.

3-19
Instructor Guide
Priority values
Priority numbers range from 0-255 in AIX. Priority 255 is reserved for the wait/idle
kernel thread.
Real-time thread priorities are lower than 40. Real-time applications should run with a
fixed priority and a numerical value less than 40 so that they are more favored than
other applications.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe the significance of the priority values.
Details Describe that as the priorities numbers go up, the process or threads priority
worsens and becomes less favored in the run queue.
CPU usage increases the priority value (thus making the process or thread less favored).
The lowest or worst priority is reserved for the idle or wait kproc which runs when nothing
else needs to run.
The process or thread priority is inherited from its parent process.
Additional information The first process created (init) is started with a base user
priority of 40. However, its effective priority is 60 since it also has a nice value of 20. And so
all children of init are started with this default priority.
Transition statement Lets look at how the priority is set for a process and its threads.

3-21
Instructor Guide
Process and thread priorities (2 of 2)

Priorities control which threads get cycles:
If more runnable threads than CPUs
A process or thread can have a fixed or variable priority.
Fixed priorities can only be set by a root process.
Variable priorities is the default scheduling policy

Called SCHED_OTHER
Penalizes compute-intensive threads to prevent the
starvation of other threads
New threads and woken daemons have a brief advantage
over running processes.
Initial priority can be changed by a user:
nice command can be used when a process is started:
renice command can be used for a running process:
Default nice value is 20 (foreground), 24 (background)

Figure 3-7. Process and thread priorities (2 of 2)
AN512.0
Notes:
Thread changing its priority
There are two system calls that allow users to make individual processes or threads to
be scheduled with fixed priority. The setpri() system call is process-oriented and
thread_setsched() is thread-oriented. Only a root-owned thread can change the
priority to fixed priority (or to a more favored priority).
Priority changed by a user

A user can use the nice and renice commands to change the priority of a process and
its associated threads. A user can also use a program that calls thread_setched() or
setpri() system calls to change the priority. Only the root user can change the priority
to a more favored priority.

V5.4
Instructor Guide
Uempty
Other methods to change priority

The kernel scheduler can change the priorities over time through its scheduling
algorithms. The Workload Manager can also change the priorities of processes and
threads in order to fit the requirements of the workload classes.
Nice value
The nice value is a priority adjustment factor used by the system to calculate the current
priority of a running process. The nice value is added to the base user priority of 40 for
non-fixed priority threads and is irrelevant for fixed priority threads. The nice value of a
thread is set when the thread is created and is constant over the life of the thread
unless changed with a system call or the renice command.
You can use the ps command with the -l flag to view a command's nice value. The nice
value appears under the NI heading in the ps command output. If the nice value in ps is
--, the process is running at a fixed priority.
The default nice value is 20 and therefore an effective priority of 60. This is because the
nice value is added to the user base priority value of 40.
Some shells (such as ksh) will automatically add a nice value of 4 to the default nice
value if a process is started in the background (using &). For example, if you executed
program & from a ksh, this program will automatically be started with a nice value of 24.
For example, if a program was preceded by a nice command such as the following, it
will be started with a nice value of 34:
nice -n 10 program &
The at command automatically adds a nice value of 2 to the programs it executes.
With the use of multiple processor run queues and their load balancing mechanism,
nice or renice values might not have the expected effect on thread priorities because
less favored priorities might have equal or greater run time than favored priorities.
Threads requiring the expected effects of nice or renice should be placed on the global
run queue.

3-23
Instructor Guide
Instructor notes:
Purpose To explain how to change the priorities of processes.
Details Explain that priorities can be changed by a thread or process itself or by a user
by using the nice or renice commands. Only root can improve the priority (that is,
decrease the priority value).
Point out that programs started in the foreground have a default value of 20 and if
programs started in the background from a ksh have a default nice value of 24.
Additional information The Workload Manager is integrated into the kernel so that
WLMs priority changes are handled by the kernels scheduling algorithms as well.
A simple way to code a program to change the priority is through setpri(). The setpri()
system call uses the process ID (0 for current process) and the priority value as
parameters.
Transition statement Before looking at how to monitor priorities, lets talk about one
other tool that can change the priority of a thread or process.

V5.4
Instructor Guide
Uempty
nice/renice examples
nice examples:
Command
Action
Relative Priority
nice -10 foo
Add 10 to current nice value
Lower priority (disfavored)
nice -n 10 foo
nice --10 foo
Subtract 10 from current nice value
Higher priority (favored)
nice -n -10 foo
renice examples:
Command
Action
Relative Priority
renice 10 -p 563
Add 10 to default nice value
renice -n 10 -p 563
renice -10 -p 563
Subtract 10 from default nice value
renice -n -10 -p 563
Figure 3-8. nice/renice examples
AN512.0
Notes:
nice command
The nice command lets you run a command at a priority lower (or higher) than the
command's normal priority.
The syntax of the nice command is:
nice [ - Increment| -n Increment ] Command [ Argument ... ]
The Command parameter is the name of any executable file on the system. For the
Increment, you can specify a positive or negative number. Positive increment values
reduce priority. Negative increment values increase priority. Only users with root
authority can specify a negative increment. If you do not specify an Increment value,
the nice command defaults to an increment of 10.
The nice value can range from 0 to 39, with 39 being the lowest priority. For example, if
a command normally runs at a priority of 20, specifying an increment of 10 runs the
command at a lower priority, 30, and the command will probably run slower. The nice

3-25
Instructor Guide
command does not return an error message if you attempt to increase a command's
priority without the appropriate authority. Instead, the command's priority is not
changed, and the system starts the command as it normally would. Specifying a nice
value larger than the maximum allowed by nice causes the effective nice value to be
the maximum value allowed by nice.
Examples:
Command
nice -10 foo
nice -n 10 foo
nice --10 foo
nice -n -10 foo
Action
Relative Priority
Subtract 10 from current nice value Higher priority (favored)
Subtract 10 from current nice value Higher priority (favored)
renice command
The renice command alters the nice value of a specific process, all processes with a
specific user ID, or all processes with a specific group ID.
The syntax of the renice command is:
renice [[-n Increment] | Increment]] [-g|-p|-u] ID...
If you do not have root user authority, you can only reset the priority of processes you
own and can only increase their priority within the range of 0 to 20, with 20 being the
lowest priority. If you have root user authority, you can alter the priority of any process
and set the increment to any value in the range -20 to 20. The specified Increment
changes the priority of a process in the following ways:
1 to 20
Runs the specified process with worse priority than the base priority
Sets priority of the specified processes to the base scheduling priority
-20 to -1
Runs the specified processes with better priority than the base priority
The way the increment value is used depends on whether the -n flag is specified. If -n
is specified, then the increment value is added to the current nice value. If the -n flag is
not specified, then the increment value is added to the default value of 20 to get the
effective nice value.
Nice values are reduced by using negative increment values and increased by using
positive increment values.
Examples:
Command
renice 10 -p 5632
renice -n 10 -p 5632
Action
Relative Priority
Add 10 to the default nice
value (20)
Add 10 to current nice
value

V5.4
Instructor Guide
Uempty
Command
renice -10 -p 5632
renice -n -10 -p 5632
Action
Subtract 10 from the
default nice value (20)
Subtract 10 from current
nice value
Relative Priority

3-27
Instructor Guide
Instructor notes:
Purpose Review how to work with the nice and renice commands.
Details
Transition statement Let us look how we can monitor the priorities of the processes in
the system.

V5.4
Instructor Guide
Uempty
Viewing process and thread priorities

$ ps -ekl
F S UID
303 A
0
200003 A
0
303 A
0
303 A
0
303 A
0
PID
0
1
8196
12294
16392
$ ps -L 483478 -l
F S UID
PID
200001 A
0 295148
200001 A
0 438352
200001 A
0 442538
240005 A
0 483478
PPID
C PRI NI ADDR
SZ
WCHAN
TTY TIME CMD
0 120 16 -- 19004190 384
- 1:08 swapper
0
0 60 20 21298480 720
- 0:04 init
0
0 255 -- 1d006190 384
- 10:08 wait
0
2 17 -- 1008190 448
- 0:00 sched
0
0 16 -- 500a190 704 f100080009791c70
- 0:17 lrud
PPID
483478
483478
483478
356546
C PRI NI ADDR
SZ
WCHAN
TTY TIME CMD
0 68 24 177a5480 176 f100060006358fb0 pts/1 0:00 sleep
0 68 24 13847480 176 f1000600063589b0 pts/1 0:00 sleep
0 60 20 2b7db480 740
pts/1 0:00 ps
0 60 20 1d840480 836
pts/1 0:00 ksh
$ ps -kmo THREAD p 16392

USER
PID
PPID
TID ST
root 16392
0
- A
16393 S
45079 S
49177 S
53275 S
CP PRI SC WCHAN
F
0 16 4 f100080009791c70
0 16 1 f100080009791c70
0 16 1
1004
0 16 1
1004
0 16 1
1004
TT BND
303
1004
-
COMMAND
- lrud
- -
Figure 3-9. Viewing process and threat priorities
AN512.0
Notes:
Viewing process priorities
To view the process priorities of all process, simply run the command: ps -el.
To view the process priorities of all processes including kernel processes: ps -elk.
The -L <PIDlist> option generates a list of descendants of each and every PID that
has been passed to it in the Pidlist variable. The list of descendants from all of the
given PID is printed in the order in which they appear in the process table.
The priority is listed under the PRI column. If the value under NI is --, this indicates that
it is a fixed priority process.
The processes in the visual above with the PRI of 16 are the most important: swapper,
lrud, and wlmsched. Notice the processes with the least important priorities: wait.

3-29
Instructor Guide
CPU usage column

Another column in the ps output is important, the C or CPU usage column. This
represents the CPU utilization of process or thread, incremented each time the system
clock ticks and the process or thread is found to be running. How this value is used to
calculate a processes ongoing priority is covered in a few pages.
Viewing thread priorities

To view the thread priorities of all threads, simply run the command:
ps -emo THREAD
To view the thread priorities of all threads including kernel threads:
ps -ekmo THREAD
To view the thread priorities of all threads within a specific process:
ps -mo THREAD -p <PID>
The priority is listed under the PRI column.
Process IDs are even numbers and thread IDs are odd numbers.
CPU usage column

The CP or CPU usage column in the visual above is the same as the C column on the
last visual. This represents the CPU utilization of the thread, incremented each time the
system clock ticks and the thread is found to be running. How this value is used to
calculate a processes ongoing priority is covered in a few pages.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe how to look at current priorities of processes.
Details To look at the current priorities of all processes on the system, the ps -elk
command can be used where the k option will also list the kernel processes. The value
under the PRI column is the numerical priority. The value under the NI column is the nice
value which is used in the calculation of the PRI value. If the NI value is --, this indicates
that the priority is fixed and does not change as a function of its CPU usage and/or its nice
value.
In the second example on the visual, point out that the -L option shows the list of
decedents for a particular PID. Notice that there is one process (ksh) with the PID specified
in the command line, and the other 3 have that PID as their parent PID.
Show how the ps command can also be used to view thread priorities.
Additional information Students may ask about the BND column for the lrud
processes. This shows that some kernel threads or processes are assigned to particular
CPUs. You would see the same for any wait processes. There may or may not be the
equivalent number of CPUs on the system. For example, on some LPARs that can have
CPUs be dynamically reconfigured, there can be a wait process for each possible CPU
rather than the actual number of CPUs.
Point out the CP column which shows the current CPU usage. This is described in more
detail on the next visual.
If asked about the other ps columns, point the students to the man page for ps. Some
descriptions are:
F - Flag fields (see /usr/include/sys/proc.h and thread.h)
S - State of the process (example: A is Active, S is sleeping)
ADDR - Segment number of the process stack for normal processes or the address of the
preprocess data area for kernel processes
SZ - Size in 1 KB units of the core image of the process
WCHAN - Event on which process is waiting (address on system)
Other column descriptions are:
SC - Suspended count of the process or kernel thread for a process. The suspend count
is defined as the sum of the kernel threads suspend count.
BND - Logical processor number of the processor to which the kernel thread is bound.
For the process, it shows if all threads are bound to the same processor. A -- indicates
an unbound thread.
TT - If students ask why only ksh in the second example on the visual has a tty
specified, its because all the rest are daemons.

3-31
Instructor Guide
Transition statement Lets look at how to view the current priorities of threads.

V5.4
Instructor Guide
Uempty
Boosting an important process with nice

Reducing nice to zero
90
Angle of slope
affected by
sched_R
85
80
75
priority values
70
65
nice=0
nice=20
60
Upward sloping indicates

increasing CPU usage penalty
and thus thread getting cycles
55
50
Amount penalty drops

each second affected
by sched_D
45
40
35
0
100
200
300
400
500
600
700
clock ticks
Figure 3-10. Boosting an important process with nice
AN512.0
Notes:
The visual shows a graph generated by applying the AIX scheduling algorithm to two
threads with different nice numbers; one had a default foreground nice of 20 while the other
had a preferred nice of zero.
The preferred thread runs without interference from the first thread, until the CPU usage
penalties raises the preferred threads PRI value to equal the other thread. At that point
they start taking turns executing until the one-second timer expires. Once a second the
CPU usage statistics are adjusted by the sched_D factor. By default, this reduces the CPU
usage by a half. This in turn, reduces the penalty. At the beginning of the next one second
interval, the preferred thread, once again, has an advantage and runs without interference
from the other thread. But this time it takes fewer ticks to have the CPU usage penalty
increase the running threads PRI value to match the other thread. Once again they take
turns until the one second period ends.
How quickly the penalty accumulates is affected by the sched_R tunable.
Tuning the sched_R and sched_D tunables will effect how long a running thread maintains
its initial priority advantage and how much the CPU penalty is forgiven each second.

3-33
Instructor Guide
Instructor notes:
Purpose Illustrate the effect of nice numbers on performance with a graphic example.
Details
Transition statement The graphic discusses the affect of the sched_D and sched_R
values on the graph. Lets take a closer look at these scheduling tunables.

V5.4
Instructor Guide
Uempty
Usage penalty and decay rates

Rate at which a thread is penalized is proportional to:
CPU usage (incremented when thread is running at a clock tic)
CPU-penalty-to-recent-CPU-usage ratio (R/32, Default R value is 16)
Rate at which CPU usage is decayed (once per second):

CPU usage * D/32 (Default D value is 16)
Tuning penalty rate:

schedo -o sched_R=value
Increasing will magnify penalty for dominant threads
Decreasing allows dominant thread to run longer (R=0 : no penalty)
Tuning decay rate:

schedo -o sched_D=value
Increasing will decay less (D=1: no decay at all)
Decreasing will decay more (D=0: zeros out the usage each second)
Remember: This affects all threads (global in impact)
Figure 3-11. Usage penalty and decay rates
AN512.0
Notes:
Overview
As non-fixed priority threads accumulate CPU ticks, their priorities will worsen so that a
system of fairness is enforced. Threads that are new or have not run recently can obtain
the CPU before threads that are dominating the CPU. This system of fairness is
implemented by using the priority, the nice value, the CPU usage of the thread, and
some tunable kernel parameters (sched_R and sched_D). These parameters are
represented as R and D in the visual and in the following text. The CPU usage value
can be seen in the output of ps -ekmo THREAD as seen on the last visual.
As the units of CPU time increase, the priority decreases (the PRI value increases). You
can give additional control over the priority calculation by setting new values for R and
D.

3-35
Instructor Guide
Priority calculation process

The details of the formula are less important than understanding that there is a penalty
for CPU usage, and that penalty has more impact if the nice value is greater than 20. In
fact, the impact of the penalty is proportional to how far the nice value deviates from the
default value of 20.
Here is the actual priority value formula:
Priority = x_nice + (Current CPU ticks * R / 32 * (x_nice + 4 / 64))
Where:
p_nice = nice value + base priority
If p_nice > 60
then x_nice = (p_nice * 2) - 60
else x_nice = p_nice
CPU penalty
The CPU penalty is calculated by multiplying the CPU usage by the
CPU-penalty-to-recent-CPU-usage ratio. This is represented by R/32. The default value
of R is 16, so by default the CPU penalty will be the CPU usage times a ratio of 1/2.
The first part of the Priority value formula (Current CPU ticks * R / 32) represents the
penalty part of the calculation
The CPU usage value of a given thread is incremented by 1 each time that thread is in
control of the CPU when the timer interrupt occurs (every 10 milliseconds). Its initial
value is 0. Priority is calculated on a per thread basis. To see a threads CPU usage
penalty, use the ps -emo THREAD command and look at the CP column. The priority is
shown in the PRI column.
The CPU usage value for a process is displayed as the C column in the ps command
output. The maximum value of CPU usage is 120. Note that a processs CPU usage
can exceed 120 since it is the sum of the CPU usage of its threads.
Tuning the CPU-penalty-to-recent-CPU-usage factor

The CPU penalty ratio is expressed as R/32 where R is 16 by default and the values for
R can range from 0 to 32.
This factor can be changed dynamically by a root user through the command
schedo -o sched_R=<value> . Smaller values of R will make the nice value a bigger
factor in the equation (that is, the nice value has a bigger impact on the priority of the
thread). This will make it easier for foreground processes to compete. Larger values of
R will make the nice value have less of an impact on the priority of the thread.
V5.4
Instructor Guide
Uempty
Aging or decaying the CPU usage

As the CPU usage increases for a process or thread, the numerical priority also
increases thus making its scheduling priority worse. Over time, a threads priority can
get so bad that on a system with a lot of runnable threads, it may never get to run
unless the threads priority is increased. The mechanism which allows the thread to
eventually become more favored again is known as CPU aging (or the usage decay
factor).
Once per second, a kernel process called swapper wakes up and ages the CPU usage
for all threads in the system. It then recalculates priorities according to the algorithm
described in the visual above.
The recent-CPU-usage-decay factor is expressed as D/32 where D is 16 by default.
The values for D can range from 0 to 32. The formula for recalculation is as follows:
CPU usage = old_CPU_usage * D/32
Tuning the recent-CPU-usage-decay factor

You can have additional control over the priority calculation by setting new values for
both R and D. Decreasing the D value enables foreground processes to avoid
competition with background processes for a longer time. Higher values of D penalize
CPU intensive threads more and can be useful in an environment which has a mix of
interactive user threads and CPU-intensive batch job threads.
The default for D is 16 which decays short-term CPU usage by 1/2 (16/32) every
second.
This factor can by changed dynamically by a root user through the command
schedo -o sched_D=<value> .

3-37
Instructor Guide
Instructor notes:
Purpose
Details
It may be helpful to describe p_nice as the initial priority, and x_nice as the adjusted initial
priority.
The non-fixed priority threads will have their priorities changed by the kernel based on the
CPU usage of these threads. The priority number increases as the threads CPU usage
increases (which means the priority is getting worse).
Since the CPU usage is added to the priority of a thread, its priority can get so bad over
time that it may never be able to run unless the numerical priority is lowered. This is done
by aging the CPU usage. Every second, the swapper kernel thread wakes up and
processes the thread table. The amount that the CPU usage ages/decays is tunable.
The R and D concepts can be confusing. Here are some review points you can use:
sched_D = CPU decaying/aging (The idea is that as a process gets older, the priority
value goes up and it becomes less favored. Eventually it might not be able to run. The
D factor is used to provide some correction every second for long running processes.)
sched_R = CPU usage penalty (As the process uses CPU, priority value goes up and it
becomes less favored. How much less favored it gets is determined by the R value.)
You can tune the R and D values with the schedo command.
A small R value gives the nice value more impact.
A small D value lessens the penalty factor.
A large D value hurts CPU intensive threads more.
Smaller values of D decay CPU usage at a faster rate and can cause CPU-intensive
threads to be scheduled sooner.
Larger values of D decay CPU usage at a slower rate and penalize CPU-intensive
threads more (thus favoring interactive-type threads).
You may find asking the following questions useful to review the concepts that have just
been covered. Ask the students the following questions:
1) What would the effect be if R=0?
Answer: There would be no penalty for accumulating CPU usage. In other words the initial
priority determines the prioritization and there is no effort to give lower priority threads a fair
chance at getting cycles when a better priority thread is dominating.
2) What would the effect be if R=16?

V5.4
Instructor Guide
Uempty
Answer: The penalty for accumulating cycles would be maximized which gives the lower
priority threads a better chance at getting cycles, but at the expense of the higher priority
thread.
3) What would the effect be if D=0?
Answer: The CPU usage will be totally forgiven (set back to 0) every second. (You can use
an analogy of someone being penalized for the crime of accumulating cycles, so we are
forgiving them for their crime.) The result is to diminish the effect of the penalty, hurting the
lower priority threads, but helping the higher priority threads.
4) What would the effect be if D=16?
Answer: The crime of CPU usage is never forgiven. The CPU usage keeps incrementing
until we reach the maximum value.
Control over how close threads come to these extremes is based on the how far they
deviate from the default values of 16.
Although you can tune the R and D values, it is not typical to do so. Under certain
circumstances, however, such as application dead-lock scenarios or a workload with a mix
of interactive and batch jobs, it can be useful.
Transition statement Having discussed the AIX scheduling mechanisms, what can you
do if you are having CPU related performance problems?

3-39
Instructor Guide
Priorities: What to do?

If CPU resources are already constrained, the setting of
priorities can help allocate more CPU resources to the more
important processes:
Decrease priority value on the most important processes
Increase priority value for the least important processes
If most important processes use a lot of CPU time, could

change CPU usage priority decay rate
Configure the CPU aging/decay (sched_D) and the CPU
usage penalty (sched_R) options with schedo
Consider using WLM or WPARs to managed CPU

resources
Figure 3-12. Priorities: What to do?
AN512.0
Notes:
Overview
The visual gives some guidelines when tuning the priorities of processes.
In addition to the suggestions above, tuning the workload could be as easy as using the
at, cron, or batch commands to schedule less important jobs for off-shift hours.
Note: It is possible to use the renice command to make threads so unfavored that they
will never be able to run.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe priority tuning guidelines.
Details This visual is a summary of the last few pages. We covered how priorities are
calculated and how they can be manipulated with the nice and renice commands and by
controlling the CPU aging/decay factor (D or sched_D) and the CPU usage penalty (R or
sched_R).
Point out that there is a difference between the AIX environment being CPU constrained
and an application being CPU constrained. There are still applications being run that are
single process and single threaded. If that is the systems critically important application,
then the best that priorities can accomplish is to ensure that application runs on a single
CPU without competition. The AIX environment may have 16 logical CPUs, with most of
them idle, but the critical application can be CPU constrained. There is a limit to what
prioritization can do in this situation. The application may need to be rewritten as a
multi-process or multi-threaded application to be able to scale up and use more of the
available CPUs.
Transition statement Lets look at how we can use workload partitions to control
application resource usage.

3-41
Instructor Guide
AIX workload partitions (WPAR): Review

WPARs reduce administration
By reducing the number of AIX images to maintain
Each WPAR is isolated
Appears as a separate instance

of AIX
Regulated share of
system resources
May have unique network
and file systems
Separate administrative
and security domain
AIX 6 instance
Workload
Partition
Workload
Partition
Billing
Application
Server
Workload
Partition
Workload
Partition
Web
Server
Test
Workload
Partition
BI
WPARs can be relocated
Load balancing
Server maintenance
Figure 3-13. AIX workload partitions (WPAR): Review
AN512.0
Notes:
Introduction
Workload Partition (WPAR) is a software-base virtualization capability of AIX 6 that
provides a new capability to reduce the number of AIX operating system images that need
to be maintained when consolidating multiple workloads on a single server. WPARs provide
a way for clients to run multiple applications inside the same instance of an AIX operating
system while providing security and administrative isolation between applications. WPARs
complement logical partitions and can be used in conjunction with logical partitions if
desired. WPAR can improve administrative efficiency by reducing the number of AIX
operating system instances that must be maintained and can increase the overall utilization
of systems by consolidating multiple workloads on a single system and is designed to
improve cost of ownership.
WPARs allow users to create multiple software-based partitions on top of a single AIX
instance. This approach enables high levels of flexibility and capacity utilization for
applications executing heterogeneous workloads, and simplifies patching and other
operating system maintenance tasks.
V5.4
Instructor Guide
Uempty
WPARs provide unique partitioning values

Smaller number of OS images to maintain
Performance efficient partitioning through sharing of application text and kernel data
and text
Fine-grain partition resource controls
Simple, lightweight, centralized partition administration
WPARs enable multiple instances of the same application to be deployed across
partitions.
Many WPARs running DB2, WebSphere, or Apache in the same AIX image.
Different capability from other partitioning technologies.
Greatly increases the ability to consolidate workloads because often the same
application is used to provide different business services.
Enables the consolidation of separate discrete workloads that require separate
instances of databases or applications into a single system or LPAR.
Reduces costs through optimized placement of work loads between systems to yield
the best performance and resource utilization.
WPAR technology enables the consolidation of diverse workloads on a single server
increasing server utilization rates
Hundreds of WPARs can be created, far exceeding the capability of other partitioning
technologies.
WPARs support fast provisioning and fast resource adjustments in response to both
normal or unexpected demands. WPARs can be created and resource controls
modified in seconds.
WPAR resource controls enable the over-provisioning of resources. If a WPAR is below
allocated levels, the unused allocation is automatically available to other WPARs.
WPARs support the live migration of a partition in response to normal/unexpected
demands.
All of the above capabilities enable more consolidation on a single server or LPAR.
WPARs enable development, test, and production cycles of one workload to be
placed on a single system.
Different levels of applications (production1, production2,test1, test2) may be deployed
in separate WPARs.
Quick and easy roll out and roll back to production environments.
Reduced costs through the sharing of hardware resources.
Reduced costs through the sharing of software resources such as the operating system,
databases, and tools.

3-43
Instructor Guide
A WPAR supports the control and the management of its resources, CPU, memory, and
processes. That means that you can assign specific fractions of CPU and memory to each
WPAR and this is done by WLM running on the partition.
Most resource controls are similar to those supported by the Workload Manager. You can
specify shares_CPU which is the number of processor shares available for a workload
partition, or you can specify minimum and maximum percentages. The same it true for
memory utilization. There are also WPAR limits for run-away situations (for example: total
processes).
When you create a WPAR, a WLM class is created (having the same name as the WPAR).
All processes running in the partition inherit this classification. You can see the statistics
and classes using the wlmstat command which has been enhanced to display WPAR
statistics. wlmstat -@ 2 --shows the WPAR classes. Also, you cannot use WLM inside the
WPAR to manage its resources.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Review what a WPAR provides.
Details
Transition statement Next we need to be clear about the difference between a system
WPAR and an application WPAR

3-45
Instructor Guide
System WPAR and application WPAR

System WPAR
Autonomous
virtual system environment
- Shared file systems (with the global environment) : /usr and /opt
- Private file systems for the WPARs own use: /, /var and /tmp
- Unique set of users, groups, and network addresses
Can
be accessed via:
- Network protocols (for example: telnet or ssh)

- Log in from the global environment using the clogin command
Can
be stopped and restarted
Application WPAR
Isolate an individual application
Light weight; quick to create and
Create and run
remove
- Created with wparexec command

- Removed when stopped
- Stopped when the application finished
Shares file systems and devices
No user login capabilities
Stop and remove
with the global environment
Figure 3-14. System WPAR and application WPAR
AN512.0
Notes:
System workload partition
System workload partitions are autonomous virtual system environments with their own
private root file systems, users and groups, login, network space, and administrative
domain.
A system WPAR represents a partition within the operating system isolating runtime
resources such as memory, CPU, user information, or file system to specific application
processes. Each system WPAR has its own unique set of users, groups and network
addresses. The systems administrator accesses the WPAR via the administrator console
or via regular network tools such as telnet or ssh. Inter-process communication for a
process in a WPAR is restricted to those processes in the same WPAR.
System workload partitions provide a complete virtualized OS environment, where multiple
services and applications run. It takes longer to create a system WPAR compared to an
application WPAR as it builds its file systems. The system WPAR is removed only when
requested. It has its own root user, users, and groups, and own system services like inetd,
cron, syslog, and so forth.
V5.4
Instructor Guide
Uempty
A system WPAR does not share writable file systems with other workload partitions or the
global environment. It is integrated with the role-based access control (RBAC).
Application workload partition
Normal WPAR except that there is no file system isolation
Login not supported
Internal mounts not supported
Target: Lightweight process group for mobility
Application workload partitions do not provide the highly virtualized system environment
offered by system workload partitions, rather they provide an environment for segregation
of applications and their resources to enable checkpoint, restart and relocation at the
application level.
The application WPAR represents a shell or an envelope around a specific application
process or processes which leverage shared system resources. It is light weight (that is,
quick to create and remove and does not take lots of resources) since it uses the global
environment system file system and device resources. Once the application process or
processes are finished the WPAR is stopped. The user cannot log in inside the application
WPAR using telnet or ssh from the global environment. If you need to access the
application in some way this must be achieved by some application provided mechanism.
All file systems are shared with the global environment. If an application is using devices it
uses global environment devices.
The wparexec command builds and starts an application workload partition, or creates a
specification file to simplify the creation of future application workload partitions.
An application workload partition is an isolated execution environment that might have its
own network configuration and resource control profile. Although the partition shares the
global environment file system space, the processes running therein are only visible to
other processes in the same partition. This isolated environment allows process
monitoring, gathering of resource, accounting, and auditing data for a predetermined
cluster of applications.
The wparexec command invokes and monitors a single application within this isolated
environment. The wparexec command returns synchronously with the return code of this
tracked process only when all of the processes in the workload partition terminate. For
example, if the tracked process creates a daemon and exits with the 0 return code, the
wparexec command blocks until the daemon and all of its children terminate, and then exit
with the 0 return code, regardless of the return code of the daemon or its children.

3-47
Instructor Guide
Instructor notes:
Purpose Review the difference between a system WPAR and an application WPAR.
Details
Transition statement Let us take a look at how we can control resource usage for
WPARs. We will start with setting target shares.

V5.4
Instructor Guide
Uempty
Target shares
Shares are a relative amount of resource entitlement
Target percentage is calculated based on active shares only
All WPARs active

W1
W2
W3
16.6%
33.3%
50%
WPAR W3 Inactive
W1
W2
33.3%
66.6%
WPAR share assignments

W1
10 shares
W2
20 shares
W3
30 shares
Figure 3-15. Target shares
AN512.0
Notes:
Shares determine the target (or desired) amount of resource allocation that the WPARs are
entitled to (calculated as a percentage of total system resource). The shares represent how
much of a particular resource a WPAR should get, relative to the other active WPARs.
Shares are not coded as the absolute percentages of the total system resources, but each
of the share values indicates the relative proportion of the resource usage.
For example, in the upper graphic of this visual (all WPARs active), the total of entitled
shares are 60. Then the intended target for the W1 WPAR is 1/6 (or 10/60), and for the W2
WPAR it is 1/3 (or 20/60), and so forth.
If a WPAR is the only active WPAR, its target is 100% of the amount of resource available
to the LPAR.
A WPARs target percentage for a particular resource is simply its number of shares
divided by the total number of active shares.
If limits are also being used, the target is limited to the configured range [minimum, soft
maximum]. If the calculated target is outside this range, it is set to the appropriate upper or

3-49
Instructor Guide
lower bound (see Resource Limits). The number of active shares is the total number of
shares of all WPARs that have at least one active process in them. Since the number of
active shares is dynamic, so is the target.
Each share value can be between 1 and 65535.
By default, a WPARs resource shares are not defined, which effectively means that the
WPAR does not have any WLM based target percentage and (unless limits are defined) is
effectively un-regulated by WLM.
Shares are automatically self-adjusted percentages. For example, in the lower graphic of
this foil, although the total of entitled shares are 60, the actual sum of the active WPARs
shares is 30, since the W3 WPAR with a share value of 30 is inactive. Then, the adjusted
proportions for the classes are, 1/3 (or 10/30) for the W1 and 2/3 (or 20/30) for the W2.
Within a WPAR individual processes compete for resources of the WPAR using traditional
AIX mechanisms. For example, with CPU resource, the process priority value determines
which process thread is dispatched to a CPU and the process nice number effects the
initial priority of the process and its threads. These same mechanisms are used for
resource contention between WPARs if no WLM based resource usage control has been
implemented in the WPAR definitions.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Define shares and how they effect the target resource percentage.
Details
Transition statement Resource allocation can be controlled by shares. But, it can also
be controlled by limits.

3-51
Instructor Guide
Limits
Maximum limits restrict resources
Soft maximum
Hard maximum
Minimum limits guarantee resources

Limits take precedence over shares
CPU
MEM
min
limit
normal
range
soft
max
limit
Hard
max
limit
Figure 3-16. Limits
AN512.0
Notes:
The resource allocation can also be controlled by limits. The WPAR resource limits define
the maximum and the minimum amount of resource that can be allocated to a WPAR as a
percentage of the total system resources.
Resource limits allow the administrator to have more control over resource allocation.
These limits are specified as percentages and are relative to the amount of resource
available to LPAR.
There are three type of limits for percentage-based regulation:
Minimum
This is the minimum amount of a resource that should be made available to the WPAR.
If the actual WPAR consumption is below this value, the WPAR is given highest priority
access to the resource. The possible values are 0 to 100, with 0 being the default (if
unspecified).

V5.4
Instructor Guide
Uempty
Soft maximum
This is the maximum amount of a resource that a WPAR can consume when there is
contention for that resource. If the WPAR consumption exceeds this value, the WPAR is
given the lowest priority. If there is no contention for the resource (from other WPARs),
the WPAR is allowed to consume as much as it wants. The possible values are 1 to
100, with 100 being the default (if unspecified).
Hard maximum
This is the maximum amount of a resource that a WPAR can consume, even when
there is no contention. If the WPAR reaches this limit, it is not allowed to consume any
more of the resource until its consumption percentage falls below the limit. The possible
values are 1 to 100, with 100 being the default (if unspecified).
Class resource limits follow some basic rules.
Resource limits take precedence over WPAR share values.
The minimum limit must be less than or equal to the soft maximum limit.
The soft maximum limit must be less than or equal to the hard maximum limit.
The sum of the minimum limits of all WPARs cannot exceed 100.
The following are the only constraints that WLM places on resource limit values:
When a WPAR with a hard memory limit has reached this limit and requests more pages,
the VMM page replacement algorithm (LRU) is initiated and steals pages from the limited
WPAR, thereby lowering its number of pages below the hard maximum, before handing out
new pages. This behavior is correct, but extra paging activity, which can take place even
where there are plenty of free pages available, impacts the general performance of the
system. Minimum memory limits for other WPARs are recommended before imposing a
hard memory maximum for any WPAR.
Since WPARs under their minimum have the highest priority, the sum of the minimums
should be kept to a reasonable level, based on the resource requirements of the other
WPARs.
For physical memory, setting a minimum memory limit provides some protection for the
memory pages of the WPAR's processes. A WPAR should not have pages stolen when it is
below its minimum limit unless all the active WPARs are below their minimum limit and one
of them requests more pages. Setting a memory minimum limit for a WPAR with primarily
interactive jobs helps make sure that their pages will not all have been stolen between
consecutive interactions (even when memory is tight) and improves response time.
Attention: Using hard maximum limits can have a significant impact on system or
application performance if not used appropriately. Since imposing hard limits can result in
unused system resources, in most cases, soft maximum limits are more appropriate.

3-53
Instructor Guide
Instructor notes:
Purpose Explain how resource limits control resource use.
Details This visual breaks down the function of limits into all of its parts and definitions.
Transition statement Let us look at how we can set these resource controls

V5.4
Instructor Guide
Uempty
WPAR resource management

Define at WPAR creation or later change WPAR attributes
wparexec R attribute=value
mkwpar R attribute=value
chwpar R attribute=value
Common attribute keywords:
active={ yes | no }
shares_CPU=<number of shares>
shares_memory=<number of shares>
CPU=m%-SM%;HM% (default: 0%-100%;100%)
memory=m%-SM%;HM% (default: 0%-100%;100%)
Figure 3-17. WPAR resource management
AN512.0
Notes:
Resource controls can be established at WPAR creation or they can be modified later.
These attributes can be set either though the command line (as shown) or through SMIT or
using the WPAR Manger GUI. Here are some common command line attribute keywords:
- Active: Even if you have set resource controls, you can enable or disable their
enforcement using this attribute
- shares_CPU: This is the number of shares for calculating the target percentage for
CPU usage.
- shares_memory: This is the number of shares used to calculate the target
percentage for memory usage.
- CPU: This value has three fields (note use of semicolon to delimit last field):
The first is the minimum percentage (default is 0%)
The second is the soft maximum (default is 100%)
The third is the hard maximum (default is 100%)

3-55
Instructor Guide
- Memory: This value has three fields with the same format and defaults as CPU
limits.
WLM overview
WPARs use the AIX Workload Manager (WLM) mechanisms to control resource usage
without having to understand the complexities of defining and configuring WLM classes.
Each WPAR is treated as a special WLM class.
WLM gives system administrators more control over how the scheduler allocates
resources to processes. Using WLM, you can prevent different classes of jobs from
interfering with each other and you can allocate resources based on the requirements
of different groups of users.
Typically, WLM is used on a system where the CPU resources are constrained or at
least occasionally constrained. If there is an overabundance of CPU resources for all of
the systems workload, then prioritization of workload is not important unless there are
other factors such as user support agreements that have specific resource
requirements.
With WLM, you can create different classes of service for jobs, as well as specify
attributes for those classes. These attributes specify minimum and maximum amounts
of CPU, physical memory, and disk I/O throughput to be allocated to a class. WLM then
assigns jobs automatically to classes using class assignment rules provided by a
system administrator. These assignment rules are based on the values of a set of
attributes for a process. The system administrator or a privileged user can also
manually assign jobs to classes, overriding the automatic assignment.
Classes
A WLM class is a collection of processes and their associated threads. A class has a
single set of resource-limitation values and target shares.
CPU resource control

WLM allows management of resources in two ways: as a percentage of available
resources or as total resource usage. Threads of type SCHED_OTHER in a class can
be controlled on a percentage basis. Fixed-priority threads are non-adjustable.
Therefore, they cannot be altered, and they can exceed the processor usage target.
If processor time is the only resource that you are interested in regulating, you can
choose to run WLM in active mode for processor and passive mode for all other
resources. This mode is called cpu only mode.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain how to define and change resource controls for a WPAR.
Details
Additional information The rset attribute is intentionally not covered.
rset: a resource set is a definition of specific hardware resources such as a particular
processor or memory card. Specifying an rset constrains the WPAR to only use these
resources and no others. This is not recommended.
If familiar with WLM, it is note-worthy that WPARs can not be classified into WLM tiers and
the disk I/O management is not supported (since they only have access to storage through
file system mounts).
Transition statement Since WPARs are a type of WLM class, we can use the wlmstat
command to display their relative resource consumption. Lets see how we use the wlmstat
command.

3-57
Instructor Guide
wlmstat command syntax

Syntax: (adjusted for WPAR relevant options)
wlmstat -@ [-c | -m | -b ] [-B device] [-T]
[-w] [interval] [count]
Output from wlmstat:

# wlmstat
CLASS
wpar11
TOTAL
CLASS
wpar11
TOTAL
-@ 3 2
CPU
0.02
0.02
MEM
9.37
9.37
DKIO
0.00
0.00
CPU
0.01
0.01
MEM
9.37
9.37
DKIO
0.00
0.00
Figure 3-18. wlmstat command syntax
AN512.0
Notes:
The syntax options for the wlmstat command are:
-c
- Shows only CPU statistics.
-m
- Shows only physical memory statistics.
-b
- Shows only disk I/O statistics.
-B device
- Displays statistics for the given disk I/O device. Statistics for all the disks accessed
by the class are displayed by passing an empty string (-B ).
-T

V5.4
Instructor Guide
Uempty
- Returns the total numbers for resource utilization since each class was created (or
WLM started). The units are:
Number CPU ticks per CPU (seconds) used by each class
Number of memory pages multiplied by a number of seconds used by
each class
Number of 512 byte blocks sent/received by a class for all the disk
devices accessed.
-a
- Delivers absolute figures (relative to the total amount of the resource available to the
whole system) for subclasses, with a 0.01 percent resolution. By default, the figures
shown for subclasses are a percentage of the amount of the resource used by the
superclass, with a 1 percent resolution. For instance, if a superclass has a CPU
target of 7 percent and the CPU percentage shown by wlmstat without -a for a
subclass is 5 percent, wlmstat with -a will show the CPU percentage for the subclass
as 0.35 percent.
-w
- Displays the memory high-water mark, that is the maximum number of pages that a
class had in memory since the class was created (or WLM started).
-v
- Shows most of the attributes concerning the class. The output includes internal
parameter values purposed for AIX support persons. Here is a list of some attributes
which may be interesting for users:
- CLASS - Class name.
- tr - tier number from 0...9.
- i - Value of the inheritance attribute: 0 = no, 1 = yes.
- #pr - Number of processes in the class. If a class has no processes assigned
to it, the value in the other columns may not be significant.
- CPU - CPU utilization of the class in percent.
- MEM - Physical memory utilization of the class in percent.
- DKIO - Disk I/O bandwidth utilization for the class in percent.
- sha - Number of shares. If no ( - ) shares are defined, then sha = -1.
- min - Resource minimum limit in percent.
- smx - Resource soft maximum limit in percent.
- hmx - Resource hard maximum limit in percent.
- des - desired percentage target calculated by WLM using the numbers of the
shares in percent.

3-59
Instructor Guide
- npg - number of memory pages owned by the class.

- The other columns are for internal use only and bear no meaning for you the
administrator or end users. This format is better suited for use with a resource
selection (-c, -m or -b) otherwise the lines might be too long to fit into a line of a
display terminal.
interval
- Specifies an interval in seconds (default to 1).
count
Specifies how many times wlmstat prints a report (default to 1).

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain how wlmstat is used in a WPAR environment.
Details Note that we are not showing the full wlmstat command syntax, with all of its
options. Instead, the visual is only showing those flags which are relevant to how WLM is
implemented with WPARs.
Transition statement We next want to examine the CPU usage breakdown seen in
many performance statistics reports. But first we need to understand the difference
between mode switching and context switching, and how these relate to performance.

3-61
Instructor Guide
Context switches
A context switch is when one thread is taken off a CPU and
another thread is dispatched onto the same CPU.
Context switches are normal for multi-processing systems:
What is abnormal? Check against baseline
High context switch rate is often an indication of lock contention
Use vmstat, sar, or topas to see context switches

Example:
# vmstat 1 5
System configuration: lcpu=2 mem=1024MB ent=0.35
kthr
----r b
2 2
0 2
0 2
0 2
0 2
0 2
memory
page
faults
cpu
----------- ------------------------ ------------ ----------------------avm fre re pi po fr
sr cy
in
sy cs
us sy id wa
pc
ec
198332 8637 0 0 0 6064 6994 0 2708 43494 12767 10 78 3 9 0.34 97.5
198337 8458 0 0 0 8159 8563 0 2800 24281 13703 10 80 2 8 0.37 106.8
198337 8057 0 0 0 6757 5112 2 1217 12276 6283 10 69 3 17 0.29 83.6
198337 8101 0 0 0 7869 7891 0 816 14836 4747
7 49 5 39 0.21 58.9
198337 8097 0 0 0 6298 10914 1 617 8112 2654
6 42 23 29 0.18 50.1
198337 8059 0 0 0 7104 8946 0 886 6440 3952
9 47 18 26 0.21 59.3
Figure 3-19. Context switches
AN512.0
Notes:
Overview
A context switch (also known as process switch or thread switch) is when a thread is
dispatched to a CPU and the previous thread on that CPU was a different thread from
the one currently being dispatched. Context switches occur for various reasons. The
most common reason is where a thread has used up its timeslice or has gone to sleep
waiting on a resource (such as waiting on an I/O to complete or waiting on a lock) and
another thread takes its place.
The context switch statistics are available through multiple tools including: sar, nmon,
topas, and vmstat. For sar, the -w flag provides the context switch statistics.
What to look for

High context switch rates may be an indication of a resource contention issue such as
application or kernel lock contention.
V5.4
Instructor Guide
Uempty
The rate is given in switches per second. Its not uncommon to see the context switch
rate be approximately the same as the device interrupt rate (the in column in vmstat).
The scheduler performs a context switch when:
- A thread has to wait for a resource (voluntarily)
- A higher priority thread wakes up (involuntarily)
- The thread has used up its timeslice (10 ms by default)
vmstat and the initial interval report line

In AIX 5L V5.3 and later, vmstat displays a system configuration line, which appears as
the first line displayed after the command is invoked. Following the configuration
information, vmstat command only reports current intervals. As a result, the first line of
output is not written until the end of the first interval and is meaningful.
Prior to AIX 5L V5.3, when running vmstat in interval mode, the first interval of the
report provided statistics accumulated since the boot of the operating system. As such,
it did not represent the current problem situation since it was diluted by a long prior
period of normal operation. As a result the administrator running the script would ignore
this non-meaningful first interval data and many scripts also would filter out the first
period reported by vmstat.

3-63
Instructor Guide
Instructor notes:
Purpose Describe context switches and how to monitor.
Details The context switch rate is given in the output of vmstat under the cs column, in
the output of sar under the cswch/s column or in the output of topas next to the Cswitch
column. The rates are always given as context switches per second. The value could be
good or bad. You do not know until you compare it against the baseline context switch
value, and if higher, it is simply another symptom to help you determine the root cause for a
performance issue. To fully understand the cause of the context switches a kernel trace
analysis is needed.
A context switch occurs when a thread is dispatched to a CPU and the previous thread on
that CPU was a different thread ID. High context switch rates may be an indication of a
locking issue. Kernel lock contention may also increase CPU system time.
Transition statement Some people may think of a context switch as when a thread
switches from user mode to kernel mode. However, we distinguish this latter case by
calling that a mode switch. Lets discuss modes and mode switches.

V5.4
Instructor Guide
Uempty
User mode versus system mode

User mode:
User mode is when a thread is executing its own application
code or shared library code
Time spent in user mode is reflected as %user time in output
of commands such as vmstat, topas, iostat, and sar
System mode:
System mode is when the CPU is executing code in the kernel
CPU time spent in kernel mode is reflected as system time in
output of commands such as vmstat, topas, iostat, and
sar
Context switch time, system calls, device interrupts, NFS I/O,
and anything else in the kernel is considered as system time
Figure 3-20. User mode versus system mode
AN512.0
Notes:
Modes overview
User time is simply the percentage of time the CPUs are spending executing code in
the applications or shared libraries. System time is the percentage of time the CPUs
execute kernel code. System time can be because the applications are executing
system calls which enter the applications into the kernel, or can be because there are
kernel threads running that only execute in kernel mode, or can be because interrupt
handler code is currently being run. When using monitoring tools, add up the user and
the system CPU utilization percentage to see the total CPU utilization.
The use of a system call by a user mode process allows a kernel function to be called
from user mode. This is considered a mode switch. Mode switching is when a thread
switches from user mode to kernel or system mode. Switching from user to system
mode and back again is normal for applications. System mode does not just represent
operating system housekeeping functions.

3-65
Instructor Guide
Mode switches should be differentiated between the context switches seen in the output
of vmstat (cs column) and sar (cswch/s).

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain the difference between user mode and system mode so that we can
understand what user time and system time is in the output of performance tools.
Details Describe system mode and user mode. Emphasize that system time is not just
operating system administrative operations, it could also be applications in kernel mode.
The reason we bring this subject up is that commands often report CPU usage time in
terms of user and system time. Add these two numbers up to see the total CPU usage
percentage.
Transition statement Lets look at some timing commands.

3-67
Instructor Guide
Timing commands
Time commands show:
Elapsed time
CPU time spent in user mode
CPU time spent in system mode
# /usr/bin/time <command> <command arguments>
real
9.30
user
3.10
sys
1.20
# /usr/bin/timex <command> <arguments>
real 26.08
user 26.02
sys
0.06
# time <command> <arguments>
real
0m10.07s
user
0m3.00s
sys
0m2.07s
Figure 3-21. Timing commands
AN512.0
Notes:
Timing commands
Use the timing commands to understand the performance characteristics of a single
program and its synchronous children. The output from /usr/bin/time and timex are
in seconds. The output of the Korn shells built-in time command is in minutes and
seconds. The C shells built-in time command is in yet another format.
The output of /usr/bin/timex with no parameters is identical to that of
/usr/bin/time. However, with additional parameters /usr/bin/timex is capable of
displaying process accounting data for the command and its children. The -p and -s
options on timex allow data from accounting (-p) and sar (-s) to be accessed and
reported. A -o option reports on blocks read or written.
The timex command is available through SMIT on the Analysis Tools menu, found
under Performance and Resource Scheduling.

V5.4
Instructor Guide
Uempty
If you do not invoke time with the full path, then you could be executing your shells
built-in time command. Therefore, its output could be in a different format than that of
/usr/bin/time. Since using the ksh and the csh built-in time commands have less
overhead (saves a fork/exec of a time command), its preferred to use the built-in time
commands.
Interpreting the output

Comparing the user+sys CPU time to the real time may give you an idea if the
application is CPU bound or I/O bound. The difference between the real and the sum of
user+sys is how much time the application spent sleeping (either waiting on I/O, for
locks, or for some other resource like the CPU). The sum of user+sys may exceed the
real time if a process is multi-threaded. The reason is because the real time is the time
from start to finish of the process, but the user+sys is the sum of the CPU time of each
of its threads.

3-69
Instructor Guide
Instructor notes:
Purpose Describe the use of the timing commands.
Details The sum of user + sys is total CPU cost of executing the program.
The difference between the real time and the total CPU time; that is, real - (user + sys) is
the sum of all the factors that can delay the program, plus the programs own unattributed
costs. These factors may include:
- I/O required to bring in the programs text and data
- I/O required to acquire real memory for the programs use
- CPU time consumed by other programs
- CPU time consumed by the operating system
Emphasize that the csh and ksh built-in time commands have less overhead so its better
to use those than the /usr/bin/time and /usr/bin/timex commands.
The timex -s command uses sar to acquire additional statistics. Since sar is intrusive,
timex -s is too. The data reported by timex -s may not precisely reflect the behavior of a
program in an unmonitored system, especially for brief runs.
Transition statement Lets look at a commonly used monitoring tool in AIX called
vmstat.

V5.4
Instructor Guide
Uempty
Monitoring CPU usage with vmstat

# vmstat 5 3 (dedicated processor LPAR)
System configuration: lcpu=2 mem=512MB
kthr
----r b
19 2
19 2
19 2
memory
page
faults
cpu
------------- ---------------------- --------------- -----------avm
fre
re pi po fr sr cy
in
sy
cs us sy id wa
127005 758755 0
0
0
0
0
0 1692 10464 1070 48 52 0 0
127096 758662 0
0
0
0
0
0 1397 71452 1059 28 72 0 0
127100 758656 0
0
0
0
0
0 1361 72624 1001 28 72 0 0
Runnable threads shows total number of runnable threads:

High number could simply mean your system is efficiently running
lots of threads; compare to the size of the lcpu count
If the high number is abnormal, look at what processes are running
and if total CPU utilization is higher than normal
If us + sy approaches100%, then there may be a system CPU

bottleneck:
Compare interrupt, system call, and context switch rates to baseline
Identify code that is dominating CPU usage
Figure 3-22. Monitoring CPU usage with vmstat .
AN512.0
Notes:
Overview
Using vmstat with intervals during the execution of a workload will provide information
on paging space activity, real memory use, and CPU utilization. vmstat data can be
retrieved from the PerfPMR monitor.int file.
A vmstat -t flag will cause the report to show timestamps. An example is:
# vmstat -t 5 3
kthr
memory
page
faults
cpu
time
----- ----------- --------------------- ----------- ---------- -----r b avm fre re pi po fr sr cy in sy cs us sy id wa hr mi se
1 1 62247 845162 0 0 0 0
0 0 327 9511 401 0 1 98 0 22:31:35
8 0 62254 845155 0 0 0 0
0 0 329 811 633 99 0 0 0 22:31:40
8 0 62353 845056 0 0 0 0
0 0 331 1387 637 99 0 0 0 22:31:45

3-71
Instructor Guide
CPU related information

Pertinent vmstat column headings and their descriptions for CPU usage are:
r - Average number of kernel threads runnable during the interval
b - Average number of kernel threads placed in the wait queue (waiting for I/O)
in - Device interrupts per second
sy - System calls per second
cs - Context switches per second
us - % of CPU time spent in user mode
sy - % of CPU time spent in system mode
id - % of time CPUs were idle
wa - % of time CPUs were idle and there was at least one I/O in progress
What to look for

If the user time (us) is abnormally high, then application profiling may need to be done.
If the system time (sy) is abnormally high, then kernel profiling (trace and/or tprof)
may need to be done.
If idle (id) or wait time (wa) is high, then you must determine if that is to be expected or
not.
Use vmstat -I to also see file in and file out rates which can show how quickly free
pages are being used and give some idea of the demand for free pages.
Since the r column stands for the number of runnable threads, in order to determine
what a high value is, you must know the number of CPUs there are. An r value of 2 on
a 12-way system means that 10 of the CPUs are probably idle. If r/#of_cpus is very
large it means threads are waiting for CPUs. This is not necessarily bad if the
performance goals are being met and the system is running the threads quickly. It is
important not to get too concerned about the actual number of runnable threads; its a
statistic that, by itself, does not necessarily point to a bottleneck. Remember that the
count of runnable threads include both the currently running threads and the threads
waiting in the dispatch queue.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Show the vmstat command and explain how to monitor CPU usage.
Details Point out the r, b, in, sy, cs, us, sy, id, and wa columns and explain
each one.
Explain that you need to run vmstat specifying intervals and examine the values at each
interval.
The -l option shows large page accesses, -t shows timestamps, and -I shows the p, fi
and fo columns.
Additional information The following will be discussed in the Memory unit of this
course. But, if students ask about other columns in the vmstat output, here is some
information:
The fre column may be used to monitor the amount of free page frames available. If the
fre column is at the low threshold (minfree) and the pi/po rates are non-zero most of the
time, then it is quite likely your memory is over-committed. A high page scan (sr) to page
steal (fr) ratio also indicates a more active memory subsystem. More in-depth analysis will
tell us why. The re column will always show 0 since true reclaims are not supported.
The vmstat -l, -t, and -I (upper case i) options are only available in AIX 5L and later.
There are new commands for obtaining statistics specific to a logical partition. These give
statistics for POWER Hypervisor activity or for tracking real CPU utilization in a
simultaneous multi-threading or shared processor (Micro-Partition) environment. A new
register was added called the Processor Utilization Resource Register (PURR) to track
logical and virtual processor activity. Commands such as sar and topas will automatically
use the new PURR statistics when in a simultaneous multi-threading or shared processor
(Micro-Partition) environment and you will see new columns reporting partition statistics in
those environments. Trace-based commands now have new hooks for viewing PURR data.
Some commands such as lparstat, mpstat, and smtctl are new for AIX 5L V5.3 and
work in a partitioned environment.
Transition statement Lets look at another command to monitor CPU utilization (sar).
The sar command can be very useful on SMP systems because of its per processor
statistic reporting capability.

3-73
Instructor Guide
sar command
Reports system activity information from selected cumulative
activity counters
# sar -P ALL 5 1
4 processors
SMT off
System configuration: lcpu=4

15:01:19
15:01:24
cpu
0
1
2
3
-
%usr
0
0
100
100
50
%sys
2
5
0
0
2
%wio
0
0
0
0
0
%idle
98
95
0
0
48
# sar -q 5 3
19:31:42 runq-sz %runocc swpq-sz %swpocc
19:31:47
1.0
100
1.0
100
19:31:52
2.0
100
1.0
100
19:31:57
1.0
100
1.0
100
Average
1.3
95
1.0
95
Figure 3-23. sar command
AN512.0
Notes:
Introduction
The sar command is the System Activity Report tool and is standard for UNIX systems.
The sar command can collect data in real-time and postprocess the data in real-time or
after the fact. sar data can be retrieved from the PerfPMR monitor.int file.
sar -P command
The syntax of the sar command using the -P flag is:
sar [-P CPUID [,...] | ALL] <Interval> <Count>
If the -P flag is given, the sar command reports activity which relates to the specified
processor or processors. If -P ALL is given, the sar command reports statistics for each
individual processor, followed by system-wide statistics in the row that starts with the
hyphen. Without the -P flag, the sar command reports system-wide (global among all

V5.4
Instructor Guide
Uempty
processors) statistics, which are calculated as averages for values expressed as

percentages or sums.
sar -q command
sar -q reports queue statistics. The following values are displayed:
- runq-sz: Reports the average number of kernel threads in the run queue
- %runocc: Reports the percentage of the time the run queue is occupied (this field is
subject to error)
- swpq-sz: Reports the average number of kernel threads waiting to be paged in
- %swpocc: Reports the percentage of the time the swap queue is occupied (this field
is subject to error)
A blank value in any column indicates that the associated queue is empty.
The -q option can indicate whether you just have many jobs running (runq-sz) or have
a potential paging bottleneck. If paging is the problem, run vmstat. Large swap queue
lengths indicate significant competing disk activity or a lot of paging due to insufficient
memory.
A large number of runnable threads does not necessarily indicate a CPU bottleneck. If
the performance goals are being met and the system is running the threads quickly,
then it does not matter if this number seems high.
sar initial interval report line

In AIX 5L V5.3 and later, sar displays a system configuration line, which appears as the
first line displayed after the command is invoked. If a configuration change is detected
during a command execution iteration, a warning line will be displayed before the data
which is then followed by a new configuration line and the header.
The first interval of the command output is now meaningful and does not represent
statistics collected from system boot. Internal to the command, the first interval is never
displayed, and therefore there may be a slightly longer wait for the first displayed
interval to appear. Scripts that discard the first interval should function as before.
The topas command

The topas command output is a convenient way to see many different system statistics
in one view. It will display top processes by CPU-usage, the CPU usage statistics,
context switches (Cswitch), the run queue value, and the wait queue value. Once
topas has started, press lowercase c twice to see per-CPU statistics.

3-75
Instructor Guide
Instructor notes:
Purpose Explain the per-CPU option of sar.
Details Since most tools show CPU statistics as an average across all CPUs, the -P
option in sar is valuable in that it can show statistics on a per CPU basis. To get all CPUs,
run sar -P ALL interval count. The other sar options can be combined with the -P flag.
sar can collect data in real-time and postprocess the data in real-time or after the fact. This
unit shows how to sample and display statistics. Point the students to the sar man page
and Performance Management Guide for more information on the sar command and its
options.
The topas command is listed in the student notes. Just mention that this is another way to
get CPU-related statistics. This command was covered in Unit 2.
Additional information Since the output of sar has a new configuration line beginning
with AIX 5L V5.3, scripts written before AIX 5L V5.3 that parse command output may need
to be modified to detect and handle this new output.
Unlike the accounting package, sar is relatively unintrusive. The system maintains a series
of system activity counters which record various activities and provide the data that sar
reports. sar does not cause these counters to be updated or used. This is done
automatically regardless of whether sar runs or not. sar merely extracts the data in the
counters and saves it, based on the sampling rate and number of samples specified to sar.
If youre asked what is meant by a processor in the output of sar -P, it depends on the
environment:
On a non-logical partitioned (LPAR) system, or an LPAR using dedicated processors
without simultaneous multi-threading (SMT), the number of CPUs listed will be equal to
the number of physical CPUs.
On an LPAR using dedicated processors with SMT enabled, the CPUs listed will be the
logical CPUs and will be twice the number of physical CPUs.
On an LPAR that uses shared processors without SMT, the number shown will be the
number of virtual processors.
On a system that uses shared processors with SMT enabled, the number shown will be
the number of logical processors, which in this case will be twice the number of virtual
processors.
Point the students to the AIX 5L Virtualization Performance Management course or one of
the virtualization Redbooks for more information on how to interpret monitoring tools in
LPAR environments.
Transition statement Another useful monitor tool is the ps command.

V5.4
Instructor Guide
Uempty
Locating dominant processes

What processes are currently using the most CPU time?
Run the ps command periodically
# ps aux
USER
root
user3
user2
user5
user1
root
PID %CPU %MEM

SZ RSS
TTY STAT
31996 15.4 0.0 188 468 pts/12 A
36334 3.0 0.0 320 456 pts/19 A
47864 1.4 3.0 2576 5676 pts/23 A
63658 0.2 3.0 2036 5120 pts/23 A
35108 0.2 4.0 4148 6584 pts/17 A
60020 0.1 0.0 324 680 pts/14 A
STIME TIME COMMAND

10:41:31 0:04 -ksh
10:40:50 0:02 tstprog
08:41:16 1:40 /usr/sbin/re
09:18:11 0:11 /usr/bin/dd
Jul 26 16:24 looper
Jul 26 16:24 looper
Run tprof over a time period:

# tprof -x sleep 60
Use other tools such as topas
The problem may not be one or a few processes dominating the

CPU, it could be the sum of many processes
Figure 3-24. Locating dominant processes
AN512.0
Notes:
Overview
To locate the processes dominating CPU usage, there are tools such as the standard
ps and the AIX-specific tool, tprof.
Using the ps command

The ps command, run periodically, will display the CPU time under the TIME column and
the ratio of CPU time to real time under the % CPU column. Keep in mind that the CPU
usage shown is the average CPU utilization of the process since it was first created.
Therefore, if a process consumes 100% of the CPU for five seconds and then sleeps for
the next five seconds, the ps report at the end of ten seconds would report 50% CPU
time. This can be misleading because right now the process is not actually using any
CPU time.

3-77
Instructor Guide
The example on the visual uses the ps aux flags which will display:
- a Information about all processes with terminals
- u User-oriented information
- x Processes without a controlling terminal in addition to processes with a
controlling terminal
Thread-related information can also be shown. The -m flag displays threads associated
with processes using extra lines. You must use the -o flag with the THREAD field specifier
to display extra thread-related columns. Examples are:
# ps
USER
root
-
[-e][-k] -mo THREAD [-p

PID
PPID TID
ST CP
20918 20660 A 0
20005 S 0
<pid>]
PRI SC
60 1
60 1
# thread is not bound

WCHAN F
TT
BND COMMAND
240001 pts/1 -ksh
400
-
# ps [-e][-k] -mo THREAD [-p <pid>] # thread is bound

USER PID
PPID TID
ST CP PRI SC WCHAN F
TT
BND COMMAND
root 21192 20918 A 86 64 1 8b06c 200001 pts/1 0
bndprog
20279 S 0 64 1 8b06c 420
0
It may be very common to see a kproc using CPU time. When there are no threads that
are runable for a time slice, the scheduler assigns the CPU time for that time slice to
this kproc which is known as the idle or wait kproc. SMP systems will have an idle
kproc for each processor.
A more accurate way of gauging CPU usage is with the tprof command.
Using the tprof command

The ps command takes a snapshot. To gather data over a time period, use tprof. It
can be used to locate the CPU-dominant processes, and then allow you to further
analyze which portion of a particular program is using the most CPU time. The -x option
specifies a program to execute at the start of the trace period; and, when the program
stops, the trace stops. While this can be used to measure a particular program
execution, in most cases it is simply used to control the trace period. For this purpose
the sleep command works well. For example, to monitor the system for 5 minutes, let
the value to the -x option be a sleep command with an argument of 300 seconds (-x
sleep 300). After this period is completed, tprof will generate a file called sleep.prof
(AIX 5L V5.2 and later) or _prof.all (prior to AIX 5L V5.2) which will show the most
dominant processes listed in order of the highest CPU percentage (starting with AIX 5L
V5.2) or using the most CPU ticks (before AIX 5L V5.2).

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose List ways to locate dominant processes.
Details There are various ways to determine which processes are using the most CPU.
The accounting facility can give you a historical perspective since it can show how much
CPU time processes have used since they were created.
The ps command can also show the current dominant processed by running the
command periodically and noting which processes dominate CPU usage.
Additional information A CPU tick is equivalent to 10 ms of CPU time. If the CPU is
idle or waiting on I/O, then the CPU ticks are assigned to the wait process.
The ps command can be invoked in various ways such as the following:
ps aux
ps xv
ps -ekF %c %u %p %C | sort -rn +3
ps options listed above are:
a - Displays information about all processes with terminals (ordinarily only the user's
own processes are displayed)
u - Displays user-oriented output. This includes the USER, PID, %CPU, %MEM, SZ,
RSS, TTY, STAT, STIME, TIME, and COMMAND fields
x - Displays processes without a controlling terminal in addition to processes with a
controlling terminal
v - Displays the PGIN, SIZE, RSS, LIM, TSIZ, TRS, %CPU, %MEM fields
e - Displays the environment as well as the parameters to the command, up to a limit
of 80 characters
k - Lists kernel processes
F - Gives the format
The netpmon command (which will be discussed in detail in a later unit) can also be used to
profile the system for one minute.
Transition statement Lets look at the tprof output to see how its useful for
monitoring information.

3-79
Instructor Guide
tprof output
Process
=======
cpuprog
wait
/usr/sbin/syncd
/usr/bin/tprof
/usr/bin/trcstop
/usr/bin/sleep
IBM.ERrmd
rmcd
=======
Total
Freq Total Kernel

==== ===== ======
1 50.29 47.77
2 49.50 49.50
1
0.14
0.14
1
0.02
0.00
1
0.02
0.02
1
0.01
0.01
1
0.01
0.01
1
0.01
0.01
==== ===== ======
9 100.00 97.46
Process
=======
cpuprog
wait
wait
/usr/sbin/syncd
/usr/bin/tprof
/usr/bin/trcstop
IBM.ERrmd
rmcd
/usr/bin/sleep
=======
Total
PID
===
16378
516
774
6200
15306
14652
9922
6718
14650
===
Total Samples = 12381
TID
===
33133
517
775
8257
32051
32975
24265
8009
32973
===
Total Kernel
===== ======
50.29 47.77
44.75 44.75
4.75
4.75
0.14
0.14
0.02
0.00
0.02
0.02
0.01
0.01
0.01
0.01
0.01
0.01
===== ======
100.00 97.46
User Shared
==== ======
0.35
2.17
0.00
0.00
0.00
0.00
0.00
0.02
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
==== ======
0.35
2.20
User Shared
==== ======
0.35
2.17
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.02
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
==== ======
0.35
2.20
Other
=====
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
=====
0.00
Other
=====
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
=====
0.00
Total Elapsed Time = 61.90s
Figure 3-25. tprof output
AN512.0
Notes:
Overview
This output lists the processes/threads that were running when the clock interrupt
occurred. tprof uses the trace facility to record the instruction address register value
whenever the clock interrupt occurs (every 10 ms). The report lists processes in
descending order of CPU usage.
The file generated will be command.prof where command is the command given with
the -x flag.
Report format
The top part of the report contains a summary of all the processes on the system. This
is useful for characterizing CPU usage of a system according to process names when
there are multiple copies of a program running. The second part of the report shows
each thread that executed during the monitoring period.
V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain the output of the tprof.
Details The tprof command can be used easily to show whos using the CPUs at this
very moment. Simply pass in the -x flag with a sleep command where the argument to
sleep is the amount of time you want to profile.
If asked, the Shared column on the visual means that shared libraries were in use and the
Other column is not used.
Additional information A cooked file is a preprocessed version of the trace and
symbol name files that allows tprof to read the data very quickly since everything in the
file is already formatted and sorted exactly like tprof needs. The default method of running
tprof now generates a command.prof file rather than a __prof.all file where command is
the command specified by the -x flag. Also, the default output shows percent of CPU time
rather than ticks but the ticks can be displayed using the -z flag.
Transition statement The default SMT environment complicates how CPU usage
statistics are presented. It is important to know how to corrected read the reports. Let start
with a review of SMT.

3-81
Instructor Guide
What is simultaneous multi-threading?

Multiple hardware threads can run on one physical
processor at the same time.
A processor appears as two or four logical CPUs (lcpu) to AIX.

Beneficial for most commercial environments
Computing intensive applications often do not benefit
SMT affects how we read performance statistic reports.
SMT is enabled by default with max number of logical CPUs

Can change between SMT2 and SMT4 (POWER7 only)
Can disable or enable
Logical
Logical
smtctl m {on|off}
CPU0
CPU1
smtctl t #SMT
AIX Layer
Physical Layer
Hardware
Thread0
Hardware
Thread1
Physical CPU
Figure 3-26. What is simultaneous multi-threading?
AN512.0
Notes:
Introduction
Simultaneous multi-threading is the ability of a single physical processor to concurrently
dispatch instructions from more than one hardware thread. There are multiple hardware
threads per processor. Instructions from any of the threads can be fetched by the
processor in a given cycle.
The number of hardware threads per processor depends upon the version of the
processor chip. The POWER5 and POWER6 chips support two hardware threads per
core. The POWER7 chips support four hardware threads per core.
Simultaneous multi-threading also allows instructions from one thread to utilize all the
execution units if the other thread encounters a long latency event. For instance, when
one of the threads has a cache miss, another thread can continue to execute.
Each hardware thread is supported as a separate logical processor by the operating
system. So, a system with one physical processor is configured by AIX as a logical
V5.4
Instructor Guide
Uempty
two-way. For POWER5 or POWER6-based systems with N physical processors, with

SMT enabled, many performance tools will report N*2 logical processors. With
POWER7-based systems, many performance tools will report N*4 logical processors.
Simultaneous multi-threading is enabled by default on POWER5-based and later
systems, and some monitoring tools in this environment will show two or four times as
many processors as physical processors in the system. For example, sar -P ALL will
show the logical processors in this configuration, which will be two or four times the
number of physical processors installed.
For most commercial environments, simultaneous multi-threading can be slightly to
greatly beneficial to performance. For a specific workload environment, test
performance with it enabled and compare it to when it is disabled to see if simultaneous
multi-threading will be a benefit. Highly compute-intensive environments may not seen
a gain, and in fact could see a slight degradation in performance, particularly with
workloads where multiple threads are competing for the same CPU execution units.
Modifying simultaneous multi-threading with the smtctl command

The smtctl command provides privileged users and applications the ability to control
utilization of processors with simultaneous multi-threading support. With this command,
you can enable or disable simultaneous multi-threading system-wide, either
immediately or the next time the system boots.
The smtctl command syntax is:
smtctl [ -m off | on [ -w boot | now ]]
smtctl [ -t #SMT [ -w boot | now ]]
where:
-m off
Sets simultaneous multi-threading mode to disabled.
-m on
Sets simultaneous multi-threading mode to enabled.
-t #SMT
Sets the number of simultaneous threads per processor
-w boot
Makes the simultaneous multi-threading mode change effective on the

next and subsequent reboots. (You must run the bosboot command
before the next system reboot).
-w now
Makes the simultaneous multi-threading mode change immediately but

will not persist across reboot.
If neither the -w boot or the -w now options are specified, then the mode change is
made now and when the system is rebooted.
Note, the smtctl command does not rebuild the boot image. If you want your change to
persist across reboots, the bosboot -a command must be used to rebuild the boot
image. The boot image has been extended to include an indicator that controls the
default simultaneous multi-threading mode.

3-83
Instructor Guide
Issuing the smtctl command with no options will display the current simultaneous
multi-threading settings.
Modifying simultaneous multi-threading (SMT) with SMIT

Start the smit command with no options, and then use the following menu path to get to
the main simultaneous multi-threading panel: Performance & Resource Scheduling
-> Simultaneous Multi-Threading Processor Mode.
The fastpath to this screen is smitty smt.
There are two options on this screen:
List SMT Mode Settings
Change SMT Mode
The Change SMT Mode screen gives the following options:
SMT Mode
Options are: enable and disable
SMT Change Effective:
Options are: Now and subsequent boots, Now, and Only on subsequent
boots
(At the time of this writing, SMIT has not been updated to provide a dialogue panel
capable of changing the number of simultaneous threads).

V5.4
Instructor Guide
Uempty
smtctl command output (no parameters)

(The following example is from an LPAR configured to use two hardware
threads per processor)
# smtctl
This system is SMT capable.
SMT is currently enabled.
SMT boot mode is set to enabled.
SMT threads are bound to the same physical processor.
proc0 has 2 SMT threads.
Bind processor 0 is bound with proc0
The smtctl command with no options reports the following information:

SMT Capability
Indicates whether the processors in the system are capable of

simultaneous multi-threading
SMT Mode
Shows the current runtime simultaneous multi-threading mode

(disabled or enabled)
SMT Boot Mode
Shows the current boot time simultaneous multi-threading

mode (disabled or enabled)
SMT Bound
Indicates whether the simultaneous multi-threading threads are

bound on the same physical or virtual processor
SMT Threads
Shows the number of simultaneous multi-threading threads per

physical or virtual processor

3-85
Instructor Guide
Instructor notes:
Purpose Describe how simultaneous multi-threading works.
Details Give an overview of simultaneous multi-threading. This is covered in the AN30
course which is a recommended prerequisite to this course.
Emphasize that it is unusual to have a system where is it is not beneficial to have SMT
enabled. If they want to verify they can measure it with SMT enabled and disabled to
compare. This is not something that most system administrators would disable. The details
for this are in the notes.
The only reason to spend time on this is that it effects the way they read the statistic
command reports.
Point students to the IBM eServer p5 Virtualization Performance Considerations redbook
for more information about performance benefits with simultaneous multi-threading. Test
results documented in the redbook show performance improvements up to 43% with
simultaneous multi-threading enabled. Be sure to emphasize the up to so they do not
think its guaranteed to be 43%. In certain environments, performance degradation of up to
-11% was recorded. Most types of workloads saw performance improvements.
Additional information When in simultaneous multi-threading mode, instructions from
either thread can use the eight instruction pipelines in a given clock cycle. By duplicating
portions of logic in the instruction pipeline and increasing the capacity of the register
rename pool, the POWER5 or POWER6 processors can execute two instruction streams,
or threads, concurrently. With POWER7, that increases to four instruction streams.
If asked how this relates to virtual processors on an LPAR with shared processors, in that
case, the logical processors will be two times (or four times) the number of virtual, not
physical, processors.
Transition statement Now lets see how we analyze the CPU statistics in the SMT
environment.

V5.4
Instructor Guide
Uempty
SMT scheduling and CPU utilization

Processor has multiple hardware (H/W) threads
AIX will tend to first dispatch software (S/W) threads on the
low order H/W threads of all processors
primary is most preferred, next secondary, and so forth
Runs wait kproc on (or snoozes) idle H/W threads

If tertiary and quartenary threads are snoozed (POWER7), mode
is dynamically reduced to SMT2.
For system wide reports, such as lparstat and vmstat:

Prior to AIX6 TL4:
If only a single H/W thread was busy, processor reported as 100%
utilized.
This could be misleading since idle H/W threads have capacity.
AIX6 TL4 and later:

Potential capacity of unused H/W threads is reported as idle time
for the processor.
Best picture of H/W thread utilization given by per-lcpu

reports, such as sar P and mpstat
Figure 3-27. SMT scheduling and CPU utilization
AN512.0
Notes:
The need for the additional throughput offered by SMT is not needed until we have
dispatchable threads waiting for a processor. If AIX scheduling is not concerned with
processor affinity issues (as when we initially dispatch a process), it will avoid having two
software threads on the same processor. For example, it will tend to dispatch work to all of
the primary hardware threads before it dispatches any work to the secondary hardware
threads. An individual thread will generally run better without a second thread to compete
with, because it avoids the occasional contention for the same execution unit within the
processor that would happen if both hardware threads were being used.
As a result of this AIX scheduling preference, when using multiple dedicated processors,
you will tend to see a pattern where (for SMT2) every other logical CPU is almost idle,
when looking at a system that has a moderate CPU load.
AIX runs a wait kproc (kernel process) thread on any logical CPU which does not have any
other dispatchable threads. When the primary thread has some real work to do and the
secondary thread has only the wait kproc, there can still be some contention for execution
units. To avoid that, AIX will snooze the secondary thread after a short time of only

3-87
Instructor Guide
running the wait kproc. When this happens the processor is running only one hardware
thread and there is no contention for execution units. A similar mechanism occurs with
SMT4. If not utilized, the third and fourth hardware threads will be snoozed, and the
scheduling restricted to the primary and secondary threads.
Keeping a processor busy with a single thread when SMT is enabled, is the same
utilization as a single thread keeping the processor busy when SMT is not enabled; in other
words it is 100% utilized. But, with SMT enabled, we can actually do much more work by
dispatching a second thread. Before AIX6 TL4, some performance statistics reports, such
as iostat or vmstat that only report overall system utilization, would report the 100%
utilization when only the primary hardware threads are being fully utilized, even though we
actually have spare capacity on the secondary hardware threads that can be used. Since
AIX6 TL4, the system CPU utilization factors in the potential capacity of the unused
hardware threads. This is reflected in higher idle statistics (idle plus iowait) and lower
in-use statistics (user plus sys).
In order to see the complete CPU utilization picture, we need to use a report that shows the
utilization of each individual logical CPU. For this you can use either sar -P ALL or
mpstat. If you see that there are logical CPUs which are underutilized, we know that there
is extra capacity available to use.
Later, we will see that AIX in a micro partition LPAR will show a slightly different SMT
behavior.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain how AIX schedules for SMT and the affect upon utilization reporting.
Details
Transition statement Lets look at examples of sar reports in the SMT environment.

3-89
Instructor Guide
System wide CPU reports (old and new)

# lparstat 2 3
Example of
lparstat
in AIX6 TL3
System configuration: type=Shared mode=Capped smt=On lcpu=2

mem=768MB psize=4 ent=0.30
%user %sys %wait %idle physc %entc lbusy vcsw phint
----- ----- ------ ------ ----- ----- ------ ----- ----98.2
0.8
0.0
1.0 0.30 99.9
51.4
618
0
97.9
0.8
0.0
1.3 0.30 99.9
48.7
602
1
98.8
0.6
0.0
0.7 0.30 99.9
50.0
450
1
# lparstat 2 3
Example of
lparstat
in AIX6 TL5
System configuration: type=Shared mode=Capped smt=On lcpu=2

mem=768MB psize=4 ent=0.30
%user %sys %wait %idle physc %entc lbusy vcsw phint
----- ----- ------ ------ ----- ----- ------ ----- ----77.3
2.7
0.0
20.0 0.30 99.8
53.2 1196
0
76.7
2.8
0.0
20.5 0.30 99.8
51.8 1216
0
77.0
2.6
0.0
20.4 0.30 99.8
52.2 1206
0
Figure 3-28. System wide CPU reports (old and new)
AN512.0
Notes:
The visual illustrates the difference between AIX6 technology level 3 and AIX technology
level 5. Both examples are running a single significant CPU intensive thread running on a
system configured for two hardware threads per processor and only one processor
provided.
Before AIX6 TL4, the lparstat report shows there to be very little idle capacity, leading us to
believe that we will need to add more processor capacity. Yet, we know that we can still run
an additional thread on that same processor, providing even more throughput.
That same situation in AIX6 TL4 (and later) shows a significant amount of idle capacity.
This is reflecting the potential capacity of the idle hardware thread on that processor.
It must be remembered that this extra capacity can only be used by additional threads. If
the application is designed with a single process and a single thread, then that application
will not be able to use that potential extra capacity.
In either method of reporting CPU utilization, it is informative to examine a report showing
the utilization of each individual hardware thread.
V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Illustrate SMT reporting between pre and post AIX6 TL4
Details
Transition statement Lets look at examples of system wide CPU utilization reports, at
different technology levels of AIX.

3-91
Instructor Guide
Viewing CPU statistics with SMT

# sar -P ALL 2 2
AIX frodo21 3 5 00C30BFE4C00
Example of
sar P ALL
with SMT
disabled
06/11/06

16:40:30 cpu
16:40:32 0
1
%usr
0
24
12
%sys
0
76
38
%wio
0
0
0
%idle
100
0
50
# sar -P ALL 2 2
AIX frodo21 3 5 00C30BFE4C00
Example of
sar P ALL
with SMT
enabled
06/11/06

16:40:43 cpu
16:40:45 0
1
2
3
%usr
4
0
0
27
14
%sys
12
1
0
69
36
%wio
0
0
0
0
0
%idle
84
99
100
4
50
physc
0.56
0.05
0.44
0.96
2.01
Figure 3-29. Viewing CPU statistics with SMT
AN512.0
Notes:
Introduction
The visual above shows a system running with the same workload, first with SMT
disabled, then with it enabled. Notice that the logical CPU number doubled with SMT
enabled. Also notice the new statistic of physc or physical CPU consumed with SMT
enabled. This shows how much of a CPU was consumed by the logical processor.
In the example in the visual above, we see activity in both sar examples with two
physical processors. The physc column is misleading in a way, since in this
environment with dedicated processors in a logical partition, the two logical processors
which make up one physical processor, have a physical processor consumption which
always add up to 100% or 1.00 processors. By looking at this second sar output, there
is user and system activity on logical CPU 3, and the other three logical processors are
mostly idle. Just by looking at the output of this second sar, you can tell that logical
CPUs 1 and 3 are on the same physical CPU (.05 + .96 add up to approximately 1.00)
and logical CPUs 0 and 2 are on the same physical CPU.
V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe how monitoring commands may change when simultaneous
multi-threading is enabled.
Details Compare the sar outputs and show the different for an environment when
simultaneous multi-threading is enabled.
SMT is enabled by default. SMT is only available on POWER5 and later systems with AIX
(5L V5.3 or later) or one of the Linux versions based on the 2.6 (or later) kernel.
Transition statement With POWER7-based systems, we can now have four hardware
threads per processor. Let us see what this looks like.

3-93
Instructor Guide
POWER7 CPU statistics with SMT4

# smtctl
This system is SMT capable.
SMT is currently enabled.
SMT boot mode is set to enabled.
SMT threads are bound to the same virtual processor.
Bind processor 0 is bound
with
with
with
with
proc0
proc0
proc0
proc0
# sar -P ALL 2 1
AIX sys304_114 1 6 00F606034C00
05/20/10
System configuration: lcpu=4 ent=0.30 mode=Capped

22:08:13 cpu
22:08:15 0
1
2
3
-
%usr
0
100
66
97
88
%sys
23
0
1
0
2
%wio
0
0
0
0
0
%idle
76
0
32
3
10
physc
0.02
0.13
0.04
0.11
0.30
%entc
6.3
43.1
13.1
37.4
99.9
Figure 3-30. POWER7 CPU statistics with SMT4
AN512.0
Notes:
The visual shows examples of both the smtctl and sar commands on a POWER7-based
system.
The smtctl report clearly shows the four SMT threads bound to proc0.
The sar report illustrates how we can view the utilization of each of these logical CPUs.
The principles are basically the same as with SMT2. Remember that the objective is to
increase throughput by running more threads in parallel, rather than to improve the
performance of any single thread.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Illustrate the ability to have four hardware threads per processor,
Details While the example ws taken in a micro partitioning environment, the focus
needs to be on the number of logical CPUs being displayed with their own statistics. If a
student asked about the %entc, use that as a segue to the next topic where we cover
micro-partitioning.
Transition statement Up to this point we have focused on an environment were we had
a dedicated processor allocated. Lets next looks at how processor virtualization, or
micro-partitioning, affects how we analyze CPU utilization.

3-95
Instructor Guide
Processor virtualization
Shared Pool
LPAR
Shared Pool
LPAR
Dedicated
Processor
LPAR
Logical CPUs
e
at
n
o
Shared Pool of
Processors
Virtual
Processors
Dedicated
Processors
Figure 3-31. Processor virtualization
AN512.0
Notes:
Introduction
This visual gives an overview of many of aspects of processor virtualization that we
need to consider in AIX tuning of CPU performance.
In this example, there are 8 physical processors. Starting from left to right, there are six
processors in the shared processor pool and two processors dedicated to a partition.
Moving up in the visual, we see that there are two shared pool LPARs (SPLPAR), each
with four virtual processors, and one dedicated LPAR with two physical processors
allocated.
Optional features
Many of the processor concepts in this unit are optional features that must be
purchased. For example, the ability to have Capacity on Demand (CoD) processors is a
separate, orderable feature. Also, the PowerVM standard edition (Advanced POWER
V5.4
Instructor Guide
Uempty
Virtualization) feature must be purchased to use Micro-Partitioning and shared and

virtual processors.
Shared processors
Shared processors are physical processors, which are allocated to partitions on a time
slice basis. Any physical processor in the Shared Processor Pool can be used to meet
the execution needs of any partition using the Shared Processor Pool. There is only one
Shared Processor Pool for POWER5 processor-based systems. With POWER6 and
later, multiple Shared Processor Pools can be configured.
A partition may be configured to use either dedicated processors or shared processors,
but not both.
Processing units
When a partition is configured, you assign it an amount of processing units. This is
referred to as the processor entitlement for the LPAR. A partition must have a minimum
entitlement of one tenth of a processor; after that requirement has been met, you can
configure processing units at the granularity of one hundredth of a processor.
Capped versus uncapped shared pool LPARs

When a partition is configured as using the shared processor pool, it can have its
entitlement either capped or uncapped. When capped, the LPAR can not use more than
its current entitlement. When uncapped, it is allowed o use processor capacity above its
current entitlement as long as other LPARs do not need those cycles to do work within
their own entitlement.
Benefits to using shared processors

Here are some benefits of using shared processors:
The processing power from a number of physical processors can be utilized
simultaneously, which can increase performance for multiple partitions.
Processing power can be allocated in sub-processor units in as little as
one-hundredths of a processor for configuration flexibility.
Uncapped partitions can be used to take advantage of excess processing power
not being used by other partitions.
Disadvantage of using shared processors

A disadvantage of using shared processors is that because multiple partitions use the
same physical processors, there is overhead because of context switches on the
processors. A context switch is when a process or thread is running on a processor, it is

3-97
Instructor Guide
interrupted (or finishes), and a different process or thread runs on that processor. The
overhead is in the copying of each jobs data from memory into the processor cache.
This overhead is normal and even happens at the operating system level within a
partition. However, there is added context switch overhead when the Hypervisor
dispatches virtual processors onto physical processors in a time-slice manner between
partitions.
Micro-partitions
The term micro-partition is used to refer to partitions that are using the shared
processor pool. This is because the partition does not use processing power in whole
processor units, but can be assigned a fractional allocation in units equivalent to
hundredths of a processor.
Shared processor logical partition (SPLPAR)

In documentation, you might see the acronym SPLPAR for shared processor logical
partition, and it simply means a partition utilizing shared processors.
Virtual processors
The virtual processor setting allows you to control the number of threads your partition
can run simultaneously. The example shows six physical processors in the shared pool,
and there are eight virtual processors configured in the two partitions.
The number of virtual processors is what the operating system thinks it has for physical
processors. The number of virtual processors is independently configurable for each
partition using shared processors.
Dedicated processor versus shared processor partition performance

Having dedicated processors will have improved performance over shared capped
processor performance because of reduced processor cache misses and reduced
latency. Dedicated processor partitions have the added advantage of memory affinity;
that is, when the partition is activated, there is an attempt made to assign physical
memory that is local to the dedicated processors, thereby reducing latency issues.
However, a partition using dedicated processors cannot take advantage of using
excess shared pool capacity as you can with an uncapped partition using the shared
processor pool. Performance could be better with the uncapped processors if there is
excess capacity in the shared pool that can be used.
Configuring the virtual processor number on shared processor partitions is one way to
increase (or reduce) the performance for a partition.
The virtual processor setting for a partition can be changed dynamically.

V5.4
Instructor Guide
Uempty
Virtual processor folding

Starting with AIX V5.3 maintenance level 3, the kernel scheduler has been enhanced to
dynamically increase and decrease the use of virtual processors in conjunction with the
instantaneous load of the partition, as measured by the physical utilization of the
partition. This is a function of the AIX V5.3 operating system (also of AIX 6.1) and not a
Hypervisor call.
If there are too many virtual processors for the load on the partition, every time slice, the
Hypervisor will cede excess cycles. This works well, but it only works within a dispatch
cycle. At the next dispatch cycle, the Hypervisor distributes entitled capacity and must
cede the virtual processor again if there is no work. The VP folding feature, which puts
the virtual processor to sleep across dispatch cycles, improves performance by
reducing the Hypervisor workload, by decreasing context switches, and by improving
cache affinity.
When virtual processors are deactivated, they are not dynamically removed from the
partition as with DLPAR. The virtual processor is no longer a candidate to run on or
receive unbound work; however, it can still run bound jobs. The number of online logical
processors and online virtual processors that are visible to the user or applications does
not change. There are no impacts to the middleware or the applications running on the
system because the active and inactive virtual processors are internal to the system.
Enable/Disable VP folding
The schedo command is used to dynamically enable, disable, or tune the VP folding
feature. It is enabled (set to 0) by default.
Typically, this feature should remain enabled. The disable function is available for
comparison reasons and in case any tools or packages encounter issues due to this
feature.
Configuring vpm_xvcpus
Every second, the kernel scheduler evaluates the number of virtual processors in a
partition based on their utilization. If the number of virtual processors needed to
accommodate the physical utilization of the partition is less than the current number of
enabled virtual processors, one virtual processor is disabled. If the number of virtual
processors needed is greater than the current number of enabled virtual processors,
one or more (disabled) virtual processors are enabled. Threads that are attached to a
disabled virtual processor are still allowed to run on it.
To determine if folding is enabled (0=enabled; -1=disabled):
# schedo -a | grep vpm_xvcpus
vpm_xvcpus = 0
To disable, set the value to -1 (To enable, set it to 0.):

3-99
Instructor Guide
# schedo -o vpm_xvcpus=-1
Tuning VP folding
Besides enabling and disabling virtual processor folding, the vpm_xvcpus parameter
can be set to an integer to tune how the VP folding feature will react to a change in
workload. For example, the following command sets the vpm_xvcpus parameter to 1:
# schedo -o vpm_xvcpus=1
Now when the system determines the correct amount of virtual processors needed, it
will add one more to that amount. So if the partition needs four processors, by setting
the xvm_xcvpus to 1, the number of virtual processors will be set to 5.
Dedicated LPAR processor donation

POWER6 and POWER7-based systems allow a dedicated processor logical partition to
donate its idle processor cycles to the shared processor pool.
The new processor function provides the ability for partitions that normally run as
dedicated processor partitions to contribute unused processor capacity to the shared
processor pool. This support will allow that un-needed capacity to be donated to
uncapped partitions instead of being wasted as idle cycles in the dedicated partition.
This feature ensures the opportunity for maximum processor utilization throughout the
system.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Provide an overview or review of basic processor virtualization concepts.
Details The course description encourages students to first attend AN30 before they
attend this course. For them this is all review. For the students without this background, this
should act as an overview and introduction to these concepts.
Do not try to teach all of the content in the student notes. The most important points are:
Entitled allocations out of the shared processor pool
Difference between capped and uncapped
Unused cycles can be donated back to the pool for use by uncapped LPARs
SP-LPARs will have virtual processors instead of dedicated processors
The SP-LPAR should only have as many VPs as is needed to handle the work
load; an LPAR that needs the processing capacity of 3.5 cores (3.5 processing
units) only needs 4 VPs.
VPs beyond what is needed by AIX will be folded by AIX, meaning that they will
not be used by AIX - no threads will be scheduled on them.
The logical CPUs relate to the SMT topic just covered.
Additional information While extra VPs beyond what AIX can use will not cause a
significant CPU problem (context switching overhead) due to folding, those extra VPs will
add to the memory load. In a low memory LPAR excessive numbers of VPs will make the
situation worse than it needs to be.
Transition statement Let us take a look at the ways in which micro partitioning can
affect our CPU performance management.

3-101
Instructor Guide
Performance management with virtualization

Processor allocation can be very dynamic
Work with HMC administrator to adjust capacity and VP allocations or
other LPAR virtualization attributes
AIX actual processor usage varies

AIX can cede or donate cycles that it can not use
If uncapped, AIX LPAR can use more than its entitlement
Traditional usr, sys, wait, idle percentages are not stable

Calculated as percentage of actual cycles used (including wait kproc)
The denominator for the calculation can constantly change
Need to factor in actual processor utilization (physc, %ent)

Uncapped execution above entitlement:
Better system resource utilization
Performance can vary due to other LPARs processor demands
Figure 3-32. Performance management with virtualization
AN512.0
Notes:
Partnering with the managed system administrator
A major component in performance management is providing the correct amount and type
of resource for the demand. In a single operating system server, you would do this by
adding more physical resource. In an LPAR environment, you may be able to provide the
additional resource by having it dynamically allocated via the HMC. When using shared
pool LPARs, you can have the processor entitlement and the number of virtual processors
increased. Often the number of virtual processors is already set to the largest value you
would practically want, so that often does not need to be changed. You might also want to
run uncapped, if not already in that mode. Remember that the server administrator sees a
larger picture involving the optimize the performance of all LPARs, taking into consideration
the characteristics and priority of the various applications.

V5.4
Instructor Guide
Uempty
Physical processor utilization variability

In a traditional non-partitioned server, the used processor cycles stays constant; either a
useful thread is running or the wait kproc is running and the processor is kept busy. A
dedicated processor LPAR behaves the same way. The traditional UNIX statistics assume
that this is the situation and are focused on how these cycles are used: user mode, system,
mode, or wait kproc (with our without threads waiting on a resource). The four categories
always add up to 100%; and this 100% matches the actual execution time on physical
processor.
When we use micropartitions the situation changes significantly. If AIX detects that it is just
wasting cycles with the wait kproc spinning in a loop, it will cede that VP (and thus the
underlying physical processor) back to the hypervisor which can then dispatch a different
LPAR on that processor. Even though AIX has a processor entitlement, it may not be
running anything on the physical processor for a period of time, not even a wait kproc. On
the other hand, if an AIX partition is uncapped, it may execute for many more cycles than
its entitlement would appear to provide. The traditional CPU statistics (user, sys, wait, idle)
will still add up to 100%, but this is 100% of the time this LPAR used a physical processor;
and, as you know this time can vary.
Problems with focusing on only the traditional CPU statistics

The traditional CPU utilization statistics are calculated by dividing the amount of time a
logical CPU is executing in a given mode (for example user mode) by the total execution
time (including the wait kproc execution). If you are examining statistics with AIX running in
an LPAR which is a shared processor LPAR (or a dedicated processor LPAR with the ability
to donate while active), then the denominator of this execution (the time spent executing on
the physical processor) can be changing from one interval to the next. Even if the workload
in AIX is constant, we may see these percentages fluctuating as other LPARs demand their
entitlement and deny this LPAR capacity above its own entitlement. In a lower demand
situation, the denominator of this calculation can be so small that very small thread
executions can appear as fairly large percentages. Without knowing the actual physical
processor utilization, these percentages can be very confusing and even misleading.
The physical processor utilization statistics

To assist with this situation, the statistic commands were modified to display the actual
physical processor utilization by a logical CPU. The two additional statistics are the physc
and the %entc. The physc statistic identified how much time a logical CPU spent executing
on a physical processor. This is expressed as the number of processing units, where 1
represents the capacity of one physical processor. The %entc reports essential the same
information, except it is expressed as a percentage of the LPARs entitled capacity. It is
important to examine these statistics to know the true processor utilization and to place the
traditional statistics in context.

3-103
Instructor Guide
Better utilization but possibly inconsistent performance

The main advantage of running uncapped SP-LPARs is to better utilize the processor
resources. With dedicated processors, if one LPAR is not utilizing its one processor
allocation while another LPAR needs more than its one processor allocation, nothing can
be done to transfer that capacity; and what few capabilities which may be of help (such as
shutting down an LPAR which is fairly inactive or using DLPAR to transfer entire
processors) are not good for relatively short term transient situation. With SP-LPARs, the
excess capacity that is not used by one LPAR is immediately available to another LPAR
that needs it.
The problem you may face is that the users can get spoiled with excellent performance
which depends upon this excess capacity (above and beyond the entitlement which the
application is assigned). Later, when other LPARs start to utilize their entitlements, the
excess capacity disappears and then users then see performance which is more
appropriate for the defined processor entitlement. This can be a trend over time as other
LPARs slowly increase their workload, or it could be seen as a fluctuations in performance
as different LPARs hit their peak demand at different times.
The key here is to clearly identify the acceptable performance and request an LPAR
processor entitlement which should provide that level of performance. The user should be
made to understand that you may, at times, provide much better performance than stated
in the service level agreement, but that this is not guaranteed.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss impact of micro partitioning on CPU performance management.
Details
Additional information Even dedicated processor LPAR can see some of these
affects when configured as dedicated-donating.
Transition statement Let us take a look at some sar reports that illustrate how to read
CPU statistics in this environment.

3-105
Instructor Guide
CPU statistics in an SPLPAR (1 of 2)

# sar -P ALL 2 2
System configuration: lcpu=4 ent=0.35 mode=Uncapped
22:28:15 cpu
22:28:17 0
1
2
3
U
-
%usr
16
0
0
0
0
%sys
55
25
23
25
1
%wio
0
0
0
0
0
0
%idle
29
75
77
75
98
99
physc
0.00
0.00
0.00
0.00
0.34
0.01
%entc
0.9
0.3
0.3
0.3
98.3
1.7
No jobs
running
# sar -P ALL 2 2
One job
running
22:39:20 cpu
22:39:22 0
1
2
3
-
%usr
11
0
13
0
0
%sys
53
25
46
21
21
%wio
0
0
0
79
79
%idle
35
75
42
0
0
physc
0.00
0.00
0.00
1.00
1.00
%entc
0.8
0.3
0.4
285.1
286.5
Figure 3-33. CPU statistics in an SPLPAR (1 of 2)
AN512.0
Notes:
Overview
The examples shown are from an SPLPAR which has an entitlement of 0.35 processing
units and an allocation of two virtual processors. This LPAR is the only LPAR which has
any significant work, so the shared pool of eight processors has plenty of excess capacity.
The examples will illustrate how the CPU statistics change based on the number of single
threaded jobs are increased.
Idle LPAR
Of course there is no such thing as a totally idle LPAR, but there are no user jobs running in
this LPAR. It may at first glance appear that lcpu 0 is fairly busy, but a glance at the physc
values show that the processor utilization is so low that it shows as zero. Even the %entc is
a fraction of a percent of the LPARs entitlement. The %user + %sys may be 71%, but that
is 71% of almost nothing.

V5.4
Instructor Guide
Uempty
One job running

Looking at just the traditional statistics, it would appear that lcpu 0 and lcpu 2 are busy, with
64% and 59% utilization. Once again, you need to look at the physc statistic to see the true
situation. Those two logical cpus once again have very low utilization, so low that they are
reported as zero processing units.
The lcpu which is reported as using an entire processing unit is lcpu 3. Furthermore, since
an entire processing unit is much more than the allocated processor entitlement for the
LPAR, the %entc is far above 100%. The execution is divided between system mode
thread execution and the wait kproc.

3-107
Instructor Guide
Instructor notes:
Purpose Continue explanation of reading CPU statistic in an SMT enabled SPLPAR.
Details
Transition statement Let us look at some more examples of reading the sar reports.

V5.4
Instructor Guide
Uempty
CPU statistics in an SPLPAR (2 of 2)

# sar -P ALL 2 2
00:50:02 cpu
00:50:04 0
1
2
3
-
%usr
8
0
13
18
9
%sys
47
16
45
82
49
%wio
0
84
0
0
42
%idle
45
0
41
0
0
physc
0.00
1.00
0.00
1.00
2.00
%entc
0.6
284.7
0.4
285.1
570.8
Two jobs
running
# sar -P ALL 2 2
Three jobs
running
00:52:26 cpu
00:52:28 0
1
2
3
-
%usr
16
0
16
16
8
%sys
51
14
84
84
49
%wio
0
86
0
0
43
%idle
33
0
0
0
0
physc
0.00
1.00
0.50
0.50
2.00
%entc
0.6
284.7
142.3
143.2
570.8
Figure 3-34. CPU statistics in an SPLPAR (2 of 2)
AN512.0
Notes:
Two jobs running
Staying focused on the physc statistic, you can see that both lcpu1 and lcpu 3 are each
fully utilizing a physical processor, while the other two logical cpus are almost entirely idle
(despite one of them having significant %user + %sys). The selection of logical CPUs is not
random. Because SMT is enabled, these logical CPUs are mapped to the primary
hardware threads of the two processors.
Three jobs running

With the addition of one more job and with the primary hardware threads busy, the AIX
scheduler starts to use the secondary hardware threads. The processor is still 100% busy,
but now the usage is prorated between the two threads sharing it, using SMT.

3-109
Instructor Guide
Instructor notes:
Purpose Continue explanation of reading CPU statistic in an SMT SPLPAR.
Details
Transition statement Lets review what we have covered with some checkpoint
questions.

V5.4
Instructor Guide
Uempty
Checkpoint
1. What is the difference between a process and a thread?
___________________________________________________
___________________________________________________
2. The default scheduling policy is called: _________________
3. The default scheduling policy applies to fixed or non-fixed priorities?
_________________
4. Priority numbers range from ____ to ____.
5. True/False The higher the priority number the more favored the thread
will be for scheduling.
6. List at least two tools to monitor CPU usage:
7. List at least two tools to determine what processes are using the CPUs:
AN512.0
Notes:

3-111
Instructor Guide
Instructor notes:
Purpose Review and test the students understanding of thus unit.
A suggested approach is to give the students about five minutes to answer the questions
on this page. Then, go over the questions and answers with the class.
A process is an activity within the system that is started by a
command, shell program or another process. A thread is what is
dispatched to a CPU and is part of a process. A process can have
one or more threads.
2. The default scheduling policy is called: SCHED_OTHER
non-fixed
4. Priority numbers range from 0 to 255.
vmstat, sar, topas, nmon
ps, tprof, topas, nmon
Details
Transition statement Its now time for an exercise.

V5.4
Instructor Guide
Uempty
Exercise 3: Monitoring,
analyzing, and tuning CPU usage
Observing the run queue
Use nice numbers to control process priorities
Analyze CPU statistics in multiple environments including
SMT and SP-LPAR, including locating a dominant process
Use WPAR resource controls (optional)
Use schedo to modify the scheduler algorithms (optional)
Use PerfPMR data to examine CPU usage
Figure 3-36. Exercise 3: Monitoring, analyzing, and tuning CPU usage
AN512.0
Notes:

3-113
Instructor Guide
Instructor notes:
Details Describe the major steps in the lab exercise.
Transition statement Now, lets summarize the key points from this unit.

V5.4
Instructor Guide
Uempty
Unit summary
This unit covered:
Processes and threads
How process priorities affect CPU scheduling
Managing process CPU utilization with either
nice and renice commands
WPAR resource controls
Using the output of the following AIX tools to determine symptoms of a
CPU bottleneck:
vmstat, sar, ps, topas, tprof
Correctly interpreting CPU statistics in various environments including
where:
Simultaneous multi-threading (SMT) is enabled
LPAR is using a shared processor pool
AN512.0
Notes:

3-115
Instructor Guide
Instructor notes:
Details Describe what we covered in this unit.
Transition statement The next unit covers virtual memory management.

V5.4
Instructor Guide
Uempty
Unit 4. Virtual memory performance monitoring

and tuning
Estimated time
3:45 (2:30 Unit; 1:15 Exercise)

This unit describes virtual memory concepts including page
replacement. It also explains how to analyze and tune the virtual
memory manager (VMM).

Define basic virtual memory concepts and what issues affect
performance
Describe, analyze, and tune page replacement
Identify memory leaks
Use the virtual memory management (VMM) monitoring and tuning
tools
Analyze memory statistics in an active memory sharing (AMS)
environment.
Describe the role of Active Memory Expansion and to interpret the
related AME statistics.

Accountability:
Checkpoints
Machine exercises
References
Reference
Unit 4. Virtual memory performance monitoring and tuning

4-1
Instructor Guide

SG24-6478

(Redbook)
Active Memory Expansion: Overview and Usage
Guide
(Whitepaper)
4-2

V5.4
Instructor Guide
Uempty
Unit objectives
Define basic virtual memory concepts and what
issues affect performance
Describe, analyze, and tune page replacement
Identify memory leaks
Use the virtual memory management (VMM)
monitoring and tuning tools
Analyze memory statistics in Active Memory
Sharing (AMS) and in an Active Memory
Expansion (AME) environments
AN512.0
Notes:

4-3
Instructor Guide
Instructor notes:
Purpose To list the objectives for this unit.
Details One of the main focuses of this unit is to explain how AIX manages memory
frames and how to observe this and react to problem situations.
Transition statement Let us first look at the hierarchy of memory storage.
4-4

V5.4
Instructor Guide
Uempty
Memory hierarchy
Registers
Cache:
L1, L2, and L3
Real Memory
(RAM)
Disk Drives (Persistent Storage)
Figure 4-2. Memory hierarchy
AN512.0
Notes:
Registers
The instructions and data that the CPU processes are fetched from memory. Memory
comes in several layers with the top layers being the most expensive but the fastest.
The top layer consists of registers which are high speed storage cells that can contain
32-bit or 64-bit instructions or data. However, there is a limited number of registers on
each CPU chip.
Caches
Caches are at the next level and themselves can be split into multiple levels. Level 1
(L1) cache is the fastest and smallest (due to cost) and is usually on the CPU chip. If the
CPU can find the instruction or data it needs from the L1 cache, then access time can
be as little as 1 clock cycle. If its not in L1, then the CPU can attempt to find the
instruction or data in the L2 cache (if it exists) but this could take 7-10 CPU cycles. The
advantage is that L2 caches can be megabytes in size whereas the L1 is typically

4-5
Instructor Guide
32-256 KB. L3 caches are even less expensive while not as fast as L2 cache, but
significantly faster than main memory access.
Real memory (RAM)

Once the virtual address is found in random access memory (RAM), the item is fetched,
typically at a cost of 200-500 CPU cycles.
Disk
If the address is not in RAM, then a page fault occurs and the data is retrieved from the
hard disk. This is the slowest method but the cheapest. Its the slowest for the following
reasons:
- The disk controller must be directed to access the specified blocks (queuing delay)
- The disk arm must seek to the correct cylinder (seek latency)
- The read/write heads must wait until the correct block rotates under them (rotational
latency)
- The data must be transmitted to the controller (transmission time) and then
conveyed to the application program (interrupt-handling time)
Cached storage arrays also have an influence on the performance level of persistent
storage. Through the storage subsystems caching of data, the apparent response time
on some I/O requests can reflect a memory to memory transfer over the fibre channel,
masking any mechanical delays (seek and rotational latency) in accessing the data.
Newer storage subsystems are offering solid state drives (SSD), also referred to as
flash storage. While still much slower than the system memory and more expensive
than traditional disk drives, the performance provided is significantly better than drives
requiring mechanical movement to access a spinning disk. These can be used either as
an alternative to disk drives (for predetermined file systems which require optimal
access times) or as another layer in the memory hierarchy by using hierarchical storage
systems that automatically keep frequently used data on the SSD.
A disk access can cost hundreds of thousands of CPU cycles.
If the CPU is stalled because it is waiting on a memory fetch of an instruction or data
item from real memory, then the CPU is still considered as being in busy state. If the
instruction or data is being fetched from disk or a remote machine, then the CPU is in
I/O wait state (I/O wait also includes waits for network I/O).
Hardware hierarchy overview

When a program runs, it makes its way up the hardware and operating system
hierarchies, more or less in parallel. Each level on the hardware side is scarcer and
more expensive than the one below it. There is contention for resources among
programs and time spent in transitional from one level to the next. Usually, the time
4-6

V5.4
Instructor Guide
Uempty
required to move from one hardware level to another consists primarily of the latency of
the lower level, that is, the time from the issuing of a request to the receipt of the first
data.
Disks are the slowest hardware operation

By far the slowest operation that a running program does (other than waiting on a
human keystroke) is to obtain code or data from a disk. Disk operations are necessary
for read or write requests for programs. System tuning activities frequently turn out to be
hunts for unnecessary disk I/O or searching for disk bottlenecks since disk operations
are the slowest operations. For example, can the system be tuned to reduce paging? Is
one disk too busy causing higher seek times because it has multiple filesystems which
have a lot of activity?
Real memory
Random access memory (RAM) access is fast compared to disk, but much more
expensive per byte. Operating systems try to keep program code and data that are in
use in RAM. When the operating system begins to run out of free RAM, it needs to
make decisions about what types of pages to write out to disk. Virtual memory is the
ability of a system to use disk space as an extension of RAM to allow for more efficient
use of RAM.
Paging and page faults

If the operating system needs to bring a page into RAM that has been written to disk or
has not been brought in yet, a page fault occurs, and the execution of the program is
suspended until the page has been read in from disk. Paging is a normal part of the
operation of a multi-processing system. Paging becomes a performance issue when
free RAM is short and pages which are in memory are paged-out and then paged back
in again causing process threads to wait for slower disk operations. How virtual memory
works will be covered in another unit of this course.
Caches
To minimize the number of times the program has to experience the RAM latency,
systems incorporate caches for instructions and data. If the required instruction or data
is already in the cache (a cache hit), it is available to the processor on the next cycle
(that is, no delay occurs); otherwise, a cache miss occurs. If a given access is both a
TLB miss and a cache miss, both delays occur consecutively.
Depending on the hardware architecture, there are two or three levels of cache, usually
called L1, L2, and L3. If a particular storage reference results in an L1 miss, then L2 is
checked. If L2 generates a miss, then the reference goes to the next level, either L3, if it
is present, or RAM.

4-7
Instructor Guide
Pipeline and registers

A pipelined, superscalar architecture allows for the simultaneous processing of multiple
instructions, under certain circumstances. Large sets of general-purpose registers and
floating-point registers make it possible to keep considerable amounts of the program's
data in registers, rather than continually storing and reloading the data.
4-8

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe the memory hierarchy.
Details Instructions and associated data targeted for the CPU comes from some type of
memory. Memory has several layers with each layer becoming more expensive and
scarcer.
Registers are at the top layer which are high speed storage cells inside the CPU.
Caches are high speed memory containing a subset of the main memory. There are
different types of caches:
Level 1 cache is the fastest (and more expensive)
Level 2 cache is slower but can be several megabytes in size
Level 3 cache can be found on some systems
Real memory (RAM) is the most familiar memory level.
Disk storage is also a type of memory, also known as persistent memory.
Translation Lookaside Buffers (TLBs)
One of the ways that programmers are insulated from the physical limitations of the
system is the implementation of virtual memory. The programmer designs and codes
the program as though the memory were very large, and the system takes responsibility
for translating the program's virtual addresses for instructions and data into real
addresses that are needed to get the instructions and data from RAM. Since this
address-translation process is time-consuming, the system keeps the real addresses of
recently accessed virtual memory pages in a cache called the Translation Lookaside
Buffer (TLB).
As long as the running program continues to access a small set of program and data
pages, the full virtual-to-real page-address translation does not need to be redone for each
RAM access. When the program tries to access a virtual-memory page that does not have
a TLB entry, called a TLB miss, dozens of processor cycles, called the TLB-miss latency
are required to perform the address translation.
Transition statement Lets look at the relationship between virtual memory, segments,
and physical memory.

4-9
Instructor Guide
Virtual and real memory

Real Memory
Virtual memory mapped to:

Real memory
Page
Frame
Disk storage
Both
Virtual Memory
Segment Size - 256 MB
Segment 0
Segment 1
Segment 2
Page sizes of:

4 KB and 64 KB
...
Configurable pools of
16MB and 16GB pages
Segment 3
.
.
.
Page
Disk
Segment n-1
Segment n
Figure 4-3. Virtual and real memory
AN512.0
Notes:
Overview
Virtual memory is a method by which the real memory appears larger than its true size.
The virtual memory system is composed of the real memory plus physical disk space
where portions of a file that are not currently in use are stored.
Virtual memory segments

Virtual address space is divided into segments. A segment is a 256 MB, contiguous
portion of the virtual memory address space into which a data object can be mapped.
Process addressability to data is managed at the segment (or object) level so that a
segment can be shared between processes or maintained as private. For example,
processes share code segments yet have separate private data segments.

V5.4
Instructor Guide
Uempty
Pages, page frames, and pages on disk

Virtual memory segments are divided into pages. AIX supports four different page
sizes:
4 KB - traditional and most commonly used
64 KB - used mostly by the AIX, but easily used by applications
16 MB - mostly used in HPC environments; requires AIX allocation of a pool
16 GB - mostly used in HPC environments; requires server configuration of a
pool
You will mostly see only the 4 KB and 64 KB page sizes. This course will not cover the
configuration or use of the larger page sizes.
Similarly, physical memory is divided (by default) into 4096 byte (4 KB) page frames. A
64 KB page frame is, essentially, 16 adjacent 4KB page frames managed as a single
unit. The system hardware and firmware is designed to support physical access of
entire 64 KB page frames.
Each virtual memory page which has been touched is mapped to either a memory page
frame or a location on disk.
For file caching, the segments pages are mapped to an opened file in a file system. As
the process reads a file, memory page frames are allocated and the file data is paged-in
to memory. When a process writes to the file, by default, a memory page frame is
allocated, the data is copied to memory, and eventually the memory contents are
paged-out to the file to which the process was writing. Note that the virtual memory
page could have its data on disk only (in the file), in memory only (written but not yet
paged-out to the file), or be stored both in memory and on disk.
For application private memory work space, the application stores data in a given virtual
memory page and this gets written to a page frame in memory. If AIX needs this
memory for other uses, it may steal that memory page frame. This would require that
the contents be saved by paging it out to paging space on disk, since it does not have a
persistent location in a file in a file system. Later, if the process references that virtual
segment page, the paging space page will be paged-in to a memory page frame. Note
that the paging space page is not freed when the contents are paged-in to memory. The
data is kept in the paging space; this way, if that page needs to be stolen again we
already have an allocation in paging space to hold it, and if the page is not modified, we
will not need to page-out at all since the paging space already has the data. Once
again, the virtual memory page may be stored in memory only, or in paging space only,
or in both.

4-11
Instructor Guide
Instructor notes:
Purpose Describe the structure of virtual memory, segments, and physical memory.
Details Focus on the basic concepts of virtual segment pages and how they relate to
real memory page frames and, possibly, disk storage (either a file system file or paging
space).
Additional information Some systems also support a larger page size, typically
accessed only through the shmat system call. This topic is beyond the scope of this
course.
Transition statement Let us discuss the function of the Virtual Memory Manager
(VMM).

V5.4
Instructor Guide
Uempty
Major VMM functions

To manage memory, the virtual memory manager (VMM):
Manages the allocation of page frames
Resolves references to virtual memory pages that are not currently in
RAM
To accomplish these functions, the VMM:
Maintains a free list of available page frames
Uses a page replacement algorithm to determine which allocated real
memory page frames will be stolen to add to the free list
The page replacement daemon is called lrud.
multi-threaded process
also referred to as the page stealer
Some memory (such as 16MB and 16GB page pools) is not LRU-able
(not managed by lrud) and cant be stolen.
Memory is divided into one or more memory pools.
There is one lrud thread for each memory pool.
Each memory pool has its own free list managed by that lrud.
Figure 4-4. Major VMM functions
AN512.0
Notes:
Overview
The virtual memory manager (VMM) coordinates and manages all the activities
associated with the virtual memory system. It is responsible for allocating real memory
page frames and resolving references to pages that are not currently in real memory.
Free list
The VMM maintains a list of unallocated page frames that it uses to satisfy page faults,
called the free list.
In most environments, the VMM must occasionally add to the free list by stealing some
page frames owned by running processes. The virtual memory pages whose page
frames are to be reassigned are selected by the VMMs page stealer. The VMM
thresholds determine the number of frames reassigned.

4-13
Instructor Guide
When a process exits, its working storage is freed up immediately and its associated
memory frames are put back on the free list. However, any files the process may have
opened can stay in memory.
When a file system is unmounted, any cached file pages are freed.
Intent of the page replacement algorithm

The main intent of the page replacement algorithm is to ensure that there are enough
pages on the free list to satisfy memory allocation requests. The next most important
intent is to try to select page frames, to be stolen, which are unlikely to be referenced
again. It also ensures that computational pages are given fair treatment. For example,
the sequential reading of a long data file into memory should not cause the loss of
program text pages that are likely to be used again soon.
Least Recently Used Daemon (lrud)

On a multiprocessor system, page replacement is done through the lrud kernel
process. Page stealing occurs when the VMM page replacement algorithm selects a
currently allocated real memory page frame to be placed on the free list. Since that
page frame is currently associated with a virtual memory segment of a process, we
refer to this as stealing the page frame from that segment.
Memory pools
The lrud kernel process is multi-threaded with one thread per memory pool. Real
memory is split into memory pools based on the number of CPUs and the amount of
RAM.
While the memory statistics and memory tuning parameters often use global values,
internally the lrud daemon threads use per-memory-pool values, with the statistic
reports often showing values that are totals from all of the memory pools. This will
become significant later when we start to analyze the performance statistics.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss the primary purpose of the VMM.
Details Explain how the page replacement algorithm maintains a free list for memory
allocation. Do not cover to much detail here, since the following visuals will provide the
details. They should understand that they may see multiple memory pools and that each
memory pool has a separate free list for 4 KB and 64 KB pages. The free list management
that we will be covering occurs for each one of these free lists. Which pages are chosen to
be stolen is important and that we will be explaining the tunables that will allow them to
influence that decision.
Transition statement Lets define the types and classifications of segments.

4-15
Instructor Guide
VMM terminology
Segments
Segment types:
Program text (Persistent)
Persistent: File caching for JFS

Client: File caching for all
other file systems
such as JFS2 or NFS
Process
Data file (Persistent)
Threads
Data file (Client)
Working: Private memory

allocations
Program text (Client)
Process private - stack and data (Working)
Segment classification:
Computational:
Shared library data (Working)
Working segments
Program text (binary object executables)
Non-computational (file memory):
Persistent segments
Client segments
Figure 4-5. VMM terminology
AN512.0
Notes:
Overview
Virtual memory is divided into three types of segments that reflect where the data is
being stored:
- Local persistent segments (for a local JFS filesystem)
- Client persistent segments (for remote, Enhanced JFS or CD-ROM filesystems)
- Working segments (in the paging space)
The segment types differ mainly in the function they fulfill and in the way they are
backed to external storage when paging occurs.
Segments of a process include program text and process private. The program text
segments can be persistent or client, depending on the program executable segment
type. If the executable is on a JFS filesystem, then the type will be persistent.
Otherwise, it will be client. The private segments are working segments containing the
data for the process. For example, global variables, allocated memory and the stack.
V5.4
Instructor Guide
Uempty
Segments can be shared by multiple processes. For example, processes can share
code segments yet have private data segments.
Persistent segments
The pages of a persistent segment have permanent storage locations on disk. Files
containing data or executable programs are mapped to persistent segments.
When the VMM needs to reclaim a page from a persistent segment that has been
modified, it writes the modified information to the permanent disk storage location.
If the VMM chooses to steal a persistent segment page frame which has not been
modified, then no I/O is required. If the page is referenced again later, then a new copy
is read in from its permanent disk storage location.
Client segments
The client segments are used for all filesystem file caching except for JFS and GPFS.
(GPFS uses its own mechanism.) Examples of filesystems cached in client segments
are remote file pages (NFS), CD-ROM file pages, Enhanced JFS file pages and Veritas
VxFS file pages. Compressed file systems use client pages for the compress and
decompress activities.
Working segments
Working segments are transitory and exist only during their use by a process. They
have no permanent disk storage location and are therefore stored on disk paging space
if their page frames are stolen. Process stack and data regions are mapped to working
segments, as are the kernel text segment, the kernel extension text segments, as well
as the shared library text and data segments. The term text here refers to the binary
object code for an executable program; it does not cover the source code or the
executable files which are interpreted, such as executable shell scripts.
Computational versus file memory

Computational memory, also known as computational pages, consists of the pages that
belong to working storage segments or program text (executable files) segments.
File memory, also known as file pages or non-computational memory, consists of the
remaining pages. These are usually pages from permanent data files in persistent
storage (persistent or client segments).
The classification of memory as computations or non-computational becomes important
in the next few slides when we look at how VMM decides which page-frames to steal
when the free list runs low.

4-17
Instructor Guide
Instructor notes:
Purpose Describe the types of segments and classification of segments.
Details
Transition statement In a memory constrained environment, VMM must occasionally
replenish the free list by removing some of the current data from real memory. This page
stealing is discussed next.

V5.4
Instructor Guide
Uempty
Free list and page replacement

Real Memory
Non-pinned
Working Storage
Paging
Space
Free
List
Persistent: JFS
File Cache
File
System
Client: JFS2,
NFS, and others
Pinned-Memory
(can not be stolen)
Memory requests are satisfied off the free list

Memory frames are stolen from allocated pages to replenish the free list
Recently referenced and pinned pages are not stealable
If a stolen page frame is not backed with matching contents on disk (dirty):
Working segment page saved to paging space (paging space page-out)
Persistent or client segment page saved to file system (file page-out)
Later access requires page-in from disk
Figure 4-6. Free list and page replacement
AN512.0
Notes:
Introduction
A process requires real memory pages to execute.
Memory is allocated out of the free list. Memory is allocated either as working storage or
as file caching. As the free list gets short, VMM steals page frames that are already
allocated. Applications with root authority can pin critical pages, preventing them from
being stolen. Before a page is stolen, VMM needs to be sure that the data is stored in
persistent storage (ultimately a disk, even when using an NFS file system).
If the page frame to be stolen was read from a file system and never modified, then
nothing has to be saved. If it is working storage that was previously paged out to paging
space and has not been modified since then, once again, it does not need to be paged
out again. Note that when we page-in from paging space we do not delete the data from
the paging space.

4-19
Instructor Guide
If there is no matching copy of the data on disk, the contents of the page frame need to
paged-out to either paging space (for working segment pages) or to a file system (for
persistent or client segment pages). The pages that do not have a copy on disk are
often referred to as dirty pages.
The most common way for frames to be placed on the free list is either for the owning
process to explicitly free the allocated memory or for VMM to free the frames allocated
to a process when the process terminates.
For file caching, once allocated in memory, the memory tends to stay allocated until
stolen to fill the free list. This is discussed in more detail later.
When a process references a virtual memory page that is on disk (because it either has
been paged out or has yet to be read in), the referenced page must be paged in.
When a process allocates and updates a working segment page, if there is not much
memory on the free list, this may require the stealing of currently allocated page frames
to maintain the free list. This can result in one or more pages to be paged out. If an
application later accesses the page frame, it will need to be paged in from disk. This
requires I/O traffic which, as you know, is much slower than our normal memory access
speeds. If the free list is totally empty, the process will hang waiting on the memory
request (which in turn is waiting on the I/O request) and thus seriously delaying the
progress of the process.
VMM services uses the page stealer to steal page frames that have not been recently
referenced, and thus would be unlikely to be referenced in the near future. Page frames
which have recently been referenced are not considered to be stealable and are
skipped when scanning for pages to steal.
A successful page stealer allows the operating system to keep enough processes
active in memory to keep the CPU busy.
There are some page frames that cannot be stolen. These are called pinned page
frames or pinned memory. In the visual, the term pinned memory covers both
non-LRUable memory (reserved memory which the lrud daemon does not manage) and
memory that has been allocated and pinned by a process to prevent it from being
paged out by the lrud daemon.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe the page replacement algorithm.
Details Focus on the main concepts in the visual.
Do not go into any detail on when the page stealer starts and stops, or what types of pages
it tries to free. These will be discussed in a later visual.
The following information, used to be taught in previous versions of the course. It is not
formally part of the material because the internal details of how PFT is constructed and
used is not really needed to understand the basic principles and tuning for memory
performance issues. The most that should be said is that the lrud daemon threads try to
select page frames to be stolen using many criteria, only one of which is to attempt to
select page frames which are not likely to require an immediate repage. A simply (if not
perfect) predictor of this is whether that page as been recently accessed; this is reflected in
the use of the term: Least Recently Used (LRU).
Page Frame Table (PFT)
The VMM uses a Page Frame Table (PFT) to keep track of what page frames are in
use. The PFT includes flags to signal which pages have been referenced and which
have been modified. If the page stealer encounters a page that has been referenced,
then it does not steal that page, but instead resets the reference flag for that page. The
next time the page stealer considers that page for stealing and the reference bit is still
off, that page is stolen. A page that was not referenced in the first pass is immediately
stolen.
Rather than scanning the entire page frame list of a memory pool to find pages to steal,
the page frame list is divided into buckets of page frames. The page replacement
algorithm will scan the frames in the bucket and then start over on that bucket for a
second scan. Then, it will move on to the next bucket.
The modify flag indicates that the data on that page has been changed since it was
brought into memory. When a page is to be stolen, if the modify flag is set, then a
pageout call is made before stealing the page. Pages that are part of working segments
are written to paging space; persistent segments are written to disk.
Transition statement Lets look how a short free list triggers page stealing to replenish
the list.

4-21
Instructor Guide
When to steal pages based on free pages

Begin stealing when free
pages in a mempool is
less than minfree
(default: 960)
Stop stealing when free

pages in a mempool is
equal to maxfree
(default: 1088)
Number of
free pages
maxfree
Number of
free pages
minfree
Free list
Free list
Figure 4-7. When to steal pages based on free pages
AN512.0
Notes:
Stealing based on the number of free pages
The VMM tries to keep the free list size greater than or equal to minfree so it can
supply page frames to requestors immediately, without forcing them to wait for page
steals and the accompanying I/O to complete. When the free list size falls below
minfree, the page stealer runs.
When the free list reaches maxfree number of pages, the page stealer stops.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe when page stealing starts and stops, based only on the number of
free pages.
Details
Additional information The following information is from a visual that was in a previous
version of the course. It was dropped because the AIX6 (and later) tuning makes it very
rare and the tuning of it is restricted (in AIX6 and later).
maxclient triggered page stealing

There is another mechanism for triggering page stealing, but with AIX6 default tuning it
is not often invoked. This is because the threshold is set fairly high and the preference
to steal file pages before computational usually keeps the file cache below this
threshold. This will be explained in more detail on the following slide. The
strict_maxclient mechanism is covered here because some AIX 5L V5.3 systems,
which are poorly tuned, may see this mechanism being invoked. In AIX 6 and later, the
tuning parameters which control this are restricted (as indicated by the R icon next to
them on the visual).
Stealing based on the number of client pages

With the restricted tunable strict_maxclient=1 (the default), the page stealer may
start before the free list reaches minfree number of pages. With a single mempool,
when the number of client pages is less than the difference between the values of the
maxclient and minfree parameters, page stealing starts. For example, if maxclient is
743207 pages and minfree is 960 pages, then page stealing will start when number of
client pages (numclient) reaches 742248 pages.
Page stealing stops when the number of client pages is greater than the difference
between the values of the maxclient and maxfree parameters. For example, if
maxclient is 743207 pages and maxfree is 1088 pages, then page stealing will stop
when the number of client pages is down to 742118 pages.
This is implemented on a per mempool basis, with each lrud thread comparing the per
mempool minfree to its own maxclient threshold. The vmstat reported maxclient value is
the sum of the individual mempools maxclient thresholds.
Stealing based on the number of persistent pages

There is also a matching mechanism to steal pages when persistent pages approach a
maxperm value. This mechanism is disabled by default and the strict_maxperm tuning
parameter that controls this is restricted.
Attention: The strict_maxperm option is a restricted tuning parameter and should only be
enabled for those cases that require a hard limit on the persistent file cache. Improper use

4-23
Instructor Guide
of the strict_maxperm option can cause unexpected system behavior because it changes
the VMM method of page replacement.
Transition statement How can we tell if a short free list is triggering page stealing?

V5.4
Instructor Guide
Uempty
Free list statistics

vmo reports the free list thresholds (per mempool and page size)
minfree (default 960 pages)
maxfree (default 1088 pages)
vmstat reports:
The global size of the free list (all mempools):
vmstat interval report, free column
vmstat v, free pages statistic
The free list size for each page size:

vmstat P ALL; vmstat P 4KB; vmstat P 64KB
The number of mempools:

vmstat v , memory pools statistic
Multiply minfree and maxfree by the number of memory pools

before comparing to the per page size free list statistics.
A short free list does not prove there is a shortage of available
memory.
Memory may be filled with easily stolen file cache pages
Figure 4-8. Free list statistics
AN512.0
Notes:
minfree and maxfree vmo parameters
The following vmo parameters are used to make sure there are at least a minimum
number of pages on the free list:
- minfree
Minimum acceptable number of real memory page frames on the free list. When the
size of the free list falls below this number, the VMM begins stealing pages.
- maxfree
Maximum size to which the free list will grow by VMM page stealing. The size of the
free list may exceed this number as a result of processes terminating and freeing
their working segment pages or the deletion of files that have pages in memory.
The minfree and maxfree values are for each memory pool. Prior to AIX 5L V5.3, the
minfree and maxfree values were divided up among the memory pools.

4-25
Instructor Guide
In addition, the thresholds can be triggered by only one page size free list getting short.
Thus you should use vmstat -P ALL to see the free list values for each of the pages
sizes.
vmstat statistics
System administrators often ask if their free list is short. Both the vmstat iterative
report and the vmstat -v report provide a statistic on the size of the free list. It is
common to compare these to the minfree and maxfree thresholds. The correct way to
do this is first multiply the thresholds by the number of memory pools to obtain the total
of the thresholds for all memory pools. The report that should be used is: vmstat -P
ALL.
While there is a formula for how many mempools AIX will create given the number of
CPUs and the amount of real memory, this formula can change and it is better to ask
the system how many memory pools it has. The vmstat -v report provides a count of
the number of memory pools.
What does a short free list mean?

In AIX, you should not use the free list size as your primary indication of memory
availability. This is because AIX, by default, uses memory for caching file system files.
And since the file stays cached in memory even after the program that used it
terminates, AIX memory quickly fills up with page frames of files that may not be
referenced again for a long time. This memory is easily stolen. If these file pages were
never modified, then VMM does not even have to page out the contents before stealing
them to be placed on the free list. As a result it is common to see the free list size down
near the minfree and maxfree thresholds.
Some system administrators will add the file pages statistic to the free list statistic to get
a better idea of how much memory is truly available. The current version of the svmon
global report provides a statistic that is designed to fill this need; it is labeled the
available value.
If the free list is constantly below minfree and even approaches or reaches zero, then
that may indicate that VMM is having trouble stealing page frames fast enough to
maintain the free list. That may be a trigger to look at other statistics to fully understand
the situation.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe how to obtain and interpret the free list statistics and thresholds.
Details
Transition statement Lets look at the commands that allow us to see the situation with
the free list and any related page stealing.

4-27
Instructor Guide
Displaying memory usage (1 of 2 )

# vmstat 5
System configuration: lcpu=4 mem=512MB ent=0.80
kthr
----r b
2 1
1 1
1 1
0 1
memory
page
faults
cpu
----------- ------------------------ ------------ ----------------------avm
fre
re pi po
fr
sr cy in
sy
cs us sy id wa
pc
152282 2731
0
0
0 14390 50492
1 479 10951 4525 1 34 52 14 0.29
152283 2669
0
0
0 13843 45696
1 599 9910 4872 1 34 52 14 0.29
152283 2738
0
0
0 14616 49573
1 503 10445 4716 1 34 52 13 0.29
152280 2639
0
0
0 13802 46128
1 375 11108 7984 1 38 49 11 0.33
ec
36.7
36.2
36.6
40.9
# svmon -G -O pgsz=off
Unit: page
-------------------------------------------------------------------------------------size
inuse
free
pin
virtual available
mmode
memory
131072
128431
2641
82554
159754
4993
Ded
pg space
131072
49897
pin
in use
work
73930
115268
pers
0
0
clnt
0
13163
other
8624
Figure 4-9. Displaying memory usage (1 of 2)
AN512.0
Notes:
The vmstat -I and svmon -G commands
The vmstat command reports virtual memory statistics. The -I option includes I/O
oriented information, including fi (file page ins/second) and fo (file page outs/second).
The svmon -G command gives an overall picture of memory use.
Breakdown of real memory

The size field of the svmon -G output shows the total amount of real memory on the
system. The following svmon -G fields show how the memory is being used:
- The free field displays the number of free memory frames
- The work field in the in use row displays the number of memory frames containing
working segment pages

V5.4
Instructor Guide
Uempty
- The pers field in the in use row displays the number of memory frames containing
persistent segment pages
- The clnt field in the in use row displays the number of memory frames containing
client segment pages
These four fields add up to the total real memory.
Computational memory
In the vmstat output, avm stands for active virtual memory and not available memory.
The avm value in the vmstat output and the virtual value in the svmon -G output
indicate the active number of 4 KB virtual memory pages in use at that time. (Active
meaning that the virtual address has a page frame assigned to it.) The vmstat avm
column will give the same figures as the virtual column of svmon except in the case
where deferred page space allocation is used. In that case, svmon shows the number of
pages actually paged out to paging space, whereas vmstat shows the number of virtual
pages accessed but not necessarily paged out.
In the svmon -G report, if no paging has occurred, then the virtual value will be equal
to the work field in the in use row. But if paging has occurred, then you cannot make
that assertion.
The avm (vmstat) and virtual (svmon -G) numbers will grow as more processes get
started and/or existing processes allocate more working storage. Likewise, the numbers
will shrink as working segment pages are released. They can be released in two ways:
- Owning process can explicitly free them
- Kernel will automatically free them when the process terminates
The avm (vmstat) and virtual (svmon -G) statistics do not include file pages.
Free frames
The fre value in the vmstat output and the free field in the svmon -G output indicate
the average amount of memory (in units of 4KB) that is currently on the free list. When
an application terminates, all of its working pages are immediately returned to the free
list. Its persistent pages (files), however, remain in RAM and are not added back to the
free list until they are stolen by the VMM for use by other programs. Persistent pages
are also freed if the corresponding file is deleted or the file system is unmounted.
For these reasons, the fre value may not indicate all the real memory that can be
readily available for use by processes. If a page frame is needed, then persistent pages
previously referenced by terminated applications are among the first to be stolen and
placed on the free list.

4-29
Instructor Guide
Paging rates
The fi and fo fields show the file page ins and file page outs per second. This
represents I/O to and from a filesystem.
The pi and po fields show the paging space page ins and paging space page outs for
working pages.
Scanning rates
The number of pages scanned is shown in the vmstat sr field. The number of pages
stolen (or freed) is shown in the vmstat fr field. The ratio of scanned to freed
represents relative memory activity. The ratio will start at 1 and increase as memory
contention increases. It is interpreted as having to scan sr number of pages and found
fr to free.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Show how memory pages are being used.
Details
Additional information With the current deferred allocation policy, the paging space
disk blocks are allocated at the time the pages are actually paged out. With the previous
policy of late page space allocation, avm was equivalent to paging space blocks since the
VMM would allocate one paging space disk block for each working page that was
accessed. The paging space policies will be discussed later in this unit.
Transition statement The vmstat report also has the ability to display statistics for
each different page size. Let see what that looks like and why it is important.

4-31
Instructor Guide
Displaying memory usage (2 of 2 )

# vmstat -P 64KB 5
System configuration: mem=512MB
pgsz
memory
page
----- -------------------------- -----------------------------------siz
avm
fre
re
pi
po
fr
sr
cy
64K
2046
1992
91
0
0
0
0
0
0
64K
2046
1992
91
0
0
0
0
0
0
64K
2046
1992
91
0
0
0
0
0
0
64K
2046
1992
91
0
0
0
0
0
0
# vmstat -P 4KB 5
System configuration: mem=512MB
pgsz
memory
page
----- -------------------------- -----------------------------------siz
avm
fre
re
pi
po
fr
sr
cy
4K
98336
120801
1107
0
0
0 10887 16625
0
4K
98336
120799
1141
0
0
0 14754 25080
0
4K
98336
120798
1145
0
0
0 11466 17164
0
4K
98336
120798
1139
0
0
0 14808 25154
0
Figure 4-10. Displaying memory usage (2 of 2)
AN512.0
Notes:
The vmstat command has an option (-p) which accepts a value of 4KB, 64KB, or ALL. This
option will show the memory related statistics broken down by page sizes.
This is important since the lrud threads will trigger page stealing whenever either page size
free amount (in a given mempool) falls below the minfree threshold. In most cases, it will be
the 4KB page size statistics that show the low free amount and the page stealing activity.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Introduce per page-size statistics.
Details
Transition statement Once page stealing has been triggered there are various factors
which influence which pages are stolen. Obviously the system want to select one based on
the least recently used principle, but there are other factors. Let us look at and important
one.

4-33
Instructor Guide
What types of pages are stolen?

R lru_file_repage = 0
lru_file_repage = 1
(AIX6 default)
Tries to only steal
file pages
numperm > maxperm
maxperm
(from maxperm%,
If file repage rate >
computational repage rate
Then
steal computational pages
Else
steal file pages
default=90%)
numperm > minperm

AND
numperm < maxperm
Tries to only steal

file pages
(non-computational,
either persistent
or client)
minperm
(from minperm%
Steals the least
recently used pages
default=3%)
numperm < minperm
Steals the least

recently used pages
File cache location optimized: page_steal_method=1 (list-based) R

Figure 4-11. What type of pages are stolen?
AN512.0
Notes:
Overview
The decision of which page frames to steal when the free list is low is crucial to the
system performance. VMM wants to select page frames which are not likely to be
referenced in the near future. When the page stealer is running frequently (due to a low
free list), the record of what has been recently referenced is very short term. Just
because a working page frame has not been referenced since the last lrud scan does
not mean it will not soon be referenced again.
There is usually much more memory pages being used for file cache that are unlikely to
be re-referenced then there are computational memory pages that will not be
re-referenced. There is also likely to be a higher cost to stealing a working segment
page frame as compared to a persistent or client segment page, due to the probability
that file cache contents (often read from disk but not modified) will not need to be paged
out. Also, if file cache pages are paged out due to page stealing, this is I/O that
eventually would be needed anyway (to flush the application write to disk).
V5.4
Instructor Guide
Uempty
The default AIX6 behavior aggressively chooses to steal from file cache rather than
computation memory.
Note that the numeric thresholds (maxperm and minperm) are calculated from tuning
parameters which are percentages. As such, the numeric thresholds are not modified
using the vmo command; they are classified as static (as indicated by the S icon next to
them on the visual). Of the discussed percentages, you should only modify minperm%.
The maxperm% and lru_file_repage parameters are restricted in AIX 6 and later (as
indicated by the R icon next to them on the visual).
numperm < minperm

If the percentage of RAM occupied by file pages falls below minperm, then any page
(file or computational) that has not been referenced can be selected for free list
replacement.
lru_file_repage=0 (default in AIX6 and later) and numperm > minperm

The page stealer tries to only steal file pages, when the file pages is above the minperm
threshold.
Since the main purpose of the page stealer page scan is to locate file cache pages, the
AIX6 default method for locating file cache pages is to use a list based method
(page_steal_method=0). All of the file cache pages are on a list. The alternative
(and default in AIX 5L V53) is to sequential search through the page frame table
(page_steal_method=1) looking for pages while are file cache. This alternative is
less efficient.
Note
The lru_file_repage parameter is an AIX6 Restricted tunable. It should not be
changed unless instructed to do so by AIX Support.
If working in an AIX 5L V3.5 environment, it is generally recommended that you set
lru_file_repage=0.
lru_file_repage=1 and numperm > maxperm

If the percentage of RAM occupied by file pages rises above maxperm, then the
preference is to try and steal file pages.
Note: In the two above cases, it is stated that only file pages will be stolen, but there are
some circumstances where computational pages can and will be stolen when numperm is
in these ranges. One example is the situation where there are no file pages in a stealable
state. By this we mean that the file pages are so heavily referenced, that we are unable to
find any with the reference bit turned off. This will drive the free list to 0 and we MAY start

4-35
Instructor Guide
stealing computational pages. This is why setting minfree at a high enough number to start
reclaiming pages is so important.
lru_file_repage=1 and numperm is between minperm and maxperm

If the percentage of RAM occupied by file pages is between minperm and maxperm, then
page replacement may steal file or computational pages depending on the repage rates
of computational versus non-computational (file) pages.
There are two types of page faults:
- New page fault which occurs when there is no record of the page having been
referenced
- Repage fault which occurs when a page is stolen using lrud and then is
re-referenced and has to be read back in
When lru_file_repage=1, if the value of the file repage counter is higher than the
value of the computational repage counter, then computational pages (which are the
working storage) are selected for replacement. If the value of the computational repage
counter exceeds the value of the file repage counter, then file pages are selected for
replacement.
Experiments have shown that the effective result is that both file and computation
pages are stolen somewhat equally when in this range.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Identify the types of pages that will be stolen.
Details
Transition statement The various memory statistics that we have been discussing can
get confused. Let us clearly define the meaning of the memory statistics reported by vmstat
and svmon.

4-37
Instructor Guide
Values for page types and classifications

JFS pages are in persistent type segments and reported by:
svmon -G as pers value
JFS2 and NFS pages are in client type segments and
reported by:
svmon G as clnt value
Page frames which are classified as non-computational are
reported by:
vmstat v as file pages and numperm percentage
values regardless of segment type
vmstat v as client pages and numclient
percentage values, if in client segments
Figure 4-12. Values for page types and classifications
AN512.0
Notes:
The file pages value
The file pages value, in the vmstat report, is the number of non-computational (file
memory) pages in use. This is not the number of persistent pages in memory because
persistent pages that hold program text (executable files) are considered computational
pages.
The numperm percentage value

The numperm percentage value is the file pages value divided by the amount of
manageable memory (the lruable memory value) and expressed as a percentage.
The client pages value

The client pages value, in the vmstat report, is the number of non-computational (file
memory) pages in use, which are in client segments. This is not the number of client
V5.4
Instructor Guide
Uempty
segment pages in memory because client pages that hold program text (executable
files) are considered computational pages.
The numclient percentage value

The numclient percentage value is the client pages value divided by the amount of
manageable memory (the lruable memory value) and expressed as a percentage.
The pers value

The pers value, in the svmon report, is the number of persistent segment pages. It
includes both computational and non-computational pages that are in persistent
segments (JFS).
The clnt value

The client value, in the svmon report, is the number of client pages. It includes both
computational and non-computational pages that are in client segments (such as
JFS2).

4-39
Instructor Guide
Instructor notes:
Purpose Describe some of the file caching statistics in the vmstat and svmon reports.
Details Point out that because of the differences described, the system administrator
should not expect the values in the svmon report to match the values in the vmstat report,
even if the system could be frozen with no changes to memory.
Transition statement Lets look at an example of the vmstat -v report that we have
been discussing.

V5.4
Instructor Guide
Uempty
What types of pages are in real memory?

# vmstat -v
131072 memory pages
109312 lruable pages
2625 free pages
1 memory pools
82309 pinned pages
80.0 maxpin percentage
3.0 minperm percentage
90.0 maxperm percentage
21.7 numperm percentage
23737 file pages
0.0 compressed percentage
0 compressed pages
21.7 numclient percentage
90.0 maxclient percentage
23737 client pages
0 remote pageouts scheduled
28 pending disk I/Os blocked with no pbuf
47304 paging space I/Os blocked with no psbuf
2484 filesystem I/Os blocked with no fsbuf
0 client filesystem I/Os blocked with no fsbuf
215 external pager filesystem I/Os blocked with no fsbuf
Figure 4-13. What types of pages are in real memory?
AN512.0
Notes:
Type of workload
In a particular workload, it might be worthwhile to emphasize the avoidance of stealing
file cache memory. In another workload, keeping computational segment pages in
memory might be more important. To get the file cache (and other statistics), use the
vmstat -v command. If PerfPMR was run, the output is in vmstat_v.before and
vmstat_v.after.
What to look for

If your system is primarily I/O intensive, you will want to have more file caching, as long
as it does not result in computational pages being stolen and paged to paging space. In
the displayed example, the 512 MB memory is used by:
- Pinned memory (about 320 MB, leaving 192 MB for lrud to manage)

4-41
Instructor Guide
- The free list (for both page sizes) needs a minimum of almost 7.5 MB and will
attempt to increase to about 8.5 MB)
- File cache is only about 92 MB
- The rest (about 92 MB) is being used by computational pages.
Note that the numclient percentage does not come near the maxclient threshold, thus
any page stealing is the result of a short free list.
If you are seeing page stealing (as in the previous vmstat iterative report), this must be
because the memory is overcommitted. The example has I/O intensive processes trying
to do massive amounts of file caching with only 92 MB of memory available to them.
Remember that the priority use of memory is in the following order:
i.
Non-LRUable memory
ii. LRUable but pinned memory

iii. Free list between minfree and maxfree
iv. Computational memory
v. File cache memory
Due to lru_file_repage=0, the last two items are only equal in priority when file
caching is less than 3% of lruable memory.
The real problem here is that we do not have enough real memory. The minimum
memory for AIX is 512 MB, which is the allocation seen in this example. If more memory
were to be added, the rate of page stealing would likely go down and the amount of file
cache memory would go up.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Identify the statistics for file pages.
Details Do not get into the blocked I/O statistics here. They will be discussed in the next
unit.
Transition statement Let us also look at the svmon report which we have been
discussing. We will also use this report to identify if memory is over committed.

4-43
Instructor Guide
Is memory over committed?

Memory is considered overcommitted if the number of pages currently in
use exceeds the real memory pages available
The number of pages currently in use is the sum of the:
Virtual pages
File cache pages
Example:
# svmon -G -O unit=MB
Unit: MB
----------------------------------------------------------------size
inuse
free
pin
virtual available
mmode
memory
512.00
501.79
10.2
320.82
596.55
26.9
Ded
pg space 512.00
197.62
work
pers
clnt
other
pin
287.13
0
0
33.7
in use
445.98
0
55.8
Virtual pages
= 596.55 MB
+ File cache pages = 55.8 MB
-------------------------------------------------------Total pages in use = 652.35 MB vs. Real memory = 512 MB)
Figure 4-14. Is memory over committed?
AN512.0
Notes:
What happens when memory is over committed?
A successful page replacement algorithm keeps the memory pages of all currently
active processes in RAM, while the memory pages of inactive processes are paged out.
However, when RAM is over committed, it becomes difficult to choose pages to be
paged out because they will be re-referenced in the near future by currently running
processes. The result is that pages that will soon be referenced still get paged out and
then paged in again later. When this happens, continuous paging in and paging out may
occur. This is referred to as paging space thrashing or simply page thrashing. The
system spends most of its time paging in and paging out instead of executing useful
instructions, and none of the active processes make any significant progress.
How do you know if memory is over committed?

If the vmstat reports are showing a high volume of paging space page-ins and
page-outs, then it is quite clear that memory is overcommitted. But sometimes memory
V5.4
Instructor Guide
Uempty
is overcommitted and high volumes of file cache page stealing is impacting the I/O
performance. Or neither is happening but you are at high risk of one or the other
happening. Examination of the svmon report can be helpful.
Use the svmon -G command to get the amount of memory being used and compare that
to the amount of real memory. To do this:
- The total amount of real memory is shown in the memory size field.
- The amount of memory being used is the total of:
The virtual pages shown in the memory virtual field.
The persistent pages shown in the in use pers field.
The client pages shown in the in use clnt field.
- Officially, if the amount of memory being used is greater than the amount of real
memory, your memory is overcommitted.
The example in the visual is officially overcommitted.
There is also an available field. This statistic is intended to identify how much
memory might be available. Because of the tendency for AIX to cache as much file
contents as possible and thus leaving the free list fairly small, this available statistic
is a much better single statistic measurement of the memory situation, then using the
size of the free list. While it is not clearly documented how this value is calculated, it is
affected by the amount of cache memory. In situations where the system has little new
memory demand and memory is filled with file caching, the available statistic could
show a rather large number, even though the free list might be fairly short.

4-45
Instructor Guide
Instructor notes:
Purpose Determine if memory is over committed.
Details Go over the example with the students. In this example, even if we were not
using file cache, memory would be over committed. The virtual space itself is more than
real memory.
Transition statement One source of unnecessary memory demand is a memory leak.
Lets look at what this is.

V5.4
Instructor Guide
Uempty
Memory leaks
A memory leak is a program error that consists of
repeatedly allocating memory, using it, and then neglecting
to free it
Systems have been known to run out of paging space
because of a memory leak in a single program.
Tools to help detect a memory leak include:
vmstat
ps gv
svmon P
Periodically stopping and starting the program will free up
the memory.
Figure 4-15. Memory leaks
AN512.0
Notes:
What is a memory leak?
A memory leak occurs when a process allocates memory, uses it, but never releases it.
Memory leaks typically occur in long running programs. Over time, the process will
either:
- Allocate all of its addressable virtual memory which may cause the process to abort
- Fill up memory with unused computational pages, resulting in increased stealing of
non-computational memory, until the numperm is reduced below minperm at which
point some computational memory is paged to paging space.
- Use up the paging space, causing the kernel to take protective measures to
preserve the integrity of the operating system by killing processes to avoid running
out of paging space.
- Cause pinned memory (if a pinned memory leak) to grow to the maxpin threshold.
Even before that happens the system may experience significant page thrashing as

4-47
Instructor Guide
processes fight over the remaining unpinned memory. If the maxpin threshold is
reached, the results are unpredictable - since it depends on what processes are
requesting pinned memory. It could result in a system hang or crash.
Detecting a memory leak

Three commands that help to detect a potential memory leak are vmstat, ps gv, and
svmon -P.
Dealing with a memory leak

The best solution is to fix the coding errors in the program that is leaking memory. That
is not always possible, especially in the short term.
The common solution is to periodically quiesce and stop the program; and then start the
program back up. This will cause all computational memory allocated by that program
to be freed up and placed back on the free list. All related paging space will also be
freed.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Define a memory leak.
Details Do not go into detail on the commands to detect a memory leak. Those will be
described on the next visuals.
Transition statement What is our initial clue that a memory leak may be occurring?

4-49
Instructor Guide
Detecting a memory leak with vmstat

By using the vmstat command to monitor the virtual memory
usage, a memory leak would be detected by a continual
increase in the avm number over time
# vmstat 3 10
kthr
----r b
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
memory
page
faults
cpu
-------------- ------------------------ ----------- ----------avm
fre
re pi po fr
sr cy in
sy cs us sy id wa
136079 817842
0
0
0
0
0
0 81 518 191 2 1 97 0
137402 816518
0
0
0
0
0
0 50 172 179 0 0 99 0
139322 814598
0
0
0
0
0
0 65 176 182 1 1 98 0
141190 812730
0
0
0
0
0
0 65 477 183 1 0 99 0
143350 810570
0
0
0
0
0
0 82 174 194 2 0 98 0
145513 808407
0
0
0
0
0
0 88 172 180 1 0 99 0
146313 807607
0
0
0
0
0
0 50 161 173 0 1 98 0
146319 807601
0
0
0
0
0
0
4 459 172 0 0 99 0
146319 807601
0
0
0
0
0
0
4 146 169 0 0 99 0
146319 807601
0
0
0
0
0
0
2 232 202 0 0 99 0
Figure 4-16. Detecting a memory leak with vmstat
AN512.0
Notes:
What to look for with vmstat
The classic indicator of a memory leak is a steady increase in the active virtual memory
(avm column for vmstat).
In addition, you may notice a steady increase in amount of paging space being used.
However, on a large memory system it make take some time to notice this effect.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Use vmstat to detect a memory leak.
Details The increase in avm, as shown by vmstat, can indicate a potential memory leak.
However, emphasize to the students that an increase in avm could be normal activity!
Transition statement If we think there is a memory leak somewhere on the system,
then the next step is to try to identify which process is the culprit.

4-51
Instructor Guide
Detecting a memory leak with ps gv

After a suspected memory leak has been established using the vmstat
command, the next step is to identify the offending process
Capture the output of a ps gv command
Let some time go by
Capture a second set of output with the ps gv command
The SIZE columns from the two sets of data are compared to see which
programs heaps have grown
# ps vg
PID
...
315632
...
TTY STAT
pts/0 A
TIME PGIN
SIZE
RSS
LIM
TSIZ
0:00
9008
9016
32768
TIME PGIN
SIZE
RSS
LIM
TSIZ
51324
51332
32768
TRS %CPU %MEM COMMAND

8
0.0
1.0 ./exmem
<some time later>

# ps vg
PID
...
315632
...
TTY STAT
pts/0 A
0:00
TRS %CPU %MEM COMMAND

8
0.0
8.0 ./exmem
Figure 4-17. Detecting a memory leak with ps gv
AN512.0
Notes:
Using ps gv to find the offending process
Isolating a memory leak can be a difficult task because the programming error may
exist in an application program, a kernel process or the kernel (for example, kernel
extension, device driver, filesystem, and so forth).
To find the offending process, look for a growing delta in the SIZE field between multiple
ps vg runs. The SIZE column is the virtual size of the data section of the process (in 1
KB units), which represents the private process memory requirements. If this number is
increasing over time, then this is a memory leak process candidate.
This is not an absolute rule. The growth in the virtual size may be a normal trend of
increased workload.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Use ps gv to detect a potential memory leak.
Details Highlight that the increase in the SIZE field may be normal. But, it could also
indicate a potential memory leak.
Additional information A script can be written that will take the before and after ps gv
output and merge it into a comparison report. The students will use a script like this in a lab
exercise for this unit.
Transition statement Up to this point, we have been working with an environment
where the memory allocated to AIX is dedicated. With PowerVM enhancements, it is
possible to share memory with other LPARs. Let us look at this shared memory
environment.

4-53
Instructor Guide
Active memory sharing: Hierarchy

Dedicated Memory
LPARs
AIX
AIX
Shared Memory (AMS)

with over-commitment
AIX
AIX
AIX
realmem=2GB
realmem=2GB
realmem=2GB
AIX
Power Hypervisor (phype)

shared memory pool
2GB
1GB
2GB
4GB
AMS VIOS
Paging Spaces
Physical Memory
Figure 4-18. Active memory sharing: Hierarchy
AN512.0
Notes:
With POWER6 or POWER7 servers and the proper level of firmware and software,
PowerVM Enterprise Edition allows the creation of a shared memory pool. An LPAR can be
created to either use dedicated memory (allocated memory out of physical memory at
activation) or use shared memory (allocated memory out of the shared pool at activation).
The physical memory allocated to the shared memory pool is not available to be allocated
to any LPAR using dedicated memory. The function and management of the shared
memory pool is referred to as Active Memory Sharing (AMS).
AMS allows the memory sharing LPARs to overcommit their allocations. Thus, even though
the example in the visual has only 4 MBs in the shared pool, the total allocations of the
AMS LPARs is 6 MBs. If the three LPARs simultaneously need more logical memory than
the physical memory of the shared pool, some memory contents will need to be paged out,
either in the individual AIX LPARs paging spaces, or in the AMS paging spaces maintained
in the VIOS partition. AMS works best when the partitions complement each other in their
patterns of memory usage. In other words: when one LPAR has high memory demand,
another LPAR has low memory demand.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Provide an overview of active shared memory concepts.
Details
Transition statement Lets take a closer looks at the mechanisms used in AMS.

4-55
Instructor Guide
Active memory sharing: Loaning and stealing

AIX real memory
states
AMS
mechanisms
Loaned by AIX,
from the free list,
AIX: Needs more

physical memory
inuse+free is
decreased
Appears to be
free or inuse, but
is actually stolen
Requests
physical
page
Receives
physical
page
AIX: Has memory

that is available
Page
loaned or
stolen
Requests
a loan
Power Hypervisor
Backed by
physical memory
svmon G report:
size = inuse + free + loaned
Figure 4-19. Active memory sharing: Loaning and stealing
AN512.0
Notes:
Shared memory is expected to be overcommitted. Yet, the total of all the AMS LPARs
entitlements can not be simultaneously backed by physical memory. The real memory that
is backed by physical memory is reported in the vmstat -h report in the pmem field. In the
following discussion, an AIX LPAR which needs more physical memory will be referred to
as the requesting LPAR and the AIX LPAR from which physical memory is taken will be
referred to as the donating LPAR.
As a result of real memory not being backed by physical memory, when an AIX process
accesses a page, there may not be a physical page frame to assign to it.
- To avoid that (or in response to that situation), AIX will request that the Power
Hypervisor (phype) provide physical memory frames for the logical memory frames
it is using as its real memory.
- The hypervisor will then assign physical memory frames to the requesting AIX
partition.

V5.4
Instructor Guide
Uempty
- In order for the Power Hypervisor to fulfill these types of requests, it may need to
request that some other partition loan it some physical memory frames which are
currently backing AIX real memory frames.
- The donating LPAR will likely take frames off the free list to satisfy the request and
may need to steal memory to replenish the free list. It may even need to page-out
current memory contents to its own paging space to do this. These are considered
loaned page frames.
- If the donating LPAR does not loan frames, then the hypervisor may steal page
frames from the donating AIX LPAR. AIX provides hints to the hypervisor to help the
phype choose what page frames to steal. If the frame chosen is dirty (modified
contents not stored on disk), then the hypervisor will save the contents to a paging
space provided for this purpose in the virtual I//O server.
AIX collaborates with the Power Hypervisor to help with hypervisor paging. In response to
the hypervisor requests, AIX checks once a second to determine if the hypervisor needs
memory. In the case where the hypervisor needs memory, AIX will free up logical memory
pages (become loaned pages) and give them to the hypervisor. The policy to free up logical
memory is tunable via the vmo ams_loan_policy tunable in AIX. The default is to loan
frames. One could configure AIX to not loan frames, but that is not generally advisable. AIX
can be much more intelligent about which page frames to loan, then the phype can be
about which frames to steal, even with the AIX provided hints.
In an AIX LPAR which is using AMS, there are three possible situations for any given page
frame.
The page frame is backed by physical memory. This is included in the (vmstat
-h) pmem value.
AIX has loaned the frame off of the free list. This may have required lrud page
stealing. The (svmon) free+inuse total would decrease and the stolen value
would increase. The (vmstat -h) pmem value would decrease.
AIX provided hints to the Power Hypervisor (phype) about which page frames
are least critical and the phype stole the page frame. The (svmon) free, inuse,
and loaned values would be unaffected. The (vmstat -h) pmem value would
decrease.
In a dedicated memory LPAR, the (svmon) size field would equal the total of inuse and
free. In an active shared memory LPAR, the (svmon) size field would equal the total of
inuse, free, and loaned.

4-57
Instructor Guide
Instructor notes:
Purpose Provide an overview of shared memory loaning and stealing mechanisms.
Details
Transition statement Let us look at example of the vmstat and svmon reports in an
AMS environment.

V5.4
Instructor Guide
Uempty
Displaying memory usage with AMS

# vmstat -h 2
System configuration: lcpu=4 mem=1024MB ent=0.30 mmode=shared mpsz=1.50GB
kthr
----r b
0 0
0 0
0 0
0 0
Note:
memory
page
faults
cpu
hypv-page
----------- ------------------------ ------------ ----------------------- ------------------------avm
fre re pi po fr
sr cy in
sy cs us sy id wa
pc
ec
hpi hpit
pmem
loan
130019 35592
0
0
0
0
0
0
1
85 220 0 1 98 0 0.01
3.0
0
7
0.60
0.26
130020 35591
0
0
0
0
0
0
0
15 208 0 1 99 0 0.01
2.2
0
0
0.60
0.26
130020 35674
0
0
0
0
0
0
0
19 198 0 1 99 0 0.01
2.2
0
0
0.60
0.26
130021 35673
0
0
0
0
0
0
3
66 207 0 1 99 0 0.01
2.6
0
0
0.60
0.26
pmem is in units of gigabytes

loan is a percentage of real memory
# vmstat -hv | egrep -i 'loan|virtual'

2612
7756
64674
24
Virtualized Partition Memory Page Faults

Time resolving virtualized partition memory page faults
Number of 4k page frames loaned
Percentage of partition memory loaned
# svmon -G -O unit=MB,pgsz=on
Unit: MB
------------------------------------------------------------------------------------------------size
inuse
free
pin
virtual available
loaned
mmode
memory
1024.00
624.05
140.13
258.11
509.06
211.14
259.82
Shar
pg space
1536.00
4.09
pin
in use
work
209.94
509.06
pers
0
0
clnt
0
114.99
other
48.2
Figure 4-20. Displaying memory usage with AMS
AN512.0
Notes:
The vmstat and svmon commands have new fields to display AMS related information.
For vmstat there is a new flag (-h) to request hypervisor page information. When the new
option is used with an iterative monitoring mode, vmstat shows four new fields on each
iteration:
hpi - Number of hypervisor page-ins.
hpit - Time spent in hypervisor page-ins in milliseconds.
pmem - Amount of physical memory that is backing the logical memory of
partitions. The value is measured in gigabytes.
loan - The percentage memory loaned
In addition, the initial system configuration line of the vmstat report has two new fields:
mmode - The memory mode (dedicated or shared).
mpsz - The size of the shared memory pool

4-59
Instructor Guide
When the -h option is used in combination with the -v option, there are four new counters:
Time resolving virtualized partition memory page faults - The total time that the
virtual partition is blocked to wait for the resolution of its memory page fault. The
time is measured in seconds, with millisecond granularity.
Virtualized partition memory page faults - The total number of virtual partition
memory page faults that are recorded for the virtualized partition.
Number of 4 KB page frames loaned - The number of the 4 KB pages of the
memory that is loaned to the hypervisor in the partition.
Percentage of partition memory loaned - The percentage of the memory loaned
to the hypervisor in the partition.
When you request the svmon global report in an AMS environment (and you specify any
-O option), you get two new fields:
loaned - The amount of memory loaned
mmode - The mode of memory, in this case: Shar
Traditionally, the inuse plus free statistics added up to the size statistics. Memory was
either in use or it was free. (Of course the svmon collects statistics at different points in time
and thus it is possible for the displayed statistics to not add up to exactly equal the total real
memory).
With the ability of Active Memory Sharing, we have a new category of memory which is
neither inuse nor on the free list: loaned memory. Thus, on a POWER6-based machine the
new formula is:
size = inuse + free + loaned
If the formula is applied to the example in the visual, we get:
1024 = 624.05 + 140.13 + 259.82,
which is correct.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Show and explain the AMS fields on the vmstat and svmon reports.
Details Relate the fields to the concepts just covered.
Additional information When the phype steals memory pages, the hpi and hpit values
will increase to reflect that activity.
Transition statement Another POWER server innovation for memory management is
the ability to provide expanded memory. Lets examine how this works.

4-61
Instructor Guide
Active Memory Expansion (AME)

LPARs
Expanded
Memory
28 GB
20 GB
30 GB target expansion
Memory allocated to the LPAR is the

logical or true memory
LPARs Actual
Compressed pool (comprsd)
Logical Memory
Uncompressed
Uncompressed pool (ucomprsd)
Memory
True
Data in true ucomprsd is paged-out
ucomprsd
(co) to comprsd pool to compress it.
Data in true comprsd is paged-in (ci)
to ucomprsd pool to uncompress it.
Compressed
True
Data in pinned and file cache pages
Memory
comprsd
are not compressed.
(expanded)
Works better when a smaller
percentage of allocated and touched
memory is reaccessed.
Expansion factor = 1.5
Memory
Expanded memory seen by applications
Deficit
The target size for expanded memory
is the AIX real memory.
HMC administrator sets expansion factor in
Deficit (dxm): expanded memory
the partition profile
does not reach the target.
exp_factor * true_mem = target_exp_mem
Poor compression ratio or insufficient
Based on recommendations of amepat
data to compress can result in a
planning tool during a normal system load.
deficit
Figure 4-21. Active Memory Expansion (AME)
AN512.0
Notes:
Active Memory Expansion (AME) is a separately licensed feature of the POWER7-based
servers. By compressing part of the virtual memory data, more effective memory is made
available. How aggressively memory is compressed is determined by the expansion factor,
which is a characteristic of the logical partition. The system administrator initially selects an
expansion factor recommended by the amepat planning tool. The expansion factor is
either defined in the partition profile or modified dynamically using DLPAR. Multiplying the
partitions allocated memory by the expansion factor provides the target amount of
expanded memory. This target amount of expanded memory is what AIX sees as its real
memory amount.
The LPARs allocated memory (also referred to as the true memory or logical memory), is
divided into an uncompressed pool and a compressed pool. The sizes of these pools are
dynamically adjusted by AIX depending on the situation.
To compress a page of data, it is paged-out to the compressed pool. When accessed, the
compressed page is paged-in from the compressed pool. The only virtual memory which is

V5.4
Instructor Guide
Uempty
eligible for paging to the compressed pool are pages which are unpinned and in working
segments.
There is CPU overhead to compressing and uncompressing the memory. An application
which is constantly accessing all of its data will generate much more of this compression
and decompression overhead, than one which is only re-accessing a small portion of that
memory during a short period.
While AIX sees the target expansion as the amount of real memory, that amount of
memory may not be effectively available. When the sum of the real uncompressed memory
and the expansion of compressed memory is less than the target expanded memory, the
difference is referred to as the deficit.
Different circumstances can result in a deficit.
- The application data may not compress well
- The amount of memory with data which is not pinned and not used for file caching
may not be enough to support the target
- A system with low memory load will not have enough working storage memory to
compress. On that situation, a deficit is normal and not a problem.
If AMS is used in combination with AME, AIX may use memory compression as a method
to free up some true memory (logical memory) to loan to the shared memory environment.

4-63
Instructor Guide
Instructor notes:
Purpose Explain how AME works.
Details
Transition statement Lets look at what AME related information is available in the
statistics reports.

V5.4
Instructor Guide
Uempty
AME statistics (1 of 2)
# vmstat -c 2
System Configuration: lcpu=4 mem=1536MB tmem=768MB ent=0.30 mmode=dedicated-E
kthr
------r
b
0
0
4
1
2
1
0
0
0
0
memory
--------------------------------------------------avm
fre
csz
cfr
dxm
194622
195970
6147
4425
0
215937
172723
14477
2725
0
225693
144092
20630
2619
27947
225693
143013
20630
2552
29021
225693
143006
20630
2554
29022
page
----------------------ci
co
pi
po
1
0
0
0
25 11856
0
0
9 7472
0
0
115
127
0
0
5
0
0
0
# lparstat -c 2 1000
System configuration: type=Shared mode=Capped mmode=Ded-E smt=4 lcpu=4 mem=1536MB
tmem=768MB psize=16 ent=0.30
%user %sys %wait %idle physc
----- ----- ------ ------ ----0.5
1.7
0.0
97.8 0.01
0.1
1.4
0.0
98.5 0.01
11.2 25.7
0.6
62.4 0.13
5.0
26.2
6.0
62.8 0.12
0.4
1.7
3.4
94.5 0.01
%entc lbusy vcsw phint %xcpu xphysc

dxm
----- ------ ----- ----- ------ ------ -----4.1
0.0
194
0
3.0 0.0004
85
3.0
0.0
195
0
0.0 0.0000
30
44.9
22.7
294
0
53.7 0.0723
0
41.3
19.0
629
0
59.9 0.0741
0
4.8
0.2
608
0
1.7 0.0002
0
Figure 4-22. AME statistics (1 of 2)
AN512.0
Notes:
The vmstat command has a -c option for displaying information about memory expansion
and compression. In the header line, it identifies both the real memory (expansion target)
and the true memory (logical memory allocated to the partition). The header line also
identifies, in the memory mode field, that the partition is using expanded memory.
The vmstat iterative report lines, when using the -c option, provide five new columns.
- csz - the true size of the compressed memory pool
- cfr - the true size of the compressed memory pool which is currently available (does
not hold compressed data)
- dxm - the size of the memory deficit
- ci - the number of page-ins per second from the compressed memory pool
- co - the number of page-outs per second to the compressed memory pool
The lparstat command has a -c option for displaying information about memory expansion
and compression. In the header line, it identifies both the real memory (expansion target)

4-65
Instructor Guide
and the true memory (logical memory allocated to the partition). The header line also
identifies, in the memory mode field, that the partition is using expanded memory.
The lparstat iterative report lines, when using the -c option, provide three new columns.
- %xcpu - the xphysc value divided by the physc value. In other words, how much of
the logical partitions total processor utilization is used for the AME overhead.
- xphysc - the amount of processor capacity that is used to execute data compression
and decompression for AME.
- dxm - the size of the memory deficit

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain AME related fields in the vmstat and lparstat reports.
Details
Transition statement Lets continue with a look at AME statistics in the svmon
command output.

4-67
Instructor Guide
AME statistics (2 of 2)
# svmon -G -O summary=ame,unit=MB
Unit: MB
-------------------------------------------------------------------------------------size
inuse
free
pin
virtual available
mmode
memory
1536.00
1532.06
3.94
282.36
1607.78
0.19
Ded-E
ucomprsd
584.77
comprsd
947.29
pg space
2048.00
58.6
work
pers
clnt
other
pin
246.14
0
0
36.2
in use
1530.13
0
1.93
ucomprsd
582.84
comprsd
947.29
-------------------------------------------------------------------------------------True Memory: 768.00
ucomprsd
comprsd
AME
CurSz
588.56
179.44
%Cur
76.64
23.36
TgtSz
590.22
177.78
%Tgt
76.85
23.15
txf
2.00
cxf
2.00
dxf
0.00
dxm
0
# svmon -G -O summary=longame,unit=MB
Allows long single line iterations
MaxSz
416.84
%Max
54.28
CRatio
5.37
-i 5
Figure 4-23. AME statistics (2 of 2)
AN512.0
Notes:
The svmon global report has an option (summary=ame) which provides AME related
details.
Below the memory line of real memory global statistics, two new rows are displayed which
show the breakdown of real inuse memory into compressed and uncompressed
categories. These are measurements of the amount of expanded (or effective) memory, as
would be seen by the applications. They add up to the total real inuse memory.
In the section which provides columns by type of segment, the working storage column
shows a breakdown into compressed and uncompressed categories. The uncompressed
value is only for working storage; it obviously does not include file caching such as client
segment storage.
A new section on True Memory statistics is provided.
Separate statistics for compressed and uncompressed memory are provided under the
following columns:

V5.4
Instructor Guide
Uempty
- CurSz - current true sizes of compressed and uncompressed, which added together
will equal the total True Memory size.
- %Cur - CurSz expressed as a percentage of total True Memory
- TgtSz - target sizes of true compressed and uncompressed memory pools which are
calculated to be needed in order to reach the target expanded memory size.
- %Tgt - TgtSz expressed as a percentage of total True Memory
- MaxSz - maximum allowed size of true compressed memory (there are vmo
command tunables which affect this)
- %Max - MaxSz expressed as a percentage of total True Memory
- Cratio - current compression ratio
- txf - target memory expansion factor
- cxf - current memory expansion factor
- dxf - deficit factor to reach the target expansion factor (txf - cxf)
- dxm - deficit memory to reach the target expansion
The svmon command also has a summary=longame option which will provide AME related
statistics in a single long line that is god for iterative monitoring. Below is example output
(the line is so long that a very small font is needed to fit the page):
# svmon -G -O summary=longame,unit=MB -i 5
Unit: MB
-------------------------------------------------------------------------------------------------------------------------------Active Memory Expansion
-------------------------------------------------------------------------------------------------------------------------------Size
Inuse
Free
DXMSz
UCMInuse
CMInuse
TMSz
TMFr
CPSz
CPFr
txf
cxf
CR
1536.00
1218.76
317.24
0
618.19
600.57
768.00
4.64
145.18
11.6
2.00
2.00
4.49
1536.00
1339.77
196.23
0
595.04
744.72
768.00
3.76
169.20
17.3
2.00
2.00
4.90
1536.00
1339.81
196.19
0
594.90
744.91
768.00
3.91
169.20
17.5
2.00
2.00
4.91
1536.00
1339.18
196.82
0
595.51
743.67
768.00
3.29
169.20
17.3
2.00
2.00
4.89
1536.00
1459.82
76.2
0
578.64
881.18
768.00
4.14
185.21
16.6
2.00
2.00
5.22
1536.00
1460.50
75.5
0
579.02
881.48
768.00
3.77
185.21
16.4
2.00
2.00
5.22
1536.00
1459.86
76.1
0
578.21
881.65
768.00
4.58
185.21
16.3
2.00
2.00
5.21
1536.00
1532.50
3.50
0
571.44
961.05
768.00
3.34
193.22
14.4
2.00
2.00
5.37
The svmon longame statistics are:

- DXMSz - size of the memory deficit
- UCMInuse - size of uncompressed memory which is in use.
- CMInuse - size of compressed memory which is in use (measured as an amount of
expanded memory)
- TMSz - size of true memory pool (logical, allocated memory)
- TMFr - size of true memory which is free
- CPSz - true size of compressed memory pool

4-69
Instructor Guide
- CPFr - true size of compressed memory pool which is free (The AIX Performance
Tools manual states this field to be the size of the uncompressed pool, but this
author believes that to be a mistake in the manual).
With the ability of Active Memory Expansion, we have a new category of memory which is
neither inuse, free, nor loaned: deficit memory. Thus, on a POWER7-based machine the
new formula is:
size = inuse + free + loaned + deficit

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain svmon AME statistics.
Details
Transition statement Now that we have the basic concepts and know what statistics
are available, how might we use this information?

4-71
Instructor Guide
Active Memory Expansion tuning

Actual performance needs to be monitored
Planning tool recommendations are modeled estimates
Application memory characteristics may have changed:
Amount of memory that is file cache or pinned memory
Proportion of allocated memory that is repeatedly accessed
Compressibility of the data
Monitor CPU overhead:

AME is a trade-off between memory and CPU resources
May need larger CPU allocation to support large expansion factor
May need less aggressive expansion factor to avoid excessive CPU load
Monitor memory deficit:
If seeing consistent deficit while under load, notify HMC administrator
May be appropriate to reduce expansion factor until deficit is eliminated
Deficit under light loads is normal
Once deficit is zero at an appropriate expansion factor:
Follow tradition memory management methods
May need to increase true memory
May need to manage memory demand
Figure 4-24. Active Memory Expansion tuning
AN512.0
Notes:
The amepat planning tool tries to model the Active Memory Expansion behavior, given
data collected during a selected time span. That modeling is not a perfect prediction and
the character of the system activity can change from what was collected for planning. As a
result, it is a good practice to monitor how AME is working after deployment and make any
adjustments.
AME is designed to allow a trade-off between memory and CPU. If memory is the
bottleneck and there is excess CPU capacity, then that excess processing capacity can be
used to relieve some of the memory constraint. It is possible for the CPU overhead of AME,
to cause the CPU capacity to be the major performance constraint. Monitoring the overall
CPU utilization and how much of that is used for AME compression can identify situations
where either the target expansion factor needs to be reduced, or where the partition might
benefit from a larger processor entitlement.
AME should not show a persistent memory deficit while the partition is under heavy
memory demand load. A deficit is an indication that AME is unable to effectively reach the
configured memory expansion factor (given amount of compressible memory and the
V5.4
Instructor Guide
Uempty
compressibility of the data). In that circumstance it is recommended to reduce the memory

expansion factor until there is no deficit displayed. If AIX needs more additional memory
than AME can effectively provide (resulting in paging space activity), then you need to use
the traditional methods of memory management: either increase the allocated (true)
memory or reduce the memory demand.
Note that a deficit under light loads is normal.

4-73
Instructor Guide
Instructor notes:
Purpose Explain how they might tune using AME statistics.
Details
Transition statement Lets review what we have covered in this unit with some
recommendations for managing memory related performance factors.

V5.4
Instructor Guide
Uempty
Overall Recommendations
If memory is overcommitted and impacting performance:
Add logical memory (for example, use DLPAR to increase allocation)
Reduce demand (especially wasteful demand)
Consider sharing memory using AMS
Consider implementing expanded memory with AME
The primary tuning recommendation is: if it is not broken, do not fix it!
The AIX6 vmo tuning defaults are already well tuned for most systems
Use of outdated tuning recommendations can cause problems
If back-leveled at AIX 5L V5.3, use the AIX6 default vmo parameter values as a
starting point
If free list is driven to zero or sustained below minfree, increasing minfree

and maxfree may be beneficial.
maxfree = minfree + (maxpgahead or j2_maxPageReadAhead)
Increasing minperm% may be beneficial when working segments

dominate, if:
Computational allocations are due to wasteful application memory
management, perhaps even a memory leak
and I/O performance is being impacted
Figure 4-25. Recommendations
AN512.0
Notes:
Initial recommendations
These recommendations are starting points for tuning. Additional tuning may be
required. The AIX defaults work well for over 95% of the installed systems.
The objectives in tuning these limits are to ensure the following:
- Any activity that has critical response time objectives can always get the page
frames it needs from the free list.
- The system does not experience unnecessarily high levels of I/O because of
premature stealing of pages to expand the free list.
The best recommendation is that unless you can identify a memory performance
problem, do not tune anything!
The second recommendation involves the changes made to the default for vmo
tunables in AIX6 (especially the lru_file_repage and the minperm and maxperm
changes. If in AIX 5.3, a good starting point is to set the vmo values to the AIX6

4-75
Instructor Guide
defaults. If at AIX 6 and later, be very careful of applying outdated tuning

recommendations; they will likely degrade your performance.
minfree and maxfree parameters

If bursts of activity are frequently driving the free list to well below the minfree value
(and even to zero), then it can be beneficial to increase these thresholds.
The difference between the maxfree and minfree parameters should always be equal
to or greater than the value of the maxpgahead ioo parameter, if you are using JFS. For
Enhanced JFS, the difference between the maxfree and minfree parameters should
always be equal to or greater than the value of the j2_maxPageReadAhead parameter. If
you are using both JFS and Enhanced JFS, then you should set the value of the
minfree parameter to a number that is greater than or equal to the larger page-ahead
value of the two filesystems.
Making parameter changes

Some changes should be strung together to avoid possible error messages and run in
the order given:
- minfree and maxfree
For example, vmo -p -o minfree=1024 -o maxfree=1536.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss memory tuning recommendations.
Details Emphasize that tuning or changing parameters just to change them is not a
good idea. There needs to be a defined problem that changing the parameters will address.
Before the lru_file_repage parameter was available, the recommendation for VMM
tuning to avoid paging was to reduce maxperm so that numperm was always above maxperm.
The following calculation was used:
maxperm = 100% - avm% -10%
then, maxclient would be set equivalent to maxperm.This allows avm to grow another 10%
before overrunning maxperm or maxclient.
If the number of CPUs * default minfree < fi (from vmstat) then minfree was set equal to
fi, otherwise minfree was set to number of CPUs * default minfree.
Transition statement Lets continue with some specifics for managing memory
demand.

4-77
Instructor Guide
Managing memory demand

Shift high memory load applications to
a lower memory demand time slot
a machine with underutilized memory
Adjust applications to:

Only allocate memory that is really needed
Only pin memory when necessary
Fix memory leaks or periodically cycle the leaking application

Consider use of direct I/O (no file system caching), if:
Application does its own caching
Application does random access or optimizes its own sequential access
Consider using file system release-behind mount options, if:

Files in the file system are not soon reaccessed, thus not benefiting from
sustained caching in memory
Unmount file systems when not being used
Figure 4-26. Managing memory demand
AN512.0
Notes:
Overview
As with any resource type, the most important factor in performance management is
balancing resource with demand. If you cannot afford more memory then you need to
reduce demand. Remember that the resource and demand balance varies from one
time period to the next and from server to server. Peak demand on one machine may be
in the middle of the afternoon, while peak demand on another machine may be early in
the morning. If possible, shift work to off-peak periods. If that is not possible, shift the
work to a server which has its peak demand at a different time. The AIX6 ability to
relocate live workload partitions to a different server makes this easy. The PowerVM
based Live Partition Mobility is another way to do this dynamically. Both methods also
support static relocation, which involve first shutting down either the workload partition
or the logical partition.

V5.4
Instructor Guide
Uempty
Tuning applications
Some applications can be configured to adjust the amount of computational memory
they allocate. The most common example of this are the databases. Databases often
allocate massive amounts of computational memory to cache database storage
contents. Sometimes the amount is excessive; given enough computational memory
they will cache data that has a low hit ratio (not accessed often). If the system is
memory constrained, this is a wasteful use of memory. Note that this application
managed caching is considered computational storage by AIX. You should work with
the application administrators (in our example that would be the database
administrator) to determine the appropriate amount of computational memory to be
allocated.
Memory leaks
Another way in which applications can waste memory is by having a coding bug which
allocates memory, loses track of that allocation, and does not free that memory. The
next time the application needs memory it allocates new memory and then loses track
of that allocation. This is referred to as a memory leak. The application, with each
iteration of its logic, keeps allocating memory but never frees any (and is not using it).
This tends to fill up memory. Over the long term, the application needs to have the
coding error fixed. In the short term, you need to periodically quiesce the application,
terminate the process (all threads), and then restart the application. This is often
referred to as cycling the application. When the application is terminated, AIX will then
free up all the memory owned by the application.
Direct I/O
AIX caches file contents in memory for two main reasons:
- It assumes that the file will be re-referenced; and wants to avoid disk I/O by keeping
it in memory
- It allows sequential disk operations to be optimized. (this involved read-ahead and
write-behind mechanisms which will be covered later).
For applications which do not need these benefits, you can tell the file system to do no
automatic memory caching. The best example, again, is the database. Most database
engines will do their own caching of data. Since the data base engine has a better
understanding of access data patterns, it can use this intelligence to manage what data
is kept in its private cache. There is no reason to both cache it for the filesystem
(persistent or client segment) and to also cache it in the database engines
computational memory. In addition the database access tends to be random access of
relatively small blocks rather than sequential. As such it does not benefit as much from
the file system sequential access mechanisms.

4-79
Instructor Guide
These applications can usually be configured to request Direct I/O which eliminates the
file system memory caching. Details on Direct I/O will be covered later.
Unmounting and release-behind

Some applications benefit greatly from the file system I/O optimization that depends on
file caching, but do not reaccess file data blocks very frequently. They write a file but
then do not reaccess for a long time (it may be the next day or perhaps not until the end
of an accounting period). Or they access a large file just once for generating an end of
month report, but do not read it again until the end of the year.
AIX does not know that the file is not likely to be reaccessed, so it tries to keep it cached
in memory even after the application terminates. If the files which follow this access
pattern can be placed in one or more special files systems, then we can tell the kernel
file systems services to not keep these files cached in memory once the immediate
application access is complete. If the file system is mounted with a release-behind
option, then once a files data is delivered to the application (on a read) or written to disk
(on a write), AIX frees that memory to be placed on the free list.
Another alternative, for applications that only run at certain times, is to place all of that
applications files in a separate file system. Then only mount the filesystem when the
application is running. When the application completes its run, unmount the filesystem.
AIX frees all file cache memory contents related to that file system when it is
unmounted.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss some techniques fro managing memory demand.
Details Do not get into details of DIO and release-behind mechanisms. Tell the students
these will be covered in the file systems unit.
Transition statement Let us review what we have covered with some checkpoint
questions.

4-81
Instructor Guide
Checkpoint (1 of 2)
1. What are the three virtual memory segment types?
_____________, _____________, and _____________
2. What type of segments are paged out to paging space?
__________________
3. What are the two classifications of memory (for the
purpose of choosing which pages to steal)?
__________________ and ___________________
4. What is the name of of the kernel process that implements
the page replacement algorithm? _______
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Checkpoint questions.
persistent, client, and working
working
computational memory and non-computational (file)
memory
4. What is the name of the kernel process that implements
the page replacement algorithm? lrud
Transition statement Lets look at more checkpoint questions.

4-83
Instructor Guide
Checkpoint (2 of 2)
5. List the vmo parameter that matches the description:
a. Specifies the minimum number of frames on the free list when the
VMM starts to steal pages to replenish the free list _______
b. Specifies the number of frames on the free list at which page
stealing stops ______________
c. Specifies the point below which the page stealer will steal file or
computational pages regardless of repaging rates ___________
d. Specifies whether or not to consider repage rates when deciding
what type of page to steal ________________
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Checkpoint questions.
5. List the vmo parameter that matches the description:
VMM starts to steal pages to replenish the free list minfree
stealing stops maxfree
computational pages regardless of repaging rates minperm%
what type of page to steal lru_file_repage
Transition statement Lets look at more checkpoint questions.

4-85
Instructor Guide
Exercise 4:
Virtual memory analysis and tuning
Use VMM monitoring and tuning tools to
analyze memory over-commitment, file
caching, and page stealing
Use PerfPMR to examine memory data
Identify a memory leak
Use WPAR manager resource controls
Examine statistics in an active memory
sharing (AMS) environment
Figure 4-29. Exercise 4: Virtual memory analysis and tuning
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Details
Transition statement Summarize the key points from this unit.

4-87
Instructor Guide
Unit summary
This unit covered:
Basic virtual memory concepts and what issues
affect performance
Describing, analyzing, and tuning page
replacement
Identifying memory leaks
Using the virtual memory management (VMM)
monitoring and tuning tools
Analyze memory statistics in Active Memory
Sharing (AMS) and in an Active Memory
Expansion (AME) environments
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Details
Transition statement On to the next unit.

4-89
Instructor Guide

V5.4
Instructor Guide
Uempty
Unit 5. Physical and logical volume performance

Estimated time
3:45 (2:30 Unit; 1:15 Exercise)

This unit describes the issues related to physical and logical volume
performance. It shows you how to use performance tools and how to
configure your disks, adapters, and logical volumes for optimal
performance.

Identify factors related to physical and logical volume performance
Use performance tools to identify I/O bottlenecks
Configure logical volumes for optimal performance

Accountability:
Checkpoint
Machine exercises
References
Reference
SG24-6478

(Redbook)
SG24-6184
IBM eServer Certification Study - AIX 5L Performance

and System Tuning (Redbook)

5-1
Instructor Guide
Unit objectives
Identify factors related to physical and logical volume
performance
Use performance tools to identify I/O bottlenecks
Configure logical volumes for optimal performance
AN512.0
Notes:
5-2

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose This unit explains the concepts and characteristics of physical and logical
volumes and their relationship to performance.
Details We will not cover SAN environments in this course. However, there are a few
references to caching disk subsystems. The term caching disk subsystem is used instead
of SAN to differentiate between many fibre channel connected disks with little or no cache
and a storage subsystem like 2105, DS8000s, FastT and EMCs. These systems are more
than just disks connected to a fibre controller.
Transition statement Heres the performance flowchart that we will be following
throughout the course with the I/O Bound box highlighted.

5-3
Instructor Guide
Overview
Does I/O response and throughput meet expectations?
If not, what is the cause of the degraded performance?
Disk subsystem underperforming to specifications

Saturated disk subsystem, storage adapter, or adapter bus
Lack of I/O overlap processing (serialization)
Shortage of LVM logical resources
Fragmentation and improper block sizes
File system issues
Shortage of CPU or memory resources
Can the workload be managed?
Load balancing workload across adapters or disks

Isolate contending workloads to different resources
I/O pacing
Shift of where and when certain programs run
Figure 5-2. Overview
AN512.0
Notes:
As with any performance analysis, the bottom line is whether the applications is performing
to the expectations (or needs) of the application users. For this there are often metrics
provided by the application which are more appropriate than the operating system metrics.
If the performance is not what you expect, then you need to look at where you can make
improvements. An obvious starting point is the devices that hold the data. What is the
expected response time and throughput for the disk drive or storage subsystem? Do the
actual measurements meet those expectations? If the storage is performing well, are there
better and faster storage solutions you can invest in?
Every device has its limits in how many requests per second or how data per second it can
handle. The problem may be that storage device is saturated. For that you either need to
get a better device or see if you can shift some of that load to another device. It is not
uncommon to find expensive new hardware underutilized while the older hardware is being
pushed to the limits. The same principle can be applied to any component in the path. It
may be that the storage adapter or the PCI bus is saturated. Again, shifting and adapter to
5-4

V5.4
Instructor Guide
Uempty
an different bus, or spreading traffic across more adapters can often have significant
payback.
Within AIX, there are layers to storage architecture and each layer has queue and controls
block which be short of what is needed. These pools and queue can often be increased,
being careful to understand the consequences.
One of the common principles of performance is that the bigger the unit of work, the more
efficient the processing. The size of the applications read and write requests, LVM stripe
unit sizes, the configuration of the file system block sizes and mechanisms, and the logical
track group size of the volume group all can have an affect. Even if we do all of this
correctly, fragmentation can breakup the flow of processing the data requests.
There are many file system mechanisms that can affect performance and these will be
covered in the next unit.
Always remember that overall performance can be affected by memory and CPU
constraints. It is necessary to look at the whole picture. Some efforts to improve I/O
performance can have a negative affect on these other factors.
Moving to the demand side of the situation, sometimes you need to manage the workload
within the constraints of the resources. The techniques of spreading workloads across
more buses, more adapters, or more disks is part of this. But, sometimes you need to
identify which programs are generating the I/O load and decide what should continue to run
on this server at this time and which work should be shifted. The workload might be moved
to another server or delayed until a time slot when the resources are not saturated. Some
applications are designed to work in a clustered environments with transactions
automatically load balanced between the nodes.
I/O pacing is a filesystem technique which can pacing (slowing down) the batch I/O loads to
give the interactive requests a better response time.

5-5
Instructor Guide
Instructor notes:
Purpose Provide an overview of some of the principles and techniques for I/O
performance management.
Details
Transition statement It is important to understand how the various components relate
to one another. Let us look at the layers involved in I/O processing.
5-6

V5.4
Instructor Guide
Uempty
I/O stack
Application
Logical file system

Raw LV
Raw Disk
File systems (JFS, JFS2, NFS, others)
VMM (file caching)
DIO
LVM (logical volume)

Disk Device Drivers (physical volume)
Adapter Device Drivers
Disk Subsystem (optional)

Disk
Figure 5-3. I/O stack
AN512.0
Notes:
Overview
The Logical Volume Manager provides the following features:
- Translation of logical disk addresses to physical disk addresses
- Dynamic allocation of disk space
- Mirroring of data (at the logical volume level) for increased availability
- Bad block relocation
The heart of the LVM is the Logical Volume Device Driver (LVDD).
When a process requests a disk read or write, the operation involves the file system,
VMM and LVM. But, these layers are transparent to the process.

5-7
Instructor Guide
File system I/O

When an application issues an I/O, it may be to a file in a file system. If so, then the file
system may either send the I/O to the VMM or directly to the LVM. The VMM will send
the I/O to the LVM. The LVM sends the I/O to the disk driver which will communicate
with the adapter layer. With the file system I/O path, the application can benefit from the
VMM read-ahead and write-behind algorithms but may run into performance issues
because of inode locking at the file system level.
Raw LVM I/O

The application can also access the logical volume directly. This is called raw LVM I/O.
Many database applications often do this for performance reasons. With this type of
access, you avoid two layers (VMM and the file system). You also avoid having to
acquire the inode lock on a file.
Raw disk I/O

An application can bypass the Logical Volume Manager altogether by issuing the I/O
directly to the disk device. This is normally not recommended since you lose the easier
system management provided by the LVM. However, this does avoid going through one
more layer (LVM). Issuing I/O this way can be a good test of a disks capabilities, thus
taking VMM, file system, and LVM out of the picture.
The disk device itself may be a simple physical disk drive or it could be a logical hdisk
that is comprised of several physical disks (such as in a RAID array) or the disk device
could be just a path specification to get to the actual logical or physical disk (as is the
case for vpath or powerpath devices).
Using a raw device

When using a raw device (LVM or disk I/O), the mechanism bypasses the normal AIX I/O
processing. To use this, the character device must be used. Failure to the use character
device will cause the I/O to be buffered by LVM and can result in up to a 90% reduction in
throughput. The character device name always begins with an r. So, accessing /dev/hdisk1
will be buffered by LVM, but /dev/rhdisk1 will bypass all buffering. This is the only correct
way to access a raw device for performance reasons.
5-8

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain LVM terminology.
Details LVM also handles inter and intra policies and scheduling policies (sequential or
parallel). The logical volume manager controls the placement of file systems and page
spaces on the hard drive which can significantly affect the amount of seek latency.
The disk device drivers control the order in which I/O requests are acted upon.
The logical volume device driver is a pseudo device driver that operates on logical volumes
through the /dev/LV or /dev/rLV special file (where LV is the name of the logical volume).
Like the physical disk device driver, it provides character and block entry points. For
example, if you are accessing a raw logical volume named raw_lv, the correct way to
access it is /dev/rraw_lv. (Note the extra r in the name.)
Transition statement It is common to here that I/O performance is now totally in the
hands of the SAN and disk storage subsystem administrators. While there are certain
traditional aspects of AIX I/O tuning which are irrelevant in a SAN storage environment,
there are other which are still very important. Lets first discuss the differences between
traditional isolated disks and cached disk arrays.

5-9
Instructor Guide
Individual disks versus disk arrays

Disks: AIX controls position of data on the platter
Center
(Outer) Middle
Inner Middle
(Outer) Edge
Inner Edge
Disk arrays: array controller spreads the data; AIX sees

LUN as an hdisk.
LUN 1
LUN 2
Figure 5-4. Individual disks versus disk arrays
AN512.0
Notes:
For traditional disks, AIX controls where data is located through the AIX Logical Volume
Manager (LVM) intra-policies. Poor placement of data can result in increased access arm
movement and rotational delays getting to the desired sector.
For disk arrays, the controller decides where to place the data, most commonly stripping
the data across the disks with distributed parity (RAID5). In this environment, it make no
difference whether the logical volume is on the outer edge or in the center, since the disk
array administrator is controlling where it is actually placed.
Another major difference is that almost all disk arrays have their own data caching. This
allows the controller to collect many write requests and then optimize how and when the
data is written to physical disk. It also allows the controller to recognize patterns of
sequential access and to anticipate the next request by sequentially reading ahead. When
the next host read request comes in, the data is already in the cache and ready to be
transferred to the host.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Provide a basic description of the differences between traditional disks and
storage array based disk subsystems.
Details
Additional information Check with the students to see if they need a refresher on the
components of disk access; seek latency, rotational latency, and transfer latency.But
remember that when using a caching disk subsystem, much of this may be hidden by the
request that involve transfers between the subsystems cache and AIX memory.
Seek latency
A seek is the physical movement of the head at the end of the disk arm from one track
to another. The time for a seek is the necessary time for the disk arm to:
-
Accelerate
Travel over the tracks to be skipped
Decelerate
Settle down and wait for the vibrations to stop while hovering over the target track
The total time the seeks take is variable. The average seek time is used to measure the
disk capabilities, and it is generally lower than 15 ms.
Rotational latency
Rotational latency is the time that the disk arm has to wait while the disk is rotating
underneath it until the target sector approaches. Rotational latency is, for all practical
purposes except sequential reading, a random function with values uniformly between
zero and the time required for a full revolution of the disk (less than 10 ms). The
average rotational latency is taken as the time of a half revolution, and it is generally
lower than 5 ms. To determine the average latency, you must know the number of
revolutions per minute (RPM) of the drive. By converting the revolutions per minute to
revolutions per second and dividing by 2, you get the average rotational latency.
Transfer latency
The data transfer time is determined by the time it takes for the requested data block to
move through the read/write arm. It is linear with respect to the block size. For a 4 KB
page transfer, this time is typically near 1 ms.
Transition statement Here our focus was on policies involving the location of data
within a disk (LVM intra-policies). Lets also examine the implications for spreading data
across multiple disks (LVM inter-policies, LVM striping, and LVM mirroring).

5-11
Instructor Guide
Disk groups
Common to balance the workload between hdisks
A disk group is a group of LUNs on the same disk array
Similar in concept to AIX mirror groups for availability
If hdisks are on the same disk group, there is no real
workload balancing (example: LUN1 and LUN2)
To balance workload with SANs, choose disks in different
disk groups (for example: LUN1 and LUN3)
AIX
AIX
LUN 1
LUN 3
LUN 2
LUN 4
Figure 5-5. Disk groups
AN512.0
Notes:
In the traditional disk environment it is common to spread data between disks, both for
availability and for performance. I/O requests sent to one disk do not interfere with requests
sent to another disk. The requests can be processed in parallel with less workload on each
of the disks. This is a great way to deal with a single saturated disk.
In the disk array environment, it is possible that two hdisks are actually LUNs allocated out
of the same disk array. If this was an LVM mirroring situation, this would make nonsense
out the availability planning. The disk arrays crashes and both hdisks are gone. To help
avoid this, AIX allows you to formally define mirror groups. A mirror group is a group of
hdisks which actually reside on the same disk array. The mirror group functionality allows
the system to reject attempts to mirror where copies would be in the same mirror group.
Even when not mirroring, it is important to be aware of these relationships. To convey this,
the course will use the term disk groups to define hdisks which are LUNs from the same
disk array. in. There is no need or benefit to defining a formal mirror group if not doing LVM
mirroring.

V5.4
Instructor Guide
Uempty
Spreading data between hdisks in the same disk group would have no benefit. A single
disk array would still be single point of resource contention. But, spreading data between
hdisks which are in different disks groups can still be very beneficial.
A similar concept can occur in a virtual SCSI environment with the LPARs virtual disks
being backed by logical volumes that are on a single disk at the VIOS server. They appear
as two hdisks but, in reality, they are out of the same physical disk which acts a single point
of resource contention. This is why most VIOS administrators back virtual disks with
physical volumes. But it is important in that situation to coordinate with the VIOS
administrator and the SAN storage administrator to understand ultimately which virtual
disks are in the same disk group and which are not.

5-13
Instructor Guide
Instructor notes:
Purpose Explain factors in spreading data to avoid single points of resource contention.
Details
Transition statement Now that we have explained the different storage environments
you might be working with, let us go over some LVM performance factors and discuss
which ones apply to which environments.

V5.4
Instructor Guide
Uempty
LVM attributes that affect performance

Disk band locality issues affecting only individual disks
Position on physical volume (intra-policy)
Active mirror write consistency (cache on outer edge)
Logical volume fragmentation (for random I/O loads)
Managing location across hdisks disks and disk groups

Range of physical volumes (inter-policy)
Maximum number of physical volumes to use
Number of copies of each logical partition (LVM mirroring)
Extra I/O traffic affects all I/O environments

Enabling write verify
Active Mirror Write Consistency (extra write to update MWC
cache before each data write)
Figure 5-6. LVM attributes that affect performance
AN512.0
Notes:
Inter-physical volume allocation policy
The Inter-Physical Volume Allocation policy specifies which strategy should be used for
choosing physical devices to allocate the physical partitions of a logical volume. The
choices are:
-
MINIMUM (the default)

- MAXIMUM
The MINIMUM option indicates the number of physical volumes used to allocate the
required physical partitions. This is generally the policy to use to provide the greatest
reliability, without having copies, to a logical volume. The MINIMUM option can be
interpreted in one of two different ways, based on whether the logical volume has
multiple copies or not:
- Without Copies:
The MINIMUM option indicates one physical volume should contain all the physical

5-15
Instructor Guide
partitions of this logical volume. If the allocation program must use two or more
physical volumes, it uses the minimum number possible, remaining consistent with
the other parameters.
- With Copies:
The MINIMUM option indicates that as many physical volumes as there are copies
should be used. Otherwise, the minimum number of physical volumes possible are
used to hold all the physical partitions. At all times, the constraints imposed by other
parameters such as the strict option are observed. (The strict allocation policy
allocates each copy of a logical partition on a separate physical volume.)
These definitions are applicable when extending or copying an existing logical volume.
For example, the existing allocation is counted to determine the number of physical
volumes to use in the minimum with copies case.
The MAXIMUM option indicates the number of physical volumes used to allocate the
required physical partitions. The MAXIMUM option intends, considering other constraints,
to spread the physical partitions of this logical volume over as many physical volumes
as possible. This is a performance-oriented option and should be used with copies to
improve availability. If an uncopied logical volume is spread across multiple physical
volumes, the loss of any physical volume containing a physical partition from that logical
volume is enough to cause the logical volume to be incomplete.
To specify an inter-physical policy use the -e argument of the mklv command. The
options are:
- x - Allocate across the maximum number of physical volumes
- m - Allocate the logical partitions across the minimum number of physical volumes
The mklv -u UpperBound argument sets the maximum number of physical volumes for
new allocation. The value of the Upperbound variable should be between one and the
total number of physical volumes in the volume group.
For example, to create a logical volume with 4 logical partitions that are spread across
three physical volumes:
# mklv -u 3 -e m datavg 4
Intra-physical volume allocation policy

The Intra-Physical Volume Allocation policy specifies what strategy should be used for
choosing physical partitions on a physical volume. The five general strategies are:
- EDGE
- INNER EDGE
- MIDDLE
- INNER MIDDLE
- CENTER
V5.4
Instructor Guide
Uempty
The Intra-Physical Volume Allocation policy has no affect on a caching disk subsystem.
It only apply to real physical drives. They also apply when setting up VIO devices from
the server.
Physical partitions are numbered consecutively, starting with number one, from the
outer-most edge to the inner-most edge.
The EDGE and INNER EDGE strategies specify allocation of partitions to the edges of
the physical volume. These partitions have the slowest average seek times, which
generally result in longer response times for any application that uses them. Outer edge
(EDGE) on disks produced since the mid 1990s can hold more sectors per track so that
the outer edge is faster for sequential I/O.
The MIDDLE and INNER MIDDLE strategies specify to stay away from the edges of the
physical volume and out of the center when allocating partitions. These strategies
allocate reasonably good locations for partitions with reasonably good average seek
times. Most of the partitions on a physical volume are available for allocation using this
strategy.
The CENTER strategy specifies allocation of partitions to the center section of each
physical volume. These partitions have the fastest average seek times, which generally
result in the best response time for any application that uses them. There are fewer
partitions on a physical volume that satisfy the CENTER strategy than any other
general strategy.
To specify an intra-physical policy use the -a argument of the mklv command. The
options are:
- e - Edge (Outer edge)
- m - Middle (Outer middle)
- c - Center
- im - Inner middle
- ie - Inner edge
For example, to create a logical partition in the center of the disk with 4 logical
partitions:
# mklv -a c datavg 4
The Intra-Physical Volume Allocation policy may or may not matter depending on
whether a disk or disk subsystem is being used for the storage. This is because AIX has
no real control over what part of the real disk in the subsystem is actually used.
Other miscellaneous LV attributes:

Other various factors that can be controlled when creating a logical volume:
- Allocate each logical partition copy on a separate PV specifies the strictness
policy to follow. A value of 'yes' means the strict allocation policy will be used, which

5-17
Instructor Guide
means no copies of a logical partition are permitted to reside on the same physical
volume. A value of 'no' means the nonstrict policy is used, which allows copies of
logical partitions to reside on the same physical volume. A value of superstrict
uses an allocation policy that ensures that no partition from one mirror copy resides
on the same physical volume that has partitions from another mirror copy of the
logical volume.
- Relocate LV during reorganization specifies whether to allow the relocation of the
logical volume during reorganization. For striped logical volumes, the relocate
parameter must be set to no (the default for striped logical volumes). Depending on
your installation you may want to relocate your logical volume.
- Write verify sets an option for the disk to do whatever its write verify procedure is.
Exactly what happens is up to the disk vendor. This is implemented completely in
the disk. This will negatively impact performance.
- Logical volume serialization serializes overlapping I/Os. When serialization is
enabled, it will force serialization on concurrent writes to the same disk block. Most
applications like file systems and databases do serialization so serialization should
be turned off. The default for new logical volumes is off. Enabling this parameter can
degrade performance I/O. Operations are serialized if they are issued to the same
data block.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain LVM policies.
Details If you select the MINIMUM Inter-Disk setting (Range =m), the original physical
partitions assigned to the logical volume are located on a single disk to enhance
availability. If you select the MAXIMUM Inter-Disk setting (Range =x), the original partitions
are located on multiple disks to enhance performance. The allocation of mirrored copies of
the original partitions is discussed in the following section. For non-mirrored logical
volumes, use the MINIMUM setting to provide the greatest availability (availability is having
access to data in case of hardware failure). The MINIMUM setting indicates that one physical
volume should contain all the physical partitions of this logical volume if possible. If the
allocation program must use two or more physical volumes, it uses the minimum number
possible, remaining consistent with the other parameters.
If there are mirrored copies of the logical volume, the MINIMUM setting causes the physical
partitions containing the first copy of the logical volume to be allocated on a single physical
volume if possible. Then, depending on the setting of the strict option, the additional copy
or copies are allocated on the same or on separate physical volumes. In other words, the
algorithm uses the minimum number of physical volumes possible, within the constraints
imposed by other parameters such as the strict options, to hold all the physical partitions.
The closer a given physical partition is to the center of a physical volume, the lower the
average seek time, assuming a uniform distribution of disk I/Os. However, if the logical
volume is the only logical volume on the disk, then placing it on the outer edge may help
with sequential performance since the outer edge may have more sectors per track. Also, if
active Mirror Write Consistency is enabled, then the mirrored logical volume would best be
served by placing it on the outer edge of the disk.
Additional information The Intra-Physical Volume Allocation also applies when setting
up VIO devices from the server.
Transition statement Let us look at performance factors in using LVM mirroring.

5-19
Instructor Guide
LVM mirroring
LVM Mirroring provides software mirroring for either individual
logical volumes all logical volumes in a volume group
Mirror write consistency (MWC)
Ensures all copies are the same after a crash
Active MWC records mirror write activity in a MWC cache
(MWCC) on the outer edge of the hdisk.
The logical volume location can cause excessive access arm
movement between the LV and the MWCC on every write.
Passive MWC (big VG only) does not use a MWC cache
Mirroring may benefit read performance but could degrade write
performance if mirror write consistency is on or active
Figure 5-7. LVM mirroring
AN512.0
Notes:
Introduction
LVM mirroring is a form of disk mirroring. Rather than an entire disk being mirrored, the
individual logical volume can be mirrored. LVM mirroring is turned on for a logical
volume when the number of copies is greater than one.
Number of copies
The number of copies could be:
- One: No mirror
- Two: Double mirror which protects against a single point of failure
- Three: Triple mirror which protects against multiple disk failures
Mirroring helps with high-availability because in case a disk fails, there would be
another disk with the same data (the copy of the mirrored logical volume).
V5.4
Instructor Guide
Uempty
Copies on a physical volume

When creating mirrored copies of logical volumes, use only one copy per disk. If you
had the copies on the same disk and the disk fails, mirroring would not have helped with
high-availability. In a SAN storage environment, use only one copy per mirror group.
Scheduling policies
There are several scheduling policies that are available when a logical volume is
mirrored. The appropriate policy is chosen based on the availability requirements and
the performance characteristics of the workloads accessing the logical volume.
Performance impact
Mirroring may have a performance impact since it does involve writing two to three
copies of the data. Mirroring also adds to the cost because of the necessity for
additional physical disk drives. In the case of reads, mirroring may help performance.
Mirror write consistency

The LVM always ensures mirrored copies of a logical volume are consistent during
normal I/O processing. To ensure consistency of mirrored copies, for every write to a
logical volume, the LVM generates a write request for every mirror copy. A problem can
occur if the system crashes before all the copies are written. If Active Mirror Write
Consistency recovery is requested for a logical volume, the LVM keeps additional
information to allow recovery of these inconsistent mirrors. Mirror Write Consistency
recovery should be performed for most mirrored logical volumes. Logical volumes, such
as paging space, should not have MWC on since the data in the paging space logical
volume is not reused when the system is rebooted.
Caching disk subsystems

When mirroring on a caching disk subsystem, AIX does not know how to make sure
each copy of the logical volume is on a separate storage system. This is the
responsibility of the administrator to take care of this, by requesting that the related
LUNs be allocated out of separate disk arrays. Defining mirror groups will allow AIX to
assist with this, since it will refuse to allocate two copies in the same mirror group.
Because the writes are cached, MWC has less affect on a caching disk subsystem. If
the disk subsystem writes are slow (over 5 ms), MWC may have a significant impact on
the performance of the system. This is because the MWC writes are synchronous and
must be completed before the writing of the actual data.

5-21
Instructor Guide
Mirror write consistency

The Mirror Write Consistency (MWC) record consists of one sector. It identifies which
logical partitions may be inconsistent if the system is not shut down correctly. When the
volume group is varied back on-line, this information is used to make the logical
partitions consistent again. The MWC control sector is on the edge of the disk,
performance may be improved if the mirrored logical volume is also on the edge.
Active MWC
With active MWC, mirrored writes do not return until the Mirror Write Consistency check
cache has been updated. This can have a negative effect on write performance. This
cache holds approximately 62 entries. Each entry represents a logical track group
(LTG). The LTG size is a configurable attribute of the volume group.
When a write occurs to a mirrored logical volume, the cache is checked to see if the
write is in the same Logical Track Group (LTG) as one of the LTGs in the cache. If that
LTG is there, then no consistency write is done. If not, then the cache is updated with
this LTG entry and a consistency check record is written to each disk in the volume
group that contains this logical volume. The MWC does not guarantee that the absolute
latest write is made available to the user. MWC only guarantees that the images on the
mirrors are identical.
You may choose to turn off MWC as long as the system administrator sets auto_varyon
to false and does a syncvg -f on the volume group after rebooting. However,
recovery from a crash will take much longer since ALL partitions will have to be
resyncd. MWC gives the advantage of fast recovery when it is turned on.
Passive MWC
Passive MWC is available for logical volumes that are part of a big volume group.
A normal volume group is a collection of 1 to 32 physical volumes of varying sizes and
types. A big volume group can have from 1 to 128 physical volumes. The mklv -B
option is used to create a big volume group. Most large volume groups are configured
as scalable volume groups and thus can not use passive MWC.
Active MWCs disadvantage is the write performance penalty (which can be substantial
in the case of random writes). However, it provides for fast recovery at reboot time if a
crash occurred. By disabling active MWC, the write penalty is eliminated but after boot,
the entire logical volume has to be resyncd by hand (using syncvg -f) before users
can use that logical volume (autovaryon must be off). However, with passive MWC, not
only is the write penalty eliminated but the administrator does not have to do the syncvg
or set autovaryon off. Instead, the system will automatically resync the entire logical
volume if it detects that the system was not shutdown properly. This resyncing is done
in the background. The disadvantage is that until the partitions are resyncd, reads may
be slower as the partitions are resyncd.
V5.4
Instructor Guide
Uempty
The passive option may be chosen in SMIT or with the mklv or chlv commands when
creating or changing a logical volume. Just like with active MWC, these options take
effect only if the logical volume is mirrored.

5-23
Instructor Guide
Instructor notes:
Purpose To explain performance aspects of LVM mirroring.
Details Be fairly brief with the mirroring basics; the students should already be aware of
mirroring from previous AIX System Administration courses. Focus here should be pointing
out that MWC can degrade performance if the LV being mirrored is not near the MWC
Cache which is stored on the outer edge. And that is only a concern when using individual
disks. Remind them that mirroring is mostly an availability tool. Use the point that mirroring
can aid in read performance as a segue to the next visual.
Additional information MWC can hurt random I/O performance more than sequential
I/O performance. For example, when a cp command is executed on a mirrored logical
volume, it would take 32 writes before a MWCC write is done because the default LTG size
is 128 KB (cp writes in 4 KB chunks). With random writes, every write could be in a different
LTG and therefore cause a MWCC write every time. Mirror Write Consistency overhead for
writes can be avoided by using the passive MWC feature. The way passive MWC works is
that there is a bit that is set to mark the logical volume as being dirty. When a logical
volume is closed cleanly, the bit is cleared. If a system crashes, the bit is still set. Therefore
at reboot time, the LVM knows that the logical volume needs to be syncd. It will then start a
syncing of the entire logical volume in the background. This may take a long time
depending on how large the logical volume is. The logical volume can be used while the
resyncing is occurring. If a partition is read that has not been syncd yet, then it will be
syncd and then the read request will return.
Transition statement Let us look at the impact that the LVM mirroring scheduling
policies can have on performance.

V5.4
Instructor Guide
Uempty
LVM mirroring scheduling policies

Parallel (default):
Read I/Os will be sent to the least busy disk that has a
mirrored copy
Write I/Os will be sent to all copies concurrently
Sequential:
Read I/Os will only be sent to the primary copy
Write I/Os will be done in sequential order, one copy at a
time
Parallel write/sequential read:
Read I/Os will only be sent to the primary copy
Parallel write/round-robin read:
Read I/Os are round-robind between the copies
Figure 5-8. LVM mirroring scheduling policies
AN512.0
Notes:
Introduction
The scheduling policy determines how reads and writes to mirrored logical volumes are
handled.
Parallel
The default policy is parallel which balances reads between the disks. When a read
occurs, the LVM will initiate the read from the primary copy if the disk which contains
that copy is not busy. If that disk is busy, then the disk with the secondary copy is
checked. If that disk is also busy, then the read is initiated on the disk which has the
least number of outstanding I/Os. In the parallel policy, the writes are written out in
parallel (concurrently). The LVM write is not considered complete until all copies have
been written.

5-25
Instructor Guide
In a caching disk subsystem, this works well for highly random data access with
relatively small data transfers (128 KB or less).
Sequential
With the sequential policy, all reads will only go to the primary copy. If the read
operation is unsuccessful, the next copy is read, and then the primary copy is fixed by
turning the read operation into a write operation with hardware relocation specified on
the call to the physical device driver.
Writes will occur serially with the primary copy updated first and when that is completed,
the secondary copy is updated, and then the tertiary copy (if triple mirroring is enabled).
Parallel write/sequential read

With the parallel write/sequential read policy, writes will go to the copies concurrently
(just like the parallel policy) but the reads are only from the primary copy (like the
sequential policy). This policy tends to get better performance when youre doing
sequential reads from the application.
This is highly beneficial when dealing with caching disk subsystems that are intelligent
enough to perform their own internal read ahead. By using this, all the reads go to one
copy of the disk and the subsystem sees them as sequential and internally performs
read ahead. Without this, the reads can be sent to different copies of the logical volume
and the disk system is unlikely to see much sequential read activity. Therefore, it will not
perform its own internal read ahead. The same problem can occur with fragmented files
as you cross over the boundary between fragments.
Parallel write/round-robin read

With the parallel write/round-robin read policy, writes will go to the copies concurrently
(just like in parallel policy). The reads, however, are initiated from a copy in round-robin
order. The first time it could be from the primary, the next read from the secondary copy,
the next read back to the primary, the next to the secondary copy, and so on. This
results in equal utilization across the copies in the case of reads when there is never
more than one outstanding I/O at a time. This policy can hurt sequential read
performance however since reads are broken up between the copies.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain LVM mirroring scheduling policies.
Details Once again keep it fairly brief. The first main point is that the default policy of
parallel generally helps with read performance since it reads from whichever disk has the
shortest queue. The second main point is that this is not true for situations where they are
mirroring between disk groups. In that situation the Parallel Write and Sequential Read
policy is better. This is because sequential read allows the disk array controller to recognize
a sequential read pattern and implement its own read-ahead mechanisms.
Transition statement If your application does mostly random I/O to a logical volume
which is large enough to be spread across many logical partitions, then the fragmentation
of those logical partitions can have a negative impact on performance.

5-27
Instructor Guide
Displaying LV fragmentation
# lslv -l lv01
lv01:/mydata
PV
hdisk0
COPIES
024:000:000
IN BAND
95%
DISTRIBUTION
000:001:023:000:000
# lslv -p hdisk0 lv01

hdisk0:lv01:/mattfs
USED
USED
USED
USED
USED
USED
0154
0155
0156
USED
USED
USED
FREE
FREE
FREE
FREE
FREE
FREE
FREE
FREE
FREE
USED
USED
USED
USED
FREE
FREE
FREE
USED
USED
USED
USED
FREE
FREE
FREE
USED
USED
USED
USED
FREE
FREE
FREE
USED
USED
USED
0157
FREE
FREE
USED
USED
USED
0158
FREE
FREE
USED
USED
USED
0159
FREE
FREE
USED
0153
USED
0161
FREE
FREE
1-10
11-20
21-30
31-40
41-50
51-60
61-66
USED
USED
USED
USED
USED
USED
FREE
USED
USED
USED
USED
USED
USED
FREE
USED
USED
USED
USED
USED
USED
FREE
USED
USED
USED
USED
USED
USED
FREE
USED
USED
USED
USED
USED
USED
FREE
USED
USED
USED
USED
USED
USED
USED
USED
USED
USED
USED
USED
USED
USED
USED
0160
USED
USED
USED
USED
USED
USED
USED
FREE
USED
USED
USED
USED
USED
FREE
67-76
77-86
87-96
97-106
107-116
117-126
127-131
USED
0007
0017
0027
USED
0008
0018
0028
USED
0009
0019
0029
USED
0010
0020
0030
0001
0011
0021
0031
0002
0012
0022
0032
0003
0013
0023
0033
0004
0014
0024
0034
0005
0015
0025
0035
0006
0016
0026
0036
132-141
142-151
152-161
162-171
Figure 5-9. Displaying LV fragmentation
AN512.0
Notes:
Using lslv -l
The lslv -l output shows several characteristics of the logical volume. The PerfPMR
config.sum file lists the output of lslv for each logical volume.
The COPIES column shows the disks where the physical partitions reside. There are
three columns, the first column is for the primary copy, the second column is for the
secondary copy (if mirroring is enabled) and the third column is for the tertiary copy (if
mirroring is enabled).
The IN BAND column shows the percentage of the partitions that met the intra-policy
criteria.
The DISTRIBUTION column shows the locations of the physical partitions of this logical
volume as numbers separated by a colon (:). Each of these numbers represents an
intra-policy location. For example, the first column is edge, then middle, then center,
then inner-middle, and then inner-edge. Of the remaining percentage in the IN BAND
V5.4
Instructor Guide
Uempty
value, the rest may be on a different part of the disk and may be fragmented. On the
other hand, even if the partitions were all in-band, that does not guarantee that they are
not fragmented. Therefore, the lslv -p data should be looked at next.
Using lslv -l
Logical volume fragmentation occurs if logical partitions are not contiguous across the
disk. The lslv -p command shows the logical volume allocation map for the physical
volume given.
The state of the partition is listed as one of the following:
- USED indicates that the physical partition at this location is used by a logical volume
other than the one specified with lslv -p.
- FREE indicates that this physical partition is not used by any logical volume.
- STALE indicates that the specified partition is no longer consistent with other
partitions. The system lists the logical partition number with a question mark if the
partition is stale.
- Where it shows a number, this indicates the logical partition number of the logical
volume specified with the lslv -p command.
Logical volume intra-policy:

The intra policy that the logical volume will use in allocating storage can be seen in the
lslv listing of the logical volume attributes:
# lslv lv01
LOGICAL VOLUME:
lv01
VOLUME GROUP:
LV IDENTIFIER:
0001d2ba00004c00000000f98ba97636.5
PERMISSION:read/write
VG STATE:
active/complete
LV STATE:
TYPE:
jfs2
WRITE VERIFY:
MAX LPs:
32512
PP SIZE:
COPIES:
1
SCHED POLICY:
LPs:
224
PPs:
STALE PPs:
0
BB POLICY:
INTER-POLICY:
minimum
RELOCATABLE:
INTRA-POLICY:
center
UPPER BOUND:
MOUNT POINT:
/mydata
LABEL:
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:
datavg
opened/syncd
off
32 megabyte(s)
parallel
224
relocatable
yes
32
/mydata
NO

5-29
Instructor Guide
Instructor notes:
Purpose Discuss the impact of and detection of LVM fragmentation.
Details Be sure the students understand that LV fragmentation:
- Only significantly affects random I/O performance when the requests range across
the entire large LV (many LPs), rather than being closely clustered.
- Has much less of an impact on cached storage arrays, depending upon the storage
arrays stripe size, amount of caching, and the level of activity for other LUNs in the
array
- Develops due to periodic LV expansions where contiguous PPs are no longer
available.
Use the displayed fragmentation to talk about random requests having short seeks if the
PPs for the LV are close together (001-036) but having much larger seeks if the PPs are
vastly discontiguous (jumping to 153-161).
When lslv -p is used with a disk name and a logical volume name, it will show how the
physical partitions on that disk are used. Each word represents a physical partition. If the
logical partition numbers are not in sequential order on that physical disk, then the logical
volume is fragmented across that disk.
Transition statement Now, lets change topics away from the impact of logical volume
characteristics to other causes of poor I/O performance and how to collect the related
performance statistics. We will start with the most common I/O statistics tool: the iostat
command.

V5.4
Instructor Guide
Uempty
Using iostat
# iostat 5
System configuration: lcpu=2 drives=3 paths=2 vdisks=0
tty:
tin
0.0
Disks:
hdisk0
hdisk1
cd0
tty:
tout
86.8
% tm_act
99.7
0.0
0.0
tin
0.0
Disks:
hdisk0
hdisk1
cd0
Kbps
7676.0
0.0
0.0
tout
97.1
% tm_act
98.5
0.0
0.0
avg-cpu: % user % sys % idle % iowait

56.8 43.2
0.0
0.0
tps
248.9
0.0
0.0
Kb_read
4260
0
0
Kb_wrtn
72500
0
0
avg-cpu: % user % sys % idle % iowait

57.7 42.3
0.0
0.0
Kbps
8381.7
0.0
0.0
tps
261.8
0.0
0.0
Kb_read
4420
0
0
Kb_wrtn
79648
0
0
Figure 5-10. Using iostat .
AN512.0
Notes:
Introduction
The iostat command is used for monitoring system input/output device load by
observing the time the physical disks are active in relation to their average transfer
rates. It does not provide data for file systems or logical volumes. The iostat
command generates reports that can be used to change the system configuration to
better balance the input/output load between physical disks and adapters.
Data collection
iostat displays a system configuration line, which appears as the first line displayed
after the command is invoked. If a configuration change is detected during a command
execution iteration, a warning line will be displayed before the data which is then
followed by a new configuration line and the header.

5-31
Instructor Guide
The iostat command only reports current intervals, so the first interval of the command
output is now meaningful and does not represent statistics collected from system boot.
Internal to the command, the first interval is never displayed, and therefore there may
be a slightly longer wait for the first displayed interval to appear. Scripts that discard the
first interval should function as before.
Disk I/O statistics since last reboot are not collected by default. Prior to AIX 5L V5.3, the
first line of output will display the message:
Disk History Since Boot Not Available
When iostat is run without an interval, it only attempts to show the statistics since last
reboot. But, disk I/O statistics since last reboot are not collected by default (configurable
in using the iostat attribute of the sys0 device). Thus it will display this message:
To check current settings, enter the following command:

# lsattr -E -l sys0 -a iostat
To enable this data collection, enter the following command:
# chdev -l sys0 -a iostat=true
Report data
There are two sections to the iostat report. By default, both are displayed. You can
restrict the report to only one of the sections. The sections are:
- tty and CPU utilization (iostat -t specifies the tty/CPU utilization report only)
- Disk utilization (iostat -d specifies the disk utilization report only)
The columns in the disk utilization report are:
- Disks lists the disk name.
- %tm_act specifies the percentage of time during that interval that the disk had at
least one I/O in progress. A drive is active during data transfer and command
processing, such as seeking to a new location.
- Kbps indicates the throughput of that disk during the interval in kilobytes per second.
This is the sum of Kb_read plus Kb_wrtn, divided by the number of seconds in the
reporting interval.
- tps indicates the number of physical disk transactions per second during that
monitoring period. A transfer is an I/O request to the physical disk. Multiple logical
requests can be combined into a single I/O request to the disk. A transfer is of
indeterminate size.
- Kb_read indicates the kilobytes of data read during that interval.
- Kb_wrtn indicates the kilobytes of data written on that disk during the interval.
V5.4
Instructor Guide
Uempty
When running PerfPMR (in the iostat.sh script), this information is put in the
monitor.int file.
The flag, -D, provides the following additional information:
- Metrics related to disk transfers
- Disk read service metrics
- Disk write service metrics
- Disk wait queue service metrics
The -l flag (lowercase L) can be used with the -D flag to provide a long listing, which
makes it easier to read.
iostat -Dl data is collected with PerfPMR (in the iostat.sh script) and put in the
iostat-Dl.out file.
What to look for

Taken alone, there is no unacceptable value for any of the fields because statistics are
too closely related to application characteristics, system configuration, and types of
physical disk drives and adapters. Therefore, when evaluating data, you must look for
patterns and relationships. The most common relationship is between disk utilization
and data transfer rate.
To draw any valid conclusions from this data, you must understand the application's
disk data access patterns (sequential, random, or a combination), and the type of
physical disk drives and adapters on the system.
For example, if an application reads and writes sequentially, you should expect a high
disk transfer rate when you have a high disk busy rate. (Kb_read and Kb_wrtn can
confirm an understanding of an application's read and write behavior but they provide
no information on the data access patterns.)
Generally, you do not need to be concerned about a high disk busy rate as long as the
disk transfer rate is also high. However, if you get a high disk busy rate and a low data
transfer rate, you may have a fragmented logical volume, file system, or individual file.
The average physical I/O size can be calculated by dividing the Kbps value by the tps
value.
What is a high data-transfer rate? That depends on the disk drive and the effective
data-transfer rate for that drive.
What can you do?

The busier a disk is, the more likely that I/Os to that disk will have to wait longer. You
may get higher performance by moving some of that disks activity to another disk or
spreading the I/O across multiple disk drives. In our example, hdisk0 is receiving the
majority of the workload. Perhaps hdisk1 or hdisk2 are under-utilized and some disk

5-33
Instructor Guide
I/O performance can be gained by allocating more logical volumes and file system
blocks to hdisk1 or hdisk2, but you should first examine what is on each of the disks. It
is also important to find out what kind of disk I/O is taking place on hdisk0. The tool
filemon will help us determine this.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain the use of iostat.
Details Point out the two parts of the report, tty and avg-cpu line and disk utilization.
State that we will discuss the significance of the %iowait statistic on the next visual.
Then, point out the statistics per disk. Explain the use of the %tm_act for detecting load
imbalances. Explain the use of %tm_act in combination with the Kbps statistic to get a
sense of the throughput achieved. Be sure it is clear that the rate per second only
represents the load presented by the application if the %tm_act is not 100%.
State that we will show them another excellent report for identifying if a disk is overloaded
(sar -d) in the visual after that. Emphasize that we always need to analyze on a disk by
disk basis (looking to rebalance the I/O load), or identify the application which is driving the
I/O load (looking to re-schedule that work, or study the design of the application). It is not
the entire system which is I/O bound, it is individual programs and the particular I/O
resources which they depend upon.
Prior to AIX 5L V5.3, when running iostat in interval mode, the first interval provided
statistics accumulated since the system was booted. As such, it did not represent the
current situation since it was diluted by a long prior period of normal operation. As a result,
administrator typically ignored this non-meaningful first interval data. Many scripts would
filter out the first period reported by iostat. Each subsequent report covered the time
since the previous report. Prior to AIX 5L V5.3, the first line of interval output will display the
message:
Transition statement Now that weve seen the iowait statistic, what exactly does it
mean?

5-35
Instructor Guide
What is iowait?
iowait is a form of idle time
The iowait statistic is simply the percentage of time the
CPU is idle AND there is at least one I/O still in progress
(started from that CPU)
The iowait value seen in the output of commands like
vmstat, iostat, and topas is the iowait percentages
across all CPUs averaged together
High I/O wait does not mean that there is definitely an I/O
bottleneck
Zero I/O wait does not mean that there is not an I/O
bottleneck
A CPU in I/O wait state can still execute threads if there are
any runnable threads
Figure 5-11. What is iowait?
AN512.0
Notes:
Introduction
To summarize it in one sentence, iowait is the percentage of time the CPU is idle AND
there is at least one I/O in progress. At any point in time, each CPU can be in one of
four states:
- user
- sys
- idle
- iowait
Performance tools such as vmstat, iostat, and sar print out these four states as a
percentage. The sar tool can print out the states on a per CPU basis (-P flag) but most
other tools print out the average values across all the CPUs. Since these are
percentage values, the four state values should add up to 100%.

V5.4
Instructor Guide
Uempty
Example
For a single CPU system with one thread running that does exactly 5 ms of computation
and then a read that takes 5 ms, the I/O wait would be 50%.
If we were to add a second thread (with the same mix of computation and I/O) onto the
system, then we would have 100% user, 0% system, 0% idle and 0% I/O wait.
If we were to next add a third thread, nothing in the statistics will change, but because of
the contention, we may see a drop in overall throughput.

5-37
Instructor Guide
Instructor notes:
Purpose To explain the meaning of the iowait statistic.
Details The iowait statistic is often the most misinterpreted statistic. Many people
think that if the iowait value is high, then that means there is definitely an I/O bottleneck.
However, this is not always true. Emphasize that you cannot draw any conclusions about
the system in terms of it being I/O or CPU bound with only the iowait statistic. It must be
considered with other statistics as well. What we can conclude from a high iowait value
(since it represents CPU idle time), is that CPU capacity is not currently a bottleneck
requiring investment.
Transition statement Let us take a look at a logical resource that can constrain I/O
performance: the pbuf pools for your volume groups.

V5.4
Instructor Guide
Uempty
LVM pbufs
LVM pbufs are used to hold I/O requests and control
pending disk I/O requests at the LVM layer
One LVM pbuf is used for each individual I/O
Insufficient pbufs can result in blocked I/O
LVM pbufs use pinned memory
LVM pbuf pool:
One pbuf pool per volume group (AIX 5L V5.3 and later)
Automatically scales as disks are added to volume group
pbufs per disk is tunable:
For each volume group
With a global default
Figure 5-12. LVM pbufs
AN512.0
Notes:
What are LVM pbufs?
The pbufs are pinned memory buffers used to hold I/O requests and control pending
disk I/O requests at the LVM layer. One pbuf is used for each individual I/O request,
regardless of the amount of data that is going to be transferred. If the system runs out of
pbufs, then the LVM I/O waits until one of these buffers has been released due to
another I/O being completed.
LVM pbuf pool

Prior to AIX 5L V5.3, the pbuf pool was a system wide resource. Starting with AIX 5L
V5.3, the LVM assigns and manages one pbuf pool per volume group. AIX creates extra
pbufs when a new physical volume is added to a volume group.

5-39
Instructor Guide
Instructor notes:
Purpose Describe pbufs and their role in LVM performance.
Details The size of an LVM pbuf is approximately 200 bytes. If LVM uses up all of its
pbufs and as a result of this I/Os are blocked due to the lack of free pbufs, this does not
necessarily mean that they are sized incorrectly. Typically, you can say that having a LVM
pbuf pool which is larger than 2 or 3 times the queue depth of the underlying devices may
provide little if any increase in overall throughput. A pbuf pool that is too large could
adversely influence the performance of other volume groups on the same adapters. The
pbuf pool comes from the kernel heap, so you have to be careful not to make it too large.
Transition statement Lets take a look at how to view and change the number of pbufs

V5.4
Instructor Guide
Uempty
Viewing and changing LVM pbufs

Viewing LVM pbuf information:
# lvmo v <vg_name> -a
vgname = rootvg
pv_pbuf_count = 512
total_vg_pbufs = 1024
max_vg_pbuf_count = 16384
pervg_blocked_io_count = 0
global_blocked_io_count = 0
Changing LVM pbufs:

Global tuning:
# ioo -o pv_min_pbuf=<new value>
Tuning for each volume group:
# lvmo v <vg_name> -o pv_pbuf_count
# lvmo v <vg_name> -o max_vg_pbuf_count
Figure 5-13. Viewing and changing LVM pbufs
AN512.0
Notes:
Viewing and changing pbufs
The lvmo command to provides support for the pbuf pool related administrative tasks.
The syntax for the lvmo command is:
lvmo [-a] [-v VGName] -o Tunable [ =NewValue ]

5-41
Instructor Guide
The lvmo -a command is used to display pbuf and blocked I/O statistics and the
settings for pbuf tunables (system wide or volume group specific). Sample output:
# lvmo -a
vgname = rootvg
pv_pbuf_count = 512
total_vg_pbufs = 1024
max_vg_pbuf_count = 16384
pervg_blocked_io_count = 0
global_blocked_io_count = 0
If the vgname is not provided as an option to the lvmo command, it defaults to rootvg.
The definitions for the fields in this report are:

- vgname: Volume group name specified with the -v option.
- pv_pbuf_count: The number of pbufs that are added when a physical volume is
added to the volume group.
- total_vg_pbufs: Current total number of pbufs available for the volume group.
- max_vg_pbuf_count: The maximum number of pbufs that can be allocated for the
volume group.
- pervg_blocked_io_count: Number of I/Os that were blocked due to lack of free pbufs
for the volume group.
- pv_min_pbuf: The minimum number of pbufs that are added when a physical
volume is added to any volume group.
- global_blocked_io_count: Number of I/Os that were blocked due to lack of free
pbufs for all volume groups.
When changing settings, the lvmo command can only change the LVM pbuf tunables
that are for specific volume groups. These are:
- pv_pbuf_count - The number of pbufs that will be added when a physical volume
is added to the volume group. Takes effect immediately at run time. The default
value is 256 for the 32-bit kernel and 512 for the 64-bit kernel.
- max_vg_pbuf_count - The maximum number of pbufs that can be allocated for
the volume group. Takes effect after the volume group has been varied off and
varied on again.
The system wide parameter pv_min_pbuf is tunable with the ioo command. It sets
the minimum number of pbufs that will be added when a physical volume is added to
any volume group.

V5.4
Instructor Guide
Uempty
If both the volume group specific parameter pv_pbuf_count, and the system wide
parameter pv_min_pbuf are configured, the larger value takes precedence over the
smaller.
Making changes permanent

To make changes permanent with the ioo command, use the -p flag.
The lvmo command does not have any options to make changes permanent.

5-43
Instructor Guide
Instructor notes:
Purpose Explain tuning of pbufs.
Details As with all of the AIX pools of control blocks, the system administrator needs to
be careful about unnecessarily increasing the pool size. These use pinned memory and too
large a pool can aggravate memory over-commitment. If more pbufs are truly needed, it
may be necessary to add more memory to properly support I/O performance.
Viewing and changing pbufs prior to AIX 5L V5.3
The best way to determine if a pbuf bottleneck is occurring system wide is to examine a
LVM variable called hd_pendqblked. In AIX 5L 5.2, vmstat -v will show the value
of hd_pendqblked. In AIX 5L V5.2, the ioo -o hd_pbuf_cnt=<value> can tune
the number of pbufs.
Transition statement Before we look at more detailed statistics, let us discuss the
components and queues involved in disk I/O.

V5.4
Instructor Guide
Uempty
I/O request disk queuing

AIX host
wait queue
Sent requests
service queue
disk queue_depth
adapter
Disk
DD
I/O results
Max reqs adapter can handle:
num_cmd_elements
Check with vendor to find out recommended queue depth

Best starting point: install the vendor provided filesets before
discovering the disks; queue_depth default will be appropriately set
If not using vendor fileset, disk description will contain other
Figure 5-14. I/O request disk queuing
AN512.0
Notes:
Understanding the queues in I/O processing is essential to understanding the I/O statistics
and their significance.
Some storage devices can only handle one request at a time while others can handle a
large number of overlapping storage requests. If a host sends more overlapping requests
than the storage device can handle, the device will likely reject the extra requests. This is
very undesirable because the host then has to go through error recovery and retransmit
those requests.
In order to avoid this, the AIX disk definition has an attribute of queue_depth, which
identifies the limit of how many overlapping I/O requests can be sent to the disk. If more
requests than this number arrive at the disk device driver, they are queued up on the wait
queue. The requests that are sent to the storage adapter are queued on the disk device
drivers server queue (until each request is completed). The service queue can not grow
any larger than the queue_depth limit.
The requests that arrive at the storage adapter device driver are handled by command
elements. The adapter has a limited number of command elements (configured by the

5-45
Instructor Guide
num_cmd_elements attribute of the adapter definition). If more requests arrive at the

adapter than this limit, they are rejected.
The requests are transmitted to the storage device, where they are processed and the
result returned to the hosts. The time that it takes for the transmitted request to be
completed is an important measurement of the storage device performance and the
connection it flows over. A completed response is handled by the storage adapter, which
notifies the disk device driver. The adapter command element is freed up to service another
request. The disk device driver processes the request completion and notifies LVM
(assuming an application is not doing direct disk raw I/O). The control block representing
the request on the service queue is then freed.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Cover the basic components in disk I/O statistics
Details
Transition statement There are statistics reports that can show us the details on the
components we just discussed. Let us first look at the iostat disk I/O detail report.

5-47
Instructor Guide
Using iostat -D
# iostat D hdisk1 5
hdisk1
xfer:
%tm_act
bps
tps
bread
bwrtn
0.0
0.0
0.0
0.0
0.0
read:
rps avgserv minserv maxserv
timeouts
fails
0.0
0.0
0.0
0.0
0
0
write:
wps avgserv minserv maxserv
timeouts
fails
0.0
0.0
8.7
8.7
0
0
queue: avgtime mintime maxtime avgwqsz
avgsqsz
sqfull
0.0
0.0
0.0
0.0
0.0
0.0
-------------------------------------------------------------------------------hdisk1
xfer: %tm_act
bps
tps
bread
bwrtn
33.9
25.6M
101.2
6.3M
19.2M
read:
timeouts
fails
24.3
11.8
3.5
20.4
0
0
write:
timeouts
fails
76.9
9.3
3.9
19.1
0
0
avgsqsz
sqfull
41.8
0.0
258.8
4.0
1.0
99.4
-------------------------------------------------------------------------------hdisk1
xfer: %tm_act
bps
tps
bread
bwrtn
100.0
70.3M
353.2
19.4M
50.9M
read:
timeouts
fails
79.5
9.8
0.2
20.4
0
0
write:
timeouts
fails
273.8
8.0
2.2
26.0
0
0
avgsqsz
sqfull
33.2
0.0
258.8
12.0
2.0
347.0
Figure 5-15. Using iostat -D
AN512.0
Notes:
The iostat -D report gives more detail than the iostat default disk information.
For the read and write metrics it provides:
- rps: Indicates the number of read transfers per second.
- avgserv: Indicates the average service time per read transfer. Different suffixes are
used to represent the unit of time. Default is in milliseconds.
- minserv: Indicates the minimum read service time. Different suffixes are used to
represent the unit of time. Default is in milliseconds.
- maxserv: Indicates the maximum read service time. Different suffixes are used to
represent the unit of time. Default is in milliseconds.
- timeouts: Indicates the number of read timeouts per second.
- fails: Indicates the number of failed read requests per second.
For the queue metrics it provides:

V5.4
Instructor Guide
Uempty
- avgtime: Indicates the average time spent by a transfer request in the wait queue.
Different suffixes are used to represent the unit of time. Default is in milliseconds.
- mintime: Indicates the minimum time spent by a transfer request in the wait queue.
- maxtime: Indicates the maximum time spent by a transfer request in the wait queue.
- avgwqsz: Indicates the average wait queue size.
- avgsqsz: Indicates the average service queue size.
- sqfull: Indicates the number of times the service queue becomes full (that is, the disk
is not accepting any more service requests) per second.

5-49
Instructor Guide
Instructor notes:
Purpose Discuss the use of the iostat -D report.
Details The main points are:
- The separate service times for read and write requests
- Counter of the number of failed requests
- The queue statistics showing the sizes of the wait and services queues and how
often (per second) arriving requests were had to wait due to a full queue.
The wait queue size shows how badly the backup is; how persistently the I/O requests
arrive faster than they can be processed. The service queue size provides a window
into the amount of command overlapping that is achieved. This can be related to
command queueing discussion on the previous visual.
Transition statement The sar command provides a slightly different look at the disk
activity. Lets take a look at that.

V5.4
Instructor Guide
Uempty
sar -d
# sar -d 1 3
AIX train43 3 5 0009330F4C00
11/05/04
System configuration: lcpu=2 drives=3

22:54:39
device
%busy
avque
r+w/s
22:54:40
hdisk1
hdisk0
cd0
5
0
0
1.4
0.0
0.0
18
0
0
22:54:41
hdisk1
hdisk0
cd0
100
0
0
151.7
0.0
0.0
22:54:42
hdisk1
hdisk0
cd0
66
0
0
Average
hdisk1
hdisk0
cd0
42
0
0
Kbs/s
avwait
avserv
807
0
0
9.5
0.0
0.0
7.5
0.0
0.0
405
0
0
26039
0
0
91.2
0.0
0.0
7.4
0.0
0.0
104.9
0.0
0.0
224
0
0
16740
0
0
22.9
0.0
0.0
9.0
0.0
0.0
64.5
0.0
0.0
161
0
0
10896
0
0
30.9
0.0
0.0
6.0
0.0
0.0
Figure 5-16. sar -d.
AN512.0
Notes:
Overview
The -d option of sar provides real time disk I/O statistics.
The fields listed by sar -d are:
- %busy - Reports the portion of time device was busy servicing a transfer request.
- avque - Reports the average number of requests waiting to be sent to the disk. This
statistic is a good indicator if an I/O bottleneck exists. (Before AIX 5L V5.3, it
reported the instantaneous number of requests sent to disk but not completed yet.)
- r+w/s - The number of read/write transfers from or to device.
- blks/s - The number of bytes transferred in 512-byte units.
- avwait - The average time (in milliseconds) that transfer requests waited idly on
the queue for the device. Prior to AIX 5L V5.3, this was not supported.

5-51
Instructor Guide
If you see large numbers in the avwait column. try to distribute the workload on
other disks.
- avserv - The average time (in milliseconds) to service each transfer request
(includes seek, rotational latency, and data transfer times) for the device. Prior to
AIX 5L V5.3, this was not supported.
Note: %busy is the same as %tm_act in iostat. r+w/s is equal to tps in iostat.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain the -d option in sar.
Details As with the iostat report, the sar -d report can be used to identify an imbalance
in disk usage and the actual data rate. The important statistics to focus on are
(1) avque, where a large number indicates a backlog of I/O requests waiting to be sent to
the device.
(2) avwait, which gives an indication of how long requests needed to wait before being
sent,
(3) avserv, which gives a measurement of how quickly the storage device responded to the
request.
Additional information If the disk supports the queue_depth statistic, then this is
also reported. This shows the average number of I/Os queued to that disk during the
interval.
Transition statement Lets now look at the filemon tool.

5-53
Instructor Guide
Using filemon (1 of 2)
#
#
#
#
#
filemon -O lv,pv -o fmon.out

dd if=/dev/rhdisk0 of=/dev/null bs=32k count=100
dd if=/dev/zero of=/tmp/junk bs=32k count=100
trcstop
more fmon.out
Fri Nov 5 23:16:10 2004

System: AIX ginger Node: 5 Machine: 00049FDF4C00
Cpu utilization:
5.7%
Most Active Logical Volumes

----------------------------------------------------------------util #rblk #wblk
KB/s volume
description
----------------------------------------------------------------0.28
0
6144 14238.3 /dev/hd3
/tmp
0.06
0
8
18.5 /dev/hd8
jfs2log
Most Active Physical Volumes
----------------------------------------------------------------util #rblk #wblk
KB/s volume
description
----------------------------------------------------------------0.78
6400
3592 23155.8 /dev/hdisk0
N/A
Figure 5-17. Using filemon (1 of 2)
AN512.0
Notes:
Overview
The filemon command uses the trace facility to obtain a detailed picture of I/O activity
during a time interval on the various layers of file system utilization, including the logical
file system, virtual memory segments, LVM, and physical disk layers. Data can be
collected on all the layers, or some of the layers. The default is to collect data on the
virtual memory segments, LVM, and physical disk layers.
The report begins with a summary of the I/O activity for each of the levels (the Most
Active sections) and ends with detailed I/O activity for each level (Detailed sections).
Each section is ordered from most active to least active.
The logical file I/O includes read, writes, opens and seeks which may or may not result
in actual physical I/O depending on whether or not the files are already buffered in
memory. Statistics are kept by file.

V5.4
Instructor Guide
Uempty
The virtual memory data contains physical I/O (paging) between segments and disk.
Statistics are per segment.
Since it uses the trace facility, the filemon command can be run only by the root user
or by a member of the system group. Note that if filemon shows dropped events, it is
not reliable data. filemon should be re-run specifying larger buffer sizes.
When running PerfPMR, the filemon data is in the filemon.sum file.
Only data for those files opened after the filemon command was started will be
collected, unless you specify the -u flag.
Running filemon
Data can be collected on all the layers, or layers can be specified with the -O layer
option. Valid -O options are:
- lf - Monitor file I/O
- vm - Monitor virtual memory I/O
- lv - Monitor logical volume I/O
- pv - Monitor physical volume I/O
- all - Monitor all (lf, vm, lv, and pv)
By default, filemon runs in the background while other applications are running and
being monitored. When the trcstop command is issued, filemon stops and generates
its report. You may want to issue nice -n -20 trcstop to stop filemon since filemon
is currently running at priority 40.
Only the top 20 logical files and segments are reported unless the -v (verbose) flag is
used.
To produce filemon output from a previously collected AIX trace, use the -i option with
an AIX trace file and the -n option with a file that contains gennames output (gennames
-f must be used for filemon offline profiling).
Example
The visual shows an example of:
1. Starting filemon and redirecting the output to the fmon.out file
2. Issuing some I/O intensive commands
3. Stopping filemon with trcstop
4. Examining the fmon.out file
This report shows logical volume activity information and expands on the physical disk
information from iostat by displaying a description of the disk. This is useful for
determining whether the greatest activity is on your fastest disks.

5-55
Instructor Guide
The reason that the logical volume utilizations are no more than 28% but the hdisk
utilization is 78% is because the first dd command is reading directly from the hdisk and
bypassing the LVM.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To review the filemon example.
Details For your reference, some of the filemon flags are:
-d
You can defer the start of tracing by using the -d flag. In this case, filemon will
start when the trcon command is issued. filemon can be suspended via the
trcoff command.
-o
By default, filemon prints its report to standard out, but the report can be
redirected to a file with the -o option.
-i
filemon can also read an existing trace report file via the -i flag.
-P
A -P option is available to pin the filemon process in memory.
-T num The -T number flag can be used to set the kernels trace buffer size. The
default size is 32000 bytes. Trace buffers are pinned in memory. Increasing
their size can negatively affect paging and I/O. However, if they are not large
enough, trace events will be dropped rendering the data collected useless. Try
the defaults, then double the values each time dropped events are logged in the
output.
If the trace facility is already in use, the message /dev/systrace: device busy
appears. The -i option is used to specify an AIX trace file that was previously collected.
This is known as off-line profiling. If using off-line profiling, the -n option must be used with
the output of gennames. If gennames is run, the -f option on gennames must be used if its to
be used with filemon. The -f option collects information on the devices in /dev.
If remote file I/O were involved, it would show up in the volume column with the remote
system name.
Often, when the message filemon: Reporting completed followed by filemon:1.871
secs in measured interval appears, signaling that the filemon report has been written,
you will need to press the <Enter> key to get your prompt back.
Transition statement Lets look at more detail of the filemon example.

5-57
Instructor Guide
-----------------------------------------------------------------------Detailed Physical Volume Stats
(512 byte blocks)
-----------------------------------------------------------------------VOLUME: /dev/hdisk0 description: N/A
reads:
100
(0 errs)
read sizes (blks):
avg
64.0 min
64 max
64 sdev
0.0
read times (msec):
avg
0.952 min
0.518 max 12.906 sdev
1.263
read sequences:
1
read seq. lengths:
avg 6400.0 min
6400 max
6400 sdev
0.0
writes:
15
(0 errs)
write sizes (blks):
avg
239.5 min
8 max
256 sdev
61.9
write times (msec):
avg
5.572 min
3.716 max 12.736 sdev
2.618
write sequences:
2
write seq. lengths:
avg 1796.0 min
8 max
3584 sdev 1788.0
seeks:
2
(1.7%)
seek dist (blks):
init
0,
avg 7284988.0 min 324392 max 14245584 sdev
6960596.0
seek dist (%tot blks):init 0.00000,
avg 20.49320 min 0.91254 max 40.07386 sdev 19.58066
time to next req(msec): avg
1.565 min
0.581 max 28.526 sdev
3.090
throughput:
23155.8 KB/sec
utilization:
0.78
AN512.0
Notes:
What to look for
The physical volume statistics can be used to determine physical disk access patterns.
The seek distance shows how far the disk had to seek. The longer the distance, the
longer it takes. If the majority of the reads and writes required seeks, you may have
fragmented files and/or more than one busy file system on the same physical disk. If the
number of reads and writes approaches the number of sequences, physical disk access
is more random than sequential.
As the number of seek operations approached the number of reads or writes (look at
the corresponding seek type), then the data access becomes less sequential and more
random.

V5.4
Instructor Guide
Uempty
Example
This visual shows that there were 100 reads where the average size was 64 blocks or
32 KB (since each block is 512 bytes). It also gives the average time in milliseconds to
complete the disk read (as well as min, max, and standard deviation). The number of
sequences compared to the number of reads indicates how sequential the I/Os were.
One sequence with 100 reads means it was fully sequential. A large number of seeks
indicates either fragmentation or random I/O.

5-59
Instructor Guide
Instructor notes:
Purpose To examine physical volume details of the filemon example.
Details The fields to focus on are:
- Read and write times (how fast is the disk?)
- Throughput (can be related to fragmentation, queue depth, and so forth).
The next fields can be useful in understanding why the throughput or response times
are different:
- Read and write block sizes (much larger block sizes take longer to read, focus on
the time per KB; can also represent a situation where I/O is being done in small
block, impacting throughput)
- Seeks and sequences (indication of randomness of I/O or extent of fragmentation)
- Remember what AIX thinks is a large seek may only be accessing some data that is
in the storage subsystem cache, requiring no actual delays in obtaining the data.
When benchmarking equipment or experimenting with tuning options, it is best to
either clear the storage subsystem cache or appropriately pre-cache with the data
each time to have a fair comparison.
Set the students expectations that we will teach them the most useful fields to get them
started. The more advanced fields will only come into play when they do very advanced
performance analysis. Remember, the goal is to teach them to narrow down on the
performance bottleneck that they experience and to enable students to collect data when
the bottleneck is present. We are NOT teaching them to fix all problems or do a complete
analysis in this course. We will just get them started and running in the right direction and
have them tune and fix the most common bottlenecks.
Additional information A SCSI drive can give you anywhere from 8 ms to 4 ms
response time on reads and writes. On fibre attached ESS system I have seen anywhere
from 1 to 2 ms response time on the reads and 2 ms to 5 ms on the writes.
Transition statement One of the most common I/O bottleneck tuning is to isolate hot
spots and resolve them by migrating partitions or logical volumes. A tool called lvmstat
can be used to show hot partitions.

V5.4
Instructor Guide
Uempty
Managing uneven disk workloads

Using the previous monitoring tools:
Identify saturated hdisks.
Identify underused or unused hdisks (separate disk groups).
Use the migratepv command to move logical volumes from

one physical disk to another, to even the load:
migratepv -l lvname source_disk destination_disk
Set LV range to max with an upper bound and reorganize:

chlv -e m -u upperbound logicalvolume
reorgvg VolumeGroup LogicalVolume
Convert to using LVM striping across the candidate disks

Use best practices in defining the LVM striping
Backup data, redefine LVs and restore data
Micro-manage the position of hot physical partitions

Use lvmstat to identify hot logical partitions
Use migratelp to move individual logical partitions to even the load.
Once you go down this path you may need to continually monitor and shift the
hotspots manually.
Figure 5-19. Managing uneven disk workloads
AN512.0
Notes:
Moving a logical volume to another physical disk
One way to solve an I/O bottleneck is to see if placement of different logical volumes
across multiple physical disks is possible. First, you would have to determine if a
particular physical disk is being heavily utilized using the iostat or filemon
commands. Second, determine if there are multiple logical volumes being accessed on
that same physical disk. If so, then you can move one logical volume from that physical
disk to another physical disk. Of course, if you move it to a much slower disk, your
performance may be worse than having two logical volumes on the same fast disk. The
moving of a logical volume can be easily accomplished by using the migratepv
command. A logical volume can be moved or migrated even while its in use. The
syntax of migratepv for moving a logical volume is:
migratepv -l lvname source_disk destination_disk

5-61
Instructor Guide
Spreading traffic with LV inter-policies

Sometimes you need to spread traffic for a single logical volume, in which case moving
an entire LV to another disk may not be the right solution. In that case you can use what
is called poor mans striping. This involves spreading the logical partitions evenly
across multiple disks. This works well if the random I/O demand is evenly spread across
the logical volume.
To do this you need to set the particular logical volumes inter-policies as follows:
- Set the range to maximum ( -e m)
- Set the upper bound to the number of disks you wish to use (-u #).
Then, you run reorgvg against that logical volume. It will move the logical partitions to
spread them as equally as possible between the candidate disks.
The problem here is that you cannot name which particular disks to use. If you need to
constrain it to particular disk, then you need to have the logical volume in its own
volume group with only those disks. That may require a backup and restore. Or you
may be able to copy it to a new volume group using the cplv command.
Spreading traffic with LVM striping

If the previous methods do not properly do the job, it is likely because the LV has hot
spots in the distribution of data and the hot LVM physical partitions are still mostly on
one disk. The easiest way to spread this load is to use LVM striping across the disks.
This works because the LVM stripe unit is so much smaller than the physical partition
size. But if doing striping, it needs to be done efficiently.
- The more physical volumes used the more spread the load.
- Avoid the adapters from being the single point of contention.
- Avoid other uses for the disks. A good way to do this is to create a separate volume
group for striped logical volumes.
- Set a stripe-unit size of 64 KB. Setting too small a stripe unit size will fragment the
I/O and impact performance. 64 KB has been found to be optimal in most situations.
- If doing sequential reads or random reads of a very large size, set the maximum
page ahead (see read-ahead discussion in the next unit) to 16 times the number of
disk drives. This causes page-ahead to be done in units of the stripe-unit size (64
KB) times the number of disk drives, resulting in the reading of one stripe unit from
each disk drive for each read-ahead operation.
- Have the application use read and write sizes which are a multiple of the stripe unit
size, or even better (if practical) the sizes equal to the full stripe (64 KB times the
number of disk drives)
- Modify maxfree, using the ioo command, to accommodate the change in the
maximum page ahead value (maxfree = minfree + <max page ahead>).
V5.4
Instructor Guide
Uempty
Moving logical partitions

There may also be the case where a disk has a single very large logical volume on it. In
this case, moving the entire logical volume to an equivalent disk would not help. You
could check to see if individual partitions are accessed heavily. For example, with a
large partition size and a database on a raw logical volume with too small of a database
buffer cache, the individual physical partition may be accessed heavily. The command
lvmstat in AIX can be used to check for this. To move an individual partition, a
command called migratelp is available.
The syntax of migratelp is:
migratelp lvname/lpartnum[/copynum] destpv[/ppartnum]
migratelp moves the specified logical partition lpartnumber of the logical volume
lvname to the destpv physical volume. If the destination physical partition ppartnum is
specified it will be used. Otherwise, a destination partition is selected using the
intra-allocation policy of the logical volume. By default, the first mirror copy of the logical
partition in question is migrated. A value of 1, 2 or 3 can be specified for copynum to
migrate a particular mirror copy.
Examples:
- To move the first logical partition of logical volume lv00 to hdisk1:
# migratelp lv00/1 hdisk1
- To move second mirror copy of the third logical partition of logical volume hd2 to
hdisk5:
# migratelp hd2/3/2 hdisk5
Migrating physical partitions

Rather than migrating entire logical volumes from one disk to another in an attempt to
rebalance the workload, if we can identify the individual hot logical partitions, then we
can focus on migrating just those to another disk.The lvmstat utility can be used to
monitor the utilization of individual logical partitions of a logical volume. By default,
statistics are not kept on a per partition basis. These statistics can be enabled with the
lvmstat -e option. You can enable statistics for:
- All logical volumes in a volume group with lvmstat -e -v vgname
- Per logical volume basis with lvmstat -e -l lvname
The first report generated by lvmstat provides statistics concerning the time since the
system was booted. Each subsequent report covers the time since the previous report.
All statistics are reported each time lvmstat runs. The report consists of a header row
followed by a line of statistics for each logical partition or logical volume depending on
the flags specified.

5-63
Instructor Guide
lvmstat syntax
The syntax of lvmstat is:
lvmstat {-l|-v} Name [-e|-d][-F][-C][-c Count][-s][Interval [Iterations]]
If the -l flag is specified, Name is the logical volume name, and the statistics are for the
physical partitions of this logical volume. The mirror copies of the logical partitions are
considered individually for the statistics reporting. They are listed in descending order of
number of I/Os (iocnt) to the partition.
The Interval parameter specifies the amount of time, in seconds, between each
report. The first report contains statistics for the time since the volume group startup.
Each subsequent report contains statistics collected during the interval since the
previous report.
If the Count parameter is specified, only the top Count lines of the report are generated.
For a logical volume if Count is 10, only the 10 busiest partitions are identified.
If the Iterations parameter is specified in conjunction with the Interval parameter,
then only that many iterations are run. If no Iterations parameter is specified,
lvmstat generates reports continuously.
If Interval is used to run lvmstat more than once, no reports are printed if the
statistics did not change since the last run. A single period (.) is printed instead.
Statistics can be disabled using the -d option.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss tools in spreading I/O load across disks or disk groups.
Details The options for spreading the data are listed in order of preference.
Emphasize the importance of avoiding single points of contention. Spreading across more
disks will not help if the storage adapter is the same and it is saturated. It also will not help
is the disks are in the same disk group (in fact it may hurt).
Transition statement Even if a disk is not a bottleneck the storage adapters may be a
bottleneck. Let us look at how the adapters are related to I/O performance.

5-65
Instructor Guide
Adapter and multipath statistics

AIX host
hdisk8
fcs0
fcs1 hdisk4
scsi3
hdisk5
SAN switch
SAN switch
Storage subsystem
LUN 1
LUN 2
Figure 5-20. Adapter and multipath statistics
AN512.0
Notes:
There are two scenarios where the storage adapters figure into the I/O performance
analysis:
- There are multiple disks connected to the same adapter
- There are multiple paths to the same disk.
While none of the individual disks may be saturated by the I/O workload, the total I/O for all
disks may overload the storage adapter. If that occurs, you may want to add another
adapter and move some of the disks to that adapter. Remember that too many adapters
with too much traffic can, in turn, overload the PCI bus to which they are connected.
Alternately, you might migrate the data to other disks that already use a different storage
adapter. Remember to avoid setting queue_depths so high that the total of all the queue
depths exceed the number of command elements on the adapter.
If we can have multiple disks using a single adapter in a single disk environment, that is
even more true for the SAN storage environment, where you could expect to have dozens
of hdisks having a single fibre channel adapter as their parent. Beyond that, it is common to
V5.4
Instructor Guide
Uempty
have multiple FC adapters zoned to access the same LUNs. In that environment there is an
extra layer of path management software to handle the adapter load balancing and fall
over. Rather than having one fibre channel adapter in an idle standby to handle fail over
situations, many installations configure both adapters to both carry I/O traffic, thus
increasing the throughput capacity. The path control software should properly load balance
to ensure this. In that situation, if one of the adapters was not functioning correctly, or the
load balancing software was not configured correctly, the desired bandwidth would not be
realized.

5-67
Instructor Guide
Instructor notes:
Purpose Discuss performance issues related to storage adapters.
Details
Transition statement Let us look at some statistics reports that allow us to investigate
these situations, starting with the iostat adapter report.

V5.4
Instructor Guide
Uempty
Monitoring adapter I/O throughput

iostat -a shows adapter throughput
Disks are listed following the adapter to which they are
attached
# iostat -a
tty:
tin
0.2
tout
22.6
avg-cpu:
Adapter:
scsi0
% user
8.6
Kbps
131.7
Disks:
hdisk1
hdisk0
% tm_act
0.0
1.2
% sys
45.2
tps
4.2
Kbps
0.2
131.6
% idle
45.8
Kb_read
128825
tps
0.0
4.2
% iowait
0.4
Kb_wrtn
2618720
Kb_read
3194
125631
Kb_wrtn
0
2618720
Figure 5-21. Monitoring adapter I/O throughout
AN512.0
Notes:
Adapter throughput
The -a option to iostat will combine the disks statistics to the adapter to which they
are connected. The adapter throughput will simply be the sum of the throughput of each
of its connected devices. With the -a option, the adapter will be listed first, followed by
its devices and then followed by the next adapter, followed by its devices, and so on.
The adapter throughput values can be used to determine if any particular adapter is
approaching its maximum bandwidth or to see if the I/O is balanced across adapters.
System throughput
In addition, there is also a -s flag that will show system throughput. This is the sum of all
the adapters throughputs. The system throughput numbers can be used to see if you
are approaching the maximum throughput for the system bus.

5-69
Instructor Guide
For example:
# iostat -s 1
tty:
tin
0.0
tout
avg-cpu: % user
583.0
48.5
% sys
6.0
% idle
0.0
System: train33.beaverton.ibm.com
Kbps
15984.0
tps
Kb_read
339.0
0
Kb_wrtn
15984
Disks:
hdisk1
hdisk0
cd0
tps
Kb_read
0.0
0
339.0
0
0.0
0
Kb_wrtn
0
15984
0
% tm_act
0.0
100.0
0.0
Kbps
0.0
15984.0
0.0

% iowait
45.5
V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain the -a option in iostat (and also -s).
Details
Transition statement Next lets see how we can observe the distribution of traffic
across multiple adapters in a multipath environment.

5-71
Instructor Guide
Monitoring multiple paths (1 of 2)

# iostat md hdisk5 5 1
System configuration: lcpu=2 drives=22 ent=0.30 paths=82 vdisks=13
Disks:
hdisk5
% tm_act
99.8
Kbps
76090.7
tps
399.2
Kb_read
Kb_wrtn
382064
377892
Paths:
Path7
Path6
Path5
Path4
Path3
Path2
Path1
Path0
% tm_act
0.0
0.0
0.0
0.0
0.0
0.0
0.0
99.6
Kbps
0.0
0.0
0.0
0.0
0.0
0.0
0.0
76077.9
tps
0.0
0.0
0.0
0.0
0.0
0.0
0.0
399.2
Kb_read
Kb_wrtn
0
0
0
0
0
0
0
0
0
0
0
0
0
0
382064
377764
Figure 5-22. Monitoring multiple paths (1 of 2)
AN512.0
Notes:
The iostat commands -m option allows us to see the I/O traffic, for one or more disks,
broken down by path. The example shown has two fibre channel adapters, both zoned to
have access to the same LUNS. Path0 through Path3 are for one adapter and Path4
through Path7 are for the other adapter.
In the example, it is clear that all of the traffic is being sent over just one of the adapters. An
investigation, in this case, would show that the disk attribute of algorithm was set to
fail_over instead of round_robin.
It should be noted that the vendor specific path management software often has better
tools for examining this than more generic tools such as iostat.
Each adapter will have multiple paths for different routing options in the SAN switch fabric.
In order to corollate the PathIDs shown in this report to the available adapters, we would
need to use the lspath command:
# lspath -F "name parent path_id status" -l hdisk5
hdisk5 fscsi0 0 Enabled
V5.4
Instructor Guide
Uempty


5-73
Instructor Guide
Instructor notes:
Purpose Show an example of the iostat multipath display.
Details
Transition statement Lets take another look at the iostat adapter report, this time in a
multiple HBA environment.

V5.4
Instructor Guide
Uempty
Monitoring multiple paths (2 of 2)

# iostat -ad hdisk5 hdisk12 10
System configuration: lcpu=4 drives=23 paths=163 vdisks=1 tapes=0
Adapter:
fcs0
Disks:
hdisk12
hdisk5
Kbps
76407.9
% tm_act
0.0
99.6
Adapter:
fcs1
Disks:
hdisk12
hdisk5
tps
628.1
Kbps
0.0
76407.9
Kbps
0.0
% tm_act
0.0
0.0
tps
0.0
628.1
tps
0.0
Kbps
0.0
0.0
Kb_read
Kb_wrtn
385904
377984
Kb_read
Kb_wrtn
0
0
385904
377984
Kb_read
0
tps
0.0
0.0
Kb_wrtn
0
Kb_read
0
0
Kb_wrtn
0
0
Figure 5-23. Monitoring multiple paths (2 of 2)
AN512.0
Notes:
This report looks very similar to the one displayed earlier, with the exception that you can
see the same disks shown under both the fcs0 adapter and under the fcs1 adapter. This
report is a little easier to understand since you do not need to figure out which adapter is
associated to which path ID.

5-75
Instructor Guide
Instructor notes:
Purpose Show an example of iostat -a in a multipath situation.
Details
Transition statement Lets go on to some checkpoint questions to review.

V5.4
Instructor Guide
Uempty
Checkpoint
1. True/False When you see two hdisks on your system, you
know they represent two separate physical disks.
2. List two commands that will provide real time disk I/O
statistics.
3. Identify and define the default mirroring scheduling policy.

_____________________________________________
4. What tool allows you to observe the time the physical disks
are active in relation to their average transfer rates by
monitoring system input/output device loads?
_____________________________________________
AN512.0
Notes:

5-77
Instructor Guide
Instructor notes:
statistics.
iostat
sar d
topas or nmon
3. Identify and define the default mirroring scheduling
policy.
Parallel policy - sends read requests to the least busy
copy and write requests to all copies concurrently
4. What tools allow you to observe the time the physical disks
iostat and sar

V5.4
Instructor Guide
Uempty
Exercise 5: I/O Performance

Use the filemon command
Locate and fix I/O bottlenecks with the
following tools:
vmstat
iostat
sar
lvmstat
filemon
Figure 5-25. Exercise 5: I/O Performance
AN512.0
Notes:

5-79
Instructor Guide
Instructor notes:
Details
Transition statement The next topic is file systems performance.

V5.4
Instructor Guide
Uempty
Unit summary
This unit covered:
Identifying factors related to physical and logical volume
performance
Using performance tools to identify I/O bottlenecks
Configuring logical volumes for optimal performance
AN512.0
Notes:

5-81
Instructor Guide
Instructor notes:
Details

V5.4
Instructor Guide
Uempty
Unit 6. File system performance monitoring and

tuning
Estimated time
3:00 (2:00 Unit; 1:30 Exercise)

This unit describes the issues related to file system I/O performance. It
shows you how to use performance tools to monitor and tune file
system I/O performance.

List characteristics of the file systems that apply to performance
Describe how file fragmentation affects file system I/O performance
Use the filemon tool to evaluate file system performance
Tune:
- JFS logs
- Release-behind
- Read-ahead
- Write-behind
Identify resource bottlenecks for file systems

Accountability:
Checkpoints
Machine exercises
References
Reference
Unit 6. File system performance monitoring and tuning

6-1
Instructor Guide

SG24-6478
6-2

(Redbook)

V5.4
Instructor Guide
Uempty
Unit objectives
Describe guidelines for accurate file system measurements
Describe how file fragmentation affects file system I/O
performance
Use the filemon tool to evaluate file system performance
Tune:
JFS and JFS2 logs
Release-behind
Read-ahead
Write-behind
Identify resource bottlenecks for file systems
AN512.0
Notes:

6-3
Instructor Guide
Instructor notes:
Purpose This unit explains the concepts and characteristics of file system I/O and their
relationship to performance.
Details
Transition statement Lets start with an overview of the filesystem I/O layers.
6-4

V5.4
Instructor Guide
Uempty
File system I/O layers

Logical File
System
Local or NFS
Virtual
Memory Manager
Paging
Logical
Volume Manager
Disk space management
Physical
Disk I/O
Hardware dependent
Figure 6-2. File system I/O layers
AN512.0
Notes:
Overview
There are a number of layers involved in file system storage and retrieval. Its important
to understand what performance issues are associated with each layer. The
management tools used to monitor file system activity can provide data on each of
these layers.
The effect of a files physical disk placement on I/O performance diminishes when the
file is buffered in memory. When a file is opened in AIX, it is mapped to a persistent
(JFS) or client (JFS2) data segment in virtual memory. The segment represents a virtual
buffer for the file. The files blocks map directly to segment pages. The VMM manages
the segment pages, reading file blocks into segment pages upon demand (as they are
accessed). There are several circumstances that cause the VMM to write a page back
to its corresponding block in the file on disk.

6-5
Instructor Guide
Instructor notes:
Purpose To explain file system I/O layers.
Details
Transition statement Lets look at a file system performance factors.
6-6

V5.4
Instructor Guide
Uempty
File system performance factors

Proper performance management at lower layers:
- LVM logical volume and physical volume
- Adapters, paths, and storage subsystem
Large reads and writes at all layers

Large application read and write sizes
Multiple of file system block size
Manage file fragmentation
Avoid small discontiguous reads at the LV layer
Avoid significant impacts from journal logging

Concurrent file access locking and serialization
Physical seeks to log on each write
Manage file caching if overcommitted memory

Avoid JFS file compression option
Figure 6-3. File system performance factors
AN512.0
Notes:
Overview
There is a theory that anything that starts out with perfect order will, over time, become
disordered due to outside forces. This concept certainly applies to file systems. The
longer a file system is used, the more likely it will become fragmented. Also, the
dynamic allocation of resources (for example, extending a logical volume) contributes to
the disorder. File system performance is also affected by physical considerations like
the:
- Types of disks and number of adapters
- Amount of memory for file buffering
- Amount of local versus remote file access
- Pattern and amount of file access by applications

6-7
Instructor Guide
Issues of fragmentation
With fragmentation, sequential file access will no longer find contiguous physical disk
blocks. Random access may not find physically contiguous logical records and will have
to access more widely dispersed data. In both cases, seek time for file access grows.
Both JFS and JFS2 attach a virtual memory segment to do I/O. As a result, file data
becomes cached in memory and disk fragmentation does not affect access to the VMM
cached data.
File system CPU overhead

Each read or write operation on a file system is done through system calls. System calls
for reads and writes define the size of the operation, that is, number of bytes. The
smaller the operation the more system calls are needed to read or write the entire file.
Therefore, more CPU time is spent making the system calls. The read or write size
should be a multiple of the file system block size to reduce the amount of CPU time
spent per system call.
Fragment size
The following discussion uses JFS fragments to illustrate the concept; but, the same
principles apply equally to small JFS2 block sizes.
As many whole fragments (or blocks) as necessary are used to store a file or directorys
data.
Consider that we have chosen to use a JFS fragment size of 4 KB, and we are
attempting to store file data which only partially fills a JFS fragment. Potentially, the
amount of unused or wasted space in the partially filled fragment can be high. For
example, if only 500 bytes are stored in this fragment, then 3596 bytes will be wasted.
However, if a smaller JFS fragment size (for example 512 bytes) was used, the amount
of wasted disk space would be greatly reduced to only 12 bytes. Therefore, it is better to
use small fragment sizes if efficient use of available space is required.
Although small fragment sizes can be beneficial in reducing disk space wastage, this
can have an adverse effect on disk I/O activity. For a file with a size of 4 KB stored in a
single fragment of 4 KB, only one disk I/O operation would be required to either read or
write the file. If the choice of the fragment size was 512 bytes, eight fragments would be
allocated to this file, and for a read or write to complete, several additional disk I/O
operations (disk seeks, data transfers, and allocation activity) would be required.
Therefore, for file systems which use a fragment size of 4 KB, the number of disk I/O
operations will be far less than for file systems which employ a smaller fragment size.
Fragments are allocated contiguously or not at all.
6-8

V5.4
Instructor Guide
Uempty
Compression
Compression can be used for JFS file systems with a fragment size less than 4 KB. It
uses the Lempel-Zev (LZ) algorithm that replaces subsequent occurrences of a given
string with a pointer to the first occurrence. On an average, a 50% savings in disk space
is realized.
Compression can be specified when creating the file system through SMIT:
System Storage Management (Physical & Logical Storage) ->
File Systems -> Add / Change / Show / Delete File Systems ->
Journaled File Systems -> Add a Journaled File System ->
Add a Compressed Journaled File System.
Or, use one of the following commands:
- crfs -a compress=LZ <other options>
- mkfs -o compress=LZ <other options>
JFS compression performance considerations

In addition to increased disk I/O activity and free space fragmentation problems, file
systems using data compression have the following performance considerations:
- Degradation in file system usability arising as a direct result of the data
compression/decompression activity
- All logical blocks in a compressed file system, when modified for the first time, will
be allocated 4096 bytes of disk space, and this space will subsequently be
reallocated when the logical block is written to disk
- In order to perform data compression, approximately 50 CPU cycles per byte are
required, and about 10 CPU cycles per byte for decompression
- The JFS compression kproc (jfsc) runs at a fixed priority of 30 so that while
compression/decompression is occurring, the CPU that this kproc is running on
may not be available to other processes unless they run at a better priority
Compression is not supported for the JFS2 (J2) file systems.

6-9
Instructor Guide
Instructor notes:
Purpose To describe file system performance factors.
Details
Additional information When the policies are initially set for a logical volume, all the
partitions that the logical volume is mapped to may be fairly sequential and in the proper
area of the disk (center, middle, edge). But as the volumes are extended, it may not always
be possible to put the partitions in the optimal place. The same goes for individual files.
Allocation groups do improve performance, but the only tuning aspect is that certain JFS
fragment sizes are only supported within a specific range of allocation group sizes.
Each allocation group contains a static number of contiguous disk i-nodes that occupy
some of the group's fragments. Allocation groups allow the JFS resource allocation policies
to use effective methods to achieve optimum file system I/O performance. These allocation
policies try to cluster disk blocks and disk i-nodes for related data to achieve good locality
for the disk. Files are often read and written sequentially, and files within a directory are
often accessed together. Also, these allocation policies try to distribute unrelated data
throughout the file system in an attempt to minimize free-space fragmentation.
Transition statement How do we measure file I/O performance when experimenting
with alternate configurations and tuning parameters?

V5.4
Instructor Guide
Uempty
How to measure file system performance

General guidelines for accurate measurements
System has to be idle
System management tools like Workload Manager
should be turned off
Storage subsystems should not be shared with other
systems
Files must not be in AIX file cache or storage
subsystem cache for read throughput measurement
Writes must go to the file system disk and not just
written to AIX memory
Figure 6-4. How to measure file system performance
AN512.0
Notes:
Idle system
File system operations require system resources such as CPU, memory, and I/O. The
result of a file system performance measurement will not be accurate if one or more of
these resources are in use by other applications.
System management tools

The same applies if one or more of these resources is managed and/or the statistics are
gathered with system management tools like Workload Manager (WLM). Those tools
should be turned off.
I/O subsystems
I/O subsystems, such as Enterprise Storage Server (ESS), can share disk space
among several systems. The available bandwidth might not be enough to achieve

6-11
Instructor Guide
maximum file system performance if the I/O subsystem is used by other systems during
the performance measurement, thus it should not be shared.
Read measurement
When a file is cached in memory, a read throughput measurement does not give any
information about the file system throughput since no physical operation on the file
system takes place. The best way to assure that a file is not cached in memory is to
unmount and then mount the file system on which the file is located. You may need to
work with the storage subsystem administrator to ensure that the subsystem cache is
empty of the data you will be reading.
Write measurement
A write throughput measurement does not give any information about file system
performance if nothing is written out to disk. Unless the application opens files in such a
way that it does not use file system buffers (such as direct I/O), then each write to a file
is done in memory and is written out to disk by either a syncd or a write-behind
algorithm. The write-behind algorithm should always be used and tuned for a write
throughput measurement.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain the basics of file system performance measurements.
Details The basics of file system performance measurements needs to be understood
before starting any measurement.
Explain the general guidelines for file system performance measurements. These
guidelines are also valid for ANY other performance measurement.
Explain the difference in CPU, memory, and file system performance measurements.
Remind the students that the ultimate measure of performance is the actual applications
being run. These applications often have their own metrics of throughput or response time.
Additional information Well discuss write-behind in the next section.
Transition statement Now, lets see how we can measure read throughput.

6-13
Instructor Guide
How to measure read throughput

Useful tools for file system performance measurements
are dd and time
Example of a read throughput measurement with dd
command:
# time dd if=/fs/file1 of=/dev/null bs=1024k
1000+0 records in
1000+0 records out
real
user
sys
0m0.44s
0m0.01s
0m0.45s
Figure 6-5. How to measure read throughput
AN512.0
Notes:
Utilities
The dd command is a good utility to measure the throughput of a file system since it
allows you to specify the exact size for reads or writes as well as the number of
operations. When the dd command is started, it creates a second dd process. One dd
process is used to read and the other to write. This allows dd to provide a continuous
data flow on an SMP machine.
Example
The time command shows the amount of time it took to complete the read.
The read throughput in this example is about 2272 MB per second (1000 MB / 0.44
seconds real time)

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain how to measure read throughput using dd command.
Details Tuning the page read-ahead algorithm, discussed later in this unit, can improve
the read throughput of sequential reads.
Transition statement Now, lets see how we can measure write throughput.

6-15
Instructor Guide
How to measure write throughput

Ensure sequential write_behind is enabled (default)
Use large amounts of data
Example of a write throughput measurement with dd
command:
# sync; sync; date; dd if=/dev/zero of=/fs/file1
bs=1024k count=800; date; sync ;date
Mon Mar 15 16:48:35 CET 2010

800+0 records in.
800+0 records out.
Mon Mar 15 16:48:46 CET 2010
Mon Mar 15 16:48:46 CET 2010
Figure 6-6. How to measure write throughput
AN512.0
Notes:
Overview
Writes to a file are done in memory (unless direct I/O, asynchronous I/O, or
synchronous I/O is used) and will be written out to disk through syncd or the
write-behind algorithm. If the application is not issuing fsync() periodically, then it is
necessary that the file system sequential write-behind mechanism be enabled (the
default). Otherwise, the process could complete with a large amount of data still in
memory and not yet written to disk.
With write_behind, most of the data will be written to disk, but up to 128KB (by default)
of data could be left unwritten to the disk; thus, a large amount of data should be used
to make that 128KB a small percentage of measurement. Write_behind will be
discussed in more detail, later.
Placing a sync command before the final date command neither helps nor hurts the
measurement in the default sequential write-behind environment. But, the final sync
command is necessary if you disable sequential write behind. Note that the sync
V5.4
Instructor Guide
Uempty
command is asynchronous; it returns without waiting for the data to be confirmed as

written to disk. But if processing a large amount of data, the unrecorded amount will not
be significant.
Example
The first set of sync commands flush all modified file pages in memory to disk.
The time between the first and the second date commands is the amount of time the dd
command took to write the file into memory and to process the disk I/O triggered by the
write-behind mechanism. The last time period is what it took to write and commit almost
all of the data to disk.
The time between the second and the third date commands is the time it took the sync
command to schedule any remaining dirty pages to be written to disk. Note that the
sync command will terminate without waiting for all its writes to be committed. For the
default write-behind environment, this is a very short amount of time.
If the write-behind mechanism had been disabled, and there was sufficient memory to
cache the written data, the dd command elapsed time would have been much shorter
(perhaps 5 seconds for this example) and the elapsed time for the final sync command
would have been much longer (around 12 seconds).
The time between the first and third date command is the total amount of time it took to
write the file to disk.
In this example, dd completed after 11 seconds (16:48:46 - 16:48:35) and wrote 72.7
MB per second.

6-17
Instructor Guide
Instructor notes:
Purpose Explain how to measure write throughput.
Details
Transition statement Lets see what iostat can show.

V5.4
Instructor Guide
Uempty
Using iostat
# iostat f /test 1 > ios.out &
[1]
245906
# dd if=/test/file2 bs=1024k of=/dev/null
100+0 records in.
100+0 records out.
# kill %1
# egrep "/test|FS Name ios.out
FS Name:
/test
FS Name:
/test
FS Name:
/test
FS Name:
/test
FS Name:
/test
% tm_act
0.0
% tm_act
12.0
% tm_act
72.0
% tm_act
69.0
% tm_act
53.0
Kbps
0.0
Kbps
3268.0
Kbps
17308.0
Kbps
17248.0
Kbps
14576.0
tps
Kb_read
0.0
0
tps
Kb_read
410.0
3268
tps
Kb_read
2163.0
17308
tps
Kb_read
2155.0
17248
tps
Kb_read
1598.0
14576
Kb_wrtn
0
Kb_wrtn
0
Kb_wrtn
0
Kb_wrtn
0
Kb_wrtn
0
Figure 6-7. Using iostat .
AN512.0
Notes:
Example
The iostat command might help you see if something is going wrong.
The example uses the -f option which adds per file system statistics. This allows you to
see which file systems have the heaviest traffic and to see the statistics for individual
file systems.
The output of the iostat taken during the dd read operation shows a higher number of
transactions per second (tps) than you would expect. The average block size in this
sample was about 8 KB (calculated by Kbps / tps).
Both the high number of transactions per second and the small block size points to a
problem. You would expect to see larger I/Os and less transactions per second,
specifically with a sequential read. This is due to the file system read_ahead
mechanism that will be covered, later.

6-19
Instructor Guide
Since iostat gives us only a general overview on the I/O activity, you need to continue
your analysis with tools like filemon which provides more detailed information.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain how to use the iostat command to determine file system
bottlenecks.
Details Treat this and the next few visuals as a case study. It walks the students
through a scenario where data had been previously cached in memory but page stealing
had fragmented the file cache. This resulted in an inefficient reading of data from disk. It is
hard for them to fully understand exactly what is going on until we cover the read-ahead
mechanisms. So treat it as a mystery as to why the reads are fragmented. And then come
back to it with the answer when we cover sequential read-ahead.
Note that the read size is 1 MB, and the number of records read was 100, indicating the
input file was 100 MBs in size.
The iostat output in this example was taken while the dd command read a file that was
partially cached in memory. Sequential page read-ahead resets whenever it finds individual
pages in a previous read-ahead sequence that need to be paged in.
Additional information Note that we are not using the iostat report to measure device
throughput in this example. There are two reasons for this:
- The %tm_act statistic is not 100% and thus the Kbps statistic can not represent
device throughput.
- The stats are only for the /test file system. Thus we may not be looking at all of the
traffic to the disk.
Transition statement Lets look at the information that filemon gives us.

6-21
Instructor Guide
# filemon -u -O lf,pv -o fmon.out
# dd if=/test/file2 bs=1024k of=/dev/null
# trcstop
# more fmon.out
Wed Nov 10 13:24:34 2004
System: AIX train21 Node: 5 Machine: 0001D2BA4C00
Cpu utilization:
6.9%
Most Active Files

-------------------------------------------------------------------#MBs #opns
#rds
#wrs file
volume:inode
-------------------------------------------------------------------101.0
1
101
0 file2
/dev/jfslv:23
100.0
1
0
100 null
3.0
0
385
0 pid=270570_fd=20960
0.2
1
62
0 unix
/dev/hd3:10
0.0
0
60
51 pid=208964_fd=14284
0.0
0
205
107 pid=249896_fd=17736
0.0
0
0
102 pid=282802_fd=20162
AN512.0
Notes:
Example
This example demonstrates how to use the filemon command to analyze the file
system performance issue as seen with the iostat command in the last visual. The
visual on this page shows the logical file output (lf) from the filemon report. Output is
ordered by #MBs read and written to a file.
By default, the logical file reports are limited to the top 20. If the verbose flag (-v) is
added, activity for all files would be reported. The -u flag is used to generate reports on
files opened prior to the start of the trace daemon.
Look for the most active files to see usage patterns. If they are dynamic files, they may
need to be backed up and restored. The Most Active Files sections shows the file1
file (read by dd command) as most active file with one open and 101 reads.
The number of writes (#wrs) is 1 less than the number of reads (#rds), because
end-of-file has been reached.
V5.4
Instructor Guide
Uempty
If the trace does not capture the open call, it does not know what name the file was
opened with. So, it just records the file descriptor number. When the trace does not
have the process name, then it saves the PID.

6-23
Instructor Guide
Instructor notes:
Purpose Explain how to use the filemon command to determine file system
bottlenecks.
Details The filemon data was taken on the same partially cached file used for the
iostat example. Explain the logical file section first then continue to answer the question
about small I/Os.
Explain why the filemon output shows 101 reads although the file is exactly 100 MB in
size.
Remind the students that if filemon reports that it has dropped trace events, the output is
not reliable.
Transition statement Lets see what more we can find out.

V5.4
Instructor Guide
Uempty
-----------------------------------------------------------------------Detailed File Stats
-----------------------------------------------------------------------FILE: /test/file2 volume: /dev/jfslv (/test) inode: 23
opens:
1
total bytes xfrd:
105906176
reads:
101
(0 errs)
read sizes (bytes):
avg 1048576.0 min 1048576 max 1048576 sdev
0.0
read times (msec):
avg 30.401 min
0.005 max 38.883 sdev
3.681
FILE: /dev/null
opens:
total bytes xfrd:
writes:
write sizes (bytes):
write times (msec):
1
104857600
100
(0 errs)
avg 1048576.0 min 1048576 max 1048576 sdev
0.0
avg
0.005 min
0.004 max
0.022 sdev
0.002
AN512.0
Notes:
Detailed File Stats report
The Detailed File Stats report is based on the activity on the interface between the
application and the file system. As such, the number of calls and the size of the reads or
writes reflects the application calls. The read sizes and write sizes will give you an idea
of how efficiently your application is reading and writing information.
In this example, the report shows the average read size is approximately 1 MB, which
matches the block size specified on the dd command on the previous visual.
The size used by an application has performance implications. For sequentially reading
a large file, a larger read size will result in fewer read requests and thus lower CPU
overhead to read the entire file. When specifying an applications read or write block
size, using values which are a multiple of the page size (which is 4 KB) is
recommended.

6-25
Instructor Guide
Instructor notes:
bottlenecks.
Details This is the detailed file output of filemon. It does not show any reason for the
physical I/O behavior caused by the partially cached file. Explain what we can see with the
detailed file output.
Transition statement Lets see what more we can find out.

V5.4
Instructor Guide
Uempty
-----------------------------------------------------------------------Detailed Physical Volume Stats
(512 byte blocks)
-----------------------------------------------------------------------VOLUME: /dev/hdisk1 description: N/A
reads:
6326
(0 errs)
read sizes (blks):
avg
16.6 min
8
read times (msec):
avg
0.301 min
0.100
read sequences:
3125
read seq. lengths:
avg
33.5 min
32
seeks:
3125
(49.4%)
seek dist (blks):
init 4634960,
avg
243.2 min
32
seek dist (%tot blks):init 13.03848,
avg 0.00068 min 0.00009
time to next req(msec): avg
2.148 min
0.191
throughput:
2574.3 KB/sec
utilization:
0.09
max
max
64 sdev
6.057 sdev
9.5
0.172
max
4832 sdev
85.9
max
659488 sdev 11796.7
max 1.85519 sdev 0.03318

max 10516.272 sdev 132.204
AN512.0
Notes:
Detailed Physical Volume Stats report
As contrasted with the Detailed File States report, the Detailed Physical Volume Stats
report shows the activity at disk device driver. This report shows the actual number and
size of the reads and writes to the disk device driver. The file system uses VMM
caching. The default unit of work in VMM is the 4 KB page. But, rather than writing or
reading one page at a time, the file system tends to group work together to read or write
multiple pages at a time. This grouping of work can be seen in the physical volume read
and write sizes provided in this report.
Note that the sizes are expressed in blocks, where a block is the traditional UNIX block
size of 512 bytes. To translate the sizes to KBs, divide the number by 2.

6-27
Instructor Guide
Example
In this report, the minimum read size was 4 KB which matches the VMM page size. The
average size approximately matches the 8 KB size that was calculated from in the
iostat report. The iostat and this filemon report are both reporting the disk device
driver activity. The maximum size was 32 KB. Generally, more work per read is better.
The example in the visual shows 3125 seeks and 6326 reads on hdisk1. You would not
expect to see any seeks here since dd reads the data sequentially. The file could be
fragmented on the file system or partially cached in real memory (partial caching will
defeat the sequential read-ahead mechanism). A simple test for this would be a
unmount, mount, and another dd to see if the behavior changes. Generally, the fewer the
number of seeks, the better the performance.

V5.4
Instructor Guide
Uempty
Instructor notes:
bottlenecks.
Details You may want to ask why the average read size is not closer to the maximum
read size. This situation will be explained later in this unit.
Additional information You may be wondering why the minimum read size does not
match the j2_minPageReadAhead value of 2 blocks (this is JFS2). When the application
read size is large, the file system does not attempt to read the entire application read size,
but instead starts a sequential memory read, fast-started with 8 blocks. As it sequentially
reads the memory-mapped segment for the file, it triggers the read-ahead algorithm and
starts doubling the range that it reads. But, it is this fast start amount of 8 blocks which
accounts for the minimum read size of 8 blocks. The maximum read size of 64 blocks is
likely due to the memory fragmentation. Without the memory fragmentation, you would
expect the maximum to match the j2_maxPageReadAhead value of 128 blocks.
The similar statistic would have been seen in the LV detailed report. The author decided to
choose the PV detail report to illustrate.
Transition statement Lets talk about file fragmentation.

6-29
Instructor Guide
Fragmentation and performance

Logical File
Physical File System

i-nodes
1
i-nodes
Physical Disk Allocation
i-nodes
i-nodes
1 2
Figure 6-11. Fragmentation and performance
AN512.0
Notes:
Overview
While an operating systems file is conceptually a sequential and contiguous string of
bytes, the physical reality might be very different. File fragmentation arises from
appending to a file while other applications are also writing to the files in the same area.
A file system is considered fragmented when its available space consists of large
numbers of small chunks of space, making it impossible to write out a new file in
contiguous blocks.
Access to fragmented files may result in a large number of seeks and longer I/O
response times (seek latency dominates I/O response time). For example, if the file is
accessed sequentially, a file placement that consists of many, widely separated extents
requires more seeks than a placement that consists of one or a few large contiguous
extents. If the file is accessed randomly, a placement that is widely dispersed requires
longer seeks than a placement in which the files blocks are close together.

V5.4
Instructor Guide
Uempty
The i-nodes and indirect blocks are part of the file system. They are placed at various
points throughout the file system. This is desirable since it helps keep i-nodes physically
close to data blocks. The disadvantage is that the i-nodes and indirect blocks contribute
to the file fragmentation.

6-31
Instructor Guide
Instructor notes:
Purpose To review fragmentation and performance.
Details With the dynamic allocation of resources, file blocks become more and more
scattered, logically contiguous files become fragmented, and logically contiguous logical
volumes become fragmented.
Access to fragmented files may result in a large number of seeks and longer I/O response
time. At some point, the system administrator may choose to reorganize the placement of
files within logical volumes and the placement of logical volumes within physical volumes,
to reduce fragmentation and to more evenly distribute the total I/O load.
Additional information Do not confuse sparse files with fragmented files. A sparse file
intentionally leaves logical extents filled with nulls (hex 00), due to application requested
seeks when writing to the file. This can look like fragmentation. A sparse file will actually
use more disk space when copied. That is because the file system does not allocate disk
space for file blocks that are skipped over with application seeks. It just notes that the
extent has null content and generates a sequence of nulls when an application reads those
blocks. When copied, the copy utility explicitly writes those hex zeros into new data blocks,
requiring use of disk storage. When the file system is defragmented, using the backup and
restore utilities, the restore utility actively parses the file when it sees data blocks filled with
nulls.
Transition statement How can we measure file fragmentation?

V5.4
Instructor Guide
Uempty
Determine fragmentation using fileplace

# fileplace -pv file1
File: file1 Size: 1048576000 bytes Vol: /dev/hd1
Blk Size: 4096 Frag Size: 4096 Nfrags: 256000
Inode: 28834 Mode: -rw-r--r-- Owner: root Group: system
Physical Addresses (mirror copy 1)
---------------------------------04075296-04076063 hdisk8
768
04077600-04082719 hdisk8
5120
04084512-04085023 hdisk8
512
04088864-04089119 hdisk8
256
04089632-04172831 hdisk8
83200
04173088-04173855 hdisk8
768
04175648-04176671 hdisk8
1024
04202784-04219423 hdisk8
16640
04223264-04224287 hdisk8
1024
04260384-04407071 hdisk8
146688
frags
frags
frags
frags
frags
frags
frags
frags
frags
frags
3145728
20971520
2097152
1048576
340787200
3145728
4194304
68157440
4194304
600834048
Bytes,
Bytes,
Bytes,
Bytes,
Bytes,
Bytes,
Bytes,
Bytes,
Bytes,
Bytes,
0.3%
2.0%
0.2%
0.1%
32.5%
0.3%
0.4%
6.5%
0.4%
57.3%
Logical Extent
---------------00077056-00077823
00079360-00084479
00086272-00086783
00090624-00090879
00091392-00174591
00174848-00175615
00177408-00178431
00204544-00221183
00225024-00226047
00262144-00408831
256000 frags over space of 331776 frags:

space efficiency = 77.2%
10 extents out of 256000 possible:
sequentiality = 100.0%
Figure 6-12. Determine fragmentation using fileplace .
AN512.0
Notes:
Overview
The fileplace tool displays the placement of a files blocks within a logical or physical
volume(s). fileplace expects an argument containing the name of the file to examine.
This tool can be used to detect file fragmentation.
By default, fileplace sends its output to the display, but the output can be redirected
to a file via normal shell redirection.
fileplace accepts the following options:
-l
Displays the files placement in terms of logical volume blocks (default).
-p
Displays the files placement in terms of physical volume blocks for the physical
volumes that contain the file. Mirroring data is included if the logical volume is
mirrored. The -p flag is mutually exclusive with the -l flag.
-i
Displays the indirect blocks (if any) for the file. This option is not available for
JFS2 files.

6-33
Instructor Guide
-v
Displays more information, such as space efficiency and sequentiality.
A logical fragment is now composed of a number of fragments.
Example
The example in the visual demonstrates how to use fileplace to determine whether a
file is fragmented.
The report generated by the -pv options displays the files placement in terms of
physical volume blocks for the physical volumes. The verbose part of the report is one
of the most important sections since it displays the efficiency and sequentiality of the
file.
Range of fragments (R) is calculated as the (highest assigned address - lowest
assigned address + 1).
Number of fragments (N) is the total number of fragments.
File space efficiency is calculated as the number of non null fragments (N) divided by
the range of fragments (R) assigned to the file and multiplied by 100, or (N/R) * 100.
Sequential efficiency is defined as 1 minus the number of gaps (nG) divided by the
number of possible gaps (nPG) or (1- (nG/nPG)) * 100.
The number of possible gaps (nPG) is calculated as nPG = (N -1).
In this example, the file is not very fragmented.
Higher sequentiality provide better sequential file access.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To interpret a fileplace report.
Details Higher sequentiality will probably provide better sequential file access. The
efficiency is not a performance factor.
Additional information The output in the visual is from a JFS2 file. With a JFS file, you
can also use the -i option to show any indirect blocks.
Transition statement Let us next look at how we can correct both file fragmentation
and free space fragmentation.

6-35
Instructor Guide
Reorganizing the file system

After identifying a fragmented file system, reduce the
fragmentation by:
1. Backing up the files (by name) in that file system
2. Deleting the contents of the file system (or
recreating it with mkfs)
3. Restoring the contents of file system
Some file systems should not be reorganized because
the data is either transitory (for example, /tmp), or does
not change that much (for example, / and /usr)
Figure 6-13. Reorganizing the file system
AN512.0
Notes:
Overview
File system fragmentation can be alleviated by backing up the problem files, deleting
the problem files, and then restoring them; provided that there is enough existing
sequential free space.This loads the file sequentially and reduces fragmentation.
Using the copy command to attempt problem file defragmentation can be dangerous,
due to the possibility of making significant inode changes such as ownership,
permission, and date stamps. In addition, use of the copy command can result in the
unparsing of parse files.
If the filesystem has very little or fragmented free space, then the entire file system
needs to be backed up, the entire file system contents deleted and the entire file system
restored. This will both defragment the free space and defragment the individual files.

V5.4
Instructor Guide
Uempty
Some file systems or logical volumes should not be reorganized because the data is
either transitory (that is, /tmp), does not change much (that is, /usr and /), or not in a file
system format (log).
Backing up the file system

Back up the file system by file name. If you back up the file system by i-node instead of
by name, the restore command puts the files back in their original places, which would
not solve the problem. The commands to backup the file system are:
1. # cd /filesystem
2. # find . -print | backup -ivf backup_filename
This command creates a backup file (in a different file system), containing all of
the files in the file system that is to be reorganized. If disk space on the system is
limited, you can use tape to back up the file system.
3. # cd /
4. # unmount /filesystem
5. # mkfs /filesystem
You can also use tar or pax (rather than backup/restore) and backup by name.
Restoring the file system

To restore the contents, run the following:
1. # mount /filesystem
2. # cd /filesystem
3. Restore the data, as follows:
# restore -xvf backup_filename >/dev/null
Standard output is redirected to /dev/null to avoid displaying the name of each
of the files that were restored, which is time-consuming.

6-37
Instructor Guide
Instructor notes:
Purpose To explain reorganizing the file system.
Details
Additional information When remaking the file system with mkfs, you are prompted to
destroy (remove all contents from) the old one.
Transition statement Sometimes we would like to defragment a file system with
fragmented free space, without having to backup and restore the file system.

V5.4
Instructor Guide
Uempty
Using defragfs
If the file system has only fragmented free space, then
new file allocations are automatically fragmented.
This usually occurs when the file system is almost full.
To attempt online defragmentation of a file system, use
one of the following:
smit dejfs
smit dejfs2
defragfs command
This may be only partially effective if there is not enough

free space to work with.
In JFS, online defragmentation is primarily intended for
scattered free fragments when using small fragment sizes.
Figure 6-14. Using defragfs
AN512.0
Notes:
Overview
If a JFS file system has been created with a fragment size smaller than 4 KB, it
becomes necessary after a period of time to query the amount of scattered unusable
fragments. If many small fragments are scattered, it makes it difficult to find available
contiguous free space.
To recover these small, scattered spaces, use smit or the defragfs command. Some
free space must be available for the defragmentation procedure to be used. The file
system must be mounted for read-write.
For JFS2, the defragfs command focuses on the number of free runs (the number of
contiguous free space extents) in used allocation groups.

6-39
Instructor Guide
defragfs syntax
The defragfs command line syntax is:
defragfs /fs
(to perform)
defragfs -q /fs
(to query)
defragfs -r /fs
(to report)
A query will display, for JFS, the current state of the file system:
- Number of free fragments
- Number of allocated fragments
- Number of free spaces shorter than a block
- Number of free fragments in short free spaces
A query will display, for JFS2, the current state of the file system:
- Total allocation groups
- Allocation groups skipped - entirely free
- Allocation groups that are candidates for defragmenting
- Average number of free runs in candidate allocation groups

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain how attempt on-line defragmentation of free space.
Details
Transition statement Lets next look at how JFS and JFS2 logs work

6-41
Instructor Guide
JFS and JFS2 logs

A special logical volume called the log device records
modifications to file system metadata, prior to writing the
metadata changes to disk. After the changes are written
to disk, commit records are written to the log.
By default, each volume group has a single JFS log and a
single JFS2 log, shared by all file systems in that VG.
Log device updates can:
Serialize file updates due to locking for the log update
Introduce extra seeks into the disk write pattern
I/O statistics from the filemon utility can identify heavy
device log volumes and increased seeks.
Can mount o log=NULL, if integrity is not a concern.
Figure 6-15. JFS and JFS2 logs
AN512.0
Notes:
Overview
JFS and JFS2 use a database journaling technique to maintain a consistent file system
structure. This involves duplicating transactions that are made to file system metadata
to the circular file system log. File system metadata includes the superblock, i-nodes,
indirect data pointers, and directories.
When pages in memory are actually written to disk by a sync() or fsync() system call,
commit records are written to the log to indicate that the data is now on disk. All I/Os to
the log are synchronous. Log transactions occur in the following situations:
- File is created or deleted
- write() occurs for a file opened with O_SYNC and the write causes a new disk block
allocation
- fsync() or sync() is called
- Write causes an indirect or double-indirect block to be allocated (JFS)
V5.4
Instructor Guide
Uempty
Location of the log

File system logs enable rapid and clean recovery of file systems if a system goes down.
However, there may be a performance trade-off here. If an application is doing
synchronous I/O or is creating and/or removing many files in a short amount of time,
then there may be a lot of I/O going to the log logical volume. If both the log logical
volume and the file system logical volume are on the same physical disk, then this could
cause an I/O bottleneck. The recommendation would be to migrate the log device to
another physical disk (this is especially useful for NFS servers).
JFS2 file systems have an option to have an inline log. An inline log allows you to create
the log within the same data logical volume. With an inline log, each JFS2 file system
can have its own log device without having to share this device. The space used can be
much less than the physical partition size and the location is implicitly in close proximity
with the file system using it. Inline logs are really used for availability. Performance is
further improved if you have a dedicated log volume on a different disk.
Recording statistics about I/Os to the log

Information about I/Os to the log can be recorded using the filemon command. If you
notice that a file system and its log device are both heavily utilized, it may be better to
put each one on a separate physical disk (assuming that there is more than one disk in
that volume group). This can be done using the migratepv command or via SMIT.
Avoiding file system journal logging

In most situations you need to have file system journal logging for the integrity of your
file system. But there are a few situations where integrity is not a concern and the I/O
can run much faster without the logging. One example would be when the file system is
being recovered from a backup. If there is a failure, you would simply repeat the
recovery. Another example is compile scratch space; if there is a failure you would just
rerun the compile.
For these situations, you may choose to mount the JFS2 filesystem with an option of
log=NULL. Just remember to remount without this option before using the filesystem for
a purpose that requires integrity!
JFS also has a mount option that provides the same capability:
mount -o nointegrity.

6-43
Instructor Guide
Instructor notes:
Purpose To discuss JFS and JFS2 logs.
Details
For AIX 5L 5.2 and later, the default JFS2 inline log size depends on the size of the logical
volume, as follows:
Logical Volume Size
< 32 MB
32 MB up to 64 MB
64 MB up to 128 MB
128 MB
128 MB to 1 GB
1 GB to 2GB
2 GB to 128 GB
128 GB up to 512 GB
512 GB
Default Inline Log Size

256 KB
512 KB
1 MB
2 MB
1/128 of size rounded to MB boundary
8 MB
512 MB
Additional information The fsync() subroutine causes all modified data in the open
file specified to be saved to permanent storage. On return from the fsync() system call,
all updates have been saved on permanent storage.
Applications that must know whether the write was successful will use the fsync()
system call. However, applications that do not need to know when the write completes can
use the sync() system call and the performance benefits can be significant.
Transition statement How can we create additional log logical volumes?

V5.4
Instructor Guide
Uempty
Creating additional JFS and JFS2 logs

Additional log devices in a volume group, for a given file
system type, may improve performance if multiple file
systems are competing for the default log device.
Placing a file system and its log on different disks may
also reduce costly physical seeks.
What to do:
Create a new JFS or JFS2 log logical volume
Unmount the file system
Format the log:
logform -V vfstype /dev/LogName
Change the filesystem use the new log device:
chfs a log=/dev/lvname <FileSystem>
Mount the file system
Figure 6-16. Creating additional JFS and JFS2 logs
AN512.0
Notes:
Overview
In the following discussion, references to the journal log apply to both JFS and JFS2
journal log devices.
Placing the log logical volume on a physical volume different from your most active file
systems logical volume will increase parallel resource usage assuming that the I/O
pattern on that file system causes journal log transactions. If there is more than one file
system in the same volume group which is causing journal log transactions, you may
get better performance by creating a separate journal logs for each of these file
systems. The downside of this is that if you have one journal log for each file system
then you are potentially faced with storage waste, since the smallest each journal log
can be is one physical partition.
The performance of disk drives differ. So, try to create a logical volume for a hot file
system on a fast drive (possibly one with fast write cache). If using a caching storage

6-45
Instructor Guide
subsystem, the seeks affects may be less of a concern due to implementation of write
caching.
Creating a new log logical volume

Create a new file system log logical volume, as follows:
# mklv -t jfslog -y LVname VGname 1 PVname
or
# mklv -t jfs2log -y LVname VGname 1 PVname
or
# smitty mklv
Another way to create the log on a separate volume is to:
i.
Initially define the volume group with a single physical volume
ii. Define a logical volume within the new volume group (this causes the allocation
of the volume group JFS log to be on the first physical volume)
iii. Add the remaining physical volumes to the volume group
iv. Define the high-utilization file systems (logical volumes) on the newly added
physical volumes
The default journal log size is 1 logical partition. For small partition sizes, this size may
be insufficient for certain file systems (such as very large file systems or file systems
with a lot of files being created and/or deleted). If there is a high rate of journal log
transactions, then a small log could actually degrade performance because I/O
activities will be stopped until the transactions can be committed. In this case, an error
log entry regarding JFS log wait is recorded. If you want to increase the size of a journal
log, you must first unmount the file systems that use the log device. You can then
increase the size of the logical volume used by the log device, then format it using the
logform command before mounting the file systems again.
Formatting the log

Format the log as follows:
# /usr/sbin/logform -V vfstype /dev/LogName
For JFS2 logs, the logical volume type used is jfs2log instead of jfslog.
Also, when using logform on a JFS2 log, specify logform -V jfs2.

V5.4
Instructor Guide
Uempty
Modifying /etc/filesystems and the LVCB

You can use the chfs command to modify the file system stanza in /etc/filesystems
and also the logical volume control block and specify the new log volume for that file
system. For example: chfs -a log=/dev/LVname /filesystemname.

6-47
Instructor Guide
Instructor notes:
Purpose To discuss adding JFS/JFS2 logs.
Details
Transition statement Lets look at sequential read-ahead and the parameters
associated with it.

V5.4
Instructor Guide
Uempty
Sequential read-ahead
JFS: minpgahead=2 R
JFS2: j2_minPageReadAhead=2
JFS: maxpgahead=8 (default)

JF2: j2_maxPageReadAhead=128
(default)
Page 0 accessed:
Read in from disk
0
Page 1 accessed:
Read in from disk
1
Page 2 accessed:
Read in from disk
4
Page 4 accessed:
Read in from disk
8
15
Page 8 accessed:
Read in from disk
16
23
Set via ioo: increase for striped logical volumes and large sequential I/O
Figure 6-17. Sequential read-ahead
AN512.0
Notes:
Overview
The VMM tries to anticipate the future need for pages of a sequential file by observing
the pattern in which a program is accessing the file. When the program accesses two
successive pages of the file, the VMM assumes that the program will continue to access
the file sequentially, and the VMM schedules additional sequential reads of the file.
These reads are overlapped with the program processing, and will make the data
available to the program sooner than if the VMM had waited for the program to access
the next page before initiating the I/O.
The visual uses JFS as the example, but the same principles apply to JFS2.

6-49
Instructor Guide
Sequential read-ahead thresholds

The number of pages to be read ahead in a JFS file system is determined by the two
VMM thresholds:
- minpgahead
Note: this is an AIX6 Restricted tunable.
Number of pages read ahead when the VMM first detects the sequential access
pattern. If the program continues to access the file sequentially, the next read-ahead
will be for 2 times minpgahead, the next for 4 times minpgahead, and so on until the
number of pages reaches maxpgahead.
- maxpgahead
Maximum number of pages the VMM will read ahead in a sequential file.
The number of pages to read ahead on a JFS2 file system is determined by the two
thresholds:
- j2_minPageReadAhead
- j2_maxPageReadAhead
The distance between minfree and maxfree relative to maxpgahead or
j2_maxPageReadAhead should also take into account the number of threads that might
be doing maxpgahead or j2_maxPageReadAhead reads at a time. IBMs current policy is
that maxfree = minfree + (#_of_CPUs * maxpgahead or j2_maxPageReadAhead).
Without this, it is too easy to drive the free list to 0 and start paging working storage
pages.
How sequential read-ahead works

The first access to a file causes:
- The first page to be read in for JFS file systems
- Two pages to be read in for JFS2 file systems
When the next page is accessed, the next page plus minpgahead number of page is
read in. Subsequent accesses of the first page of a group of read-ahead pages results
in a doubling of the pages read in, up to maxpgahead (JFS) or j2_maxPageReadAhead
(JFS2).
If the program were to deviate from the sequential-access pattern and access a page of
the file out of order, sequential read-ahead would be terminated. It would be resumed
with minpgahead (JFS) or j2_minPageReadAhead (JFS2) pages if the VMM detected
that the program resumed sequential access.
Another situation where the pattern would be broken is when the file was previously
cached in memory and memory overcommit has caused random blocks of the cached
file contents to be stolen.
V5.4
Instructor Guide
Uempty
JFS example
The visual shows an example of sequential read-ahead for a JFS file system.
In this example, minpgahead is 2 and maxpgahead is 8 (the defaults). The program is
processing the file sequentially.
Following is the sequence of steps in the example:
1. The first access to the file causes the first page (page 0) of the file to be read. At
this point, the VMM makes no assumptions about random or sequential access.
2. When the program accesses the first byte of the next page (page 1), with no
intervening accesses to other pages of the file, the VMM concludes that the
program is accessing sequentially. It schedules minpgahead (2) additional
pages (pages 2 and 3) to be read. Thus the access causes a total of 3 pages to
be read.
3. When the program accesses the first byte of the next page that has been read
ahead (page 2), the VMM doubles the page-ahead value to 4 and schedules
pages 4 through 7 to be read.
ahead (page 4), the VMM doubles the page-ahead value to 8 and schedules
pages 8 through 15 to be read.
ahead (page 8), the VMM determines that the page-ahead value is equal to
maxpgahead and schedules pages 16 through 23 to be read.
The VMM continues reading maxpgahead pages when the program accesses the first byte
of the previous group of read-ahead pages until the file ends.
Changing the sequential read-ahead thresholds

If you are thinking of changing the read-ahead values, keep in mind:
- The values should be powers of 2 (from the set: 0, 1, 2, 4, 8, 16, and so on). The use
of other values may have adverse performance or functional effects.
Values should be powers of 2 because of the doubling algorithm of the VMM.
If the max page ahead value exceeds the capabilities of a disk device driver, the
largest read size stays at 64 KB (16 pages).
Higher values of the maximum page ahead can be used in systems where the
sequential performance of striped logical volumes is of paramount importance.
- A j2_minPageReadAhead value of 0 effectively turns off the mechanism. This can
adversely affect performance. However, it can be useful in some cases where I/O is
random, but the size of the I/Os cause the VMM's read-ahead algorithm to take
effect. For example, with a database block size of 8 KB and a read pattern that is

6-51
Instructor Guide
purely random, the j2_minPageReadAhead value of 2 would cause a total of 4 pages

to be read for each 8 KB block instead of two 4 KB pages. Another case where
turning off page-ahead is useful is the case of NFS reads on files that are locked. On
these types of files, read-ahead pages are typically flushed by NFS so that reading
ahead is not helpful. NFS and the VMM will automatically turn off VMM read-ahead if
it is operating on a locked file.
- The buildup of the read-ahead value from the j2_minPageReadAhead to
j2_maxPageReadAhead is quick enough that for most file sizes there is no advantage
to increasing j2_minPageReadAhead.
- When an application uses a large read size, the pattern fast-starts with reading 8
blocks at a time before starting the doubling pattern.
For JFS, The minpgahead and maxpgahead values can be changed with:
- ioo (-o minpgahead and -o maxpgahead)
- But minpgahead is a restricted tunable; do not modify without the direction of AIX
Support.
For JFS2, The j2_minPageReadAhead and j2_maxPageReadAhead values can be
changed with:
- ioo (-o j2_minPageReadAhead and -o j2_maxPageReadAhead)

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain tuning sequential read-ahead.
Details Note that the example in the visual is for a JFS file system. The difference for a
JFS2 file system would be the parameter names (j2_minPageAhead and
j2_maxPageReadAhead).
Transition statement Looking at writes, how can we optimize file syncs?

6-53
Instructor Guide
Tuning file syncs

JFS file writes will be stored in memory, but not written to
disk, until any one of the following happens:
Free list page replacement steals a dirty file page forcing
a page-out to disk for that one page
The syncd daemon flushes pages at scheduled intervals
The sync command or fsync() call is issued
Write-behind mechanism is triggered
An i-node lock is obtained and held while dirty pages are
being written to disk, which can be a performance issue.
Tuning options:
Tune sequential write-behind (on by default)
Turn on and tune random write-behind
Increase the frequency of the syncd daemon
Figure 6-18. Tuning file syncs
AN512.0
Notes:
Overview
When an application writes to a file, the data is stored in a memory segment which is
mapped to the file. The memory page frames are marked as being modified, until they
have been written to disk. These are referred to as dirty pages. When the VMM
chooses to steal a modified page frame, it needs to page-out (write) the modified
contents to disk. In the case of a file, the modified contents are paged to the file it is
mapped to. If the VMM waits until a page frame is stolen to write the changes to the file,
it might result in a very long delay and the writes would be individual pages scattered
throughout the file. So, there are some other mechanisms which are used to force these
dirty pages to be written to disk.
If too many pages accumulate before one of these conditions occur, then when pages
do get flushed by the syncd daemon, the i-node lock is obtained and held until all dirty
pages have been written to disk. During this time, threads trying to access that file will
get blocked because the i-node lock is not available. Remember that the syncd daemon
V5.4
Instructor Guide
Uempty
currently flushes all dirty pages of a file, but one file at a time. On systems with a large
amount of memory and large numbers of pages getting modified, high peaks of I/Os can
occur when the syncd daemon flushes the pages.
Tunable options
There are three options to tune file syncs:
- Tune the sequential write-behind.
- Enable and tune random write-behind.
- This blocking effect can also be minimized by increasing the frequency of syncs in
the syncd daemon. Use of the write-behind mechanisms is a much better solution
over having increasing the frequency of the syncd. To modify the syncd frequency,
change /sbin/rc.boot where it invokes the syncd daemon. Then reboot the system
for it to take effect. For the current system, kill the syncd daemon and restart it with
the new seconds value.
Caution
Caution should be exercised when changing the syncd time on systems with more than
16-24 GB of memory being used for the file system cache and not running AIX 5L V5.3
or later. On AIX 5L V5.1 and V5.2, syncd looks at each page in the file system cache to
determine if it has been modified. As the file system cache grows large, this can cause
other additional problems. With AIX 5L V5.3, a linked list of dirty pages is kept. On AIX
5L V5.1 and V5.2, because we do not have a linked list of dirty pages, syncd will use
more and more system CPU to scan for dirty pages. It then sleeps for the syncd sleep
time and starts again. This can and does consume a huge amount of CPU on systems
with large amounts of memory used for the file cache.

6-55
Instructor Guide
Instructor notes:
Purpose Explain how to tune file syncs.
Details
The students may ask you about a tuning option called sync_release_ilock. If set, it
will cause a sync() to flush all I/O to a file without holding the i-node lock, and then use
the i-node lock to do the commit. In the past, this used to be offered as an option to
reduce the impact of the syncd triggered flushing of dirty pages in memory.
The default is that it is off. To turn on this option, use:
ioo -o sync_release_ilock
This is an AIX6 Restricted tunable; do not modify unless directed to do so by AIX
Support.
The use of sync_release_ilock can cause a leak of inodes when a large number of files
are created and/or deleted during a short time period. This can be managed by periodically
unmounting the file system and running fsck on it.
Transition statement Lets look at sequential write-behind.

V5.4
Instructor Guide
Uempty
Sequential write-behind
Files are divided into clusters
4 pages (16 KB) for JFS (fixed cluster size)
32 pages (128 KB) for JFS2 (default; tunable)
Dirty pages of a file are not written to disk until the
program writes the first byte beyond the threshold
Tuning JFS file systems:
Threshold number of clusters is tunable with the
ioo o numclust parameter
Tuning JFS2 file systems:
Single cluster threshold
Number of pages per cluster is tunable with the
ioo -o j2_nPagesPerWriteBehindCluster
Figure 6-19. Sequential write-behind
AN512.0
Notes:
Overview
To increase write performance, limit the number of dirty file pages in memory, reduce
system overhead, and minimize disk fragmentation, the file system implements a
mechanism called write-behind. The file system organizes each file into clusters. The
size of a cluster is 16 KB (4 pages) for JFS and 128 KB (32 pages), by default, for JFS2.
In JFS, the written pages are cached until the numclust number of 16KB clusters have
been accumulated. That cached numclust number of clusters are written to disk as
soon as the application writes to the next sequential cluster. Note that the data that
triggers this write-behind of cached data is not immediately written, but has to wait for
the numclust threshold to be passed again.
In JFS2, the concept is similar, except that the threshold amount to be cached before
being written to disk is a single cluster of a tunable size:
(j2_nPagesPerWriteBehindCluster).

6-57
Instructor Guide
To distribute the I/O activity more efficiently than either doing synchronous writes or
waiting for the syncd to run, sequential write-behind is enabled by default. Without this
feature, pages would stay in memory until the syncd daemon runs. This could cause I/O
bottlenecks and possibly increased fragmentation of the file.
The write-behind threshold is on a per-file basis, which causes pages to be written to
disk before the syncd daemon runs. The I/O is spread more evenly throughout the
workload.
There are two types of write-behind: sequential and random.
Tuning sequential write-behind

For JFS, the size of a cluster is 16 KB (4 pages). The number of clusters that the VMM
uses as a threshold is tunable. The default is one cluster. You can delay write-behind by
increasing the numclust parameter. This will allow small writes to get coalesced into
larger batches of writes so that you can get better disk write throughput. By setting
numclust to a larger value, this allows for coalescing of smaller logical I/Os into larger
physical I/Os. Change the numclust parameter using:
-ioo -o numclust
For JFS2, the number of pages per cluster is the tunable value (rather than the number
of clusters with JFS). The default is 32 pages (128 KB). This can be changed by using:
-ioo -o j2_nPagesPerWriteBehindCluster
To disable write-behind for JFS2, set j2_nPagesPerWriteBehindCluster to 0.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To discuss sequential write-behind.
Details
Transition statement How does random write-behind work?

6-59
Instructor Guide
Random write-behind
Can be used to prevent too many random write dirty pages
from accumulating in RAM so that when syncd does a flush,
there would not be a large amount of I/O sent to the disks
This is disabled by default
Random write-behind writes modified pages in memory to disk
after reaching tunable thresholds:
JFS threshold:
maxrandwrt (Maximum number of dirty pages)
JFS2 thresholds:
j2_nRandomCluster (Separation in number of clusters)
j2_maxRandomWrite (Maximum number of dirty pages)
The random cluster size is fixed at 16 KB (4 pages)
Figure 6-20. Random write-behind
AN512.0
Notes:
Overview
There may be applications that perform a lot of random I/O, that is, the I/O pattern does
not meet the requirements of the sequential write-behind algorithm and thus the dirty
pages do not get written to disk until the syncd daemon runs. If the application has
modified many pages in memory, this could cause a very large number of pages to be
written to disk when the syncd daemon issues a sync() system call.
The write-behind feature provides a mechanism such that when the number of
randomly written dirty pages in memory for a given file exceeds a defined threshold,
these pages are then scheduled to be written to disk.

V5.4
Instructor Guide
Uempty
JFS threshold
The parameter, maxrandwrt, specifies a threshold (in 4 KB pages) for random writes to
accumulate in RAM before subsequent pages triggers them to be flushed to disk by the
write-behind algorithm. The random write-behind threshold is on a per-file basis. The
default value is 0 indicating that random write-behind is disabled.
Increasing this value to 128 would mean that once 128 random page writes have
occurred, any subsequent random write causes the previous write to be written to the
disk. The first set of pages and the last page written will be flushed after a sync.
This threshold is tunable by using: ioo -o maxrandwrt
JFS2 thresholds
In the JFS2 random write-behind algorithm, writes are considered random if the
distance between two consecutive writes are separated by tunable number of clusters.
There are two thresholds for JFS2 file systems:
- The parameter, j2_nRandomCluster, specifies the distance apart (in clusters) that
writes have to exceed in order for them to be considered as random by JFS2s
random write-behind algorithm. The cluster size in this context is a fixed 16
kilobytes. The default is 0 which means that any non-sequential writes to different
clusters are considered random writes. The default of 0 is with bos.mp.5.1.0.15 and
later. The threshold is tunable by using:
ioo -o j2_nRandomCluster
- The parameter, j2_maxRandomWrite, specifies a threshold for random writes to
accumulate in RAM before subsequent pages are flushed to disk by JFS2s
write-behind algorithm. The random write-behind threshold is on a per-file basis.
The default value is 0 indicating that random write-behind is disabled. The threshold
is tunable by using:
ioo -o j2_maxRandomWrite

6-61
Instructor Guide
Instructor notes:
Purpose To discuss random write-behind.
Details
Transition statement Lets look at a random write-behind example.

V5.4
Instructor Guide
Uempty
JFS2 random write-behind example

j2_nRandomCluster=4
j2_maxRandomWrite=1
1st two consecutive writes

0
16
24
32
...
Page number
j2_nRandomCluster * 4
2nd two consecutive writes
The first two consecutive writes are not considered to be

random since they are not j2_nRandomCluster
clusters apart
The second two consecutive writes are considered to be
random since they are more than j2_nRandomCluster
clusters apart
Figure 6-21. JFS2 random write-behind example
AN512.0
Notes:
Example
The example in the visual demonstrates when a write is considered to be random.
In this example:
- j2_nRandomCluster is set to 4
- j2_maxRandomWrite is set to 1
Thus, two consecutive writes must be at least 16 pages apart to be considered random
This 16 page separation comes from the following calculation:
j2_nRandomCluster * 16 KB; which is 4 * 16 KB = 64 KB = 16 pages
The first two consecutive writes, one in page number 4 and the other in page number
12, are not considered to be random because the actual separation is 8 pages (12 - 4),
which does not exceed the j2_nRandomCluster requirement.

6-63
Instructor Guide
The second two consecutive writes, one in page number 7 and the other in page
number 28 are considered to be random because the actual separation is 21 pages (28
- 7), which is more than 16 pages. In this case, page number 7 will be written out to
disk.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To discuss JFS2 random write-behind.
Details The intent of the j2_nRandomCluster tuning is too allow tightly clustered
random writes to accumulate until a sync request (either syncd or an application issued
fsync() call) flushes them. While random, much if not all of the cluster will be touched in
memory before being flush, thus providing great benefit in disk I/O unit size and
sequentiality of physical writing.
Transition statement Before we look at ways to manage file caching, let us first look at
the advantages and disadvantages of having the file system use VMM file caching.

6-65
Instructor Guide
File system buffers and VMM I/O queue

read()
write()
VMM page fault
no
VMM write-behind
File system
buffer
available?
yes
Pager
VMM IO
queue
LVM
Wait for
file system
buffer
Disk Device
Driver
Figure 6-22. File system buffers and VMM I/O queue
AN512.0
Notes:
Overview
Each read or write to a file system requires resources such as file system buffers. One
or more file system buffers are needed for a single request. The number of file system
buffers needed mainly depends on the number of pages to read or write and the file
system itself:
- JFS usually needs a single file system buffer per request, which can consist of
multiple pages
- JFS2 needs one file system buffer per page
When enough file system buffers are available, read or write requests can be sent to
the file system pager device which will queue the I/Os to the LVM.
If the system runs out of file system buffers, read and write requests will be queued on
the VMM I/O queue. The VMM then will queue the request to the file system pager once
enough file system buffers become available.
V5.4
Instructor Guide
Uempty
Performance issue
The number of read/write requests on the VMM I/O queue can become quite large in an
environment with heavy file system activity. As a result of this, the average response
time can increase significantly.
For example, if there are already 1000 write requests to one single file system on the
VMM I/O queue, and the disk subsystem can perform 100 writes per second, a single
read queued at the end of the VMM I/O queue would return after more than 10 seconds.
Note: It is possible to fill up the entire memory available for file system caching with
pages that have outstanding physical I/Os (queued in VMM). The system appears to
hang or has a very long response time for any command that requires file access. Such
a situation can be avoided by the proper use of I/O pacing.

6-67
Instructor Guide
Instructor notes:
Purpose To discuss the relationship between file system buffers and VMM I/O queue.
Details I/O pacing will be discusses in a couple pages.
Transition statement Lets see how we can tune file system buffers.

V5.4
Instructor Guide
Uempty
Tuning file system buffers

To determine if there is a file system buffer shortage, use
vmstat v and note the rate of change between displays:
# vmstat -v
<...output omitted>
0 pending disk I/Os blocked with no pbuf
7801 paging space I/Os blocked with no psbuf
2740 filesystem I/Os blocked with no fsbuf
794 client filesystem I/Os blocked with no fsbuf
0 external pager filesystem I/Os blocked with no fsbuf
Increasing the number of file system buffers can increase

performance, if there is a high rate of blocked I/Os
Do not increase without good reason; uses pinned pages.
To change the number of file system buffers use ioo:
JFS: numfsbufs=<#buffers>
JFS2: j2_dynamicBufferPreallocation=<#buffers>
Figure 6-23. Tuning file system buffers
AN512.0
Notes:
Overview
JFS file system buffers are allocated when a file system is mounted.The number of file
system buffers allocated is defined by the ioo parameter:
- numfsbufs
Increasing the initial number of JFS file system buffers can increase the performance if
there are many simultaneous or large I/Os to a single file system. However, if there is
mainly write activity to a file system, increasing the number of file system buffers might
not avoid a file system buffer shortage. I/O pacing should be used in such case.
JFS2 file system buffers have an initial allocation when a file system is mounted, but
then dynamically increase n demand by a tunable amount.The related ioo parameters
are:
- j2_nBufferPerPagerDevice (for the initial allocation; Restricted tunable)
- j2_dynamicBufferPreallocations (for increase amounts; not restricted)

6-69
Instructor Guide
If there are bursts of JFS2 file system activity, the normal rate of fsbuf increase may not
react fast enough. A larger j2_dynamicBufferPreallocations may help in that
situation.
Determining a file system buffer shortage

Use vmstat -v. Look for the following lines in the output:
- file system I/Os blocked with no fsbuf
Refers to the number of waits on JFS file system buffers
- client file system I/Os blocked with no fsbuf
Refers to the number of waits on NFS and VxFS (Veritas) file system buffers
- external pager filesystem I/Os blocked with no fsbuf
Refers to the number of waits on JFS2 file system buffers
It is normal and acceptable to have periodic transitory fsbuf shortages, so the displayed
count may be fairly large without representing a problem. If the system has
unsatisfactory I/O performance, compare two displays of vmstat -v and calculate the
rate of change over the intervening time period. If it is a high rate of change, then
modifying the discussed tunable might help. Do not increase without good reason. They
take up valuable pinned memory.
Changing the number of file system buffers

To change the number of file system buffers use:
For JFS: ioo -o numfsbufs=<# buffers>
This tunable will require an unmount and mount of file system to be effective.
For JFS2: ioo -o j2_nBufferPerPagerDevice=<# buffers>

This tunable is effective immediately. There is no need to unmount and mount the file
system.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain how to tune file system buffers.
Details
Additional information Note: File system buffers are allocated out of the kernel heap.
Changing the number of file system buffers to a too high value might exhaust the kernel
heap on a 32-bit kernel machine, but AIX 5L has the 64 bit kernel as default and AIX6 only
run a 64 bit kernel.
Transition statement Lets see what we can do with I/O pacing.

6-71
Instructor Guide
VMM file I/O pacing

A VMM file cache algorithm which paces the amount of write
I/O to a file, trading throughput for response time
Prevents any one thread from dominating system resources
Tuned with system wide sys0 object attributes: maxpout
and minpout:
AIX 5L defaults: minpout=0, maxpout=0 (disabled)
AIX6 defaults: minpout=4096, maxpout=8193
Can be specified per file system via the mount options:
maxpout and minpout
0
16
minpout
24
32
... I/O Requests
Delta
maxpout - 1
Figure 6-24. VMM file I/O pacing
AN512.0
Notes:
Overview
Disk I/O pacing is used to prevent I/O-intensive programs, that generate very large
amounts of output, from dominating the system's I/O facilities and causing the response
times of less demanding programs to deteriorate. Disk I/O pacing enforces per segment
(which effectively means per-file) high and low-water marks on the sum of all pending
I/Os.
When a process tries to write to a file that already has pending writes request at or
above the high-water mark, the process is put to sleep until enough I/Os have
completed to make the number of pending writes less than or equal to the low-water
mark. The logic of I/O request handling does not change. The output from high volume
processes is slowed down somewhat.

V5.4
Instructor Guide
Uempty
Controlling system wide parameters

There are two parameters that control the system wide I/O pacing:
- maxpout: High-water mark that specifies the maximum number of pending I/Os to a
file minpout: Low-water mark that specifies the point at which programs that have
reached maxpout can resume writing to the file
The high- and low-water marks can be set by:
- smit -> System Environments -> Change / Show Characteristics of
Operating System (smitty chgsys) and then entering the number of 4KB pages
for the high- and low-water marks
- chdev -l sys0 -a maxpout=NewValue
chdev -l sys0 -a minpout=NewValue
Controlling per file system options

In AIX 5L V5.3, and later, I/O pacing can be tuned on a per file system basis. There are
cases when some file systems, like database file systems, require different values than
other file systems, like temporary file systems.
This tuning is done when using the mount command:
# mount -o minpout=40,maxpout=60 /fs
Another way to do this is to use SMIT or edit the /etc/filesystems.
Default and recommended values

In AIX6, the default value for the high-water mark is 8193 and for low-water mark is
4096. (Prior to AIX6, these were both defaulted to 0, thus disabling I/O pacing).
Changes to the maxpout and minpout values take effect immediately and remain in
place until they are explicitly changed.
While, in AIX6, I/O pacing is enabled by default, the values are set rather large to
manage only the worse situations of an over-dominant batch I/O job. Depending on
your situation you may benefit by making them smaller.
It is a good idea to make the value of maxpout (and also the difference between
maxpout and minpout) large enough so that they are greater than the write-behind
amounts. This way sequential write-behind will not be suspended due to I/O pacing.For
example, for JFS the maxpout number of pages should be greater than (4*numclust).
For JFS2, the maxpout number of pages should be greater than j2_maxRandomWrite.
Using JFS as an example, the recommended value for maxpout should be (a multiple of
4) + 1 so that it works well with the VMM write-behind feature. The reason this works
well is for the following interaction:
1. The write-behind feature sends the previous four pages to disk when a logical
write occurs to the first byte of the fifth page (JFS with default numclust=1).

6-73
Instructor Guide
2. If the pacing high-water mark (maxpout) were a multiple of 4 (say, 8), a process
would hit the high-water mark when it requested a write that extended into the
ninth page. It would be then put to sleep before the write-behind algorithm had a
chance to detect that the fourth dirty page is complete and the four pages were
ready to be written.
3. The process would then sleep with four full pages of output until its outstanding
writes fell below the pacing low-water mark (minpout).
4. If on the other hand, the high-water mark had been set to 9, write-behind would
get to schedule the four pages for output before the process was suspended.
While enabling VMM I/O pacing may improve response time for certain workloads, the
workloads generating the large amounts of I/O will be slowed down because the
processes are put to sleep periodically instead of continuously streaming the I/Os.
Disk-I/O pacing can improve interactive response time in some situations where
foreground or background programs that write large volumes of data are interfering with
foreground requests. If not used properly, however, it can reduce throughput
excessively.
Example 1
The figure on the visual presents the minpout and maxpout VMM file I/O pacing values.
A thread writing to the file goes to sleep once the number of outstanding write I/Os, this
include pages that are sent to the disk and those queued in the VMM I/O queue,
reaches the maxpout threshold. The thread is woken up when the number of
outstanding I/Os is minpout or less.
VMM file I/O pacing should be used to avoid a large number of read/write requests on
the VMM I/O queue which can cause new read/write requests to take many seconds to
complete, since they are put at the end of the VMM I/O queue.
Example 2
The effect of pacing on performance can be demonstrated with an experiment that
consists of starting a vi editor session on a new file while another process is writing a
very large file with the dd command. If the high-water mark were set to 0, the logical
writes from the dd command could consistently arrive faster than they could be
physically written, and a large queue would build up.
Each I/O started by the vi session must wait its turn in the queue before the next I/O
can be issued, and thus the vi command is not able to complete its needed I/O until
after the dd command finishes. The following table shows the elapsed seconds for dd
execution and vi initialization with different pacing parameters.
.
High Water
0
Low Water
0
Throughput (sec)
49.8
vi (sec)
finished after dd

V5.4
Instructor Guide
Uempty
High Water
Low Water
Throughput (sec)
vi (sec)
33
24
23.8
no delay
129
64
37.6
no delay
257
128
44.8
no delay
513
256
48.5
no delay
769
640
48.3
<3
1025
256
49.0
<1
1025
384
49.3
3
1025
896
47.8
3 to 10
It is important to notice that the dd duration is always longer when pacing is set. Pacing
sacrifices some throughput on I/O-intensive programs to improve the response time of
other programs.
Remember that these results are for one particular environment. It requires
experimentation in the actual target environment with your actual applications to find out
what values work best for you.
The challenge for a system administrator is to choose settings that result in a
throughput and response-time trade-off that is consistent with the organization's
priorities. It may be that a 3 second response time is acceptable and you need to
optimize the batch processing. For a different organization, the response time is
paramount and they can accept some delay in batch job completion.

6-75
Instructor Guide
Instructor notes:
Purpose Explain VMM file I/O pacing.
Details The second example shows how the various values for maxpout and minpout
can affect performance. These were on an older machine and the newer faster boxes may
have very different types of numbers.
One limitation of pacing is that it does not offer as much control when a process writes
buffers larger than 4 KB. When a write is sent to the VMM and the high-water mark has not
been met, the VMM performs start I/Os on all pages in the buffer, even if that results in
exceeding the high-water mark. Pacing works well on the cp command because the cp
command writes 4 KB at a time. But, if the cp command wrote larger buffers, the times
shown in the previous table for starting the vi session would increase.
Transition statement Lets look at the benefits and exposures of using file caching.

V5.4
Instructor Guide
Uempty
The pro and con of VMM file caching

Pro:
Later reads do not require disk I/O
Sequential read-ahead mechanism provides:
Overlapping of disk I/O (without application AIO)
Larger disk reads from grouping of read requests.
Write-behind mechanism allows larger and more
efficient writes to disk due to coalescing.
Con:
CPU overhead
Longer path length of kernel logic
Extra load is a concern if CPU constrained
Memory overhead
Large amount of noncomputational memory
Page stealing disrupts and diminishes benefits
Figure 6-25. The pro and con of VMM file caching
AN512.0
Notes:
The main advantage to using file caching is the avoidance of costly re-reads to disk. Once
a file block has been cached in memory, an application file read for that file is quickly
handled through a memory read.
Even if there is no expected re-read of that data in the near future, the file caching is
necessary for the read-ahead and write-behind mechanisms which were just covered. Both
mechanisms result in the coalescing of many smaller I/O requests into fewer larger
requests for contiguous (or at least clustered) requests. For sequential read-ahead, there is
the additional benefit of generating overlapping disk I/O requests, without requiring an
application to be re-written to use asynchronous I/O processing. This provides a significant
throughput improvement.
The main disadvantage is the memory and CPU overhead. The path length of the kernel
logic for read and write service calls and for interrupt handlers is significantly increased to
handle the file caching. Normally, this is an acceptable trade-off for the listed benefits. With
the current memory defaults, a large file cache overcommit of memory does not usually
result in any computational pages being paged out, but it can trigger a high volume of file

6-77
Instructor Guide
cache page stealing. This page stealing can then significantly diminish the potential
advantages of the file caching.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Summarize the pros and cons of VMM file caching.
Details
Transition statement Lets us look at a couple ways to avoid some file caching
overhead. Lets first look at release-behind mechanisms.

6-79
Instructor Guide
JFS and JFS2 release-behind

Over-committed memory results in page stealing:
Inefficient single page writes
Disruption of sequential read-ahead
When you know the data will not be re-read in the near
future, use release-behind to free file cache memory
Reduced page stealing and less disruption
With release-behind, sequential I/O pages are freed as

soon as:
They are committed to permanent storage (by writes)
They are delivered to an application (by reads)
Enabled by mounting a file system with one of the

following options:
rbr Release-behind when reading
rbw Release-behind when writing
rbrw Release-behind when reading and writing
Figure 6-26. JFS and JFS2 release-behind
AN512.0
Notes:
Overview
Release-behind is a mechanism under which pages are freed as soon as they are
either committed to permanent storage (by writes) or delivered to an application (by
reads). This solution addresses a scaling problem when performing large amounts of
sequential I/O on files whose pages may not need to be re-accessed in the near future.
Release-behind only applies to sequential IO, so random IO pages are cached. In
addition, the pages used for read-ahead are also cached until they are delivered to the
application.
When writing a large file without using release-behind, writes will go very fast whenever
there are available pages on the free list. When the number of pages drops to minfree,
VMM uses its Least Recently Used (LRU) algorithm to find candidate pages to release
and reuse. Because the LRU daemon examines frames one at a time, acquiring and
releasing multiple locks, it can take too long to release enough frames to allow the I/O to

V5.4
Instructor Guide
Uempty
continue at full speed. This lock contention can cause a sharp performance
degradation.
Further more, when file cache is stolen, sequential read-head and write behind
mechanisms can be disrupted. Dirty pages can be paged-out one page at a time, rather
than being allowed to accumulate and be written in longer sequences. The read ahead
mechanism is disrupted when previous pages that were read ahead are stolen; the file
read is forced to re-read from disk the stolen block which then resets the read ahead
algorithm. Random I/O will not be impacted as much as sequential I/O.
Enabling release-behind
You enable this mechanism by specifying one of the following flags to the mount
command:
- rbr
Mount file system with the release-behind-when-reading capability. When sequential
reading of a file in this file system is detected, the real memory pages used by the
file will be released once the pages are copied to internal buffers.
- rbw
Mount file system with the release-behind-when-writing capability. When sequential
writing of a file in this file system is detected, the real memory pages used by the file
will be released once the pages have been written to disk.
- rbrw
Mount file system with both the release-behind-when-reading and
release-behind-when-writing capabilities.
Release-behind side-effect
A side-effect of using the release-behind mechanism is that you will notice an increase
in CPU utilization for the same read or write throughput rate without release-behind.
This is due to the work of freeing pages, which would normally be handled at a later
time by the LRU daemon. On the other hand the overhead of freeing the pages needs
to be done sometime; this is a shift of when that occurs.
Note that, with release behind, all file page accesses result in disk I/O since file data is
only briefly cached by VMM before being released. The exception to this is if the pages
are the result of read-ahead mechanism. In that case, they are cached by VMM until the
application reads them into its private segment; then the release-behind would free
those pages.
Files with contents that are expect to be read soon after writing should not use
release-behind-when-writing. Files with contents that are expected to be re-read within
a relatively short period of time should not use release-behind-when-reading. Since this
is managed through mount options, you should plan to segregate your files into file

6-81
Instructor Guide
systems where the mount options can be appropriate to the files in each file system.
Alternatively you may remount the file system with different options according to how
the file system will be used. For example, a normal mount during the day and a
release-behind mount on third shift during batch report processing (if appropriate).

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To discuss JFS/JFS2 release-behind.
Details
Transition statement Lets next see how we can reduce both memory load and CPU
load by using direct I/O.

6-83
Instructor Guide
Normal I/O versus direct I/O (DIO)

Application
DIO
Application
File System
File System
VMM
LVM
LVM
Physical Disk I/O
Physical Disk I/O
Figure 6-27. Normal I/O versus direct I/O (DIO)
AN512.0
Notes:
Direct I/O uses all the facilities of the file system and the logical volume manager, except
that it does not do any file caching. The entire VMM file caching layer is skipped.
Normal I/O writes will store the data in the VMM file cache and tell the application the write
is complete. The application can then do something else asynchronously while the write is
being processed. Later actions, such as the syncd running, flush the data to the disk.
Normal I/O reads attempt to read the file from the file cache. If there is a cache miss, then it
triggers a read to disk. But If the data is already in memory from an earlier read or write, the
read request is quickly completed through a memory to memory transfer. The read ahead
mechanism provides asynchronous or overlap processing of reads.Cached files tend to
stay in memory until the pages are stolen or the file system is unmounted.
With direct I/O the write request needs to be processed all the way to a commit to disk
before the application is told it is completed. The data goes directly from the applications
private memory to the storage adapter for writing to disk. A DIO read request immediately
is sent out to the disk; when completed the data gets stored directly to the applications
private memory without first being stored in the file systems file cache.
V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Introduce the concepts of direct I/O.
Details
Transition statement Let us examine the issues in using direct I/O.

6-85
Instructor Guide
Using direct I/O (DIO)

I/O requests are between disk and applications private
memory; no VMM caching
Reduces memory load

Reduces CPU load, shortens kernel service path length
Loses the significant benefits of VMM caching
All normal reads and writes are synchronous
Application should be specially written to use DIO
Files opened with O_DIO flag

Complies with DIO block size and alignment rules
Data caching (if needed) in application computational memory
Large read and write sizes for sequential I/O
Asynchronous I/O with AIO service calls
Non-DIO access demotes DIO requests

Commonly copy or backup commands
Solved by using dio mount option
Only use DIO mounts with DIO compliant applications
Figure 6-28. Using direct I/O (DIO)
AN512.0
Notes:
Since the data flows directly from and to the applications private memory, without VMM
caching, there is a great reduction in memory usage. This is a common reason for
database engines to use DIO, since they maintain their own cache of data in computational
storage. Having eliminated the entire VMM caching layer, the amount of code to be
executed in handling filesystem I/O requests is greatly reduced. Both of these benefits can
be valuable, especially if the system is memory or CPU constrained. On the other hand, all
the benefits of file caching that were previously discussed are lost: quick access to
previous read data, read-ahead overlap processing and request coalescing, and finally the
write-behind data coalescing and asynchronous completions.
Direct I/O should only be used with applications which are known to be DIO compliant. DIO
has rules that require the block sizes and file positions to be multiples of the page size
(4KB). If these rules are violated, the DIO request is demoted, which is a very bad thing.
DIO emotion means the I/O request that violated the rules reverts to file caching and all of
its overhead, but without the read-ahead and write-behind benefits.

V5.4
Instructor Guide
Uempty
It is best to use applications which are specifically written for using DIO. In addition to being
compliant with the DIO rules, these application will compensate for the loss of the
read-ahead and write-behind mechanisms. For example, they will use very large read and
writes for sequential processing (rather than small reads and writes and letting read-ahead
or write-behind coalesce them into larger disk I/O requests.) If the application expects to
re-read data, it will manage its own intelligent caching of data, often with much better hit
ratios then what the file system VMM caching can achieve. To improve effective
throughput, these applications will often use the AIX Asynchronous I/O (AIO) facility by
issuing aio_read and aio_write calls which tell kernel service how to notify the application of
later completion, while immediately returning control to that application to do work
asynchronous to the processing of the I/O request. These applications which are written for
DIO, will open the file in DIO mode. As a result, the system administrator does not need to
do anything special to enable this capability.
A problem will sometimes develop where the administrator wants to work with a file that is
currently being processed by an application using DIO. The administrator will use a utility
such as the cp command or the backup command to read from the file. The utility opens
the file normally and does not request DIO processing. When any process requests normal
I/O processing of a file, then all processing is done normally (VMM file caching). For the
process that is requesting DIO, this is referred to as a demotion. When DIO requests are
demoted, they incur all of the costs of file caching with little of benefit; the I/O will not use
read-ahead or write-behind.
A common solution to this situation, is to use the DIO option of the mount command. When
the file system is mounted with this option, all I/O requests to the files will be treated as DIO
requests. That can allow the utility to run without causing demotions. The catch is that the
utility must be DIO compliant.
The danger of doing a DIO mount is that there are other files in that file system which were
never intended to be processed with DIO. In other words the programs that use these files
are not even DIO compliant, much less written to run well with DIO. For these other files
performance can be seriously impacted. One of the worse things you can do is mount file
systems containing executable programs and libraries using DIO.

6-87
Instructor Guide
Instructor notes:
Purpose Cover the issues involved with DIO.
Details
Transition statement Lets finish up with some checkpoint questions before we start on
the lab exercise.

V5.4
Instructor Guide
Uempty
Checkpoint (1 of 3)
1. True/False File fragmentation can result in a
sequential read pattern of many small reads with
seeks between them.
2. True/False When measuring file system performance,
I/O subsystems should not be shared.
3. Two commands to measure read throughput are:
_________ and __________
4. The _____________ command can be used to
determine if there is fragmentation.
AN512.0
Notes:

6-89
Instructor Guide
Instructor notes:
Purpose Review and test the students understanding of this topic.
seeks between them.
2. True/False When measuring file system
performance, I/O subsystems should not be shared.
dd and time.
4. The fileplace command can be used to determine
if there is fragmentation.

V5.4
Instructor Guide
Uempty
Checkpoint (2 of 3)
5. What tunable functions exist to flush out modified
file pages, based on a threshold of the number of
dirty pages in memory?
6. What is the difference between JFS and JFS2

random write-behind?
________________________________________
________________________________________
________________________________________
AN512.0
Notes:

6-91
Instructor Guide
Instructor notes:
Purpose Review and test the students understanding of this topic.
5. What tunable functions exist to flush out modified file
pages, based on a threshold of the number of dirty
pages in memory?
Random write-behind
6. What is the difference between JFS and JFS2 random
write-behind?
The threshold for random writes in JFS is simply the
number of random pages. In JFS2, in addition to using
the number of random writes as a threshold, it has a
definition of what is considered a random write based
upon the separation between the writes

V5.4
Instructor Guide
Uempty
Checkpoint (3 of 3)
7. List factors that may impact performance when files are
fragmented:
8. What commands can be used to determine if there is a

file system performance problem?
9. What is the relationship between file system buffers

and the VMM I/O queue?
___________________________________________
___________________________________________
AN512.0
Notes:

6-93
Instructor Guide
Instructor notes:
fragmented:
Sequential access is no longer sequential
Random access affected (by having to access more
widely dispersed data)
Access time dominated by longer seek time
iostat
filemon
Read/write requests will be queued on the VMM I/O
queue once the system runs out of file system buffers
Transition statement Lets move on to the exercise.

V5.4
Instructor Guide
Uempty
Exercise 6: File system performance

Monitor and fix file fragmentation
Using release-behind
Using DIO (optional)
Figure 6-32. Exercise 6: File system performance
AN512.0
Notes:

6-95
Instructor Guide
Instructor notes:
Details
Transition statement Lets summarize what weve learned.

V5.4
Instructor Guide
Uempty
Unit summary
This unit covered:
Guidelines for accurate file system measurements
How file fragmentation affects file system I/O performance
Using the filemon tool to evaluate file system performance
Tuning:
JFS and JFS2 logs
Release-behind
Read-ahead
Write-behind
Identifying resource bottlenecks for file systems
AN512.0
Notes:

6-97
Instructor Guide
Instructor notes:
Purpose To summarize file system performance tuning.
Details

V5.4
Instructor Guide
Uempty
Unit 7. Network performance

Estimated time
3:15 (2:30 Unit; 0:45 Exercise)

This unit describes the issues related to network performance. It
shows you how to use performance tools to monitor and tune network
performance.

Identify the network components that affect network performance
List the network tools that can be used to measure, monitor, and
tune network performance
Monitor and tune UDP and TCP transport mechanisms
Monitor and tune for IP fragmentation mechanisms
Monitor and tune network adapter and interface mechanisms

Accountability:
Checkpoint
Machine exercises
References
Reference
AIX Version 6.1 System Management Guide:
Communications and Networks
SG24-6478

(Redbook)

7-1
Instructor Guide
SG24-6184
7-2


V5.4
Instructor Guide
Uempty
Unit objectives
Identify the network components that affect network
performance
List the network tools that can be used to measure,
monitor, and tune network performance
Monitor and tune UDP and TCP transport
mechanisms
Monitor and tune for IP fragmentation mechanisms
Monitor and tune network adapter and interface
mechanisms
AN512.0
Notes:

7-3
Instructor Guide
Instructor notes:
Purpose Explain the objectives of the unit.
Details Explain what the students can expect to learn in this unit.
This unit is long and is broken up into four topics. There will be a hands-on lab exercise at
the end that will include activities for all the topics.
Transition statement Let us start with what can affect your network performance.
7-4

V5.4
Instructor Guide
Uempty
What affects network performance?

Network infrastructure
Type of session or connection
Session parameters
Resource availability
AIX settings
Figure 7-2. What affects network performance?
AN512.0
Notes:
Overview
There are a number of things that can affect network performance. Among them are:
- Type of interface
- Capacity of hubs, switches, and routers
- Host architecture
- Types of connections using the network at the same time
- Settings of AIX parameters
Some of these factors are external to AIX and have to be managed separately.
However, there are a number of parameters that are internal to AIX that can be used to
manage and even improve network performance.
As with so many areas of performance, it is often possible to improve the performance
by simply increasing the amount of the constraining resource. In the case of networking,

7-5
Instructor Guide
the primary resource is network bandwidth. Thus upgrading to a faster network can
often improve network performance. For example, one might upgrade the network from
10 Mbps to 100 Mbps (or even gigabit Ethernet). But the bandwidth of the physical
network is not the only factor and there can be logical resources that are the actual
constraining factor. Thus, this unit will focus on many of these logical resources such as
buffer sizes, queue sizes, delay timers, and so forth.
7-6

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss some of the factors that affect network performance.
Details Cover the items in the visual. Provide brief examples for each. This can be used
to prepare them for what we will be covering. For example you can say that we will be
discussing the differences between tuning UDP sessions and TCP connections.
Be sure they understand that even with the fastest network hardware, there are protocols
and software restrictions which can cause the network performance to fall way short of the
potential.
Transition statement How do we go about identifying and tuning for network
performance?

7-7
Instructor Guide
Document your environment

# netstat in
Name
en0
en0
lo0
lo0
lo0
Mtu
1500
1500
16896
16896
16896
Network
link#4
192.168.2
link#1
127
::1
Address
0.1a.64.91.85.fe
192.168.2.1
127.0.0.1
ZoneID
0
Ipkts Ierrs
1406
0
1406
0
354
0
354
0
354
0
Opkts Oerrs
209
4
209
4
366
0
366
0
366
0
Coll
0
0
0
0
0
# lsdev -Cc adapter | grep 'ent[0-9]

ent0
ent1
Available 01-08 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)

Available 01-09 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)
# entstat d ent0 | more

------------------------------------------------------------ETHERNET STATISTICS (ent2) :
Device Type: 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)
Hardware Address: 00:1a:64:91:85:fe
Elapsed Time: 0 days 7 hours 35 minutes 53 seconds
Transmit Statistics:
-------------------Packets: 205
Bytes: 19786
Interrupts: 0
Transmit Errors: 0
Packets Dropped: 0
Receive Statistics:
------------------Packets: 1410
Bytes: 136274
Interrupts: 1404
Receive Errors: 0
Packets Dropped: 0
Figure 7-3. Document your environment
AN512.0
Notes:
netstat -i
It is hard to analyze a network performance problem if you do not know what you are
working with. The netstat -in command lists the configured interfaces on the system.
This includes those interfaces which are currently in a down state (shown with a leading
* symbol in its name). The configured MTU size is shown for each interface (this will be
covered later).
It also shows some statistics related to those interfaces. The network interface statistics
provide a quick overview about which network interfaces show the most load.
To look at a single interface, use capital I and the name of the interface. For example:
# netstat -I en0
To output the address as numeric IP address, add a -n; to see the symbolic name
translation, do not use the -n flag.
7-8

V5.4
Instructor Guide
Uempty
lsdev
The lsdev listing identifies the available ethernet adapters. This can identify currently
unconfigured adapters that might be used, either as alternative network adapters or that
might be combined with another adapter to form an etherchannel (aggregate adapter),
for additional bandwidth.
entstat
The entstat listing identifies additional adapter characteristics and configuration
information. Later, the course cover using the detailed network statistics provided in the
entstat report.

7-9
Instructor Guide
Instructor notes:
Purpose Explain some basic commands fro collecting network information.
Details The emphasis here is on taking an inventory of what they are working with.
Later parts of the course will focus on statistics.
Analyzing errors
A few errors is nothing to worry about. Compare the level of errors (Oerrs or Ierrs)
with the total traffic (Opkts or Ipkts). If the percentage of errors is high (some would
consider anything larger than 1% to be high), then investigate the cause. The
netstat -i report does not provide enough information for us to know the cause of
the errors. The netstat -v or entstat report on the adapter and also information from
the network switch administrator will provide the needed details. We will discuss this in
more detail later, but we will list some possible causes here.
Input errors (Ierrs) can be caused by:
- Malformed packets (damaged by electrical problems)
- Bad checksums (may indicate a host has a network interface problem and is
sending corrupted packets, or cable is damaged)
- Insufficient buffer space in the device driver
- Receiving packets of a type that it was not registered for (DecNet, Netware,...)
Output errors (Oerrs) can be caused by:
- A fault in the local host connection to the network
- Prolonged collisions
- Transmit queue overflows.
Changing queue sizes
The queue sizes can be changed via SMIT or the chdev command. The MTU size is
changed by the ifconfig command, chdev, or SMIT. Set this to the maximum for that
interface, but remember to change it on all the machines on the network. You will see
an example of recommended sizes later in the lecture.
Collision counts
The netstat -i command does not support the collision count for Ethernet interfaces.
Use the entstat or the netstat -v command to see this collision information.
Transition statement Let us next look at how we can setup a simple benchmark for
measuring network performance.
V5.4
Instructor Guide
Uempty
Measuring network performance

Network throughput:
Number of bytes transferred / total transfer time
ftp command is a common tool:
ftp> put | dd if=/dev/zero bs=32k count=100 /dev/null
Compare with:
Theoretical bandwidth (Ex. Fast Ethernet = 94.8 Mbps )
Baseline measurements
Performance goals
Network response time:
Round-trip delay of transaction
Common tools: ping, netperf (transaction rate)
Application processing is often a major factor in total
response time
Figure 7-4. Measuring network performance
AN512.0
Notes:
Network throughput
Network performance has two main aspects. One is throughput and the other is
response time.
When transferring large amounts of data, we are concerned with throughput. This is
measured in bytes per second. Any application can measure the amount of data
transferred and how long it took. A standard file transfer application which automatically
reports this to us is FTP.
One common mistake made in measuring network throughput is to accidentally include
non-network factors, such as retrieval of the data from disk at the source or storing of
the data at the destination. A good way to avoid this is to use special device files such
as /dev/zero and /dev/null which have no disk activity. Other utilities may give you an
option to do memory transfers.

7-11
Instructor Guide
Another common tool that is used is the spray command. This is an RPC based
command which will allow you to specify the number, size and delay between packets
sent. It then reports the average transaction rate and data transfer rate. It depends on
configuring the inetd to enable the service at the destination.
Next, the measurement needs to be evaluated. One way to do this is to compare it to
the theoretical bandwidth that can be achieved on the type of network you have. One
needs to be careful here. Do not assume that 100 Mbps transfer is actually possible on
a 100 Mbps Ethernet. Even without contention for the bandwidth, there is overhead in
the form of intergap delays, frame headers and trailers, IP headers and TCP headers. If
not transferring on a common LAN, there are other components in the path which can
affect the transfer.
The following table lists some of the network interface types:
Interface Name
Speed
Ethernet (en)
10 Mb/sec - 10 Gb/sec
IEEE 802.3 (et)
10 Mb/sec - 10 Gb/sec
Token-Ring (tr)
4 Mb/sec or 16 Mb/sec
X.25 protocol (xt)
64 Kb/sec
Serial Line Internet Protocol, SLIP (sl) 64 Kb/sec

loopback (lo)
FDDI (fi)
100 Mb/sec
SOCC (so)
220 Mb/sec
ATM (at)
155 Mb/sec or 622 Mb/sec
To avoid a mismatch in speed or duplex mode at 10/100 Mb speed, it is often

recommended to disable auto-negotiation for Ethernet cards and set them to the fixed
media speed and duplex mode. It is also imperative that the same be done at the switch
port. This does not apply if you are running at Gb speed.
As with performance analysis in all the major resource categories, it is important to
obtain a benchmark of what is achievable under normal conditions when performance
seems satisfactory. When concerned with degraded performance, comparing current
measurements to the baseline can tell you if network throughput is a cause of the
current degradation.
Comparing against the baseline is also important for normal monitoring of the system to
spot trends in performance degradation, which in time may result in performance goals
not being met.
Utilization is the percentage of time a device is in use. General queueing theory holds
that devices with utilizations greater than 70% will see response time increases
because incoming requests have to wait for previous requests to complete. Maximum
acceptable utilizations are a trade-off of response time for throughput. In interactive
V5.4
Instructor Guide
Uempty
systems, utilizations of devices should generally not exceed 70-80% for acceptable
response times.
Network response time

Many applications do not have that much data to transfer, but need to obtain a fairly
quick reply. There are many networking elements which may have little effect on
throughput, but would have a great effect on response time. Response time is
determined by measuring the time between when a request is sent and a reply is
received.
While we will focus here on network response time, it is important to realize that to the
end user, this network response time may only be a small part of the overall perceived
response time. The client software may do processing even before sending a
transaction, the server may need to wait for a program to be loaded and once loaded
may take some time to process the request (including disk I/O delays).
To isolate the network response time from the non-network elements, we use tools
which have very little processing delay at either end. A universal tool of this sort is the
ping command. The ping command will provide a report giving the delay between
sending the ICMP echo request and the ICMP echo reply. Because it only operates at
Layer 3 and is built into the kernel, it has very little additional delay in the processing.
Another way to measure response time is to record a rapid transaction rate. For this you
need a client/server application which serializes a large number of transactions. As
soon as the reply to one transaction returns, it issues another one. If we invert the
transaction rate (divide it into 1) we will obtain the average response time per
transaction. This has the dual benefit of smoothing out any short term variances during
the test and the ability to measure sub second response times more accurately. A
common tool for this purpose is netperf. (www.netperf.org)
It is more difficult to identify an optimal theoretical response time because the network
latencies (often measured in hundreds of microseconds) are only a small part of the
network response time. The other factors depend very much on the network topology
and machine configurations. Once again, obtaining a baseline measurement under
normal loads when performance is acceptable is important for later comparisons.
For those who are not familiar with the ftp commands put subcommand syntax in the
visual, here is an extract from the BSD man page for ftp:
If the first character of the file name is |, the remainder of the argument is interpreted
as a shell command. ftp then forks a shell, using popen(3) with the argument
supplied, and reads (writes) from the stdin (stdout). If the shell command includes
spaces, the argument must be quoted; for example, | ls -lt "'.

7-13
Instructor Guide
Instructor notes:
Purpose Explain how to measure network performance.
Details Make a clear distinction between throughput and response time. Emphasize the
importance of having an objective measurement of performance to know if your changes
have improved network performance. Explain the common tools used for this purpose.
Remind the students that an applications performance may depend on other factors
besides the network component. Always go back to a measurement of that application and
re-evaluated the total picture.
Transition statement Now we are ready to start examining the various components
and protocols that affect network performance. Let us start with an overview of the data
flow through the layers.

V5.4
Instructor Guide
Uempty
Network services processing

Application
Application
layer
Write
buffer
Socket
Send
Buffer
File Descriptor
Socket
Receive
Buffer
Socket Layer
mbuf chain
Read
buffer
mbuf chain
MTU compliance (TCP)

TCP/UDP Layer
MTU compliance (TCP/UDP)

MTU enforcement (TCP/UDP)
IP Layer
IP Input Queues
Demux and IF
Layer
Transmit
Queues
Device Driver
Receive
Queues
Adapter
Media
Figure 7-5. Network services processing
AN512.0
Notes:
Introduction
The visual shows the flow of network data from a sending application down through the
protocol layers and out on to the physical network. It then shows the flow of network
data into a receiving application. The data arrives across the physical network and
makes its way up the protocol stack.
Maximum Transmission Unit (MTU)

The MTU indicates the established maximum size for data transfer for a given network
interface. The MTU parameter is tunable and must be set to the same value for all hosts
on a given network interface. If the amount of data sent by a process is larger than the
MTU, it is divided into separate packets that comply to the MTU by the layers of protocol
illustrated above.

7-15
Instructor Guide
Sending
The interface layer (IF) is used on output and is the same level as the Demux layer
(used for input) in AIX. It places transmit requests on to a transmit queue, where the
requests are then serviced by the network interface device driver. The size of the
transmit queue is tunable, as is described later in this lecture. The loopback interface
still uses the IF layer both on input and output.
Receiving
The network interface device driver places incoming packets on a receive queue. If the
receive queue is full, packets are dropped and lost, resulting in the sender needing to
retransmit. The receive queue is tunable using SMIT or the chdev command. The
maximum queue size is specified to each type of communication adapter.
The IP layer also has an input queue. The Demux layer places incoming packets on this
queue. Once again, if the queue is full, packets are dropped and never reach the
application. If packets are dropped at the IP layer, a statistic called ipintrq
overflows in the output of netstat -s -f inet is incremented. If this statistic
increases in value, then you should tune the ipqmaxlen tunable using the no
command. In AIX, in general, interfaces will not do queuing and will directly call the IP
Input Queue routine to process the packet. The loopback interface will still do queuing.
Communication steps
Applications that use TCP/IP protocols across a network use sockets to facilitate
communication. A socket is similar to the file access mechanism. On the send side, the
following things happen:
Step
Action
A program that needs network communication opens a TCP or

UDP type socket.
The program writes data to the socket.
Data is copied to a socket send buffer made up of mbufs and

clusters.
The socket calls TCP/UDP layer, passing a pointer to the linked list
of mbufs or clusters.
TCP/UDP allocates a new mbuf for header information.
TCP/UDP copies data from the socket send buffer to the header
mbuf or new mbufs chain (maximum size of chain governed by
MTU). UDP copies the data to one or more clusters, with the
remainder allocated to mbufs, then adds a UDP header.

V5.4
Instructor Guide
Uempty
Step
Action
TCP/UDP checksums the data and calls the IP layer.
IP determines the correct interface fragments for UDP packets

larger than the MTU size, updates and checksums the IP part of
header.
The mbuf chain is passed to the interface (Demux layer).
10
Demux prepends link layer header information, checks format, and

calls device driver write routine.
11
At the device driver layer, the mbuf chain containing the data is
enqueued on the transmit queue and the adapter is signaled to
start transmission operations.
12
After all data is sent, control returns to the application, the transmit
queues are adjusted, and the mbufs are freed.
The receive side works in reverse, stripping off headers and passing along pointers to
mbufs. Queue sizes, buffer sizes, and MTU sizes are all tunable parameters.
The processing sequence shown does the various alternative flows that could result
from changing various tunables. It is only an example of the default processing
sequence that is traditionally implemented.

7-17
Instructor Guide
Instructor notes:
Purpose To provide an overview of how data flows through the layers of the stack.
Details Explain how data moves through the various buffers and queues.
Point out the tunable components of the visual (socket send buffer size, application write
buffer size, queue length) and mention that tuning parameters are discussed later in this
lecture.
Describe what happens if there are no mbufs or queue spaces available when an
application transmits data (the transmit packet is discarded, resulting in time out and
retransmit at a higher level).
Be careful not to pre-teach the course here. But the students need to have the big picture
as we start with the higher layers and work our way down to physical layer. For example,
they need to understand the role of the adapters MTU restriction and the basic function of
fragmenting or segmenting the data before we start discussing the tuning of segmentation
at the TCP layer later on.
Additional information - Once the data is copied from the application buffer into mbufs in
the kernel, the data is no longer physically moved from protocol layer to protocol layer.
Rather, pointers to the mbufs are transferred from one protocol layer to the next
Transition statement You may have noticed that what was processed from layer to
layer within the AIX kernel was an mbuf chain. Let us take a closer look at how AIX
manages the storage of network data in memory.

V5.4
Instructor Guide
Uempty
Network memory
AIX provides a network buffer pool for network operations
Dynamically allocated and deallocated based on
demand
Clusters sizes can range from 32 bytes up to 128 KB
mbufs anchor data
Data can be in mbuf or a chain of clusters
thewall is the maximum amount of network pinned
memory
It is not tunable
It is determined by the amount of real memory
If mbufs or clusters are not available, performance may

suffer because packets may be dropped or delayed
Figure 7-6. Network memory
AN512.0
Notes:
Overview
AIX provides a network memory pool to service network communication. The maximum
size of the network memory pool is that defined by the thewall network option. Network
buffers are automatically allocated and pinned out of this network memory pool when
they are needed.
Size of network pool

The thewall value for maximum size of the network memory pool cannot be changed.
It will be calculated at boot time by the following formula:
- 32-bit kernel is one half of RAM or 1 GB (whichever is smaller)
- 64-bit kernel is one half of RAM or 65 GB (whichever is smaller)

7-19
Instructor Guide
The following command shows the amount of real memory that can be used for the
network memory pool on a machine:
no -o thewall
The sys0 ODM object (attributes of the operating system) maxmbuf attribute, if set to a
non-zero value, will be used instead of thewall. The default value is zero.
Note: The sys0 maxmbuf parameter can not be set to a value that is greater than the
system determined thewall value.
Memory buffers
AIX network services manages memory that is allocated and pinned for use by network
services. The amount of memory allocated and managed by network kernel services
dynamically increases and decreases based upon the utilization pattern, but will never
exceed the thewall.
The allocated memory is organized into memory clusters of various sizes. These
clusters are used for many purposes including various control blocks that are necessary
to represent sockets, connections, interfaces, routes, and other network components.
The cluster sizes can vary from 32 bytes to 1024 KB, in sizes that are always a power of
2 (Ex. 32, 64, 128, 256, 512,...).
Network operations require buffers to transfer data. These buffers use clusters matched
to the amount of data that needs to be stored. On a multiprocessor system, each CPU
is assigned its own network memory pool with buckets of buffers from 32 bytes to
16384 bytes, and there is a set of global buckets for sizes from 16 KB to 1024 KB.
Every network datagram that is sent or received must be represented by a control block
called an mbuf. The mbuf is stored in the appropriately sized memory cluster. If the data
is small enough, then it is stored right in the mbuf. Otherwise, the mbuf will point to an
mbuf cluster that can hold the data. An mbuf cluster is a control block which is allocated
a matching size memory cluster. In many cases, the mbufs are chained together to
keep track of them.
Since the mbuf clusters only come in certain sizes, the memory used may be almost
twice as much as the amount of data being transmitted. For example, if an application
sends 1460 bytes of data, then the smallest cluster that can contain it would be a 2048
byte cluster.
The term buffer is used in many different ways in computer jargon and, in fact has many
different meanings in networking. For example, later we will talk about a socket buffer
which is really a different concept than the clusters we are discussing here to hold an
individual datagram or message sent by an application or received from the network
adapter. When someone uses the term buffer, be sure to clarify what they are referring
to.
V5.4
Instructor Guide
Uempty
Network buffer limit

The maximum amount of real memory that can be used for the network memory pool is
limited by thewall. The value of thewall cannot be changed and changing the sys0
maxmbuf attribute can only be used to further restrict the maximum amount of memory
used for network buffer pool, not to increase it. The value of thewall is calculated at
boot time:
- 32-bit kernel is 1/2 of RAM or 1 GB (whichever is smaller)
- 64-bit kernel is 1/2 of RAM or 65 GB (whichever is smaller)
Options for a network buffer shortage

When netstat -m shows mbuf allocation failures, you have the following options:
- Add more RAM, if the machine runs 32-bit kernel and has less than 2 GB RAM
- Add more RAM, if the machine runs 64-bit kernel and thewall is less than 65 GB
- Change from 32-bit kernel to 64-bit kernel if possible and add RAM if needed
- Check the size of the socket send and receive buffers to determine whether they
can be reduced
- It is possible that an mbuf or cluster memory leak by a kernel component is causing
the mbuf or cluster shortage. A steady increase in allocations of a particular cluster
size or of a particular control block usage could be an indicator of an mbuf. A full
analysis of the kernel memory leak is outside the scope of this class.
Limited components
The limit on the network memory pool (thewall) is also used to limit how much memory
can be used for STREAMS. The tunable parameter called strthresh (default value of
85% of thewall) specifies that once the total amount of allocated memory has reached
85%, no more memory can be given to STREAMS.
Similarly, another threshold called sockthresh (also defaults to 85%) specifies that
once the total amount of allocated memory has reached 85% of thewall, no new
socket connections can occur (socket() and socketpair() system calls return with
ENOBUFS). These thresholds are tunable via the no command.

7-21
Instructor Guide
Instructor notes:
Purpose Explain network memory and buffers.
Details Stay focused on explaining basic memory concepts and terminology. The
discussion of network memory related problems and the management of those problems is
addressed on later visuals.
Emphasize the range of cluster sizes. They will see this illustrated later when we cover the
netstat -m report
Point out the difference between the amount of data and the amount of memory consumed.
Most of the student notes are additional detail for their reading. Avoid getting into too much
detail here.
If not using the 64 bit kernel in AIX 5L, emphasize the importance of changing to the 64 bit
kernel. In AIX6 the only kernel is 64 bit. In either case providing enough memory to meet
the demand is important.
Emphasize reducing the number of clients and the workload per client coming into this
single server to deal with the demand side. Note the automatic mechanisms used by AIX to
control the level of demand and avoid shortages.
Additional information Network memory tuning prior to AIX 5L was different. The
tunable that thewall used to be changeable with the no command.
The students may be confused by the variety of terminology used in the documentation.
The following terms are used almost interchangeably when referring to what we are calling
the network memory pool: mbuf pool, network kernel buffers, network buffers, mbuf buffer
space, network memory buffers, or just network memory.
Transition statement Let us look at how we can identify a problem involving a network
pinned memory shortage.

V5.4
Instructor Guide
Uempty
Memory statistics with netstat -m

# netstat m
Kernel malloc statistics:
******* CPU 0 *******
By size
inuse
64
171
128
2032
256
810
512
2108
1024
188
2048
557
4096
133
8192
4
16384
128
32768
24
65536
59
131072
3
calls failed
5452
0
2477
0
5189
0
175570
0
4428
0
1694
0
139
0
10
0
128
0
24
0
59
0
3
0
delayed
2
63
50
258
48
261
4
1
16
6
30
0
free
21
16
22
20
8
3
25
0
0
0
0
36
hiwat
1884
942
1884
2355
942
1413
471
117
58
29
29
73
freed
0
0
0
0
0
0
0
0
0
0
0
0
******* CPU 1 *******

By size
inuse
64
32
128
54
256
36
512
26
calls failed
326
0
646
0
998
0
30441
0
delayed
1
2
2
0
free
96
42
12
78
hiwat
1884
942
1884
2355
freed
0
0
0
0
. . .
Figure 7-7. Memory statistics with netstat -m
AN512.0
Notes:
What to look for
The main thing to look for in the output of netstat -m are non-zero values in the
failed and delayed columns. If these values are non-zero, you need to identify
whether you can reduce the mbuf usage (by reducing the socket buffer sizes, discussed
later), add more memory (on 64-bit kernel) or move to a 64-bit kernel, if possible.
The maximum size for the network buffer pool cannot be increased beyond the system
defined default.
Once an cluster (such as an mbuf) is allocated and pinned, it can be freed by the
network services routine. Instead of unpinning this buffer and giving it back to the
system, it is left on a free list based on the size of this buffer. The next time a buffer is
requested, it can be taken off this free list in order to avoid the overhead of pinning.
Once the number of buffers on the free list reaches the high water mark, buffers less
than 4096 will be coalesced together into page-sized units so that they can be unpinned
and given back to the system. When the buffers are given back to the system, the

7-23
Instructor Guide
freed column is incremented. If the freed value consistently increases, this should
indicate that the high water mark is too low. There is no shipped tool to increase the
high water mark, however, the thresholds scale with the maximum amount of memory
available for network buffers.
Allocating and deallocating buffers

When a network service needs to transport data, it can call a kernel service such as
m_get() to obtain a memory buffer. If the buffer is already available and pinned, it can
get it right away. If the upper limit has not been reached and the buffer is not pinned,
then a buffer is allocated and pinned. Once pinned, the memory stays pinned when it is
freed back to the network pool. If the number of free buffers reaches a high water mark
(not tunable), then a certain number are unpinned and given back to the system for
general use. This unpinning is done by the netm kproc. The caller of m_get() can
specify whether or not to wait for a network memory buffer. If M_DONTWAIT is specified
and no pinned buffers are available at that time, then a failed counter is incremented. If
M_WAIT is specified, then the process is put to sleep until the buffer can be allocated and
pinned by the netm kproc. If the failed counter is not incremented, M_WAIT was
specified. The larger size buffers can only be allocated if M_WAIT is specified.
The low watermark and high water mark for mbufs scale with the size of the network
buffer pool.
If mbufs or clusters are not available, performance may suffer because packets may be
dropped or delayed.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain the netstat -m output.
Details Explain the netstat -m output. Note that the normal output breaks down
network memory pool requests by the various cluster sizes previously discussed. Also note
that they get a separate report for each CPUs bucket of buffers. Focus on the failed and
delayed columns as an indicator of a shortage. Avoid a discussion of the free, hiwat, and
freed columns which reflect the internal management of the mbuf clusters, which the
administrator has no control over.
Additional information The report has additional information provided when
extendednetstats is enabled. It has always been emphasized that this should not be
turned on unless it is absolutely necessary. With AIX6 and later this is a restricted tunable.
Transition statement Let us move on to a discussion of tuning the transport layer of
the TCP/IP stack.

7-25
Instructor Guide
Socket flow control (TCP)

Application
Data Buffer
User Memory
Kernel Memory
Socket
Structure
mbufs
Send Buffer
Receive Buffer
Figure 7-8. Socket flow control (TCP)
AN512.0
Notes:
Buffers
Sockets hold transient data in two buffers:
- Send space buffer
- Receive space buffer
The system implements limits on the sizes of these buffers on both a per socket and
system level. Separate limits are defined for TCP and UDP buffers.
TCP reliable transport

The TCP send buffer is used to hold onto data that has been sent, until
acknowledgement is received that the other side has successfully received the data. If
acknowledgement is not received, then TCP can retransmit what it has in the buffer.
The size of the buffer limits how much unacknowledged data can be buffered. If it fills,
then a send by the application will be blocked until there is free space in the send buffer.
V5.4
Instructor Guide
Uempty
At the destination socket, the receive buffer is used to hold onto arriving data until it can
be matched to an application receive request and moved to the applications private
memory. Due to TCP flow control, this buffer should never overflow or discard data.
TCP flow control

TCP implements flow control by using a sliding window mechanism which is described
in detail on the next visual. This allows data to be transmitted and received without
having to worry about exceeding the size of the socket buffers. The no command
parameters tcp_sendspace and tcp_recvspace are global limits on the TCP send and
receive socket buffers. Applications can use the setsockopt() system call to override
these limits.The default value for tcp_sendspace and tcp_recvspace is 16384.
Ultimate limit for TCP and UDP buffers

Another parameter, sb_max, controls the upper limit for any of these buffers. All of these
parameters are tuned with the no command.

7-27
Instructor Guide
Instructor notes:
Purpose Explain TCP socket flow control.
Details Explain how the socket interface works (at a very high level) and point out
where performance problems can occur. Note how the application request can be blocked
if the send buffer is full and that freeing up the memory in the send buffer depends on
acknowledgements from the receiving end of the connection. Point out the role of the
receive buffer in holding data until is can be copied to the applications own address space.
The receiving end will need a flow control mechanism to tell the sender to stop if the
receive buffer fills up.
Point out that the administrator can control the size of these buffers.
Use the two issues of acknowledging packet (frees up space in the send buffer) and flow
control (prevents receive buffer overflow) a segues to the next two visuals.
Bottom line is that if we fill up these buffers the sending applications send requests will be
blocked and our performance will be slowed.
Transition statement Lets look at the role of TCP acknowledgements in confirming
the successful receipt of TCP session data by the destination connection partner.

V5.4
Instructor Guide
Uempty
TCP acknowledgement and retransmission

5
6
acknowledge 6
7
fasttimeo
200 ms
acknowledge 7
8
no acknowledgement
RTO
8
acknowledge 8
Figure 7-9. TCP acknowledgement and retransmission
AN512.0
Notes:
Overview
When the destination socket receives a segment it does not immediately send an
acknowledgement. Instead it waits to see if there is a datagram being sent in the other
direction to piggy back the acknowledgement. This is to reduce the number of
ack-only packets using the network capacity. (The TCP protocol designers use the term
piggy back to refer to the practice of signalling the acknowledgement in a data packet
that is being sent anyway.
If there are no datagrams going in the other direction, the destination socket waits for
another segment to arrive and will then acknowledge both segments. If after waiting for
200 ms and there is neither a datagram to piggy back on nor a second segment to
trigger an acknowledgement, then the socket sends an acknowledgement for the one
segment. This 200 ms timer can be tuned using the fasttimeo network option.
At the socket which sent the segments, it holds them in the send space buffer until it
receives the acknowledgement, at which point it frees up that space in the buffer.

7-29
Instructor Guide
If after waiting for a kernel calculated Retransmission Time Out (RTO) period, the
sending socket does not receive an acknowledgement, it retransmits the segment on
the assumption that it was discarded in the network.
If the host sending the data has sent additional packets after the unacknowledged
packet, then that sending host needs to retransmit not only the unacknowledged
packet, but all the packets that were sent after that point. This further adds load to the
network and additional overhead to the hosts on the connection.
Technically, the receiving host acknowledges receipt of everything up to a byte position
in a stream of bytes that are numbered from the first byte transmitted in the connection,
but the acknowledged byte position almost always correlates to the last byte of a
segment; this it is common to talk about the segments that were acknowledged. The
details about acknowledged bytes in the stream of transmission is mainly important
when do analysis of network traces.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain TCP acknowledgement and retransmission protocols at a very basic
level.
Details Explain the basic mechanisms of the acknowledgement and retransmit
protocols. Cover the three basic situations shown in the visual.
The first case has two packets sent, with the second packet triggering immediate
acknowledgement.
The second case has a solitary packet being sent. Delayed acknowledgment protocol at
the receiving end will wait for a second arriving packet up to 200 ms before acknowledging.
The third case shows a transmitted packet being discarded. The sender waits for the RTO
period for the acknowledgement and then retransmits.
Relate the last two cases to the potential impact on performance if these happen often.
Additional information This explanation is intentionally simplified. For the purposes of
this class it is not necessary to get into the detail related to the use of byte positions in the
transmission stream.
We also do not, in this course, go into the use of selective acknowledgement (SACK) or the
use of fast retransmit protocols or the related newreno variation. The following information
is being provided to you, the instructor, in case the students should ask about SACK or fast
retransmit. Avoid getting into a detailed discussion of SACK, fasttimeo and the newreno
variation. These are outside the scope of this class. Instead, emphasize the importance of
reducing packet delays and discards.
Starting with AIX V4.3.3, a new feature called TCP Selective Acknowledgment (SACK)
allows TCP to recover from multiple losses within the window and can provide for better
performance in congested networks. SACK is disabled by default but can be enabled by
running /usr/sbin/no -o sack=1.
While SACK can avoid having to retransmit segments which did not get discarded, it is not
intended for situations that have high volumes of discards and its use can, in some
situations, result in a hung session. It also requires both sides to agree to use it at session
establishment.
With the default newreno fast retransmission, when a packets that were re-transmitted after
the discarded packet arrive at the destination, the receiving hosts acknowledges that it is
still waiting for the discarded packet. These are seen by the sending host as duplicate
acknowledgements. Multiple duplicate acknowledgments trigger the retransmission of just
the next unacknowledged packet. Even in this situation, frequent discards can reduce the
performance due to each packet (in a sequence of discarded packets) needing to be sent
with a RTT delay between each individual transmission.
The lesson is that there is no great way to handle a frequent discards. Find out what is
causing the discards and fix it.

7-31
Instructor Guide
Transition statement Next, let us look at how this mechanism works with flow control
to prevent the sender from exceeding the socket receive buffer at the destination.

V5.4
Instructor Guide
Uempty
TCP flow control and probes

Packets C and D 8 KB
acknowledge D win=8 KB
Packets E and F 8 KB
acknowledge F win=0 KB
time
out
server
congestion
window probe
time
out
window probe
server
ready
for
more
Packet G - 4KB
...
Figure 7-10. TCP flow control and probes
AN512.0
Notes:
Overview
In order to prevent TCP receive buffer overflows, TCP implements a flow control
mechanism in which the receiving socket controls how much data can be transmitted by
the sending socket. The amount of un-acknowledged data that the sending side may
transmit is called the window size. If the transmitting socket has a full window o
transmitted but unacknowledged traffic, it has to stop and wait for data to be
acknowledged before it can continue transmitting.
The receiving side advertises a window size based upon its ability to receive that data.
The greatest factor used to determine the window size is the size of the TCP socket
receive buffer. If the TCP receive buffer is too small, it can artificially constrain the
throughput of the connection. Even when the TCP receive buffer is very large, if the
receiving server is experiencing congestion (buffer filling faster than it can process the
data), it can reduce the window to protect itself from overflow.

7-33
Instructor Guide
How it works
The receiver advertises a window size back to the sender as part of an ACK packet.
This tells the sender how much room the receiver has in its buffer to accept packets.
The sender will transmit send out all the segments within its window (sequence of
packet segments waiting to be sent).
The receiver can acknowledge multiple packets instead of sending back an ACK for
each packet. As long as the receive is acknowledging packets fast enough and is
advertising a large enough window, the sender will continue transmitting packets.
Larger window sizes allow more time to transmit data while unacknowledged data
travels to the destination and the acknowledgement returns.
When a server gets overloaded the receive socket buffer may fill up faster than the
application can receive the data. In response to this the receiving host will reduce the
advertised window size. If the socket buffer is completely filled, then the advertised
window can be reduced to zero, preventing any new transmission on the connection.
When the receiving socket buffer empties, the receiving host will send an unsolicited
acknowledgement with the non-zero window size. If there is a long delay in the sending
host receiving the new window size, it will send a window probe. This is because it does
not know if a non-zero window advertisement was discarded in the network. Without a
window probe, the connection could be in a permanent deadlock with the sending host
waiting for a non-zero window and the receiver waiting for the next data packet.
A statistic of the number of window probes usually indicates how long and how often the
advertised window was closed to zero. This, in turn, can be an indication of server
overload.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain the significance of window probes.
Details
Additional information TCP uses a method of flow control called sliding windows. The
term refers to the implementation which identified a range of bytes in the stream of
transmission, between the last acknowledged byte to the maximum byte position that can
be transmitted. The last position is equal to the last acknowledge byte plus the current
window size. Since each acknowledgement shifts the position of the first byte in the window
and, in turn, also shifts the last byte in this window, the entire window slides forward. This is
what the term sliding window refers to.
Transition statement Lets see what TCP protocol statistics we can examine with the
netstat command.

7-35
Instructor Guide
netstat -p tcp
# netstat -p tcp
tcp:
6899764 packets sent
4436476 data packets (2943162856 bytes)
3208 data packets (3788499 bytes) retransmitted
1813815 ack-only packets (500199 delayed)
1 URG only packet
389 window probe packets
484658 window update packets
161217 control packets
0 large sends
0 bytes sent using largesend
0 bytes is the biggest largesend
7861688 packets received
3535095 acks (for 2943219325 bytes)
82344 duplicate acks
0 acks for unsent data
5906529 packets (950111507 bytes) received in-sequence
4165 completely duplicate packets (376089 bytes)
1 old duplicate packet
140 packets with some dup. data (1611 bytes duped)
67997 out-of-order packets (15386274 bytes)
105 packets (139969 bytes) of data after window
0 window probes
Figure 7-11. netstat -p tcp .
AN512.0
Notes:
Highlighted statistics
Statistics of interest are:
-
Packets sent
Data packets
Data packets retransmitted
Window probe packets
Packets received
Completely duplicate packets
Window probes
Retransmit timeouts

V5.4
Instructor Guide
Uempty
Packets sent and packets retransmitted

For the TCP statistics, compare the number of packets sent to the number of data
packets retransmitted. If the number of packets retransmitted is over 10-15 percent of
the total packets sent, TCP is experiencing timeouts indicating that network traffic may
be too high for acknowledgments (ACKs) to return before a timeout. A bottleneck on the
receiving node or general network problems can also cause TCP retransmissions,
which will increase network traffic, further adding to any network performance
problems.
Packets received and completely duplicate packets

Compare the number of packets received with the number of completely duplicate
packets. If TCP on a sending node times out before an ACK is received from the
receiving node, it will retransmit the packet. Duplicate packets occur when the receiving
node eventually receives all the retransmitted packets. If the number of duplicate
packets exceeds 10-15 percent, the problem may again be too much network traffic or a
bottleneck at the receiving node. Duplicate packets increase network traffic.
Retransmit timeouts
Another important statistic in the report is the value for retransmit timeouts, which
occurs when TCP sends a packet but does not receive an ACK in time. It then resends
the packet. This value is incremented for any subsequent retransmittals. These
continuous retransmittals drive CPU utilization higher, and if the receiving node does
not receive the packet, it eventually will be dropped. (This is not shown on the visual,
but is highlighted in the rest of the output below.)
Window probe packets and window probes

A large value for sent window probe packets indicates that either the socket receive
space of the remote receiving sockets are too small or the applications on the remote
receiving host are not reading the data fast enough.
A large value for received window probes indicates that either the socket receive buffer
on the local receiving socket is too small or the applications on the local receiving host
are not reading the data quickly enough.

7-37
Instructor Guide
The rest of the output

The output in the visual is the beginning of the total output. Following is the remainder of
the report:
5441 window update packets
51 packets received after close
0 packets with bad hardware assisted checksum
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
53 discarded by listeners
1123560 ack packet headers correctly predicted
4109607 data packet headers correctly predicted
53780 connection requests
56147 connection accepts
109101 connections established (including accepts)
130872 connections closed (including 1623 drops)
0 connections with ECN capability
0 times responded to ECN
826 embryonic connections dropped
3418078 segments updated rtt (of 3087065 attempts)
0 segments with congestion window reduced bit set
0 segments with congestion experienced bit set
0 resends due to path MTU discovery
24 path MTU discovery terminations due to retransmits
940 retransmit timeouts
4 connections dropped by rexmit timeout
1402 fast retransmits
223 when congestion window less than 4 segments
646 newreno retransmits
42 times avoided false fast retransmits
0 persist timeouts
0 connections dropped due to persist timeout
8929 keepalive timeouts
8893 keepalive probes sent
35 connections dropped by keepalive
0 times SACK blocks array is extended
0 times SACK holes array is extended
0 packets dropped due to memory allocation failure
3 connections in timewait reused
0 delayed ACKs for SYN
0 delayed ACKs for FIN
0 send_and_disconnects
0 spliced connections
V5.4
Instructor Guide
Uempty
0
0
0
0
0
spliced
spliced
spliced
spliced
spliced
connections
connections
connections
connections
connections
closed
reset
timeout
persist timeout
keepalive timeout

7-39
Instructor Guide
Instructor notes:
Purpose Explain what to look at within the TCP protocol statistics shown by
netstat -p tcp.
Details Point out the most important counters to look at and relate back to what was
discussed on the last few visuals.
Transition statement Lets see what we can change to affect TCP performance.

V5.4
Instructor Guide
Uempty
TCP socket buffer tuning (1 of 2)

TCP send buffer size - how much data can be buffered
before the application is blocked
TCP receive buffer size - how much data the receiving
system can buffer until the application reads it
Buffer sizes can be set in a hierarchy:
Application setsockopt(): SO_SNDBUF, SO_RCVBUF
Interface attributes: tcp_sendspace, tcp_recvspace
Network tunables: tcp_sendspace, tcp_recvspace
Effective window size based on minimum of:
Transmitters send buffer
Receivers advertised receive window size
Figure 7-12. TCP socket buffer tuning (1 of 2)
AN512.0
Notes:
TCP send buffer
The TCP socket send buffer is used to buffer the application data in the kernel using
mbufs and clusters before it is sent beyond the socket and TCP layer. The default size
of this buffer is specified by the no parameter tcp_sendspace, but can be overridden by
the application using the setsockopt() system call.
The send buffer can hold both data waiting to be transmitted (queued due to a full
window) and data that has already been transmitted but is waiting for
acknowledgement. When data is acknowledge, this frees up space in the buffer, which
in turn allows the application to send more data. If a send buffer fills up, the application
can not send any more data. A larger send buffer allows an application to have more
transmitted data to be unacknowledged and for the application to send data that is
being queued due to the window being full. A small send buffer quickly fill up and cause
the application send requests to be blocked, even though the window is not full.

7-41
Instructor Guide
If an application does non-blocking I/O (specified O_NDELAY or O_NONBLOCK on the

socket), then if the send buffer fills up, the application will return with an
EWOULDBLOCK/EAGAIN error rather than being put to sleep. Applications need to be
coded to handle this condition. A suggested solution is to sleep for a short period of time
and try to send again. When changing sendspace or recvspace values, in some cases it
is necessary to stop and restart the inetd process (stopsrc -s inetd; startsrc -s
inetd).
TCP receive buffer

The TCP receive buffer is used to accommodate incoming data. When the data is read
by the TCP layer, TCP can send back an acknowledgment for that packet immediately
or it can delay before sending the ACK. TCP tries to piggyback the ACK if a data packet
was being sent back anyway. If multiple packets are coming in and can be stored in the
receive buffer, TCP can ACK all of these packets with one ACK. Along with the ACK,
TCP will send back a window advertisement to the sending system telling it how much
room there is left in the receive buffer. If theres not enough room left, the sender will be
blocked until the application has read the data. The default size of this buffer is
specified by the parameter tcp_recvspace.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Introduce the students to socket buffer tuning.
Details Define the send and receive socket buffer sizes and identify how the
administrator can tune them either globally, interface by interface or socket by socket.
Be clear that there are two affects. These determine the space available for buffering data
but they are also used as input to determining the initial window size on a session.
Transition statement How do we know what to use for our initial window size?

7-43
Instructor Guide
TCP socket buffer tuning (2 of 2)

Tune initial window size:
Big enough to avoid blocking application sends: how much data
can be sent during round trip time (RTT)?
Small enough to avoid receiver or network problems
Experiment for optimal throughput:

Try different sizes and measure the effect
sb_max limits the maximum size for any socket buffer:

Set sb_max to at least twice the size of the largest socket buffer
TCP window size is limited to (64 KB) but can be set

higher (rfc1323=1):
Maximum window size = 2 ** 30 (1 GB)
Adds 12 additional bytes of overhead to each packet
Figure 7-13. TCP socket buffer tuning (2 of 2)
AN512.0
Notes:
Selecting the best socket buffer size
If we set the initial window size too small we may be unnecessarily blocking application
send requests. On the other hand, if the sender is on a much faster machine and
network than the destination, it could be adding to a congestion situation (resulting in
packet discards). In that case, we may want to de-tune the window size to restrain
how fast we transmit.
One way to determine how large the window should be is to calculate the
bandwidth-delay product. Basically this is the amount of data we can transmit during
the Round Trip Time (RTT). The RTT is the time between when we transmit a segment
and when we receive the matching acknowledgement. The trick is in determining the
transmission rate and the RTT.
A different and common approach is to determine the best window size through
experimentation. By trying a variety of different window sizes and measuring the
V5.4
Instructor Guide
Uempty
throughput on each, we can determine at what point increasing the window size leads
to diminishing returns.
TCP window size

TCP uses a 16-bit value for its window size, by default. This provides for a maximum of
65536 bytes. If data is being sent through adapters that have large MTU sizes (32 KB or
64 KB for example), TCP streaming performance may not be optimal since the packet
or packets will get sent and the sender will have to wait for an acknowledgment. By
enabling the RFC1323 option using no -o rfc1323=1, TCPs window size can be set
as high as 4 GB. After setting this option, you can increase the tcp_recvspace
parameter to something much larger, such as 10 times the size of the MTU. An
alternative option would be to reduce the MTU size if the receiving system does not
support RFC1323.
Maximum memory for socket buffers

A sockets send buffer memory usages plus that sockets receive buffer memory usage
can never exceed the value of sb_max bytes. sb_max is a ceiling on buffer space
consumption. In addition, no individual socket buffer size maximum (Ex.
tcp_sendspace) is allowed to exceed the sb_max value. The two quantities (socket
buffer size versus sb_max) are not measured in the same way, however. The socket
buffer size limits the amount of data that can be held in the socket buffers. The sb_max
value limits the number of bytes of mbufs that can be in the socket buffer at any given
time. In an Ethernet environment, for example, each 2048 byte mbuf cluster might hold
just 1500 bytes of data. In that case, sb_max would have to be 1.37 times larger than
the specified socket buffer size to allow the buffer to reach its specified capacity. The
guideline is to set sb_max to at least twice the size of the largest socket buffer.
To change the sb_max value, use the command:
# no -o sb_max=<new_value>
Large MTU issues

On adapters that have 64 KB MTUs, TCP streaming performance can be seriously
degraded if the receive buffer is 64 KB or less. The two main protocols which have this
concern are the High Performance Switch (Federation switch) and ATM since both can
be configured with a 64 KB MTU.
The problem is that as soon we transmit a segment, we have filled the window. We then
have to stop and wait the RTT until we receive an acknowledgement that allows us to
slide the window forward. We have to wait between each and every MTU transmission.
To avoid this, we need to have a window size that is at least twice the MTU size and
preferably larger.

7-45
Instructor Guide
If we send large segments (ex. 32 KB) which are less than these large MTUs (64 KB)
we also have interactions with Nagles Algorithm, which can result in us waiting for 200
ms between each transmit, reducing the throughput to 5 packets per second. We will
cover Nagles Algorithm later in this unit
If the receiving machine is not an AIX system or does not support RFC1323, then
reducing the MTU size is one way to improve streaming performance in this situation.
RFC 1323
RFC1323 is designed to modify the standard TCP protocol to support networks which
have a large bandwidth, use large MTUs, and are very fast. The original protocol was
designed for the 10 Mbps Ethernet with a 1500 byte MTU. There are several changes to
the protocol specified in RFC 1323. For example, with the rate at which packets could
be sent on a 10 Mbps Ethernet, it would take a very long time for the sequence number
field in the TCP header to overflow, but on a Gigabit Ethernet connection this could
happen much quicker. So RFC 1323 designed a protocol for handling the wraparound
when the sequence number field starts count from the beginning after reaching its limit.
The most commonly cited benefit is the ability to modify the use of the TCP header filed
for advertising the window size. The original field had a maximum value of 64 KB. With
modern networks that became a major performance constraint. RFC 1323 provides a
mechanism to multiply the value in the TCP header advertised window size field by
powers of two, thus allowing much larger window sizes. For the High Performance
Switch (HPS), use of RFC 1323 is really a requirement.
To enable RFC 1323, use the command:
# no -o rfc1323=1
The downside of RFC 1323 is that every TCP header needs an 12 byte extension which
adds to the overhead of using the protocol. So you do not want to turn this on unless
you need the benefits it provides.
Table of suggested buffer sizes

The following table shows some suggested socket buffer sizes based on the type of
adapter and the MTU size. The general rule of thumb is for TCP send and receive
space to be at least 10 times the MTU size. MTU sizes above 16 KB should use
rfc1323=1 to allow larger tcp_recvspace values. For high-speed adapters, larger
TCP send and receive space values help performance.
The window size is the receivers window size. rfc1323 only affects the receiver.
In benchmark tests with gigabit Ethernet using a 9000 byte MTU, it was found that the
performance was the same for both the given sets of buffer sizes.
V5.4
Instructor Guide
Uempty
Device
Speed
MTU
tcp
sendspace
tcp
recvspace
rfc 1323
Ethernet
10/100 Mb
1500
16384
16384
Ethernet
Gb
1500
131072
65536
Ethernet
Gb
9000
131072
65536
Ethernet
Gb
9000
262144
131072
Ethernet
10 Gb
1500
131072
65536
Ethernet
10 Gb
9000
262144
131072
ATM
155 Mb
1500
16384
16384
ATM
155 Mb
9180
65536
65536
ATM
155 Mb
65527
655360
655360
Fibre
Channel
ATM
2 Gb
65280
65536
65536
155 Mb
65527
655360
655360
FDDI
100 Mb
4352
45056
45056
Many of the faster adapters set ISNO options, making it unnecessary for you to tune
based on that adapter. But remember, this is only a starting point - different sessions
have different requirements.

7-47
Instructor Guide
Instructor notes:
Purpose Present the basic principles of TCP buffer tuning.
Details Explain the objective of allowing application streaming without causing
congestion problems.
Explain the two listed approaches to determining what the window size should be. Explain
how to calculate the bandwidth delay product. Emphasize that they ultimately need to
monitor and adjust.
Explain why there is a need for larger window sizes, especially on Gigabit speed networks.
Remind them that they always need to increase the sb_max to be larger than the socket
buffer sizes, before they increase the socket send spaces or receive spaces.
Explain the role of RFC 1323 in allowing larger socket buffer sizes.
Transition statement While we can set these TCP performance factors at a global
level with the no command, one size does not necessarily fit all. Some connections need to
be tuned differently than other connections. Lets see how we can customize to a particular
interface or even a particular connection.

V5.4
Instructor Guide
Uempty
Interface specific network options

AIX supports a subset of network tuning attributes that can
be set on each network interface
Tunable options include:
tcp_sendspace and tcp_recvspace
rfc1323
tcp_mssdflt
tcp_nodelay
These options are tuned at the interface level using SMIT or
chdev
The no option use_isno defaults to 1 (enabled)
ISNO values automatically configured for some adapters
All of these can be overridden by application
setsockopt()
Figure 7-14. Interface specific network options
AN512.0
Notes:
Tunable options
The Interface Specific Network Options (ISNO) is enabled by default and can be
disabled by setting the no option (use_isno) to 0.
For each network interface, there are five ISNO parameters:
-rfc1323
-tcp_nodelay
-tcp_sendspace
-tcp_recvspace
-tcp_mssdflt
These correspond to the same options with the no command.

7-49
Instructor Guide
Changing the options

If these values are set for a specific interface, then they will override the system no
default value. This allows different network adapters to be tuned for the best
performance. The application can override any of these options using setsockopt().
These values can be displayed via the lsattr -E -l interface command. They can
be changed via the chdev -l interface -a attribute=value command. For
example:
chdev -l en0 -a tcp_recvspace=65536 -a tcp_sendspace=65536
sets the tcp_recvspace and tcp_sendspace to 64 KB for en0 interface.
Using the chdev command will change the value in the ODM database so it will be
saved between system reboots. If you want to set a value for testing or temporarily, use
the ifconfig command. For example:
ifconfig en0 hostname tcp_recvspace 65536 tcp_sendspace 65536 tcp_nodelay 1
sets the tcp_recvspace and tcp_sendspace to 64 KB and enables tcp_nodelay.

These values are also displayed via the ifconfig interface command.
Default ISNO option values

For some high speed adapters, the ISNO parameters are defaulted in the ODM
predefined database.
Adapter Type
GigaE
GigaE
ATM
ATM
FDDI
MTU
1500
9000
9180
65527
4352
RFC1323
0
1
0
1
0
tcp_sendspace
131072
262144
65536
65536
45046

tcp_recvspace
65536
131072
65536
65536
45046
V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain network interface tuning.
Details Explain the mechanics and benefits of setting the tunables on an interface by
interface basis, instead of using the global network options.
Transition Statement - For some applications the TCP mechanisms that are designed to
provide better network utilization and overall better network performance can create
serious performance problems. Let us take a look at the Nagle algorithm.

7-51
Instructor Guide
Nagles algorithm
To prevent congestion of networks, Nagles algorithm states
that: a TCP connection can have only one outstanding small
segment that has yet to be acknowledged:
A small segment is defined to be smaller than the MSS
Packets are collected until there is enough data to meet the
MSS requirement or until the 200 ms TCP timer expires
May hinder the performance of certain types of applications
such as request/response applications
Packet transmission does not get delayed by Nagles
algorithm if any of the following are set:
-tcp_nodelay socket option through setsockopt()
system call in the application
-tcp_nagle_limit to 1
-tcp_nodelay to 1 in the Interface Specific Network Options
Figure 7-15. Nagles algorithm
AN512.0
Notes:
Overview
While a local area network (LAN) can handle many small sized packets (defined to be
smaller than the maximum segment size), a wide area network (WAN) could get
congested. To reduce the congestion problem, TCP implements the Nagle algorithm
which states that a TCP connection can have no more than one outstanding small
segment that has yet to be acknowledged. This means the first small segment can be
sent right away, but no more small segments can be sent until an acknowledgement
(ACK) is received for the previous one. Instead, subsequent small segments are
collected together by TCP until TCP deems there is enough to meet the MSS value or
until the TCP 200 ms timer expires.

V5.4
Instructor Guide
Uempty
Disabling Nagles algorithm

Since some applications may not stream packets (such as an application that sends a
small packet and cannot do anything until it gets back a response), these applications
may actually suffer serious performance problems due to this algorithm. In such cases,
the applications can do a setsockopt() on the socket (after the connect or accept) and
set the tcp_nodelay flag.
If the send buffer size is less than or equal to the maximum segment size (ATM and SP
switches can have 64 KB MTUs), then the applications data will be sent immediately
but the application will have to wait for an ACK before sending another packet, due to
Nagles algorithm. This prevents TCP streaming and could reduce throughput. To
maintain a steady stream of packets, increase the socket send buffer size so that its
greater than the MTU (3-10 times the MTU size could be used as a rule-of-thumb).
A system administrator can also allow all TCP connections on an interface to behave as
if tcp_nodelay was set by setting interface specific options such as tcp_nodelay to 1
on the interface (not all interfaces support this yet). For details on tcp_nodelay tuning
see later material under, Network interface tuning.
Another no parameter is called tcp_nagle_limit. This value defaults to the largest
packet size (65535). TCP disables the Nagle algorithm for packets of size greater than
or equal to the value of tcp_nagle_limit. So, this means you can essentially disable
the algorithm altogether for all packets by setting the value to 0 or 1.
Delayed packet transmission

If the amount of data that the application wants to send has all of the following
attributes:
- Smaller than the send buffer size
- Smaller than the maximum segment size
- tcp_nodelay is not set
Then TCP will delay up to 200 ms (fasttimeo tunable) until one of the following
conditions is met before transmitting the packets:
- Theres enough data to fill the send buffer
- The amount of data is greater than or equal to the maximum segment size
The MSS value is computed by TCP based on the MTU size or the tcp_mssdflt value,
depending on whether it is a local or remote network. If tcp_nodelay is set, then the
data is sent immediately. This is useful for request/response type of applications.
Most network interfaces support the tcp_nodelay option which can be set with the
chdev command. If you have a connection through a network interface that does not
support tcp_nodelay, and you cannot change the application, set the no parameter
tcp_nagle_limit to 1.

7-53
Instructor Guide
Delayed acknowledgements
Operating systems like Windows NT/2000 have difficulties with data streaming when
they get delayed acknowledgements (acknowledgement packets are always sent
delayed on AIX). You can disable delayed acknowledgement transmission by setting
the no parameter tcp_nodelayack to 1.
Idle connection
When a connection goes idle (after 0.5 seconds without any data traffic), the initial
window size is set to one MTU. When an application then sends more data than the
MTU size, one packet is sent that needs to be acknowledged before sending the rest of
the data. The receiver might be expecting more than one packet and does not send the
acknowledgement packet immediately (usually the receiver will send it after 200 ms).
You can increase the initial window size by changing the no parameter
tcp_init_window so that more than one packet is sent when sending data through an
idle connection.
In order to do this, the rfc2414 option must be on. For example, the following will set
the initial window size to 4*MSS (Maximum Segment Size for this connection):
# no -o rfc2414=1
# no -o tcp_init_window=4

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain the impact and management of Nagles algorithm.
Details Explain how Nagles algorithm works and why it is generally good for reducing
network overhead. Explain how it can sometimes be a problem.
Explain how to disable Nagles algorithm.
Note that most applications which need Nagle disabled will do so via their socket interface,
thus requiring no action from the system administrator.
Transition statement While most our traffic uses TCP, you might have some situations
where you will need to tune for UDP traffic. Lets take a brief look at the most common
problem with UDP.

7-55
Instructor Guide
UDP buffer overflow

Datagrams arriving faster than
receives are issued result in buffer
overflow and packet discard
Application
recvfrom()
Solutions:
Increase upd_recvspace to at
least 10 x udp_sendspace
udp_recvspace
Increase CPU cycles

Decrease number of clients or
workload
Figure 7-16. UDP buffer overflow
AN512.0
Notes:
Adjusting for UDP buffer overflows
Without flow control services, it is possible to have socket receive buffer overflows.
This will happen when UDP datagrams arrive faster than the UDP application can issue
receive requests. Unable to transfer the data from the socket buffer to the applications
private memory, the buffer fills up. The next datagram to arrive is discarded for lack of
space.
While it is hard to predict how large the receive buffer should be, a commonly
recommended starting value is 10 times larger than the sendspace being used by the
transmitting socket. Some environments can get by with less, while others will need even
more. The only way to know for sure is to monitor the occurrence of receive buffer
overflows.
One way to handle the situation is to decrease demand by reducing the number of clients
transmitting to a server (perhaps by spreading the workload across more individual
V5.4
Instructor Guide
Uempty
servers). Another technique is to reduce the size of the records being sent, though that is
not always an option for the given application.
Sometimes the problem is that the receiving host is CPU bound. When this happens, it is
possible that while the network adapter interrupt handlers are able to get cycles (they get a
preferred fixed priority), the application may be starved for cycles and thus delayed in
issuing new receives. The solution is to tune the CPU situation.

7-57
Instructor Guide
Instructor notes:
Purpose Explain UDP receive buffer overflows and what the causes may be.
Details Explain the UDP receive buffer situation. A good way to present it is as a
classical demand and resource imbalance. They can either increase the resource (size of
the udp_recvspace) or they can reduce the demand (how quickly the buffer is filling up).
The rate at which the buffer is filling is another balancing act between the rate at which the
datagrams are arriving and the rate at which the application is able to issue receives
requests, They can either reduce the demand (datagram size and arrival rate) or they can
improve the resource (ability of the application to get CPU cycles).
Transition statement Lets see how we can detect this situation using the netstat
command.

V5.4
Instructor Guide
Uempty
netstat -p udp
# netstat -p udp
udp:
1309238 datagrams received
0 incomplete headers
0 bad data length fields
0 bad checksums
139 dropped due to no socket
521435 broadcast/multicast datagrams
dropped due to no socket
0 socket buffer overflows
787664 delivered
1283000 datagrams output
Figure 7-17. netstat -p upd .
AN512.0
Notes:
Highlighted statistics
Statistics of interest are:
- Dropped due to no socket
- Socket buffer overflows
Socket buffer overflows

A large socket buffer overflow count indicates that either the UDP socket
receive buffer on the local machine is too small or the applications are not reading the
data fast enough. The result is that the packet is dropped. You want to avoid ANY
dropped packets in UDP protocol since it has a severe impact on performance.
Socket buffer overflows could be due to insufficient transmit and receive UDP sockets,
too few nfsd daemons threads (we will cover NFS in the next unit), or too small
nfs_socketsize, udp_recvspace and sb_max values.

7-59
Instructor Guide
Check the affected system for CPU or I/O saturation, and verify the recommended
setting for the other communication layers by using the no -a command. If the system
is saturated, you must either reduce its load or increase its resources.
Dropped due to no socket

The dropped due to no sockets counter is an important statistic. It indicates that
there was no open socket matching the destination port number on the arriving
datagram. The application on this host is either not running or is in a state where it is not
ready to accept requests. A well written UDP client server design would have the two
side do a hand shake before starting a flow of requests. The sending side may attempt
this over and over in a polling fashion until the server replies. As such this counter is
likely to represent a functional problem. If this value is high, investigate how the
application is handling sockets.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain what to look at within the UDP protocol statistic shown by
netstat -p udp.
Details Emphasize the importance of the buffer overflows and relate back to the
discussion on the previous visual.
Transition statement The network components have limitations on the size of a packet
that they can handle. To avoid overrunning these devices TCP/IP has mechanisms for
breaking large sends into smaller packets. These mechanisms can have an affect on the
network performance. Let us first see what they mechanisms are.

7-61
Instructor Guide
Fragmentation and segmentation

DATA
User
Kernel
TCP
TCP DATA
TCP LAYER
TCP
MSS
IP TCP
IP TCP DATA
No IP Fragmentation
MTU
LINK IP TCP
IP LAYER
IP TCP
LINK IP TCP DATA
LINK IP TCP
FRAME
MSS : Maximum Segment Size

MTU : Maximum Transmit Unit
Figure 7-18. Fragmentation and segmentation
AN512.0
Notes:
Segment Size
When TCP takes data from the sendspace buffer to pass to the IP layer, it first prepends
a 20 byte header with such information as source and destination port numbers, byte
position in the stream and other transport layer management information. The data
carried in this datagram is referred to as a segment.
Because TCP is connection oriented it has better visibility to what an optimal
transmission unit size should be. This MTU is then converted into the maximum size of
the segment. The Maximum Segment Size (MSS) is (MTU - TCP header - IP header).
The IP header is 20 bytes and the TCP header is 20 bytes. Thus, for standard Ethernet
the MSS would be ( 1500 - 40 = 1460 ).
TCP will never send a segment larger than the MSS. As a result, IP at the originating
host will never have to fragment. If the MSS is optimal for the entire path, then none of
the intermediate routers will need to fragment the datagram, either. As a result, the
destination host will not need to do fragment reassembly using the IP input queue.
V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain segmentation and how it avoids fragmentation.
Details Be sure to stay focused on the basic definition and purpose of segmentation
and on the size of a segment compared to the size of a transmission unit as it regards the
additional IP header and TCP header.
Additional information It should be noted that there are various TCP options which will
add optional header extensions to the 20 byte TCP header. The TCP layer is aware of what
options are in use (such as RFC1323) and will automatically adjust the MSS for the
additional header extensions. Students may point out that while the notes say the Ethernet
MSS would be 1460, that they have seen values of 1448. That would be due to the
additional 12 bytes of RFC1323 header information.
Also note that the only way for UDP to avoid fragmentation is to send datagrams small
enough that they do not require fragmentation. Fortunately most UDP based applications
send small datagrams. For UDP applications which could exceed the MTU restrictions,
they either need to allow the administrator to code the MTU restriction in a configuration file
or be coded to query the Path MTU restriction.
Transition statement Let us examine the role of TCP connection establishment in
setting the MSS and why we need additional mechanisms when dealing with remote
networks.

7-63
Instructor Guide
Intermediate network MTU restrictions

Lets use
mss=8960
GigaE
GigaE
Connection
establishment
FastE
FastE
Lets use
mss=8960
GigaE
GigaE
9000 bytes
Fragment 1
1500 bytes
Fragment 2
1500 bytes
Fragment 3
1500 bytes
etc .
Best performance: largest segments which will not be fragmented

.
.
.
Figure 7-19. Intermediate network MTU restrictions
AN512.0
Notes:
Introduction
At TCP connection establishment, both sides communicate what their MTU restrictions
are, expressed as an MSS value. Unfortunately, this only identifies the local MTU
restrictions, which is fine if both sides are on the same network, but will cause problems
if they are remote.
If TCP used these values based upon the local MTU, we could end up with intermediate
routers fragmenting the packets.
In this example, we have two gigabit Ethernets using jumbo frames (9000 byte MTU)
with an intermediate fast Ethernet network which has an MTU of 1500. If we were to use
the local MTU values of 9000, then the first router would be forced to break those large
packets into smaller ones with transmission units no more than 1500 bytes in size.
The router does not need this extra burden and we want to avoid creating fragments.
We need some way to communicate the MTU restriction of the intermediate networks.
V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain why we need a way to identify intermediate network restrictions.
Details Explain how at connection establishment the sockets communicate their local
MTU restrictions, expressed as a maximum segment size.
Explain how this does not take into account the intermediate network restrictions and how
the routers in the path would be required to fragment.
Explain that this is not desirable and then segue to the next visual.
Transition statement Let us see how the effective MSS value is controlled.

7-65
Instructor Guide
TCP maximum segment size

TCP Maximum Segment Size (MSS):
Largest segment TCP will transmit to the other end

MSS = (MTU TCP/UDP Headers IP Header)
The goal is to avoid any fragmentation
A larger size allows less overhead per byte of data
Local connection: MSS uses the local interface MTU sizes

For a remote connection the MSS is tunable:
# no o tcp_mssdflt=1460 (default value)
Overridden with the path MTU mechanism

Discovers the best MTU value for a given path
To disable (it is enabled by default):
# no o tcp_pmtu_discover=0
To display or manage table entries:
# pmtu display
# pmtu delete -dst 192.168.1.5
Figure 7-20. TCP maximum segment size
AN512.0
Notes:
Introduction
The Maximum Segment Size (MSS) is the largest chunk of data that TCP will transmit
to the other end. When a connection is established, each end has the option of
announcing an MSS it expects to receive, based upon its local MTU restriction. If one
end of a connection does not receive an MSS option from the other end, a default of
536 bytes is (typically) assumed. This allows for a 20 byte IP header plus 20 byte TCP
header to fit into a 576 byte IP datagram. In practice, this small default size is rarely
experienced.
Size and fragmentation

In general, the larger the MSS the better, until fragmentation occurs. A larger segment
size allows more data per segment, reducing the TCP and IP header cost per byte of
data.
V5.4
Instructor Guide
Uempty
The MSS allows a host to limit the datagram size that is sent by the other side. If the
MSS size is small enough, there will be no needed to fragment.
When establishing a connection, TCP can announce an MSS value up to the outgoing
interface MTU minus the size of the fixed TCP and IP headers. If the destination IP
address specified in a connection is nonlocal, the protocol default for MSS is 536 (AIX
uses a default value of 512 bytes). In practice this is overridden by the value of the no
option tcp_mssdflt which defaults to 1460 and is tunable.
Note that, in AIX 5L V5.3 and later, the tcp_mssdflt is ignored when path MTU
discovery (PMTU) is enabled. This will be covered in more detail later in this unit.
Network route with MTU attribute

There may be situations where the smallest MTU that you need to anticipate with
tcp_mssdflt is only on some routes but not on others. In that situation, you might
want to use a different MSS for some connection than others.
One way to improve this is to add a static route that forces the default MTU to a different
size. Lets say that the local system is on the 129.35.46.1-126 subnet (the routers
address is 129.35.46.1). If you wanted to send data to the 9.3 network with a 1500 size
MTU, then you can specify this on the local system with the route command:
/usr/sbin/route add -net 9.3.0.0 -netmask 255.255.0.0 129.35.46.1 -mtu
1500
Path MTU
The global tcp_mssdflt may not be optimal for all paths. The alternative of manually
defining routes each with an associated mtu attribute would be difficult to manage. To
provide a more customized approach with low administrative overhead, AIX implements
a mechanism which automatically discovers the optimal MTU size for each connection
destination and stores it in a Path MTU (PMTU) table. The contents of the table can be
displayed with the pmtu command.
How path MTU is discovered

The discovery mechanism simply sends segments which comply with the local MTU
restriction, but with an IP header bit set to forbid fragmentation. If there is a router in the
path which has an interface with a tighter restriction, it sends back an ICMP error
packet. Using this information, TCP discovers the largest MSS that can be successfully
routed to the destination without requiring fragmentation.
Path MTU timeout

Since changes could occur in the network (such as a failover resulting in a path with a
smaller MTU), the entries periodically expire and then require rediscovery. If

7-67
Instructor Guide
administrators know that the path MTU has changed and do not wish to wait for the 10
minute (default) entry expiration, then they can manually delete an entry.
This PMTU table is used to identify the MSS to be used for each TCP transmission.
UDP PMTU discovery

There is also a udp_pmtu_discover network option, which is also enabled by
default. The catch is that this is only used if UDP applications are coded to query the
PMTU size and then use that information to restrict the size of their sends.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain TCP maximum segment size.
Details Remind the students of the definition of a segment. Explain that we want it as
large as possible without causing the IP layer (anywhere in the path) to fragment.
Explain how local connections use the value negotiated at connection establishment for the
MSS and the remote connections use either tcp_mssdflt, the route mtu or the PMTU value
for the MSS.
Explain what PMTU discovery does.
Explain how to enable and disable the facility.
Explain how to manage the PMTU table and control the expiration period for entries.
Prior to AIX 5L V5.3, the path MTU value was stored in the routing table by creating cloned
routes for each destination host. These cloned routes would cause problems with the
workload balancing of multipath routing and would cause routing table performance and
management problems due to the increased size of the table. As a result, many system
administrators made it a policy to disable path MTU discovery.
In AIX 5L V5.3 and later, these problems are avoided with the use of the separate PMTU
table and it is recommended that PMTU discovery be left enabled (default).
Transition statement If traffic is fragmented in the IP layer, then we can potentially
have problems with the fragment reassembly at the destination host. Let us examine how
this can affect performance and how to manage it.

7-69
Instructor Guide
Fragmentation and IP input queue

IP Input Queue
Datagram
1
Fragmented (IP layer)
MTU
Adapter
Adapter
MTU
no options:
4
ipqmaxlen
ipfragttl
Discarded
or delayed
Figure 7-21. Fragmentation and IP input queue
AN512.0
Notes:
Introduction
A datagram will be fragmented by the IP layer whenever the transmission unit would
otherwise exceed the MTU for the interface. If this did not happen, the interface would
have to discard the datagram to avoid overflowing the adapters transmission buffer.
Fragmentation may occur at the source host or at an intermediate routers IP layer. One
of the major reasons to avoid fragmentation is IP Input Queue overflows.
IP input queue processing

When fragments arrive at the destination hosts IP layer, they are placed on the IP Input
Queue. IP will not pass the data to the transport layer until it has reassembled the
original datagram. It must receive all the fragments before it can do this. If one of the
fragments was discarded in the lower network layers, then IP will never be able to
reassemble the original datagram. Rather than have these dead fragments fill up the
V5.4
Instructor Guide
Uempty
queue, it periodically checks to see how long they have been in the queue. If fragments
have been in the queue longer than the time to live specified by the ipfragttl
network option value, they are discarded by the IP layer. The ipfragttl option is
coded in half seconds with a default of 60 (that is 30 seconds).
These fragment discards are shown under the netstat -s and the netstat -p ip
statistic: fragments dropped after timeout.
If the total number of fragments in the IP Input Queue reach the number specified by the
ipqmaxlen network option, any newly arriving fragments are discarded by the IP layer.
These fragment discards are shown under the netstat -s and the netstat -p ip
statistic: ipintrq overflows.
Performance impact of IP input queue discards

All discards are bad, since that requires the higher layers to detect the loss through
timeouts and then retransmit. There are several scenarios in which discards can occur
at the IP Input Queue.
- There were no discards or significant delays in the network, but the arrival rate of
fragments is so high that it overflows the IP Input Queue anyway. This would require
a very high traffic rate in combination with insufficient CPU cycles to process the
fragments.
- There are discards in the network (lower layers). This will cause the IP Input Queue
to eventually discard the dead fragments. Since a fragment from the original
datagram was lost anyway, there is no additional timeout and retransmit penalty for
these fragments. The problem is that holding onto the dead fragments until
ipfragttl expires will increase the chance that the IP Input Queue will overflow.
When that happens, fragments from other datagrams, which would otherwise arrive
and be reassembled, will be discarded. To the extent that the remaining fragments
(of the original datagram) do arrive and get placed on the IP Input Queue, they will
wait for ipfragttl before being discarded, This can further aggravate a IP Input
Queue overflow problem, and require timeout and retransmission of the other
datagrams.
- There is a long delay in the arrival of a fragment. Even though waiting longer for the
late fragment would have allowed reassembly of the datagram, the queued
fragments for that datagram will exceed their time to live and they will be discarded.
Then the late fragment will arrive and sit for ipfragttl after which it will be
discarded. The impact is that the datagram for the late fragment needs to be
retransmitted. In addition, the situation can contribute to an IP Input Queue overflow
with the result of discarding fragments for other datagrams, which then also need to
be retransmitted.

7-71
Instructor Guide
Tuning to avoid IP input queue discards

What is causing overflows is that the fragments arrive faster than they can be
assembled and removed for the queue. Reducing the volume of fragments arriving and
solving any network problems that may cause delay or discard of in-transit fragments
will be the best way to solve the problem.
Shortening the ipfragttl will discard incomplete fragment chains sooner and
possibly avoid a queue overflow situation, but that may also force otherwise avoidable
retransmissions of datagrams with delayed fragments.
Increasing the ipfragttl may help if delayed fragments is the main cause of
discards due to timeouts, rather than overflows of the queue. But this will hold onto
dead fragments longer and could cause an input queue overflow.
Increasing the ipqmaxlen will help avoid transitory overflows, but will not help if there
is a sustained high fragment arrival rate with delayed or discarded fragments.
Again, the best way to avoid IP Input Queue overflows is to reduce fragmentation and
eliminate packet discards and delays.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain IP Input Queue overflows.
Details Explain the purpose of the IP Input Queue. Explain discards due to overflow.
Discuss how delayed or discarded fragments aggravate the situation. Explain the role of
ipqmaxlen and ipqfragttl in the management of the queue and how they might be
used to help alleviate problems resulting in discards of packets in the queue. Emphasize
the importance of reducing fragmentation and delayed or discarded packets in the network
layers below the IP layer.
Additional Information - This is not central to our performance discussion, but a student
may bring it up. One of the possible reasons for avoiding fragmentation is the increasing
use of firewalls. Many firewalls will only let packets through which match a list of allowed
ports. Since only the first fragment of a datagram has a UDP header with the port number,
the remaining fragments are blocked or discarded. This prevents reassembly and forces
repeated retransmissions until the retransmission limit is reached and the session is
terminated.
Transition statement How can we monitor what is happening with IP fragmentation?

7-73
Instructor Guide
netstat -p ip
# netstat -p ip
ip:
9892501 total packets received
.
.
.
189901 fragments received

0 fragments dropped (dup or out of space)
5 fragments dropped after timeout
64443 packets reassembled ok
9222159 packets for this host
.
.
.
8260713 packets sent from this host

.
.
.
10494 output datagrams fragmented

206446 fragments created
.
.
.
0 ipintrq overflows
.
.
Figure 7-22. netstat -p ip .
AN512.0
Notes:
What to look for
Our focus here is to examine how much traffic there is, how much of it is fragmented,
and any discards related to IP Input Queue management.
Statistics should generally be assessed in terms of their significance as a percentage of
the total traffic.
Items to look for relative to the total packets received:
- A high percentage of fragments received: As we will see, using TCP with the
proper MSS value should avoid this situation
- A high percentage of fragments dropped after timeout: These are
ipfragttl timeouts.
- A high percentage of fragments dropped (dup or out of space) and a
high percentage of ipintrq overflows: These are due to overflowing the IP Input
Queue.
V5.4
Instructor Guide
Uempty
Items to look for relative to the packets sent from this host:
- A high percentage of output datagrams fragmented and a high percentage of
fragments created: Again, using TCP with the proper MSS value should avoid
this situation.

7-75
Instructor Guide
Additional output
A complete listing of the output would be:
9892501 total packets received
0 bad header checksums
0 with size smaller than minimum
0 with data size < data length
0 with header length < data size
0 with data length < header length
0 with bad options
0 with incorrect version number
189901 fragments received
0 fragments dropped (dup or out of space)
5 fragments dropped after timeout
64443 packets reassembled ok
9222159 packets for this host
12408 packets for unknown/unsupported protocol
0 packets forwarded
532466 packets not forwardable
0 redirects sent
8260713 packets sent from this host
0 packets sent with fabricated ip header
0 output packets dropped due to no bufs, etc.
0 output packets discarded due to no route
10494 output datagrams fragmented
206446 fragments created
0 datagrams that can't be fragmented
10 IP Multicast packets dropped due to no receiver
42608 successful path MTU discovery cycles
8422 path MTU rediscovery cycles attempted
8070 path MTU discovery no-response estimates
8773 path MTU discovery response timeouts
7 path MTU discovery decreases detected
60158 path MTU discovery packets sent
0 path MTU discovery memory allocation failures
0 ipintrq overflows
0 with illegal source
0 packets processed by threads
0 packets dropped by threads
0 packets dropped due to the full socket receive buffer
0 dead gateway detection packets sent
0 dead gateway detection packet allocation failures
0 dead gateway detection gateway allocation failures

V5.4
Instructor Guide
Uempty
Broadcast traffic
A high value for packets for unknown/unsupported protocol points to machines
using non-IP addresses sending broadcast messages. If this number is increasing
rapidly, the machines sending the packets should be identified and possibly moved to
another subnet or network segment (sometimes the router mistakenly forwards those
packets). On the other hand this traffic may be normal and the extra load on you host
(even if it is not participating in these broadcasts) may not be a problem.
Indicators of corrupted or truncated packets

Non-zero values for the following counters, though rare, can point to possible network
problems, such as defective or misconfigured adapter or switch port, or using poor
cabling practices:
- bad header checksums or
-fragments dropped (dup or out of space)
(If the output shows non-zero values for either of these values, this indicates either a
network that is corrupting packets or device driver receive queues that are not large
enough)
with size smaller than minimum
with data size < data length
with header length < data size
with data length < header length
with bad options

7-77
Instructor Guide
Instructor notes:
Purpose Explain what to look at within the IP protocol statistics shown by
netstat -p ip.
Details Focus on the counters related to fragmentation and the IP Input Queue.
There are additional notes after the full example output. It is suggested that you not open a
discussion on these, but be prepared to discuss if a student asks.
Transition statement Lets next talk about problems that can happen at the interface
and adapter.

V5.4
Instructor Guide
Uempty
Interface and hardware flow

Interface
Layer
Device Driver
and
Transmit
Queues
Buffer
Switch/Hub
Interrupt
Handler
Receive
Queues
Adapter
Network
Buffer
Switch/Hub
Figure 7-23. Interface and hardware flow
AN512.0
Notes:
Overview
To handle the transfer of data between AIX and the network adapters, AIX provides
queues where data can be placed until the other party can remove it and process it. If
packets are placed on the queue faster than they are removed, the queue will fill up and
packets will be discarded.
Transmit queue processing

The device drivers may provide a transmit queue limit which may be both hardware
queue and software queue limits. Some drivers only have a hardware queue whereas
others can have both hardware and software queues. Some drivers only allow the
software queue limits to be modified.

7-79
Instructor Guide
The interface receives a pointer to the mbufs holding the packet to be sent and
prepends the link layer frame headers. It then places a pointer to the mbufs on the
transmit queue and signals the adapter.
The adapter will access the queue, locate the data and transfer it to its hardware buffer,
thus freeing up the entry in the transmit queue. The adapter then uses the link protocols
to transmit the data on the cabling. The cabling could be point to point to another host
adapter (cross-over cable), could be daisy chained through other host adapters
(Ethernet BNC), could be wired to a central repeater hub or (more commonly) could be
cabled to a switch.
Receive queue processing

AIX pre-allocates mbufs and mbuf clusters which are large enough to hold the largest
transmission unit it expects to receive (MTU) and stores pointers to them on the receive
queue. These are referred to as the receive pool buffers.
The receiving adapter listens to the transmissions on the wire looking for the frames
with its own hardware address (or a broadcast address). It records the data in its
hardware buffer and locates a free buffer on the receive queue, to which it transfers the
data it has received. The adapter then signals AIX which runs an interrupt handler to
process the data.
The interrupt handler locates the data on the receive queue and processes the data.
The mbufs with the data likely end up either on the IP Input Queue or on a transport
layer socket receive queue. The entry on the adapter receive queue is freed up and a
new mcluster is allocated and placed on the queue to receive future data.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Provide an overview of interface and adapter hardware mechanisms.
Details Explain the basic mechanism of queuing up the packets and signaling the
adapter which reads the memory to transmit over the hardware network.
On the receiving side, explain how the network buffers are already queued up on receive
queue and how the adapter stores the incoming packets there and signals AIX.
Explain that we will be discussing two things: the overflow of these queues and the
interaction between AIX and the adapter.
Note that there is actually a wide variety of mechanisms for managing the interface
between and adapter and the host operating system. What we cover in this unit is only one
example. The adapter attributes and how they are used may be different for other types of
network adapters.
Remind the students that when using virtual ethernet adapters, the physical network
adapter may be at the Virtual I/O server partition (VIOS), associated with Shared Ethernet
Adapter (SEA) which bridges between the virtual ethernet and the physical ethernet. In that
situation the analysis being described here would need to be done at the VIOS.
Transition statement First we will look at the interface to the adapter at the transmitting
host.

7-81
Instructor Guide
Transmit queue overflows

Packets arrive faster than the adapter can remove them
# entstat d | grep I transmit queue
Max Packets on S/W Transmit Queue: 210
S/W Transmit Queue Overflow: 0
Current S/W+H/W Transmit Queue Length: 1
If bursty traffic, increasing the queue size may help

If sustained high traffic rate:
Improve adapter speed or bandwidth
Reduce number of parallel application threads
Change the queue size on the adapter in the ODM:

# chdev P l ent1 a tx_que_size=2048
# shutdown Fr
To make effective without a reboot:

# ifconfig en1 detach
# chdev l ent1 a tx_que_size=2048
# /etc/rc.net
Figure 7-24. Transmit queue overflows
AN512.0
Notes:
Introduction
The device driver queues a transmit packet directly to the adapter hardware queue. If
the CPU is fast relative to the speed of the network, or if there are multiple CPUs, the
system may produce transmit packets faster than they can be transmitted on the
network. This will cause the hardware queue to fill. Once the hardware queue is full, the
driver can queue packets to the software queue (if it exists). If the software transmit
queue limit is reached, then the transmit packets are discarded. This can affect
performance because the upper level protocols will have to retransmit these packets.
If there are transient situations where this happens due to a burst of activity or a
temporary delay in transmitting by the adapter, a larger queue may be able to handle
the situation.
If on the other hand there is a sustained situation where the packet transmission rate is
too much for the adapter to keep up with, then one needs to either reduce that rate
V5.4
Instructor Guide
Uempty
(reschedule or distribute the workload, reduce the window size) or get more adapter
bandwidth.
Upgrading from 10 Mbps to 100 Mbps Ethernet may be all that is needed. An alternative
solution is to group multiple adapters in an aggregate (etherchannel). That provides
increased bandwidth and availability.
An aggregate will only help the situation if the high packet transmission rate is over
several different connections. If there is only one connection, then all the traffic for that
connection will go over the same single physical adapter. Workload balancing for
aggregates works best when there are many connections to spread over the physical
adapters which participate in the aggregate.
A similar solution would be to spread the connections over different interfaces using
multi-path routing.
Tuning the transmit queue

The transmit queue size is an attribute of the adapter. Adapter attributes can not be
changed if the device driver is open. Since a configured interface will be using the
adapter (even if in a down state), you must detach the interface from the adapter before
you can change the attribute. The chdev command will normally update the ODM object
for the adapter and also make the change effective in the kernel. Since detaching the
interface removes all interface configuration information from the kernel, you will need
to reconfigure the interface after making your change. Note that this will be disruptive to
all communication over that interface. If you are using the interface to remotely access
the platform you lose your connections and you need to use a procedure that allows
you to reconnect. The alternative method is to have the chdev command only update
the ODM object. This does not require that the interface be detached. But the ODM
change will not become effective until you reboot the system. During reboot the change
is made effective in the kernel when the adapter object is changed to an available state
during cfgmgr processing.
Max packets on S/W transmit queue

The Max Packets on S/W Transmit Queue shows the maximum number of outgoing
packets ever queued to the software transmit queue.
An indication of an inadequate queue size is if the maximal transmits queued equals the
current queue size (tx_que_size). This indicates that the queue was full at some point.
To check the current size of the queue, use the lsattr -El adapter command (where
adapter is, for example, tok0 or ent0). Because the queue is associated with the device
driver and adapter for the interface, use the adapter name, not the interface name. Use
the SMIT or the chdev command to change the queue size.

7-83
Instructor Guide
S/W transmit queue overflow

The S/W Transmit Queue Overflow shows the number of outgoing packets that have
overflowed the software transmit queue. A value other than zero requires the same
actions as would be needed if the Max Packets on S/W Transmit Queue reaches the
tx_que_size. The transmit queue size must be increased.
The Max Packets on S/W Transmit Queue field will show the high water mark for the
transmit queue, and the S/W Transmit Queue Overflow field will show the number of
software queue overflows. Note, these values may represent the hardware queue if the
adapter does not support a software transmit queue. If there are Transmit Queue
Overflows, then the hardware or software queue limits for the driver should be
increased using the chdev command or SMIT.
Changing the attribute

Different adapters have different names for the attribute that controls the transmit queue
size. So you need to first display the attributes of the adapter to find out the name.
# lsattr -E -l ent1
You also need to know what values are acceptable, so you next need to list the range of
values for that attribute name
# lsattr -R -l ent1 -a tx_que_size
The value is the number of entries in the queue. The cost for setting a large number is
not too great, since the queue itself does not require much storage, being basically an
array of pointers to the buffers that were already allocated by the higher layers. For
UDP traffic, the larger queue could allow more datagrams to accumulate which would
otherwise be discarded on successful transmit, thus using more memory.
Reconfiguring the interface

If you try to configure the adapter while a related interface is configured, you will get an
error message stating that the device is in use. The solution is to change the attribute in
the ODM object for the adapter, without making it effective yet.
You then have two options. You can either:
- Delay making the change effective until the next reboot. On reboot, the init
process will run /etc/rc.net which will read the ODM and configure the interface.
- Make the change effective now by detaching and then reconfiguring the interface.
If using the second option, remember that it is disruptive to connections using that
interface. If you are doing the second option remotely, there is only one interface to
connect through and you are using it, you need to make sure you can reconnect. One
way to do this is to code the procedure in a script and run it with nohup.
V5.4
Instructor Guide
Uempty
Example
The following example shows:
- Detection of adapter transmit queue overflow.
- Change to the transmit queue size for the adapter. This change will not take effect
until reboot. (lsattr says it has been made, but that change is only in the ODM, not
the current value). The change should reduce transmit queue overflows.
Note that the transmit and receive queue overflows can also be indicated by the
netstat -i output under Ierrs and Oerrs columns.
# entstat -d ent0
------------------------------------------------------------ETHERNET STATISTICS (ent0) :
Device Type: 10/100 Mbps Ethernet PCI Adapter II (1410ff01)
Hardware Address: 00:02:55:6f:1b:aa
Elapsed Time: 1 days 19 hours 14 minutes 2 seconds
. . .
S/W Transmit Queue Overflow: 20
. . .
# lsattr -El ent0

. . .
tx_que_sz
8192
txdesc_que_sz
512
use_alt_addr
no
Software TX Queue Size

True
TX Descriptor Queue Size
True
Enable Alternate Ethernet Address True
# chdev -P -a tx_que_sz=16384 -l ent0

ent0 changed
# lsattr -El ent0
. . .
tx_que_sz
16384
txdesc_que_sz
512
use_alt_addr
no
Software TX Queue Size

True
TX Descriptor Queue Size
True
Enable Alternate Ethernet Address True
# shutdown -Fr

7-85
Instructor Guide
Instructor notes:
Purpose Explain the causes of and solutions for transmit queue overflows.
Details Explain the basic cause of overflows.
Again you can use the basic demand versus constrained resource model. The resource is
the transmit queue. The demand is the rate at which we fill the queue. Then we can look at
the rate we fill the queue as the balance between the transmission rate for all sessions
across that adapter (demand) and the ability of the adapter (resource) to remove and
transmit them on the network. In each case we can either increase the resource or
constrain the demand.
Explain the procedure on the visual. Emphasize that to make the change effective they
need to detach the interface which will be disruptive. Point out that most administrators will
use the reboot procedure.
Transition statement One possible reason for an Ethernet adapter not quickly
removing data from the queue could be collisions. Even without queue overflows, a high
percentage of collisions will impact performance. Lets look at one common cause of
excessive collisions: a configuration conflict between the adapter and the switch.

V5.4
Instructor Guide
Uempty
Adapter configuration conflicts

Switch port and adapter configuration must match
# entstat d | egrep i media speed
Media Speed Selected: Auto negotiation
Media Speed Running: 100 Mbps Full Duplex
A duplex mode mismatch results in high level of collisions

# entstat d | egrep i collision|deferred
Max Collision Errors: 0
No Resource Errors: 0
Late Collision Errors: 0
Receive Collision Errors: 0
Deferred: 0
Packet Too Short Errors: 0
Timeout Errors: 0
Packets Discarded by Adapter: 0
Single Collision Count: 0
Receiver Start Count: 0
Multiple Collision Count: 0
Configuration options:
Configure both sides for auto-negotiation
Configure both sides for the fastest speed and full duplex
Do not set auto-negotiation on one side only:

Defaults to half-duplex at auto-negotiation end
If other side is coded full-duplex: mode mismatch
Figure 7-25. Adapter configuration conflicts
AN512.0
Notes:
Adapter configuration problems
A common reason for network performance problems is the misconfiguration of the
adapter or switch port.
Speed mismatches between the switch and the port will become obvious because the
adapter simply will not work.
A less obvious misconfiguration problem will be mode mismatches. The adapter
communicates, but the performance will be severely impacted. The classic symptom
will be a high collision rate with both multiple collisions, late collisions, and CRC errors.
Note that this will only be seen on the side using half-duplex. If that is the Ethernet
switch side, then these errors would be visible to the Ethernet switch administrator. The
AIX side might see higher TCP retransmission without knowing why.

7-87
Instructor Guide
It is recommended that either both sides configure to use auto-negotiate, which should
result in the highest common speed and full-duplex mode, or to hard code the
configuration on both sides to the fastest common speed and full-duplex.
A common error is to code auto-negotiate on one side and not on the other. This will
likely result in a mode mismatch.
Gigabit Ethernet only supports full-duplex, so it is impossible to have a mode mismatch
when configured for 1000 Mbps.
What is a collision?
When discussing collisions, you may wish to think of it like a telephone party line. A
system will check for a transmission in progress before trying to send a packet (carrier
sense). If two machines transmit at exactly the same time, there is a collision because
neither senses the other. When a host recognizes a collision, it backs off and waits a
few seconds to re-transmit. The more machines on the network, the greater the chance
for collision. A collision rate can be calculated as the number of collisions divided by the
number of output packets. If the result is greater than 10%, there is a high network
utilization, and it may need to be reorganized or partitioned.
If your network environment requires half-duplex operation then you will find that in a
half-duplex environment, collisions are normal. It is through collision detection,
back-offs, and retries that the original Ethernet standard allowed the sharing of a
common wire. If you are using BNC wiring or are cabled into a repeater hub (instead of
a switching hub), then you will be using half-duplex protocol. But even in this
environment you will wish for only single collisions (too many busy adapters on the
common wire will have multiple collisions and bad performance.) And even with
half-duplex, we do not expect to see late collisions.
Max collision errors

Max Collision Errors is the number of unsuccessful transmissions due to too many
collisions. The number of collisions encountered exceeded the number of retries on the
adapter.
Late collision errors

Late Collision Errors is the number of unsuccessful transmissions due to the late
collision error. A late collision error is one that occurs after the start of transmission.
Normally when two adapters try to start a transmission at the same time, they detect a
collision and retry at random time intervals. Or if one adapter detects that another has a
transmission in progress, it will retry later. Late Collision is usually caused by either
misconfigured or defective Ethernet adapters (they failed to detect a transmission in
progress and transmitted anyway), or by incorrect cabling where the two adapters are
so far apart that they do not hear the other until after they have started their
transmissions.
V5.4
Instructor Guide
Uempty
Deferred
Deferred counts indicate packets that could not be sent because the media was
half-duplex and the adapter sensed that there was a packet already coming down the
media so that it deferred the sending of the packet until the line was free. With
full-duplex, this would not be a problem because both sides can send and receive at the
same time. If the adapter sensed that the line was free and sent the packet but the other
side also did the same thing at the same time, thats where collisions occur.
Ethernet collisions can be avoided by running full-duplex (make sure both sides are
correctly set to full-duplex otherwise the performance will be even worse than having
collisions).
Collision errors
The types of collision errors reported are:
- Single Collision Count is the number of outgoing packets with single (only one)
collision encountered during transmission.
- Multiple Collision Count is the number of outgoing packets with multiple (2 - 15)
collisions encountered during transmission attempts.
- Receive Collision Errors is the number of incoming packets with collision errors
during reception.
Collision errors should be considered since they can decrease performance. Multiple
collision errors are even worse because that means the same packet was sent multiple
times and each time it had a collision. With full-duplex (available on switches and
crossover cables), you should not see any collisions; so it is best to use a switch with
both ends running full-duplex.
If one end was half-duplex and the other was full-duplex, then collisions are almost
guaranteed since the full-duplex side is not even listening for collisions) and performance
may be terrible (look for CRC (cyclical redundancy check) errors in this case). The errors
and collisions will be seen on the half-duplex side of the mismatch.
Additional details
Collisions do not occur in a full-duplex environment. Since the connection between the
adapter and the switch port is analogous to an Ethernet crossover cable joining two
computers, they are the only ones talking. If the connection is full-duplex, then there is
never a conflict over who gets to talk when. In fact, an adapter or port that is configured
for full-duplex does not even bother to detect the other side transmitting when it wants
to transmit.
If one side is configured to half-duplex and the other is configured to full-duplex, then
the half-duplex side is assuming that the other side can only handle half-duplex
communications while the full-duplex side is going to transmit any time it wants to.

7-89
Instructor Guide
On the half-duplex side, when it wants to transmit, it keeps hearing the other side
transmitting, does its random delay retries and keeps having collisions, because the
full-duplex side is not using the back off and retry protocol. And when the half-duplex
side gets no initial collision and starts a transmit, it will likely get late collisions and
immediately terminate that transmission because the full-duplex side does not care if
the half-duplex side is talking and it will transmit anyway. This leads to high collision
rates and very poor performance.
Even with both sides set to auto-negotiate there can still be mode mismatches. This
depends on the both sides having properly implemented the standard auto-negotiate
standard. If there was a confusion about how to implement the standard or a bug in the
implementation, then they may fail to negotiate correctly. In this situation, hard coding
on both sides is the expedient solution.
Another mistake that can effect performance is the misplacement of adapters in the bus
slots. Overloading a a PCI bus can lead to performance problems. For example a single
Gigabit Ethernet adapter can consume the bandwidth of a PCI bus. Two gigabit
adapters on the same bus could result in overloading that bus. Even if you do not
overload a bus, placing an PCI-X adapter in a PCI slot could reduce the Mhz rate of the
adapter PCI interface.
It is strongly recommended that administrators plan their adapter placement using the
manual: Adapter Placement Reference for AIX.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain adapter and switch configuration problems.
Details Explain the consequences of mismatched configuration between an adapter
and a switch port. Focus on the problem of duplex mode mismatch and the resulting
collisions.
Explain how auto-negotiation works. Emphasize that both sides must be configured to
auto-negotiation. Point out that even when configured correctly, there are implementations
of auto-negotiation that have had problems correctly negotiating, in which case they should
hard code the configuration on both sides.
Transition statement Lets move on to the network adapter at the receiving side of the
transmission.

7-91
Instructor Guide
Receive pool buffer errors

An adapter may receive packets faster than the interrupt
handler can remove them
Check entstat -d for a count of errors
# entstat d | grep i receive pool
Receive Pool Buffer Size: 2048
Free Receive Pool Buffers: 767
No Receive Pool Buffer Errors: 5746361
If bursty traffic, a larger pool may reduce the errors:

Some adapters support configurable receive pools
If sustained high traffic, may throttle traffic:

Reduce window size(s)
Reduce number of clients transmitting to server
Adapter and switch port problems can cause errors
Figure 7-26. Receive pool buffer errors
AN512.0
Notes:
Introduction
A full receive pool can result in the adapter discarding packets. Some adapters allow
configuring the number of resources used for receiving packets from the network. This
might include the number of receive buffers (and even their size) or may simply be a
receive queue parameter (which indirectly controls the number of receive buffers). The
receive resources may need to be increased to handle peak bursts on the network.
The entstat -d <adapter-name> command will give you a count of the number of
receive pool buffer errors. A small percentage is not a great concern. But, a large
percentage will have a significant impact on performance.

V5.4
Instructor Guide
Uempty
Increasing the queue size

The name of the adapter attribute which controls the adapters receive queue size will
vary from adapter to adapter. List the attributes of the adapter to find the correct name
and then update the value using the procedure we covered with the transmit queue.
Throttling the traffic

If there is a sustained high level of traffic, you may need to throttle it back to avoid the
errors. While the discards themselves should automatically slow down the sliding
window and even place the TCP session into slowstart mode, it might be better to
reduce the tcp_recvspace to further reduce the arrival rate of packets. You may
select to do this globally, or be more selective by using the ISNO options or even by
configuring the application to set the socket buffer sizes using setsockopt().
More commonly, the total traffic is a combination of many clients transmitting steams of
data at the same time. For example, doing multiple concurrent backups of clients
systems to Tivoli Storage Manager (TSM) may overload the TSM server.
Adapter and switch port problems

In some cases, we have found that the cause is a defective switch port or a defective
adapter. Changing the port in use may solve it. Upgrading the adapter microcode to the
current level may provide a fix. Replacing the adapter may be needed.

7-93
Instructor Guide
Instructor notes:
Purpose Explain the cause and solution of receive pool buffer errors.
Details Explain what receive pool buffer errors are and how they are recorded in the
netstat -d report. Discuss the possible solutions.
Additional information - A student may ask about device specific buffers.
AIX supports device specific buffers for some adapters. This allows a driver to allocate its
own private set of buffers and have them already setup for DMA. This can provide
additional performance because the overhead to set up the DMA mapping is done all at
one time instead of every time. Also, the adapter can allocate buffer sizes that are best
suited to its MTU size. For example, the SP2 switch supports a 64 KB MTU. The maximum
system mbuf size is 16 KB bytes. By allowing the adapter to have 64 KB byte buffers, large
64 KB writes from applications can be copied directly into the 64 KB buffers owned by the
adapter, instead of copying them into multiple 16 KB buffers (which has more overhead to
allocate and free the extra buffers). Some SP2 high-speed switch adapters support these
device specific buffers. The system administrator would need to use device specific
commands to view the statistics related to these adapter buffers and then change adapter
parameters as necessary. Refer to a SP2 switch tuning guide for information on tuning the
SP2 switch adapters.
Transition statement Sometimes network performance problems can be so obscure or
complex that someone need to analyze the flow of packets. Lets look at the AIX tools for
tracing and reporting packet flows.

V5.4
Instructor Guide
Uempty
Network traces
Capture and report details of network packets
Useful for analyzing difficult situations
Requires detailed understanding of network protocols and
header fields
tcpdump
Good summary of header information
Easy to read
PerfPMR creates tcpdump.raw file
iptrace and ipreport

More detailed report
PerfPMR creates iptrace.raw file
Figure 7-27. Network traces
AN512.0
Notes:
tcpdump
The tcpdump command prints out the header information of the packets captured on a
network interface. It can be used to trace all packets that go through a single network
interface or to trace a specific protocol, such as TCP.
By default, tcpdump sends its output to stdout and does not require any post
processing. However, it also allows data collection in raw format (without any packet
parsing) into a file. This file can be used as input for post processing with tcpdump.
If you ran the perfpmr.sh script, then a tcpdump.raw file will be in the output
directory. You may then either use the PerfPMR script tcpdump.sh -r to format the
tcpdump or (if you wish more control over the formatting and record selection) use the
tcpdump -r command to format it. tcpdump defaults to the first configured interface. If
you need control over what interface is being traced, you need to run tcpdump directly
with the -i <interface> option.

7-95
Instructor Guide
Supported interfaces
Only a limited number of network interfaces are supported by tcpdump: Ethernet,
FDDI, token-ring and loopback. Interfaces like the css interface (SP2 high performance
switch) are not supported by tcpdump. For those interfaces, the iptrace command
must be used to capture the packets.
iptrace command
The iptrace command is an interface-level packet tracing tool for Internet protocols.
Unlike tcpdump, it captures the entire contents of the packets and writes them into a
logfile. The filename must be specified when the iptrace command is invoked. Thus,
the size of the logfile can become quite large, sometimes several hundreds of
megabytes in just a few seconds, depending on the speed of the network adapter and
level of network traffic.
Post processing of the logfile is done with the ipreport command.
iptrace loads a kernel extension for the packet capturing. This kernel extension does
not get unloaded when iptrace is stopped with kill -9. Thus, iptrace should be
stopped with kill -15. You can unload the kernel extension with iptrace -u if you
mistakenly killed iptrace with kill -9.
ipreport command
The ipreport command is the post processing tool for iptrace data.
To generate a report on an iptrace logfile run: ipreport <logfile> | more
It is advisable to generate an ipreport output with packet numbering, decoded RPC
calls, and protocol information (-s, -r, and -n flags).
The PerfPMR generated iptrace.raw file can be formatted by running the PerfPMR
script: iptrace.sh -r.
The iptrace command does a hostname resolution for all packets in the logfile. The
processing of the data can take a very long time if the hostnames are not known (that
usually happens when you post process iptrace data from customer machines). You
can reduce the processing time by using the -N option which bypasses the host name
resolution. For example: ipreport -srnN <logfile> | more

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Introduce the use of network traces.
Details Emphasize that we are not teaching them enough about network protocol
analysis to make effective use of network traces in most cases. But they should be aware
of network traces as a tool and the basic differences between tcpdump and iptrace.
We will not go into the details of tcpdump data analysis. This would take more time than we
have in this class. Refer the students to the class:
QT055, AIX 5L TCP/IP and Network Services: Problem Determination
Point out the ability to use the PerfPMR generated trace file and its need to be formatted.
There are other trace tools that can be used. Network sniffers are popular, ethernet
switches often have their own tracing capabilities, switch, and some administrators prefer
using the GNU tool: ethereal. This course only covers the trace tools that are part of the
AIX installation.
Explain that the trace file is in the PerfPMR output and how it can be formatted using the
iptrace.sh -r script.
Additional information tcpdump.sh has an interface selection default which may be
a problem; your traffic may be on a different interface. Thus you may need to run a trace
identifying the interface your traffic is on.
Transition statement Lets look at examples of reports from these two trace facilities.

7-97
Instructor Guide
Network trace examples

tcpdump example:
000015
000182
000059
000010
209793
IP
IP
IP
IP
IP
client.33100
server.14000
client.33100
client.33100
server.14000
>
>
>
>
>
server.14000:
client.33100:
server.14000:
server.14000:
client.33100:
P
P
.
P
.
4061:4121(60) ack 21 win 65535

21:41(20) ack 4121 win 17520
4121:5581(1460) ack 41 win 65535
5581:6121(540) ack 41 win 65535
ack 6121 win 17520
ipreport example:
Packet Number 2
ETH: ====( 77 bytes transmitted on interface en0 )==== 09:48:01.954494310
ETH:
[ 00:06:29:c3:0a:1c -> 00:06:29:ec:00:64 ] type 800 (IP)
IP:
< SRC =
9.3.104.19 > (ginger.austin.ibm.com)
IP:
< DST =
9.41.90.25 > (idefix.austin.ibm.com)
IP:
ip_v=4, ip_hl=20, ip_tos=0, ip_len=63, ip_id=50365, ip_off=0 DF
IP:
ip_ttl=60, ip_sum=a5a3, ip_p = 6 (TCP)
TCP:
<source port=23(telnet), destination port=34919 >
TCP:
th_seq=e4d92b8a, th_ack=43bc31c5
TCP:
th_off=5, flags<PUSH | ACK>
TCP:
th_win=17520, th_sum=e5b8, th_urp=0
TCP: 00000000
67696e67 65722e61 75737469 6e2e6962
|ginger.austin.ib|
TCP: 00000010
6d2e636f 6d203a
|m.com :
Figure 7-28. Network trace examples
AN512.0
Notes:
The strength of the tcpdump report is the ability to see many packets on a single page,
because it can use one line per packet. The tool can be customized to present the trace
information in different ways. For example, this example requested that the IP addresses
not be translated into their symbolic names and also requested that the time stamps only
print the time stamps as a delta (in micro-seconds) between current and previous line on
each dump line. The ability to see time stamps can be helpful in seeing where delays
occurred. The source and destination fields are important for identify the connection and
the direction of flow. You can also see the amount of data (in parenthesis) and what bytes
are being acknowledged as received in the acknowledgements.
The strength of the ipreport is the great amount of detail provided in its breakdown of the
header fields. But this can also be a weakness, since it can be difficult to see the big picture
through all that detail.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Show examples of the tcpdump and ipreport output.
Details Explain the differences between iptrace and tcpdump. Emphasize the pros and
cons of the level of detail.
The example is to just demonstrate how a formatted report looks. Do not spend much time
on it. We will not go into how to analyze the trace contents in this course.
One reason that the network trace content is included in the lecture material is that the
exercise steps for recognizing a performance problem due to Nagle, uses tcpdump to show
the 200ms delays using the time stamps. Thus it is important that, at minimum, that they
know how to find the time stamps and the direction of flow.
Transition statement Lets review what we have covered with some checkpoint
questions

7-99
Instructor Guide
Checkpoint (1 of 3)
1. Interactive users are more concerned with measurements
of __________, while users of batch data transfers are
more concerned with measurements of
_______________.
2. True/False thewall maximum amount of network
pinned memory can be increased in AIX6 only by
increasing the amount of real memory.
3. When sending a single TCP packet an acknowledgement
can, by default, be delayed as long as ______
milliseconds
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Review the unit and test the students understanding of the topic.
Details You may give the students time to figure out all the answers first or ask for
volunteer answers one question at a time. Either way, be sure to review each one, validate
that they understand the answer, and use the opportunity to discuss the relevant subject.
1. Interactive users are more concerned with
measurements of response time while users of batch
data transfers are more concerned with measurements
of throughput .
3. When sending a single TCP packet an
acknowledgement can, by default, be delayed as long as
200 milliseconds.

7-101
Instructor Guide
Checkpoint (2 of 3)
4. True/False Increasing the tcp_recvspace at the
receiving host will always increase the effective
window size for the connections.
___________________________________________
___________________________________________
_____________________________________________
5. What network option must be enabled to allow window
sizes greater than 64 KBs?
_______________________________________
6. List two ways in which Nagles Algorithm can be
disabled:
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
4. True/False Increasing the tcp_recvspace at the receiving host
will always increase the effective window size for the
connections.
If the tcp_sendspace at the transmitting host is smaller than
the tcp_recvspace at the receiving host, it will become the
controlling factor. Both ends would need to be increased.
5. What network option must be enabled to allow window sizes
greater than 64 KBs? rfc1323
6. List two ways in which Nagles Algorithm can be disabled:
Specify tcp_nodelay either from the application
(setsockopt) or as an Interface Specific Network
Option
Specify tcp_nagle_limit=1 as a network option

7-103
Instructor Guide
Checkpoint (3 of 3)
7. If you saw a large count for ipintrq in the netstat
report, which actions would help reduce the overflows?
a) Increase memory and CPU capacity at the receiving
host
b) Increase ipmaxqlen and decrease ipfragttl
c) Decrease ipmaxqlen and increase ipfragttl
d) Eliminate the cause of delayed and dropped fragments.
8. A high percentage of collisions in an Ethernet full duplex
switch environment is an indication of:
______________________________________________
_______________________________________________
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
a) Increase memory and CPU capacity at the receiving host
Answer: b and d
Either a defective adapter or switch port, or a duplex mode
configuration mismatch between the adapter and switch
port.
Transition statement Lets practice some of what we discussed with some lab
exercises.

7-105
Instructor Guide
Exercise 7: Network performance

Window size tuning
FTP case study
Packet throughput (optional)
Transmit queue overflows (optional)
Figure 7-32. Exercise 7: Network performance
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Begin the lab exercise.
Details Have the students open their exercise guides and provide a lab introduction.
Transition statement Having finished the exercise, let us summarize what we have
learned.

7-107
Instructor Guide
Unit summary
This unit covered:
Identifying the network components that affect
network performance
Listing the network tools that can be used to measure,
monitor, and tune network performance
Monitoring and tuning UDP and TCP transport
mechanisms
Monitoring and tuning for IP fragmentation
mechanisms
Monitoring and tuning network adapter and interface
mechanisms
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To summarize this unit.
Details Remember, there are trade-offs! Any time you change an attribute or network
parameter, monitor the system to determine the effect.
Many of these parameters can be changed using SMIT and the communication screens.
Transition statement This is the end of this unit. Next we will discuss NFS.

7-109
Instructor Guide

V5.4
Instructor Guide
Uempty
Unit 8. NFS performance

Estimated time
1:20 (1:00 Unit; 0:20 Exercise)

This unit describes the factors that influence the performance of the
Network File System, more commonly known as NFS. It covers the
tools for monitoring NFS activity and for tuning performance in an NFS
environment.

Define the basic NFS tuning concepts
List the differences between NFS V2, V3 and V4
Monitor and tune NFS servers
Monitor and tune NFS clients

Accountability:
Checkpoint
Machine exercises
References
Reference
AIX Version 6.1 System Management Guide:
Communications and Networks
SG24-6478

(Redbook)

8-1
Instructor Guide
SG24-6184
8-2


V5.4
Instructor Guide
Uempty
Unit objectives
Define the basic Network File Systems (NFS)
tuning concepts
List the differences between NFS V2, V3 and V4
Use nfstat and netpmon to monitor NFS
Use nfso and mount options to tune NFS
AN512.0
Notes:

8-3
Instructor Guide
Instructor notes:
Purpose Explain the objectives of this unit.
Details Go over the objectives with the students, explaining what they will be learning in
this unit.
Transition statement Lets take a look at the first topic.
8-4

V5.4
Instructor Guide
Uempty
NFS tuning concepts

NFS allows one or NFS clients to mount file systems from
an NFS server
Cumulative load of many NFS clients can overload server
May need to limit number of clients or detune clients
Reduce number of client biod threads
Reduce client read and write sizes
NFS performance depends on basic memory, CPU, I/O,

and network performance management
Do not misuse NFS
Stable files should be replicated to user systems
Figure 8-2. NFS tuning concepts
AN512.0
Notes:
Overview
The Network File System (NFS) is a distributed file system. It is independent of machine
type or operating system. A typical NFS environment consists of one or more client
machines and at least one NFS server. The NFS server can export its local file systems
to the client machines so that the clients can have access to these file systems as if
they were local on the clients. Applications can access the file systems transparently
(normal file semantics can be used).
NFS misuse
A common error is to assume that since remote access to a file system acts functionally
like a local file system that it can be conveniently used for all sorts of remote data
access without concern. While it works functionally, it is not always the most efficient
way to handle the situation. Here are two examples of when it might not be a good idea
to use NFS:

8-5
Instructor Guide
- The first example is for a one time access of an entire file. It will be faster to use FTP
to transfer the file to your local platform then to retrieve the entire file over NFS. The
issue would be different if you were only accessing a few portions of the file.
- A second example is when an application is designed for performance assuming
that all the files are local. There is a big difference between the application looking
through a 5,000 entry directory file locally versus doing the same thing over NFS. Or
an application that does a lot of dynamic linking to modules in what it expects to be a
local library. Place that same library across an NFS mount and you could experience
significant degradation of performance.
In many situations, it is better to allow the client to have its own local copies of the file.
Client and server tuning

NFS tuning can be quite complex since it involves tuning every component we have
discussed.
The NFS file system accesses may be concentrated on one or a few disks. Check for
disk bottlenecks using the physical I/O monitoring techniques mentioned previously
(using tools such as iostat, topas, or filemon). Disk and/or LVM tuning may have to
be done.
On both the NFS client and NFS server, this includes tuning CPU usage, memory,
logical and physical I/O (physical I/O on client if cacheFS is used), mount options,
network options, network adapter tuning, and NFS options. Sometimes, the client may
have to be de-tuned. That is, limit the amount of I/O sent to the NFS server because
either the network or the NFS server cannot keep up with the data rate.
Increasing biod and nfsd threads can improve throughput unless the network or server
cannot keep up. Then, decreasing biod threads or reducing read/write buffer sizes may
help. The biod and nfsd threads will be created dynamically up to a tunable maximum
value (mount option for biod and nfso option nfs_max_threads for nfsd).
Network tuning
The network itself may have to be tuned (switches, routers, network media). Check for
network media settings such as speed and duplex modes. Also, check the switch
statistics or router statistics to see if packets are getting dropped.
Detuning a client
If dropped and significantly delayed packets are mainly due to the load being presented
by your platform as a client and you can not improve the ability of the network or the
NFS server to handle that load, then you may have to de-tune your NFS client.
De-tuning means to slow down the NFS client.
8-6

V5.4
Instructor Guide
Uempty
If the network or server congestion is due to the accumulated load of many clients, then
you would need to either reduce the number of clients or find a way to detune all of
them. If you do not control the client platforms, many of the client detuning methods will
not be practical.
It is counter-intuitive, but detuning may actually improve performance. This is because
the performance impact of pacing the traffic is often less than the performance impact of
discarded packets due to congestion.
De-tuning should only be done as a last resort since this will decrease performance if
the network or server was not a bottleneck.
How to de-tune
A common detuning technique is to reduce the number of biod threads down to a value
of 1 (mount -o biods=1). This sets the biod threads to 1 per mount.
The read/write size can be reduced by specifying a small value for rsize or wsize
(such as 1024), where rsize and wsize are mount options.
Enabling commit-behind
Sometimes enabling commit-behind can also improve performance even though this
will cause more commits to occur than if commit-behind was not needed. While
commit-behind causes more commits, it actually can reduce the number of commits if
page-replacement is occurring on modified client pages.

8-7
Instructor Guide
Instructor notes:
Purpose Explain the basic function of NFS and identify the versions of NFS.
Details Review the basic concepts of NFS as a distributed file system with an emphasis
on the fact the networking is transparent to the application. Also explain that there are
different versions of the NFS protocol with significant performance implications.
Explain to the students the major components involved in NFS tuning. Remind them that
many of these components are the same as we have been covering during the rest of the
class. Both client and server can be impacted through CPU or memory constraints. The
communications between them goes through all the network layers we discussed in the
previous unit. The server side will be doing disk I/O through a local file system. Emphasize
that they need to follow the full performance analysis methodology at the client and server
platforms, not just focus on the NFS tunables.
Remind them that performance is determined by the balance between demand and
resource constraints. If the total demand of many clients overwhelm a server, the
performance will degrade quickly. If you can neither spread the workload out over more
servers or increase the capacity of the constrained server, than it may be beneficial to
throttle back the demand by either restricting the number of clients or de-tuning the clients
to reduce the level of demand per client.
Most of this unit will focus on the NFS tunables, but we must be sure the students
understand the bigger picture.
Draw for the students the big picture of many clients using a single overburdened server.
The cumulative load can cause congestion at the server resulting in delays and discards,
which in turn cause retransmissions. The solution is either to control the number of clients
or to reduce the demand per client.
Constraining the demand per client will reduce what they can do in a perfect environment.
So it is not an ideal solution. But point out that, in this scenario with an overburdened
server, the situation is not ideal. Reducing the size and slowing the pace of the client
requests is preferable to the performance impact of retransmissions. A much preferred
solution would be to increase the capacity of the NFS server and provide even better
service to the clients.
Explain the principles behind the detuning mechanisms.
Additional information NFS comes with AIX. It is not a separately orderable product.
Transition statement The version of NFS protocol can be requested on a mount by
mount basis. Let us look at the different versions and how they relate to performance.
8-8

V5.4
Instructor Guide
Uempty
NFS versions
NFS v2
Maximum and default read/write size is 8 KB
All writes are synchronous
File offsets are limited to 32 bits
NFS v3 (best performing)

Default read/write size 32 KB, maximum 64 KB
Reliable asynchronous writes
Attributes on replies and readdirplus reduce getattr overhead
File offsets can be up to 64 bits
NFS v4
Only supports TCP
rpc.lockd, rpc.statd and rpc.mountd merged into nfsd
Benefits large-scale file sharing distributed environments
Better security
Requires more processing overhead
Figure 8-3. NFS versions
AN512.0
Notes:
NFS V2
NFS Version 2 (V2) was the only version of NFS available until AIX V4.2.1 at which
point NFS Version 3 (V3) was implemented. Currently on AIX, NFS V2, V3 and V4 are
supported simultaneously on both the AIX NFS client and the AIX NFS server. Writes in
NFS V2 are much slower than in V3 or V4 because all writes are synchronous on the
server. The clients write request does not complete until the write has reached the disk
on the NFS server. Also, the maximum read or write size is 8 KB in NFS V2. Another
limitation of NFS V2 is that the file offsets are 32 bits which means the maximum file
size that can be supported is 4 GB. On AIX, the default maximum number of biod
threads is 7 per NFS V2 mount, but this can be overridden with the biods mount option.
NFS V2 supports both UDP and TCP (TCP is the default on AIX).

8-9
Instructor Guide
NFS V3
NFS Version 3 was introduced in AIX V4.2.1. AIX can support both NFS Version 2 and
Version 3 on the same machine. NFS V3 is the default on AIX. NFS V3 improves
performance in many ways:
- NFS V3 removes the 8 KB read/write size limit in V2. The default read/write size is
32 KB on AIX with NFS V3.
- The maximum NFS I/O size in AIX NFS V3 is 64 KB. Other vendors NFS V3
implementations may have different maximum and default values.
- NFS V2 required that writes to the NFS server did not return back to the client until
the data was committed to disk. NFS V3 provides for reliable asynchronous writes
so that writes can be written to disk asynchronously. The write goes to memory and
returns. If the write has not been committed to disk, the client marks that write as a
smudged write. Eventually the write will have to be committed (due to page
replacement on the client or due to a sync). When the write is committed, the
contents are sent to the disk and then the write on the client is considered no longer
in a smudged state.
NFS V3 operations return the attributes of the file with every operation so that the
number of attribute calls can be decreased from the client. A directory read can result in
a readdirplus call (a operation not available in NFS V2) which not only reads the
directory but can also return the attributes of multiple files in the directory.
Since NFS V3 uses 64-bit offsets, the file size can be considerably larger than 4 GB.
NFS V4
NFS Version 4 is described by RFC 3530.
NFS V4 only supports TCP. UDP is not supported.
While NFS V4 is similar to prior versions of NFS (primarily NFS V3), NFS V4 provides
many new functional enhancements in areas such as security, scalability, and back-end
data management. These characteristics make NFS V4 a better choice for large-scale
distributed file sharing environments.
The additional functionality and complexity of NFS V4 result in more processing
overhead. Therefore, NFS V4 performance might be slower than with NFS V3 for many
applications. The performance impact of NFS V4 varies significantly depending on
which new functions are used.
The performance impact of NFS V4 varies significantly depending on which new
functions you use. For example, if you use the same security mechanisms on NFS V4
and NFS V3, your system might perform slightly slower with NFS V4. However, you
might notice a significant degradation in performance when comparing the performance
of NFS V3 using traditional UNIX authentication (AUTH_SYS) to that of NFS V4 using
Kerberos 5 with privacy, which means full user data encryption.
V5.4
Instructor Guide
Uempty
A client that does an NFS V3 mount of a file system, which has been NFS exported to allow
mounting by another client using NFS V4 will experience performance degradation
compared with mounting the file system from a server that has exported the file system
without allowing NFS V4 mounts.
rsize/wsize option
The rsize value specifies the maximum read size on the mount while the wsize value
specifies the maximum wsize value on the mount. If the application issues I/Os larger
than these, they are broken down by NFS into rsize/wsize chunks.

8-11
Instructor Guide
Instructor notes:
Purpose Explain how the different NFS versions compare in terms of performance.
Details Explain the performance characteristics of NFS V2. Emphasize the restrictions
on read/write size and the synchronous nature of write requests. The functional restriction
on the maximum file size is another reason to avoid NFS V2. Point out that it is not the
default in AIX, but there may be back-level client which require NFS V2. AIX will support
this protocol if requested.
Explain the advantages of NFS V3 over NFS V2. Emphasize the larger read/write size, the
asynchronous write capability, and the reduction in individual requests to get attributes due
to the readdirplus capability. Also note that the 64 bit file offset means that NFS can
address any size file that the mounted file system can support. If asked about fewer biod
threads, explain that 4 is usually sufficient, but if not we will show them how to increase that
number.
Explain how NFS V4 differs from previous versions of NFS. Note that they will likely will not
be using NFS V4 unless they need the functional enhancements and both client and server
are NFS V4 capable. Point out that NFS V4 might give them worse performance due to the
additional functionality. Just because it is newer does not necessarily mean it is better for
their purposes. Be careful to avoid getting into an extensive discussion of the enhanced
NFS V4 functionality; it is outside the scope of this course.
Additional information NFS V4 has extensive functionality enhancements for which
we lack time to cover in detail in this course. In the area of security it is Kerberos
authentication capable. It also has the ability to define replicas of served file systems, thus
being able to spread the workload across multiple platforms and a referral service to
simplify administration of the replicas. The replicas can also be used for improved
availability and NFS V4 provides a failover capability to help automate this.
Transition statement Which transport layer is being used can also vary on a mount by
mount basis. Let us consider the performance implications of these.

V5.4
Instructor Guide
Uempty
Transport layers used by NFS

Client can select transport protocol for each mount via
mount options
TCP is the default protocol in NFSv3
Provides reliable delivery with flow control, in-order
stream of data, and error detection with retransmission
Provides for larger throughput due to 64 KB I/O sizes
Recommended for wide-area networks or on networks
with high congestion or inefficient NFS servers
UDP is recommended in efficient environments where the
network and server are able to keep up with client requests
Figure 8-4. Transport layers used by NFS
AN512.0
Notes:
Selecting TCP or UDP
In NFS V3, the client machine can select TCP or UDP as a transport protocol for a
particular mount. The default is TCP in AIX V4.3 and higher. Prior to AIX V4.3, the
default was UDP for both NFS V2 and NFS V3. A mount option (proto) can be used to
select TCP or UDP (example: mount -o proto=udp).
UDP works efficiently over clean or efficient networks and responsive servers. For wide
area networks or for busy networks or networks with slower servers, TCP may provide
better performance since its inherent flow control can minimize retransmits on the
network. Also, since the maximum UDP packet size is 64 KB (which includes the IP
header), the maximum NFS I/O size used with UDP is less than 64 KB (depending on
the size of the IP header and UDP header).
UDP is not available in NFS V4.

8-13
Instructor Guide
Instructor notes:
Purpose Explain to the students the TCP versus UDP transport layers trade-offs.
Details Explain that AIX can support the protocol requested by the client. Explain that,
in most cases they should allow the protocol to default to TCP, but in a very clean local LAN
environment where neither the client or server experience congestion (discards and
delays), the UDP protocol can provide better performance due to less overhead.
Transition statement Next, we will go over a diagram that shows the path a client
request takes when accessing an NFS file system.

V5.4
Instructor Guide
Uempty
NFS request path

Client
Server
RPC
Application system calls
RPC daemon thread

Virtual file system (vnode)
Network
NFS kernel extension
NFS kernel extension

Virtual file system (vnode)
Daemon thread
AIX file system
RPC
Disk
Figure 8-5. NFS request path
AN512.0
Notes:
Client side
NFS is just one of the possible file systems that can be accessed transparently on a
client machine. When an application thread issues a system call to access a file or
directory in an NFS file system on the client, the system call will go to the kernels virtual
file system layer which will determine what type of file system it is. If it is NFS, then the
kernel calls the NFS kernel extension. The NFS kernel extension will, for certain
requests (for example, reads and writes) use a daemon thread (discussed later) to send
the request to the server. For other requests, the NFS kernel extension will send the
requests itself. Before sending the requests, External Data Representation (XDR) will
be used to convert the data to a common form in order to support platforms with
different data representations. (We will not cover the details of XDR in this course and it
is not a performance tuning component.) Once the data is in the XDR format, the
Remote Procedure Call (RPC) library routines are used to handle the actual network
communications between the client and the server. It is important to understand that it is

8-15
Instructor Guide
RPC which is handling the UDP or TCP socket calls. When a daemon thread is used,
that thread is dedicated to that request until it receives the reply from the server.
Server side
At the NFS server, the request is received by a daemon thread (discussed later) using
the RPC routines. XDR is used to convert the data to the local data representation and
the request is then handled by the NFS kernel extension. The NFS kernel extension
accesses the local file system to execute the requests such as: open, close, read, write,
get attributes, and so forth. The local file system access goes to the kernel's virtual file
system layer which determines the type of file system (for example, JFS or JFS2) and
then invokes the kernel extension for that file system type. Assuming a normal JFS or
JFS2 mount, the local file system kernel extension will use VMM mechanisms to handle
the files. The local file system I/O processing is normal with caching mechanisms
available. Writes may remain in memory until a write-behind, a sync, or a page steal,
flushes them to disk. Reads may be satisfied without physical disk I/O, if already cached
in memory. When a daemon thread is used, that thread is dedicated to that request until
it is completed and the reply is sent back to the client.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain the path taken by an NFS request from a client.
Details Explain the flow of requests through the layers in the diagram. Major points to
emphasize will be:
Transparency maintained by using the Virtual File System.
How the NFS kernel extension implements the distributed network processing.
That some requests require separate daemon threads to be processed while others are
implemented in the kernel extension.
That ultimately the network communication is handled by the RPC layer. That will
become significant when we look at the nfsstat reports later.
That the TCP/IP layers underneath RPC are an important component.
That the server side still has to handle the actual local file system I/O.
Since both the client and server sides of NFS utilize the file system VMM caching
mechanisms, having sufficient memory can be important.
Additional information RPC is a library of procedures. The procedures allow the client
process to direct the server process to run procedure calls as if the client process had run
the calls in its own address space. Because the client and the server are two separate
processes, they need not exist on the same physical system (although they can). NFS is
implemented as a set of RPC calls in which the server services certain types of calls made
by the client. The client makes such calls based on the file system operations that are done
by the client process. NFS, in this sense, is an RPC application. Because the server and
client processes can reside on two different physical systems which may have completely
different architectures, RPC must address the possibility that the two systems might not
represent data in the same way. For this reason, RPC uses data types defined by the
eXternal Data Representation (XDR) protocol.
Transition statement The main NFS operations are handled by daemon threads that
run on the NFS clients and the NFS server.

8-17
Instructor Guide
NFS performance related daemons

Too few daemon threads could restrict performance
nfsd (server only)
# nfso o nfs_max_threads (restricted tunable)
rpc.biod (client only, threads per mount)

# mount o biods=## (default 4)
rpc.mountd (server only, mainly automounter concern)

rpc.mountd h ## (default 16)
rpc.lockd (client and server)

rpc.lockd a ## (default 33)
Procedure for rpc.mountd and rpc.lockd subsystems

# chssys s <ssysname> -a <option>
# stopsrc s <subsysname>
# startsrc s <subsysname>
Figure 8-6. NFS performance related daemons
AN512.0
Notes:
Overview
Additional NFS daemon threads are not expensive when it comes to memory usage, so
it is normally okay to have many NFS daemon threads running. The rpc.mountd,
rpc.lockd, nfsd, and biod daemons are multi-threaded.
nfsd daemon
nfsd daemons are the active agents providing NFS services from the NFS server. The
receipt of any one NFS protocol request from a client requires the dedicated attention of
an nfsd daemon until that request is satisfied and the results of the request processing
are sent back to the client.

V5.4
Instructor Guide
Uempty
rpc.mountd daemon
rpc.mountd is a server daemon and an RPC that answers a client request to mount a
servers exported file system or directory. The rpc.mountd daemon finds out which file
systems are available by reading the /etc/xtab file. The rpc.mountd daemon is not in
NFS V4 because the operation is moved into the main NFS V4 protocol.
rpc.lockd daemon
The rpc.lockd handles the file locking requests for files in NFS file systems prior to
NFS V4. The rpc.lockd daemon is not in NFS V4 because the operation is moved into
the main NFS V4 protocol.
rpc.statd daemon
The rpc.statd coordinates file lock recovery if a system crashes. The rpc.statd
daemon is not in NFS V4 because the operation is moved into the main NFS V4
protocol.
portmap daemon
The portmap daemon is a network service daemon that provides clients with a standard
way of looking up a port number associated with a specific program.
nfsrgyd daemon
The nfsrgyd daemon is new in NFS V4. It provides a user name and group name
translation service for NFS servers and clients. This daemon must be running in order
to perform translations between NFS string attributes and UNIX numeric identities.
biod daemon
The biod daemon is the block input/output daemon. The biod is used on the client to
submit open/read/write/close requests from the client. It also performs read-ahead
and write-behind requests, as well as directory reads. The biod daemon threads
improve NFS performance by filling or emptying the buffer cache on behalf of the NFS
client applications. When a user on a client system wants to read from or write to a file
on a server, the biod threads send the requests to the server.
When a user on a client system wants to read and/or write to a file on a server, the biod
daemons send the requests to the server. The following NFS operations are sent
directly to the server from the NFS client kernel extension and do not require the use of
biods: getattr, setattr, lookup, readlink, create, remove, rename, link, symlink,
mkdir, rmdir, readdir, and fsstat. The default number of biods is seven per V2
mount or four per V3 and V4 mounts and can be increased or decreased as necessary
for performance.

8-19
Instructor Guide
The nfsd and biod user level processes are not used by NFS V4. They have been
replaced by kernel processes called nfsd and kbiod. Use the -k flag of the ps
command to see kernel processes.
biods option
The biods option specifies the number of biod threads for the mount. The default is 32
per NFS V3 and NFS V4 mounts and 7 per NFS V2 mount. The maximum value is 128
for each type of mount.
Tuning the network lock manager in NFS V2 and V3

On clients and servers where there is heavy file locking activity, the rpc.lockd daemon
may become a bottleneck. If so, it can be tuned so that there are more rpc.lockd
threads created. This is done by passing in the number of threads as an argument to
rpc.lockd. The chssys/stopsrc/startsrc commands can be used to change this
number. The NFS server should have enough rpc.lockd threads to handle all of its
clients (which means have more threads than what the client runs). The default value is
33 and the maximum value is 511.
File locking in NFS V4

There are significant changes in NFS V4 for file locking when compared to earlier NFS
versions. The RPC operations for file locking have been moved into the main NFS
protocol. The separate Network Lock Manager and status monitor protocols in earlier
NFS versions are eliminated along with the corresponding rpc.lockd and rpc.statd
daemons in NFS V4.
mountd
On servers that are handling large numbers of mount requests from clients, the
rpc.mountd may not be able to keep up. It could be that the clients are using
automount and the automount timeout is too low. In any case, the number of
rpc.mountd threads on the server can be increased by specifying the -h flag to
rpc.mountd. The new value goes into effect after rpc.mountd is stopped and restarted
using:
# stopsrc -s rpc.mountd
# startsrc -s rpc.mountd
The rpc.mountd daemon is not used in NFS V4. It is part of the protocol

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To explain the daemon processes used in an NFS environment.
Details Explain the roles of each of the NFS daemons. Explain the differences between
NFS V3 and prior versions of NFS, but be sure to explain the NFS V3 is the default in AIX.
Explain that some will only be of concern in tuning if there is a high volume of mount
requests, while others can be involved in every request on an already mounts file system.
Additional information The rpc.lockd, rpc.statd and rpc.mountd daemons are not
used by NFS V4. Their function is built into the NFS V4 protocol. Note however, that
although AIX (5L V5.3 and later) supports NFS V4, it also supports NFS V2 and V3, so the
NFS V3 daemons (rpc.lockd, rpc.mountd, and so forth) will still be present. Indeed the
default is that file systems will be exported and mounted using NFS V3 unless V4 (or V2) is
explicitly requested.
The question may come up of how to determine if there is heavy NFS lock usage. There
are multiple ways. One way is to use kdb to identify the number of rpc.lockd threads are
running. Another way is to analyze a network trace and look for the RPC lock requests. We
will not teach either of these in this course. So, the best we could say is to experiment and
see if increasing the maximum number of threads improves the performance.
The students may ask about chssys command and the illustrated usage. System
administrators do not usually define subsystems to the SRC, that is normally done under
the covers by installation and configuration procedures. Briefly explain that we are
changing the way in which the SRC starts the subsystem. The first -a is a chssys
command option which says to replace the current SRC rpc.lockd subsystem definition
options and arguments with what we are supplying as a value to the -a option. The
options to pass to the subsystem follows in quotes. For example, in:
# chssys -s rpc.lockd -a -a 100
a following value of -a 100 would be passed as an option and value for the rpc.lockd
command.
Explain how the number of mount daemon threads can become a bottleneck in certain
environments. Cover the procedures for increasing the maximum number of threads. Place
the discussion in the proper context. We normally do not have a high rate of mount
requests in a short period of time. Ultimately, they will want to investigate where these
numerous mount requests are coming from. Point out a common source could be an
automounter environment. Automounter is covered in the prerequisite TCP/IP
implementation course. Avoid reteaching that topic here. The instructor needs to
understand it and to be prepared to discuss with students who are experiencing this
problem, how the client side automounter mechanism can be tuned so as to control
repeated and frequent mount and unmounts.
Additional information Once the topics of automounter comes up, the students may
ask about the load balancing capabilities of automounter. Again, we do not cover that topic
in this course, but the instructor should study up on automounter and be prepared to
discuss it with a student off-line.

8-21
Instructor Guide
The rpc.mountd -h option is not documented in the product documentation at the time
of this course revision.
Transition statement Lets look at what statistics are available at an NFS server.

V5.4
Instructor Guide
Uempty
nfsstat -s
# nfsstat -rs
Server rpc:
Connection oriented:
calls
badcalls nullrecv
100256
0
0
Connectionless:
calls
badcalls nullrecv
badlen
0
xdrcall
0
dupchecks
29999
dupreqs
0
badlen
xdrcall
dupchecks
dupreqs
721845
94082
# nfsstat -ns
Server nfs:
calls
badcalls
public_v2
822019
2
0
<version 2 calls not shown >
Version 3: (613184 calls)
null
getattr
setattr
28 0%
50447 8%
4883 0%
write
create
mkdir
111758 18% 3856 0%
191 0%
rename
link
readdir
349 0%
52 0%
629 0%
commit
47861 7%
public_v3
0
lookup
23692 3%
symlink
0 0%
readdir+
1650 0%
access
17356 2%
mknod
0 0%
fsstat
4241 0%
readlink
0 0%
remove
1297 0%
fsinfo
16 0%
read
344852 56%
rmdir
18 0%
pathconf
8 0%
Figure 8-7. nfsstat -s .
AN512.0
Notes:
nfsstat command
By default, the nfsstat command prints out NFS client and server statistics and
statistics on NFS and remote procedure calls (RPC).
The flags for /usr/sbin/nfsstat are:
-c client information
-s server information
-n NFS information only
-r RPC information only
-z reset statistics (root only)
-m mount statistics
The default if no flags are given is: nfsstat -csnr.

8-23
Instructor Guide
Using the -s flag on an NFS client does not show server statistics for any machine other
than itself (an NFS client may also be an NFS server to another machine). The statistics
are cumulative statistics since system boot (or since the counters were reset to 0 with
nfsstat -z).
Too many nfsd threads?

If you think you need more nfsd threads on your server, and proceed to add some,
watch the nullrecv column in the nfsstat -rs output. If the number starts to grow, it
may mean you have too many nfsd threads. However, this is usually not the case on
AIX NFS servers as much as it could be the case on other platforms. The reason for
that is that all nfsd threads are not woken up at the same time when an NFS request
comes into the server. Instead, the first nfsd thread wakes up, and if there is more work
to do, this daemon will wake up the second nfsd thread, and so on. You can adjust the
maximum number of nfsd threads in the system by using the nfs_max_threads
parameter of the nfso command. The default is 3891, which is the maximum.
Duplicate checks
Duplicate checks are performed for non-idem potent operations (that is, those that can
not be performed twice with the same result). The classic example is rm. The first rm will
succeed, but if the reply is lost the client will re-transmit it. You want duplicate requests
like these to succeed, so the duplicate request cache is consulted, and if it is a duplicate
request the same (successful) result is returned on the duplicate request as was
generated on the initial request. Depending on whether the Connection oriented or
the Connectionless dupreq counters are increasing, the appropriate NFS duplicate
cache size may need to be increased (nfso -o nfs_tcp_duplicate_cache_size or
nfso -o nfs_udp_duplicate_cache_size). The default duplicate cache size is 1000
and the maximum is 10000. The type of requests that can be stored in the duplicate
cache are: setattr(), write(), create(), remove(), rename(), link(), symlink(),
mkdir(), rmdir().
PerfPMR reports
The PerfPMR nfsstat.sh script runs all the nfsstat reports both before and after
the specified measurement period. The results can be found in the nfsstat.int file in the
PerfPMR output directory.
If you do not already have a separate baseline report, do not reset the netstat
statistics before running perfpmr.sh or nfsstat.sh. That way the before reports
can be used as the baseline.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain the output of nfsstat -rs.
Details Explain that it is common to override the default action of giving all categories of
statistics. For example, there is usually no point in looking at client side statistics at a
server. Review the options in the student notes. Explain that there are flags that control
whether the stats are client side or server side. Explain that there are flags that control
whether we show NFS layer statistics or RPC layer statistics. Point out the ability to reset or
zero out the statistics to make before and after type comparisons easier.
Point out that the top report is RPC layer statistics. Ultimately, it is RPC that issues network
requests to the UDP or TCP socket.
It is not necessary to explain each and every field in the report. Other than the volume of
traffic (demand) displayed as the number of calls and whether it is UDP or TCP traffic,
there is not much here the students will use for performance management. It is unlikely that
we will see a high nullrecv count in AIX, due to the way the kernel manages the nfsd
threads. And if it is high, there is nothing the system administrator can do about it; it would
be an AIX kernel design issue. The section in the student notes on the size of the duplicate
request cache is really a functional problem. If the cache is not large enough, the client
application may receive misleading errors.The client side NFS statistics are actually much
more useful for performance analysis.
Note that the visual has does not show the NFS Version 2 statistics.
Transition statement Lets now look at the output of nfsstat -ns.

8-25
Instructor Guide
NFS statistics using netpmon O nfs

# netpmon O nfs o netpmon.out; sleep 60; trcstop
# more netpmon.out
=====================================================================
NFS Server Statistics (by Client):
--------------------------------------- Read --------- Write ----Other
Client
Calls/s
Bytes/s
Calls/s
Bytes/s
Calls/s
--------------------------------------------------------------------aixclient1
0.48
2115
0.22
162
0.32
aixclient2
0.28
1228
0.25
356
0.32
aixclient3
0.30
1296
0.27
264
0.38
aixclient4
0.28
1228
0.23
261
0.30
aixclient5
0.22
887
0.33
216
0.53
aixclient6
0.17
751
0.08
107
0.03
aixclient7
0.02
68
0.05
5
0.02
--------------------------------------------------------------------Total (all clients)

1.75
7574
1.43
1371
1.90
=====================================================================
Figure 8-8. NFS statistics using netpmon -O nfs
AN512.0
Notes:
This netpmon report shows the number of reads and writes and their rates that each
client is sending to the NFS server. It is requested by using the -O nfs option when
running netpmon.
The netpmon report also shows file activity on the server due to NFS mounts. Each row
describes the amount of NFS activity handled by this server on behalf of a particular
client. At the bottom of the report, call for all clients are totaled.
This information can be used to identify where the demand is coming. This can in turn
be used to investigate the situation for a particular client or to load balance the demand
among alternative servers.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain the netpmon tools NFS report.
Details Explain that the netpmon tool is a kernel trace based tool. Here we are using
the -O nfs option to request that it collect and analyze NFS activity during the monitored
period.
Discuss with the students what they might do if they saw most of the activity coming from
one or two clients out of 20 different clients.
Transition statement There are some system wide nfso tunables that the system
administrator could still use, though most nfso tunables are now restricted. Let us take a
look at these.

8-27
Instructor Guide
Server tuning with nfso

Tuning options with nfso on a NFS server:
nfs_server_base_priority fixes the priority of the
NFS daemon threads to this value
nfs_server_clread default value of 1 enables
aggressive file read-ahead on the NFS server
nfs_v3_server_readdirplus default value of 1
enables NFS V3 readdirplus ability
Figure 8-9. Server tuning with nfso .
AN512.0
Notes:
Overview
In addition to the nfs_max_threads, nfs_rfc1323, and the
nfs_tcp_duplicate_cache_size or nfs_tcp_duplicate_cache_size parameters
already discussed, there are a few more that affect performance on the NFS server.
nfso command
The nfso command is used to configure NFS tuning parameters. The nfso command
sets or displays current or next boot values for NFS tuning parameters. This command
can also make permanent changes or defer changes until the next reboot. Whether the
command sets or displays a parameter is determined by the accompanying flag. The -o
flag performs both actions. It can either display the value of a parameter or set a new
value for a parameter.

V5.4
Instructor Guide
Uempty
Extreme care should be taken when using this command. If used incorrectly, the nfso
command can make your system inoperable.
nfs_server_base_priority
The nfs_server_base_priority parameter will fix the priority (scheduling policy is
round-robin) to the value specified. A value of 0 means that the default non-fixed or
floating priority (scheduling policy is SCHED_OTHER) is used.
nfs_server_clread
The nfs_server_clread option allows the NFS server to be very aggressive about the
reading of a file. This may be useful in cases where the client is reading sequentially but
the JFS/JFS2 read-ahead parameters are at default values. If value is 1 (default), then
aggressive read-ahead is done. If the value is 0, normal system default read-ahead
methods are used. Normal system read-ahead is controlled by VMM. The more
aggressive top-half JFS read-ahead is less susceptible to read-ahead breaking down
due to out-of-order requests (which are typical in the NFS server case). When the
mechanism is activated, it will read an entire cluster (128 KB, the LVM logical track
group size).
nfs_v3_server_readdirplus
The nfs_v3_server_readdirplus option enables the server to automatically return file
handle and file attribute information along with directory entries instead of the client
having to separately request this information. This greatly improves performance and is
enabled by default. There may be rare situations where almost all the server traffic is for
a client which does not need the additional information and turning the option off may
improve performance. It is recommended to leave this enabled.
The following NFS options have either been discontinued or are

restricted tunables:
nfs_device_specific_bufs (discontinued)
The nfs_device_specific_bufs parameter when set to 1 means that the NFS server
will use memory allocations from network devices if the network device supports such a
feature. Use of these special memory allocations by the NFS server can positively
affect the overall performance of the NFS server. The default of 1 means the NFS
server is allowed to use the special network device memory allocations. These are
buffers managed by a network interface that result in improved performance (over
regular mbufs) because no setup for DMA is required on these. Two adapters that
support this include the Micro Channel ATM adapter and the SP2 switch adapter. If the
value of 0 is used, the NFS server will use the traditional memory allocations for its
processing of NFS client requests.

8-29
Instructor Guide
nfs_max_connections (discontinued)
The nfs_max_connections parameter value can be used to limit the number of NFS
clients that use TCP mounts. The default value of 0 means that no limit is enforced.
Tuning this can be used to reduce the load on the NFS server by denying access to
some clients.
nfs_device_specific_bufs (discontinued)
The nfs_device_specific_bufs parameter when set to 1 means that the NFS server
will use memory allocations from network devices if the network device supports such a
feature. Use of these special memory allocations by the NFS server can positively
affect the overall performance of the NFS server. The default of 1 means the NFS
server is allowed to use the special network device memory allocations. These are
buffers managed by a network interface that result in improved performance (over
regular mbufs) because no setup for DMA is required on these. Two adapters that
support this include the Micro Channel ATM adapter and the SP2 switch adapter. If the
value of 0 is used, the NFS server will use the traditional memory allocations for its
processing of NFS client requests.
nfs_socketsize (restricted)
The nfs_socketsize parameter when tuned on the client machine will specify the size
of the UDP send/receive socket buffers used for each UDP request. The default is
60000 bytes.
nfs_tcp_socketsize (restricted)
The nfs_tcp_socketsize parameter when tuned on the client machine will specify the
size of the TCP send/receive socket buffer used for each NFS server connection. The
default is 60000 bytes.
nfs_iopace_pages (restricted)
The nfs_iopace_pages specifies the number of pages that will be sent to a file on the
NFS server before the application issuing I/Os to that file is put to sleep. The default
value is 0 which means that 1/8th of the file will be sent. The VMM I/O pacing can also
be used. But, in the case of NFS flushes from shared memory, the writes may bypass
the VMM I/O pacing code. This parameter can also be used to keep one process from
using up all of the bufstructs from a paging device table (PDT).
nfs_dynamic_retrans (restricted)
The nfs_dynamic_retrans value if set to 1 (the default) will adjust the timeout value
dynamically after each retransmit. This value can be doubled each time but may vary
V5.4
Instructor Guide
Uempty
based on a feedback mechanism. The maximum timeout value is 20 seconds. If setting

the timeo mount option, then the timeo value is used for each retransmit if
nfs_dynamic_retrans=0. Otherwise, timeo is only used for the initial retransmit.
Tuning bufstruct pools (restricted)

The number of pools is tunable with the nfso command (nfs_v2_pdts, nfs_v3_pdts or
nfs_v4_pdts). The number of bufstructs in each pool is also tunable. It can go up to a
maximum value of 5000 per pool. This is tunable also with nfso (nfs_v2_vm_bufs,
nfs_v3_vm_bufs or nfs_v4_vm_bufs). When tuning the vm_bufs values, make sure this
is set before the pdts value is set. Both the pdts and vm_bufs values must be set
before the NFS file systems are mounted.
When issuing reads/writes to files in an NFS mounted file system, each NFS I/O uses a
bufstruct that is obtained from a pre-allocated pool. By default, there is one pool that is
created for each NFS version and all NFS mount points of the same version share that
pool. This pool can differ in size depending on the amount of RAM on the client machine
(though for most clients, the pool will have 1000 bufstructs).
nfs_auto_rbr_trigger (restricted)
The nfs_auto_rbr_trigger nfso option can be used to specify the number of
megabytes to initially cache in memory. File contents sequentially read after this
threshold will be released after the application has read the data. This option is
defaulted to 0, which currently has AIX perform no release-behind-on-read processing.
By coding the option you can enable release-behind-on-read and determine how much
initial data to cache.

8-31
Instructor Guide
Instructor notes:
Purpose Discuss the other nfso options that can be tuned on the server that affect
performance.
Details You generally want to leave nfs_server_clread and nfs_v3_server_readdirplus
enabled. Note that these are enabled by default and there are rare situations were they
would disable them. There are some situations where readdirplus can interact with a
limited attributes cache to prematurely flush out existing cache entries. Be very careful
about setting a fixed priority for the nfsd threads. If the NFS server is getting high traffic
rates, you could end up preventing other important threads from running. If using fixed
priorities, it is recommended that the SCHED_OTHER base priority of 40 be the very
lowest that you set it to.
Transition statement Lets switch from NFS server statistics and tunables to the client
side statistics and mount options.

V5.4
Instructor Guide
Uempty
nfsstat -c
# nfsstat -rc
Client rpc:
Connection oriented
calls
badcalls
1
0
nomem
cantconn
0
0
Connectionless
calls
badcalls
1448
0
timers
nomem
22
0
badxids
timeouts
0
0
interrupts
0
retrans
12
cantsend
0
badxids
0
newcreds
0
timeouts
12
badverfs
0
timers
0
newcreds badverfs
0
0
# nfsstat -nc
Client nfs:
calls
badcalls
clgets
1437
0
0
<nfsv2 calls not shown> .
null
getattr
setattr
0 0%
0 0%
0 0%
write
create
mkdir
0 0%
0 0%
0 0%
rename
link
readdir
0 0%
0 0%
0 0%
commit
0 0%
cltoomany
0
lookup
0 0%
symlink
0 0%
readdir+
0 0%
access
0 0%
mknod
0 0%
fsstat
0 0%
readlink
0 0%
remove
0 0%
fsinfo
1 100%
read
0 0%
rmdir
0 0%
pathconf
0 0%
Figure 8-10. nfsstat -c
AN512.0
Notes:
Connection oriented versus connectionless
The report has two sections. The connection oriented section is for TCP mounts. The
connectionless section is for UDP mounts. The statistics are the total for all mounts of a
given type.
Dropped packets
For performance monitoring, nfsstat -rc will give information on whether the network
is dropping packets. A network may drop a packet if it cannot handle it. Dropped
packets may be the result of the response time of the network hardware or software, or
an overloaded CPU on the server. A dropped packet is not actually lost, as the request
is retransmitted (normally successfully). Packets are rarely dropped on the client.
Usually, packets are dropped on either the network or on the server.

8-33
Instructor Guide
Retransmissions and timeouts

If using UDP mounts, the retrans column in the rpc section displays the number of
times requests were retransmitted due to a timeout in waiting for a response. This is
either related to dropped packets or a delayed reply from the server. If the retrans
number consistently increases, then it indicates a problem with the server or network
keeping up with demand. Use vmstat, netpmon, and iostat on the server machine to
check the load.
With TCP mounts, retransmissions are handled by the transport layer.
With soft mounts, when the major timeout period has expired and there has been no
reply from the server, the application is informed that the request has failed and the
timeouts counter is incremented. For UDP, the major timeout period is after RPC has
reached the retransmit limit.
Delayed server replies and excessive retransmissions

Generally the analysis of the following statistics is mainly a concern for UDP mounts,
where the RPC protocol needs to handle error detection and retransmission.
A high badxid count implies that requests are reaching the various NFS servers, but
the servers are too loaded to send replies before the local hosts RPC calls timeout and
are retransmitted. The badxid value is incremented each time a duplicate reply is
received for a transmitted request (an RPC request retains its XID through all
transmission cycles). Excessive retransmissions place an additional strain on the
network or server, further degrading response time.
If the server is CPU-bound, it will affect NFS and its daemons. To improve the situation,
the server must be tuned or upgraded, or the user can localize his applications files. If
the server is I/O-bound, the server file systems can be reorganized, or localized files
can be used.
If the server does not appear overloaded and the badxid column in nfsstat is much
lower than the timeout column, then there may be network hardware problems. A
network analyzer can help pinpoint this.
Other report fields

The timers field shows the number of times the calculated time-out value was greater
than or equal to the minimum specified timed-out value for a call.
The cantconn field shows the number of times the call failed due to a failure to make a
connection to the server.
The nomem field shows the number of times the calls failed due to a failure to allocate
memory.
The interrupts field shows the number of times the call was interrupted by a signal
before completing.
V5.4
Instructor Guide
Uempty
The output listed using nfsstat -nc on the client machine can be used to determine
what type of requests are being made. If there are a high number of getattr requests,
attribute cache tuning may help. If there are a high number of read requests, then more
memory may help. If there are a high number of commits, then tuning commit-behind
may help.
Additional NFS V4 output

When you run nfsstat -4nc on an AIX 5L V5.3 system, the following additional
information is displayed:
null
getattr
0 0%
0 0%
write
create
0 0%
0 0%
rename
link
0 0%
0 0%
confirm
downgrade
0 0%
0 0%
renew
clid_cfm
setattr
0 0%
mkdir
0 0%
readdir
0 0%
close
0 0%
secinfo
lookup
0 0%
symlink
0 0%
statfs
0 0%
lock
0 0%
release_lo
access
0 0%
mknod
0 0%
finfo
0 0%
locku
0 0%
replicate
0 0%
0 0%
0 0%
0 0%
0 0%
readlink
0 0%
remove
0 0%
commit
0 0%
lockt
0 0%
read
0 0%
rmdir
0 0%
open
0 0%
setclid
0 0%

8-35
Instructor Guide
Instructor notes:
Purpose Explain the use of nfsstat -rc.
Details Explain how to use the report. Be very clear on the difference between the UDP
and TCP statistics. Note that only UDP mounts have RPC manage the retransmissions.
Emphasize the ability in UDP to distinguish between discarded packets versus delayed
RCP acknowledgements (high badxid count) causing retransmissions. Delay
acknowledgements are usually caused by an overloaded server, though there could be a
network problem.
Transition statement Lets now look at the client side nfsstat -m report.

V5.4
Instructor Guide
Uempty
nfsstat -m
# nfsstat -m
/nfs/retain/pmrs from /nfs/retain/pmrs:cia
Flags: vers=2,proto=udp,auth=unix,soft,intr,dynamic,rsize=8192,wsize=8192,ret
rans=5
Lookups: srtt=2 (5ms), dev=2 (10ms), cur=1 (20ms)
All:
srtt=2 (5ms), dev=2400 (12000ms), cur=600 (12000ms)
/nfs/retain/bin from /nfs/retain/bin:cia
Flags:
vers=2,proto=udp,auth=unix,hard,intr,dynamic,rsize=8192,wsize=8192,ret
rans=5
Lookups: srtt=14 (35ms), dev=6 (30ms), cur=4 (80ms)
Reads:
All:
/nfs/cust from /nfs/cust:l3perf
Flags:
vers=3,proto=tcp,auth=unix,soft,intr,link,symlink,rsize=32768,wsize=32
768,retrans=5
All:
Figure 8-11. nfsstat -m .
AN512.0
Notes:
Overview
The nfsstat -m option shows statistics for each NFS mount on the client. first it shows
the NFS version, protocol, and mount options. In addition it provides (for UDP mount
only) the smoothed round trip times and the current NFS RPC timeout value:
- srtt is the smoothed round-trip time
- dev is the estimated deviation
- cur is the current timeout value
RPC uses a an exponential back-off for the time-out. A large current timeout indicates
slow RPC acknowledgements.
The numbers in parentheses are the actual times in milliseconds. The other values are
un-scaled values kept by the operating system kernel. You can ignore the un-scaled
values. Response times are shown for lookups, reads, writes and a combination of all
of these operations (all).

8-37
Instructor Guide
Instructor notes:
Purpose Explain the use of the output from nfsstat -m.
Details Emphasize that this report allows us to identify which mounts a client has the
highest activity on. Again this can help us narrow down our investigation. Also point out that
it also provides us with the actual protocol options in use by these mounts and current
performance measurements statistics.
As with many performance measurements. The meaning of these will depend on having a
baseline to which we can compare.
Transition statement Lets discuss commit-behind tuning.

V5.4
Instructor Guide
Uempty
Client commit-behind tuning

A high number of commits could be due to VMM pagereplacement:
When NFS client pages in a smudged state are stolen, they
will be committed to the NFS server
Page-by-page commits are inefficient and can overload the
server
The combehind mount option enables commit-behind:

After numclust clusters of pages are modified and a cluster
boundary is crossed, smudged pages in previously modified
clusters are committed to the server using a single commit
The default numclust value is 128 clusters (each cluster
has 4 pages)
A smaller numclust mount option value will make the
commits more aggressive (use when page-by-page commits
are still high)
Figure 8-12. Client commit-behind tuning
AN512.0
Notes:
Overview
With NFS V2, for each write, the page is synched to disk on the server. Therefore, it is
considered committed once the write is completed.
With NFS V3 and V4, a write may just go to the memory cache on the NFS server and
return. In this case, the write is considered as being in a smudged state on the NFS
client. If the file is synched, then a commit occurs and all dirty pages are flushed to disk.
At this point, the pages are no longer in smudged state. If page-replacement occurs on
the client and a smudged page is stolen, then the VMM drives an NFS commit for this
page. This page-replacement activity can cause a high rate of commits to the NFS
server as there can be a commit per page.

8-39
Instructor Guide
Enabling commit-behind
To increase the NFS client and server performance in this case, the combehind mount
option can be used to enable commit-behind. Commit-behind uses the NFS numclust
value (defaults to 128 and can be overridden with the numclust mount option) to
determine when to send commits. An NFS cluster contains 4 pages. After 4*128 pages
by default, if another page is modified, then these pages are committed with a single
commit call. If page replacement is running faster than the commit-behind algorithm
(commits continue to increase in nfsstat -c), then make commit-behind more
aggressive by reducing the numclust value (a suggested value is 32, 64 or 128).
VMM read cache impact

A side effect of enabling commit-behind is that VMM caching is effectively disabled.
This could had a negative impact when doing large sequential reads on the same NFS
mount.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain NFS client commit-behind/numclust tuning.
Details Be sure to place this in context of the memory management situation on the
client. If page stealing on the client results in individual NFS file pages in a smudged state
is driving commits at the server, then this can be very inefficient. You may want to relate
back to the similar discussion in JFS tuning. Ask them what the trade-off is between a large
numclust versus a smaller numclust value. Again this is similar to write-behind tuning in
JFS.
Additional information These mount options are not documented in the Commands
Reference manual as of this writing.
Transition statement Now, lets look at attribute cache tuning.

8-41
Instructor Guide
Client attribute cache tuning

If the NFS client is looking up file attributes at a high rate
AND if the attributes dont change often, then tuning the
attribute cache values may increase performance
File attribute cache mount options:
actimeo
acregmin
acregmax
acdirmin
acdirmax
noac
Figure 8-13. Client attribute cache tuning
AN512.0
Notes:
File attribute cache tunables
NFS maintains a cache on each client of the attributes of recently accessed directories
and files. Five parameters, beginning with ac, can be set to control how long an entry is
kept in cache. These are all mount options and can be set in /etc/filesystems or
specified at the mount command line. The mount options are:
- actimeo is the absolute time for which file and directory entries are kept in the file
attribute cache after an update. If specified, this value overrides the following *min
and *max values, effectively setting them all to the actimeo value.
- acregmin is the minimum time after an update that file entries will be retained. The
default is 3 seconds.
- acregmax is the maximum time after an update that file entries will be retained. The
default is 60 seconds.

V5.4
Instructor Guide
Uempty
- acdirmin is the minimum time after an update that directory entries will be retained.
The default is 30 seconds.
- acdirmax is the maximum time after an update that directory entries will be retained.
The default is 60 seconds.
- noac - specified that this mount performs no attribute or directory caching.
Each time the file or directory is updated, its removal is postponed for at least acregmin
or acdirmin seconds. If this is the second or subsequent update, the entry is kept at
least as long as the interval between the last two updates, but not more than acregmax
or acdirmax seconds.

8-43
Instructor Guide
Instructor notes:
Purpose Explain the attribute cache tuning capabilities on an NFS client.
Details Point out that if a mounted filesystem is very stable (contents are rarely
modified during NFS client access), then these value can be made larger. If the filesystem
is frequently modified, then a shorter time-out period might be appropriate even though this
may reduce performance. It also depends on the tolerance of the application for out of
attributes. If the filesystem is modified frequently and the application has no tolerance for
stale attributes, then you might even consider disabling attribute caching.
Ask the students why caching of file attributes improves performance. Relate this back to
the netstat -nc report and the getattr request count. Point out that actimeo overrides
the other attribute cache related mount options. Clearly delineate the regular versus
directory file options and explain how the min and max value are used.
Transition statement Next, are more NFS mount options that can be set on an NFS
client.

V5.4
Instructor Guide
Uempty
NFS I/O pacing, release-behind, and DIO

Pace NFS reads and writes on open files
minpout and maxpout mount options
Suspends I/O until outstanding pageouts is low
Release-behind conserves NFS client memory

For reads only
Similar to JFS release behind tuning
rbr mount option
Next page read triggers release of previous cluster
Cluster size = numclust * MAX(wsize,rsize)
Direct I/O and Concurrent I/O are available as

NFS mount options: dio and cio
Application needs to be properly designed.
Figure 8-14. NFS I/O pacing, release-behind, and DIO
AN512.0
Notes:
Avoiding shortages of resources
A sudden large increase in NFS requests can exhaust the number of bufstructs on the
client and strain network or server resources. One common cause is the flushing of
unwritten file pages when an application closes (typically with an fsync) or when the
syncd daemon runs. Another common cause is a single application which is writing out
a large file, which could hog these resources and thus affect other applications.
Pacing the flushing of cached file writes

The nfs_iopace_pages nfso option specifies the maximum, number of dirty pages
that can be written to the server at one time. The default value is 0, which indicates that
the kernel dynamically adjusts the maximum depending upon the write sizes (has a
starting value of 32 pages). Coding a non-zero value for this option allows the
administrator to force a particular maximum to be used.

8-45
Instructor Guide
Pacing application I/O

The maxpout and minpout mount options control the outstanding pageouts
thresholds at which additional I/O to the NFS file system will be suspended and when it
will be resumed. These options for NFS mounts were introduced with AIX 5L V5.3
ML03.
When the outstanding number of pageouts reaches the maxpout value, I/Os to that file
system are blocked until the outstanding pageouts has been reduced to the minpout
value. This helps prevent a single application from dominating the I/O, but also tends to
smooth out the request load, thus avoiding a transient shortage of bufstructs.
By default, if not coded on the mount, AIX will still use this pacing mechanism, but with
kernel determined values for maxpout and minpout.
Conserving NFS client memory

If there is very little likelihood that the applications sequentially reading a file (or any
other application on this client) will re-read what has been cached in memory, then that
data is unnecessarily competing with other uses of that memory. Starting with AIX 5L
V5.3 ML03, NFS has have the ability (as JFS has had for some time) to free up this
memory after the application has read the data.
Global automatic release behind on read

The nfs_auto_rbr_trigger nfso option can be used to specify the number of
megabytes to initially cache in memory. File contents sequentially read after this
threshold will be released after the application has read the data. This option is
defaulted to 0, which currently has AIX perform no release-behind-on-read processing.
By coding the option you can enable release-behind-on-read and determine how much
initial data to cache.
Using the rbr mount option

You can enable release-behind-on-read for individual mounts by coding the rbr mount
option. This option will cause AIX to release previously read pages when the next page
is sequentially read beyond the current cluster. The size of the cluster is determined by
the kernel, based upon the current value for numclust and the read or write sizes that
are specified for the mount.
The mount option overrides the global automatic mechanism
Using the dio and cio mount options

Direct I/O for NFS has the same consideration as discussed in the filesystem unit.
Concurrent I/O is DIO with file locking disabled (the application has to handle the data
locking instead).
V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Explain the NFS I/O pacing capabilities.
Details Set the context for this discussion. Part of this is pacing the demand on the
network and on the NFS server. A short burst can result in packet discards. Another impact
might be that other NFS clients to the same server see uneven performance. Also, point
out that many bufstruct shortages are caused by short bursts of activity, often from a single
application. By pacing the issuance of NFS requests we can smooth this out. Keeping the
queue of requests waiting on bufstructs can provide a fairer environment when many
applications on the same client are trying to use NFS. Pacing one clients demand on the
network and on the server should improve performance for the other clients.
Set the context in terms of a client that has filled memory with these cached file pages
when that application does not expect to re-read that data.
Additional information Just in case someone challenges the description of the default
behavior: Technically, the default threshold value of 0 means that the kernel will
dynamically determine the threshold. The current implementation of the method is to not do
release-behind-on-read at all. In the future, if this default behavior changes to a method for
dynamically determining the threshold for doing release-behind-on-read, then the system
administrators would code a -1 value if they wish to disable the mechanism.
Transition statement Lets review what we have covered with some check point
questions.

8-47
Instructor Guide
Checkpoint ( 1 of 2 )
1. True / False A large number of concurrent NFS clients
can overload an NFS server
2. The ________ daemons are the block input/output
daemons and are required in order to perform remote
I/O requests at an NFS client.
3. On clients and servers where there is heavy file locking
activity, the _________ daemon may become a
bottleneck (for NFS V2 and NFS V3).
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To test the students to see if they were paying attention.
Checkpoint solutions ( 1 of 2 )
2. The biod daemons are the block input/output daemons
and are required in order to perform remote I/O
requests at an NFS client.
activity, the rpc.lockd daemon may become a
Transition statement Its time for the exercise.

8-49
Instructor Guide
Checkpoint ( 2 of 2 )
4. The __________ command can be used to look at per
mount statistics at the NFS client
5. The _________ utility can identify which NFS clients
present the greatest workload at the NFS server.
6. If the NFS client has overcommitted memory, the
___________ mount option can be used improve NFS
I/O efficiency and the _______ mount option can be
used to release file cache memory once the application
receives the data.
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose To test the students to see if they were paying attention.
4. The nfsstat -m command can be used to look at per
5. The netpmon utility can identify which NFS clients
combehind mount option can be used improve NFS
I/O efficiency and the rbr mount option can be used to
release file cache memory once the application
receives the data.
Transition statement Its time for the exercise.

8-51
Instructor Guide
Exercise 8: NFS performance tuning
Examine
nfsstatdifferences
and netpmon
View
the performance
reports
between NFS versions
Figure 8-17. Exercise 8: NFS performance tuning
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Provide a transition to the lab exercise.
Details Have the students open their exercise lab guides to the NFS exercise and give
them an introduction to what they will be doing in lab.
Transition statement After the lab, summarize what they have learned.

8-53
Instructor Guide
Unit summary
This unit covered:
The basic Network File Systems (NFS) tuning
concepts
Differences between NFS V2, V3 and V4
Using nfstat and netpmon to monitor NFS
Using nfso and mount options to tune NFS
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Summarize what students have learned in this unit.
Details
Transition statement This completes the network performance unit. Lets move on to
the next unit on NFS performance.

8-55
Instructor Guide

V5.4
Instructor Guide
Uempty
Unit 9. Performance management methodology

Estimated time
01:30 (0:30 Unit; 1:00 Exercise)

This unit reviews performance monitoring methodology and
summarized the tools and procedures covered in this course. The
emphasis is on finding bottlenecks using standard AIX monitoring
tools.

List the steps to approach performance analysis
Describe the distinct areas of performance that need to be
investigated and how to go about monitoring those areas
Use tools that will aid with performance monitoring and tuning on
partitioned systems

Accountability:
Checkpoint
Machine exercises
References
Reference
SG24-6478

(Redbook)
SG24-6184


9-1
Instructor Guide
Unit objectives
List the steps to a methodical approach to performance
analysis
Describe the distinct areas of performance that need to be
investigated and how to go about monitoring those areas
Use tools that will aid with performance monitoring and
tuning on partitioned systems
AN512.0
Notes:
9-2

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Go over the objectives for this unit.
Details Explain at a high level what will be covered and what the students should be
able to do by the end of the unit.
This unit starts with specifics on performance management and then expands on the
Performance Analysis Flowchart used throughout this course by covering each subsystem
in more detail. Use the subsystem flowcharts section of this unit to review key tuning
decisions and actions that were covered earlier in this class.
Set expectations that this unit is to review the big picture of performance analysis for each
subsystem, while reminding them of the most important steps and tools. Students should
refer to the appropriate units in this course for the detail.
Transition statement Lets start by reviewing the factors that can affect performance.

9-3
Instructor Guide
Factors that can affect performance

Detecting the bottleneck(s) within a server system depends
on a range of factors such as:
Throughput
Bottlenecks
Figure 9-2. Factors that can affect performance
AN512.0
Notes:
Introduction
Technological improvements in microprocessors, disks, and networking equipment
have dramatically changed the look of server computing. While those improvements
have more often than not reduced the incidence of performance problems, they have
also increased the capabilities of systems such that more complex problems need to be
solved. Thus, performance tuning has tended to change in nature from simple hardware
and software bottleneck analysis toward evaluation of more complex interactions.
Important factors that can affect performance

As server performance is distributed throughout each server component and type of
resource, it is essential to identify the most important factors or bottlenecks that will
9-4

V5.4
Instructor Guide
Uempty
affect the performance for a particular activity. Detecting the bottleneck within a server
system depends on a range of factors such as:
-

File servers need fast network adapters and fast disk subsystems. In contrast, database
server environments typically produce high processor and disk utilization, requiring fast
processors or multiple processors and fast disk subsystems. Both file and database
servers require large amounts of memory for caching by the operating system or the
application.
Bottlenecks
A bottleneck is a term used to describe a particular performance issue which is throttling
the throughput of the system. It could be in any of the subsystems: CPU, memory, or I/O
including network I/O. The graphic in the visual above illustrates that there may be
several performance bottlenecks on a system and some may not be discovered until
other, more constraining, bottlenecks are discovered and solved.

9-5
Instructor Guide
Instructor notes:
Purpose Describe the factors that affect performance. Define the term bottleneck.
Details Give examples for each of the factors such as:
Configuration of hardware: There are rules for adapter placement. Too many of certain
types of adapters on a bus can affect performance. See the adapter placement
documentation. In addition, sometime there are trade-offs between scalability and
performance. These are business decisions that must be made. An example is the RIO2
drawers on POWER4- and POWER5-processor based systems. They can be installed in
multiple ways depending if maximum scalability or maximum performance is desired.
Software application workload: There may be bottlenecks due to the application itself. Can
it be made to take advantage of more CPUs? Are there enough daemons to handle client
requests? Are log files and other heavily used files distributed in the disk subsystem?
Weve seen examples of operating system parameters (such as virtual memory options
tuned with schedo) and potential network configuration issues (such as network adapter
speed mismatches) throughout this course.
Mention that bottlenecks can mask other bottlenecks and solving a bottleneck could cause
an additional bottleneck. For example, if theres a CPU bottleneck which is solved, this
might create more workload, causing a memory or disk I/O performance issue.
Transition statement An important step in solving performance issues is determining
the type of problem. Lets look at some important questions to ask.
9-6

V5.4
Instructor Guide
Uempty
Determine type of problem

Determine the type of problem:
Is it a functional problem or purely a performance problem?
Is it a trend or a sudden issue?
Is the problem only at certain times (of the day, week, and so
forth)
Youll need to know baseline performance statistics and what your
performance goals are:
Use AIX tools
Use PerfPMR
Document statistics regularly to spot trends for capacity planning
Document statistics during high workloads
Figure 9-3. Determine type of problem
AN512.0
Notes:
Types of problems
In addition to the questions in the visual above to discover the type of problem, you can
ask questions such as: Is this problem new or could it have always been there? What
has changed since youve noticed the problem? Is it a functional problem that is causing
the performance problem?
Determining the type of problem (functional or performance) is half the battle in
determining how to solve the issue. A functional problem is typically more straight
forward to solve; you find one or more things to fix. With a performance problem you
need to theorize what could help, try it, and see if it causes the performance to be better
or worse.

9-7
Instructor Guide
Creating a baseline
Because you need something to compare current statistics to, you need to have
baseline statistics documented. You may have several baselines documented
depending on the cyclical nature of the workload. For example, a separate baseline
may be need for the end-of-month batch processing workload versus the rest of the
month.
9-8

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose List the types of performance related problems. Emphasize the importance of
documenting baseline statistics.
Details The visual lists a few questions to determine the type of problem. You might ask
the students what other questions would be good to ask to determine the type of problem.
As stated throughout this course, you need a baseline to compare current statistics to,
otherwise you will not know if todays performance is better or worse.
And of course, you need to have a performance goal or set of goals so that you can tell if
youre meeting those goals. You may be meeting the goal, however, by comparing current
results with the baseline you spot a troubling trend.
Transition statement Typically, there are trade-offs to be made.

9-9
Instructor Guide
Trade-offs and performance approach

Trade-offs must be considered, such as:
Cost versus performance
Conflicting performance requirements
Speed versus functionality
Performance may be improved using a methodical

approach:
5. Measuring the new performance of the server to check for
improvement
Figure 9-4. Trade-offs and performance approach
AN512.0
Notes:
Trade-offs
There are many trade-offs related to performance tuning that should be considered.
The key is to ensure there is a balance between them.
The trade-offs are:
- Cost versus performance
In some situations, the only way to improve performance is by using more or faster
hardware. But, ask the question Does the additional cost result in a proportional
increase in performance?
- Conflicting performance requirements
If there is more than one application running simultaneously, there may be
conflicting performance requirements.

V5.4
Instructor Guide
Uempty
- Speed versus functionality

Resources may be increased to improve a particular area, but serve as an overall
detriment to the system. Also, you may need to make choices when configuring your
system for speed versus maximum scalability.
Methodical approach
Using a methodical approach, you can obtain improved server performance. For
example:
- Understanding the factors which can affect server performance, for the specific
server functional requirements and for the characteristics of the particular system
- Measuring the current performance of the server
- Identifying a performance bottleneck
- Upgrading/tuning the component which is causing the bottleneck
- Measuring the new performance of the server to check for improvement

9-11
Instructor Guide
Instructor notes:
Purpose Discuss the trade-offs and an approach to performance analysis.
Details Discuss the trade-off decisions that you may need to make. One example is
configuring the remote I/O (RIO) drawers on POWER4- and POWER5-processor based
systems. For maximum performance, the drawers are configured differently (dual loop
cabling) than if they are configured for maximum scalability (single loop cabling).
Review the approach that you tune one thing at a time, then monitor again. Emphasize that
one thing could mean several actual tuning actions since some tuning parameters must be
changed in conjunction with other tuning parameters.
Transition statement Lets look at a general flowchart to analyze performance.

V5.4
Instructor Guide
Uempty
Performance analysis flowchart

Yes
Actions
Is there a
performance
problem?
No
CPU Bound?
Yes
No
Yes
Memory Bound?
Actions
No
I/O Bound?
Yes
Actions
Normal Operations
No
Monitor system performance

and check against requirements
Network Bound?
Yes
Actions
No
No
Does performance
meet stated
goals?
Yes
Additional tests
Actions
Figure 9-5. Performance analysis flowchart
AN512.0
Notes:
Introduction
This is a flowchart that some performance analysts use. Keep in mind it is an iterative
process. The rest of this unit will look at the subsystems in more detail.

9-13
Instructor Guide
Instructor notes:
Purpose Use this flowchart to provide an overview of performance analysis.
Details Remember this flowchart? Well look at the pieces in more detail through the
rest of this unit.
Point out that both continuous monitoring should be done as well as satisfying customer
complaints.
Transition statement Is the system CPU bound?

V5.4
Instructor Guide
Uempty
CPU performance flowchart

START
vmstat, sar, topas, time

Monitor CPU usage
and compare with goals
No
Yes
CPU
supposed to
be idle?
Actions
No
High CPU
usage?
No
Actions
Determine cause
of idle time by
tracing
Check memory and

disk subsystems
Yes
Locate dominant
process(es)
ps, tprof,
topas
Actions
Actions
No
Is
process behavior
normal?
nice/renice, bindprocessor,
smtctl, schedo,
Yes
Make app multi-threaded
Fix or tune the

app or OS
Tune applications /
operating system
Kill
abnormal
processes
kill
Actions
Figure 9-6. CPU performance flowchart
AN512.0
Notes:
CPU bound system
A CPU bound system means that all the processors are nearing 100% busy, with
processes which want to run but cannot (or cannot as quickly), causing you not to meet
your performance goals.
Locate dominant processes

In order to understand why a system is CPU bound, you have to determine which
processes are using the most CPU time by using a command like ps. The %CPU column
gives the percentage of time the process has used the CPU since the process started.
You then must verify if those processes are running correctly and if they are using the
same amount of CPU as usual. You may find processes which are not behaving
normally (spinning, using up CPU time, but not doing any work) and you may be able to
kill these.

9-15
Instructor Guide
Tune applications or operating system

The application itself may be able to be tuned. Is it single-threaded? Can you increase
the number of its threads or processes? You may also be able to tune the operating
system. Is simultaneous multi-threading enabled? Can you change the priorities of
processes or threads such that the most important processes and threads receive most
favored status?

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss what to look for to see if the system is CPU bound.
Details The purpose for this visual and the following flowcharts is to summarize
decisions and actions if you have a resource-bound system. All of the flowcharts attempt
the impossible, which is to summarize and yet give enough detail to be useful. Use them to
remind the students of the big picture for each subsystem, and to remind them of the major
steps and tools involved. Point the students back to the individual units as a reference for
exact commands and steps.
One of the questions to ask is, is this system really CPU bound, or are the processors not
being utilized effectively? If the system is truly CPU bound, then reduce the workload or
add processors.
Example of a misbehaving process: Consider a thread the just spins doing yield() calls.
The run queue will go up, %idle and %wait will drop. yield() does no real work. But, can
contribute to high utilization of processors AND high context switching.
Transition statement Lets look at the steps to analyze a potential memory bottleneck.

9-17
Instructor Guide
Memory performance flowchart

START
Monitor memory usage.

Meeting goals?
Monitoring tools
lsps s, topas,
vmstat I, PerfPMR
No
Actions
Paging,
page steals,
repaging?
Yes
Tune memory
parameters
vmo
No
vmstat, ps gv,
svmon -P
Memory
leak?
Actions
Yes
Determine process
or kill/debug
No
Is memory
overcommitted?
svmon -G
Actions
Yes
Reduce workload or
add memory
No
Figure 9-7. Memory performance flowchart
AN512.0
Notes:
Memory bound system
A system is memory bound if it has high memory occupancy and high paging space or
file paging activity. The activity of the paging space is given by the number of pages
read from disk to memory (page ins) and number of pages written to disk (page out).
Examples of memory parameters to tune with vmo are the minfree, maxfree, minperm%,
maxperm%, maxclient%, strict_client, strict_maxperm, and lru_file_repage
tuning options.
Example of determining a memory bound system

For example, you might use topas and notice that memory is 100% consumed (Comp,
Noncomp), paging space is 61% consumed (% Used), a lot of pages are written to disk
(PgspOut) and you see higher VMM page steals (Steals). Because the system is using
all the memory and asking for more, this system is memory bound. Note that the page
V5.4
Instructor Guide
Uempty
stealing is a normal behavior in AIX and depending on the application may not be an
issue. Now you need to determine why the system is memory bound. Is it because of a
memory leak? Perhaps you need to reduce the workload to free up memory, or add
more physical memory. You may be able to change tuning options to make more
efficient use out of memory if adding memory is not an option.
Determine which processes are causing the problem

One action to take is to determine which processes are making the system memory
bound. Use the ps or svmon commands to look for processes that are consuming a lot of
memory. In the following example, perl is the largest memory consumer application
with a total number of pages in real memory of 218933 (around 875 MB) and total
number of pages reserved or used on paging space of 97963 (nearly 400 MB). The

9-19
Instructor Guide
second application vpross uses only 48 MB of memory and less than 7 MB of paging
space. So the perl application is the root cause of this memory problem.
# svmon -P
----------------------------------------------------------------------Pid
Command
Inuse Pin
Pgsp Virtual 64-bit Mthrd LPage
332008 perl
218933 4293 97963
318471
N
N
N
Vsid Esid Type Description
LPage
Inuse Pin Pgsp Virtual
7380
6
work working storage
65536
0
0
65536
1383
3
47302
0 18215
65515
15389
7
44717
0
0
44717
17388
4
38021
0 27528
65536
21393
5
15031
0 50528
65536
0
0
work kernel segment
6843 4290 1621
8454
3f8bd
d
work loader segment
1352
0
71
3062
29397
f
work shared library data
83
0
0
83
d385
2
work process private
32
3
0
32
3362
1
clnt code,/dev/hd2:12435
16
0
2f374
a
0
0
0
0
3f37c
9
0
0
0
0
3d37d
8
0
0
0
0
-----------------------------------------------------------------------Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd LPage
Pid
Command
Inuse Pin
Pgsp Virtual 64-bit Mthrd LPage
303122 vpross
12193 4293
1696
15465
N
N
N
Vsid Esid Type Description
LPage
Inuse Pin Pgsp Virtual
0
0 work kernel segment
6843 4290 1621
8454
17368
2 work process private
3913
3
4
3917
3f8bd
d work loader segment
1352
0
71
3062
1936f
1 pers code,/dev/lv00:83977 53
0
1136b
f work shared library data
32
0
0
32
1336a
- pers /dev/lv00:83969
0
0
----------------------------------------------------------------------

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss what to look for to see if the system is memory bound.
Details Point out the major steps and tools used to determining why a system is
memory bound and a few actions to take. Point the students back to the VMM unit in this
course for details.
Transition statement Lets look at the steps to analyze a potential disk bottleneck.

9-21
Instructor Guide
Disk/File system performance flowchart

vmstat I and v, svmon -G
START
Monitor disk usage.

Meeting goals?
No
Monitoring tools
Is adapter
overloaded?
iostat -a
Yes
Actions
Distribute load
No
iostat, topas
sar d, filemon
Is disk
overloaded?
Yes
Actions
Distribute load
Actions
No
fileplace, lvmstat,
lslv, lspv,
svmon will show if there is
enough memory to cache
files
Is it a
file system or
LVM issue?
Yes
No
De-fragment, change
fragment size, check
compression setting,
distribute logical or
physical volumes if
there are hotspots
Figure 9-8. Disk/File system performance flowchart
AN512.0
Notes:
Disk and file system performance issues
A system may be disk bound if at least one disk is busy and cannot fulfill other requests,
and process are blocked and are waiting for the I/O operation to complete. The
limitation can be either physical or logical. The physical limitation involves hardware like
bandwidth of disks, adapters and the system bus. The logical limitations involves the
organization of the logical volumes on disks and Logical Volume Manager (LVM)
tunings and settings, such as striping or mirroring.
Example of determining a disk or file system bound system

A system might show a high wait I/O at 86.6% (Wait), a percentage of time that hdisk0
was active at 98.7% (Busy%) and more than 5 processes which are waiting for an I/O
operation to complete (Waitqueue). This system is waiting for write operation on
hdisk0, which may be an indication that it is disk bound.

V5.4
Instructor Guide
Uempty
Disk I/O analysis

When a system has been identified having disk I/O performance problems, the next
point is to find out where the problem comes from. Check the adapter throughput and
the disk throughput. The activity of a disk adapter is given by the iostat -a command.
Because the maximum bandwidth of an adapter depends on its type and technology,
compare the statistics given by iostat to the published bandwidth for the adapter to
know the load percentage of the adapter. If the adapter is overloaded, try to move some
data to another disk on a distinct adapter, move a physical disk to another adapter or
add a disk adapter.
The disk may be bound just because all the data are not well organized. Verify the
placement of logical volumes on the disk with lspv command. If logical volumes are
fragmented across the disk like in the following lspv example, reorganize them with the
reorgvg or migratepv commands.
If logical volumes are well organized in the disks, the problem may come from the file
distribution in the file system. The fileplace command displays the file organization. If
space efficiency is near 100%, this means that the file does not have many fragments
and they are contiguous. If necessary, you can use the defragfs command to
increases a file system's contiguous free space by reorganizing allocations to be
contiguous rather than scattered across the disk.

9-23
Instructor Guide
Instructor notes:
Purpose Discuss what to look for to see if the system is disk/file system bound.
Details Discuss that a high I/O Wait (Wait) does not necessarily mean that the system
is disk bound. It is a special instance of idle processor time where nothing is on the run
queue AND there are outstanding I/Os. As processors get faster, I/O wait increases.
Consider 2 systems with the only difference being the CPU speed. The application is single
threaded and run by performing X amount of work and then reads in from the disk. The disk
I/O on both system takes 5 ms to complete the read (same disks).
System1 has a 1.1 GHz CPU and System2 has a 1.9 GHz CPU.
On System1, the work takes 5 ms to complete. So, we have 5 ms of user time, then
5 ms of I/O time with nothing else to run. This gives us 50% I/O Wait.
On system 2, the work takes 2.5 ms (approximately). So, we work for 2.5 ms, and wait
for 5 ms. I/O Wait is now 66%! But the actual work getting done has increased! High
I/O Wait may indicate a slow disk as well.
Consider an overloaded disk subsystem in the above example. It is a shared disk and an
external system adds load to it. The I/O's now take 15 ms to complete. I/O went up, but
there is nothing wrong in AIX or the application.
Transition statement Lets look at the steps to analyze a potential network bottleneck.

V5.4
Instructor Guide
Uempty
Network performance flowchart (1 of 3)

ping, netstat, netperf
START
Monitor network usage.

Meeting goals?
No
Monitoring tools
HW
configuration
problems?
netstat -v,
entstat -d
Yes
Fix configuration
chdev
No
netstat v,
entstat -d
Adapter
xmit queue/rcv
overflows?
Actions
Actions
Yes
Increase queue or
pool, or decrease
workload
chdev
No
netstat -m
Network buffer
shortage?
Actions
Yes
Add RAM, use 64-bit

kernel, or decrease
workload
No
Figure 9-9. Network performance flowchart (1 of 3)
AN512.0
Notes:
Monitoring performance and compare to goals
With network I/O, one of the things that needs to be done is to document the network
topology and identify the transmit and receive hosts for the major applications (or at
least the ones that use the network).
Hardware configuration problems

Adapter to switch port link configuration problems could cause corrupted frames, late or
multiple collisions. In addition to using netstat -n and entstat -d on the hosts, check
the statistics on the Ethernet switch(es). If the configuration looks fine, check the rest of
the hardware such as cables, switch ports, and the adapters.

9-25
Instructor Guide
Are there adapter transmit (xmit) queue or receive (rcv) pool overflows?
If there are overflows in these areas, increase the size of the transmit queue or the
receive pool with the chdev command for the adapter or decrease the network load.
Are there network buffer shortages?

One configuration option if you are running short on network buffers is to use the 64-bit
kernel. You could also add more memory or find out what is using all the network buffers
and try to decrease the network load.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss what to look for to see if the system is network bound.
Details This page reviews what to look for if youre trying to find a network bottleneck.
Transition statement Lets look at the second part of this flowchart.

9-27
Instructor Guide

Back to Meeting goals?
Actions
Monitoring tools
IP input
queue
overflows?
netstat p ip
Yes
No
Increase queue size,

avoid fragmentation,
eliminate
discards/delays, or
decrease workload
no
Actions
netstat p udp
UDP rcv
buffer
overflows?
Yes
Increase buffer size,

check CPU, or
decrease workload
no, setsockopt()
No
netstat p tcp
TCP
retransmits?
Yes
Actions
Find and fix

drops/delays
All
No
AN512.0
Notes:
Are there IP input queue overflows?
If there are IP input queue overflows, try to solve this by avoiding fragmentation, and/or
by eliminating network discards and delays. You could also increase the queue size
with the no command or decrease network load.
Are there UDP receive buffer overflows?

If there are UDP receive buffer overflows, try to solve this by increasing the size of the
buffer with the no command. Also, check that the CPU subsystem is not constrained.
Another solution is to decrease the network load.

V5.4
Instructor Guide
Uempty
Are there TCP retransmits?

If you see TCP retransmits, you will need to identify where the packets are being
dropped or delayed and fix the problem at the source. The problem could be caused by
any of the above network issues, or could be somewhere else in the network.

9-29
Instructor Guide
Instructor notes:
Purpose Discuss the steps to analyze a potential network bottleneck.
Details Continue working down the network performance flowchart.
Transition statement Lets look at the last part of this network flowchart.

V5.4
Instructor Guide
Uempty

Back to Meeting goals?
Actions
Monitoring tools
TCP initial
window size
not optimal?
no, lsdev
Size tcp send/rcv

buffers, and possibly
enable rfc1323
no, chdev, setsockopt()
Yes
No
netstat p tcp,
tcpdump or iptrace
Actions
Yes
200 ms
transmit
pauses?
Disable Nagles
Algorithm
no, setsockopt()
No
Demand too
high?
netstat
Yes
Actions
Decrease or
distribute traffic
All
No
AN512.0
Notes:
Use the optimal TCP initial window size
The TCP initial window size is controlled by the TCP send and receive buffer sizes. You
may need to experiment with window sizes to find the optimal setting. Tune with the
ISNO parameters with the chdev command. You may need to enable rfc1323.
Do you see 200 ms transmit pauses?

These transmit pauses are due to waiting for delayed acknowledgements. Solve by
disabling Nagles Algorithm.
Is the network demand simply too high?

Identify the source(s) of the demand and eliminate what you can or slow it down. Or,
even better, try to redistribute the traffic to other adapters or servers, or over time. To

9-31
Instructor Guide
slow down demand, you can decrease window size, message sizes, or the maximum
number of connections.
The source of the demand may be:
- An application that either has a bug, or is poorly designed, or simply has a lot of
valid work to do
- The sum of many applications that cumulative overload the queues, memory, or
adapter
- One remote session partner which is either sending too much data or requesting too
much data to be sent back
- The sum of many session partners that in total are overloading this host

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss the steps to analyze a potential network bottleneck.
Details Continue working down the network performance flowchart.
Transition statement Lets look at the NFS performance flowchart.

9-33
Instructor Guide
NFS performance flowchart: Servers

START
Monitor NFS usage.

Meeting goals?
Monitoring tools
No
Non-standard
mount
protocols?
nfsstat
Actions
Yes
Actions
No
All CPU, memory, I/O,

and network analysis
tools
Any
resource
bottlenecks?
Use NFSv3 and TCP

if appropriate
Yes
Fix system
bottleneck. Use NFS
specific tuning tools.
nfso
No
Try and look for

improvement
NFS
threads
need tuning?
Actions
Yes
Max # threads may

be too low for
lockd, mountd,
nfsd
No
Figure 9-12. NFS performance flowchart: Servers
AN512.0
Notes:
NFS performance tuning
Part of the difficulty with tuning for NFS is that it is a client/server application. So its
tuning involves everything we have taught in the entire course plus some items that are
unique to NFS. The flowchart in the visual above is a basic methodology outline for
servers. The next visual will show a flowchart for clients.
Use the correct protocols for NFS mounts

Are you using the default and recommended NFSv3 and TCP protocols for the NFS
mounts? Other options (NFSv2, UDP, NFSv4) may impact performance, and should be
treated as special cases when it comes to tuning. If you can, change back to NFSv3
and TCP.
The rest of the server tuning decision points in the flowchart assume these protocols
are in use.
V5.4
Instructor Guide
Uempty
Tune the resource subsystems on the NFS server

On the NFS server, watch for bottlenecks in all subsystems: CPU, memory, I/O, and
networking as previously discussed in this course. Any of these could affect NFS
performance.
Specifically for NFS processor performance, you can set the servers priority with nfso.
For I/O, try setting aggressive read-ahead with nfso. For networking, you can request
setsockopt() values for socket buffer size and configure the use of rfc1323 with nfso.
Tune NFS threads

NFS server has some logical resources which may need tuning. You can configure the
maximum number of threads for lockd, mountd, and nfsd. Increase these if your server
can handle the load. Decrease these if you need to throttle back NFS traffic.
Additional ways to throttle workload include using nfso to reduce the maximum read
and write sizes, the maximum number of connections, and the socket size. Also look at
redistributing the workload by spreading the workload over more NFS servers or by
controlling when certain clients request services. Instead of de-tuning the server, you
could de-tune the clients.

9-35
Instructor Guide
Instructor notes:
Purpose Discuss the steps to analyze performance issues on NFS servers.
Details The visual shows an overview of the major NFS server performance issues, the
tools to use to monitor, and suggestions of how to solve.
Transition statement Now, lets look at NFS clients.

V5.4
Instructor Guide
Uempty
NFS performance flowchart: Clients

nfsstat, netpmon
START
Monitor NFS usage.

Meeting goals?
Monitoring tools
All CPU, memory, I/O,
and network analysis
tools
Actions
No
Any
resource
bottlenecks?
Fix system
bottleneck. Use NFS
specific tuning tools.
Yes
nfso, or mount options

No
Try and look for

improvement
NFS
threads need
tuning?
Actions
Yes
Set # of biod
threads, size
bufstruct pools
nfso, or mount options
No
High
commits or attr
requests?
nfsstat -nc
Actions
Yes
No
Tune combehind
option, tune attribute
cache
mount options
Figure 9-13. NFS performance flowchart: Clients
AN512.0
Notes:
Tune the resource subsystems on the NFS clients
On the NFS client, just like on the NFS server, watch for bottlenecks in all subsystems:
CPU, memory, I/O, and networking as previously discussed in this course. Any of these
could affect NFS performance.
Specifically for NFS client memory performance, you can use nfso or mount options to
enable release behind on read. For networking, you can request setsockopt() values
for socket buffer size and configure the use of rfc1323 with nfso. You can also set read
and write options with mount options.
Tune NFS threads

NFS server has some logical resources which may need tuning. You can configure the
maximum number of threads for biod. Increase these if your server can handle the
load. Decrease these if you need to throttle back NFS traffic from this client. Also check

9-37
Instructor Guide
the number and size of the vugstruct pools with nfso. These could be artificially
constraining performance.
High number of commits and/or a high number of attribute requests

If nfsstat -nc shows a high number of commits, consider tuning the combehind
setting. This is a mount option.
If nfsstat -nc shows a high number of attribute requests, consider tuning the attribute
cache. This is a mount option.
Also, if one application is unfairly dominating in a constrained environment, look at
tuning the IOpacing mount options, both minpout and maxpout.
Special case when using UDP (server and client)

For systems using UDP protocol, if netstat -p udp at receiving side shows buffer
overruns, either:
- Tune the NFS system as previously discussed (increase socket size or tune for
constrained CPU). Also try setting the UDP buffer size via setsockopt() (using
nfso), or
- Possibly increase the maximum number of nfsd threads using nfso.
Use nfstat -cr on the client to see the RPC statistics. NFS/RPC handles error
detection and retransmission. If retrans and badxid are both high, there are delays (or
ack discards) either in the network or at a congested server. Investigate and fix the
problem (using the tools covered in the networking unit of this course).
If retrans is high but badxid is low, this indicates that requests are being discarded
either in the network or at the server. Investigate and fix the problem (using the tools
covered in the networking unit of this course).

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Discuss the steps to analyze performance issues on NFS clients.
Details The visual shows an overview of the major NFS client performance issues, the
tools to use to monitor, and suggestions of how to solve.
Transition statement Time for our final checkpoint.

9-39
Instructor Guide
Checkpoint
1. These are the steps for a methodological approach to
performance analysis. Put them in the correct order:
_
Identifying a performance bottleneck
Measuring the current performance of the server
Changing the component which is causing the bottleneck
Understanding the factors which can affect performance
Measuring the new performance of the server to check for

improvement
2. What are the distinct areas or subsystems to analyze for

performance?
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Checkpoint Questions.
improvement

performance? CPU, memory, disks, file systems,
network, NFS

9-41
Instructor Guide
Exercise 9: Summary exercise

Use PerfPMR reports to determine
symptoms of performance issues
Recommend tuning actions
Figure 9-15. Exercise 9: Summary exercise
AN512.0
Notes:
Introduction
Use all of the tools and knowledge learned in this course to find and fix the performance
issues.

V5.4
Instructor Guide
Uempty
Instructor notes:
Purpose Describe the exercise scenario.
Details
Transition statement Lets summarize the unit.

9-43
Instructor Guide
Unit summary
A system could be CPU bound if all of the following are true:
Processors are nearing 100% busy
Many jobs are waiting for a CPU in the run queue and
performance has degraded
A system could be memory bound if it has:
High memory occupancy, high paging space activity, or high file
paging activity
A system could be disk bound if it has:
At least one disk busy and cannot fulfill other requests
Processes blocked and waiting for I/O operation to complete
A system could be network I/O bound if it has:
The bandwidth of at least one network adapter totally (or almost
totally) used
Run out of buffers, memory, or is configured incorrectly
AN512.0
Notes:

V5.4
Instructor Guide
Uempty
Instructor notes:
Details
Transition statement This is the end of the course!

9-45
Instructor Guide

V5.4
Instructor Guide
AP
Appendix A. Checkpoint solutions

Unit 1 - Performance analysis and tuning overview
a. Performance is dependent on a combination of throughput and
response time.
b. Expectations can be used as the basis for performance goals.
c.
These are standardized tests used for evaluation. benchmarks

performing normally. baseline
e. These are collected by analysis tools. metrics

A-1
Instructor Guide
Unit 1 - Performance analysis and tuning overview (cont.)
2.
3.
4.
CPU
Memory
I/O
Network
what is the next step in the tuning process? Determine if the performance
goal(s) have been met.
schedo
vmo
ioo
lvmo
no
nfso
A-2

V5.4
Instructor Guide
AP
Unit 2 - Data collection
performance problem? A functional problem is when an
application, hardware, or network is not behaving correctly. A
performance problem is when the function is working, but the
speed it's performing at is slow.
2. What is the name of the supported tool used to collect
reports with a wide variety of performance data? PerfPMR
3. True /False You can individually run the scripts that
perfpmr.sh calls.
nmon displays.

A-3
Instructor Guide
Unit 3- Monitoring, analyzing, and tuning CPU usage
A process is an activity within the system that is started by a
command, shell program or another process. A thread is what is
dispatched to a CPU and is part of a process. A process can have
one or more threads.
2. The default scheduling policy is called: SCHED_OTHER
non-fixed
4. Priority numbers range from 0 to 255.
vmstat, sar, topas, nmon
ps, tprof, topas, nmon
A-4

V5.4
Instructor Guide
AP
Unit 4 - Virtual memory performance monitoring and tuning
persistent, client, and working
working
computational memory and non-computational (file)
memory
4. What is the name of the kernel process that implements
the page replacement algorithm? lrud

A-5
Instructor Guide
Unit 4 - Virtual memory performance monitoring and tuning (cont.)
5.
List the vmo parameter that matches the description:

VMM starts to steal pages to replenish the free list minfree
stealing stops maxfree
computational pages regardless of repaging rates minperm%
what type of page to steal lru_file_repage
A-6

V5.4
Instructor Guide
AP
Unit 5- Physical and logical volume performance
statistics.
iostat
sar d
topas or nmon
3. Identify and define the default mirroring scheduling
policy.
Parallel policy - sends read requests to the least busy
copy and write requests to all copies concurrently
4. What tools allow you to observe the time the physical disks
iostat and sar

A-7
Instructor Guide
Unit 6 - File system performance monitoring and tuning
seeks between them.
2. True/False When measuring file system
performance, I/O subsystems should not be shared.
dd and time.
4. The fileplace command can be used to determine
if there is fragmentation.
A-8

V5.4
Instructor Guide
AP
Unit 6 - File system performance monitoring and tuning (cont.)
5. What tunable functions exist to flush out modified file
pages, based on a threshold of the number of dirty
pages in memory?
Random write-behind
6. What is the difference between JFS and JFS2 random
write-behind?
The threshold for random writes in JFS is simply the
number of random pages. In JFS2, in addition to using
the number of random writes as a threshold, it has a
definition of what is considered a random write based
upon the separation between the writes

A-9
Instructor Guide
Unit 6 - File system performance monitoring and tuning (cont.)
fragmented:
Sequential access is no longer sequential
Random access affected (by having to access more
widely dispersed data)
Access time dominated by longer seek time
iostat
filemon
Read/write requests will be queued on the VMM I/O
queue once the system runs out of file system buffers
A-10 AIX Performance Management

V5.4
Instructor Guide
AP
Unit 7 - Network performance
1. Interactive users are more concerned with
measurements of response time while users of batch
data transfers are more concerned with measurements
of throughput .
3. When sending a single TCP packet an
acknowledgement can, by default, be delayed as long as
200 milliseconds.

A-11
Instructor Guide
Unit 7 - Network performance (cont.)
4. True/False Increasing the tcp_recvspace at the receiving host
will always increase the effective window size for the
connections.
If the tcp_sendspace at the transmitting host is smaller than
the tcp_recvspace at the receiving host, it will become the
controlling factor. Both ends would need to be increased.
5. What network option must be enabled to allow window sizes
greater than 64 KBs? rfc1323
6. List two ways in which Nagles Algorithm can be disabled:
Specify tcp_nodelay either from the application
(setsockopt) or as an Interface Specific Network
Option
Specify tcp_nagle_limit=1 as a network option

V5.4
Instructor Guide
AP
Unit 7 - Network performance (cont.)
a) Increase memory and CPU capacity at the receiving host
Answer: b and d
Either a defective adapter or switch port, or a duplex mode
configuration mismatch between the adapter and switch
port.

A-13
Instructor Guide
Unit 8 - NFS performance
2. The biod daemons are the block input/output daemons
and are required in order to perform remote I/O
requests at an NFS client.
activity, the rpc.lockd daemon may become a

V5.4
Instructor Guide
AP
Unit 8 - NFS performance (cont.)
4. The nfsstat -m command can be used to look at per
5. The netpmon utility can identify which NFS clients
combehind mount option can be used improve NFS
I/O efficiency and the rbr mount option can be used to
release file cache memory once the application
receives the data.

A-15
Instructor Guide
Unit 9 - Performance management methodology
improvement

performance? CPU, memory, disks, file systems,
network, NFS

V5.4
backpg
Back page

An512inst PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An512inst PDF

Uploaded by

Copyright:

Available Formats

V5.

Power Systems for AIX IV:

Windows is a trademark of Microsoft Corporation in the United States, other countries, or

November 2010 edition

Copyright International Business Machines Corporation 2010.

Copyright IBM Corp. 2010

Formatting PerfPMR raw traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-31

AIX Performance Management

Copyright IBM Corp. 2010

Unit 4. Virtual memory performance monitoring and tuning . . . . . . . . . . . . . . . . . . 4-1

AIX Performance Management

Copyright IBM Corp. 2010

Unit 7. Network performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1

Client attribute cache tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-42

AIX Performance Management

Copyright IBM Corp. 2010

Copyright IBM Corp. 2010

AIX Performance Management

Copyright IBM Corp. 2010

Instructor course overview

Copyright IBM Corp. 2010

Instructor course overview

AIX Performance Management

Copyright IBM Corp. 2010

Copyright IBM Corp. 2010

It is very helpful to have a strong background in TCP/IP networking to

AIX Performance Management

Copyright IBM Corp. 2010

Copyright IBM Corp. 2010

AIX Performance Management

Copyright IBM Corp. 2010

Unit 1. Performance analysis and tuning overview

What this unit is about

What you should be able to do

How you will check your progress

Copyright IBM Corp. 2010

AIX 5L Practical Performance Tools and Tuning Guide

Unit 1. Performance analysis and tuning overview

List performance components

Copyright IBM Corporation 2010

Figure 1-1. Unit objectives

AIX Performance Management

Copyright IBM Corp. 2010

Copyright IBM Corp. 2010

Unit 1. Performance analysis and tuning overview

What exactly is performance?

Acceptable performance is based on expectations:

Copyright IBM Corporation 2010

Figure 1-2. What exactly is performance?

AIX Performance Management

Copyright IBM Corp. 2010

Setting performance goals

Performance goals are being met; now what?

Unit 1. Performance analysis and tuning overview

AIX Performance Management

Copyright IBM Corp. 2010

Copyright IBM Corp. 2010

Unit 1. Performance analysis and tuning overview

What is a performance problem?

Copyright IBM Corporation 2010

Figure 1-3. What is a performance problem?

AIX Performance Management