Sa400 Ig

Solaris System Performance Management
SA-400
Student Guide With Instructor Notes
Sun Microsystems 500 Eldorado Boulevard MS: BRM05-104 Broomeld, Colorado 80021 U.S.A.
Revision C.1
Copyright 2001 Sun Microsystems, Inc., 901 San Antonio Road, Palo Alto, California 94303, U.S.A. All rights reserved. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and compilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. Sun, Sun Microsystems, the AnswerBook, CacheFS, Forte, Java, OpenBoot, OpenWindows, Solaris, StarOffice, Solstice DiskSuite, Solaris Resource Manager, Sun Enterprise 10000, Sun Enterprise SyMON, Sun Logo, SunOS, Sun Management Center, Sun StorEdge, Sun StorEdge A3000, Sun StorEdge A5000, Sun StorEdge A7000, Sun Trunking, Ultra, and Ultra Enterprise are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. ORACLE is a registered trademark of Oracle Corporation. The OPEN LOOK and Sun Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Suns licensees who implement OPEN LOOK GUIs and otherwise comply with Suns written license agreements. U.S. Government approval required when exporting the product. RESTRICTED RIGHTS: Use, duplication, or disclosure by the U.S. Government is subject to restrictions of FAR 52.22714(g) (2)(6/87) and FAR 52.227-19(6/87), or DFAR 252.227-7015 (b)(6/95) and DFAR 227.7202-3(a). DOCUMENTATION IS PROVIDED AS IS AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS, AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
THIS MANUAL IS DESIGNED TO SUPPORT AN INSTRUCTOR-LED TRAINING (ILT) COURSE AND IS INTENDED TO BE USED FOR REFERENCE PURPOSES IN CONJUNCTION WITH THE ILT COURSE. THE MANUAL IS NOT A STANDALONE TRAINING TOOL. USE OF THE MANUAL FOR SELF-STUDY WITHOUT CLASS ATTENDANCE IS NOT RECOMMENDED.
Please Recycle
Table of Contents
About This Course .............................................................Preface-xxi Course Goal ......................................................................... Preface-xxi Course Overview ................................................................Preface-xxii Course Map.........................................................................Preface-xxiii Module-by-Module Overview .........................................Preface-xxiv Course Objectives............................................................ Preface-xxviii Skills Gained by Module...................................................Preface-xxix Guidelines for Module Pacing .......................................... Preface-xxx Topics Not Covered...........................................................Preface-xxxi How Prepared Are You?.................................................Preface-xxxiii Introductions .....................................................................Preface-xxxv How to Use the Course Materials..................................Preface-xxxvi Course Icons and Typographical Conventions ........ Preface-xxxviii Icons ....................................................................... Preface-xxxviii Typographical Conventions ..................................Preface-xxxix Notes to the Instructor........................................................... Preface-xl Philosophy ..................................................................... Preface-xl Course Tools .................................................................Preface-xli Changes for Revision C.1...................................................Preface-xliv Instructor Setup Notes ........................................................Preface-xlv Purpose of This Guide................................................Preface-xlv Course Files..........................................................................Preface-xlvi Course Components ..................................................Preface-xlvi Course Labs An Explanation....................................... Preface-xlviii Course Labs Explanation (Introducing Performance Management) ...............................................Preface-xlix Course Labs Explanation (Processes)................................. Preface-l Course Labs Explanation (CPU Scheduling) ................... Preface-li Course Labs Explanation (Caches) ...................................Preface-lii Course Labs Explanation (Memory Tuning) .................Preface-liii Course Labs Explanation (System Buses).......................Preface-liv Course Labs Explanation (I/O Tuning)...........................Preface-lv Course Labs Explanation (UFS Tuning) .........................Preface-lvi
iii
Copyright 2001 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision C.1
Course Labs Explanation (Network Tuning)............... Preface-lvii Course Labs Explanation (Summary) .......................... Preface-lviii Introducing Performance Management .........................................1-1 Objectives ........................................................................................... 1-1 Relevance............................................................................................. 1-2 Additional Resources ....................................................................... 1-2 What Is Tuning? ................................................................................. 1-4 Why Should a System Be Tuned?.................................................... 1-7 Basic Tuning Procedure .................................................................... 1-8 Conceptual Model of Performance................................................ 1-10 Where to Tune First ......................................................................... 1-12 Where to Tune? ................................................................................ 1-14 Performance Measurement Terminology..................................... 1-15 Performance Graphs........................................................................ 1-17 Displaying Graphs of Performance Metrics................................. 1-18 Sample Tuning Task ........................................................................ 1-20 The Tuning Cycle ............................................................................. 1-22 SunOS Components..................................................................... 1-23 How Data Is Collected..................................................................... 1-24 The kstat Kernel Data Structure .................................................. 1-25 How the Tools Process the Data .................................................... 1-26 Monitoring Tools Provided With the Solaris OE ........................ 1-27 The sar Command .......................................................................... 1-28 The vmstat Command.................................................................... 1-29 The iostat Command.................................................................... 1-30 The mpstat Command.................................................................... 1-31 The netstat Command.................................................................. 1-32 The nfsstat Command.................................................................. 1-33 The Sun Enterprise SyMON System Monitor.......................... 1-34 Sun Management Center............................................................. 1-35 Other Utilities ................................................................................... 1-37 The prex Command ........................................................................ 1-38 The tnfxtract and tnfdump Commands ................................... 1-39 The busstat Command.................................................................. 1-40 The cpustat and cputrack Commands ..................................... 1-41 The /usr/proc/bin Directory ...................................................... 1-42 The Process Manager....................................................................... 1-43 The memtool Package...................................................................... 1-45 The SE Toolkit................................................................................... 1-47 System Performance Monitoring With the SE Toolkit ............... 1-48 SE Toolkit Example Tools ............................................................... 1-49 System Accounting .......................................................................... 1-50 Viewing Tuning Parameters........................................................... 1-51 Setting Tuning Parameters ............................................................. 1-52 The /etc/system File ..................................................................... 1-53
iv

Check Your Progress ....................................................................... 1-54 Think Beyond ................................................................................... 1-55 System Monitoring Tools.............................................................. L1-1 Objectives ......................................................................................... L1-1 Using the Tools.................................................................................L1-2 Enabling System Activity Data Capture.......................................L1-4 Installing the SE Toolkit ..................................................................L1-5 Installing the memtool Package....................................................L1-6 Using the prstat Command ........................................................L1-7 Using the sar Command to Read Data From a Saved File.......L1-8 Processes and Threads ...................................................................2-1 Objectives ........................................................................................... 2-1 Relevance............................................................................................. 2-2 Additional Resources ....................................................................... 2-2 Process Lifetime ................................................................................. 2-3 Process States ...................................................................................... 2-5 Process Performance Issues .............................................................. 2-7 Process Lifetime Performance Data................................................. 2-8 Process-Related Tunable Parameters ............................................ 2-11 Tunable Parameters Specific to the Solaris 8 OE ......................... 2-13 Multithreading ................................................................................. 2-14 Application Threads ........................................................................ 2-17 Process Execution............................................................................. 2-18 Thread Execution ............................................................................. 2-19 Process Thread Examples ............................................................... 2-20 Performance Issues .......................................................................... 2-21 Thread and Process Performance .................................................. 2-22 Locking .............................................................................................. 2-23 Locking Problems ............................................................................ 2-25 The lockstat Command ............................................................... 2-26 Interrupt Levels ................................................................................ 2-27 The clock Routine........................................................................... 2-29 Clock Tick Processing...................................................................... 2-31 Process Monitoring Using the ps Command............................... 2-32 Process Monitoring Using System Accounting ........................... 2-35 Process Monitoring Using SE Toolkit Programs ......................... 2-36 The msacct.se Program ....................................................... 2-36 The pea.se Program .............................................................. 2-36 The ps-ax.se Program.......................................................... 2-38 The ps-p.se Program............................................................ 2-38 The pwatch.se Program ....................................................... 2-38 Check Your Progress ....................................................................... 2-39 Think Beyond ................................................................................... 2-40
v
Processes Lab................................................................................ L2-1 Objectives ......................................................................................... L2-1 Using the maxusers Parameter......................................................L2-2 Using the lockstat Command .....................................................L2-5 Monitoring Processes ......................................................................L2-6 CPU Scheduling ............................................................................... 3-1 Objectives ........................................................................................... 3-1 Relevance............................................................................................. 3-2 Additional Resources ....................................................................... 3-2 Scheduling States ............................................................................... 3-3 Scheduling Classes............................................................................. 3-4 Class Characteristics .......................................................................... 3-5 Timesharing/Interactive.......................................................... 3-5 System Class .............................................................................. 3-5 Real-time Class .......................................................................... 3-5 Dispatch Priorities.............................................................................. 3-6 Example ...................................................................................... 3-7 Timesharing/Interactive Dispatch Parameter Table .................. 3-11 CPU Starvation ........................................................................ 3-13 Example of Dispatch Table Timeshare ............................. 3-14 Dispatch Parameter Table Issues ................................................... 3-15 Timesharing Dispatch Parameter Table (Batch).......................... 3-16 The dispadmin Command ............................................................. 3-17 dispadmin Example ........................................................................ 3-18 The Interactive Scheduling Class................................................... 3-19 Real-Time Scheduling ..................................................................... 3-20 Real-Time Issues............................................................................... 3-21 Real-Time Dispatch Parameter Table............................................ 3-23 The priocntl Command ............................................................... 3-24 Kernel Thread Dispatching ............................................................ 3-26 Old Thread ............................................................................... 3-26 New Thread ............................................................................. 3-27 Processor Sets ................................................................................... 3-28 Dynamic Reconfiguration Interaction.................................. 3-29 Using Processor Sets ........................................................................ 3-30 The Run Queue................................................................................. 3-32 CPU Activity..................................................................................... 3-33 CPU Performance Statistics ............................................................ 3-35 Available CPU Performance Data ................................................. 3-39 CPU Control and Monitoring......................................................... 3-40 The mpstat Command.................................................................... 3-41 mpstat Data...................................................................................... 3-43 Solaris Resource Manager............................................................... 3-44 Solaris Resource Manager Hierarchical Shares Example........... 3-46 SRM Limit Node .............................................................................. 3-47
vi

CPU Monitoring Using SE Toolkit Programs .............................. 3-48 The cpu_meter.se Program................................................. 3-48 The vmmonitor.se Program................................................. 3-48 The mpvmstat.se Program ................................................... 3-48 Check Your Progress ....................................................................... 3-49 Think Beyond ................................................................................... 3-50 CPU Scheduling Lab ..................................................................... L3-1 Objectives ......................................................................................... L3-1 CPU Dispatching..............................................................................L3-2 Process Priorities ..............................................................................L3-4 The Dispatch Parameter Table .......................................................L3-8 Real-Time Scheduling ...................................................................L3-10 Processor Sets .................................................................................L3-11 System Caches .................................................................................4-1 Objectives ........................................................................................... 4-1 Relevance............................................................................................. 4-2 Additional Resources ....................................................................... 4-2 What Is a Cache? ................................................................................ 4-3 Hardware Caches............................................................................... 4-4 Static Random Access Memory (SRAM) and Dynamic RAM (DRAM) ................................................................... 4-5 CPU Caches......................................................................................... 4-6 Cache Operation................................................................................. 4-7 Cache Miss Processing ...................................................................... 4-8 Cache Data Replacement .................................................................. 4-9 Cache Operation Flow..................................................................... 4-10 Cache Hit Rate .................................................................................. 4-11 Effects of CPU Cache Misses .......................................................... 4-13 Cache Characteristics....................................................................... 4-14 Virtual Address Cache .................................................................... 4-15 Physical Address Cache .................................................................. 4-16 Direct-Mapped and Set-Associative Caches ................................ 4-17 Harvard Caches....................................................................... 4-18 Write-Through and Write-Back Caches........................................ 4-19 Write Cancellation .................................................................. 4-20 CPU Cache Snooping ...................................................................... 4-21 Cache Thrashing............................................................................... 4-23 System Cache Hierarchies .............................................................. 4-25 Relative Access Times ..................................................................... 4-27 Cache Performance Issues .............................................................. 4-28 Items You Can Tune ........................................................................ 4-30 Cache Monitoring Command Summary ...................................... 4-31 Check Your Progress ....................................................................... 4-33 Think Beyond ................................................................................... 4-34
vii
Cache Lab....................................................................................... L4-1 Objectives ......................................................................................... L4-1 Calculating Cache Miss Costs ........................................................L4-2 Displaying Cache Statistics.............................................................L4-3 Memory Tuning ................................................................................ 5-1 Objectives ........................................................................................... 5-1 Relevance............................................................................................. 5-2 Additional Resources ....................................................................... 5-2 Virtual Memory.................................................................................. 5-3 Process Address Space ...................................................................... 5-4 Types of Segments ............................................................................. 5-6 Virtual Address Translation............................................................. 5-7 Page Descriptor Cache ...................................................................... 5-8 Virtual Address Lookup ................................................................. 5-10 The Memory-Free Queue................................................................ 5-11 The Paging Mechanism ................................................................... 5-13 Page Daemon Scanning................................................................... 5-14 Page Scanner Processing ................................................................. 5-15 An Alternate Approach The Clock Algorithm ......................... 5-17 Default Paging Parameters ............................................................. 5-19 The maxpgio Parameter .................................................................. 5-20 File System Caching......................................................................... 5-22 The Buffer Cache .............................................................................. 5-24 The Page Cache ................................................................................ 5-26 Available Paging Statistics.............................................................. 5-28 Additional Paging Statistics ........................................................... 5-34 Swapping........................................................................................... 5-36 When Is the Swapper Desperate? .................................................. 5-37 What Is Not Swappable .................................................................. 5-38 Swapping Priorities ......................................................................... 5-39 Swap In .............................................................................................. 5-40 Swap Space ....................................................................................... 5-41 The tmpfs File System..................................................................... 5-43 Available Swapping Statistics ........................................................ 5-45 Shared Libraries ............................................................................... 5-48 Paging Review.................................................................................. 5-49 Memory Utilization ......................................................................... 5-50 Total Physical Memory ................................................................... 5-51 File Buffering Memory .................................................................... 5-52 Kernel Memory ................................................................................ 5-53 Identifying an Applications Memory Requirements................. 5-55 Free Memory..................................................................................... 5-57 Identifying a Memory Shortage..................................................... 5-58 Using the memtool Graphical User Interface (GUI).................... 5-59
viii

Items You Can Tune ........................................................................ 5-63 Scanning Example................................................................... 5-65 Check Your Progress ....................................................................... 5-66 Think Beyond ................................................................................... 5-67 Memory Tuning Lab ...................................................................... L5-1 Objectives ......................................................................................... L5-1 Process Memory ...............................................................................L5-2 Page Daemon Parameters ...............................................................L5-3 Determining Initial Values ....................................................L5-3 Changing Parameters .............................................................L5-4 Scan Rates........................................................................................L5-11 System Buses ...................................................................................6-1 Objectives ........................................................................................... 6-1 Relevance............................................................................................. 6-2 Additional Resources ....................................................................... 6-2 What Is a Bus? .................................................................................... 6-3 General Bus Characteristics.............................................................. 6-4 System Buses....................................................................................... 6-5 MBus and XDbus ...................................................................... 6-6 System Buses: UPA and Gigaplane Bus ................................ 6-7 Gigaplane XB Bus............................................................................... 6-8 The Gigaplane XB Bus....................................................................... 6-9 Sun System Backplanes Summary................................................. 6-10 Peripheral Buses............................................................................... 6-11 SBus Compared to PCI Bus ............................................................ 6-12 PIO and DVMA ................................................................................ 6-13 Programmed I/O (PIO).......................................................... 6-14 Dynamic Virtual Memory Access (DVMA) ........................ 6-14 Which Is Being Used?............................................................. 6-14 The prtdiag Utility ......................................................................... 6-16 prtdiag CPU Section ...................................................................... 6-17 prtdiag Memory Section ............................................................... 6-18 Memory Interleaving.............................................................. 6-18 prtdiag I/O Section ....................................................................... 6-20 Bus Limits.......................................................................................... 6-21 Bus Bandwidth ................................................................................. 6-22 Diagnosing Bus Problems............................................................... 6-23 Avoiding Bus Problems .................................................................. 6-24 SBus Overload Examples ................................................................ 6-26 CPU Load on the System Bus......................................................... 6-28 Dynamic Reconfiguration Considerations ................................... 6-30 Tuning Reports................................................................................. 6-31 Bus Monitoring Commands ........................................................... 6-32 Check Your Progress ....................................................................... 6-34 Think Beyond ................................................................................... 6-35
ix
System Buses Lab ......................................................................... L6-1 Objectives ......................................................................................... L6-1 Testing the Hardware Configuration of a System ......................L6-2 I/O Tuning ......................................................................................... 7-1 Objectives ........................................................................................... 7-1 Relevance............................................................................................. 7-2 Additional Resources ....................................................................... 7-2 SCSI Bus Overview ............................................................................ 7-3 SCSI Bus Characteristics ................................................................... 7-5 SCSI Speeds......................................................................................... 7-6 SCSI Widths ........................................................................................ 7-8 SCSI Lengths....................................................................................... 7-9 SCSI Properties Summary............................................................... 7-10 SCSI Interface Addressing .............................................................. 7-11 SCSI Bus Addressing ....................................................................... 7-13 SCSI Target Priorities ...................................................................... 7-14 Fibre Channel.................................................................................... 7-15 Disk I/O Time Components........................................................... 7-18 Access to the I/O Bus ............................................................. 7-18 Bus Transfer Time ................................................................... 7-20 Seek Time ................................................................................. 7-20 Rotation Time .......................................................................... 7-20 Internal Throughput Rate (ITR) Transfer Time .................. 7-21 Reconnection Time.................................................................. 7-21 Interrupt Time ......................................................................... 7-21 Disk I/O Time Calculation ............................................................. 7-22 Bus Data Transfer Timing...................................................... 7-23 Latency Timing........................................................................ 7-23 Disk Drive Features ......................................................................... 7-24 Multiple Zone Recording................................................................ 7-25 Drive Caching and Fast Writes ...................................................... 7-27 Fast Writes................................................................................ 7-27 Tagged Queueing............................................................................. 7-29 Ordered Seeks.......................................................................... 7-30 Mode Pages ....................................................................................... 7-31 Device Properties From prtconf .................................................. 7-32 Device Properties From scsiinfo................................................ 7-34 I/O Performance Planning ............................................................. 7-36 Decision Support Example (Sequential I/O) ...................... 7-37 Transaction Processing Example (Random I/O) ............... 7-37 Tape Drive Operations .................................................................... 7-38 Storage Arrays and JBODs.............................................................. 7-39 Storage Array Architecture............................................................. 7-40 RAID .................................................................................................. 7-41 RAID-0 ............................................................................................... 7-42

RAID-0 Stripe Sizing ....................................................................... 7-44 RAID-1 ............................................................................................... 7-46 RAID-0+1........................................................................................... 7-48 RAID-3 ............................................................................................... 7-50 RAID-5 ............................................................................................... 7-52 Tuning the I/O Subsystem ............................................................. 7-54 Available Tuning Data .................................................................... 7-56 Available I/O Statistics ................................................................... 7-57 Available I/O Data .......................................................................... 7-60 iostat Service Time........................................................................ 7-61 Disk Monitoring Using SE Toolkit Programs .............................. 7-62 The siostat.se Program ..................................................... 7-62 The xio.se Program .............................................................. 7-62 The xiostat.se Program ..................................................... 7-63 The iomonitor.se Program................................................. 7-63 The iost.se Program............................................................ 7-63 The xit.se Program .............................................................. 7-64 The disks.se Program.......................................................... 7-64 Check Your Progress ....................................................................... 7-65 Think Beyond ................................................................................... 7-66 I/O Tuning Lab ............................................................................... L7-1 Objectives ......................................................................................... L7-1 Disk Device Characteristics ............................................................L7-2 System Behavior Characteristics....................................................L7-5 Reading iostat Report Output ........................................................L7-7 UFS Tuning .......................................................................................8-1 Objectives ........................................................................................... 8-1 Relevance............................................................................................. 8-2 Additional Resources ....................................................................... 8-2 The vnode Interface ........................................................................... 8-3 The Local File System ........................................................................ 8-5 Fast File System Layout .................................................................... 8-6 Disk Label................................................................................... 8-6 Boot Block................................................................................... 8-6 Superblock.................................................................................. 8-7 Cylinder Groups........................................................................ 8-7 The Directory ...................................................................................... 8-9 Directory Name Lookup Cache (DNLC)...................................... 8-10 The inode Cache .............................................................................. 8-12 DNLC and inode Cache Management......................................... 8-14 The inode Structure ........................................................................ 8-16 inode Timestamp Updates.................................................... 8-17 The inode Block Pointers ............................................................... 8-18 UFS Buffer Cache ............................................................................. 8-19 Hard Links ........................................................................................ 8-21
xi
Symbolic Links ................................................................................. 8-22 Allocating Directories and inodes ................................................ 8-23 The Veritas File System ................................................................... 8-24 Comparison of UFS and VxFS........................................................ 8-26 Allocation Performance................................................................... 8-28 Allocating Data Blocks .................................................................... 8-30 Quadratic Rehash............................................................................. 8-31 The Algorithm ......................................................................... 8-33 Fragments.......................................................................................... 8-34 Logging File Systems....................................................................... 8-35 Application I/O................................................................................ 8-36 Working With the segmap Cache .................................................. 8-39 File System Performance Statistics ................................................ 8-40 The fsflush Daemon ..................................................................... 8-44 Direct I/O.......................................................................................... 8-46 Restrictions............................................................................... 8-47 Accessing Data Directly With mmap............................................... 8-48 Using the madvise Call................................................................... 8-49 File System Performance................................................................. 8-50 Comparing Data-Intensive and Attribute-Intensive Applications.......................................................................... 8-50 Comparing Sequential and Random Access Patterns ....... 8-51 Cylinder Groups............................................................................... 8-52 Sequential Access Workloads......................................................... 8-54 UFS File System Read Ahead ......................................................... 8-56 Setting Cluster Sizes for RAID Volumes ...................................... 8-58 Commands to Set Cluster Size ....................................................... 8-60 Tuning for Sequential I/O ..................................................... 8-61 Tuning for Sequential I/O in a RAID Environment .......... 8-61 File System Write Behind................................................................ 8-62 UFS Write Throttle........................................................................... 8-64 Random Access Workloads............................................................ 8-66 Tuning for Random I/O ................................................................. 8-68 Post-Creation Tuning Parameters ................................................. 8-70 Fragment Optimization.......................................................... 8-70 The minfree Parameter ......................................................... 8-71 The maxbpg Parameter ........................................................... 8-71 Application I/O Performance Statistics........................................ 8-72 Check Your Progress ....................................................................... 8-74 Think Beyond ................................................................................... 8-75 UFS Tuning Lab ............................................................................. L8-1 Objectives ......................................................................................... L8-1 File Systems.......................................................................................L8-2 Directory Name Lookup Cache .....................................................L8-4 UFS Parameter Controls..................................................................L8-8
xii

The mmap and read/write Programs .........................................L8-15 Using the madvise Program.........................................................L8-18 Network Tuning ................................................................................9-1 Objectives ........................................................................................... 9-1 Relevance............................................................................................. 9-2 Additional Resources ....................................................................... 9-2 Networks ............................................................................................. 9-3 Network Bandwidth.......................................................................... 9-4 Full Duplex 100BASE-T Ethernet .................................................... 9-5 ndd Examples............................................................................. 9-6 Forcing 100BASE-T Full Duplex ............................................. 9-7 IP Trunking ......................................................................................... 9-8 Sun Trunking Software ............................................................ 9-9 TCP Connections.............................................................................. 9-10 TCP Stack Issues............................................................................... 9-12 TCP Tuning Parameters .................................................................. 9-13 Using the ndd Command ................................................................ 9-15 Network Hardware Performance.................................................. 9-17 NFS..................................................................................................... 9-18 NFS Client Retransmissions ........................................................... 9-19 NFS Server Daemon Threads ......................................................... 9-21 The Cache File System..................................................................... 9-23 Cache File System Characteristics ................................................. 9-24 CacheFS Benefits and Limitations ................................................. 9-26 Creating a Cache ..................................................................... 9-28 Displaying Information.......................................................... 9-28 Mounting With CacheFS................................................................. 9-29 Using /etc/vfstab................................................................ 9-29 Using the Automounter ......................................................... 9-29 The cachefs Operation .................................................................. 9-30 Writes to Cached Data............................................................ 9-30 Managing cachefs.......................................................................... 9-31 The cachefsstat Command................................................ 9-31 The cachefslog Command .................................................. 9-31 The cachefswssize Command ........................................... 9-31 The cachefspack Command................................................ 9-32 Monitoring Network Performance................................................ 9-33 The ping Command .............................................................. 9-34 The spray Command............................................................ 9-35 The snoop Command............................................................ 9-36 Isolating Problems ........................................................................... 9-37 Tuning Reports................................................................................. 9-38 Network Monitoring Using SE Toolkit Programs ...................... 9-39 The net.se Program .............................................................. 9-39 The netstatx.se Program ................................................... 9-39
xiii
The nx.se Program ............................................................... 9-39 The netmonitor.se Program............................................ 9-40 The nfsmonitor.se Program............................................ 9-40 The tcp_monitor.se Program ......................................... 9-40 Check Your Progress ....................................................................... 9-41 Think Beyond ................................................................................... 9-42 Network Tuning Lab ...................................................................... L9-1 Objectives ......................................................................................... L9-1 The nscd Naming Service Cache Daemon...................................L9-2 autofs Tracing.................................................................................L9-4 cachefs Tracing...............................................................................L9-8 Performance Tuning Summary..................................................... 10-1 Objectives ......................................................................................... 10-1 Relevance........................................................................................... 10-2 Additional Resources ..................................................................... 10-2 Some General Guidelines................................................................ 10-3 What You Can Do ............................................................................ 10-4 Application Source Code Tuning .................................................. 10-5 Compiler Optimization ................................................................... 10-6 Application Executable Tuning Tips............................................. 10-7 Performance Bottlenecks................................................................. 10-8 Memory Bottlenecks ...................................................................... 10-10 Memory Bottleneck Solutions ...................................................... 10-11 I/O Bottlenecks .............................................................................. 10-13 I/O Bottleneck Solutions .............................................................. 10-15 CPU Bottlenecks............................................................................. 10-17 CPU Bottleneck Solutions ............................................................. 10-19 Ten Tuning Tips ............................................................................. 10-21 Check Your Progress ..................................................................... 10-23 Think Beyond ................................................................................. 10-24 Case Study....................................................................................... 10-25 Case Study System Configuration............................................ 10-27 Case Study Process Evaluation ................................................. 10-30 Case Study Disk Bottlenecks ..................................................... 10-31 Case Study CPU Utilization ...................................................... 10-34 Case Study CPU Evaluation ...................................................... 10-35 Case Study Memory Evaluation ............................................... 10-36 Case Study Memory Tuning ..................................................... 10-41 Case Study File System Tuning ................................................ 10-47 Performance Tuning Summary Lab ........................................... L10-1 Objectives ....................................................................................... L10-1 vmstat Information .......................................................................L10-2 sar Analysis I .................................................................................L10-4 sar Analysis II................................................................................L10-8 sar Analysis III ............................................................................L10-11
xiv Solaris System Performance Management
System Monitoring Tools............................................................ LF1-1 Objectives ....................................................................................... LF1-1 Tasks ................................................................................................LF1-2 Using the Tools......................................................................LF1-2 Enabling System Activity Data Capture.....................................LF1-4 Installing the SE Toolkit ................................................................LF1-5 Installing memtool...............................................................LF1-6 Using the prstat Command .............................................LF1-7 Using the sar Command to Read Data From a Saved File ............................................................................LF1-8 Processes and Threads Lab..................................................... LF1-11 Objectives ..................................................................................... LF1-11 Tasks ..............................................................................................LF1-12 Using the maxusers Parameter.........................................LF1-12 Using the lockstat Command ........................................LF1-15 Monitoring Processes .........................................................LF1-16 CPU Scheduling Lab ................................................................. LF1-21 Objectives ..................................................................................... LF1-21 Tasks ..............................................................................................LF1-22 CPU Dispatching.................................................................LF1-22 Process Priorities .................................................................LF1-23 The Dispatch Parameter Table ..........................................LF1-27 Real-Time Scheduling ........................................................LF1-29 Processor Sets ......................................................................LF1-30 Cache Lab .................................................................................. LF1-33 Objectives ..................................................................................... LF1-33 Tasks ..............................................................................................LF1-34 Calculating Cache Miss Costs ...........................................LF1-34 Displaying Cache Statistics................................................LF1-35 Memory Tuning Lab .................................................................. LF1-39 Objectives ..................................................................................... LF1-39 Tasks ..............................................................................................LF1-40 Process Memory ..................................................................LF1-40 Page Daemon Parameters ..................................................LF1-41 Scan Rates.............................................................................LF1-49 System Buses Lab..................................................................... LF1-51 Objectives ..................................................................................... LF1-51 Tasks ..............................................................................................LF1-52 Testing the Hardware Configuration of a System .........LF1-52
xv
I/O Tuning Lab............................................................................LF1-57 Objectives ..................................................................................... LF1-57 Tasks ..............................................................................................LF1-58 Disk Device Characteristics ...............................................LF1-58 System Behavior Characteristics.......................................LF1-61 Reading iostat Report Output........................................LF1-62 UFS Tuning Lab .........................................................................LF1-65 Objectives ..................................................................................... LF1-65 Tasks ..............................................................................................LF1-66 File Systems..........................................................................LF1-66 Directory Name Lookup Cache ........................................LF1-68 UFS Parameter Controls.....................................................LF1-71 mmap and read/write........................................................LF1-77 Using madvise.....................................................................LF1-80 Network Tuning Lab ..................................................................LF1-83 Objectives ..................................................................................... LF1-83 Tasks ..............................................................................................LF1-84 nscd Naming Service Cache..............................................LF1-84 autofs Tracing....................................................................LF1-86 cachefs Tracing..................................................................LF1-90 Performance Tuning Summary Lab LF1-93............................ LF1-93 Objectives ..................................................................................... LF1-93 Tasks ..............................................................................................LF1-94 vmstat Information ............................................................LF1-94 sar Analysis I ......................................................................LF1-96 sar Analysis II...................................................................LF1-100 sar Analysis III .................................................................LF1-103 Accounting .......................................................................................A-1 What Is Accounting? ........................................................................ A-2 What Is Accounting Used For? ....................................................... A-3 Types of Accounting......................................................................... A-4 Connection Accounting........................................................... A-4 Process Accounting.................................................................. A-4 Disk Accounting....................................................................... A-5 Charging.................................................................................... A-5 How Does Accounting Work? ........................................................ A-6 Location of Files ....................................................................... A-6 Types of Files ............................................................................ A-6 What Happens at Boot Time .................................................. A-6 Programs That Are Run .......................................................... A-7 Location of ASCII Report Files............................................... A-8 Starting and Stopping Accounting ................................................. A-9
xvi

Generating the Data and Reports ................................................. A-11 Raw Data ................................................................................. A-11 rprtMMDD Daily Report File................................................. A-13 Generation of the rprtMMDD File......................................... A-13 The runacct Program........................................................... A-18 runacct States........................................................................ A-19 Periodic Reports ..................................................................... A-21 Looking at the pacct File With acctcom ........................... A-22 acctcom Options.................................................................... A-23 Generating Custom Analyses........................................................ A-25 Writing Your Own Programs............................................... A-25 Raw Data Formats.................................................................. A-25 The /var/adm/wtmp File....................................................... A-26 The /var/adm/pacct File..................................................... A-27 The /var/adm/acct/nite/disktacct File...................... A-28 Summary of Accounting Programs and Files............................. A-29 IPC Tuneables.................................................................................. B-1 Additional Resources ........................................................................B-2 Overview .............................................................................................B-3 How to Determine Current IPC Values.................................B-3 How to Change IPC Parameters .............................................B-4 Tuneable Parameter Limits......................................................B-5 Shared Memory ..................................................................................B-6 Semaphores.........................................................................................B-8 Message Queues...............................................................................B-11 Performance Monitoring Tools ...................................................... C-1 Additional Resources ....................................................................... C-2 The sar Command ........................................................................... C-3 Display File Access Operation Statistics............................... C-3 Display Buffer Activity Statistics ........................................... C-3 Display System Call Statistics ................................................ C-4 Display Disk Activity Statistics.............................................. C-6 Display Pageout and Memory Freeing Statistics (in Averages).......................................................................... C-7 Display Kernel Memory Allocation (KMA) Statistics ........ C-8 Display Interprocess Communication Statistics.................. C-9 Display Pagein Statistics ......................................................... C-9 Display CPU and Swap Queue Statistics ........................... C-10 Display Available Memory Pages and Swap Space Blocks......................................................................... C-11 Display CPU Utilization (the Default) ................................ C-11 Display Process, Inode, File, and Shared Memory Table Sizes............................................................................ C-12 Display Swapping and Switching Statistics....................... C-13 Display Monitor Terminal Device Statistics ...................... C-14
xvii
Display All Statistics.............................................................. C-15 Other Options ......................................................................... C-16 The vmstat Command................................................................... C-17 Display Virtual Memory Activity Summarized at Some Optional, User-Supplied Interval (in Seconds).......................................................................... C-17 Display Cache Flushing Statistics........................................ C-20 Display Interrupts Per Device.............................................. C-20 Display System Event Information, Counted Since Boot ............................................................................. C-21 Display Swapping Activity Summarized at Some Optional, User-Supplied Interval (in Seconds)............... C-22 The iostat Command................................................................... C-23 Display CPU Statistics........................................................... C-23 Display Disk Data Transferred and Service Times ........... C-23 Display Disk Operations Statistics ...................................... C-24 Display Device Error Summary Statistics .......................... C-24 Display All Device Errors ..................................................... C-25 Display Counts Rather Than Rates, Where Possible ........ C-25 Report Only on the First n Disks ......................................... C-25 Display Data Throughput in Mbytes/sec Rather Than Kbytes/sec ................................................................. C-26 Display Device Names as Mount Points or Device Addresses................................................................ C-26 Display Per-Partition Statistics as Well as Per-Device Statistics............................................................ C-26 Display Per-Partition Statistics Only................................... C-27 Display Characters Read and Written to Terminals ......... C-27 Display Extended Disk Statistics in Tabular Form ........... C-28 Break Down Service Time, svc_t, Into Wait and Active Times and Put Device Name at the End of the Line ................................................................................. C-29 The mpstat Command................................................................... C-30 The netstat Command................................................................. C-31 Show the State of All Sockets ............................................... C-31 Show All Interfaces Under Dynamic Host Configuration Protocol (DHCP) Control......................... C-33 Show Interface Multicast Group Memberships................. C-33 Show State of All TCP/IP Interfaces................................... C-34 Show Detailed Kernel Memory Allocation Information....................................................... C-35 Show STREAMS Statistics .................................................... C-36 Show Multicast Routing Tables ........................................... C-36 Show Multicast Routing Statistics ....................................... C-37 Show Network Addresses as Numbers, Not Names........ C-37 Show the Address Resolution Protocol (ARP) Tables ...... C-38
xviii

Show the Routing Tables ...................................................... C-38 Show Per Protocol Statistics ................................................. C-39 Show Extra Socket and Routing Table Information.......... C-40 The nfsstat Command................................................................. C-41 Show Client and Server Access Control List (ACL) Information .............................................................. C-41 Show Client Information Only............................................. C-42 Show NFS Mount Options.................................................... C-44 Show NFS InformationBoth Client and Server................ C-45 Show RPC Information ......................................................... C-46 Show Server Information Only ............................................ C-47
xix
About This Course

Course Goal
This course introduces performance tuning for most Sun server environments. It presents information relating to most of the major areas of system performance, concentrating on physical memory and input/output (I/O) management. The Sun environment is quite complex. There are thousands of possible hardware congurations and tens of thousands of software congurations. Because it is not possible to give cookbook-style tuning information for all situations, this course examines how the components operate and how components might affect each other. You can use the knowledge gained from this course to determine what types of problems can arise, how to identify them, and how to alleviate them.

Use this module to get the students excited about this course. The strategy provided by the About This Course is to introduce students to the course before they introduce themselves to you and one another. By familiarizing them with the content of the course first, their introductions will have more meaning in relation to the course prerequisites and objectives. Use this introduction to the course to determine how well students are equipped with the prerequisite knowledge and skills. The pacing chart on page xxx enables you to determine what adjustments you need to make to accommodate the learning needs of students. This module also states how each student should approach the lab exercises. Make sure that the students are made aware of the required self-motivation and analysis during lab exercises. Also make sure that the students are willing to discuss issues openly.
xxi
Course Overview
This course begins with an overview of performance monitoring techniques and continues with information about specic operating system and hardware components. This provides you with an understanding of what effect a particular tuning change has and whether it is necessary. This course is intended for system administrators and system designers with a background in Solaris Operating Environment (OE) system administration. The course presents the topics of tuning as follows:
q q q q
Theory How it works Tools Monitoring activity Rules Thresholds and comparative analysis Adjustments What can be changed
Note Recommendations for adjustments are not always included. The course relies on student participation in the discussions, the practical exercises, and the analysis of various system situations. This approach mirrors the real-life situations in which you will probably nd yourself, where you must analyze system activity to determine where performance problems might lie.
xxii

Course Map
The following course map enables you to see the general topics and the modules that support each topic area.
Overview
Introducing Performance Management
CPU
Processes and Threads CPU Scheduling
Memory
System Caches Memory Tuning
I/O
System Buses I/O Tuning UNIX File System (UFS) Tuning
Networks
Network Tuning
Summary
Performance Tuning Summary
About This Course

xxiii
Module-by-Module Overview
This course contains the following modules:
q
Preface About This Course The preface introduces the typographic conventions of the materials, states the goals and objectives for the course and, most importantly, provides guidance to students on how to use the course notes and how to apply themselves during the course.
Module 1 Introducing Performance Management This module introduces the concepts of tuning, tuning tasks, terminology, and the steps involved in tuning a system. The module also provides a list of tuning references, the various performance monitoring tools available with the Solaris 8 OE, the information that the tools provide, and how to inspect and set system tuning parameters.
xxiv

Module 2 Processes and Threads This module describes the behavior of processes and threads in the Solaris OE. It presents information on their creation, execution, and on related tuning issues, as well as system timing and the clock thread.
Module 3 CPU Scheduling This module describes how the systems central processing units (CPUs) set dispatch priorities, how to invoke real-time processing, and how to use processor sets.
Module 4 System Caches This module describes the generic operation of caches and the specic operation of the hardware caches. Because many system components use caches or operate like them, understanding caches is fundamental to being able to manage system performance.
Module 5 Memory Tuning This module describe the operation of the virtual and real memory subsystems. Because it is perhaps the most important component of system performance, memory operation is described in detail.
About This Course

xxv
Module 6 System Buses This module shows how buses operate, what kinds of buses there are in the system, and what bus characteristics are important to tuning.
Module 7 I/O Tuning This module describes the characteristics of the small computer system interface (SCSI) bus and disk drives and their interactions. It presents information about redundant array of inexpensive disks (RAID) levels and storage arrays, and looks at the cost of I/O operations to enable the system administrator to quickly identify and locate the source of an I/O problem.
xxvi

Module 8 UFS Tuning This module explains the structure and operation of the Solaris platform UNIX le system (UFS). It discusses the various UFS disk structures, caches, and allocation algorithms. It also provides a description of the various UFS conguration, logging, and tuning parameters. In addition, this module describes alternative le-system types, specically the Veritas File System.
Module 9 Network Tuning This module describes the tuning required for a network, both the physical network and the network applications. It describes tuning the Transmission Control Protocol/Internet Protocol (TCP/IP) stack and NFS.
Module 10 Performance Tuning Summary This module summarizes the material presented in this course. It offers examples of poor performance and its cause. It also provides some general guidelines for tuning, a brief description of application optimization, and suggestions for dealing with common system performance bottlenecks.
About This Course

xxvii
Course Objectives
Upon completion of this course, you should be able to:
q q
Describe the purpose and goals of tuning Describe the common characteristics of system hardware and software components Explain how the major system components operate Explain the tuning report elements and how they relate to system performance Identify tuning parameters and understand when and how to set them Describe the different types of performance graphs Determine whether a problem exists in system hardware congurations Identify the primary memory problems and their solutions Identify I/O hardware and le system performance problems and their solutions Identify network performance problems and their solutions Explain CPU scheduling and thread function in the Solaris OE
q q
q q
q q
q q
xxviii

Skills Gained by Module

The skills for Solaris System Performance Management are shown in Column 1 of the following matrix. The black boxes indicate the main coverage for a topic; the gray boxes indicate that the topic is briey discussed. Module Skills Gained Describe the purpose and goals of tuning Describe the common characteristics of system hardware and software components Understand how the major system components operate Understand the tuning report elements and how they relate to system performance Identify tuning parameters and understand when and how to set them Describe the different types of performance graphs Determine whether a problem exists in system hardware congurations Identify the primary memory problems and their solutions Identify I/O hardware and le system performance problems and their solutions. Identify network performance problems and their solutions Understand CPU scheduling and thread function in the Solaris OE

1 2 3 4
5 6 7 8 9 10
Refer students to this matrix as you progress through the course to show them the progress they are making in learning the skills advertised for this course.
About This Course

xxix
Guidelines for Module Pacing

The following table provides a rough estimate of pacing for this course: Module Introducing Performance Management Processes and Threads CPU Scheduling System Caches Memory Tuning System Buses I/O Tuning UFS Tuning Network Tuning Performance Tuning Summary Day 1 A.M./ P.M. A.M. P.M. A.M. A.M. P.M. A.M. P.M. A.M. P.M. Day 2 Day 3 Day 4 Day 5
xxx

Topics Not Covered

This course does not cover the topics shown on the overhead. Many of these topics are covered in other courses offered by Sun Educational Services.
q
Capacity planning Covered in ES-330: Sun Enterprise Cluster Administration and ES-255: Sun Hardware RAID Storage Systems Administration General system administration Covered in SA-238: Solaris 8 System Administration I, SA-288: Solaris 8 System Administration II, and SA-350A: Solaris 2.x Server Administration Network administration and management Covered in SA-389: Solaris TCP/IP Network Administration General Solaris OE problem resolution Covered in ST-350: Sun Systems Fault Analysis Workshop
About This Course

xxxi
Data center operations Covered in SOE-8000: Solaris 8 Operating Environment Essentials Disk storage management Covered in ES-310: Volume Manager With Sun StorEdge
Refer to the Sun Educational Services catalog for specic information and registration.
xxxii

How Prepared Are You?

To be sure you are prepared to take this course, can you answer yes to the following questions?
q
Do you have Solaris OE system administration training or experience, and are you able to install software, edit cron tables, start daemons, and perform the lab exercises? Do you understand the function of the kernel and are you able to load kernel modules and modify kernel tunable parameters? Do you have basic knowledge of Sun server platforms and do you understand the role of the CPU, memory, and buses and their relationship to system performance? Do you have an understanding of system buses, peripheral buses, types of I/O devices, and the UNIX le system, and are you able to recognize inappropriate congurations and tune the I/O subsystem properly for a given application? Do you understand network protocols and are you able to tune them to maximize the network bandwidth?
About This Course

xxxiii
Notes:

If any students indicate they do not meet these requirements and feel uncomfortable with remaining on the course, meet with them at the first break to decide how to proceed with the class. Perhaps they want to take the class at a later date? Is there some way to get the extra help needed during the week? You should attempt to resolve this situation with the center manager. It might be appropriate here to recommend resources from the Sun Educational Services catalog that provide training for topics not covered in this course.
xxxiv

Introductions
Now that you have been introduced to the course, introduce yourself to each other and to the instructor, addressing the items shown on the overhead.
About This Course

xxxv
How to Use the Course Materials

To enable you to succeed in this course, these course materials use a learning model that is composed of the following components:
q
Course map An overview of the course content appears in the About This Course module so you can see how each module ts into the overall course goal. Objectives What you should be able to accomplish after completing this module is listed here. Relevance This section, which appears in every module, provides scenarios or questions that introduce you to the information contained in the module and provoke you to think about how the module content relates to optimizing system performance. Additional resources Where you can look for more detailed information, or information on congurations, options, or other capabilities on the topics covered in the module or appendix.
xxxvi

Overhead images Reduced overhead images for the course are included in the course materials to help you easily follow where the instructor is at any point in time. Overheads do not appear on every page. Lecture The instructor will present information specic to the topic of the module. This information should help you learn the knowledge and skills necessary to succeed with the exercises. Exercise Lab exercises give you the opportunity to practice your skills and apply the concepts presented in the lecture. Check your progress Module objectives are restated, sometimes in question format, so that before moving to the next module you are sure that you can accomplish the objectives of the current module. Think beyond Thought-provoking questions are posed to help you apply the content of the module or predict the content in the next module.
In addition to the Student Guide materials, a set of lab les and reference HTML documents are provided on each system.

IMPORTANT: The HTML pages, which are to be used as reference material during the course, are in the lab1 sub-directory of the course labs directory. If the students launch a browser and open the index.html page, in the lab1 directory, this provides the appropriate links to all of the HTML reference pages. Also, advise the students not to read ahead with the HTML pages. They are intended to be read during the relevant lab exercises.
About This Course

xxxvii
Course Icons and Typographical Conventions

The following icons and typographical conventions are used in this course to represent various training elements and alternative learning resources.
Icons
Additional resources Indicates additional reference materials are available.
Discussion Indicates a small-group or class discussion on the current topic is recommended at this time.
Note Additional important, reinforcing, interesting, or special information.
Caution A potential hazard to data or machinery.
!
Warning Anything that poses personal danger or irreversible damage to data or the operating system.
xxxviii

Typographical Conventions
Courier is used for the names of commands, les, and directories, as well as on-screen computer output. For example: Use ls -al to list all les. system% You have mail. Courier bold is used for characters and numbers that you type. For example: system% su Password:
Courier italic is used for variables and command-line placeholders that are replaced with a real name or value. For example:
To delete a le, type rm filename. Palatino italic is used for book titles, new words or terms, or words that are emphasized. For example: Read Chapter 6 in Users Guide. These are called class options. You must be root to do this.
About This Course

xxxix
Notes to the Instructor

Philosophy
The Solaris System Performance Management course has been created to allow for interactions between the instructor and the student as well as between the students themselves. In an effort to enable you to accomplish the course objectives easily, and in the time frame given, a set of tools have been developed and support materials created for your discretionary use. A consistent structure has been used throughout this course. This structure is outlined in the Course Goal section. The suggested ow for each module is: 1. 2. 3. 4. 5. Module objectives Context questions and module rationale Lecture information with appropriate overheads Lab exercises Discussion: either as whole class or in small groups
To allow the instructor exibility and give time for meaningful discussions during the lectures and the small-group discussions, a timing table is included in the Course Tools section. The UNIX operating system is quite complex. This is not always recognized by students attending a tuning class, who often expect a cookbook list of all system tuning parameters and a series of rules for setting them. There are, however, too many options, system congurations, and workloads for one size or approach to work in all situations. (Besides, there is no complete list of rules or tuning parameters.) Given the complexity of the system, the large number of interactions and interdependencies, and the vast variety of available hardware, no approach that emphasizes individual situations will be productive. As a result, the design of this course is concepts over commands. This means that general concepts are introduced and explained, then illustrated with specic examples. The two primary concepts introduced are the bus and the cache. Most system components that affect performance can be modeled after one or both of these. Examples of caches are main memory, the segmap
xl

cache, and a storage arrays cache. Examples of buses are the SCSI bus, networks, and system internal buses. When students recognize the component part of the system with which they are concerned, they can understand tuning reports for that component, and understand the effects of various types of tuning changes that could be made to the component. The course material explains general concepts, which are applicable to any computing system, and uses examples specic to Sun products and terminology to illustrate them. When teaching, it is very easy to add more detail to almost every topic. Be careful that you do not begin to run late if you do this. The course material has been selected and sized to t the available time. This version of the course material supports the Solaris 8 OE with references to releases 2.6 and 7.
Course Tools
To enable you to follow this structure, the following supplementary materials are provided with this course:
q
Relevance These questions or scenarios set the context of the module. It is suggested that the instructor ask these questions and discuss the answers. The answers are provided only in the instructors guide.
Course map The course map allows the students to get a visual picture of the course. It also helps students know where they have been, where they are, and where they are going. The course map is presented in the About This Course in the students guide.
About This Course

xli
Lecture overheads Overheads for the course are provided in two formats: The paper-based format can be copied onto standard transparencies and used on a standard overhead projector. These overheads are also provided in the students guide. The Web browser-based format is in both HTML and PDF format and can be projected using a projection system, which displays from a workstation. This format gives the instructor the ability to allow the students to view the overhead information on individual workstations. It also allows better random access to the overheads.
Small-group discussion After the lab exercises, it is a good idea to debrief the students. Gather them back into the classroom and have them discuss their discoveries, problems, and issues in programming the solution to the problem in small groups of four or ve, one-on-one, or one-onmany.
General timing recommendations Each module contains a Relevance section. This section may present a scenario relating to the content presented in the module, or it may present questions that stimulate students to think about the content that will be presented. Engage the students in relating experiences or posing possible answers to the questions. Spend no more than 1015 minutes on this section.
xlii

Module About This Course Introducing Performance Management Processes and Threads CPU Scheduling System Caches Memory Tuning System Buses I/O Tuning UFS Tuning Network Tuning Performance Tuning Summary
q
Lecture (Minutes) 40 100 70 120 40 70 40 75 90 60 60
Exercise (Minutes)
Total Time (Minutes) 40
180 90 90 50 100 45 100 100 80 60
280 160 210 90 170 95 175 190 140 120
Module self-check Each module contains a checklist for students under Check Your Progress. Give them a little time to read through this checklist before going on to the next lecture. Ask them to see you for items they do not feel comfortable checking off.
About This Course

xliii
Changes for Revision C.1

The following changes have been made to this revision:
q q
The instructor notes have been updated to reect the Solaris 8 OE. Some diagrams have been cleaned up and increased in size to improve legibility. Miscellaneous typographical errors have been corrected. Several sections have been reworded or expanded for clarity. The index has been expanded. Solaris 8 OE support has been completed. Lab les have been added. More instructor notes have been added throughout the material and some existing ones claried. Options for the different monitoring tools as well as sample output have been removed from Module 2 and placed in Appendix C. The default reports for each tool have been more clearly specied, where appropriate. HTML reference pages are supplied in the labs/html directory in addition to the lab les. Lab exercises include new shell scripts, Java technology and compiled programs. PDF documents are supplied as online reference material. The emphasis has been placed on the lab exercises, which now include descriptions of the purpose of the lab, what the student is expected to see during the exercise, and questions related to the lab exercises.
q q q q q q
q q
xliv

Instructor Setup Notes

Purpose of This Guide
This guide provides general information about setting up the classroom. Refer to the SA400_revC.1_setup.txt le for specic information about how to set up this course.
Projection System and Workstation

If you have a projection system for projecting PDF slides and are planning to use the PDF slides, you need to do the following:
q
Install the PDF overheads on the workstation connected to the projection system so you can display them with a browser during lecture. To install the PDF overheads on the machine connected to your overhead projection system, copy the PDF directory, provided in the SA400_OH directory to any directory on the overhead workstation machine. Display the overheads in the browser by choosing Open File and typing the following in the Selection eld of the pop-up window: /location_of_PDF_directory/SA400_RevC_OH.pdf
Set up an overhead-projection system that can project instructor workstation screens.
Note This document does not describe the steps necessary to set up an overhead projection system because it is unknown what will be available in each training center. This setup is the responsibility of each training center.
About This Course

xlv
Course Files
All of the course les for this course are available from the education.central server. You can use ftp or the education.central Web site, http://education.central/Released.html, to download the les from education.central. Either of these methods requires you to know the user ID and password for FTP access. See your manager for these if you have not done this before.
Course Components
This course consists of the following components:
q
Instructor Guide The SA400_IG directory contains the FrameMaker les for the Instructor Guide (Students Guide with instructor notes). The SA400_ART directory is required for printing this guide.
Student Guide The SA400_SG directory contains the FrameMaker les for the Student Guide. The SA400_ART directory is required for printing this guide.
Art The SA400_ART directory contains the supporting images and artwork for the Student Guide and Instructor Guide. This directory is required for printing the Student Guide and Instructor Guides and should be located in the same directory as SA400_IG and SA400_SG.
Overheads The SA400_OH directory contains the instructor overheads. There are PDF and FrameMaker versions of the overheads.
xlvi

Lab les The SA400_LF directory contains the following subdirectory:

w
labfiles A directory containing a compressed tar le of the class student lab les.
You will need approximately 50 Mbytes of disk space for the uncompressed lab le directory.
About This Course

xlvii
Course Labs An Explanation

The following pages describe the course lab les and how they should be presented to students. The relevance of the lab exercises, the required approach, and possible outcome of the exercises are presented. These explanations assist instructors in their course preparation and, specically, in management of the lab exercise periods. The design of the course allows instructors to add their own personal lab exercises and anecdotes. In fact, it is to be expected that each instructor will compile a series of their own lab exercises to complement those provided in the course materials. However, personal lab exercises should only be used to extend the range of exercise rather than supplement those provided in the course les. It is, therefore, expected that every instructor will spend time learning how the lab les work (in association with the following documented notes). Answers to the lab questions are provided in the course materials. If certain tasks are system-specic, this is noted in the Answers module.
xlviii

Course Labs Explanation (Introducing Performance Management)

The lab for this module involves the students in the installation of the SE ToolKit les, setting up the sys crontab le, installing memtool, using a range of performance-related utilities, and answering a series of questions. This lab introduces the student to the manner in which they are expected to work during the course. The instructor should promote the following approaches: 1. Students are expected to be self-motivated. That is to say, they should be willing to refer to man pages, write their own scripts to perform certain tests, and be willing to try additional tests during the lab sessions. 2. Instructors are expected to promote the idea of self-motivation. This is partly to assist during the teaching of the class but also to help the student realize that performance management in a work situation requires a high degree of self-motivation. (There are no rules and very few guidelines to assist the student in their daily work.) 3. Students are expected to ask questions. The instructor should make the students aware that questions are expected from each and every student. 4. Not all tools work on all systems. It is to be expected that problems might be encountered, especially when using third-party packages. Because performance-related software might have been developed over several revisions of the Solaris OE, there are no guarantees that third-party software (such as SE toolKit programs) will work on all platforms. The instructor should make the students aware that the core set of utilities (such as sar, vmstat, iostat, netstat) that are provided with the Operating System (OS) release should always work. Therefore, these are the predominant utilities used within the labs. 5. This lab does not enable the system accounting facilities. However, the sys crontab le is to be uncommented so that the sa utility can capture data for later viewing with sar.
About This Course

xlix
Course Labs Explanation (Processes)

The purpose of this lab is to help the student understand the relationship between kernel parameters and the operation of the system. This lab also makes use of shell scripts and Java technology programs. The rst part of the lab makes use of the adb and mdb utilities. These are controlled using shell scripts. Changes are made to the /etc/system le and the changes are monitored using the scripts. The purpose of this lab is to show that shell scripting is a useful skill to have or to learn because it can be applied to repetitive tasks. A lab using lockstat demonstrates that locking is occurring when processes are active. The locks in use are usually not accessible to users but are controlled from within the program being executed. This lab shows how the locking mechanisms can be viewed using the lockstat command. Monitoring of processes using the ps command is also presented in this lab. Derivatives of the ps command, as provided by the SE Toolkit, are used to compare and contrast to the standard ps command output. During the ps-related labs, the students are requested to view certain values that help to identify how programs can differ in their execution. One important difference, presented during the lab, is the Bourne and Korn shells activities when using numeric calculations. The fork/exec processes produced by the Bourne shell are possible causes of performance degradation. To demonstrate the use of the ps -L command option, a Java program (called ThreadMon) is used. This is invoked by a shell script that also invokes the ps -L command to monitor the Java programs thread activity. This program should help to reinforce the concept of multi threaded programming technique. Finally, the lab closes with the catman command to demonstrate the range of process ID (PID) numbers rising to a maximum value and then PID numbers being reused from the lower ranges.

Course Labs Explanation (CPU Scheduling)

This lab demonstrates how the Solaris OE schedules processes and how types of processing can affect the system as a whole or other processes. The instructor should try to get the students to think about what processing is done by their own systems. They can then try to relate the methodology, presented in the class, to their own specic system scenarios. The lab starts by using two programs, dhrystone and io_bound, to determine the effect of compute-intensive and I/O-intensive operations and how each type of operation might affect the other. Although this lab can be time-consuming, it is a good introduction to the level of analysis that might be required. Also, because the students can only guess at what the programs are doing, this provides a realworld working in the dark scenario. The student should be encouraged to try and use the utilities (sar, vmstat, and so on) to help them deduce what the programs are, in fact, doing. This should help the student become more familiar with the utility output and should show them how to read the values that are displayed in the utility output. In this lab, the student also gets to change the dispatch parameter table. The student should execute the dhrystone and io_bound programs after changing the dispatch table to see how the changes can affect the processing. Also, the results should be documented (possibly using the script utility or an alternative) so that the values can be compared (pre- and post-change of the dispatch table values). By encouraging the students to document their work, you are promoting good practice that will be of benet to them in their work-place. A lab is also provided related to processor sets. This lab can only be attempted if the appropriate hardware is available in the classroom.
About This Course

li
Course Labs Explanation (Caches)

This lab looks at the topic of caches and concentrates on the DNLC cache. One useful addition is to bring to the students attention that the DNLC cache values are accumulative. That is, if the vmstat output shows a drop in value from 97 percent to 96 percent, this could be interpreted as being a 1 percent miss when searching the cache. However, if previous searches had searched for 10,000 le names and the current search was for 100 le names, those 100 le names represent 1 percent of the (accumulated) total. So, a 1 percent drop could actually mean a 0 percent success when searching the DNLC cache. Always try to get the student to look behind the values being displayed and get them to determine a more realistic picture of the information that is being provided to them by utilities, such as vmstat. The lab also presents the use of the prtconf command to display details of the CPU caches that are available. If possible, the instructor should produce prtconf output (as used in the lab exercise) on different hardware platforms. This helps to demonstrate differences, for example, between Ultra workstations and Enterprise servers.
lii

Course Labs Explanation (Memory Tuning)

This lab is one of the most important labs in the course. As such, every effort has been made to make the lab exercises interesting and informative. During the lab, the students should make use of shell scripts, utilities, and a specially-written Java technology program. If the students are not familiar with adb or mdb, this lab helps to demonstrate how these debugging utilities can be applied. Also, the pmap utility is used. The instructor should, therefore, be familiar with these utilities and be willing to demonstrate their use, if required. It is important, when using the Memory Monitor (a Java technology program) that a number of these programs are invoked. If just one invocation occurs, the effect on the scheduler is that the program will not be switched out of the scheduler tables executables. If, however, multiple copies of the program are executing simultaneously, each one is switched in and out of the scheduler table, thereby allowing its process address space to be paged in and out. This produces a good example of memory paging and also demonstrates just how effective the Solaris 8 OE can be at memory management. While the lab is in progress, the student should be running the vmstat command so that output can be monitored as the Memory Monitor programs are executing. Be careful when changing the scan-related kernel parameters. The instructor should monitor the student activity as this lab progresses.
About This Course

liii
Course Labs Explanation (System Buses)

The purpose of this lab is to help students become more familiar with their hardware architecture. The lab requires the use of a serial cable to connect two systems together. If this is not practical or cannot be achieved due to the layout of the class, the instructor should make a decision as to whether this lab should be attempted in full. There are sections of the lab that can be carried out without the use of the serial connection. You might want to limit the lab to these exercises.
liv

Course Labs Explanation (I/O Tuning)

This lab demonstrates how to determine hardware characteristics that affect I/O performance. Also, there is a question session in which the student must match the characteristics of a system to a type (for example, a Web server). The instructor should read through the answers and be prepared to discuss why those characteristics match the particular type of use. For example, many people are surprised when told that a Web server system has both high network activity and high disk activity. However, most Web servers keep logs of who has accessed the Web pages being provided. This log information is being written to disk in a non-sequential manner (depending upon which Web pages are being read). The instructor can also bring backward-referencing to play in this lab. For example, because the Web server is high-intensity read/write, there are many I/O operations occurring on the Web server. How would this affect processes in the scheduling table? The instructor should prepare a series of example questions and answers that bring together the process-related information presented in earlier course modules, and the I/O-related information. The iostat command features heavily in this module. The instructor might want to present extra examples of iostat output (preferably with explanations of what was happening on the system at the time that the report was produced). There are no lab exercises relating to RAID. However, the instructor is expected to open discussion relating to RAID and provide examples or references regarding the application of RAID.
About This Course

lv
Course Labs Explanation (UFS Tuning)

The purpose of this lab is to allow the student to become more familiar with the UNIX File System (UFS) controls and layout. By allowing the student to come to terms with the strategy used by UFS in data storage, this allows the instructor to discuss alternative methods of long-term data storage, why certain le system types might be preferred for certain storage requirements and how le system-specic options can be applied successfully to enhance performance. This is one lab in which the instructor should have carried out extensive research into le system design and implementation. It is almost inevitable that students will ask questions, such as Which le system is best for. . . and the instructor should be prepared to answer such questions. The lab requires that a disk be available that can be partitioned and that le systems be created in those partitions. Although the students are expected to be familiar with the format command and Solaris disk partitioning, it is advisable that the instructor carefully supervise this set of lab exercises. The fstyp command produces output related to the le system creation. Not all of the values displayed are described in the man pages for fstyp. The instructor must, therefore, spend some time preparing a list of the items that are not explained in the fstyp man pages. Depending upon the hardware available in the classroom, the results of this lab could be astounding (in the different values that show for each of the le systems) or inconclusive (where the values are pretty much the same for each le system). In either case, the results are still useful in showing how much the hardware affects performance and that, with good hardware, changes to the le system controls might not have any major effect. Be prepared for disappointment among the students if the results are inconclusive.
lvi

Course Labs Explanation (Network Tuning)

This lab looks at naming service caching and the use of the NFS automounter trace facility. The instructor should advise the students to take readings from the netstat utility at periods during the execution of the lab. This should help them to identify periods of heavy network activity in the class and these readings could become the topic of an instructor-led discussion. Instructors should use their discretion to decide whether to use pairs of systems (one acting as NFS server and the other acting as NFS client) or whether to set up a classroom man-page server that all student hosts will use (as clients). NFS caching is also presented in the nal part of this lab. The use of the audio feature of the snoop utility is intentionally included to bring a little bit of humor into the classroom. The instructor might want to extend the range of lab exercises. However, caution should be applied due to the scope of questioning and investigation that students might want to undertake.
About This Course

lvii
Course Labs Explanation (Summary)

The lab exercises related to the Summary module have been included to allow each student to reinforce what has been learned in the course. A number of sar data les have been included. The student is expected to view the data stored in these les, using the sar utility. The instructor should prepare by trying out each lab exercise prior to running the course so that the instructor is aware of the potential values being displayed and what answers are required to the questions that are posed to the student. Because this is the nal lab of the course, this lab can also act as a point of reference so that students can assess how much they have learned by attending the course. A set of case studies is available in Module 10. These case studies are useful material for the student. The case studies are taken from reallife systems. However, the details relate to systems running releases of the Solaris OE prior to the Solaris 8 OE. The content of the case studies, however, is still relevant in terms of the approach taken and the deduction of the performance-related problems.
lviii


Objectives
Upon completion of this module, you should be able to:
q q
Explain what performance tuning really means Characterize the type of work performed on a system to make meaningful measurements Explain performance measurement terminology Describe a typical performance tuning task Describe the purpose of baseline measurements Describe how performance data is collected in the system Describe how the monitoring tools retrieve the data List the data collection tools provided by the Solaris OE Explain how to view and set tuning parameters Describe the system accounting functions Describe the use of the Sun Management Center software
q q q q q q q q q
1-1
1
Relevance

Present the following questions to stimulate the students and get them thinking about the issues and topics presented in this module. While they are not expected to know the answers to these questions, the answers should be of interest to them and inspire them to learn the content presented in this module.
Discussion The following questions are relevant to understanding the content of this module:
q q q
Why tune a system? How do you know that a system requires tuning? How do you know what to tune?
Additional Resources
Additional resources The following references provide additional details on the topics discussed in this module:
q
Cockcroft, Walker Sun Blueprints Capacity Planning for Internet Services. Palo Alto: Sun Microsystems Press/Prentice Hall, 2001. McDougall, Cockroft, Hoogendoorn, Vargas, and Bialaski. Sun Blueprints Resource Management. Palo Alto: Sun Microsystems Press/Prentice Hall, 1999. Cockcroft, Adrian and Richard Pettit. Sun Performance and Tuning. Palo Alto: Sun Microsystems Press/Prentice Hall, 1998. Wong, Brian L. Conguration and Capacity Planning for Solaris Servers. Palo Alto: Sun Microsystems Press/Prentice Hall, 1997. Benner, Alan F. 1996. Fibre Channel. McGraw-Hill, 1996. Catanzaro, Ben. Multiprocessor System Architectures. SunSoft Press/Prentice Hall, 1994. Cervone, Frank. Solaris Performance Administration. McGraw-Hill, 1998. Goodheart, Berny and James Cox. The Magic Garden Explained. Prentice-Hall, 1994.
q q
1-2

1
q q
Graham, John. Solaris 2.x Internals. McGraw-Hill, 1995. Gunther, Neil. The Practical Performance Analyst. McGraw-Hill, 1998. Raj Jain, Raj. The Art of Computer System Performance Analysis. John Wiley & Sons, Inc, 1991. Ridge, Peter M. and David Deming. The Book of SCSI. No Starch Press, 1995. Solaris 2 Security, Performance and Accounting in the Solaris System Administration AnswerBook. SMCC NFS Server Performance Tuning Guide in the Hardware AnswerBook on the SMCC Supplements CD-ROM. The Sun Web site http://www.sun.com/sun-onnet/performance.html and related pages. Man pages for the various commands (adb, fstyp, kstat, lockstat, mkfs, newfs, ndd, prtconf, prtdiag, ps, sysdef, tunefs) and performance tools (iostat, mpstat, netstat, nfsstat, sar, and vmstat).

1-3
What Is Tuning?
As shown in the overhead image, tuning means many things. In general, however, it means making the most efcient use of the resources that the system has for the workload that runs on it. Tuning is aimed at two primary activities:
q q
Eliminating unnecessary work performed by the system Taking advantage of system options to tailor the system to its workload
Almost every tuning activity can be placed into one of these two categories.

Discuss the bullet points with the students. Try to establish a buy-in as to what tuning is. Remind the students that this course is primarily biased towards system monitoring. This is because it is important to first establish whether there is a requirement for tuning to take place.
1-4

1
Tuning can mean different things to users, managers, programmers and system administrators. Tuning is a way to:
q
Improve resource usage System users might work in an inefcient manner, using inappropriate commands or overloading the system unneccesarily. To identify whether the system resources are being used effectively, constant monitoring might need to be performed. One of the easiest ways to tune the system could involve educating the users about appropriate commands to use. For example: When the ls -l command is used, this can involve disk input/output. If there is no requirement to view the les permissions, ownership, and so on, then it would be more efcient to use the ls command without the -l option.
Improve user perception of the system Users might believe that the system response time is inadequate. However, the users might be unaware of the tasks involved in the processes they are running. You might be required to explain how the system functions and why users are experiencing such response times.
Make more efcient use of system resources The CPU might be idle much of the time. Executing programs might not be sharing memory with other processes on the system.
Discussion Can you think of any other ideas for improving resource usage?
Discussion Can you think of other reasons why user perception might need to be improved?
Discussion Can you suggest why system resources might not be being used efficiently?

1-5
1
q
Increase system capacity Management might believe that the system should be capable of performing a set of tasks. It might be your responsibility to determine whether the system can actually perform the required tasks or you might be required to suggest possible system upgrades to achieve the required level of performance. You might, therefore, be required to produce reports for submission to management.
Adapt the system to the workload You might be required to recongure the system hardware or set operating system parameters to better match the systems workload. Alternatively, the system might not be used correctly and you might have to alter the scheduling of programs, user activities, or software development to match the system.
Lower costs All organizations want to see the best return on investment (ROI) in their computer systems. Management personnel might see ROI as a critical factor in their computing strategy, so budgetary constraints will probably affect the choices open to you.
Discussion What skills and knowledge do you need to produce the necessary reports for management?
Discussion How could you prove that the system requires reconfiguration? Why might it be better to alter the workload instead of the system?
Discussion In your opinion, is spending money the simplest tuning option?
1-6

Why Should a System Be Tuned?

There are several reasons why a system might need to be tuned. Some of the most important are:
q q q q
Meeting service level agreement (SLA) target values Making system resources available at specied levels Consolidating the system Planning the capacity
It is vital that you know what level of performance the system is capable of providing and that you know how to monitor performance to show that agreed upon levels of performance are being met.

This slide reinforces the need to understand system monitoring. The skill to monitor appropriately and interpret command output correctly cannot and should not be underestimated. Service Level Agreements are extensively described in the book Solaris Blueprints: Capacity Planning for Internet Services by Adrian Cockroft and Bill Walker, ISBN: 0-13-089402-8.

1-7
Basic Tuning Procedure

A set of measurements are taken. These act as the baseline against which all future measurements will be compared. The basic tuning procedures are quite simple: Locate a bottleneck and remove it. In practice, however, when you remove one bottleneck, another bottleneck will subsequently affect system performance. This process is repeated until performance is satisfactory, the new performance measurement as compared to the baseline measurements. Generally, that would be to say that the current system bottleneck is having minimal impact on the performance of the system. You will never be nished tuning because there will always be a bottleneck. If there were none, innite work could be completed in zero time. Tuning provides a compromise between cost, the adding of extra hardware, and performance, which is the ability to get the systems work done efciently.
1-8

1
Before beginning a tuning project, you should have a realistic goal in mind regarding the denition of satisfactory performance. Establish a realistic goal. Expectations can be set that cannot be met, causing even the most successful tuning project to be perceived as a failure.

1-9
Conceptual Model of Performance

There are many factors that make a system perform the way it does. These include:
q
The workload itself. The workload might be Input/Output (I/O) bound, Central Processing Unit (CPU) bound, or network intensive. System activity could be a mix of the following: short jobs, long jobs, simple data queries, complex queries, simple text editing, or complex computer-aided design (CAD) simulations. These might occur in any combination and the mix of activity might change during the day or week. Random uctuations and noise are the tasks that cannot be accurately measured or dened. These would include system tasks, such as le system management and updating, or intermittent intrusive tasks, such as users running the sar command to analyze system performance. Changes to the system itself. These include new hardware components, new network access, new software, new algorithms from patches, or different media.
1-10

1
All of these factors and more work together to make the system a very dynamic entity, constantly changing and acting differently over time. When tuning, you need as much information as possible to make correct decisions about the systems current and future behavior, and where efciencies can be obtained.

Where do you start? These initial questions need to be answered before any performance problems can be solved. Although these have been presented earlier in this module, some additional performance questions are:
What is the business function of the system? What is the systems primary application? Is the system a file server, a database server, or a Web server? Who and where are the users? How many users are there? How do they use the system? What kind of work patterns do the users have? Who says there is a performance problem? What is slow? Are the end users complaining or do you have some objective business measure like batch jobs not completing quickly enough? What is the system configuration? How many machines are involved? What version of the Solaris OE is running? What relevant patches are loaded? What are the busy processes on the system doing? Which processes are busy? Who started the busy processes? How much of the CPU resources are the busy processes using? How much memory are the busy processes using? How long have the busy processes been running? What are the CPU and disk utilization levels? How busy is the CPU? What is the proportion of user and system CPU time? How busy are the disks? Which disks have the highest load? What is making the disks busy? What is the network name service configuration? How much network activity is there? A slow machine might be waiting for some other system to respond. Is there enough memory? Is there a lot of scanning? Are the swap disks busy? What changed recently? What changes are on the way? Are you planning on adding more users? Are you planning on upgrading an application?

1-11
Where to Tune First

The table in the overhead identies several aspects (areas) of a system that can be tuned. Use the Theory and Realistic columns to create two lists of priority values. List the priority values from 1 (most improvement possible) to 4 (least improvement).

The order of the lines is 4, 2, 1, 3. This is surprising to most students, but the reason is simple: The closer you get to the application, the more you know about what should be done and the less general the response to a given request needs to be. Most efficiencies can be obtained from areas about which you are most knowledgeable about. In tuning, however, system administrators rarely have access to the database management system (DBMS) and applications, so they tend to concentrate on the other areas exclusively.
1-12

1
One list should be the theoretical list, which rates the priority from the theoretical perspective. The second list should be the realistic list, which is prioritized based on your own personal circumstances and experience. In the second list, for example, if you have no personal experience with application programming and cannot inuence the application programmers, the application software will probably come lowest in the priority order. Whereas in the rst list, the theoretical list, you might want to rate the application software as a high priority candidate for performance gain.

1-13
Where to Tune?
Tuning will be most effective closest to where the work is performed. As system administrators, however, access to application programmers might be restricted. Also, the ability to tune a database management system (DBMS) might be limited or restricted to the use of special tools provided by the DBMS supplier. System administrators invariably are limited to changing the operating system (OS) controls or system hardware. Tuning at the lower levels is usually an attempt to make the system support the applications as efciently as possible. This usually involves making trade-offs. Changes that benet one type of work might have no effect on, or could hinder, another type of work. If you tune the system specically for a particular type of work, you must be prepared to accept these trade-offs. You also need to understand the characteristics of the applications to know how they use the system before you can tune the system properly to support them.
1-14

Performance Measurement Terminology

Terminology used in the tuning literature includes:
q
Bandwidth
w w w w
The peak capacity that cannot be exceeded A measurement that ignores overhead The best (perfect) case numbers Something that is never achieved in practice; there are always some inefciency and overhead

1-15
1
q
Throughput
w w
How much work gets done in a specied time interval What you really get; that is, how much of the bandwidth is actually used A measurement that is dependent on many factors, including hardware, software, human intervention, and random events A number that is usually impossible to calculate, but can be approximated Maximum throughput if you are measuring how much work can be done at 100 percent utilization
Response time
w w w
How long the user or requester waits for a request to complete Elapsed time for an operation to complete Usually the same thing as latency
Service time
w w
Actual request processing time How long a request takes to perform, ignoring any queueing delays Response time, which is the same as service time when there is no queueing
Utilization
w
How much of a resource, either aggregate or individual, was used to do the measured work The amount of throughput accomplished by a resource A percentage usually expressed as a busy percentage
w w
Baseline
w
A set of measurements that will be used as the basis of future comparisons to determine where performance has changed An agreed upon starting point that shows system performance measurement levels
1-16

Performance Graphs
The three graphs in the overhead show the most common appearance of data obtained from performance measurements. The rst two graphs indicate a problem but not its source. You can see that the work performed, measured either as throughput or response time, improves to a certain level, and then begins to drop off. The drop-off might occur quickly or slowly, but the system is no longer able to efciently perform more work. A resource constraint has been reached. The third graph, a utilization graph, shows how much capacity is available for a given resource. When fully used, the resource is no longer available and requesters have to queue for access to that resource. When there is a drop-off, as measured by the rst two types of graphs, there is a resource that is fully uses, as measured by the third graph. Locating this resource and increasing its availability is the goal of tuning.
1-17
Displaying Graphs of Performance Metrics

Producing graphs of performance metrics helps others to understand the data that is being presented to them. Most Solaris OE performance utilities produce large amounts of data in text format. A good example of this is sar. Rather than present volumes of data that might be difcult to read and interpret, the production of a graph helps to convey the details produced by the utility in an easy to understand format. Because most utilities produce report output in columns of text, it should be easy to lter this information into a spreadsheet (such as a StarOfce spreadsheet) from which a graph can be produced.
1-18

1
Shown below is a portion of the output from the sar command: 09:34:28 fd0 nfs1 sd1 sd1,a sd1,c sd1,g sd1,h sd3 sd3,a sd3,c sd3,d sd3,f sd3,h sd6 0 0 45 0 0 2 45 38 1 0 0 0 38 0 0.0 0.0 0.6 0.0 0.0 0.1 0.6 0.4 0.0 0.0 0.0 0.0 0.4 0.0 0 0 28 0 0 0 27 4 0 0 0 0 4 0 0 0 4311 0 0 7 4305 4324 3 0 0 0 4321 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 22.8 0.0 0.0 122.6 21.1 96.7 103.8 0.0 0.0 0.0 96.4 0.0
In this format, the data would not make any sense to someone who was unfamiliar with the output from sar. However, when the appropriate portions of the output from sar have been converted into a graph, the information makes more sense. The graph for this data is shown on the preceding page. The graph shows disk throughput for two disks, at one-minute intervals, over a period of ve minutes. In its own right, the graph is not meaningful, supportive documentation should always accompany graph results. The documentation should, at a minimum, state:
q q q q q
What was being observed Why was it being observed How was the data produced (that is, which utility) A brief analysis of the utility output If necessary, a plan of action to be taken as a follow-up

1-19
Sample Tuning Task

Most tuning tasks begin with the complaint, The system is too slow. This is a very general statement, and it means different things to different people. Before beginning a tuning project, you must identify the problem: Precisely what is too slow? It might be that there is no problem, the user has unrealistic expectations, or there really is some type of a problem. When you have conrmed the problem, you must gather relevant system performance information. It might not be possible to quickly identify the source of the problem. In this case, gather and analyze general information to locate likely causes. When this has been done, you must evaluate and choose various solutions for implementation. Some might be as simple as changing a tuning parameter, some might require recompiling the application, and others might require the purchase of additional hardware.
1-20

1
The more that you know about the system and its applications, the more options you have for performance improvements, and the more cost-efcient you can be. Spending money on a new system, software, or components might be the easiest solution, but it is not always the best. When you have identied the source of the problem, develop and implement a plan to remove the cause. Make sure that you only x what is reasonable to x; requests to software vendors to redesign their product might not result in action being taken by the vendor. If you are not sure what a parameter does, do not change it or be prepared to change it back if it damages your performance. Be very careful; some parameters can affect system integrity, and you should not change them without understanding what they do. You should also make as few changes to the system at a time as possible. Reevaluate the performance after each change. Some changes might have a negative impact, which might be masked if multiple changes are made at the same time. By making a few changes at a time, you can quickly identify ineffective changes and gather more information on the effects of certain changes. Remember to document as you go. Documentation is absolutely vital to any performance-management project. The documentation should always include reasons why decisions were made as well as documenting the technical data. Finally, after a change, make sure that you continue to monitor the system. You need to ensure that performance is now satisfactory, as well as check to see that a new bottleneck (and there will always be one) is not causing new problems in another part of the system.

1-21
The Tuning Cycle

It has been said that tuning is like peeling an onion; there is always another layer and more tears. As shown in the overhead, the tuning cycle is continuous. For example, xing a CPU problem might reveal (or create) an I/O problem. Fixing that might reveal a new CPU problem, or perhaps a memory problem. You always move from one bottleneck to another, until the system is (for the moment) performing acceptably. As described earlier, there will always be a bottleneck. When one bottleneck, or utilization problem, has been relieved, another will take its place. Your job is to reduce the impact of the current system bottleneck to such a level that its impact is minimal and the system is performing to acceptable levels.
1-22

SunOS Components
The diagram in the overhead image shows the signicant components of the SunOS operating system and some of their interactions. Many of these components are examined during the course to see how problems with them can be identied and eliminated. There will also be an in-depth look at some aspects of the hardware, because there are many similarities in the operation of the hardware and the software components. Understanding one OS component can help in understanding and working with the others.

This diagram is intended to show some of the main components of the Solaris kernel. You (the instructor) are expected to explain some of the workings of the kernel, based on this diagram, to the best of your knowledge. Also, you might want to reference this diagram when the appropriate topics are described in later modules. Advise the students if you intend to do so.

1-23
How Data Is Collected

Performance monitors report two types of data: samples and counts. The data that they collect is usually not gathered directly by the monitor but by the various OS components. At regular intervals, the monitors access the component data structures and save the measurement information. Depending on the information, the previous measurement might be subtracted from the current to identify the activity during the interval. The rst output line reported by the Solaris OE tools provides the current values of the counters. This rst line should be ignored because this is not reliable summary data. Subsequent lines are accurate and are subtracted from the previous values. There is no way for the monitors to collect information. Therefore, if the kernel routines do not collect it, the information cannot be reported.
1-24

The kstat Kernel Data Structure

The kstat is a generic kernel data structure that is used by system components to report measurement data. The system might have dozens or even hundreds of kstat structures, with various formats for the different components and reported data. The kstats all have names and can be read either by name or by following a chain. There are a series of kstat manipulation system calls (such as kstat_lookup) that are documented in the man pages. The owning system component is responsible for keeping the contents of the kstat current, usually by incrementing the appropriate counters as the measured event occurs.

1-25
How the Tools Process the Data

As mentioned earlier, the tools accumulate data over each measurement interval and are scaled as appropriate. The difference between the current interval and the previous interval is calculated. When this has been done for all of the data elements being recorded, the data is sent to the specied destination, such as a le. When processing this data, you can specify intervals that are multiples of the tools internal recording interval. Be careful when specifying a large interval, spikes and other unusual readings can be smoothed out and lost, causing you to see no problems. But when shorter intervals are examined, problems become visible. Remember that intervals reported by the different tools, even if they are of the same length, will start at different times, making it difcult to coordinate events between reports. Most tools, when they access the kernels process data, are intrusive processes and might affect the current system performance.
1-26

Monitoring Tools Provided With the Solaris OE

The Solaris OE provides a number of monitoring tools. Each of these tools is briey introduced in this module. Use of specic command options and ways of interpreting data, reported by these tools, is presented in the remaining modules where applicable. Other tools are also presented in subsequent modules. When more than one tool reports similar data, a table comparing the data produced by the different tools is provided.

The tools are briefly introduced in this module. Advise the students that the course cannot teach extensive understanding of each tool and its output. The labs provide many opportunities to analyze how the tools are used. Also, there are examples, case studies, and samples in the Hypertext Markup Language (HTML) reference pages that are installed along with the lab files.

1-27
The sar Command

The system activity reporter, sar, provides a wide range of measurement information covering most areas of the system. There are 16 different reports, which can be combined, that collect large quantities of simultaneously reported data. The sar command with no options provides CPU activity information. The sar data is usually recorded to a le, and then processed later due to the large amount of available data. The sar command is probably the most commonly used performance monitoring command of all those provided with the Solaris OE.
1-28

The vmstat Command

The vmstat (virtual memory statistics) report focuses on memory and CPU activity. It has ve different reports, including one report on limited disk activity numbers. The information that vmstat reports is similar to that reported by sar, but in a few cases, vmstat reports data not provided by sar. Averaged rates of kernel counters, which are calculated as the difference between two snapshots of a number are displayed. The first line of output is the difference between the current value of the counters and zero. As is explained later, paging I/O has several distinct uses. These are reported by the vmstat tool.

1-29
The iostat Command

The iostat (I/O statistics) command reports I/O performance data for disk and terminal (TTY) devices. It also provides CPU utilization information. The iostat command was signicantly extended for the Solaris 2.6 OE release and has been further enhanced for the Solaris 8 OE release. As devices, network le systems (NFS le systems), or CPUs are added or removed, iostat displays messages accordingly. The -p option is used to report data at the individual partition level. If you are using the Volume Manager, iostat might not show you enough information about the volumes. Instead, use the vxstat command, which is provided with Volume Manager.

A complete description of the vxstat command is beyond the scope of this course.
1-30

The mpstat Command

The mpstat (multiprocessor statistics) command provides activity information for individual CPUs. It reports CPU utilization information and gives the frequency of occurrence for events, such as interrupts, page faults, and locking. The data from mpstat are not usually needed during routine tuning, but can be very helpful if there is unusual behavior or if the other monitoring tools show no indication of the source of a problem. Output from the mpstat report displays one line for each CPU in the system. For the Solaris 8 OE, mpstat has been improved and now provides information on processor set activity. This facility was not available in earlier releases of the Solaris OE.

1-31
The netstat Command

The netstat (network statistics) command reports network utilization information, including socket usage, Dynamic Host Conguration Protocol (DHCP) information, arp information, and trafc activity. The netstat command has been updated for the Solaris 8 OE to include IPv6 protocols. While netstat information can be critical to tuning your network, it can also be difcult to interpret for a system with many network interfaces. The default report provides active socket and window size information. The -s option shows extremely detailed counts for very low-level Transmission Control Protocol (TCP), User Datagram Protocol (UDP), and other protocol messages. This allows for quick identication of problems or unusual usage patterns.
1-32

The nfsstat Command

The nfsstat (NFS statistics) command has six options that provide detailed information on a systems NFS activity. It provides both host and client information, which can be used to locate the source of a problem. Also, because remote procedure call (RPC) information is provided, it can also report on the network activity of some DBMSs. You can use the nfsstat command to re-initialize statistical information about NFS and RPC interfaces to the kernel. If no options are given, the default command is nfsstat -cnrs, which displays everything and re-initializes nothing. Without options, nfsstat provides a report that lists the various types of messages received by NFS. Every NFS message type is reported. Note Refer to man nfsstat for details of the available command options.

1-33
The Sun Enterprise SyMON System Monitor

The Sun Enterprise SyMON system monitor is available for use on desktop systems and servers running the Solaris 7 OE or earlier. It provides for the remote monitoring of system conguration, performance, and failure notication. It uses Simple Network Management Protocol (SNMP) to receive much of the data that it reports. SyMON exception reporting is managed by rules, which can be changed and added to by the user. The user can add monitoring and notication of exception conditions specic to an individual site or system. SyMON has been superseded by the Sun Management Center software for server hosts and is no longer relevant for Enterprise systems running the Solaris 8 OE release.

This information relates to the SyMON 2.0.1 and earlier releases. If the Sun Management Center software is installed, the installation routine will de-install any previous SyMON packages.
1-34

Sun Management Center

As with SyMON, the Sun Management Center software can monitor both local and remote hosts. You can set rules that cause Sun Management Center to report on various states of the systems being monitored whenever the rules apply. The software has the ability to view the current state of a variety of system attributes, such as:
q q q q
User statistics Disk statistics CPU statistics Memory usage statistics

1-35
Figure 1-1
The Sun Management Center Window
1-36

Other Utilities
There are several other utilities and performance monitors provided by Sun that are useful in identifying performance problems. Each of the following is explained in this module:
q q q q q q q q
prex tnfxtract and tnfdump busstate cpustat and cputrack memtool The proc tool and /usr/proc/bin utilities Process Manager System accounting
The SE Toolkit is also described.

1-37
The prex Command

According to the man page the prex command is the part of the Solaris tracing architecture that controls probes in a process or in the kernel, itself.

Although working at kernel-level, the prex command is very useful for obtaining in-kernel information that can aid the understanding of higher-level processes.
This means that a command is available to you that will enable the reading of process or kernel data with absolute minimum overhead, due to the fact that those probes would exist anyway. The prex command enables the buffering of the probe data for extraction, at a later time, by the tnfxtract utility. The prex command deals with low-level data, such as the opening of a program thread and can be used as an alternative to programs, such as apptrace and truss. You will probably encounter problems when tracing a 32-bit program or script that uses a 32-bit interpreter in a 64-bit Solaris OE. The prex command will almost certainly use core dumps under those circumstances.
1-38

The tnfxtract and tnfdump Commands

You use the tnfxtract and tnfdump commands to read the buffer information that has been collected by prex and to produce humanreadable reports of the traced information or events. If no le name argument is supplied, the information is read, by default, from the /dev/tnfmap le. This le name is symbolically linked to a le in the /devices directory, /devices/pseudo/tnf@0:tnfmap, which is readable and writable only by the root user. These permission settings prevent nonroot users from using the tnfxtract and tnfdump commands.

The output of the tnfdump command might appear too complex for some students to comprehend. However, those students with some performance or programming backgrounds might find the tnfdump output extremely useful data with which to work. Ensure that the students are made aware that they do not, necessarily, have to work with data at such a low-level, but that it might be helpful if they work towards the goal of understanding such low-level data. A graphical viewing tool for TNF data, called tnfview, is available for download from the Sun Developer (http://soldc.sun.com) Web site pages.

1-39
The busstat Command

The busstat command provides access to bus-related counters. Bus events can be monitored and the relevant Process Information Counter (PIC) information can be output. The command is limited to use by the root user only and only produces output on supported system architectures. To understand the output of busstat, you need to understand system hardware architectures and device characteristics, so this utility is of most benet to system engineers rather than system administrators. You use the busstat -l command to see if any supported devices exist on your system.
1-40

The cpustat and cputrack Commands

You use the cpustat command to monitor the behavior of each CPU on a multiprocessor system.

The cpustat command does not function on single-processor workstation systems.
You use the cputrack command to monitor individual processes (or families of processes) running on the system. Both cpustat and cputrack can run simultaneously. Due to the potential sampling conicts that could arise, the cpustat command is limited to roots use only. Both commands only function on supported hardware architectures.

1-41
The /usr/proc/bin Directory

The Solaris 2.5.1 OE added a new set of utilities to the /usr/proc/bin directory. These utilities interface with the /proc le system, allowing detailed process information to be reported. For compatibility reasons, the Solaris 8 OE still provides the /usr/proc/bin directory. All le names contained in that directory are symbolically linked to their counterpart les in /usr/bin. Among the commands provided are ptree, which prints an indented listing showing the relationship of a process ancestors and children, and pfiles, which provides some (rather cryptic) information on the les used by a process. All of the /usr/proc/bin commands are documented on the proc man page.

Ask the students to use the ls -li command on /usr/proc/bin and /usr/bin/p* to see the differences between the two versions of the commands.
1-42

The Process Manager

You can use the Process Manager to quickly locate a problem process and track current system resource usage. It has been supplied as a standard utility since the Solaris 7 OE release. The Process Manager is started through the Tools submenu of the Common Desktop Environment (CDE) Workspace menu. The Process Manager displays data retrieved from the ps command and the /proc le system. It displays:
q q q q q q
CPU usage I/O usage Memory usage Process credentials Tracebacks and traces Scheduling parameters

1-43
1
The process of manager can invoke several of the proc(1) commands for a process, as well as apptrace and kill. Process information can be sorted by any available category. The Process Manager replaces proctool, a utility that was provided for a long time as an unsupported service by the Sun Enterprise Services organization.
1-44

The memtool Package

The memtool package was developed to provide a more in-depth look at memory allocation on a Solaris OE system. It is not included with the Solaris 8 OE. Earlier versions were named bunyip.

The current version is included with the class lab files for this course.
The memtool package provides real-time graphic reports on the current memory usage by processes, les, and shared libraries. The displayed information can be ltered by different criteria.

1-45
1
Several command-line, character, and graphical user interface (GUI) tools are provided with the memtool package. Table 1-1 lists the tools. Table 1-1 Tool pmem memtool Utilities Interface Command-line Description Processes memory map and usage Dumps process summary and UFS Provides comprehensive GUI for UFS and process memory Provides a curses interface for UFS and process memory
memps
Command-line
memtool
GUI
mem
Curses user interface (CUI)
Output of pmem is similar to the pmap command, which is supplied as a standard Solaris OE utility (see pfiles page 1-42 and /usr/proc/bin page 1-42).
1-46

The SE Toolkit
The SE Toolkit is a customized C interpreter that contains a large number of scripts. The scripts are written in the SE language, which is also known as SymbEL, and provide lots of examples of how to write simple performance tools. The SE Toolkit consists of a software package that is rules-based, which provide a set of easily modiable performance monitoring tools. The package is available from the http://www.setoolkit.com Web site. When installed, the SE tools can be congured to offer notices and repair suggestions as it detects problems. If run by root, one of the utilities (virtual_adrian.se) will actually change some of the system tuning parameters as conditions require. Remember that it might be necessary to make changes to the default rules provided with the SE Toolkit to adapt it to your system conguration and workload.

The SE Toolkit is provided with the course lab files, in the lab1 directory.

1-47
1
System Performance Monitoring With the SE Toolkit
One of the SE Toolkit programs is a GUI front end that displays the state of your system. The program is called zoom.se and a typical display is shown in Figure 1-2.
Figure 1-2
The zoom.se Display Window
The zoom.se monitor displays the status of the primary components of a system using color-coded states and action strings. The status colors and action strings are based on a set of performance rules that are coded into the SE Toolkit. Note It is recommended that you include the /opt/RICPse/bin directory in your PATH variable in order to execute the se command.

Color-coded states, action strings, and rule construction are explained in Sun Performance and Tuning, by Adrian Cockroft and Richard Pettit.
1-48

1
SE Toolkit Example Tools
The SE Toolkit has several tools. Source code for each of the tools is provided. Table 1-2 lists some of the example tools by function. Table 1-2 Function Rule monitors cpg.se live_test.se virtual_adrian.se siostat.se iomonitor.se disks.se cpu_meter.se msacct.se ps-p.se net.se netstatx.se iostat.se nfsstat.se aw.se anasa systune syslog.se collisions.se net_example SE Programs Grouped by Function Example SE Programs monlog.se percollator.se virtual_adrian_ lite.se xio.se iost.se vmmonitor.se pea.se pwatch.se tcp_monitor.se nx.se uname.se perfmeter.se infotool.se dfstats watch cpus.se uptime.se nproc.se mon_cm.se zoom.se
Disk monitors
xiostat.se xit.se mpvmstat.se ps-ax.se netmonitor.se nfsmonitor.se vmstat.se xload.se multi_meter.se kview pure_test.se dumpkstats.se kvmname.se
CPU monitors Process monitors Network monitors Clones
Data browsers Contributed code Test programs

1-49
System Accounting
The system performance monitors provide information on the state of the system resources, not on the activity of individual processes. To get individual process information, such as memory and CPU usage, run times, and I/O counts, you must use system accounting. Accounting is not enabled by default, due to processing overhead, and must be enabled specically when desired. The accounting system collects large amounts of data, which then must be processed to extract and format the information of interest. Accounting data can be very useful in determining the most frequently used or the most expensive (in terms of resource usage) programs in your environment, to help focus your tuning efforts.

Inform the students that system accounting is not to be enabled in the course labs. The accounting system would create too much noise on the system and possibly have a negative impact on the running of the lab exercises.
1-50

Viewing Tuning Parameters

Several utilities provide the ability to look at the current values of system tuning parameters. The sysdef command, near the end of its output, provides a list of approximately 60 of the most commonly used tuning parameters. The debugger utilities, mdb and adb, enable you to look at (and alter) the values of the tuning parameters. While not very user friendly, they allow the inspection of the running system. Use the mdb utility for debugging in a Solaris 8 OE. The ndd utility enables you to look at and change the settings of network device driver and protocol stack parameters. Most of these parameters are undocumented, so caution should be used when changing them. Changes to these parameters can also be made in /etc/system. The /usr/ccs/bin/nm utility displays the current name list in an object le.

1-51
Setting Tuning Parameters

Changes to tuning parameters are usually made in the /etc/system le. The /etc/system le is read just after the kernel has been loaded into memory. The parameters specied in it are set in the appropriate symbol location in the kernel (and other kernel resident modules) and system initialization continues. You must reboot the system for changes to /etc/system to take effect. Changes can also be made to a running system using mdb. Not all parameters can be changed this way, however. If a parameter is used at boot time to specify the size of a table, for instance, changing it after the boot will have no effect. Without a knowledge of system internals, it can be difcult to determine whether the change will be seen. Network device and protocol stack parameters can be dynamically changed using ndd and changed permanently by using the /etc/system le.
1-52

The /etc/system File

You use the /etc/system le to set the values for kernel-resident tuning and option parameters. Values can be specied for parameters in modules other than the kernel by specifying the proper module name in front of the parameter name. If no module name is given, a parameter in the kernel is assumed. It does not matter in which of the kernel les (unix or genunix) that the parameter resides. If the /etc/system le is changed incorrectly, the system might be unable to boot. Non-existent tuning parameters cause warning messages, and syntax errors can prevent the system from booting. If this occurs, boot the system with the -as option, and specify a le, such as /dev/null, as the location for the /etc/system le. When you reach single-user mode, correct the errors and reboot.

1-53
1
Check Your Progress
Before continuing on to the next module, check that you are able to accomplish or answer the following: u u u u u u u u u u u Explain what performance tuning really means Characterize the type of work performed on a system to make meaningful measurements Explain performance measurement terminology Describe a typical performance tuning task Describe the purpose of baseline measurements Describe how performance data is collected in the system Describe how the monitoring tools retrieve the data List the data collection tools provided by the Solaris OE Explain how to view and set tuning parameters Describe the system accounting functions Describe the use of the Sun Management Center software
1-54

1
Think Beyond
Ask yourself the following questions:
q
Are there areas that will never cause a bottleneck?
No, all system resources including the CPU, memory, and the I/O subsystem can cause a bottleneck.
q
Why bother to tune a system at all?
You tune a system to get the most benefit from available system resources.
q
Are default system parameters geared toward desktop systems or servers? Why?
Default parameters are usually geared towards servers with the Solaris 8 OE release. Earlier releases were geared towards desktop systems.
q
How long should a tuning project take? How do you know?
Tuning is constant. A system that performs well today might not perform well tomorrow due to additional users, additional applications, or changes in system workloads. You need to continually monitor your system.
q
How can you determine a realistic and satisfactory performance goal ahead of time?
Let the users determine what is satisfactory. Measure throughput and utilization levels to ensure you are getting the most out of your system resources.
q
What other tools might the Solaris OE need?
Software analysis, long-term data collection, network analyzers, and so on.

q
Why do the tools seem to overlap so much?
Because the tools have been developed and enhanced over a number of years. Older programs are still provided, for compatability reasons, when new tools are introduced.
q
Can you write your own tools?
Yes, although there are many tools already available.

q
Can an industry-standard set of tools be developed?
Yes. There are many third-party products available on the market that apply to the Solaris OE and other operating environments.

1-55
System Monitoring Tools

Objectives
Upon completion of this lab, you should be able to:
q q q q q q
L1
Execute performance monitoring tools native to the Solaris OE Start system activity data capture Install the SE Toolkit Install the memtool package Use the prstat command Use the sar command to read data saved to a le
Appendix C, Performance Monitoring Tools, contains details of some of the utilities and how to read the data output by those utilities. You might want to ask the students to read through Appendix C before attempting these labs.
L1-1
L1
Using the Tools
This lab exercise relates to the use of Solaris tools. During the lab, you are expected to be logged in as the root user. If you experience any difculty, bring this to the attention of the instructor. Complete the following steps: 1. Change to the lab1 directory for this lab. 2. List which tools you would use to look at the following data elements (tick the appropriate boxes in Table L1-1): Table L1-1 Data Elements and Tools Data Element Tape device response time Free memory Per CPU interrupt rates Network collision rates UFS logical reads NFS read response time I/O device error rates Semaphore activity Active sockets 3. Read the man pages for sar and netstat to see how many options exist for both commands. Do not worry about understanding all the options. 4. Open two terminal windows on your workstation. Start iostat in one window and vmstat in the second window. Use a 15-second interval, and let the output display on the screen. sar iostat vmstat netstat nfsstat
L1-2

L1
5. Open another terminal window, and run the load3.sh script, which is in your lab1 directory. When the script nishes, look at the script. Based on what it does, does the output from the tuning tools that you were running in Step 4 make sense?
6. Stop the tuning tools that you started in Step 4.

L1-3
L1
Enabling System Activity Data Capture
The Solaris OE allows system accounting to take place. The volume of data that is collected by system accounting can be extremely large and the intrusive nature of the accounting processes affect other lab exercises. However, as part of the accounting package, a crontab le for the sys user is installed. This le is previously congured to invoke system activity collection that can, at a later time, be displayed using the system activity reporting (sar) utility. 1. As the root user, issue the following commands: # EDITOR=vi export EDITOR # crontab -e sys This starts a vi session with the sys accounts crontab le. Uncomment each of the command lines in this crontab le, and then save and exit from the vi session. The crontab le is passed to the cron utility when the le has been saved. 2. Now, issue the following commands: # vi /etc/init.d/perf ...<file edit takes place>... # /etc/init.d/perf start This starts a vi session with the performance run-control script. At the end of the le, there are lines that should be un-commented so that the /usr/lib/sa/sadc command is invoked at boot time.
L1-4

L1
Installing the SE Toolkit
The SE Toolkit is provided in the perf directory of your student account. To install it on your lab system, as root: 1. Change to the lab1 directory. 2. Uncompress the toolkit installation les using: unzip setoolkit.zip 3. Add the toolkit package using: pkgadd -d . Add the RICHPse package using appropriate options. 4. Modify your PATH variable, in the $HOME/.profile le, to include the path name: /opt/RICHPse/bin 5. Test the installation by executing the following series of commands: . $HOME/.profile cd /opt/RICHPse/examples se zoom.se Note The rst of these three command lines is used to source the change made to the PATH variable in the $HOME/.profile le.

L1-5
L1
Installing the memtool Package
The memtool package is provided in the memtool directory of your student account. To install it on your lab system, as root: 1. Change to the memtool directory. 2. Uncompress and untar the toolkit installation les using: zcat RMCmem3.9.1Beta1.tar.Z | tar -xvf 3. Add the memtool package using: pkgadd -d . 4. Respond y to all prompts about commands to be started at boot time. When asked if you want to load the kernel module now, respond yes. When asked if you want to support the development efforts, you should respond no. 5. Verify the kernel module was successfully loaded. modinfo | grep -i memtool 209 102bff10 3659 1 bunyipmod (MemTool Module ..<date & time>..)
L1-6

L1
Using the prstat Command
The prstat command is provided with the Solaris 8 OE. Unlike the ps command, it runs continually until terminated by the user, and displays a list of all processes on the system. The listing is restricted to the current terminal display. The prstat command can display the process listing in sorted order and provides a number of options to sort by appropriate values. To view the information produced by prstat, follow these steps: 1. Issue the command: prstat Observe the current list of processes in the display. After a period of 30 seconds, quit from the program by typing the letter q. 2. Read the man page for prstat using the command: man prstat Evaluate which options should be used to sort the display output by using the information in Table L1-2: Table L1-2 Sort Key Values and Options Sort Key Value CPU usage Size of process image Execution time When you have evaluated which options should be used, invoke the command using each of the options. Observe the differences in the display. Which process is using (or has used) most CPU resource? __________________________________________________ Which process has the largest process image size? __________________________________________________
Options
L1-7
L1
Using the sar Command to Read Data From a Saved File
The sar command can save data to a le for reporting at a later time. In the lab1 directory is a le called sar.out. This le contains a set of data captured by the sar program. First, read the entire contents of the le, using sar, to see how that data is displayed. Second, extract specic types of data, such as CPU or buffer, to show that subsets of the data can be reported on from the data contained in the le. 1. Change the directory to the lab1 directory. 2. Invoke the sar command to read all of the data in the le, using the command: sar -A -f sar.out | more 3. Use the appropriate keyboard instructions for more to page through the data that is being displayed. 4. After you have seen the entire display output, issue the following commands, individually, to display subsets of the sar data: sar -u -f sar.out | more sar -b -f sar.out | more sar -q -f sar.out | more Was the CPU busy at the time when the run queue was occupied? Were the buffers busy at the time when the run queue was occupied?
L1-8

Processes and Threads

Objectives
q q q q q q q q
Describe how a process commences its execution cycle Describe the life cycle of a process in the SunOS environment Explain multithreading and its benets State the limits of the SunOS process scheduler Describe locking and its performance implications List the functions of the clock thread Describe how system interrupts affect process execution Use appropriate utilities to display process-related and thread-related information Describe the related tuning parameters
2-1
2
Relevance

q q q
What is a thread? Why is it used? What is tunable for processes and threads? Can any tuning be done without programmer intervention?
q q
Graham, John. Solaris 2.x Internals. McGraw-Hill, 1995. Cockcroft, Adrian and Richard Pettit. Sun Performance and Tuning. Palo Alto, California: Sun Microsystems Press/Prentice Hall, 1998. McDougall, Richard and Adrian Cockroft, Evert Hoogendoorn, Enrique Vargas, and Tom Bialaski. Sun BLUEPRINTS Resource Management: Palo Alto, California: Sun Microsystems Press/Prentice Hall, 1999. Man pages for the various commands (acctcom, clock, lastcomm, lockstat, and ps) and a performance monitoring tool (sar). Mauro, McDougall. Solaris Architecture and Internals. Sun Microsystems Press/Prentice Hall, 1999.
2-2

Process Lifetime
Most processes in the system are created by the fork system call. The fork call makes a copy (the child) of the calling process (the parent) in a new address space in virtual memory.

Students might point out that there is more than one fork call (fork, fork1 and vfork1). This information is too advanced to explain at this point in the course. However, if a student does mention the alternative fork calls, you are now ready for them. For details on the fork system calls, refer students to the Solaris internals book by Jim Mauro and Richard McDougall.
The child then usually issues an exec system call, which replaces the newly created process address space with a new executable le. The child continues to run until it completes. On completion, the child process issues an exit system call, which returns the childs resources to the system. While the child is running, the parent either waits for the child to complete, by issuing a wait system call, or continues to execute. If it continues, it can check occasionally for the childs completion, or the system can notify it when the child exits.

2-3
2
If the parent process is waiting for the child to complete, it usually forces itself into a WAIT or SLEEP state. When the child process exits, the kernel process manager sends an appropriate signal to the parent process so that it can be put back into a RUN state in the process queue. At the time the child process exits, the child process changes from a system RUN state (SRUN) into a ZOMBIE (defunct) state. It stays in this state until the parent acknowledges receipt of the appropriate signal or the parent process terminates. There is no practical limit to the number of fork requests a process can issue (except where governed by system kernel parameters or availability of resources). The child process can also issue fork requests.
2-4

Process States
Processes undergo many changes of state while they exist in the process table. For example, a process might be running and nd it has to wait for data to arrive in an I/O operation. The process either puts itself into a wait state or it might be forced into a wait state. Such a change of state is referred to as a context switch.

The concept of a context switch is an important concept for the students to understand. The slide shows the various states of a process and should help to lead discussion into what context switches are and when they could occur during a process execution.
While a process exists in the process table, that process might not be in a runnable state for much of the time. A good example of this is a process that requires data from disk storage. Much of the time that the process is in the process table can be spent waiting for I/O activity to occur.

This slide is not intended as a complete view of all possible processing states. There are other states that might not be shown but the most important are included in the diagram. For complete details regarding Solaris-specific process states, there are Solaris internals books listed in the Additional Resources entries that provide excellent reference materials.

2-5
2
A context switch can be caused by any of the following:
q q q q q q q
A wait on I/O is required The process thread runs out of allocated CPU time An interrupt signal is received from another process The process spawns a child process The process receives a wake signal on the exit of a child process Notication of a hardware interrupt is received A sleep state is invoked by the process itself
When a process is context switched, this incurs some time overhead within the kernels process scheduler and consumes a small amount of central processing unit (CPU) time while the scheduler is processing the context switch appropriately. If the number of context switches on the system is very high, this could cause a massive loss of actual processing time because the CPU spends a large proportion of its processing time handling context switches rather than executing actual processes.

Inform the students that they should determine, as part of their system baseline measurement, what a reasonable number of context switches is considered to be.
2-6

Process Performance Issues

The system, by default, allows a maximum of 30,000 simultaneous processes. (Setting the number of processes and the use of pidmax is explained later in this module.) You can also save memory by remembering that executables that use shared libraries take fewer system resources (memory, swap, and so on) to use. This is described in Module 5, Memory Tuning. If your program needs multiple processes, consider using threads. As explained later in this module, threads can save resources and improve performance. Do not be concerned about the resource use of zombie processes. Other than taking an available process table slot, they consume no other resources. However, if you have a large number of zombie processes in the process table, this could limit the capability of new processes to start execution.
pidmax is explained later in this module.

2-7
Process Lifetime Performance Data

The sar command is an easy-to-use performance monitor that provides process lifetime performance data. The most useful of the available information is shown in the overhead. Sample output of the sar -v command is shown below. The -v option reports the status of processes, inodes, and le tables. # sar -v SunOS trng450 5.8 Generic sun4u 09/05/00 10:20:00 10:40:00 11:00:00 ... 13:00:00 13:20:00 13:40:01 14:00:00 ... # proc-sz 83/30000 89/30000 87/30000 85/30000 85/30000 331/30000 ov inod-sz 0 46029/128512 0 46115/128512 0 0 0 0 46178/128512 46182/128512 46183/128512 46183/128512 ov 0 0 0 0 0 0 file-sz 651/651 728/728 711/711 704/704 704/704 947/947 ov 0 0 0 0 0 0 lock-sz 0/0 0/0 0/0 0/0 0/0 0/0
2-8

2
This output contains the following elds related to processes:
q
proc-sz The number of process entries (proc structs) currently being used and the number of processes allocated in the kernel (max_nprocs). inod-sz The total number of current inodes in memory compared to the maximum number of inodes allocated in the kernel. This maximum value is not a strict high-water mark; it can overow. file-sz The number of les that have data stored in the Directory Name Lookup Cache (DNLC) in kernel space. Like the inod-sz column, the values in this column grows until it reaches a (soft) maximum number.
Note The second value in the proc-sz column is the highest PID number that can be assigned in the process scheduler table. After that number is reached, new processes start with the rst unused (freed) PID number starting from minpid. Sample output of the sar -c command is shown below. The -c option reports system call information. # sar -c SunOS trng450 5.8 Generic sun4u 09/05/00 10:20:00 scall/s sread/s swrit/s 10:40:00 535 33 28 11:00:00 1355 61 32 11:20:00 5918 43 396 11:40:00 2027 55 168 12:00:00 2306 120 462 12:20:00 1634 38 125 12:40:00 155 10 7 13:00:00 519 56 46 13:20:00 382 56 35 13:40:01 721 76 116 14:00:00 325 39 26 Average 1443 53 131 fork/s 4.72 0.78 0.54 0.55 1.82 0.52 0.51 0.58 0.57 0.67 0.74 1.09 exec/s rchar/s wchar/s 4.69 4762 3692 0.77 111768 72860 0.51 4007 210661 0.52 5274 167337 2.90 246505 716240 0.49 128115 146710 0.48 2345 1570 0.55 24537 14388 0.57 5311 14949 0.61 193913 187606 0.70 6458 5273 1.16 66640 140127

2-9
2
This output contains the following elds related to system calls:
q q
scall/s All types of system calls per second. fork/s The number of fork system calls per second. This number varies in your baseline depending on the type of system (architecture/hardware) being monitored and the processing that is taking place on the system. However, you should investigate uctuations from the baseline value. exec/s The number of exec system calls per second. If exec/s divided by fork/s is greater than three, look for an inefcient PATH variable or improper invocation of programs.
The example shows high values in the fork/s and exec/s columns. This example was taken when the catman program was being run. This program creates numerous forked subprocesses during its execution. If you are aware that the catman program is executing, you can put these resulting values into perspective. # sar -c 5 5 SunOS earth 5.8 Generic sun4m 14:47:04 scall/s sread/s swrit/s 14:47:09 692 55 61 14:47:14 732 56 54 14:47:19 317 41 68 14:47:24 748 56 52 14:47:29 1106 68 38 Average # 719 55 55 09/07/00 fork/s 4.59 4.99 0.60 4.99 8.78 4.79 exec/s rchar/s wchar/s 4.39 122271 94089 4.99 126507 88047 0.80 115335 119367 4.99 103078 64548 8.78 108070 42921 4.79 115052 81794
You can see that the calculation of fork/s divided by exec/s does not produce a value greater than 3 even though the system is actively creating new processes.
2-10

Process-Related Tunable Parameters

The majority of tuning parameters related to the process lifetime are scaled from the maxusers parameter by default. You address the problem of scaling the size of several kernel tables and buffers by creating a single sizing variable. In the past, the parameter, maxusers, was related to the number of time-sharing terminal users that the system could support. Today, there is no direct relationship between the number of users a system supports and the value of maxusers. Increases in memory size and complexity of applications require much larger kernel tables and buffers for the same number of users. The maxusers kernel parameter, in the Solaris 8 OE, is automatically set using the physmem variable. The minimum limit for the maxusers parameter is 8 and the maximum default setting is 2048, corresponding to systems with 2 Gbytes or more of RAM. You can manually set maxusers in /etc/system, but this is checked and limited to a maximum of 4096.

2-11
2
The max_nprocs parameter is scaled from maxusers, and species the number of processes that the system might have simultaneously active. While setting max_nprocs too high does not waste memory, lowering it might provide some protection against system overutilization. The maxuprc is the maximum number of processes a single user can create. The root user is not limited by this parameter. The value is usually set far too large but unless you have a runaway program, there should be no need to change it. The ndquot parameter species the number of disk quota structures to build. You can ignore this value if you are not using disk quotas; if you are using disk quotas, check the quot man page to determine its proper value. The ncsize and ufs_ninode parameters are used to specify the number of entries in the DNLC and UFS inode cache, respectively. These parameters are explained in Module 8, UFS Tuning. For Solaris 2.6 and 7, use the npty and pt_cnt parameters to specify the number of SunOS 4 Berkeley Software Distribution (BSD) and SunOS 5 (System V) pseudo-terminals, respectively, to create on the system. Over 3,000 have been tested and more than 8,000 have been created. There is no practical limit to these devices. If you increase the number of either, you must use the boot -r command before the new pseudo-terminals are usable. By default, the value of these two parameters is scaled from the maxpid value. However, in the Solaris 8 OE, the npty and pt_cnt parameters should be set to zero as they are automatically scaled appropriate to the system.
2-12

Tunable Parameters Specic to the Solaris 8 OE

The majority of tuning parameters related to the process lifetime are scaled from the maxusers parameter by default. In addition, for the Solaris 8 OE, there are new kernel parameters that relate to process controls. These are:
q
The reserved_procs parameter species the number of process slots that must be reserved for processes running with a user ID of 0. The pidmax parameter species the largest possible process ID value used by the system. Use this parameter to set the value for the maxpid kernel parameter. The maxpid parameter cannot be set directly by controls in the /etc/system le. The default is 30,000, but values can range from a minimum of 266 to a maximum of 999,999.
Note The max_nprocs parameter is included because it can be affected by the pidmax parameter.

2-13
Multithreading
A thread is a logical sequence of instructions in a program that accomplishes a particular task. In a multithreaded program, the programmer creates multiple threads, each performing its assigned task. This technique allows concurrent execution in the same process, thereby making more efcient use of system resources. Take, for example, a program that opens its own processing window in which data is displayed. The program also provides a window button. The user can select the button to suspend or restart processing within the programs window. The program has two distinct components:
q q
The processing of data that is output in the window An event handler (the button)
2-14

2
The characteristics of these two components are vastly different from each other and require different sequences of program instruction. Also, the button has to act as though it is in real time because the button could be selected by a mouse click at any time, whereas the data output occurs whenever it is ready to display. Two threads could be created: one thread to handle the data output and the other thread to manage the button selection. The program could be extended to allow multiple window regions for data output with a suspend/restart button for each window region. The number of threads in the program is equal to the number of window regions plus the number of buttons. Each window region would be processed independently of the other window regions. Each button would uniquely refer to a specic window region, allowing control of any window region at any time. You treat the program as one process and have just one, unique, Process ID (PID) number in the kernel process table. The program manages each thread of the program. Creating a process, using fork, can take much more CPU time than creating a new thread. The kernel is also multithreaded, allowing multiple tasks to execute at once. This dramatically improves performance, because systems with multiple CPUs do not have to execute kernel code sequentially. When using threads, a process still has only one executable. Each thread has its own stack, but they share the same executable le and the address space allocated to that process (PID). Because application threads share their process address space and data, you must take care to synchronize access to shared data. The threads are invisible to the operating system, so there is no system protection between application threads. Some amount of lightweight process (LWP) information is available using the ps -L command.

2-15
2
While there are signicant resource savings and performance benets to using threads, it does require special programing techniques with which many programmers are not comfortable. However, similar techniques are required to synchronize multiple processes, so it is not difcult to learn.
2-16

Application Threads
Each thread has its own stack and some local storage. All of the address space is visible to all of the threads. Unless programmed that way, threads are not aware that there are other threads in the process. Threads are invisible outside of the process in the sense that there are no commands (excluding the Trace Normal Format commands) that create, kill, change the priority of, or even show the existence of application threads. Threads are scheduled independently of each other and can execute concurrently. The programmer controls the number of threads that can execute concurrently. Much of the state of the process is shared by all of the threads, such as open les, the current directory, and signal handlers.

2-17
Process Execution
The system provides dispatch queues (described in Module 4, System Caches) that queue the work to be executed by the CPUs in priority order. As CPU slots become free, the system takes kernel threads from the dispatch queue and runs them.

A kernel thread is associated with a lightweight process (LWP). This is described in greater detail later in this module.
2-18

Thread Execution
Application thread execution operates similarly to that of the system CPUs. Application threads are queued on a single dispatch queue in the application process. They are not queued directly to the system CPU dispatch queues. The system provides logical CPUs, called lightweight processes (LWPs), that perform the function of the system CPUs for the application threads. The application threads attach to an available LWP and are then queued through a kernel thread connected to the LWP in the systems CPU dispatch queue. This model has two symmetrical levels: a dispatch queue and CPUs for the application threads, and dispatch queues and real CPUs for the kernel threads. Because the number of LWPs is under programmer control, programmers can control the concurrency level of the threads in their application. An LWP is the smallest processing unit that you can view using the standard Solaris utilities, such as with the ps -L command.

2-19
Process Thread Examples

The overhead shows examples (proc1 through proc5) of multithreaded applications. The top layer is in the user process. Only the application sees the application threads. The middle layer is in the kernel, which includes the LWP and kernel thread structures. The kernel thread is the entity that the kernel schedules. The bottom layer represents the hardware, that is, the physical CPUs. The kernel threads that are not attached to an LWP represent system daemons such as the page daemon, the callout thread, and the clock thread. These kernel threads are known as system threads.
2-20

Performance Issues
Multithreading an application allows it to execute more efciently and provides many performance benets. The application:
q
Is broken into separate tasks that can be scheduled and executed independently Takes advantage of multiprocessors with less overhead than multiple processes Shares memory without going through the overhead and complexity of the interprocess communication (IPC) mechanisms Takes advantage of asynchronous I/O with low overhead locks Uses a cleaner programming model for certain types of applications
q q
Note Depending on the writing of the program and the current use of processor sets, it might not be possible for a process to use multiple processors.

2-21
Thread and Process Performance

The performance numbers in the overhead were obtained on a SPARCstation 5 system (Sun4m). The measurements were made using the built-in microsecond resolution, real-time timer with the prex utility. You can use the prex command to nd that a fork execution of a subprocess takes 1700 nanoseconds to complete. Creating a bound thread, however, takes approximately 300 nanoseconds. Creating an unbound thread takes 150 nanoseconds. The proceeding table shows the creation times for unbound threads, bound threads, and processes. It measures the time consumed to create a thread using the default stack that is cached by the threads package. It does not include the initial context switch to the thread. The Ratio column gives the ratio of the second and third rows with respect to the rst row (the creation time of unbound threads).
2-22

Locking
In a multithreaded environment, it is quite likely that two different threads will try to update the same data structure at the same time. This can cause serious problems because the update made by one thread could overwrite another. Programmers use locks to protect critical data structures from simultaneous update access. Locks can also provide a synchronization mechanism for threads. Use locks only when threads share writable data. Often programs do not need to lock the data if the data is just being read. This can improve performance. However, frequent locking is required when a program is required to write data. There are several locking mechanisms available to handle different locking situations.

2-23
2
The Solaris OE provides four different types of locks:
q q q q
Mutexes Condition variables Counting semaphores (basic + System V) Multiple-reader, single-writer locks
A bad locking design, such as protecting too much data with one lock, can cause performance problems by serializing the threads for too long. Repairing these problems usually requires a signicant redesign by the programmer. Identifying them can be difcult.

These locks are defined as follows:
Mutex (mutual exclusion) lock Is the most basic lock, only one thread can own the lock at a time and the thread that acquires the lock must be the thread that releases the lock. Condition variables Are associated with a condition that a thread might want to wait on, such as a parent process waiting for a child process to exit. Counting semaphores Controls thread access to several resources; when a thread acquires a resource, the number of resources is decremented. Multiple-reader, single-writer locks Enables several readers to access the data; if the threads task is to write the data, only that thread can own the lock.
2-24

Locking Problems
Locking problems are difcult to identify and usually appear as programs running slower than expected for no obvious reason. Locking problems are almost always caused by programming errors, such as those listed in the overhead. The programmer must x them. In some rare cases, they can be caused by extremely heavy system loads, which the programmer must x. For example:
q
Contention caused by inappropriate use of locking mechanisms by the programmer A deadlock caused by two operations competing with each other A lock control that has become lost to the process (possibly caused by the removal of a lock le) A race for resources, which is similar to a deadlock situation Incomplete implementation of the lock controls within the program
q q
q q

2-25
The lockstat Command

You can use the lockstat command to identify locking problems in a specic executable. In the Solaris 8 OE, only the root user can use the lockstat command to identify the source of a locking problem. This is because of the default permission settings applied to the /devices/pseudo/lockstat@0:lockstat le. When run like apptrace, the lockstat command runs the specied command and monitors the locks that the command is using within its own process (for example, with multithreading), as well as locks obtained for the process in the kernel. These locks are acquired during system call processing, for instance. Large counts (indv), with a slow locking time (nsec) or very large locking times, can identify the source of the problem. While the vendor must x locking problems in the kernel, the lock name is usually prexed with the module name. With the module name, you can search for known locking problems in the module and, perhaps, obtain a patch.
2-26

Interrupt Levels
Just as systems have priorities for threads to access the CPU and for I/O devices, interrupts coming into the system also have priorities. These priorities, usually assigned by the device driver, govern the order in which the system processes the device interrupts. Interrupts are generally handled by a CPU on the system board to which the I/O interface board is attached. Do not try to change the interrupt priority for a particular device. At certain levels, there are programming restrictions on device drivers that lower priority drivers might not follow. Also, this can cause devices to get slower service than they require, causing overruns and retries, which slows down the system throughput. Interrupts can have a temporary, yet severe, effect on the system performance. For example, the movement of a window around the graphics screen display causes multiple, sequential interrupts to be sent on the serial port that monitors mouse activity.

2-27
2
Because the serial port interrupt has such a high priority, the mouse movement, in effect, causes disk, tape, and network activity to slow down while the mouse interrupts are collected and acted upon. As the user drags a window around the screen, using the mouse, the graphics interrupt value compounds the effect upon the system. To see this effect at work, try the following exercise: 1. Issue the command: # /usr/bin/ptime cat /usr/share/lib/dict/words 2. Observe the time taken (Real, User and System) to complete the task. These are the three values that display when the command line has ended its execution. 3. Next, issue the same command (as shown in Step 1), but use the appropriate mouse button to drag the window around the screen while the command executes. Keep dragging the window around the screen display for approximately 10 seconds, and then drop the window and leave it in its new position, allowing the command to continue execution. 4. Observe the time taken to complete the task. The Real time has increased signicantly, but the User and System time remain roughly the same. The effect of the high interrupt rate temporarily prevents the users process from running and, subsequently, increases the response time of the execution of the cat program.

Ask the students to attempt this short exercise. It should help to demonstrate that high numbers of interrupts (from any I/O device) cause delays in response time of executing programs. As an alternative to using the terminal window, you can try using the command: lockstat /usr/sadm/admin/bin/printmgr to demonstrate lockstat.
2-28

The clock Routine

The clock executes at interrupt level 10, a high priority. There is one system thread per system that manages the clock, which is associated with a particular CPU. Each time the clock routine executes is a tick. For SPARC processors (except sun4c systems), there are 100 ticks per second as shown by the vmstat -i command. You can change this to 1000 ticks by requesting the high resolution rate. Use the high resolution rate primarily for real-time processing. This is described in more detail in Module 3, CPU Scheduling.

2-29
2
# vmstat -i interrupt total rate -------------------------------clock 930270 100 esp0 218274 23 zsc0 2 0 zsc1 79459 8 cgsixc0 760 0 lec0 23914 2 fdc0 16 0 -------------------------------Total 1252695 134 #

The clock tick resolution can be modified if the Solaris Resource Manager software has been installed. The scheduler is, by default, configured for a time quantum of 11 ticks and can be reconfigured. However, it is recommended that the new quantum value should not be a divisor of the system clock value of 100. This helps to prevent possible beat effects with the systems per-second process scheduler and per-4-second user scheduler. New to the Solaris 8 OE is that, if a Real Time process is allocated to a single processor, the Real Time controls can take over control of the timing of that individual processor and bypass the normal timing rates. (Refer to the priocntl man page.)
2-30

Clock Tick Processing

For every clock tick, the system performs numerous checks for timerrelated work, some of which are shown in the overhead. The system clock tends to drift slowly during operation. There are not always exactly 100 clock interrupts every second, due to interference from other system operations. Systems drift at different rates. One possible cause of a drift in the system clock is a high interrupt rate on the serial ports. This could be caused by busy modem connections, terminal servers respawning too quickly, or active serially connected devices. To synchronize the system clocks with each other, and perhaps with an external exact time source, use the Network Time Protocol (NTP), provided with the Solaris 2.6 OE and above. Refer to the man page for xntpd(1M)for more information.

The topic of NTP is beyond the scope of this course. However, it should have been covered in the Solaris 8 System Administration courses.

2-31
Process Monitoring Using the ps Command

Several methods exist for monitoring the process running on your system. The lab exercises for this section provide examples of some of the different methods and also describe output elds when necessary. The ps command is one of the simplest commands to use when monitoring processes. The Solaris OE supplies two versions of the ps program:
q q
/usr/bin/ps /usr/ucb/ps
In fact, they are the same program that you can invoke using either absolute path name. Depending on the path name you use, the command options differ and the output is also different. Of all the available options, the best performance-related summary comes from the BSD version of /usr/ucb/ps -aux, which collects all of the process data at one time, sorts the output by CPU usage, and
2-32

2
then displays the result. The unsorted versions of ps loop through the processes, printing as they go. This spreads out the time at which the processes are measured, and the last line process measured is signicantly later than the rst. By collecting at one time, the sorted versions give a more consistent view of the system. You can obtain a quick look at the most active processes using this command. # /usr/ucb/ps -aux | head
USER craigm craigm craigm root root craigm root root root # PID %CPU %MEM SZ RSS TT 260 0.7 56.015049632672 ? 4234 0.4 15.021168 8744 pts/3 2549 0.4 8.735936 5048 ? 4901 0.2 1.9 1464 1096 pts/5 3 0.1 0.0 0 0 ? 4370 0.1 6.4 7280 3728 ? 183 0.0 2.7 2384 1544 ? 258 0.0 2.7 2288 1560 ? 0 0.0 0.0 0 0 ? S START S Jun 07 S 08:06:56 S Jun 09 O 13:38:55 S Jun 07 S 09:04:14 S Jun 07 S Jun 07 T Jun 07 TIME COMMAND 18:34 /usr/openwin/bin/X 2:49 /usr/dist/share/fr 6:28 /usr/dist/share/ne 0:00 /usr/ucb/ps -aux 1:16 fsflush 0:00 /usr/dt/bin/dtterm 0:01 /usr/sbin/nscd 1:14 mibiisa -r -p 3282 0:00 sched
This summary immediately tells you who is running the most active processes.
q q q q
USER User who started the process. PID Process ID number assigned by the system. %CPU Percentage of the CPU time used by the process. %MEM Percentage of the total random access memory (RAM) in your system in use by the process (it does not add up to 100 percent because some RAM is shared by several processes). SZ Amount of nonshared virtual memory, in Kbytes, allocated to the process. It does not include the programs text size, which is normally shared. RSS Resident set size of the process. It is the basis for %MEM and is the amount of RAM in use by the process. TT Which teletype the user is logged in on.

2-33
2
q
S The status of the process. Check disk performance and memory usage if P or D appears frequently. This could indicate an overload of both subsystems where processes are spending much of their time waiting for I/O to occur.
w w w w w w
S Sleeping O Running (on CPU) R Capable of running and waiting for a CPU to become free Z Terminating but has not died (zombie) P Waiting for page-in D Waiting for disk I/O
q q q
START Time the process started up. TIME Total amount of CPU time the process has used so far. COMMAND Command being executed.
The same basic information is displayed when you use the freeware utilities top and proctool. The Solstice SyMON system monitor, Sun Management Center, and most commercial performance tools also display this data. Note It is worth remembering that ps takes a snapshot of the current process list. By the time the data displays, the process list might have changed signicantly. In the Solaris 8 OE, the prstat command provides an alternative to the ps command.
2-34

Process Monitoring Using System Accounting

Many processes live very short lives. You cannot see them with ps, but they occur so frequently that they dominate the load on your system. The only way to catch them is to have the system keep a record of every process that it runs, who ran it, what it was, when it started and ended, and how much resource it used. The system accounting subsystem keeps this record. Accounting data is most useful when measured over a long period of time. This can be useful on a network of workstations as well as on a single time-shared server. From this you can identify how often programs run, how much CPU time, I/O, and memory each program uses, and what work patterns throughout the week look like. Note System accounting can consume large amounts of disk space on busy systems. Use system accounting where it is absolutely necessary to discover usage statistics of the system resources.

2-35
2
Process Monitoring Using SE Toolkit Programs
The msacct.se Program
The msacct.se program uses microstate accounting to show the detailed sleep time and causes for a process. This program must be executed by root.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/msacct.se Elapsed time User CPU time System trap time Data pfault sleep User lock sleep Wait for CPU time Elapsed time User CPU time System trap time Data pfault sleep User lock sleep Wait for CPU time ^C# 67:37:19.092 0.000 0.000 0.000 0.000 0.000 67:37:29.088 0.000 0.000 0.000 0.000 0.000 Current time Thu Jun 10 11:19:15 1999 System call time 67:37:19.092 Text pfault sleep 0.000 Kernel pfault sleep 0.000 Other sleep time 0.000 Stopped time 0.000 Current time Thu Jun 10 11:19:25 1999 System call time 67:37:29.088 Text pfault sleep 0.000 Kernel pfault sleep 0.000 Other sleep time 0.000 Stopped time 0.000
The pea.se Program

The pea.se program also uses microstate accounting. This program runs continuously and reports on the average data for each active process in the measured interval (10 seconds by default). The data includes all processes and shows their average data since they were created.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/pea.se tmp = pp.lasttime<9.29046e+08> tmp_tm = localtime(&tmp<929045620>) debug_off() 14:13:40 name lwmx init 1 utmpd 1 ttymon 1 sac 1 rpcbind 1 automountd 13 pid 1 219 259 252 108 158 ppid 0 1 252 1 1 1 uid 0 0 0 0 0 0 usr% 0.00 0.00 0.00 0.00 0.00 0.00 sys% wait% chld% 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 size 736 992 1696 1624 2208 2800 rss 152 536 960 936 840 1808 pf 0.0 0.0 0.0 0.0 0.0 0.0
2-36

2
Some of the elds of this output are:
q q
name The process name lwmx The number of LWPs for the process that helps to determine which are multithreaded pid, ppid, uid The process ID, parent process ID, and user ID usr%, sys% The user and system CPU percentage measured accurately by microstate accounting wait% The process time percentage spent waiting in the run queue or for page faults to complete chld% The CPU percentage accumulated from child processes that have exited size The size of the process address space rss The resident set size of the process, the amount of RAM in use by the process pf The page-fault-per-second rate for the process over the interval
q q
q q
The pea.se program also has a wide mode that displays additional information. Use the -DWIDE option to the se command /opt/RICHPse/bin/se -DWIDE /opt/RICHPse/examples/pea.se. The additional data includes:
q q q q q
Input and output blocks per second Characters transferred by read and write calls System call per second rate over the interval Voluntary context switches in which the process slept for a reason Involuntary context switches in which the process was interrupted by higher priority work or exceeded its time slice

2-37
2
The ps-ax.se Program
The ps-ax.se program is a /usr/ucb/ps -ax clone. The program monitors only processes that you own.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/ps-ax.se PID TT S TIME COMMAND 0 ? T 0:00 sched 1 ? S 0:01 /etc/init 2 ? S 0:00 pageout 3 ? S 1:16 fsflush 219 ? S 0:00 /usr/lib/utmpd ... ... 1282 pts/2 S 0:00 dtpad -server 1433 pts/4 S 0:00 -sh 4370 ? S 0:01 /usr/dt/bin/dtterm 2529 ? S 0:00 /usr/dist/pkgs/cam,v1.5/5bin.sun4/cam netscape http://h #
The ps-p.se Program

The ps-p.se program is a ps -p clone (using /usr/ucb/ps output format). The program monitors only processes that you own.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/ps-p.se 4234 PID TT S TIME COMMAND 4234 pts/3 S 4:12.73 /usr/dist/share/framemaker,v5.5.6/bin/sunxm.s5.sparc/ma #
The pwatch.se Program

The pwatch.se program watches a process as it consumes CPU time. It defaults to Process ID 3 to watch the system process fsflush. It monitors only processes that you own.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/pwatch.se fsflush has used 77.3 seconds of CPU time since boot fsflush used 0.0 %CPU, 0.0 %CPU average so far fsflush used 0.2 %CPU, 0.1 %CPU average so far ^C#
2-38

2
Check Your Progress
Before continuing on to the next module, check that you are able to accomplish or answer the following: u u u u u u u u u Describe how a process commences its execution cycle Describe the life cycle of a process in the SunOS environment Explain multithreading and its benets State the limits of the SunOS process scheduler Describe locking and its performance implications List the functions of the clock thread Describe how system interrupts affect process execution Use appropriate utilities to display process- and thread-related information Describe the related tuning parameters

2-39
2
Think Beyond
Some questions to ponder are:
q
Why does the Solaris OE need threads when processes have worked for so long?
Because threaded programs generally work much more efficiently in a multiprocessor environment.
q
Why are LWPs used in the thread model? Are they really needed?
To allow the kernel to manage the assignment of CPU processing across multiple processors, if necessary. No. LWPs are not really needed but they do provide an effective means of processor allocation to multithreaded programs.
q
Why are locks difcult for programmers to work with?
Because the programmer must completely understand the manner in which one or more users will work with the program. Also, locks must not conflict with other operations that might be occurring outside of that process.
2-40

Processes Lab
Objectives
q
L2
Understand the relationship of maxusers to other tuning parameters Understand output from the lockstat command Understand various methods of displaying process information
q q
L2-1
L2
Using the maxusers Parameter
Many of the kernel parameters derive their values by calculating from a predetermined formula. The maxusers parameter is used in many of these formulas. You use the maxusers parameter to set the size of several system tables. Some of the system parameters that the maxusers parameter modies directly are:
q q q q
max_nprocs The maximum number of processes ufs_ninode The size of the inode cache maxuprc The maximum number of processes per user ncsize The size of the directory name lookup cache
First, to see the current values of these system parameters, you will make use of the debugging tools adb and mdb. The reason for using both tools is to show that they can both produce the same output, if required. The mdb program is new to the Solaris 8 Operating Environment (OE) and is the recommended debugging tool for this release. In the rst part of the lab, you display and alter the system parameters by changing the value of maxusers. In the second part of the lab, you learn how the lockstat command displays the locks that have been applied to an executing process. During this lab, you observe the effect of the priority of the serial port interrupt. The nal part of the lab concentrates on displaying process information. For this part of the lab, you use several tools, some of which are supplied by default with the Solaris 8 OE. Note For this lab, the manual pages should have been installed on your system. If not, consult with the instructor.
L2-2

L2
Complete the following steps: 1. You must be logged in as the root user and be working in a Common Desktop Environment (CDE) (window) environment. 2. Change to the lab2 directory for this lab. 3. Type ./maxusers.adb.sh. This lists the current value of maxusers and some of the parameters it affects. Record the values. physmem pidmax maxusers max_nprocs ufs_ninode maxuprc ncsize ____________ ____________ ____________ ____________ ____________ ____________ ____________
4. Type ./maxusers.mdb.sh, and compare the output to that shown by the preceding command. 5. Run the sysdef command, and save the output to a le: # sysdef > /var/tmp/sysdef1 6. Make a copy of the /etc/system le using the following command: # cp /etc/system /etc/system.original 7. In the /etc/system le, insert a line so that 50 is added to the current value of maxusers as discovered in Step 3 (where new_value = current_value + 50). set maxusers=new_value 8. Reboot the system.

Be prepared to assist the users with the adb and mdb utility, if required.
Processes Lab
L2-3
L2
9. Verify the changes by executing the ./maxusers.mdb.sh script. 10. Run the sysdef command again, saving the output to a le different from that used in Step 5. Run diff on the two les to identify any changes in the sysdef output. # sysdef > /var/tmp/sysdef2 # diff /var/tmp/sysdef1 /var/tmp/sysdef2 | more 11. Restore the /etc/system le to its original form using the copy made in Step 6, and reboot the system.
L2-4

L2
Using the lockstat Command
The purpose of this lab is to show that processes may use locking mechanisms during their lifetime. In this lab, you open a dtterm window and observe the locks that are invoked when certain actions occur. The lock information only displayed when the dtterm window is closed. Complete the following steps: 1. To view output from lockstat, start a desktop terminal window under lockstat using the command: # lockstat /usr/dt/bin/dtterm & Remember, you do not see any output from lockstat until the dtterm window exits. 2. After the window opens, change the size of the display font to 16 point using the Options -> Font-Size window controls. 3. Resize the window by dragging one corner. 4. Continue moving the dtterm window around the screen for a period of three seconds. 5. Close the dtterm window using an appropriate window-menu option. Press the [Return] key. (Now that you are back in the window in which you issued the lockstat command.)
Processes Lab
L2-5
L2
Monitoring Processes
Complete the following steps: 1. Display the most active processes on your system using the command: # /usr/ucb/ps -aux | head 2. Answer the following questions:
w
What is the most active process on your system?
__________________________________________
w
What percentage of CPU time is used by the most active process on your system?
__________________________________________
w
What percentage of the total RAM in your system is used by the most active process?
__________________________________________ Are there any processes that are blocked waiting for disk I/O? How can you tell? __________________________________________ __________________________________________ __________________________________________ __________________________________________ __________________________________________
L2-6

L2
w
Is your system doing any unnecessary work? Can you disable any unused daemons? How would you prevent unused daemons from starting at boot time? __________________________________________ __________________________________________ __________________________________________ __________________________________________ __________________________________________
What is the advantage of using system accounting to monitor processes over the ps command? __________________________________________ __________________________________________ __________________________________________ __________________________________________ __________________________________________
3. Monitor processes using the following SE Toolkit programs:

w w w w w
msacct.se pea.se ps-ax.se ps-p.se pwatch.se
Processes Lab
L2-7
L2
4. List processes using /usr/bin/ps by giving the following command: # /usr/bin/ps -eL | more Identify processes that are using lightweight processes (LWPs) and list some of them below: __________________________________________ __________________________________________ __________________________________________ __________________________________________ __________________________________________ 5. Verify the time allocated to the entire process and the time allocated to each LWP of that process using the following two commands: # /usr/bin/ps -e | grep mib # /usr/bin/ps -eL | grep mib Compare the time shown in the third column of the output of the rst command and the total of the time in the fourth column of output for the second command. Do they add up to the same (total) value? Is time allocated evenly across all of the LWP entries or does one entry have more time than other LWPs? 6. To see how the two versions of ps differ in their output, use the counter.sh script. This script invokes the counter script, which increments an integer variable from the value 1 through to 100,000. While the counter is being incremented, the /usr/ucb and /usr/bin versions of ps display the details of the current terminals processes. Execute the script, in the lab2 directory, using the following command: # ./counter.sh
L2-8

L2
After the script has completed, determine the PID value of the current shell by issuing the following command: # echo $$ While the script was running, output will have been produced by both versions of the ps command. Compare the time values for the current PID (as determined by the echo command). Is there a difference in the TIME column of each output for the current shell? 7. In the lab2 directory, there are two shell scripts called bourne and korn. The purpose of these scripts is to show how context switching and the fork/exec operation can severely affect processing. Because the Korn shell has an in-built arithmetic capability, it is much more efcient than the Bourne shell at performing arithmetic calculations. To analyze the number of system calls and the number of fork/exec operations, in a separate terminal window, issue the following command: # sar -c 1 120 Then, in another terminal window, execute the Korn shell script and the Bourne shell script by issuing the following two command lines: # ./korn # ./bourne While the commands are running, be aware of the values displayed in the scall/s, fork/s, and exec/s columns.
Processes Lab
L2-9
L2
8. In the lab2 directory, there is a shell script called threadmon. The purpose of this script is to show how window-based programs might require several threads. Some of the threads are used for event handling, such as the selection of a button. Other threads are associated with the program operations. First, change to the lab2 directory. Invoke the threadmon script using the following command: # ./threadmon After a few seconds, the following window is displayed (Figure L2-1):
Figure L2-1
The Thread Monitor Window
You start either of the programs two operations running by clicking the Start button. After an operation has started, the button changes to read Pause. Clicking the Pause button to pause the thread of the program (related to the appropriate operation). While the threadmon script is running, the terminal window displays output from the ps -L command, showing the threads of the process.
L2-10

L2
The terminal window shown in Figure L2-2 shows the output.
Figure L2-2
The Thread Monitor LWP Listings
Observe the amount of time allocated to each thread by clicking the Start or Pause button for the two Thread Monitor window areas. Which thread is associated with prime number calculations? Which thread is associated with the Reset button? Are new threads created by events in the program? When you feel that you have observed the operation of the threadmon script, terminate the script (running in the terminal window) as instructed on screen.
Processes Lab
L2-11
L2
Note The following lab exercise could take up to 30 minutes to complete. Before starting this lab exercise, make sure that you have at least four terminal windows open on your CDE desktop. 9. In the lab2 directory is the catman.sh script. This shell script executes the catman command to create a whatis database of the manual page entries. The catman program performs many context switches when it forks subprocesses to perform its work. To invoke this shell script, issue the command: # ./catman.sh Follow the instructions displayed on the screen.

To allow the catman command to work as required, the manual pages should have been previously installed on the classroom systems. The purpose of this final lab exercise is to show the wrap-around of PID values. Because catman causes so many subprocesses to be forked and executed, there is a strong possibility that PID values will reach the default limit of 30,000 and start again with unused, lower, PID values.
L2-12

CPU Scheduling
Objectives
q
Describe the Solaris Operating Environment (OE) scheduling classes Explain the use of dispatch priorities Describe the time-sharing scheduling class Explain Solaris OE real-time processing List the commands that manage scheduling Describe the purpose and use of processor sets Introduce Solaris Resource Manager
q q q q q q
3-1
3
Relevance

q q q
How do you control how the system prioritizes its work? Can control processing units (CPUs) be dedicated to certain work? Do the tuning tasks differ with large numbers of CPUs?
q
Cockcroft, Adrian, and Richard Pettit. Sun Performance and Tuning. Palo Alto: Sun Microsystems Press/Prentice Hall, 1998. Graham, John. Solaris 2.x Internals. McGraw-Hill, 1995. Man pages for the various commands (dispadmin, priocntl, prtconf, prtdiag, psradm, psrinfo, psrset, sysdef) and performance tools (sar, vmstat, mpstat). Solaris Resource Manager, Controlling System Resources Effectively. Whitepaper, Sun Microsystems, Inc., 1995.
q q
3-2

Scheduling States
A kernel thread can exist in one of three states:
q q q
Running Actually executing on a CPU Ready Waiting for access to a CPU Sleeping Waiting for an event, such as an I/O interrupt, wait time period expiration (such as a sleep), or lock availability
A thread moves between these states depending on the work being performed by the program that it is supporting. Application threads do not queue for the CPU directly, although system threads do. They are attached to a lightweight process (LWP) and kernel thread pair. The kernel thread is then assigned to a dispatching queue when there is work for the application thread to do.
CPU Scheduling
3-3
Scheduling Classes
SunOS provides four scheduling classes by default. These are:
q
TS The timesharing class, for normal user work. Priorities are adjusted based on CPU usage. IA The interactive class, derived from the timesharing class, provides better performance for the task in the active window in OpenWindows or Common Desktop Environment (CDE). SYS The system class, also called the kernel priorities, is used for system threads, such as the page daemon and clock thread. RT The real-time class has the highest priority in the system, except for interrupt handling. It is even higher than the system class.
Like device drivers and other kernel modules, the scheduling class modules are loaded when they are needed. Scheduling class behavior is controlled by root-adjustable tables.
3-4

3
Class Characteristics
Timesharing/Interactive
This class relates to the programs that a user invokes, either from the command line or from a Graphical User Interface (GUI) tool. The TS and IA classes have the following characteristics:
q
Time sliced The CPU is shared in rotation between all threads at the same priority. Mean time to wait scheduling Priorities of CPU-bound threads are lowered; priorities of I/O-bound threads rise.
System Class
This class relates to system calls and kernel-related activity. The SYS class has the following characteristics:
q
Not time sliced Threads run until they nish their work or are preempted by a higher priority thread. Fixed priorities The thread always retains the priority it was initially assigned.
Real-time Class
By default, the real time class is not invoked. However, once invoked by a user, that class remains in the system as a valid class until the next reboot. The RT class has the following characteristics:
q q
Time sliced Fixed priorities
CPU Scheduling
3-5
Dispatch Priorities
System threads are assigned a priority within their class. Based on that priority, a system-wide dispatching priority is also assigned to the thread. Timesharing (TS) and Interactive (IA) threads are assigned an initial priority level within their class range. Unlike system threads, the TS and IA thread priority can change during the lifetime of the process. Threads are assigned to a specic class and, without user intervention, they remain in that class for the life of their process. Priority bands are assigned in a hierarchy based on class order, which cannot be changed. If a new scheduling class is activated, such as real-time (RT), it is inserted into its proper place in the hierarchy and all other existing classes are adjusted accordingly.
3-6

3
Example
A find process was started from a Korn shell environment. Every second, a ps -cle command was issued, within a loop, to list the find and sh processes. The ps process was also included in the listing. Both the find and sh processes were long-lived. The ps process, however, was invoked for each process listing, so it was very shortlived. Here are the lines of output relating to each of the three processes. Observe the values in the PRI column, where that process current priority is shown. Note The output has been slightly modied to allow it to t in the pages more easily.

Although this is a simple example, it demonstrates the following points:
The find process is frequently blocked on I/O during its life cycle. This means that it will, inevitably, be subject to fluctuations in its priority. The sh process is executing a loop, within which the ps command is being invoked and a sleep is occurring. Because the shell is continually active but not I/O bound, the process priority is fairly static. The ps process is invoked for each cycle of the shells loop. Each time the ps process appears in the listing, therefore, it will have been allocated a relevant priority. Because each ps execution is a new process, it is understandable that each ps listing has a very similar priority. However, there are a couple of exceptions. These must have been started at times when more (or less) processing was in the run queue.
CPU Scheduling
3-7
3
The following is the output from the find process: F 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 S UID PID PPID R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 R 0 657 655 S 0 657 655 S 0 657 655 S 0 657 655 S 0 657 655 R 0 657 655 S 0 657 655 R 0 657 655 S 0 657 655 S 0 657 655 S 0 657 655 S 0 657 655 S 0 657 655 S 0 657 655 CLS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS TS PRI ADD 44 ? 42 ? 31 ? 40 ? 20 ? 11 ? 0 ? 38 ? 31 ? 31 ? 21 ? 10 ? 30 ? 23 ? 43 ? 60 ? 60 ? 60 ? 60 ? 46 ? 60 ? 46 ? 60 ? 60 ? 60 ? 60 ? 60 ? 60 ? SZ 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 TTY pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 TIME 0:04 0:04 0:05 0:05 0:05 0:05 0:06 0:06 0:06 0:06 0:06 0:06 0:06 0:07 0:07 0:07 0:07 0:07 0:07 0:07 0:07 0:07 0:07 0:07 0:07 0:07 0:07 0:07 CMD find find find find find find find find find find find find find find find find find find find find find find find find find find find find
The priority value of the command varies over time. The STIME column shows that this process started (STIME) at 14:38:57. Also, the process identity (PID) remains the same (1884), showing that this is the same process being reported on. The CPU time, allocated to the process (TIME), grows over time. The priority of this process (PRI) varies in the range 39 to 99.
3-8

3
The following is the output from the sh process: F 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 S UID PID PPID C PRI S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 S 0 603 525 TS 48 SZ STIME TTY 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 229 ? pts/6 TIME 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 0:01 CMD ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh ksh
The priority value (PRI) of the command remains constant over the elapsed time. This is because the shell process is in a SLEEP state while the find command is executing. The allocation of CPU time (TIME) does not, therefore, increase over the period of time that find is executing.
CPU Scheduling
3-9
3
The following is the output from the ps -cle process: F 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 S UID O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 O 0 PID PPID 1095 655 1098 655 1101 655 1104 655 1107 655 1110 655 1113 655 1116 655 1119 655 1122 655 1125 655 1128 655 1131 655 1134 655 1137 655 1140 655 1143 655 1146 655 1149 655 1152 655 1155 655 1158 655 1161 655 1164 655 1167 655 1170 655 1173 655 1176 655 C PRI SZ TS 58 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 49 235 TS 54 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 42 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TS 52 235 TTY pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 pts/7 TIME 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 CMD ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps ps
The priority value (PRI) of the command is fairly constant. At times, the ps command reads the scheduling table at the time the ps command is still in user mode (before reading the system tables). This is when the command has a priority less that 59.

Students might ask why the priority value changes. This is because there are other system processes running at the same time, and the ps command must be allocated a priority to allow appropriate access to the systems process table at that time (possibly having a slightly higher priority than other, running, system processes to be able to obtain CPU time in which to run). The TIME column shows elapsed seconds, so it generally shows zero time for ps.
3-10

Timesharing/Interactive Dispatch Parameter Table

The process scheduler (or dispatcher) is the portion of the kernel that controls allocation of the CPU to processes. The timesharing dispatch parameter table consists of an array of entries, one for each priority. It is indexed by the threads priority and provides the scheduling parameters for the thread. Initially, it is assigned the amount of CPU time specied in quantum, in clock ticks.

The word quantum is defined as a share, portion, allowed amount. So, in this context, the term quantum refers to the amount of time allowed on the CPU.
If a process is using a lot of CPU time, it exceeds its assigned CPU time quantum (ts_quantum) and is given the new priority in ts_tqexp. The new priority is usually 10 less than its previous priority, but in return, it gets a longer time slice. A thread that blocks waiting for a system resource is assigned the priority ts_slpret when it returns to the user mode. The value of ts_slpret in the 50s allows even the most I/O-bound job at least short access to the CPU after returning.
CPU Scheduling
3-11
3
The following is an example output from the dispadmin command. This shows details of the Time Sharing component of the systems dispatch table. # dispadmin -g -c TS # Time Sharing Dispatcher Configuration RES=1000 # ts_quantum 200 200 200 200 200 200 200 200 200 200 160 ... 40 40 40 40 40 40 20 #

ts_tqexp 0 0 0 0 0 0 0 0 0 0 0 43 44 45 46 47 48 49
ts_slpret 50 50 50 50 50 50 50 50 50 50 51 58 58 58 58 58 58 59
ts_maxwait ts_lwait 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 51 0 0 0 0 0 0 32000 59 59 59 59 59 59 59
PRIORITY LEVEL # 0 # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # # # # # # # 53 54 55 56 57 58 59
Ask the students to execute this command on their systems so that they can see the complete table.
3-12

3
CPU Starvation
If a thread does not get any CPU time in ts_maxwait seconds, its priority is raised to ts_lwait. (A value of 0 for ts_maxwait means one second.) By default, the ts_maxwait time is one second. The new ts_lwait priority in the 50s should allow the thread to get some amount of CPU time if any timesharing threads are able to run at all. The dispadmin command is used to make temporary changes to the timesharing/interactive dispatch table. The dispadmin command is described later in this module. Refer to the man page for ts_dptbl(4) to make permanent changes to the table.
CPU Scheduling
3-13
3
Example of Dispatch Table Timeshare
A process starts executing with a priority value of 59 and is given a time quantum of 20. It is still running when the time-slice expires and the process has to be taken off the CPU and put into the run queue. # ts_quantum 20 ts_tqexp 49 ts_slpret 59 ts_maxwait ts_lwait 32000 59 PRIORITY LEVEL # 59
Because of the expiration, the next time the process is given CPU time, the priority will have changed to 49 (ts_tqexp), but the time quantum will have been increased to 40. Again, the time-slice is used up and the process is removed from the CPU. The next time it is processed, the priority will have fallen to 39. # ts_quantum 40 ts_tqexp 39 ts_slpret 58 ts_maxwait ts_lwait 0 59 PRIORITY LEVEL # 49
The next time the process is given CPU time, the quantum allocated will be 80 and the priority value will drop to 29 if the process, again, uses up its time quantum. # ts_quantum 80 ts_tqexp 29 ts_slpret 54 ts_maxwait ts_lwait 0 54 PRIORITY LEVEL # 39
This reduction in priority will continue until the priority value reaches 0. At that point, the time quantum applied to this process will be 200. # ts_quantum 200 ts_tqexp 0 ts_slpret 50 ts_maxwait ts_lwait 0 50 PRIORITY LEVEL # 0
However, because this is such a low-priority process, it is highly likely that it will not be able to compete with other (higher priority) processes for CPU time if there is a large run queue. This process, therefore, will only complete when there is a vacant slot on the CPU and no other processes are competing for that slot.
3-14

Dispatch Parameter Table Issues

When you change the size of the time quantum for a particular priority, you change the behavior of the CPU that it is using. For example, by increasing the size of the quantum, you help ensure that the data the thread is using remains in the CPU caches, thereby providing better performance. By decreasing the size of the quantum, you favor I/O-bound work that gets on and off of the CPU quickly. With very heavy CPU utilization, the CPU starvation elds ensure that all threads get some CPU time. Generally, if you are relying on starvation processing, you need more CPUs for the system. If your workload changes slowly between I/O bound and CPU bound, you can alter the ts_tqexp and ts_slpret elds to raise and lower priorities more slowly.
CPU Scheduling
3-15
Timesharing Dispatch Parameter Table (Batch)

The table portion, shown in the overhead, is from the SunOS 5.2 operating system, where the table favored batch work. The quanta are ve times larger, and priorities drift up and down more slowly as the quantum is or is not used up. The table allows batch work to use the CPU caches, such as the page descriptor cache (PDC) and Level 1 and 2 caches, more effectively. It also forces the thread to earn its way to I/O-bound status, as the thread works its way up over several dispatches. Using the nice and priocntl commands might change the calculated priorities for a thread. A hybrid table could be built using the upper half of the current table and the lower half of this table that allows I/O-bound work to move up quickly, and CPU-bound work to use the hardware more efciently.
3-16

The dispadmin Command

The dispatcher administration command dispadmin enables the system administrator to inspect and change the dispatch parameter table for the timesharing or real-time classes. (The interactive class uses the timesharing (TS) table, and the system class has only a list of priorities.) The dispadmin command provides a list of the active, loaded classes (-l), or allows you to inspect or change the current table. For example, to change the TS table, you would: 1. Save the current table to a le. dispadmin -c TS -g > tfile 2. Edit tfile to change the table. 3. Make the changed table active for the life of the boot by running: dispadmin -c TS -s tfile
CPU Scheduling
3-17
dispadmin Example
The overhead shows how to inspect a dispatch table. The resolution for the quanta defaults to milliseconds, which is the inverse of the resolution eld (RES = 1000). You can specify a different resolution by using the -r option. If you want the resolution to be displayed in hundredths of a second, use the command: dispadmin -r 100 -c TS -g If you change the table, be sure to set the quanta in units that correspond to the RES value. Note Refer the ts_dptbl(4)man page for instructions on how to make permanent changes to the table.
3-18

The Interactive Scheduling Class

The interactive scheduling class is derived from the timesharing class, using most of its code and data structures. It is used to enhance interactive performance. The interactive scheduling class is the default scheduling class for processes running in OpenWindows or CDE. It automatically adds 10 to the calculated priority for the task in focus; that is, the one running in the active window. This gives better performance to the job for which a user is most likely to be waiting. When the active window changes, the priority boost is transferred to the new focus window. If you adjust the in-focus processs priority by nice or priocntl, the boost is not applied.
CPU Scheduling
3-19
Real-Time Scheduling
Hard real-time systems must respond to an interrupt or other external event within a certain period of time (the latency), or problems arise. While SunOS does not provide a hard real-time facility, it does provide soft real-time capability, where the system tries to run the real-time thread as soon as possible, typically in less than a millisecond. There have been problems with real-time, general-purpose systems in the past, which the Solaris OE real-time facility tries to avoid, as shown in the overhead. Page faults are avoided by locking all possible pages that the process could use in memory. To avoid delays from lower priority jobs using the CPU, a thread can be interrupted at any time in favor of a real-time thread. Real-time work can have a signicant impact on the system and should be used sparingly: It has higher priority than the system threads. Never put a general-purpose program, such as a database management system (DBMS), into the real-time class to try to improve its performance.
3-20

Real-Time Issues
Threads running in real time have an impact on the rest of the processes competing for system resources, especially if they are:
q
Large All of their pages are locked into memory, reserving potentially large amounts of main storage. Compute intensive They prevent system threads from accessing the CPU if they use too much CPU time. forking subprocesses Child processes inherit their scheduling class from their parent unless priocntl is used. Stuck in a loop It can be difcult to stop a runaway real-time thread on a small system because timesharing threads might be locked out.
CPU Scheduling
3-21
3
If you do real-time processing:
q
Remember that real-time threads do not change priorities like timesharing threads. Their priority is xed for the duration. Consider using application threads. A bound thread with its LWP set into the real-time class by the priocntl system call has less impact on the system and can segment real-time and normal work more easily. If you are concerned about real-time dispatch latency, set hires_tick to 1 in /etc/system. This causes the system clock to run at 1000 Hz instead of 100 Hz. It causes more overhead on the CPU processing clock interrupts, but real-time work will be dispatched more quickly.
Real-time processes that are I/O-bound will relinquish control of the CPU while in a wait-on-I/O state. As soon as data becomes available, they are processed on the CPU unless pre-empted by a higher priority process. Note A real-time processs memory address space could grow to consume all available virtual memory. In the Solaris 8 OE, high-resolution timers (HRTs) can bypass the traditional 10 millisecond clock interface to expose the granularity of the physical clock interrupt from the hardware. Thus, the HRT interface allows a real-time process to take control of one processor (in a multiprocessor system) and operate to any degree of precision in timing events.

This feature is new in the Solaris 8 OE. You might want to discuss the implications of this with your students.
3-22

Real-Time Dispatch Parameter Table

The overhead lists the default values for the real-time dispatch parameter table. When a thread is put into the real-time class by priocntl(), the priority and the time quantum is specied at that time. No thread or process is put into the real-time class by default. Because the real-time priorities are xed, the time quantum and global priority do not change for the life of the thread unless they are modied using priocntl. This also means that the elds from the timesharing dispatch parameter table that manage the priority changes are not needed. Note For instructions on how to make permanent changes to this table, refer to the man page for rt_dptbl(4).
CPU Scheduling
3-23
The priocntl Command

The priocntl command is used to change the scheduling behavior of a process, or, by using the priocntl(2) system call, an individual LWP. Using the priocntl command or call, you can change:
q q q
Scheduling class Relative priority within the class Time quantum size
The prioctl command can also be used to display current conguration information for the systems process scheduler or execute a command with specied scheduling parameters. You must be the root user to place a process into the real-time class.
3-24

For example, the priocntl command allows a process to be placed in real time with a specied priority and time quantum. The command shown in the overhead runs command at a global realtime priority of 150 (response time [RT] relative priority 50 + the base priority of the RT class) and with a time quantum of 500 milliseconds. You can specify an innite time quantum with the priocntl(2) system call by using the special value RT_TQINF for the time quantum. Note To change the scheduling parameters of a process using priocntl, the real or effective user ID of the user invoking priocntl() must match the real or effective user ID of the receiving process, or the effective user ID of the user must be superuser.
CPU Scheduling
3-25
Kernel Thread Dispatching

The overhead shows a simplied ow chart of the operation of kernel thread dispatching.
Old Thread
The switch request (pre-empt) routine is called either when a higher priority thread is put on the dispatch queue or a thread has used up its time slice. It has two responsibilities: remove the current thread from the CPU and assign a new thread in its place. The rst call is made to a scheduling class routine. The routine allows the scheduling class to perform any necessary work and checks to see if the departing thread should go to the front or the back of the dispatch queue.
3-26

3
A thread goes to the front of the dispatch queue if it has not used up its time slice.

The previous paragraph appears to over-simplify the round-robin approach to allocating CPU time. In fact, processes frequently change their position in the round-robin queue.
Because each CPU has its own dispatch queue, it is necessary to determine on which CPUs queue the thread should be placed. For a timesharing thread, this choice is made as follows:
q
If a thread has been out of execution for less than three ticks, it is placed on the dispatch queue of the CPU on which it was last executed.
This interval can be changed with rechoose_interval. Set it to the desired number of ticks.
q
Otherwise, the CPU that is currently running the lowest priority thread is chosen.
For a real-time thread, the dispatch queue of the CPU running the lowest priority thread is selected.
New Thread
After the replaced thread has been placed on a dispatch queue, a new thread must be chosen to run on the CPU. The highest priority is 1 and the lowest priority is 5: 1. Real-time threads on the current CPUs dispatch queue 2. Real-time threads on any CPUs dispatch queue 3. Other threads on the current dispatch queue 4. Other threads on any dispatch queue 5. The idle thread (priority -1) The chosen thread is then loaded onto the CPU (for example, its registers restored), and run. If the thread selected is the same thread that was just taken off of the CPU, it is run because it is already loaded on the CPU.
CPU Scheduling
3-27
Processor Sets
Processor sets are a feature provided since the release of SunOS 5.6 that add the ability to fence or isolate groups of CPUs for use by specially designated processes. This allows these processes to have guaranteed access to CPUs that other processes cannot use. A processor may belong to only one processor set at a time. Processor sets differ signicantly from processor binding (which is done using the pbind command or the processor_bind(2) system call). The pbind command designates a single specic processor on which the process must run, but other processes can continue to use that processor. With processor sets, the CPUs are dedicated to the processes that are assigned to the set. Work that is not assigned to a specic processor set must be run on the CPUs that are not part of a processor set. You must ensure that you have enough CPU capacity available for this non-associated work.
3-28

3
System CPUs can be grouped into one or more processor sets by the psrset -c command. These processors will remain idle until processes (technically, LWPs) are assigned to them by the psrset -b command. Processors can be added and removed from a processor set at any time with the psrset -a and psrset -r commands, respectively. Processor set denitions can be viewed with the psrset -p command.
Dynamic Reconguration Interaction

If a processor set is completely removed from the system by Dynamic Reconguration (DR) detach, the processes using the processor set will be unbound and able to run on any CPU in the system. If a large part of a processor group is removed, you might need to use psrset -a to add additional processors to the processor set. The system does not retain processor set conguration across DR operations. If you use DR Attach to connect a board to a system in one or more processor sets, use psrset -a to add the CPUs back to that set. System-dened processor sets are automatically restored.
CPU Scheduling
3-29
Using Processor Sets

When a processor set has been created, processes are assigned to it by using the psrset -b command. These processes can only run on the processors in the specied processor set, but they can run on any processor in the set. Any child processes or new LWPs created by a process in a processor set are set to run in the same processor set. Processes are unbound from a processor set with the psrset -u command. Only the root user may create, manage, and assign processes to these processor sets. The psrset -q command displays the processes bound to each processor set, and the psrset -i command shows the characteristics of each processor set.
3-30

3
In addition to manually created processor sets, the system can also provide automatically dened system processor sets. These processor set denitions cannot be modied or deleted. Any user can assign processes to system-dened processor sets, but other processes will not be excluded from these CPUs. A processor can be in a system or non-system processor set, but not both at the same time. Processor sets can also be managed from within a program by using the psrset(2) system call.
CPU Scheduling
3-31
The Run Queue

The performance monitoring tools report the length of the CPU run queue. The run queue is the number of kernel threads waiting to run on all system dispatch queues. It is not scaled by the number of CPUs. A depth of three to ve is normally acceptable, but if heavily CPUbound work is constantly being run, this might be too high. It is not possible to tell what work is being delayed nor how long it has run. Remember, higher-priority work can constantly preempt lowerpriority work, subject to any CPU starvation processing done by the threads scheduling class.
3-32

CPU Activity
The performance monitoring tools report four different kinds of CPU activity. These types are reported as percentages.
q
User Programming is being run in the user process. This is the amount of time the CPU spends running the users program. This time is under the programmers control. This does not include system calls but does include shared library routines. System System time is reported when the programming resident in the kernel is being run. This is the amount of time the CPU spends executing kernel code on behalf of the user program. This code can be executed because of interrupts, system thread dispatches, and application system calls. Wait I/O All CPUs are idle, but at least one disk device is active. This is the amount of time spent waiting for blocked I/O operations to complete. If this is a signicant portion of the total, the system is I/O bound.
CPU Scheduling
3-33
3
q
Idle None of the CPUs are active, and no disk devices are active. If the CPU spends a lot of time idle, a faster processor only increases idle time.
If the CPU is not spending a lot of time idle (less than 15 percent), then threads that are runnable have a higher likelihood of being forced to wait before being put into execution. In general, if the CPU is spending more than 70 percent of its time in user mode, the application load might need some balancing. In most cases, 30 percent is a good high-water mark for the amount of time that should be spent in system mode.

Obviously, the amount of time spent in USR and SYS mode depends on the workload characteristics of the system being monitored. There might be instances where 70 percent or more in USR mode demonstrates good performance. You might want to discuss this with the students. It is worth mentioning that NFS servers usually exhibit a high amount of %sys time. This is due to the high usage of NFS program code resident in the kernel.
Some reporting tools, such as vmstat, add Wait I/O and Idle time together, usually reporting it as Idle time. To locate the devices causing a high Wait I/O percentage, check the %w column, reported by iostat -x. The amount of CPU time used by an individual process can be determined by using /usr/ucb/ps -au or system accounting.
3-34

CPU Performance Statistics

The table in the overhead compares the data available from vmstat and sar relevant to CPU activity. Sample outputs of the sar command with CPU-related options are shown. The -u option reports CPU utilization. # sar -u SunOS earth 5.8 Generic sun4m 19:00:01 20:00:00 21:00:00 Average # %usr 3 5 4 %sys 1 2 2 %wio 0 0 0 11/08/00 %idle 96 93 94
CPU Scheduling
3-35
3
The elds in this output are:
q
%usr Lists the percentage of time that the processor is in user mode %sys Lists the percentage of time that the processor is in system mode %wio Lists the percentage of time the processor is idle and waiting for I/O completion %idle Lists the percentage of time the processor is idle and is not waiting for I/O
The sar -q command reports average run queue length while occupied and the percent of time the queue is occupied. Sample output is shown as follows. # sar -q SunOS earth 5.8 Generic sun4m 11/08/00
19:00:01 runq-sz %runocc swpq-sz %swpocc 20:00:00 1.1 1 21:00:00 1.0 3 Average # 1.1 2

q
runq-sz The average number of kernel threads in memory waiting for a CPU to run. Typically, this value should be less than 2 although peaks may occur in the 3-5 range. Sustained higher values mean that the system might be CPU bound. %runocc The percentage of time a thread was on any dispatch queue.
3-36

3
The sar -w command reports system swapping and switching activity. Sample output is shown. # sar -w SunOS earth 5.8 Generic sun4m 11/08/00
19:00:01 swpin/s bswin/s swpot/s bswot/s pswch/s 20:00:00 0.00 0.0 0.00 0.0 93 21:00:00 0.00 0.0 0.00 0.0 138 Average # 0.00 0.0 0.00 0.0 116
The applicable eld in this output is:

q
pswch/s The number of kernel thread switches per second.
CPU Scheduling
3-37
3
The vmstat command also reports data relevant to CPU activity. However, on multiprocessor systems, the CPU values are the average of all CPUs. Sample output of the vmstat command is shown.
# vmstat 2 procs memory r b w swap free 0 0 0 30872 3688 0 0 0 164696 976 0 0 0 164696 976 1 0 0 164592 840 1 0 0 164464 984 1 0 0 164464 992 1 0 0 164512 960 1 0 0 164520 976 1 0 0 164512 952 1 0 0 164512 928 0 0 0 164504 920 0 0 0 164720 1120 0 0 0 164712 1112 0 0 0 164712 1104 0 0 0 164712 1096 0 0 0 164720 1096 0 0 0 164728 960 ^C# page mf pi po fr de 2 8 13 24 0 3 0 40 124 0 1 12 0 0 0 289 4 40 140 0 1098 0 0 0 0 1190 16 0 0 0 1151 116 48 108 1161 8 0 0 0 1160 4 0 0 0 1160 8 4 4 0 1098 0 0 0 0 3 4 0 0 0 0 0 0 0 0 0 4 0 0 0 16 8 0 0 0 0 0 0 0 0 34 100 0 8 0 disk sr f0 s0 s6 -5 0 1 0 0 102 0 1 0 0 0 0 1 0 0 93 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 115 0 16 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 21 0 0 0 0 0 0 0 6 0 13 0 0 faults in sy 184 370 175 430 314 766 244 5362 291 20707 385 21714 388 20748 174 21588 329 21575 356 21156 201 19975 182 1052 182 611 174 489 266 589 276 548 298 668 cpu us sy 1 0 1 0 2 2 9 14 33 67 32 68 30 70 27 73 32 68 31 69 38 62 2 1 0 0 1 1 1 2 2 0 0 4
re 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
cs 147 170 203 250 282 332 360 288 312 344 263 186 191 181 204 187 270
id 98 98 96 77 0 0 0 0 0 0 0 96 99 98 97 98 96

q
r The number of kernel threads in the dispatch (run) queue; the number of runnable processes that are not running b The number of blocked kernel threads waiting for resources, I/O, paging, and so forth; the number of processes waiting for a block device transfer (disk I/O) to complete sy Percentage of system time cs CPU context switch rate us Percentage of user time id Percentage of idle time
q q q q
3-38

Available CPU Performance Data

The overhead shows the origin and meaning of the data reported in the CPU monitoring reports.
CPU Scheduling
3-39
CPU Control and Monitoring

There are a number of commands that provide CPU control and status information:
q q q q q
mpstat Shows individual CPU activity statistics psradm Enables and disables CPUs psrinfo Displays which CPUs are enabled and their speeds prtconf Shows device conguration /usr/platform/sun4X/sbin/prtdiag Prints system hardware conguration and diagnostic information psrset Manages processor sets sysdef Shows software conguration
q q
3-40

The mpstat Command

The mpstat performance monitor displays per-processor performance statistics in a manner similar to vmstat. It checks per-CPU load and load balancing behavior. This data detects load imbalances between processors. You do not need to use mpstat for normal tuning tasks, but it is very helpful when the cause of bad performance, especially if it is CPU related, cannot be located. It can also point to locking problems, which can be identied with the lockstat command.
CPU Scheduling
3-41
3
Sample output of the mpstat command is shown.
# mpstat CPU minf 0 0 2 0 CPU minf 0 2 2 0 CPU minf 0 0 ^C# 2 mjf xcal 0 0 0 4 mjf xcal 0 0 0 4 mjf xcal 0 0 intr ithr 0 0 208 8 intr ithr 0 0 212 11 intr ithr 0 0 csw icsw migr smtx 35 0 2 1 34 0 2 1 csw icsw migr smtx 30 0 1 0 21 0 1 0 csw icsw migr smtx 31 0 1 0 srw syscl 0 96 0 81 srw syscl 0 54 0 25 srw syscl 0 53 usr sys 0 0 0 0 usr sys 0 0 0 0 usr sys 0 0 wt 0 0 wt 0 0 wt 2 idl 100 100 idl 100 100 idl 98
The output contains the following elds:

q q q q
CPU Processor ID minf Minor faults mjf Major faults xcal Inter-processor cross-calls (which occur when one CPU wakes up another CPU by interrupting it) intr Interrupts ithr Interrupts as threads (not counting clock interrupt) csw Context switches icsw Involuntary context switches migr Thread migrations (to another processor) smtx Spins on mutual exclusion locks (mutexes) not acquired on rst try, and number of times the CPU failed to obtain a mutex immediately srw Spins on readers and writer locks (lock not acquired on rst try) syscl System calls usr Percent user time sys Percent system time wt Percent wait time idl Percent idle time
q q q q q q
q q q q q
3-42

mpstat Data
The overhead shows the data reported by the mpstat command and its origin. mpstat also provides CPU utilization data for each CPU.
CPU Scheduling
3-43
Solaris Resource Manager

Solaris Resource Manager is Suns software tool for making resources available to users, groups, and enterprise applications. Solaris Resource Manager is an add-on product, not shipped with the Solaris 8 OE, that offers the ability to control and allocate CPU time, processes, virtual memory, connect time, and logins. Solaris Resource Manager allocates resources according to a dened policy and prevents resource allocation to users and applications that are not entitled to them. When a resource policy is set, mission-critical applications will get the resources they demand. Solaris Resource Manager prevents runaway processes from consuming all of the available CPU power or virtual memory.
3-44

3
Because Solaris Resource Manager is share based, not percentage based, the Solaris Resource Manager management can re-allocate system resources based on the current mix of shares. As users log on and off, as processes start and terminate, the Solaris Resource Manager software recalculates how much resource to allocate to all current processes/users based on policy rules. This helps ensure that no processes are completely starved of resource and that no single process can consume all available CPU and virtual memory resource.
CPU Scheduling
3-45
Solaris Resource Manager Hierarchical Shares Example

CPU time is allocated based on an assigned number of shares rather than on a at percentage basis. When applications are idle or are consuming less than their designated CPU allotment, other applications can take advantage of the unused processing power. The granularity of control can be as deep or at as the system administrator desires. Shares can be allotted to a department, and then to individuals in the department. In the overhead, the system administrator divides system resources between three groups of users: Financial Database (60 shares), Web Server (10 shares), and Operations Department (30 shares). Resources allocated to the Operations Department are then further distributed to Kelvin (6 shares), Susan (5 shares), and Brian (19 shares). The 30 shares allocated for CPU resources to the Operations Department represents a minimum of 30 percent (30 of 100 shares) if the Financial Database and Web Server are also in operation. If the Financial Database was not running, Operations would receive 75 percent of CPU resources (30 of 40 shares). The ratio of shares is important, not the actual number of shares.
3-46

SRM Limit Node

By default, the Solaris OE includes system parameters that impose limits on resources; for example, the number of processes per user. However, these system parameters are dened within the kernel for all users and do not allow different values for different users. Solaris Resource Manager denes a resource control policy for each individual user or for a group of users. Thus, the Solaris Resource Manager system administrator has a great deal of exibility in dening the resource control policy for a system. The fundamental building block in Solaris Resource Manager is a peruser persistent object called an lnode (limit node). An lnode exists for every user and group of users on the system. The lnode object contains system and user attributes that correspond to the resource control policy for that user. The kernel accesses lnodes to determine a users specic resource limits and to track that users resource usage. When a user starts a process, the users lnode records the process and its activity. As the process runs and requests resources, the kernel determines whether the process is entitled to the requested resources based on the lnode information.
CPU Scheduling
3-47
3
CPU Monitoring Using SE Toolkit Programs
The cpu_meter.se Program
The cpu_meter.se program (Figure 3-1) is a basic meter that shows usr, sys, wait, and idle CPU as separate bars. # /opt/RICHPse/bin/se /opt/RICHPse/examples/cpu_meter.se
Figure 3-1
cpu_meter.se Program Results
The vmmonitor.se Program

The vmmonitor.se program is an old and oversimplied monitor script based on vmstat that looks for low RAM and swap space.
The mpvmstat.se Program

The mpvmstat.se program provides a modied vmstat-like display. It prints one line per CPU.
3-48

3
Check Your Progress
Before continuing on to the next module, check that you are able to accomplish or answer the following: u u u u u u u Describe the Solaris OE scheduling classes Explain the use of dispatch priorities Describe the time-sharing scheduling class Explain Solaris OE real-time processing List the commands that manage scheduling Describe the purpose and use of processor sets Introduce Solaris Resource Manager
CPU Scheduling
3-49
3
Think Beyond
Consider the following questions:
q
Why are there four scheduling classes? What was the point of adding the interactive scheduling class? Why would you want to change a dispatch parameter table? When might you want to use the SunOS real-time capability? What types of work might you use processor sets for? Why might you use Solaris Resource Manager instead of processor sets?
q q q q
3-50

CPU Scheduling Lab

Objectives
q q q
L3
Read a dispatch parameter table Change a dispatch parameter table Examine the behavior of CPU-bound and I/O-bound processes, both individually and together Run and observe a process in the real-time scheduling class
L3-1
L3
CPU Dispatching
This lab provides an opportunity to alter the dispatch table on your system. When the table has been altered, you observe changes to the manner in which processes are executed, differences in the time taken to complete, and possible differences in the expected display output from monitoring tools. You will need to be logged in as the root user to complete these exercises. Complete the following steps: 1. Display the congured scheduling classes by typing: dispadmin -l What classes are displayed? ___________________________________________________________ 2. Use dispadmin to display the time-sharing dispatch parameter table by typing: dispadmin -c TS -g What is the longest time quantum for a time-share thread? ___________________________________________________________ 3. Use the dispadmin command to display the time-share dispatch parameter table so that the time quantums are displayed in clock ticks. What command did you use? ___________________________________________________________
L3-2

L3
4. Display the real-time dispatch parameter table by typing: dispadmin -c RT -g What is the smallest time quantum for a real-time thread? ___________________________________________________________ 5. Display the congured scheduling classes again. What has changed? ___________________________________________________________
CPU Scheduling Lab

L3-3
L3
Process Priorities
Process priorities get assigned in a variety of ways.:
q
A process gets assigned a priority within its class (timeshare process priorities range from 0 through 59 with 59 being the highest priority). Each priority has an associated quantum, which is the amount of CPU time allocated in ticks. If a process exceeds its assigned quantum before it has nished executing, it is assigned a new priority, tqexp. The new priority, tqexp, is lower but the associated quantum is higher, so the process gets more time. If a process (or thread) blocks waiting for I/O, it is assigned a new priority, slpret, when it returns to user mode.
slpret is a higher priority (in the 50s) allowing faster access to the CPU.
If a process (or thread) does not use its time quantum in maxwait seconds (maxwait of 0 equals 1 second), its priority is raised to lwait, which is higher and ensures that the thread will get some CPU time.
The purpose of this portion of the lab is to observe the process priorities during execution of a program. The two programs that you will be executing are dhrystone and io_bound. The dhrystone program is an integer benchmark that is widely used for benchmarking UNIX systems. The dhrystone program provides an estimate of the raw performance of processors. (You might also be familiar with a oating-point benchmark called whetstone that was popular for benchmarking FORTRAN programs.) The dhrystone program is entirely CPU bound and runs very fast on machines with high-speed caches. Although it is good measure for programs that spend most of their time in some inner loop, it is a poor benchmark for I/O-bound applications.
L3-4

L3
The io_bound program, as the name implies, is an I/O-intensive application. I/O-bound processes spend most of their time waiting for I/O to complete. Whenever an I/O-bound process wants CPU, it should be given CPU immediately. The scheduling scheme gives the I/O-bound processes higher priority for dispatching. Complete the following steps, as root, using the lab3 directory. Note If the dhrystone program runs too quickly (this is systemdependent) to be monitored correctly, read the README le in the lab3 directory for directions on how to compensate for its actions. 1. Type: /usr/proc/bin/ptime ./io_bound Record the results. ___________________________________________________________ ___________________________________________________________ 2. Type: /usr/proc/bin/ptime ./dhrystone Record the results. ___________________________________________________________ ___________________________________________________________ 3. Using the elapsed time of both results, calculate the factor of time between dhrystone and io_bound. For example, if io_bound took ten times longer to execute (in real elapsed time) than dhrystone then the factor is 10. 4. Run the dhrystone command. Pass a parameter of 500000 times the factor (shown as value in the command line below). ./dhrystone value
CPU Scheduling Lab

L3-5
L3
5. Use prstat 1 | egrep (PID|dhry) in another terminal window to view the priority of the dhrystone process. What priority has it been assigned? ___________________________________________________________ Why is it so low? (Get several snapshots of the process priority using the ps -cle command.) ___________________________________________________________ ___________________________________________________________ 6. When dhrystone has completed, type: ./io_bound This runs an I/O-intensive application. 7. Type prstat 1 | egrep (PID|boun) in another terminal window to view the priority of this process. What priority has it been assigned? ___________________________________________________________ Why? (Again, get several snapshots of the priority using ps cle.) ___________________________________________________________ ___________________________________________________________
L3-6

L3
8. Type: ./test.sh value where value is the value used in the dhrystone command in step 4. This shell script runs both dhrystone and io_bound and stores the results in two les, io_bound.out and dhrystone.out. It then waits for both of them to complete. Record the results.
w
io_bound results _______________________________________________________ _______________________________________________________
dhrystone results _______________________________________________________ _______________________________________________________
Why did the dhrystone process take longer to execute when it was run at the same time as the io_bound process? ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________
CPU Scheduling Lab

L3-7
L3
The Dispatch Parameter Table
Using the lab4 directory, perform the following steps as root: 1. Type: dispadmin -g -c TS > ts_config This saves the original timeshare dispatch parameter table. 2. Type: dispadmin -s ts_config.new -c TS You have just installed a new timeshare dispatch parameter table. Do not look at the new table at this time. 3. Type: ./test.sh Note the results. Given the results, can you make an educated guess as to the nature of the ts_config.new dispatch parameter table? _________________________________________________________ _________________________________________________________ _________________________________________________________ _________________________________________________________ _________________________________________________________ Look at ts_config.new to conrm.
L3-8

L3
4. Type: ps -cle Do you notice anything strange about some of the priorities you see? Which processes have very low priorities? Why? _________________________________________________________ _________________________________________________________ _________________________________________________________ _________________________________________________________ 5. Type: dispadmin -s ts_config -c TS This restores the original table.
CPU Scheduling Lab

L3-9
L3
Run this exercise as root from the lab4 directory. Use the following steps: 1. Run test.sh as a real-time process. Type: priocntl -e -c RT ./test.sh Note Running these processes as real-time might cause your system to temporarily hang. 2. What is different from the timeshare case? __________________________________________________________ __________________________________________________________ __________________________________________________________ Can you explain this? __________________________________________________________ __________________________________________________________ __________________________________________________________ 3. Record the results for io_bound and dhrystone.
w
io_bound results ______________________________________________________ ______________________________________________________ ______________________________________________________
dhrystone results ______________________________________________________ ______________________________________________________ ______________________________________________________
L3-10

L3
Processor Sets
This lab is designed to illustrate how you set up and populate processor sets with CPUs and processes. The lab also demonstrates the potential benets of a moderately loaded system. Document the results that you obtain during the lab. The results will vary depending on the number and speed of CPUs on your system. Note If your system is a single-processor system, do not attempt this lab. Best results are obtained on a system with more than two CPUs. Run this exercise as root. Use the following steps: 1. Open two terminal windows, and note their process ID values. In each of the windows, type the following command lines: # exec csh % while 1 ? end 2. In a third window, issue the following command: # mpstat 5 Observe the output in the smtx column. On systems with more and faster CPUs, the values in the smtx column tend to be higher than on systems with less or slower CPUs. If there is a mix of systems in the classroom, compare the values between systems. 3. Execute zoom.se to see if a mutex problem has been detected. 4. Create a processor set using the following command: # psrset -c If the command is successful, you will see a message similar to: created processor set 1
CPU Scheduling Lab

L3-11
L3
5. Add a processor to the processor set: # psrset -a SID PROCID where SID represents the processor set ID and PROCID represents the processor ID. To obtain the values for SID and PROCID, respectively, issue the following commands: # psrset -i # psrinfo -v When the processor has been added to the processor set, observe the values output by mpstat. Is there any activity on the CPU that has been added to the processor set? Can you explain this? 6. Migrate the two C shell windows to the processor set: # psrset -b 1 PID1 PID2 where PID1 and PID2 represent the process ID values for the two windows created in Step 1. After the two processes have been migrated to the processor set, is there a reduction in the smtx activity across the remaining processors in the mpstat output?
L3-12

System Caches
Objectives
q q q q q q
Describe the purpose of a cache Describe the basic operation of a cache List the various types of caches implemented in Sun systems Identify the important performance issues for a cache Identify the tuning points and parameters for a cache Understand the memory systems hierarchy of caches and their relative performances Use standard tools to display cache-related statistics
4-1
4
Relevance

q q q q
What is a cache? Why are caches used at all? What areas of the system might benet from the use of a cache? How do you tune a cache?
q
Wong, Brian L. Conguration and Capacity Planning for Solaris Servers. Palo Alto: Sun Microsystems Press/Prentice Hall, 1997. The SPARC Architecture Manual Version 9. Prentice Hall. Catanzaro, Ben. Multiprocessor System Architectures. SunSoft Press/Prentice Hall, 1994.
q q
4-2

What Is a Cache?
Caches are used to keep a (usually small) subset of active data near where it is being used. Just as the reference books you use tend to remain on your desk, the system tries to keep data being used near the point of use. This provides the ability to access the data more quickly and to avoid a time-consuming retrieval of the data from a more remote location. Caches are used by the hardware, in the CPU and I/O subsystem, and by the software, especially for le system information. The hardware and software contain dozens of caches. A cache usually can hold only a small subset of the data available, sometimes less than .01 percent, so it is critical that a subset contain as much of the data in use as possible. While there are many different ways to manage a cache, only a few are commonly used.
System Caches
4-3
Hardware Caches
The most commonly known hardware caches are the CPU caches. Modern CPUs contain at least two data caches: Level 1 and Level 2. The Level 1 cache is also called the on-chip or internal cache. It is usually approximately 32 Kbytes in size. The Level 2 cache is called the external cache. It is the most commonly known as Ecache. In addition, main storage can be considered a cache because it holds data from disks and the network that the CPU might need to access. CPUs usually also contain a cache of virtual address translations, called the Translation Lookaside Buffer (TLB), Dynamic Lookaside Table (DLAT), or Page Descriptor Cache (PDC), which described later. In addition, the hardware I/O subsystem has various caches that hold data moving between the system and the various peripheral devices.
4-4

Static Random Access Memory (SRAM) and Dynamic RAM (DRAM)

Most caches are made from static random access memory (SRAM). The systems main storage is made from dynamic RAM (DRAM). Static RAM uses four transistors per bit, while DRAM uses only one. It would seem that using DRAM everywhere would be a good idea; it takes one-fourth the space, so there could be four times the memory. But DRAM has one problem: The electrical charge that gives it its values fades (quickly) over time. Usually every 10 or so system clock cycles the contents of the DRAM must be refreshed. Refresh involves reading the data into the memory controller and writing it back fresh. Because the memory being refreshed is busy during this process, data cannot be read from nor written to it. Given the cost of accessing main memory, this cost is usually hidden. SRAM does not require a refresh. As a result, the data it contains is always available, making it ideal for locations that require highperformance access, such as caches.
System Caches
4-5
CPU Caches
As mentioned previously, there are usually two levels of CPU cache, Level 1 and Level 2. The Level 1 cache is usually 440 Kbytes in size, and is constructed as part of the CPU chip. Of the three million transistors on the UltraSPARC-II CPU chip, two million are used for the cache. The Level 2 cache is usually 0.58 Mbytes in size, and is located near, but not on, the CPU. The Level 1 cache operates at CPU speeds because it needs to provide data to the CPU as rapidly as possible. Data must be in the Level 1 cache before the CPU can use it. You can use several commands to list the sizes of the Level 1 and Level 2 caches. These commands are described later in this module.
4-6

Cache Operation
All caches, hardware and software, operate in a similar fashion. A request is made to the cache controller or manager for a particular piece of data. That data is identied by its tag, which could be its address or some other uniquely identifying attribute. If the data is found in the cache, it is returned to the requester. This is known as a cache hit. If the data cannot be located in the cache, it must be requested from the next level of storage up the chain. This is known as a cache miss. The requester must wait, or process other work if it can, until the requested data becomes available. The length of time that the requester must wait depends on the construction of and activity in the system.
System Caches
4-7
Cache Miss Processing

For efciency reasons, caches are usually kept 100 percent full. This way, as much data as possible is available for high-speed access. When a cache miss occurs, new data must be put in the cache. This process, called a move in, must replace data already in the cache. This replacement process requires that a location for the incoming data be found, at the expense of some data already present in the cache. Depending on the design of the cache, there might be only one place that the new data can go, or there might be many. If there is more than one, a location least likely to hurt performance is chosen. The algorithm used to nd the new location almost always assumes that the oldest unreferenced candidate location (the longest unused) is the least valuable and will choose it as the location. This type of algorithm is known as least recently used (LRU).

Ask what the most common error of this type is. The answer is a page fault. It is important that the students begin to see the operational similarities between all the different system caches as soon as possible.
4-8

Cache Data Replacement

When a location for the incoming data is found, the system cannot just copy the new data into this location; the incoming data could overlay the only current copy of the existing data. In this case, the data must be moved out, or ushed, and copied out to the next level. If the next level up already has a current copy of the data, it can be abandoned, and the new data copied directly in.

When data is abandoned, this is called a write-cancellation. If the data is abandoned, the cache access will be much faster.
Depending on the hardware design, the move in can begin before the move out has nished. The amount of data that is moved is called a cache line. A cache line, in the CPU caches, is usually 64 bytes, but sometimes is only 16. In other types of caches, it can be as small as 1 byte or as large as several Kbytes, but usually CPU caches have 16-to 64-byte line size. Level 2 cache size depends on both the total cache size and the system architecture.
System Caches
4-9
Cache Operation Flow

In summary, when data is requested from a cache:
q
If the data is present, the cache line (or portion of the line) is returned to the requester. If the data is not present, the requester must wait until the data has been received from the next level up. When the data is available, a cache location is found for it, and the existing data in that location is either abandoned or moved out, depending on its state at the next level up. A full cache line of data will be transferred. If the next level cache cannot nd the data, it requests the data from its next level up, until the data is found and moved in or an error is generated.
4-10

Cache Hit Rate

The cache hit rate is the percentage of memory accesses that result in cache hits or nding the requested data in cache. The cache hit rate depends on:
q q
The size of the cache how much room there is The caches fetch rate how much data is placed in the cache for each memory access Locality of reference how close newly referenced data is to recently used data Cache structure how the cache is built and managed
System Caches
4-11
Hits impose a cost to the system as time is taken to search the contents of the cache.

A unit, as mentioned in the overhead, is just a unit of measurement. On page 4-26, there is a comparison chart. The ratio of Level 1 and Level 2 cache is 1:6. The values 1 and 6 have been multiplied by 20 to make the calculations more emphatic.
Cache misses can be very expensive because of the much higher cost of getting data from the next level up. Even a miss rate of just a few percent can make a noticeable difference in a systems performance.
4-12

Effects of CPU Cache Misses

Performance degrades very quickly as the hardware cache miss rate increases. There are many ways to improve the hit rate, and most involve the design characteristics of the cache or the software design. Different characteristics can be used depending on system requirements and cost considerations.

System administrators do not have many options to improve the cache hit rate. Two ways would be to improve file system access and to alter the process load of the system.
The hit rate of a cache is determined by its characteristics as much as by its overall size. With the Level 1 and Level 2 caches, these characteristics are predetermined. However, you can alter the characteristics of main memory by modifying kernel parameters. This is described, in more detail, in the Module 5, Memory Tuning.
System Caches
4-13
Cache Characteristics
There are several major cache design characteristics that can be chosen, depending on the systems requirements. Most of these characteristics apply to software caches as well, although they might be implemented differently than they are in hardware caches. After you identify a cache and its characteristics, you can write programs that use the cache for better performance.
4-14

Virtual Address Cache

A virtual address cache can be found in some early SPARC desktop designs (the sun-4c architecture systems) and the UltraSPARC systems. A virtually addressed cache uses virtual addresses to determine which location in the cache has the requested data in it. The advantage to this scheme is that the memory management unit (MMU) translation can be avoided if the data is found in cache. This means that the address translation mechanism can be bypassed. The disadvantage is that after a context switch from a runnable state, the addresses in the cache are invalid because they apply to the previous running process. This makes the cache inefcient on systems with high numbers of user processes because of the high rate of context switching.
System Caches
4-15
Physical Address Cache

A physically addressed cache is one that uses the physical address of a piece of data to determine where in cache it could be located. A disadvantage to this type of cache is that for every address accessed by the CPU, a Memory Management Unit (MMU) translation must be located, regardless of whether the data is in cache. This can be quite expensive, especially in a very busy CPU, but these translations are also cached. The Page Descriptor Cache (PDC) holds recent MMU translations.
4-16

Direct-Mapped and Set-Associative Caches

In a direct mapped cache, each higher-level memory location maps directly to only one cache location. In a set associative cache, a line can exist in the cache in several possible locations. If there are n possible locations, the cache is known as n-way set associative. While set-associative caches are more complex than direct-mapped caches, they allow for a more selective choice of data to be replaced. A set-associative cache reduces contention for cache resources. When data must be stored in the cache, it can be stored in one of several locations, which allows other addresses that map to the same cache lines to remain in cache. In a direct-mapped cache, an address can only map to a single cache line. Two addresses are, therefore, unable to share the line without ushing the cache.
System Caches
4-17
4
Direct-mapped caches are normally used for Level 2 or external caches. Because the logic used for set-associative caches is much more complex than direct-mapped caches, set-associative caches are normally used for Level 1 or on-chip caches where the size of the cache is more limited. However, the complexity of the cache logic can be enhanced more easily. The extra performance benet is more noticeable in this type of cache.
Harvard Caches
A Harvard cache (which is named after the university where it was invented) separates a cache into two parts. The rst part, named the I-cache (I$), contains only instructions. The second part, named the D-cache (D$), contains only data. The two parts are usually four-way set associative, and do not have to be the same size (or type).

For example, one part could be virtual (I$) and the other physical (D$), like the UltraSPARC.
The idea behind this divided cache is that instructions will not lose their cache residence when large amounts of data are passed through the cache. This allows better performance for the system as a whole, although there may be slight performance degradation for a few applications. Harvard caches are usually used in CPU Level 1 caches. Caches that do not divide the cache are called unied. Following these naming conventions, the unied (non-Harvard) external cache is named the E$.

The Harvard cache concept is applied to main memory as priority paging.
4-18

Write-Through and Write-Back Caches

A write-through cache always writes a cache line to the next level up any time its data is modied. A write-back cache writes only to cache. It writes to the next level only when the line is ushed or another system component needs the data. The advantage to a write-through cache is that it ensures that the cache contents are consistent with that of main memory. The disadvantage is that there is a lot of extra memory trafc because of the loss of write cancellation.
System Caches
4-19
4
Write Cancellation
A write-back cache takes advantage of a system behavior called write cancellation. The system assumes that if a particular cache line or data block has been written to once, it can be written to again. If the intermediate writes can be avoided, I/O or memory trafc will be reduced and, ultimately, the system will not be doing useless work moving out data that will be replaced before it is used again. This is the advantage of a write-back cache. Write cancellation occurs when the system holds modied data without writing it up to the next level, thus potentially cancelling unnecessary writes.
4-20

CPU Cache Snooping

Write-back caches create an interesting problem in a multi-processor system: Cache coherency must be maintained across all of the systems CPUs. This means that each CPU must be able to obtain current, modied data, even if the lower-level cache has not written it to main storage, where it is directly available to the other parts of the system. The system uses a mechanism called cache snooping to avoid getting incorrect data. All system write-back caches are responsible for making sure that any component that requests data modied gets the current data and not obsolete data from main storage or elsewhere.
System Caches
4-21
The cache controllers do this by monitoring every memory request made by every system component. When they observe a request or operation that involves data contained in their caches, they respond accordingly. If a CPU modies a cache line, it causes all other caches to discard that block if it is present in them. If the CPU tries to access a line not present in its cache, it needs to read it from a higher level. The CPU requests (through the system bus) the most recent cache line from any cache or from memory. If the CPU modies a line, it will cause all other caches to either discard the block (called write invalidate protocol) or request all other caches to update their copy of the cache line (write update protocol). This is done as processor architecture-dependent.

This is very similar in concept to NFS file attribute checking, although the client polls for changes instead of noticing broadcast change messages. This is more efficient for data that is less likely to be shared.
4-22

Cache Thrashing
Cache thrashing occurs when the same location or locations in the cache are repeatedly used, causing a move in essentially every time the cache is referenced. When this occurs, the hit rate goes to near zero, and corresponds to an immense performance degradation. Cache thrashing normally occurs when arrays are accessed in memory with a stride, or spacing, that corresponds to the cache location mapping algorithm. The stride in this case is almost invariably a power of two.
System Caches
4-23
Diagnosing a cache thrashing problem is usually very simple. By examining the source code for structures that are a power of two in size, the problem can be quickly identied. To solve it, change the structure size. Further, cache thrashing can occur on one type of system or CPU, but not on another type. If you suspect a problem of this type and the programs source code is not available, move the program to a different type of system to see if the problem disappears.
4-24

System Cache Hierarchies

When viewed linearly, almost every component of the system that contains data is part of a hierarchy of caches.
System Caches
4-25
The further from the CPU the cache is, the larger and slower it is. These components might have different behaviors, but they all operate according to the basic cache principles. The table in the overhead shows the relative speeds of the different levels of the cache hierarchy. The further away from the CPU the data is when it is required, the higher the cost of moving that data closer to the CPU so that it can be used. In a later module, you will look at how UFS parameters can be set so that sequential access to le system data can be optimized. To avoid unnecessary and frequent access to disk, a read-ahead mechanism is employed to populate main memory in advance of actual data requirements.
4-26

Relative Access Times

It is difcult for humans to comprehend a nanosecond, or one billionth of a second. By converting the nanosecond times from the previous table into human times, it is possible to get a better understanding of the cost of a cache miss. As you might imagine from looking at the relative access times in the table in the overhead, the system tries very hard to avoid doing I/O operations. (You will see how this is done in later modules.) Caches gure prominently in these processes. The CPU tolerates Level 1 and Level 2 cache misses but suspends the current process if main memory does not provide a cache hit. Based on the data in the table, it is easy to see why the CPU switches the running process. While the suspended process waits for disk I/O to complete, the CPU is free to perform other tasks.
System Caches
4-27
Cache Performance Issues

Some performance issues of caches are:
q
Locality of reference Data used together should be stored together to take advantage of cache presence and pre-fetch (bringing data into cache before it is requested). Poor data distribution (bad locality) causes expensive cache misses.
w w
Linked lists are bad for locality. Use arrays whenever possible. Forte Developer Workshop analysis can show access patterns for a programs routines and data.
Cache thrashing If a job runs extremely slowly, and there are no other obvious causes, suspect cache thrashing. Inspect the source code or try the program on a processor with a different CPU architecture. Cache ushing statistics The -c option of vmstat gives you detailed information on hardware cache ushes. This is of limited use, however, because neither cache hits nor the reason for the ush are reported.
4-28

4
# vmstat -c flush statistics: (totals) usr ctx rgn 821 960 0 seg 0 pag 123835 par 97
The output from the vmstat -c command contains the following elds:
q q q q q q
usr User cache type ctx Context cache type rgn Region cache type seg Segment cache type pag Page cache type par Partial-page cache type
Although the output may not be particularly useful, the change in values between two successive executions within a reasonable time interval will show how much cache ushing activity is taking place.
System Caches
4-29
Items You Can Tune

Most of the things that you can tune to affect hardware cache performance can only be done at the application level. Unlike the Operating System (OS) software caches, there are no controls available to change the conguration, size, or behavior of the hardware caches. You can, however, alter the behavior of main memory, the cache for your disks. This is described in Module 5, Memory Tuning.
4-30

Cache Monitoring Command Summary

This summary briey outlines how to obtain cache size information and, where relevant, cache performance data from your system. Some of the commands listed here are described in more detail in later modules:
q q
prtdiag prtconf
On sun4u and sun4d architectures, you use the prtdiag and prtconf commands to display the size of the Level 1 and Level 2 cache, on a per CPU basis. Combining the prtconf v and p options, you see details of the Level 1 and Level 2 caches sizes, the cache line sizes, associativity, and whether the cache is split or unied.
System Caches
4-31
4
q
vmstat
You use the s option to display the total name lookup and cache hit percentage for the Directory Name Lookup Cache (DNLC). You use vmstat with no options to see the re (reclaim) column showing some page cache hits.
q
sar
Use the a and v options to show DNLC and inode cache data, respectively. The b option shows buffer cache statistics. The p and g options show memory page-in and page-out activity, respectively. The atch/s (attaches per second) column shows some page cache hits and the %ufs_ipf column shows inode cache ushes.
q
netstat
Use the k option to illustrate the use of the inode cache. This includes the initial, current, and high-water mark values.
q
prtmem
Use the prtmem command to show the current size of the page/le cache.
q
sysdef
Use the sysdef command to display the congured size of the buffer cache (bufhwm).
q
nscd -g
Use the nscd -g command to display the current statistics of naming service-related caches.

Have the students try the nscd -g command now. This command shows naming service caches, such as the passwd and group cache. If the students issue the command, ls -lR /usr, and then refer to the nscd cache again, they will see how that cache is being used to determine the user and group names from the files UID and GID inode entries.
4-32

4
Check Your Progress
Before continuing on to the next module, check that you are able to accomplish or answer the following: u u u u u u u Describe the purpose of a cache Describe the basic operation of a cache List the various types of caches implemented in Sun systems Identify the important performance issues for a cache Identify the tuning points and parameters for a cache Understand the memory systems hierarchy of caches and their relative performances Use standard tools to display cache-related statistics
System Caches
4-33
4
Think Beyond
Consider the following:
q
How could you prove that you were having hardware caching problems? Are there workloads that are insensitive to caches? How big is too big for a cache? How many caches might be too many to coordinate among CPUs? Why? Can you nd software equivalents for each of the hardware cache types?
q q q
4-34

Cache Lab
Objectives
q
L4
Calculate the cost of a cache miss, in both time and percentage of possible performance View cache-related statistics using standard Solaris 8 Operating Environment (OE) tools
L4-1
L4
Calculating Cache Miss Costs
Use the following gures to answer the questions in this exercise:
q q
Cache hit cost 12 nanoseconds Cache miss cost 540 nanoseconds
1. What speed will 100 accesses run at if all are found in the cache (100 percent hit rate)? ___________________________________________________ 2. What is the cost in time of a 1 percent miss rate? In percentage performance? ___________________________________________________ 3. If a job was running three times longer than it should, and this was completely due to cache miss cost, what percentage hit rate does it achieve? ___________________________________________________
Note To calculate the hit rate, you can use the following formula: Execute-time = (Hit-cost * n) + (Miss-cost * (100 - n)) where n is the hit rate being achieved.
L4-2

L4
Displaying Cache Statistics
The following lab exercises involve the use of the tools mentioned in this module. The purpose of this section of the lab is to allow you to become familiar with the output of the commands and to help you determine what information, contained in the command output, will be of use to you. Another aspect of the lab is to demonstrate that there will always be times when a cache is not working as efciently as you would like. The mix of system activities determines how much data is cached and available for use. Sometimes, a performance gain is obtained by altering the system workload rather than altering kernel parameters or adding memory. During the lab exercise, you see how standard commands, such as find, can have a negative effect on the systems caches. You should be logged in to the Common Desktop Environment (CDE) windowing environment as the root user for this exercise. The lab4 directory should be your current working directory. 4. In a terminal window, issue the following command: # vmstat -s Write down the value associated with total name lookups. 5. In the same terminal window, issue the following command: # vmstat 5 | tee /tmp/vmstat_lab 6. In another terminal window, issue the following command: # ./finder Observe the page-in (pi) column of the vmstat output while the find command is executing. After the find command has completed, terminate the vmstat process that you started in Step 5.
Cache Lab
L4-3
L4
7. In a terminal window, issue the following command: # vmstat -s Compare the current value associated with total name lookups to the previous value (from Step 6). Has the cache hit percentage increased or decreased? ___________________________________________________ Why? ___________________________________________________ 8. In the same terminal window, issue the same command as before: # vmstat 5 9. In another terminal window, again issue the command: # find / > /dev/null Observe the page-in (pi) and minor-fault (mf) column of the vmstat output while the find command is executing. Does it appear to be the same as before? ___________________________________________________ Is the CPU spending as much time in an idle (id) state? ___________________________________________________ Is the find command reading data from disk or from the cache (or both)? ___________________________________________________
L4-4

L4
10. If your system is sun4u or sun4d architecture, run the prtdiag command as shown: # /usr/platform/uname -m/sbin/prtdiag Write down the size of the caches for each of your CPUs. (The size is displayed in the Ecache columns). If you are uncertain of your system architecture, try either of these two commands: # uname -m # arch -k 11. Issue the following command: # prtconf -vp | grep -i cache The values are output in hexadecimal. Determine the decimal values for the following information: Level 1 Data Cache: dcache-line-size: dcache-nlines: Total cache size (line-size x nlines): dcache-associativity: Level 2 Cache: ecache-line-size: ecache-nlines: Total cache size (line-size x nlines): ecache-associativity: Level 1 Instruction Cache: icache-line-size: icache-nlines: Total cache size (line-size x nlines): icache-associativity: Note You can use the bc command to convert hexadecimal values to decimal. (Use the instruction ibase=16.)
Cache Lab
L4-5
L4
12. If there are systems in your classroom with different architectures, compare the results found on your system architecture to those on other system architectures.
L4-6

Memory Tuning
Objectives
q q
Describe the structure of virtual memory Describe how the contents of main memory are managed by the page daemon Understand why the swapper runs Interpret the measurement information from these components Understand the tuning mechanisms for these components
q q q
5-1
5
Relevance

q q q q
Why do real and virtual memory need to be managed at all? What might you need to manage? How could you perform this management? What other parts of the system depend on the memory system working correctly?
q
Cockcroft, Adrian and Richard Pettit. Sun Performance and Tuning. Palo Alto: Sun Microsystems Press/Prentice Hall, 1998. Wong, Brian L. Conguration and Capacity Planning for Solaris Servers. Palo Alto: Sun Microsystems Press/Prentice Hall, 1997. Man pages for sar and vmstat.
5-2

Virtual Memory
Virtual memory is a technique used to provide the appearance of more memory to a program than the system really has. It uses physical main memory as a cache for the total amount of memory in use. The memory that is visible to the program, that is, the amount it has available to use, is the virtual memory. Virtual memory is created by the Operating System (OS) for the program as it is needed, and destroyed when it is freed (or the process using it ends). If there is no room for a certain section of virtual memory in real, physical memory, it is paged out. Each process gets its own expanse of virtual memory to use, up to a maximum of 3.75 Gbytes (on a 32-bit system). This area, called the processs address space, is where the program runs. Sections of this, called shared memory, can be shared with other programs. Virtual memory is managed in segments, which are variously sized stretches of virtual memory with a related use, such as a shared library, the heap, the stack, and so on.
Memory Tuning
5-3
Process Address Space

When an executable program is loaded into memory, the virtual random access memory (RAM) that is assigned to that process is called the process address space. The process address space is divided into segments. In a normal address space there are usually at least 15 different segments. Its maximum size is 4 Gbytes in a 32-bit environment. The text segment is the executable instructions from the executable le, which are generated by the compiler. The data segment contains the global variables, constants, and static variables from the program. The processs initial form also comes from the executable le. The heap is memory that is used by the program for some types of working storage; it is allocated with the malloc call.
5-4

5
The heap starts as virtual memory. Only when the rst reference to a page is made will physical memory be allocated. Pages will, subsequently, be allocated in physical memory as each is referenced. Pages will remain in memory until the process exits or the page is stolen by the page scanner. Following the heap is the hole. The hole is that section of the address space that is unallocated and not in use. For most processes, it is the vast majority of the address space. It takes up no resources because it has not been allocated.

The hole is shown in the 64-bit Sun4u diagram.
Above the heap are the shared libraries used by the program. These libraries have the same basic structure as a normal executable le; that is, they contain text, data, and BSS segments. The difference is how the code is compiled. A shared library is compiled as position independent so that it can become a section of shared memory that is accessed by every program that wants to use it. Shared libraries, such as libc.so, are connected to the program executable because it is loaded into memory when it starts. Above the shared libraries, at the end of the address space area that the program can address, is the stack. The stack is used by the program for variables and storage while a specic routine is running. The stack grows and contracts in size, depending on which routines have been called and what kind of storage requirements each routine has. The stack is normally 8 Mbytes in size. Located above the stack is the kernel. While it is a part of the address space, the kernel cannot be referenced by an application. When a program needs access to kernel data, it must be able to run as root or make a system call request. The actual size of the kernel changes, depending on many factors, such as peripheral device conguration and the amount of system real memory. Essentially, the entire kernel is locked in memory. The amount of real memory used is the same as the amount of virtual memory allocated in the kernel. Note You can use the /usr/ccs/bin/nm command to determine the size of text, data and bss (uninitialized data) segments of a program le. For example: # /usr/ccs/bin/size /usr/bin/ls 15426 + 1261 + 1959 = 18646
Memory Tuning
5-5
Types of Segments

Present this information quickly. It is here for those students who are interested in it; it is not referenced again.
Segments are managed by segment drivers, just as devices are managed by device drivers. The segment driver is responsible for creating, copying (for fork), and destroying the segment, as well as for handling page faults for the segment.

Segments provide a convenient method for allowing a process to share portions of its address space with other processes. Segment drivers monitor for locking contention on those shared segments.
There are several different segment drivers; the majority are used to manage kernel memory. The most common segment driver for user processes is the segvn driver. The segvn driver maps les (represented by vnodes) into the address space. Almost every segment in the address space is backed by a le (an executable, swap, or data le), and the segvn driver supports them all. The usual exception is for a device segment (segdev), which represents the frame buffer.
5-6

Virtual Address Translation

Because memory acts like a fully associative cache for a disk, a directory mechanism must allow the system to determine whether a page is in memory and, if so, where it is in real memory. This process is called address translation. The address translation is performed (in SPARC systems) as a threestage table lookup. Parts of the virtual address to be translated are used to index a series of tables, which eventually provide the real address corresponding to the virtual address. The translation is performed in the CPUs memory management unit (MMU) by the hardware. If any part of the translation cannot be performed, an interrupt occurs, indicating a page fault, which the OS must then resolve. If the address being translated is legitimate, the OS must locate the page containing the requested data.
Memory Tuning
5-7
Page Descriptor Cache

The address translation process is very expensive: It requires three memory accesses before the data can be requested, and must be performed for every instruction or operand reference. If resolving the address of every instruction and operand required three memory references, or even three cache references, system performance would be exceptionally slow. The system avoids this by using a cache. Translated addresses are saved in a special cache, usually with 64 entries, called the page descriptor cache (PDC). It is also known as the translation lookaside buffer (TLB) or dynamic lookaside table (DLAT). After an address is translated, it is placed in the cache for future reference. Access to the PDC is so fast, it is actually part of the Level 1 cache access time. As usual, the cache is managed on an LRU basis. It is fully associative.
5-8

5
Each CPU (and I/O interface) has its own PDC, located on the CPU chip. The most recent 64 pages referenced by the CPU are contained in each. In a multiprocessor system, each PDCs contents are different, although there is some overlap for shared pages from the kernel and shared libraries. PDC entries are normally replaced on an LRU basis: If a page is not referenced, its entry is eventually replaced by a new page. If the page is referenced again, it is retranslated and reentered into the cache.

If a page is invalidated (for example, stolen or freed by the program), its PDC entry must be invalidated as well. The SPARC chip has a special instruction for this. The page must be invalidated in all of the CPUs PDCs, not just one. Because the validity of the page is checked only in the long translation, if the PDC entries were not purged, the old translation would continue to be valid, even if the real pages were reused. A crosscall, a message to all of the other CPUs, is always issued when a page is invalidated just in case the page is also in another CPUs PDCs.
Memory Tuning
5-9
Virtual Address Lookup

The overhead illustrates the process that occurs when a virtual address must be translated. The PDC is checked rst, then a long translation is done. If the page is invalid at the end of the translation, the hardware cannot proceed and asks for help from the software by issuing a page fault interrupt. The OS passes the request to the proper segment driver, which checks memory to see if the page is available. If it is, it attaches the page to the process. (For a shared page, there might already be hundreds of users of the one page.) If the page is not in memory, it requests the appropriate I/O operation to bring in the page (for example, from NFS or UNIX le system (UFS).
5-10

The Memory-Free Queue

Main memory is a fully associative cache for the contents of the disks. As such, it has the usual attributes of cache operations, such as move ins and move outs. For main memory, move ins and move outs are called paging. The process for main memory is the same: A cache miss occurs (a page fault). The OS determines that it must read in the page from disk. In the hardware caches, a location is identied, a move out is done, and then a move in (pagein) is done. This is not done for main memory. Remember, hardware cache operations are quite fast. Pageouts for main memory are disk I/O operations, which are extremely expensive.
Memory Tuning
5-11
5
To avoid the I/O, the virtual memory (VM) system keeps a quantity of free pages that are available to processes needing a new page. This avoids the need to wait for a pageout followed by a pagein; only the pagein is needed. The Solaris 8 OE differs from earlier releases because it maintains a list of the systems free pages. Not only does this reduce the need to invoke the page scanning daemon, the system should be able to allocate address space needs more effectively. The page scanning daemon must still be used at times of severe memory shortage. Prior to Solaris 8, le content caching in virtual memory caused programs, such as vmstat, to display incorrect free memory values. This was caused by the le system data being treated as memory that was in-use. Solaris 8 memory management routines treat memory used to store le data in virtual memory as free memory but will not make use of such memory until the demand for memory is high and there is insufcient free RAM available to supply demand.

The PDF file, vmcyclops.pdf, in the lab directories, contains an explanation of the workings of the Solaris 8 memory management. You may wish to inform the students and have them read through this PDF file for a fuller explanation.
5-12

The Paging Mechanism

Pages are not replaced immediately after they are used. Instead, the page daemon wakes up every 0.25 second and checks to see if there are enough free pages. If there are, it does nothing. If there are not, it attempts to replenish the free page queue in a process called page replacement. The moveouts are asynchronous in this case. Because main memory is a cache, the algorithm used by the page daemon to replenish the free queue is the usual LRU process. The page daemon inspects, over time, all of the pages in memory and takes those that have not been used. The page daemon LRU process is called scanning. The more pages needed, the faster the page daemon scans. The scan rate, how many pages are being checked each second, is the best indicator of stress (shortage) in the memory subsystem. If the page daemon cannot keep up with the demand for free pages, the swapper may run, which will temporarily move some memory-heavy work to disk.
Memory Tuning
5-13
Page Daemon Scanning

The page daemon performs the same checks on every page eligible to be taken. There are many types of pages that cannot be taken, such as kernel pages, which are skipped over. The page daemon checks each eligible page to see if it can be taken. To take the page, it must be unreferenced, and if modied, the maxpgio limit cannot have been reached. If the page has been referenced, the reference bit is set and the daemon locates the next page to check.
5-14

Page Scanner Processing

Page daemon processing is conceptually a simple process. If there is nothing to do, the page daemon does nothing. This determination is made if the number of free pages is greater than lotsfree. In the Solaris 8 OE, the kernel maintains a list of free pages at all times until the lotsfree threshold is breached. If there are less than lotsfree pages left, the page daemon begins scanning at the slowscan rate. As fewer and fewer pages are left on the free queue, the faster the page daemon scans. At the desfree value, the scanner starts to increase the scan rate to free as much memory as possible as quickly as possible. When a minimum number (minfree) are left, the page daemon scans as fast as possible, at the fastscan rate. The scan rate is determined by the number of pages available, as shown in the graph in the overhead.
Memory Tuning
5-15
5
The only special case occurs when too many pages are taken at once, causing a big drop in the free queue. When this occurs, the system temporarily adds to lotsfree a quantity (deficit) designed to keep this from happening again. The deficit value decays to zero over a few seconds. If the value of free pages falls below the minfree threshold, the system stops scanning and starts to swap out the entire process address space to swap space on the disk. Any page of RAM that is shared by more than eight processes, such as a shared library page, is locked in memory and cannot be paged or swapped out.
5-16

An Alternate Approach The Clock Algorithm

The page daemon also has an alternate mechanism for controlling paging, using the parameter handspreadpages. The page daemon always uses both the lotsfree and handspreadpages parameters when it does its work. However, the operation of the page daemon can be tuned through both. By default, the value of handspreadpages is set to the fastscan value. It is always better to choose one method or the other because it becomes difcult to predict system behavior if both are tuned. The front hand of the clock walks through memory marking pages as unused. The back hand walks through memory some distance behind the front hand, checking to see if the page is still unused. If so, the page is subject to reclaim. This parameter requires that the kernel variable reset_hands is set to a nonzero value. When the value of handspreadpages has been recognized, reset_hands is set back to zero.
Memory Tuning
5-17
5
The drawback is that the page daemon could take pages that the program is still using, thus causing the program to quickly take them back. This causes memory thrashing, which uses CPU time and does not rell the free queue. Performance usually degrades rapidly at this point, especially if the pages are modied and being written out. Do not try to adjust both handspreadpages and the scan parameters: It becomes too difcult to determine how fast pages are being stolen, and it might cause thrashing.
5-18

Default Paging Parameters

The table in the overhead shows the defaults for the page daemon parameters. Prior to the release of the Solaris 2.6 OE, these parameters were not scaled for multiprocessor systems. This caused larger systems to default to too small a free queue and also caused unnecessarily high scan rates. With Solaris 2.6 and higher, these parameters are properly scaled. With decit processing, it is no longer necessary to tune these parameters. If you have values in /etc/system from an earlier release, remove them when migrating to the Solaris 8 OE. To keep scanning from taking over the CPU on a very memorystressed system, the Solaris 8 OE limits scanning to 4 percent of available CPU power, regardless of the calculated scan rate.
Memory Tuning
5-19
The maxpgio Parameter

When a page daemon steals pages, some of those pages might be modied (or dirty). Just like any other cache, when a modied cache line is being replaced, it must be written back to the next higher level cache. In the case of main memory, this means written back to disk. If a large number of dirty pages were taken every quarter second, this would cause signicant I/O spikes, as well as large amounts of nonproductive I/O that are not related to the systems execution. To limit the level of these spikes, the system provides the maxpgio parameter. The maxpgio parameter tells the page daemon how many modied pages it may queue. It is divided by four to obtain the limit for each invocation of the page daemon. If the page daemon has taken maxpgio divided by 4 pages, any other pages it takes must be unmodied; if an unused but modied page is found, it is ignored.
5-20

5
The maxpgio parameter defaults to 40 pages. This means that only 40 writes of modied pages may be placed in the page daemon queue. (More than 40 pages may be written, however, depending on the number of pages per I/O.) In a small system, this is quite reasonable because 40 input/output operations can be a large part of its capacity. In a larger system, perhaps with hundreds of disks, this default is too small. If the number of modied dirty pages is set too low, the system scan rate will increase as otherwise acceptable modied pages cannot be taken, and the page daemon must continue to look for unmodied pages. The page daemon will not get enough pages, and the scan rate will increase the next time it runs. To prevent an unnecessary increase of the scan rate, you must set maxpgio properly. Remember, it covers all stolen and modied pages, whether from a process address space (such as the heap or stack) or from les. If set too high, it can cause I/O spikes or periodic overloads. If multiple swap disks are congured, you should try setting maxpgio to 40 times the number of swap disks. This setting might be too low to handle le pages in some cases. You might also try 120 times the number of disk controllers to avoid the focus on swap devices.
Memory Tuning
5-21
File System Caching

File systems, specically UFSs, are described in detail in Module 8, UFS Tuning. However, the le system cache is not implemented in the le system. In the Solaris OE, the le system cache is implemented in the virtual memory system. Therefore, an explanation of how the Solaris OE le caching works and the interactions between the le system cache and the virtual memory system is relevant to this module. The Solaris method of caching le system data is known as the page cache. The page cache is dynamically sized and can use all memory that is not being used by applications. The Solaris OE page cache caches le blocks rather than disk blocks. The key difference is that the page cache is a virtual le cache, rather than a physical block cache. This allows the Solaris OE to retrieve le data by simply looking up the le reference and seek offset, rather than invoking the le system to look up the physical disk block number. The overhead shows the Solaris OE page cache. When a Solaris OE process reads a le the rst time, le data is read from disk though the le system into memory in page-size chunks and returned to the user.
5-22

5
The next time the same segment of le data is read it can be retrieved directly from the page cache without having to do a logical-to-physical lookup though the le system. The buffer cache is used only for le system data that is only known by physical block numbers. This data only consists of metadata items (direct blocks, indirect blocks, and inodes). All le data is cached though the page cache.

Ask the students to draw a dotted-line from the process to the page cache to represent subsequent data lookup operations.
The overhead is somewhat simplied. The le system is still involved in the page cache lookup, but the amount of work the le system needs to do is dramatically simplied. The page cache is implemented in the virtual memory system. In fact, the virtual memory system is built around the page cache principle, and each page of the physical memory is identied the same way, by le and offset. Pages associated with les point to regular les, while pages of memory associated with the private memory space of processes point to the swap device. The important thing to remember is that le cache is just like process memory. File caching shares the same paging dynamics as the rest of the memory system.
Memory Tuning
5-23
The Buffer Cache

The buffer cache is used in the Solaris OE to cache inodes and le metadata; it is also dynamically sized. The buffer cache grows as needed until it reaches a ceiling specied by the bufhwm kernel parameter. By default, the buffer cache is allowed to grow until it uses 20 percent of physical memory. The upper limit for the buffer cache can be displayed by using the sysdef command. # sysdef * * Hostid ... ... * Tunable Parameters * 1196032 maximum memory allowed in buffer cache (bufhwm) 922 maximum number of processes (v.v_proc) ... ... 60 maximum time sharing user priority (TSMAXUPRI) SYS system class name (SYS_NAME) #
5-24

5
Because only inodes and metadata are stored in the buffer cache, it does not need to be a very large buffer. In fact, you only need 300 bytes per inode, and about 1 Mbyte per 2 Gbytes of les that are expected to be accessed concurrently. Note Buffer cache size of 300 bytes per inode, and 1 Mbyte per 2 Gbytes of les expected to be accessed concurrently is a rule of thumb for the UFS. For example, if you have a database system with 100 les totaling 100 Gbytes of storage space and only 50 Gbytes of les are accessed at the same time, then at most you would need 30 Kbytes (100 x 300 bytes = 30 Kbytes) for the inodes, and about 25 Mbytes ([50 2] x 1 Mbyte = 25 Mbytes) for the metadata. On a system with 5 Gbytes of physical memory, the default for bufhwm would be 102 Mbytes (2 percent of physical memory), which would be more than sufcient for the buffer cache. You can limit bufhwm to 30 Mbytes to save memory by setting the bufhwm parameter in the /etc/system le. To set bufhwm to 30 Mbytes, add the following line to the /etc/system le (units for bufhwm are in Kbytes): set bufhwm=30000 You can monitor the buffer cache hit rate using sar -b. The statistics for the buffer cache show the number of logical reads and writes into the buffer cache, the number of physical reads and writes out of the buffer cache, and the read/write hit ratios. Try to obtain a read cache hit ratio of 100 percent on systems with few, but very large les, and a hit ratio of 90 percent or better for systems with many les.
Memory Tuning
5-25
The Page Cache

There is no equivalent parameter to bufhwm for the page cache. The page cache grows to consume all available memory, which includes all process memory that is not being used by applications.
5-26

5
To observe how much memory is being used for page caching, use the mdb command to display the current kernel buffer information:

There are pre-prepared system adb and mdb macros in the /usr/lib/adb directory. You might mention these macros at this time.
# mdb -k bfreelist$<buf bfreelist: bfreelist: bfreelist+0xc: bfreelist+0x40: bfreelist+0x1c: bfreelist+0x34: bfreelist+0x38: $q #
flags 0 av_forw f02783a0 bufsize 3072 un.b_addr 0 proc 0 pages 0
forw f02783a0 av_back f02783a0 error 0 _b_blkno 0 iodone 0
back f02783a0 bcount 327 edev 0 resid 0 vp 0
The size of the page cache buffer is shown as a number of pages. In the above example, the size is 3072 pages. These values were taken on a SPARCStation 5 system, which uses a 4 Kbyte page size. The page cache is, therefore, using 12 Mbytes of RAM. You use the kstat command to display the page cache buffer input/output statistics: # kstat -n biostats module: unix name: biostats buffer_cache_hits buffer_cache_lookups buffers_locked_by_someone crtime duplicate_buffers_found new_buffer_requests snaptime waits_for_buffer_allocs #
instance: 0 class: misc 476044 486870 118 0.8695065 0 0 37430.74045 0
Memory Tuning
5-27
Available Paging Statistics

The table in the overhead shows the data most commonly used in tuning the memory subsystem. Note the different units reported by the tools. Make sure that you know the memory page size for your system. (Use the pagesize command if necessary.) # sar -g SunOS earth 5.8 Generic sun4m 08:00:00 ... 11:00:01 11:20:01 ... 16:20:01 ... 18:00:00 Average # 10/13/00
pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf 0.46 1.43 0.16 0.00 0.10 8.37 6.57 4.51 0.01 1.05 10.54 7.00 5.87 0.01 1.15 22.77 4.57 17.57 0.00 1.92 0.00 0.00 0.11 0.00 0.10
5-28

5
The following elds appear in this output:
q
pgscan/s The number of pages, per second, scanned by the page daemon. If this value is high, the page daemon is spending a lot of time checking for free memory. This implies that more memory might be needed. pgout/s The number of pageout requests per second. ppgout/s The actual number of pages that are paged out, per second. (A single pageout request might involve paging out multiple pages.) pgfree/s The number of pages, per second, that are placed on the free list by the page daemon.
q q
Memory Tuning
5-29
5
# sar -p SunOS earth 5.8 Generic sun4m 10/13/00 08:00:00 atch/s pgin/s ppgin/s pflt/s vflt/s slock/s 08:20:00 2.03 0.34 0.98 9.08 7.09 0.00 08:40:01 1.56 0.13 0.35 8.75 7.23 0.00 ... 11:00:01 2.47 4.54 5.30 3.65 11.88 0.00 11:20:01 3.42 2.96 5.43 10.52 16.64 0.00 16:00:00 0.38 0.10 0.27 1.59 1.35 0.00 16:20:01 56.75 11.36 19.90 186.55 485.64 0.00 ... 18:00:00 1.05 0.41 0.67 1.42 6.29 0.00 Average 3.35 1.43 2.33 10.54 22.38 0.00 # The following elds appear in this output:
q
atch/s The number of page faults, per second, that are satised by reclaiming a page currently in memory (attaches per second). Instances of this include reclaiming an invalid page from the free list and sharing a page of text currently being used by another process (for example, two or more processes accessing the same program text). pgin/s The number of times per second that le systems receive pagein requests. ppgin/s The number of pages paged in per second. A single pagein request, such as a soft-lock request or a large block size, can involve paging in multiple pages.
5-30

5
# sar -r SunOS earth 5.8 Generic sun4m 08:00:00 freemem freeswap 08:20:00 6476 343076 08:40:01 5061 333080 ... 11:00:01 3184 306781 11:20:01 2481 237785 ... 16:00:00 4813 243598 16:20:01 5768 226064 ... 18:00:00 7237 223395 Average # 4171 263954 10/13/00
This output includes the following eld:

q
freemem The average number of memory pages available to user processes over the intervals sampled by the command. Page size is machine dependent.
Memory Tuning
5-31
5
# vmstat 2 procs memory r b w swap free 0 0 0 134248 18280 0 0 0 83100 2088 0 0 0 83084 2028 0 0 0 83044 2196 0 0 0 83008 2736 0 0 0 83008 2936 ^C# re 3 0 0 0 0 0 page disk mf pi po fr de sr f0 s1 s2 s3 22 10 3 4 0 1 0 2 0 1 8 2 0 0 0 0 0 0 0 1 6 0 0 0 0 0 0 16 0 5 17 0 372 386 0 133 0 0 0 3 6 0 186 186 0 0 0 0 0 2 6 0 0 0 0 0 0 0 0 0 faults cpu in sy cs us sy 149 492 93 3 3 119 254 69 3 0 200 247 70 1 4 143 246 69 3 1 167 459 111 2 1 189 593 135 2 2 id 94 97 94 96 98 96
The following elds appear in this output:

q
free The size of the free memory list in KBytes. It includes the pages of RAM that are immediately ready to be used whenever a process starts up or needs more memory. re The number of pages reclaimed from the free list. The page had been stolen from a process but had not yet been reused by a different process so it can be reclaimed by the process, thus avoiding a full page fault that would need I/O. mf The number of minor faults in pages. If the page is already in memory, then a minor fault simply reestablishes the mapping to it. A minor fault is caused by an address space or hardware address translation fault that can be resolved without performing a pagein. It is fast and uses only a little CPU time. The process does not have to stop and wait as long as a page is available for use from the free list. pi The number of Kbytes paged in per second. A pagein occurs whenever a page is brought back in from the swap device, or into the new buffer cache. A pagein causes a process to stop execution until the page is read from disk and adversely affects the processs performance. po The number of Kbytes paged out per second. A pageout is counted whenever a page is written and freed. Often this is as a result of the pageout scanner, fsflush, or le close. fr The number of Kbytes freed per second. Pages freed is the rate at which memory is being put onto the free list by the page scanner daemon. Pages are usually freed by the page scanner daemon but other mechanisms, such as processes exiting, also free pages.
5-32

5
q
sr The number of pages scanned by the page daemon as it looks for pages to steal from processes that are not using them often. This number is the key memory shortage indicator if it stays above 200 pages per second for long periods. This is the most important value reported by vmstat. de The decit or anticipated, short-term memory shortfall needed by recently swapped-in processes. If the value is nonzero, then memory was recently consumed quickly, and extra free memory will be reclaimed in anticipation of being needed.
Memory Tuning
5-33
Additional Paging Statistics

The Solaris 7 OE and later releases provide an extended set of paging counters that allow you to see what type of paging is occurring. You can now see the difference between paging caused by an application memory shortage and paging though the le system. The new paging counters are visible with the memstat command, which can be downloaded from ftp://playground.sun.com/pub/rmc/memstat. The output from the memstat command is similar to that of vmstat, but with extra elds to break down the different types of paging. In addition to the regular paging counters (sr, po, pi, fr), the memstat command shows the three types of paging broken out as executable, anonymous, and le. Sample output follows as well as eld descriptions.

The current version of memstat is included with the class lab files.
5-34

5
# memstat 3
memory free 1960 1976 2172 1936 ---------- paging ----------- - executable re mf pi po fr de sr epi epo epf 0 359 50 537 1276 1060 413 1 0 658 0 255 58 496 1012 780 318 0 0 410 3 287 146 578 1042 540 419 5 0 325 45 366 630 881 1306 300 460 409 0 324 - anonymous api apo apf 0 496 496 0 496 496 20 578 578 17 785 785 -- filesys -fpi fpo fpf 49 41 121 58 0 105 121 0 138 204 96 197 --- cpu --us sy wt id 70 30 0 0 72 28 0 0 70 30 0 0 67 33 0 0
^C#

q q q q q q q
pi Total pageins per second po Total pageouts per second fr Total pagefrees per second sr Page scan rate in pages per second epi Executable pageins per second epf Executable pages freed per second api Application (anonymous) pageins per second from the swap device apo Application (anonymous) pageouts per second to the swap device apf Application pages freed per second fpi File pageins per second fpo File pageouts per second fpf File pagefrees per second us CPU user mode sy CPU system mode wt CPU wait on I/O id CPU idle
q q q q q q q q
Memory Tuning
5-35
Swapping
If the page daemon cannot keep up with the demand for memory, then the demand must be reduced. This is done by swapping out memory users and writing them to disk. Obviously this does not get the systems work done, and so it is only done when there is no other choice. This is usually called desperation swapping. The system uses an algorithm to choose lightweight processes (LWPs) for swapping. When the last LWP of a process is swapped, the process itself is removed from memory, and all of its pages are freed. Only pages left unused by the swapped LWP are freed before that. Do not try to tune swapping; the number of swaps is not important. The fact that the system is swapping at all means that there is a severe load on main memory. Swaps should be eliminated, not reduced.
5-36

When Is the Swapper Desperate?

The swapper checks once every second to see if it must act. It swaps out if memory has been low (avefree30) for a sustained period of time.
avefree30 is the average amount of free memory over the last 30 seconds. This value is used to ensure that the shortage has not been temporary (avefree30).
The paging rate is considered excessive if the total paging rate (pginrate plus pgoutrate) is more than maxpgio. Also, the run queue must be occupied by at least two processes on a consistent basis for the swapper to assume a state of desperation. If all of the above are true, then the Solaris system performs hard-swapping, where even kernel modules may be swapped out of RAM along with process address space.
Memory Tuning
5-37
What Is Not Swappable

Only LWPs in the timesharing and interactive classes are swappable, and not all of those are. The system checks for conditions that make it unsafe (such as a fork in progress), unwise (the process is ending), or undesirable (a higher priority thread waiting on a lock is blocked), and exempts those LWPs. Those that pass this screening are then ranked, and the LWP with the highest ranking is swapped out.
5-38

Swapping Priorities
The system calculates the effective priority of a LWP to be swapped out (epri) using the following calculation:
epri = swapin_time - rss (maxpgio 2) - pri

The chosen LWP is then swapped out. The swapping of the LWP itself usually returns little memory. Only when the last LWP of a process is swapped, and the memory used by the entire process is taken, does the amount of memory freed become signicant.
Memory Tuning
5-39
Swap In
Swapped-out LWPs are swapped in when RAM is available. The priority calculation is very similar to the swap-out priority calculation. LWPs are brought back in reverse priority order. If memory pressure continues after an LWP is brought back in, further LWP swap-outs might occur. Usually these involve other LWPs, because the LWPs just brought back in have lower swap priorities due to the presence of the swapin_time term in the swap-out priority calculation.
5-40

Swap Space
The amount of virtual memory available to the system to allocate is the total of real memory (less the kernel) and swap space allocated on disk. The system can provide virtual memory to processes until this limit is reached. Swap space is provided in two steps: reservation and allocation. The pages written to swap space are those that have no other place to be written to. This includes the stack, heap, Business Support Systems (BSS), and modied data segment pages. These are called anonymous pages because they have no le to return to. Virtual memory, and thus (eventually) swap space, is reserved when a new virtual memory segment is created. Each segment driver requests an appropriate amount of virtual memory to be reserved for it. If space is not available, the system call that requested the operation fails. The usual calls that reserve virtual memory are fork, exec, and (indirectly) malloc, which would usually receive an ENOMEM error.
Memory Tuning
5-41
5
The amount of reservable swap space (available virtual memory) is the total of all swap disk space plus physical memory, less kernel locked pages and a safety zone (in case of under-reservation). The safety zone is the larger of 3.5 Mbytes or 1/8 of main memory. A page of swap is not allocated until the page has to be written out. This allows the system to group related pages, limiting the number of writes and perhaps the number of reads necessary to bring the pages back in. If no swap disk space is available for the page, the system leaves the page as is in memory. Because the system can reserve more swap space than the amount of disk space used for swap, when a page needs to be written out, there might not be any available disk space for the page. If this is the case, the page is left in memory. Systems with large amounts of physical memory might need little, if any, swap space because they can do all of their processing in memory. Remember, always have a primary swap partition large enough to hold a panic dump, even if there is no other need for swap space. There is no performance difference between swap les and swap partitions, although swap les are easier to manage.
5-42

The tmpfs File System

Another user of virtual memory is tmpfs. The tmpfs le system allows the creation of virtual memory le systems, such as /tmp. The use of a tmpfs le system allows virtual memory to be used in place of real disk space, providing signicant speedups for programs that use temporary les, such as compilers and sorting. The tmpfs le system uses virtual memory as any other user, allocating it when it is written to and freeing it when the les are deleted or truncated. The tmpfs le system does not reserve virtual memory. Virtual memory includes main memory, so the use of /tmp causes the use of equivalent quantities of main memory, at least initially. Overuse can lead to a memory shortage and swapping. You can limit the amount of virtual memory to be used by specifying the size option on the tmpfs lines in /etc/vfstab. You can also create multiple tmpfs le systems if you want, but all share the system virtual memory.

You might want to mention the /var/run directory, specific to the Solaris 8 OE.
Memory Tuning
5-43
5
Some sites have moved /tmp to a physical drive because of concerns about the use of real memory by les in /tmp or the possible exhaustion of virtual memory by a job using too much /tmp space. Given the vast performance difference between real disk and virtual memory, those programs that use /tmp encounter a signicant performance degradation. It is usually best to allow the page daemon to manage the contents of main memory (so that truly unneeded pages can be written) and use the size option to control use. There are no monitors that report on the use of tmpfs space. You need to use the df -k command.
5-44

Available Swapping Statistics

The table in the overhead shows the most commonly used data for monitoring the swap subsystem. Remember, you do not tune swaps, you eliminate them. The presence of any swaps at all indicates stress on the memory system because it is so hard to induce swapping.
# vmstat -S 2 procs memory r b w swap free 0 0 0 1640 3656 0 0 0 174328 1288 0 0 0 174328 1288 0 0 0 174328 1288 0 0 0 174328 1288 ^C# page disk so pi po fr de sr f0 s0 s6 -0 8 8 15 0 5 0 1 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 faults in sy 187 556 191 470 177 447 180 447 186 447 cpu cs us sy 200 1 1 202 0 0 188 0 0 190 0 0 186 0 0
si 0 0 0 0 0
id 98 100 100 100 100
This output contains the following elds:

q
swap Amount of swap space currently available in Kbytes. If this number goes to zero, your system hangs and you are unable to start more processes. so The number of Kbytes per second swapped out.
5-45
Memory Tuning
5
q
w The swap queue length; that is, the swapped out LWPs waiting for processing resources to nish. si The number of Kbytes per second swapped in.
# sar -r SunOS earth 5.8 Generic sun4m 05:00:01 freemem freeswap 19:00:00 6908 342147 20:00:01 4774 332221 21:00:00 3537 320859 22:00:00 2790 316324 Average # 4031 325094 11/14/00
This output contains the following eld:

q
freeswap The number of 512-byte disk blocks available for page swapping.
# sar -w SunOS earth 5.8 Generic sun4m 11/14/00
05:00:01 swpin/s bswin/s swpot/s bswot/s pswch/s 19:00:00 0.00 0.0 0.00 0.0 72 20:00:01 0.00 0.0 0.00 0.0 122 21:00:00 0.00 0.0 0.00 0.0 119 22:00:00 0.00 0.0 0.00 0.0 91 Average # 0.00 0.0 0.00 0.0 107

q q
swpin/s The number of LWP transfers into memory per second. bswin/s The number of blocks (512-byte units) transferred for swap-ins per second. swpot/s The average number of processes swapped out of memory per second. If the number is greater than 1, you might need to increase memory.
5-46

5
q
bswot/s The number of blocks (512-byte units) transferred for swap-outs per second.
# sar -q SunOS earth 5.8 Generic sun4m 11/14/00
05:00:01 runq-sz %runocc swpq-sz %swpocc 19:00:00 1.5 2 20:00:01 1.1 3 21:00:00 1.1 2 22:00:00 1.1 1 Average # 1.1 2

q q
swpq-sz The average number of swapped-out LWPs %swpocc The percentage of time LWPs are swapped out
Memory Tuning
5-47
Shared Libraries
Shared libraries can help reduce your memory usage by allowing multiple processes to share the same copy, and thus the same physical memory, of library routines. Indeed, most of the program can be treated as a shared library. Shared libraries take a small amount of extra time to initialize, but their benets to the system can be signicant. I/O operations to access them on disk are usually eliminated because they remain in memory, physical memory is saved because only one copy is needed, and tuning, such as compiler settings and locality of reference tuning, needs to be done only once. Creation of a shared library requires only the use of special compiler and linker options. No changes to the code are necessary. There are also maintenance benets; if a bug is found in the library, it need only be replaced. Programs that use it do not need to be relinked.
5-48

Paging Review
The diagram in the overhead shows where memory segments that are paged out are written to or originate from. Unmodied pages, such as text segment pages and unmodied data segment pages, are never written out. A new copy is read in from the executable le when needed.

Device space has no arrow because it communicates with devices, taking up no physical memory, and so does not participate in paging.
Memory Tuning
5-49
Memory Utilization
The amount of memory needed for a system is dependent on the amount of memory required to run your applications. But before an applications memory requirements can be determined, memory usage must be observed. Memory in a system is used for le buffering, by the kernel and by applications running on the system. So the rst thing to observe in a system is where the memory has been allocated:
q q q q q
The total amount of physical memory available How much memory is being used for le buffering How much memory is being used for the kernel How much memory the applications are using How much memory is free
5-50

Total Physical Memory

The amount of total physical memory can be determined from the output of the prtconf command. # prtconf System Configuration: Sun Microsystems Memory size: 128 Megabytes System Peripherals (Software Nodes): ... ... # As you can see, the physical memory size of this (SPARCstation-5) system is 128 Mbytes. sun4m
Memory Tuning
5-51
File Buffering Memory

The page cache uses available free memory to buffer les on the le system. On most systems, the amount of free memory is almost zero as a direct result of this. To look at the amount of page cache, use the MemTool package. A summary of the page cache memory is available with the MemTool prtmem command. # /opt/RMCmem/bin/prtmem Total memory: Kernel Memory: Application memory: Executable memory: Buffercache memory: Free memory: # The page cache size appears in the output of the prtmem command as Buffercache memory. On this system, the amount is 1 Mbyte. 56 14 19 17 1 4 Megabytes Megabytes Megabytes Megabytes Megabytes Megabytes
5-52

Kernel Memory
Kernel memory is allocated to hold the initial kernel code at boot time, then grows dynamically as new device drivers and kernel modules are used. The dynamic kernel memory allocator (KMA) grabs memory in large slabs, and then allocates smaller blocks more efciently. If there is a severe memory shortage, the kernel unloads unused kernel modules and devices, and frees unused slabs. The amount of memory allocated to the kernel can be found by using the sar -k command (kernel memory allocation statistics), and totaling all of the alloc columns. The output is in bytes. # sar -k 1 SunOS earth 5.8 Generic sun4m 22:21:54 sml_mem alloc 22:21:55 4657344 3853080 11/14/00
fail lg_mem alloc fail ovsz_alloc fail 0 30806016 28656468 0 2752512 0#
Memory Tuning
5-53
5
q
sml_mem The amount of memory, in bytes, that the KMA has available in the small memory request pool. A small request is less than 256 bytes. alloc The amount of memory, in bytes, that the KMA has allocated from its small memory request pool to small memory requests. fail The number of requests for small amounts of memory that failed. lg_mem The amount of memory, in bytes, that the KMA has available in the large memory request pool. A large request is from 512 bytes to 4 Kbytes. alloc The amount of memory, in bytes, that the KMA has allocated from its large memory request pool to large memory requests. fail The number of failed requests for large amounts of memory. ovsz_alloc The amount of memory allocated for oversized requests (those greater than 4 Kbytes); these requests are satised by the page allocator, thus there is no pool. fail The number of failed requests for oversized amounts of memory.
Totaling the small, large, and oversize allocations, the kernel has 35,463,360 bytes of memory and is actually using 35,262,060 bytes at present.
5-54

Identifying an Applications Memory Requirements

Knowing how much memory an application uses allows you to predict the memory requirements when running more users. Use the pmap command or pmem functionality of MemTool to determine how much memory a process is using and how much of that memory is shared. # pmap -x 10062 10062: /bin/csh Address Kbytes Resident Shared Private Permissions 00010000 144 104 104 read/exec 00042000 16 8 8 read/write/exec 00046000 248 - read/write/exec ... ... FF3DC000 8 8 8 - read/write/exec FFBE2000 56 8 8 read/write/exec -------- ------ ------ ------ -----total Kb 1544 912 760 152 #
Mapped File csh csh [ heap ]
ld.so.1 [ stack ]
Memory Tuning
5-55
5
The output of the pmap command shows that the csh process is using 912kbytes of real memory. Of this, 760kbytes is shared with other processes on the system, using shared libraries and executables. The pmap command also shows that the csh process is using 152kbytes of private (non-shared) memory. Another instance of csh only consumes 152kbytes of memory (assuming its private memory requirements are similar).
5-56

Free Memory
Free memory is maintained so that when a new process starts up or an existing process needs to grow, the additional memory is immediately available. Just after boot, or if large processes have just exited, free memory can be quite large. As the le system page cache grows, free memory settles down to a level set by the kernel variable lotsfree. When free memory falls below lotsfree, memory is scanned to nd pages that have not been referenced recently. These pages are moved to the free list. Free memory can be measured with the vmstat command. The rst line of output from vmstat is an average since boot, so the real free memory gure is available on the second line. The output is in Kbytes.
# vmstat 2 procs memory r b w swap free 0 0 0 210528 2808 0 0 0 199664 1904 0 0 0 199696 1056 0 0 0 199736 1144 ^C# page disk mf pi po fr de sr f0 s0 s6 -19 49 38 73 0 24 0 5 0 0 4 12 0 0 0 0 0 2 0 0 0 0 628 620 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 faults in sy 247 1049 224 936 401 1212 165 800 cpu cs us sy 239 4 2 261 7 2 412 27 9 188 0 0
re 0 0 7 0
id 94 92 64 100
Memory Tuning
5-57
Identifying a Memory Shortage

To determine if there is a memory shortage, it is necessary to:
q
Determine if the applications are paging excessively because of a memory shortage. Determine if the system could benet by making more memory available for le buffering
Use vmstat to see if the system is paging. If the system is not paging, there is no chance of a memory shortage. Excessive paging is evident by constant nonzero values in the scanrate (sr) and pageout (po) columns. Look at the swap device for activity. If there is application paging, then the swap device will have I/Os queued to it. This is an indicator of a memory shortage. Use the MemTool to measure the distribution of memory in the system. If there is an application memory shortage, the le systems page cache size will be very small (less than 10 percent of the total memory available).
5-58 Solaris System Performance Management
5
Using the memtool Graphical User Interface (GUI)
The memtool graphical user interface (GUI) provides an easy way of invoking most of the functionality of the MemTool kernel interfaces. Invoke the GUI as the root user to see all process and le information. # /opt/RMCmem/bin/memtool&
Figure 5-1
memtool GUI Buffer Cache Memory
There are three basic modes on the memtool GUI: VFS Memory (Buffer Cache Memory), Process Memory, and a Process/Buffer cache mapping matrix. The initial screen (Figure 5-1), shows the contents of the Buffer Cache memory. Fields are described as follows:
q
Resident The amount of physical memory that this le has associated with it. Inuse The amount of physical memory that this le has mapped into a process segment or SEGMAP. Generally the difference between this and the resident gure is what is on the free list associated with this le.
Memory Tuning
5-59
5
q
Shared The amount of memory that this le has in memory that is shared with more than one process. Pageins The number of minor and major pageins for this le. Pageouts The number of pageouts for this le. Filename The le name of the vnode or, if not known, the device and inode number in the format 0x0000123:456.
q q q
The GUI only displays the largest 250 les. A status panel at the top of the display shows the total number of les and the number that have been displayed. The second mode of the memtool GUI is the Process Memory display (Figure 5-2). Click the Process Memory check box to select this mode.
Figure 5-2
memtool GUI Process Memory
5-60

5
The Process Memory display (Figure 5-2 on page 5-60), shows the process table with a memory summary for each process. Fields are described as follows:
q q
PID The process ID of the process. Virtual The virtual size of the process, including swapped-out and unallocated memory. Resident The amount of physical memory that this process has, including shared binaries, libraries, and so forth. Shared The amount of memory that this process is sharing with another process, such as shared libraries, shared memory, and so forth. Private The amount of resident memory that this process has that is not shared with other processes. This gure is essentially Resident - Shared and does not include the application binaries. Process The full process name and arguments.
The third mode shows the Process Matrix, which is the relationship between processes and mapped les (Figure 5-3 on page 5-62). Click the Process Matrix check box to select this mode.
Memory Tuning
5-61
Figure 5-3
memtool GUI Process Matrix
Across the top of the Process Matrix display is the list of processes that were viewed in the Process Memory display. The rst column lists the les that are mapped into these processes. Each column of the matrix shows the amount of memory mapped into that process for each le, with an extra row for the private memory associated with that process. The matrix shows the total memory usage of a group of processes. By default, the summary box at the top-right corner shows the memory used by all of the processes displayed.
5-62

Items You Can Tune

The list in the overhead contains the most commonly adjusted tuning parameters for the memory subsystem. In addition, the list provides some actions that you can take to improve the efciency of your memory system. The memstat command reveals whether you are paging to the swap device. If you are, the system is short of memory. If you have high le system activity, the scanner parameters are insufcient and will limit le system performance. To compensate for this, set some of the scanner parameters to allow the scanner to scan at a high enough rate to keep up with the le system.
Memory Tuning
5-63
5
By default, the scanner is limited by the fastscan parameter, which reects the number of pages per second that the scanner can scan. It defaults to scan a quarter of memory every second but is limited to 64 Mbytes/sec. The scanner runs at 100 pages/second when memory is at lotsfree, with a limit of 4 percent of CPU power being available for the scanning process. If only one in three physical memory pages are freed, the scanner will only be able to put 11 Mbytes (32 3 = 11 Mbytes) of memory on the free list, thus limiting le system throughput. You need to increase fastscan so that the page scanner works faster. A recommended setting is one-quarter of memory with an upper limit of 1 Gbyte per second. This translates to an upper limit of 131,072 for the fastscan parameter. The handspreadpages parameter should also be increased with fastscan to the same value. Another limiting factor for the le system is maxpgio. The maxpgio parameter is the maximum number of pages the page scanner can push. It can also limit the number of le system pages that are pushed. This in turn limits the write performance of the le system. If your system has sufcient memory, set maxpgio to something large, such as 65,536. On E10000 systems, this is the default for maxpgio. Remember, a poorly performing memory subsystem can cause problems with CPU performance (from excessive scanning) and the I/O subsystem (unnecessary I/O operations). If you have memory problems, work on them rst. You might be able to eliminate or reduce other problems by correcting the memory subsystem problems.
5-64

5
Scanning Example
Here is a set of calculations that provide an example of how much memory can be reclaimed on a system: If the slowscan rate is 100 pages per second and a page size is 8K, this means that 800Kbytes/sec can be referred to by the scanning procedure at the slowscan rate. If you assume that one third of the pages that are being scanned can be reclaimed, the reclaimed memory at the slowscan rate would be: 800 Kbytes/3 per second = 266 Kbytes/sec = 0.25 Mbytes/sec If maxpgio is set to 65536 pages (of 8K size) per second and you assume that one third of the pages can be reclaimed, this would limit the reclaimed memory at the fastscan rate to: 65536 pages/3 per sec = 21845 pages/sec 21845 pages /sec = 174760 Kbytes/sec = 170.6 Mbytes/sec So, this system would have a reclaim rate somewhere between 0.25 Mbytes/sec and 170 Mbytes/sec when the page scanner is invoked.
Memory Tuning
5-65
5
Check Your Progress
Before continuing on to the next module, check that you are able to accomplish or answer the following: u u u u u Describe the structure of virtual memory Describe how the contents of main memory are managed by the page daemon Understand why the swapper runs Interpret the measurement information from these components Understand the tuning mechanisms for these components
5-66

5
Think Beyond
q q q
Why is the use of main memory modeled so closely after a cache? What is the benet of managing virtual memory in segments? If the memory subsystem is performing poorly, where else would you see performance problems? Why? Why are tuning parameters other than the scan rate generally not very useful?
Memory Tuning
5-67
Memory Tuning Lab

Objectives
q q q
L5
Use the pmap command to view the address space structure Identify the contents of a process address space Describe the effect of changing the lotsfree, fastscan, and slowscan parameters on a system Recognize the appearance of memory problems in memory tuning reports
For the Solaris 8 Operating Environment (OE) in 64-bit mode: All tuning parameters are now 64-bit elds. To change them to work on 32-bit systems with adb or mdb, use the following steps: 1. For adb or mdb, use the appropriate command letter: J (64-bit) to X (32-bit) E (64-bit) to D (32-bit) Z (64-bit) to W (32-bit) 2. If you are unsure of your systems architecture, issue the command: # isainfo -b
L5-1
L5
Process Memory
Complete the following steps: 1. Using the ps command or ptree command, locate the process identier (PID) of your current sessions Xsun process. Write the PID value of that process in the space provided below: ___________________________________________________ 2. Run: # pmap -x Xsun-pid where sun-pid is the value you wrote down in Step 1. How much real (resident) memory is the Xsun process using? (Totals are in Kbytes.) ___________________________________________________ How much of this real memory is shared with other processes on the system? ___________________________________________________ How much private (non-shared) memory is the Xsun process using? ___________________________________________________
Note When the pmap -x command shows that a region is private, this can mean that no child process has yet been invoked to share that region. So, the term private implies that it is currently private but it can become shared with a subsequent child process.
L5-2

L5
Page Daemon Parameters
Determining Initial Values
Use the lab5 directory for these exercises. 1. Observe the effects of modifying various kernel paging parameters on a loaded system. The rst step is to dump all of the current paging parameters. Type: ./paging.sh Record the output.
lotsfree _______________________pages desfree _______________________pages minfree ________________________ pages slowscan _______________________ pages fastscan _______________________ pages physmem ________________________ pages pagesize _______________________ bytes
2. Convert lotsfree from pages to bytes. You use these values later in this lab. _________________________________________________________ Divide this number by 1024 to get lotsfree in Kilobytes. ___________________________________________
Memory Tuning Lab

L5-3
L5
Changing Parameters
In this portion of the lab, you put a load on the system by running a program called Memory Monitor, which locks down a portion of physical memory. Then you run a shell script called load5.sh., which puts a load on the systems virtual memory subsystem. This script should force the system to page and, perhaps, swap. You run MemoryMonitor and load5.sh several times, each with a different set of paging parameters to observe their effect. 1. Change the directory to the lab5 directory. 2. Observe the system under load with the default parameters: vmstat -S 5 3. To obtain the amount of physical memory on your machine, in another terminal window, type: prtconf | grep Memory 4. Please exit from Netscape and Acrobat sessions before continuing with this lab. (As these programs may interfere with the running of the Memory Monitor, Java-based, program)
L5-4

L5
5. You should start a Memory Monitor program for each 28 Mbytes of RAM in your system. To start each Memory Monitor program, type: java -jar MemoryMonitor.jar & and wait for the Memory Monitor windows to open. Each window should look like Figure L5-1:
Figure L5-1
Memory Monitor Window
Click the Start button to start the Memory Monitor program. Note You are required to start a number of instances of the Memory Monitor program. The value of 28Mbytes of RAM was selected as an appropriate value used in the calculation of how many instances to execute. However, if you wish to place more load on your system, try starting one instance of Memory Monitor for every 15 Mbytes of RAM.
Memory Tuning Lab

L5-5
L5
As memory is being allocated, the pie chart in Figure L5-2 displays how much RAM has been allocated and the values also change in the window.
Figure L5-2
Memory Monitor Pie Chart
At intervals, you can pause and then resume the Memory Monitor program by selecting the Pause or Resume button as shown in Figure L5-3.
Figure L5-3
L5-6
Memory Monitor Buttons

L5
Try experimenting by entering different values relating to the quantity (where the default is 10,000) and the delay time (where the default is 2,000 milliseconds). Note Each time you change one of the two values (quantity or sleeptime) you must click the Pause button and then select the Resume button for the change to take effect. Java technology allows for garbage collection of redundant areas of RAM. Try selecting and de-selecting the Garbage Collection button to see what, if any, difference garbage collection makes. 6. In a third window, type: # ./load5.sh This puts a load on the system. It takes about 3 to 5 minutes to run. Note load5.sh produces an invalid le type error as well as other job completion messages; this is normal. Ignore the error.
Memory Tuning Lab

L5-7
L5
Make the following observations while the vmstat command is running:
w
The value of free memory (remember, free memory is in Kbytes and should hover around the value of lotsfree). The scan rate and number of Kbytes being freed per second. Most of this activity is due to the page daemon. Disk activity. The time spent in user and system mode. Data that is being paged in and paged out.
w w w
When load5.sh is nished, press Control-C to exit vmstat. 7. Record the same data in a sar le. Type: sar -A -o sar.norm 5 60 8. In another window, type: ./load5.sh Wait for load5.sh to complete. If sar is still running, use Control-C to stop the output of sar. 9. You can leave Memory Monitor running if it does not impact your system performance too much. 10. In a new window, type: mdb -kw 11. Try changing lotsfree to be one-tenth of main memory. In mdb, type: lotsfree/Z 0tnew_lotsfree (Remember, the value you use for new_lotsfree should be in pages.)
L5-8

L5
12. To verify, in mdb, type: lotsfree/E You should see the value you just set. 13. Increase slowscan to half of fastscan. In mdb, type: slowscan/Z 0tnew_slowscan 14. To verify the new value for slowscan, type: slowscan/E Note Type $q to exit mdb. For this lab, you might want to keep this window open. Given these alterations, what sort of behavior do you expect to see on the system with respect to the following:
w
The amount of time the CPU spends in user and system mode _______________________________________________________
The average amount of free memory on the system _______________________________________________________
The number of pages being scanned and put on the freelist _______________________________________________________
15. If MemoryMonitor is not currently running, invoke MemoryMonitor by typing: java -jar MemoryMonitor.jar 16. In another window, type: sar -A -o sar.1 5 60 17. In another window, type: ./load5.sh Wait for load5.sh to complete.
Memory Tuning Lab

L5-9
L5
18. Exit Memory Monitor. 19. In one window, type: sar -f sar.norm -grqu 20. In another window, type: sar -f sar.1 -grqu Did you see the changes you expected? Explain why or why not. ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ 21. Restore lotsfree to its original value. In the mdb window, type: lotsfree/Z 0toriginal_value_of_lotsfree
L5-10

L5
Scan Rates
To examine the scan rates: 1. Lower slowscan. In the mdb window, type: slowscan/Z 0t10 2. Lower fastscan. In the mdb window, type: fastscan/Z 0t100 What effect do you think this will have on the systems behavior with respect to:
w
The number of pages scanned per second _______________________________________________________
The number of pages freed per second _______________________________________________________
The average amount of free memory _______________________________________________________
The size of the swap queue _______________________________________________________
3. If necessary, generate paging activity by running Memory Monitor. Type: java -jar MemoryMonitor.jar 4. In another window, type: sar -A -o sar.2 5 60 5. In another window, type: load5.sh Wait for load5.sh to complete. 6. Exit Memory Monitor.
Memory Tuning Lab

L5-11
L5
7. In one window, type: sar -f sar.norm -grqu 8. In another window, type: sar -f sar.2 -grqu Did you see the changes you expected? Explain why or why not. ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ 9. Restore slowscan to its original value. In the adb window, type: slowscan/Z 0toriginal_value 10. Restore fastscan to its original value. In the mdb window, type: fastscan/Z 0toriginal_value 11. Exit mdb using the $q command
L5-12

System Buses
Objectives
q q q
Dene a bus Explain general bus operation Describe the processor buses: UltraSPARC Port Architecture (UPA), Gigaplane, and Gigaplane XB Describe the data buses: SBus and Peripheral Component Interconnect (PCI) Explain how to calculate bus loads Describe how to recognize problems, and congure the system to avoid bus problems Determine Dynamic Reconguration considerations
q q
6-1
6
Relevance

q q q q
What is the purpose of a bus? Where in the system are the buses? What could happen if a bus does not operate correctly? What symptoms might you see if a bus was overloaded?
q
Cockcroft, Adrian and Richard Pettit. Sun Performance and Tuning. Palo Alto: Sun Microsystems Press/Prentice Hall, 1998. Wong, Brian L. Conguration and Capacity Planning for Solaris Servers. Palo Alto: Sun Microsystems Press/Prentice Hall, 1997. Catanzaro, Ben. Multiprocessor System Architectures. SunSoft Press/Prentice Hall, 1994. SPARC Architecture Reference Manual. Mauro and McDougall Solaris Architecture and Internals, Prentice Hall/SMI Press, ISBN 0-13-022496-0. Man pages for the prtdiag command.
q q
6-2

What Is a Bus?

Technically the term is buss, but popular industry usage demands bus.
A bus is conceptually a cable that connects two or more components, such as a network. The bus may actually be implemented as a cable, as traces on a circuit board, as a wireless link, or some combination. The common thing is that all buses transfer data and control signals from one location to another. Almost every data transfer mechanism in the system has the characteristics of a bus: a xed or maximum bandwidth, one active transmitter at a time, a contention resolution scheme, and bidirectional transfer capability. Very high-speed buses may be serviced and supported by slower buses, much as an asynchronous transfer mode (ATM) network may be de-muliplexed into multiple 10BASET or 100BASET links. With so many buses that have so many common characteristics, if the basic principles of a bus are understood, they can be applied to many different situations.
System Buses
6-3
General Bus Characteristics

There are two major categories of buses: circuit switched and packet switched. A circuit-switched bus operates like a telephone, where a connection must be made at each end, and the bus remains busy for the entire operation. When a transaction is initiated, it must be rejected or completed before any other transaction can be started. Circuitswitched buses are generally easier to implement but have signicantly slower throughput in most applications. Packet-switched buses operate like a network where a message is sent to a specic destination requesting an operation. The bus is then released to other trafc. When the requested operation is complete, the remote destination sends a message (or messages) back with the results of the operation. Most buses today are packet switched for performance reasons even though they are more complicated to implement.
6-4

System Buses
The system bus connects various sections of the hardware together, such as system boards, to other system boards or central processing units (CPUs) or memory banks. There are many different types of system buses, depending on cost and performance requirements. Some systems have multiple layers of buses, with slower buses connecting to faster buses, to meet performance, exibility, design, and cost requirements. System buses change and evolve over the years. Sun has used many different buses in its SPARC systems. Each new processor family generally uses a new system bus.
System Buses
6-5
MBus and XDbus

The MBus (module bus) was developed for the rst Sun SPARC multiprocessor systems, the Sun4/600 series, and was also used in desktop uniprocessor and multiprocessor systems. These systems made up the sun4m architecture family. The XDbus (Xerox databus) was used in Suns rst generation, large, datacenter systems, the sun4d architecture family. Unlike the general purpose systems, the family members had different numbers of parallel buses to provide additional bandwidth without the need for faster or wider buses.
6-6

System Buses: UPA and Gigaplane Bus

In the Sun UltraSPARC family, two new system buses were provided. The UPA bus is a crossbar (all components directly connected) bus designed to connect a few elements together. It is used as the main system bus in the UltraSPARC desktop systems. It is very fast and very efcient. The Gigaplane bus was introduced with the Enterprise X000 family of systems. In these servers, each system board has a UPA bus that connects the boards components to a Gigaplane backbone bus. The Gigaplane bus has the ability to electrically and logically connect and disconnect from the UPA buses while it is running, providing the ability for the X000 and X500 servers to do Dynamic Reconguration (the ability to add and remove system boards from a running system).
System Buses
6-7
Gigaplane XB Bus
The Gigaplane XB (crossbar) bus is an extension of the Gigaplane bus, which is currently used only in Sun Enterprise 10000 systems. The Gigaplane XB bus consists logically of many individual Gigaplane buses, connected point-to-point between the 16 system boards in the E10000. This provides the appearance of a Gigaplane bus dedicated to connecting each pair of system boards. Just as with the Gigaplane bus, the Gigaplane XB bus provides the ability to perform dynamic reconguration. In addition, the Gigaplane XB is constructed in a modular fashion that allows the system to continue to operate if sections of it have failed.
6-8

The Gigaplane XB Bus

The diagram in the overhead shows the crossbar connections between each system board in the Sun Enterprise 10000s Gigaplane XB bus. Each line represents a logically separate Gigaplane bus.
System Buses
6-9
6
Sun System Backplanes Summary
Characteristics of recent Sun system backplanes are summarized in Table 6-1. Table 6-1 System Backplane Characteristics Bandwidth Burst (Sustained) 400 MB/sec (130 MB/sec) 320 MB/sec (105 MB/sec) 320 MB/sec (250 MB/sec) 640 MB/sec (500 MB/sec) 400 MB/sec (312 MB/sec) 800 MB/sec (625 MB/sec) 1.8 GB/sec (1.4 GB/sec) 1.15 GB/sec (1.0 GB/sec) 1.3 GB/sec (1.2 GB/sec) 1.5 GB/sec (1.44 GB/sec) 2.6 GB/sec (2.5 GB/sec) 12.8 GB/sec 67.2 GB/sec (9.6 GB/sec)
System SPARCstation 20 SPARCstation 20 Model 514 SPARCserver 1000 SPARCcenter 2000 SPARCserver 1000E SPARCcenter 2000E Cray SuperServer 6400 Ultra-1/140 Ultra-1/170 Ultra-2/2200 Ultra Enterprise x000 Ultra Enterprise 10000 Sun Fire 3800, Sun Fire 4800, Sun Fire 6800
Bus MBus MBus XDBus XDBus XDBus XDBus XDBus UPA UPA UPA Gigaplane/ UPA GigaplaneX B/UPA Sun Fireplane interconnect
Speed, Width 50 MHz, 64 bits 40 MHz, 64 bits 40 MHz, 64 bits 40 MHz, 64 bits (* 2) 50 MHz, 64 bits 50 MHz, 64 bits (* 2) 55 MHz, 64 bits (* 4) 72 MHz, 128 bits 83.5 MHz, 128 bits 100 MHz, 128 bits 83.5 MHz, 256 bits 100 MHz 150 MHz
6-10

Peripheral Buses
I/O devices generally run at a much lower bandwidth than the system bus. Connecting them to the system bus would be a very expensive process. In support of this, many different peripheral buses have evolved to connect I/O devices to the system bus. This is usually done through interface cards. These buses can be either circuit switched or packet switched, depending on cost and performance considerations. These buses, not usually even seen, can be a hidden bandwidth limiter on the system and can indeed cause not only performance problems but the appearance of hardware problems as well. Sun now provides two different peripheral interconnect buses: the SBus and the PCI bus. Server systems may contain a mixture of both. All new systems from Sun, both desktop and server, feature PCI as the only or the preferred peripheral interconnect bus.
System Buses
6-11
SBus Compared to PCI Bus

The SBus is a 25-MHz circuit-switched bus with a peak theoretical total bandwidth of around 200 Mbytes/sec in its fastest implementations. There are many SBus cards available today to perform most functions required by a Sun system. The PCI bus was originally seen in personal computer (PC) systems. The PCI bus can run at 32-bit and 64-bit data widths, and at 33-MHz, 66-MHz, and 100-MHz clock rates. This provides a maximum theoretical total bandwidth for a 64-bit, 66-MHz card of over 500 Mbytes/sec. With hundreds of existing PCI cards, support for a Sun PCI system requires only a device driver. There are several 100 Mbytes/sec I/O interface cards available today, so the appeal of the PCI bus is obvious.
6-12

PIO and DVMA

When the peripheral devices have been attached to the system, the system must be able to transfer data to and from them. Mediated by the device driver, this is done over the system bus, which is connected to the peripheral bus. The exact mechanism of these transfers is beyond the scope of this course, but the essential feature of the transfers is that the peripheral card appears to be resident in physical memory; that is, writes to a section of what looks like main memory are actually writes to the interface card. This connection allows two types of data transfer to be performed between the peripheral card (representing the device) and the system. These are programmed I/O (PIO) and direct virtual memory access (DVMA).

Ask the students to use the busstat -l command to see if they have any devices that can be analyzed using busstat.
System Buses
6-13
6
Programmed I/O (PIO)
PIO forces the CPU to do actual copies between the device memory and the main memory. Moving 8 bytes (64 bytes on UltraSPARC systems) at a time, the CPU copies the data from one location to the other. High-speed I/O devices can swamp the CPU with PIO transfers, limiting the bandwidth at which the device can operate, thereby affecting the ability and availability of the CPU to move the data. PIO is quite easy to use in a device driver and is normally used for slow devices and small transfers.

The busstat command is used to see how PIO devices are working. However, you would need to understand how hardware functions to make sense of the output of busstat. Have the students refer to the man page for busstat.
Dynamic Virtual Memory Access (DVMA)

DVMA allows the interface card, on behalf of the device, to bypass the CPU and transfer data directly to memory. DVMA is capable of using the full device bandwidth to provide much faster transfers. DVMA operations, however, require a much more complicated setup than PIO. The break-even point is at about 2,000 bytes in a transfer. Not all I/O cards support DVMA, and if they do, the driver might not have implemented support for it.
Which Is Being Used?

In most cases, it is not possible to tell whether PIO or DVMA is being used. The driver implements whichever mechanism it chooses. In drivers that support both, the driver might use PIO in older, slower machines and DVMA in newer, faster machines. Usually no message is issued to specify which data transfer type is being used.
6-14

6
This ability to switch between PIO and DVMA helps explain how a 100-Mbyte interface card, such as a scalable coherent interconnect (SCI) card (used with clustering) can operate on a 40 Mbyte/sec and SBus. PIO is used to lower the transfer rate to what the system can tolerate (by denition). Throughput calculations made assuming that the full bandwidth is available will be overestimated in these cases, usually by a factor of 2 or 3.
System Buses
6-15
The prtdiag Utility

The prtdiag utility, rst provided with the SunOS 5.5.1 operating system, provides a complete listing of a systems hardware conguration, down to the I/O interface card level. It is supported for the sun4d and sun4u architecture systems and is key to locating hardware conguration problems. Intended for use by people familiar with Suns hardware nomenclature, you might need to consult a reference, such as the Field Engineers Handbook, to identify some of the interface cards. You can get similar information from the prtconf command, but it is much harder to read and interpret. The prtdiag command is found in the /usr/platform/sun4[du]/sbin directory or in the /usr/platform/sun4u1/sbin directory on the Sun Enterprise 10000.
6-16

prtdiag CPU Section

The rst section of the prtidag report describes the CPUs present in the system. This report is taken from an Enterprise 10000 system. It shows eight 250-MHz UltraSPARC-II CPUs on two system boards, each with 1 Mbyte of external (Level 2) cache. (The mask version is the revision level of the processor chip.)
System Buses
6-17
prtdiag Memory Section

The next section of the prtdiag report shows the physical memory conguration. This report shows two system boards, with two banks of 256 Mbytes each, for a total system memory size of 1 Gbyte. There are four memory banks possible on the system board, according to the report, but only two are lled. This usually allows what is called two-way memory interleaving.
Memory Interleaving
When writing 64 bytes to memory, for example (the normal transfer or line size), the system is not able to write it all at once. For example, a memory bank might have the ability to accept only 8 bytes at a time. This means that eight memory cycles or transfers are required to write the 64 bytes into the bank. Interleaved memory banks must be a power of 2. For example, you cannot interleave three memory banks.
6-18

6
If you could spread the data over two memory banks, you could transfer all of the data in four memory cycles, taking half the time required by one bank. If you could use four, it would take only two transfers. If there was a way to use eight banks, it would take only one cycle to move the 64 bytes to or from memory.

Do not confuse memory cycles with CPU cycles here; they are quite different.
While four 64-byte requests would take the same amount of transfer time, interleaved or not, interleaving does provide signicant performance benets. Aside from the shorter elapsed time, requests are always spread across multiple banks, providing a sort of memory redundant array of independent disks (RAID) capability. This helps avoid hot spots (overuse) of a particular bank if a program is using data in the same area consistently. For some types of memory-intensive applications, increased speeds of 10 percent or more have been seen, just by using interleaving. Systems congure interleave silently. Without knowing the characteristics of a system, you cannot always be sure that it is interleaving properly. For example, the desktop maintenance manuals specify the sequence in which memory single in-line memory modules (SIMMs) should be installed in the system. The installation order is not always in physical order on the board. While the memory will still be used (in most cases), there will be no interleave. It is not always obvious what the proper installation order should be. Some systems, when properly congured, are capable of 16-way (or more) interleave. Check your systems to ensure that the memory has been properly installed for maximum interleave, and remember to take into account the possibility of interleave when conguring a system. The prtdiag command reports the level of memory interleaving achieved by the system given the actual layout of the memory modules. An even number of system boards produces better memory interleaving.
System Buses
6-19
prtdiag I/O Section

The prtdiag I/O section shows, by name and model number, the interface cards installed in each peripheral I/O slot. By using the information in this section, you can calculate the requested bandwidth for the bus to determine if you are causing I/O bottlenecks by installing interface cards with more bandwidth capability than the peripheral bus can handle.
6-20

Bus Limits
A bus is like a freeway: It has a speed limit. Unlike a freeway, however, the bus speed cannot be exceeded. The problem is that as you add more trafc, the bus tends to get congested and not operate as well. When the bus reaches capacity, no more trafc can ow over it. Any requests to use the bus must wait until there is room, delaying the requesting operation. When this occurs, you do not get any specic messages; the system does not measure or report on this condition.
System Buses
6-21
Bus Bandwidth
There is an old joke that goes, If I have checks, I must have money. The connection between a check and real money is lost. The same situation can occur with bus bandwidth: If I have slots, I must have bandwidth. Like checks and money, these two are not necessarily related. If you have installed more cards into a peripheral bus than you have bandwidth for, some of the I/O requests are delayed. The greater the delay, the more problems become evident. Problems from this situation, such as bad I/O response times, dropped network packets, and so on, occur intermittently but consistently and almost invariably during the busiest part of the day.
6-22

Diagnosing Bus Problems

With these sorts of problems, it is not uncommon to have the service people replace every I/O board, system board, CPU, memory bank, and power supply, sometimes multiple times, and still have the problem persist. This might be a hint that this is not due to a hardware fault. Check your conguration with prtdiag. Given the maximum bus bandwidths for your system, and the bandwidths necessary for the interface cards, it should be easy to tell if you have exceeded the available bandwidth for the peripheral bus. Note You will need access to the appropriate specications to determine the bandwidth specications for your hardware and peripheral cards. Just because the slot is available does not mean that it can be used. If it seems unreasonable to support certain interface cards in the same bus, check the conguration. These problems are not widely known, and it is possible for both the sales and support staff to overlook a problem.
System Buses
6-23
Avoiding Bus Problems

When calculating throughput, realize that you will usually at best get 60 percent of the rated bandwidth of a peripheral device. It is not usually necessary to use actual transfer rates when checking a conguration; you will almost invariably get the same percentage of capacity either way.
6-24

To avoid bus problems, follow the suggestions listed on the overhead. There are many ways to create the problem, but only a few to solve it. There is plenty of detailed information available about the capabilities and requirements of I/O interface cards and system capabilities, including white papers on the specic systems and devices and some excellent books. Unexpected and unexplained problems (all the patches are on, all the hardware has been replaced) might be strong indications of conguration problems. Remember the basics when tuning and conguring a system: Almost everything is a bus or a cache and should be tuned accordingly. Monitor the system carefully, and try to understand what it is doing. Be prepared to leave empty slots in the system. You might not have the bandwidth for some of those slots.
System Buses
6-25
SBus Overload Examples

Without knowledge of the SBus management of the server slots, it would be possible to put Serial Optical Channel+ (SOC+) cards (100 Mbytes/sec Fibre Channel-Arbitrated Loop [FC-AL] interfaces) into Slots 2 and 3 and overload SBus B. In this case, assuming that all onboard devices (those shown) are in use, the best choice would be to put the SOC+ cards in Slots 1 and 2 and leave Slot 3 empty.
6-26

The example shown in the overhead illustrates the possible overloading of the system bus. Remember that the system bus has a xed bandwidth, and it is possible to overcongure a system such that the system bus itself becomes overloaded. Table 6-2 shows how this is done. Table 6-2 SC2000E System Bus Loads Quantity 20 10 Bandwidth 25 Mbytes 50 Mbytes Total 500 Mbytes 500 Mbytes 1000 Mbytes
Component CPU SBus Total
With a peak request for 1 Gbyte/sec bandwidth and at most 600 Mbytes available, delays are inevitable for the system bus trafc as usage grows. Unfortunately, this kind of overload often appears as watchdog resets. Hardware is often replaced even though the conguration is at fault.
System Buses
6-27
CPU Load on the System Bus

In the example shown in the overhead, the 20 CPUs on the SPARCcenter 2000E were quoted as loading the dual XD buses by 2030 Mbytes/sec. To calculate if the system bus can cope with high levels of activity, you need to know how much load each CPU is likely to produce. The load that a CPU imposes on a system bus is dependent on how much external cache it has and how efciently the cache is being used. The CPUs external cache (L2) sits between the CPU and the system bus and can reduce the load on the system bus by working at high hit rates. In an ideal world, it can reduce the CPU access to the bus to negligible amounts if the cache is operating at near 100 percent efciency. This is a general principle for all caches and buses on the system. The more efcient a cache is, the less the adjacent bus is needed to load information to the next level down.
6-28

6
Because the bus load greatly depends on cache efciency which, in turn, depends on the type of activity the entire system is involved in, it is difcult to produce gures for CPU loads on a system bus. Brian Wong, in Conguration and Capacity Planning for Solaris Servers, (see Additional Resources page 6-2) produces some gures that are worth quoting here: A SPARCcenter 2000E with sixteen 85-MHz CPUs operating will consume approximately 480540 Mbytes of an available sustained rate of 625 Mbytes/sec. A SPARCcenter 1000E with eight 85-MHz CPUs will load the 312 Mbytes/sec system bus by 240-272 Mbytes/sec. Although the gures scale directly, in the latter case a bus overload could occur if the CPUs are heavily used for non-I/O activity and there is a simultaneous large amount of I/O activity. Also, the 2000E CPUs have 2 Mbytes L2 caches (as opposed to 1 Mbyte L2 for the 1000E), which point to a lower corresponding CPU load on the larger system.
System Buses
6-29
Dynamic Reconguration Considerations

Dynamic Reconguration (DR) enables you to add and remove I/O boards (Enterprise xx00 servers) and system boards (Enterprise 10000) in a running system. New conguration performance problems will not occur when you remove boards because you are decreasing the demand for system resource. Problems can occur, however, when you add boards. Follow the guidelines given in the overhead when you add boards or use DR to add new components to existing boards. Make sure that you have the bandwidth in the peripheral and system buses to support the components you are adding. Consider reorganizing the existing components and boards to provide better performance for the new components. If you know that you will be adding components at a later time, plan for it now to avoid problems then.
6-30

Tuning Reports
Tuning reports provide very little information about conguration problems, largely because system conguration is almost always static. Therefore, it is not interesting to a monitor. Monitors do not even report on the current conguration when they start. The best protection is using prtdiag on current systems and doing a bandwidth calculation before adding boards to an existing system. Keep in mind:
q q
An empty slot does not imply a usable slot. Strange behavior, especially behavior that continues after hardware has been replaced, might be due to conguration problems. Failures due to conguration problems invariably appear when the system is the most heavily loaded.
System Buses
6-31
Bus Monitoring Commands

In the Solaris 8 OE, a range of bus-related performance analysis utilities have been provided as new or enhanced utilities. These include:
q
prtdiag Shows system bus clock speed plus number and type of I/O buses on sun4d and sun4u architectures. busstat A new tool, introduced with the Solaris 8 OE. Use this tool to monitor bus-related performance counters on hardware based on SPARC technology. The -l option shows any devices on your system that support performance counters.
The busstat command only works on Enterprise-level systems, not on desktop workstations.
q
prtconf - Although this tool produces output less easily read than prtdiag, it is available on all architectures. Use the -v and p options to display the best details available relating to system hardware.
6-32

6
Appropriate OpenBoot programmable read-only memory (PROM) (OBP) commands can be issued at the ok prompt. You would use commands available in the revision of OBP on your system. Sun Management Center is a graphical user interface (GUI) tool that provides facilities to view the systems hardware conguration in addition to monitoring events at hardware and Operating System (OS) levels.
System Buses
6-33
6
Check Your Progress
Before continuing on to the next module, check that you are able to accomplish or answer the following: u u u u u u u Dene a bus Explain general bus operation Describe the processor buses: UPA, Gigaplane, and Gigaplane XB Describe the data buses: SBus and PCI Explain how to calculate bus loads Describe how to recognize problems, and congure the system to avoid bus problems Determine Dynamic Reconguration considerations
6-34

6
Think Beyond
Consider these questions:
q q
Why are there so many different types of buses? Why does not every new system from Sun use the Gigaplane XB bus? Why do some systems use several different buses to connect their components? How do you tell if a device is using PIO or DVMA? How do you conrm whether you have a hardware or a conguration problem?
q q
System Buses
6-35
System Buses Lab

Objectives
L6
Upon completion of this lab, you should be able to assess hardware components and layout.
L6-1
L6
Testing the Hardware Conguration of a System
The purpose of this lab is to demonstrate how you can determine the hardware layout of a system. Work with another student for this lab. For the duration of the lab, the notes refer to System-A and System-B. You must decide which of the pair of hosts will be System-A and System-B and carry out the tasks appropriately. For the remainder of this lab, you will be working as a paired team. Work carried out on System-A must use the root login. Note This lab requires that you have access to the serial and keyboard ports on your system and that you have the appropriate serial cables to connect two systems together. If you do not have appropriate access or the serial connector cable, do not attempt this part of the lab. If you are in doubt, consult with your instructor.

This lab requires the student to connect their system to a partner system, using the serial port. You will require a cross-over serial cable to connect Serial Port B on one system to Port A on the partner system. During the lab, the student will also be required to remove the keyboard connector from their system. If you do not have the appropriate resources, do not attempt this lab. However, you might want to provide a verbal walk-through of what the lab would prove.
1. On System-B, execute the prtdiag command (on sun4u systems), and document the output below: ________________________________________________ ________________________________________________ ________________________________________________ ________________________________________________ ________________________________________________ ________________________________________________
L6-2

L6
2. Using the serial cable provided by your instructor, connect the cable to Serial Port A on System-A and System-B so that both hosts can communicate with each other through the serial port. 3. On System-A, verify that there is an entry in the /etc/remote le that is similar to the following: hardwire:\ :dv=/dev/term/a:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D: Ensure that the dv= entry reads /dev/term/a. If the entry does not exist, add the entry for hardwire. 4. On System-A, issue the following command: # tip hardwire If the serial connection between the two systems is working correctly, you should see the message: connected If this is not the case, consult with the instructor.

If the serial connection is not valid, be sure that the cables are connected correctly. If they are, you can assume that there is a problem with the serial cable and abandon the lab on that pair of hosts.
5. On System-B, to halt the system and take the system to the programmable read-only memory (PROM) command level, issue the command: # init 0
System Buses Lab

L6-3
L6
6. When System-B has shut down to Level 0, issue the following commands: ok setenv diag-switch? true ok setenv auto-boot? false ok banner If the version of the PROM is 3.x, issue the following command: ok reset-all If the version of the PROM is 2.x, issue the following command: ok reset

On some systems, such as Ultra 10, a reset is not sufficient. You will have to power-cycle the system.
7. On System-A, the hardware conguration of System-B is listed as the system boots. Answer the following questions: Do you see a listing of the attached hardware? On a sun4u system, does the listing match the output of the prtdiag command (as issued in Step 2 of this lab)? How many memory banks are there on System-B? What is the distribution of memory modules across the memory banks? Can you see the size of the Level 2 (look for Ecache Size) cache? 8. On System-A, terminate the tip session by typing the following: ~. 9. On System-A, record all of the data into a le using the following command: # script /var/tmp/hardware_list.script Script started, file is /var/tmp/hardware_list.script # tip hardwire Note If you are unfamiliar with the script command, consult with the instructor.
L6-4

L6
10. When System-B has booted to the PROM command level, issue the following commands on System-B: ok setenv diag-switch? false If the version of the PROM is 3.x, issue the following command: ok reset-all If the version of the PROM is 2.x, issue the following command: ok reset Before System-B starts to boot, remove the keyboard cable from its socket. 11. As System-B boots, the ok prompt should appear in the terminal window on System-A. If so, issue the following command and answer the questions below: ok boot -v
What is the name of the graphics adapter on System-B, and how much memory is on the graphics card? What is the name and speed of the network card? What types of system and I/O buses are used on System-B? 12. When System-B has booted, on System-A issue the following commands: # ~. # exit Script done, file is /var/tmp/hardware_list.script # The output produced by System-B, during its boot cycle, is now stored in the /var/tmp/hardware_list.script le. Compare the details in this le to the listing produced in Steps 2 to 7. If you are using sun4d or sun4u architecture systems, use the prtdiag command to determine the clock speed of the system bus. Use Table 6-1 on page 6-10 to determine the width of your system bus.
System Buses Lab

L6-5
L6
Calculate the burst bandwidth of the bus by multiplying the clock speed of the bus by the width value. Then, divide the total by 8 to obtain a value in Mbytes/sec. Write down your answer: ___________Mbytes/sec 13. If a system with OBP 2.x is available, try the following commands at the ok prompt: ok cpu-info ok show-sbus 14. If a system with OBP 3.x is available, try the following commands at the ok prompt: ok .speed 15. Boot the system using the command: ok boot
L6-6

I/O Tuning
Objectives
q q
Demonstrate how I/O buses operate Describe the conguration and operation of a Small Computer System Interface (SCSI) bus Explain the conguration and operation of a Fibre Channel bus Demonstrate how disk and tape operations proceed Explain how a storage array works List the Redundant Array of Independent Disks (RAID) levels and their uses Describe the available I/O tuning data
q q q q
7-1
7
Relevance

q
What parts of I/O tuning are related to concepts you have already seen? What makes I/O tuning so important? Why must you understand your peripheral devices to properly tune them?
q q
Additional resources The following references can provide additional details on the topics discussed in this module:
q
Cockcroft, Adrian and Richard Pettit. Sun Performance and Tuning. Palo Alto: Sun Microsystems Press/Prentice Hall, 1998. Wong, Brian L. Conguration and Capacity Planning for Solaris Servers. Palo Alto: Sun Microsystems Press/Prentice Hall, 1997. Ridge, Peter M. and David Deming. The Book of SCSI. No Starch Press, 1995. Benner, Alan F. Fibre Channel. McGraw-Hill, 1996. Mauro and McDougall, Solaris Architecture and Internals, Prentice Hall/SMI Press, ISBN 0-13-022496-0. Musciano, Chuck. 0, 1, 0+1... RAID Basics. SunWorld, June 1999. Man pages for the prtconf command and performance tools (sar, iostat).
q q
q q
7-2

SCSI Bus Overview

The small computer system interface (SCSI) bus was designed in the early 1980s to provide a common I/O interface for the many small computer systems then available. Each required its own proprietary interface, which made development of standard peripherals impossible. The SCSI bus helped solve this problem by providing a standard I/O interface. The SCSI bus was originally shipped in 1982, and its initial standard (SCSI-1) was approved in 1986. The current level of standardization is SCSI-3, with further work continuing. The SCSI standard species support for many different devices, not just disk and tape, but diskette drives, network devices, scanners, media changers, and others. The SCSI bus, probably the most common peripheral interconnect today, is fully specied by the standards in terms of cabling, connectors, commands, and electrical signaling.
I/O Tuning
7-3
7
Conceptually, it operates like Ethernet, acting as a network between devices, although the implementation details are quite different. It is very similar to the IBM mainframe parallel channels in operation. Indeed, most I/O buses follow similar principles. Although the SCSI bus has bandwidth limitations, you can use SCSI devices on other bus mechanisms, such as Fibre Channel. This has helped SCSI devices to remain widely used, especially with regard to disk storage.
7-4

SCSI Bus Characteristics

There are several different types of SCSI bus, distinguished by speed, length (distance it can support), and width (number of bytes transported at once). A SCSI bus might be able to support many different sets of characteristics simultaneously, depending on the attached devices. The specic characteristics supported by a given device are negotiated with the attaching systems device drivers. The SCSI bus operates like a packet-switched bus: It connects to the device and sends a message, and then it disconnects and waits for a response. The SCSI bus usually does not have to remain connected during an entire I/O operation, such as a read (although it can). This behavior is determined by the device.
I/O Tuning
7-5
SCSI Speeds
The SCSI bus supports four different speeds (the rate at which bytes are transferred along the cables). All SCSI devices support the asynchronous speed. Faster speeds are negotiated with the device by the system device driver and can usually be determined with the prtconf command (explained later in this module). The same bus might have devices operating at different speeds; there is no problem in doing this. Bus arbitration (the process of determining which device may use the bus) always occurs at an asynchronous rate. However, it might not be a good idea to put slow devices on the bus because they can slow the response of faster devices on the bus. Serial SCSI, also known as Firewire, takes a different approach to boost the data throughput of Fast/Wide SCSI. Managing only one data line instead of the multiple (16) SCSI bus lines, it is possible to increase the speed from the maximum 20 MHz of Ultra SCSI to values over 1 GHz. Even using just one of the 16 bus lines, 1 GHz divided by 16 still provides a throughput value of 64 MBytes/sec.
7-6

7
Keep in mind that the SCSI speeds given in the table in the previous overhead are maximums. Like a speed limit, these are speeds that are not to be exceeded. They do not require the device to operate at that speed; it can negotiate and run at a slower rate. You can limit the speed at which the bus will transfer, and the use of some SCSI capabilities, such as tagged queueing, by setting the scsi_options parameter in the /etc/system le. Normally, this would not be necessary, but for some devices, it might be required. It can also be set (through the programmable read-only memory (OBP)) on an individual SCSI controller basis. With the introduction of Fibre Channel, the speed of data transfer can be signicantly increased. Transfer rates of up to 320 Mbytes/ sec have been achieved but rates could double, on average, every 18 months (following Moores principle).
I/O Tuning
7-7
SCSI Widths
The width of a SCSI bus species how many bytes can be transferred in parallel down the bus. Multiplied by the speed, this gives the bandwidth of the bus. Just like speed, width is negotiated by the driver and device at boot time and can be mixed on the same bus. Again, prtconf provides the current setting for a device. Each width uses a different cable. It is possible to use an adapter cable to mix narrow and wide devices on a bus, regardless of whether the host adapter is narrow or wide. Keep in mind that after a wide bus has been connected to narrow devices, it cannot be made wide again.
7-8

SCSI Lengths
Unlike speed and width, the length capability of a SCSI device is manufactured in and cannot be changed or negotiated. The electrical characteristics of single-ended and differential buses are different, so never mix device types on a bus. If you are unsure of the device type or bus type, look for the identifying logo on the device or interface board. While there are also specic terminators, cables will work in either environment. As long as you use an active terminator with the light-emitting diode (LED), the terminator has no inuence on total bus length. (A passive terminator can greatly limit the total length.) The SCSI bus is limited, in either case, to the distance that the signals can travel before enough voltage is lost to make distinguishing a 0 from a 1 unreliable. The lengths given in the table are maximums; every device on the bus and every cable join (butt) takes an additional 18 inches (0.5 meter) of the length of the total. Remember that a shorter cable always provides better performance than a longer one.
I/O Tuning
7-9
SCSI Properties Summary

Every SCSI device on a bus operates with a set of characteristics designed into it and negotiated with the driver. While there are common mixtures, such as fast-wide, uncommon mixtures are acceptable. Usually no message is given (nor is information made available) by the system describing the mode in which a device is operating. Because you might have devices of different capabilities on the same bus, it is common to nd performance problems resulting from this. A slower device might prevent a faster device from performing an operation, sometimes for a signicant period of time. This is described later in this module.
7-10

SCSI Interface Addressing

On a narrow SCSI bus, there are 57 possible devices and on a wide bus there are 121. It is commonly thought that there are only eight, which is an understandable misconception. A SCSI target is not a device, it is a device controller. It species an attachment point on the bus where as many as eight devices (called LUNS or logical unit numbers) may be attached. This is reected in the tXdX notation that is used for SCSI devices, showing the target and device LUN addresses. Embedded SCSI devices support only one LUN per target. Most SCSI devices available today are embedded, causing the confusion about the number of devices supported on the bus. A narrow bus supports eight targets, a wide bus 16. (This is because of the use of cable data lines to arbitrate bus usage.) The host adapter has only one LUN.
I/O Tuning
7-11
7
SCSI commands are issued by an initiator, a device on the bus that can initiate commands. The most common initiator is the host adapter, the systems interface to the SCSI bus. Although it is rare, the standard does allow any device to be an initiator. While host adapter and initiator are used synonymously, they technically are different.

It is possible to have more than one initiator on a SCSI bus. This is known as multiinitiator SCSI (MIS) and is described later in this module.
7-12

SCSI Bus Addressing

The overhead shows a SCSI bus congured with embedded devices. The devices can be attached in any order, although more important devices (with higher priority LUN values) should, if possible, be attached closer to the host adapter. The SCSI terminator does not take a bus target position. It is invisible to the host adapter and devices and is used to control the electrical characteristics of the bus.
I/O Tuning
7-13
SCSI Target Priorities

Most networks do not have a priority mechanism; whichever device needs to transmit can do so if the network medium is free. This is not the case on a SCSI bus. Each target has a priority. When a device on a target wants to transmit and the bus is free, an arbitration process is performed. Lasting about 1 millisecond (ms), the arbitration phase requests all devices that need to transmit to identify themselves, and then grants the use of the bus to the highest priority device requesting it. Higher priority devices get better service. The priority scheme is not obvious. On a narrow bus, priorities run from 7 (highest) to 0 (lowest). On a wide bus, for compatibility reasons, priorities run 70 then 158. The host adapter is usually 7, the highest priority target. With multi-initiator SCSI, the initiators should be the highest priority devices on the bus.
7-14

Fibre Channel
Fibre Channel (FC) is a reliable, scalable, gigabit interconnect technology allowing concurrent communications among workstations, mainframes, servers, data storage systems, and other peripherals. Fibre Channel attempts to combine the best of channels and networks into a single I/O interface. The network features provide the required connectivity, distance, and protocol multiplexing. The channel features provide simplicity and performance. Fibre Channel operates on both ber cable and copper wire and can be used for more than just disk I/O. The Fibre Channel specication supports high-speed system and network interconnects using a wide variety of popular protocols, including:
q q q q
SCSI Small computer system interface IP Internet Protocol ATM Adaptation layer for computer data (AAL5) IEEE 802.2 Token Ring
I/O Tuning
7-15
Fibre Channel has been adopted by the major computer systems and storage manufacturers as the next technology for enterprise storage. It eliminates distance, bandwidth, scalability, and reliability issues of SCSI. Fibre Channel-Arbitrated Loop (FC-AL) was developed specifically to meet the needs of storage interconnects. Using a simple loop topology, FC-AL supports both simple configurations and sophisticated arrangements of hubs, switches, servers, and storage systems. With the ability to support up to 126 FC-AL nodes (device objects) on a single loop, cost and implementation complexity are greatly reduced. The FC-AL development effort is part of the American National Standards Institute/International Standards Organization (ANSI/ISO) accredited SCSI-3 standard, which help to avoid the creation of non-conforming, incompatible implementations. Virtually all major system vendors are implementing FC-AL, as are all major disk drive and storage system vendors. Gigabit Fibre Channels maximum data rate is 100 Mbytes/sec per loop, after accounting for overhead, with up to 400 Mbytes/sec envisioned for the future. This is faster than any Ultra-SCSI-3, serial storage architecture, or P1394 (Firewire).
7-16

7
In a Fibre Channel network, legacy SCSI storage systems are interfaced using a Fibre Channel to SCSI bridge. IP is used over FC for server-toserver and client-to-server communications. Table 7-1 shows the characteristics of Fibre Channel.

Much of the original Fibre Channel development work was done by Seagate. For details of fibre channel developments, visit the following URL: http://www.fibrechannel.com
Table 7-1 Description Technology application Topologies Standard baud rate Scalability to higher rates
Fibre Channel Characteristics Characteristic Storage, network, video, clusters Point-to-Point, loop hub, switched 1.06 Gigabits/second 2.1 2 Gigabits/second or 4.24 Gigabits/second Yes None Variable: 02 Kbits Credit based Copper and ber cable Network (IP, IEEE 802.2), SCSI, ATM Adaptation Layer (AAL5), Fibre Channel, Link Encapsulation (FC-LE) Up to 10 kilometers
Guaranteed delivery mechanism? Congestion data loss Frame size Flow control Physical media supported Protocols supported
Maximum distance
I/O Tuning
7-17
Disk I/O Time Components

A SCSI I/O operation goes through many steps. To determine where bottlenecks are, or to decide what must be done to improve response time, an understanding of these steps is important.
Access to the I/O Bus

An I/O operation cannot begin until the request has been transferred to the device. The request cannot be transferred until the initiator can access the bus. As long as the bus is busy, the initiator must wait to start the request. During this period, the request is queued by the device driver. A queued request must also wait for any other requests ahead of it. A large number of queued requests suggests that the device or bus is overloaded. The queue can be reduced by several techniques: use of tagged queueing, a faster bus, or moving slow devices to a different bus.
7-18

7
Queue information is reported by the sar -d report and the iostat -x report. # iostat -x device fd0 sd1 sd3 sd6 nfs1 r/s 0.0 1.0 0.7 0.0 0.0 extended device statistics w/s kr/s kw/s wait actv 0.0 0.0 0.0 0.0 0.0 0.5 9.6 9.3 0.0 0.1 0.2 4.1 1.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 The elds in this output are:
q
svc_t 0.0 32.2 19.3 0.0 0.0
%w 0 0 0 0 0
%b 0 2 1 0 0
wait The average number of transactions waiting for service (queue length). In the device driver, a read or write command is issued to the device driver and sits in the wait queue until the SCSI bus and disk are both ready. %w Percentage of time the queue is not empty; percentage of time transactions wait for service.
# sar -d 5 3 SunOS earth 5.8 Generic sun4m 08:44:41 08:44:46 device fd0 nfs1 sd1 ... sd1,g sd1,h sd3 sd3,a sd3,c ... sd6 ... %busy 0 0 25 25 0 11 11 0 0 10/13/00 avque 0.0 0.0 0.3 0.3 0.0 0.1 0.1 0.0 0.0 r+w/s 0 0 26 26 0 11 11 0 0 blks/s 0 0 225 221 4 69 69 0 0 avwait 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 avserv 0.0 0.0 9.9 9.9 9.2 10.3 10.3 0.0 0.0
In this output, one eld is avwait, which is the average time, in milliseconds, that transfer requests wait idly in the queue. (This is measured only when the queue is occupied.)
I/O Tuning
7-19
7
Bus Transfer Time
This includes several components: arbitration time, where the determination is made regarding which device can use the bus (the host adapter always wins), and the time required to transfer the command to the device over the bus. This takes approximately 1.5 ms. In the case of a write, this time also includes the data transfer time.
Seek Time
Seek time is the amount of time required by the disk drive to position the arm to the proper disk cylinder. It is usually the most costly part of the disk I/O operation. Seek times quoted by disk manufacturers are average seek times, or the time that the drive takes to seek across one-half of the (physical) cylinders. With tagged queueing active, this time is not representative of the actual cost of a seek.
Rotation Time
Rotation time, or latency, is the amount of time that it takes for the disk to spin under the read/write heads (Figure 7-1). It is usually quoted as one half of the time a complete rotation takes. It is not added into the seek time when a manufacturer quotes disk speeds, although a gure for it is provided in revolutions per minute (RPM).
9 ms seek time
4.1 ms rotation time
Figure 7-1
Rotation Time
7-20

7
Internal Throughput Rate (ITR) Transfer Time
ITR is the amount of time required to transfer the data between the devices cache and the actual device media. It is usually much slower than the bus speed. ITRs are not always available for devices, although they can be found in the manufacturers detailed specications. ITRs are usually between 5 and 10 Mbytes/sec. ITRs are the limiting factor on performance for sequential disk access. For random access patterns, I/O time is dominated by head-seek and disk-rotation time.
Reconnection Time
When the data has been transferred to or from cache, the request has to be completed with the host adapter. An arbitration and request status transfer must be performed, identical to that done at the start of the request. The only difference is that the arbitration is done at the devices priority, not that of the host adapter. If the request is a read, the data is transferred to the system as part of this operation. Timing is similar to host adapter arbitration, about 1.5 ms, plus any data transfer time.
Interrupt Time
This is the amount of time that it takes for the completion interrupt for the device to be handled. It includes interrupt queueing time (if higher priority interrupts are waiting), time in the interrupt handler, and time in the device driver interrupt handler. It is very difcult to measure the interrupt time. However, if a high interrupt rate on CPUs on the system board to which the device is attached is observed, and no other cause for delay can be seen, this might be part of the problem. Move the host adapter or higher interrupt priority devices to another system board, if necessary.
I/O Tuning
7-21
Disk I/O Time Calculation

This overhead shows the calculation of the total cost of a write I/O operation. Remember, response time is the total, elapsed time for an operation to complete. Service time is the actual request processing time, ignoring any queueing delays. Some of the costs are xed, such as arbitration and command/status transfer; some are device specic, such as ITR, seek time, and latency; and some are unpredictable, such as queueing time and interrupt time. The cost of an operation on a particular device can be calculated, given knowledge of the device characteristics and average data transfer size. The following tables can help in the cost of operation calculations.
7-22

7
Bus Data Transfer Timing
Table 7-2 provides details regarding the timings associated with a variety of SCSI devices. Table 7-2 Type Asynchronous/narrow Synchronous/narrow Fast/narrow Fast/wide Ultra/wide Bus Data Transfer Timing Speed 4 Mbytes/sec 5 Mbytes/sec 10 Mbytes/sec 20 Mbytes/sec 40 Mbytes/sec 1 Sector 0.128 ms 0.102 ms 0.051 ms 0.025 ms 0.013 ms 8 Kbytes 2.05 ms 1.63 ms 0.82 ms 0.41 ms 0.21 ms
Latency Timing
Table 7-3 provides details of latency timing for disks with various revolution rates. Table 7-3 Latency Timing Calculation Revolutions per Minute (RPM) 3600 5400 7200 10,000 Revolutions per Second (RPS) 60 90 120 167 Milliseconds per Revolution 16.7 11.1 8.33 5.98 Milliseconds per one-half Revolution 8.33 5.55 4.17 2.99
I/O Tuning
7-23
Disk Drive Features

Disk drives have become complex devices. Drives today contain upwards of 8 Mbytes of DRAM memory and custom logic with Intel 80286 microprocessor cores. They perform many functions in addition to basic data retrieval. Disk technology evolves at an ever-increasing rate due to the high demands for storage solutions. With the growth of Internet-related usage, the need to centralize data within an organization and greater use of large-scale databases, new disk storage techniques, and features are launched to market every month. The features, described over the next few pages, are just some of the storage features available to you.
7-24

Multiple Zone Recording

Disk drives are presented to the Operating System (OS) as having a particular geometry. For example, the device might report 2200 cylinders, 16 heads, and 64 sectors per track. A simple examination of the device shows that this cannot be the real geometry: There are not 16 recording surfaces in a drive that is only 1 inch high. Manufacturers are very creative when it comes to increasing the capacity of their devices. A drive listed at 1.02 billion bytes is a 1.0 Gbyte drive, because a Gbyte has been dened as 1 billion bytes, not 1024 Mbytes. This, however, is only good for a 2 percent increase. The technical approach used to increase the capacity is called multiple zone recording (MZR), also known as zone bit recording (ZBR). This technique takes advantage of the ability of a drive to vary the write density on inner, smaller, and outer, larger cylinders.
I/O Tuning
7-25
7
With the high-capability microprocessor in the drive, this problem can be overcome. A nominal geometry is reported to the OS, which the OS uses in formulating requests to the drive. The nal request is sent as an absolute sector number from the beginning of the drive. The MZR drive is divided into zones, usually 12 to 14, each of which has a constant density. As the zones move inward, the number of sectors on each track decreases. The drive contains a table of which sectors are in each zone. There might be 60 sectors in an inner zone and 120 or more in an outer zone. The drive then takes the requested sector number, translates it using the zone table, and determines the actual physical location on the drive to be accessed. The varying, true geometry of the drive is hidden from the OS. This puts denser cylinders on the outside of the drive. More data can be accessed without moving the arm. This difference is noticeable. When accessing data sequentially, the performance difference from the outermost to the innermost cylinders can be as much as 33 percent. MZR also improves the ITR for this area of the disk. When partitioning a ZBR, Disk Cylinder 0 is usually faster than the last cylinder. This is because Cylinder 0 is most often located on the outside edge of the platter (though this can vary depending upon the manufacturer).

Explain that, within the product lifetime of a disk, the zoning details might be altered to enhance the storage capacity of the disk. Therefore, if the same disk type is purchased 12 months later, its zoning might not match the disk type bought earlier. A good Web site that explains disk zoning is: http://www.pcguide.com/ref/hdd/tracksZBR-c.html A good Web site that explains disk terminology is: http://www.seagate.com:80/support/glossary

7-26

Drive Caching and Fast Writes

Disk drives and most SCSI devices manufactured today contain signicant amounts of data cache. When an I/O request is made, data is transferred in and out of cache on the device before the actual device recording mechanism is used. This allows the device to transfer data at the speed of the bus rather than at its more limited ITR. Individual SCSI devices today have from 0.5 to 8 Mbytes or more of cache. Some storage arrays have multiple Gbytes. In association with the tagged queueing feature (described on page 7-29), multiple requests may be active concurrently on the device.
Fast Writes
Fast writes allow the device to signal completion of a write operation when data has been received in the cache, before it is written to the media. (The disk drives cache becomes write back instead of write through.)
I/O Tuning
7-27
7
While providing signicant speed increases (as described later in this module), the use of fast-write capability is very unsafe unless used with an uninterruptable power supply (UPS). Loss of power can cause the data to be lost before it is written to the media, while the OS has already been told that the write was successful. Fast writes can be set on an individual drive basis (although an OS interface to do so is not usually available), or more commonly can be set in storage arrays.

In storage arrays, data is usually written to an NVRAM cache within the array. This means that data is much more secure when written to an array device if fast-write has been enabled, as the data will be safely stored in NVRAM should a power outage occur.
7-28

Tagged Queueing
Tagged queueing is a mechanism that allows the device to support multiple concurrent I/O operations. Its use is negotiated between the device and the driver. Not all devices support tagged queueing. (The prtconf and scsiinfo commands are used to determine if a specic device is using tagged queueing.) Solaris 2.4 OE and higher operating environments have tagged queueing enabled by default, at a depth of 32. In addition to the regular SCSI command, an extra ID eld is added that identies the request to the OS device driver. When the request completes, the tag is returned with a request status that identies to the driver which command has completed. In and of itself, multiple commands do not provide a signicant performance advantage. The advantage comes from the ability of the device to optimize its operation. This is done through ordered seeks.
I/O Tuning
7-29
7
Ordered Seeks
With tagged queueing, multiple requests are sent to the disk for execution. If executed in the order in which they were received, the total motion of the disk arm, the seeks, would be the sum of all distances between the requests. With ordered seeks, sometimes called elevator seeks, the arm makes passes back and forth over the drive. As the location for a request is reached, the request is executed. While the requests are fullled out of order, the arm motion is greatly reduced. The average execution time for a request is signicantly less than it would be otherwise. As the number of simultaneous requests to the disk increases, average request response time improves (to a certain point) because the average arm movement for a request decreases. The device operates more efciently under load. To avoid long delays for requests at the opposite end of the drive, requests that arrive after a pass has started are queued for the return pass. Response time improvements up to 66 percent have been measured. Here is an example portion of the prtconf output: # prtconf -v System Configuration: Sun Microsystems sun4m ... esp, instance #0 Driver properties: name <target3-sync-speed> length <4> value <0x00002710>. name <target3-TQ> length <0> -- <no value>. name <scsi-selection-timeout> length <4> value <0x000000fa>. name <scsi-options> length <4> value <0x00001ff8>. name <scsi-reset-delay> length <4> value <0x00000bb8>. ...
7-30

Mode Pages
You might have noticed that it is no longer necessary to provide format.dat entries for your disk drives. This is because SCSI devices provide a descriptive interface called mode pages. Mode pages are blocks of data that are returned by a SCSI device in response to the appropriate request. The SCSI standard denes several specic mode pages that must be supported by every SCSI device, and some additional ones for certain device types. Device manufacturers usually add their own vendor-specic mode pages to the device. Specic mode pages are identied by number. By interrogating the mode pages, many of the device characteristics can be determined by the driver or OS. In general, they are not visible to the user, although they can be viewed by using the expert mode of the format command (the -e option) and then invoking the scsi command. To change device behavior, mode pages can also be written to.
I/O Tuning
7-31
Device Properties From prtconf

The prtconf command is used to format the OBP device tree. (Explanation of the OBP device tree is beyond the scope of this course.) The OBP contains a node for each device that the OBP nds or is attached by a peripheral bus. By using the prtconf -v command, the attributes of these device nodes are seen. For SCSI devices, this includes properties such as bus speed (targetX-sync-speed, a hexadecimal number in Kbytes/sec), support for wide buses (the targetX-wide property is present), and tagged queueing (the targetX-TQ property is present). This information can also be retrieved from the device mode pages.
7-32

7
Table 7-4 shows the speeds reported by prtconf and the translated bus speed. Table 7-4 prtconf Bus Speeds Bus Speed 40 Mbytes/sec 20 Mbytes/sec 10 Mbytes/sec 5 Mbytes/sec
prtconf Speed 9c40 4e20 2710 1388
I/O Tuning
7-33
Device Properties From scsiinfo

The scsiinfo command can provide the same type of SCSI conguration information as prtconf; however, it is specically designed for use with SCSI devices, and is more detailed and easier to read. The scsiinfo command is not provided with the SunOS operating system, but can be downloaded from the World Wide Web. The unresolved resource locator (URL) is ftp://ftp.cs.toronto.edu/pub/jdd/scsiinfo/. The ISP, FAS, and GLM controllers are supported under the Solaris OE.
7-34

7
Examples of output from scsiinfo are as follows: # ./scsiinfo -o esp0: sd0,0 tgt 0 lun 0: Synchronous(10.0MB/sec) Clean CanReconnect esp0: sd2,0 tgt 2 lun 0: Synchronous(10.0MB/sec) Clean CanReconnect esp0: sd3,0 tgt 3 lun 0: Synchronous(10.0MB/sec) Clean CanReconnect esp0: sd6,0 tgt 6 lun 0: Synchronous(4.445MB/sec) Clean CanReconnect esp0: st4,0 tgt 4 lun 0: Synchronous(5.715MB/sec) Noisy CanReconnect esp0: st5,0 tgt 5 lun 0: Asynchronous Noisy CannotReconnect # # ./scsiinfo -r /dev/rdsk/c0t3d0s0 Vendor: SEAGATE Model: ST31200W SUN1.05 Revision: 8724 Serial: 00440781 Device Type: Disk Data Transfer Width(s): 16, 8 Supported Features: SYNC_SCSI TAG_QUEUE Formatted Capacity: 2061108 sectors (1.03 GB) Sector size: 512 bytes Physical Cylinders: 2700 Heads: 9 Sectors per track (Avg.): 85 Tracks per zone: 9 Alternate Sectors per zone: 9 Alternate Tracks per volume: 18 Rotational speed: 5411 rpm Cache: Read-Only #
I/O Tuning
7-35
I/O Performance Planning

The characteristics of the I/O operations can have a dramatic impact on the throughput that the bus can sustain. Sequential operations run at the drives ITR, limiting the bus performance to that of the drive. Random operations run at bus speed, allowing many more operations. Because of recent advances in disk technology, you will probably be limited more by the system bus capacity than by the disk drive. Better layout of le systems, use of data striping and balance of drives-tocontroller can have more impact on performance than altering such features as tagged queueing. When you are planning a disk conguration, you must design for the type of load the system will create.
7-36

7
Decision Support Example (Sequential I/O)
For a decision-support or data-warehousing application, 700 Mbytes/ sec throughput is required. Intelligent SCSI processor (ISP) controllers will be used. Use the following calculations: 700 Mbytes/sec 14 Mbytes per controller = 50 ISP controllers 700 Mbytes/sec 5.5 Mbytes per RSM drive = 128 drives The designed conguration would need 50 ISP controllers, with only two or three devices per controller.
Transaction Processing Example (Random I/O)

For a transaction-processing application, 20,000 I/O operations/sec throughput is required. SOC controllers for SPARCstorage arrays (SSA) will be used. Use the following calculations: 20,000 I/O operations/sec 2400 operations per SOC controller = 8 SOC controllers 20,000 I/O operations/sec 75 operations per SSA RSM drive = 266 RSM drives For a better spread of disk loading, the designed conguration could double the number of SOC controllers and use 16 SOC controllers, with 1618 devices per controller. These gures must be veried against the capacities of the various controllers, motherboard slots, and so on.

Point out to the students that the calculations are based on required throughput. The system capacity must also be verified. For example, what is the maximum number of devices that can be connected to a SOC controller? If this maximum is used, would this saturate or overload the SOC controller? The point being made in these examples is that calculations should be made first and then the values should be verified to see if they can be achieved with the hardware available.
I/O Tuning
7-37
Tape Drive Operations

Remember that bus speed is a maximum, not a requirement. Tape drives, even though put on a fast bus, usually cannot transfer faster than their tape motion allows. If a tape drive transfers slowly, say at synchronous speed, it holds the bus for 1 ms for an 8-Kbyte transfer, even if it is on a fast/wide bus. Because the tape must be a high-priority device (missed requests are expensive to restart), its effect on other devices on the bus can be dramatic. This kind of effect is known as SCSI starvation, where a higher priority device, whether very active or just slow, prevents lower priority devices from receiving adequate service. The best solution is to move the slow devices to another bus or try to split the load of highly active devices.
7-38

Storage Arrays and JBODs

A JBOD (just a bunch of disks) is a storage device that provides a compact package for a number of disk drives, usually with a common power source. A JBOD does not provide any other functions, such as caching or using RAID. The devices are individually visible on the bus. Examples are the Sun Multi-Pack and the SPARCstorage array. A storage array provides a compact package with additional management for the devices. It may also provide caching, RAID, transparent failed-drive replacement, or special diagnostic capabilities. Drives might not be individually visible outside of the array. Examples are the Sun StorEdge A5000, the Sun StorEdge A3000, and the Sun StorEdge A7000.
I/O Tuning
7-39
Storage Array Architecture

Conceptually, a storage array is quite simple. It is a box of SCSI buses. The array provides a single interface (the front end) into a series of SCSI buses accessing the array drives (the back end). The front end is congured as any individual storage device and assigned an appropriate target number. (Customarily, the array is the only device on the bus due to its high I/O rate.) The back end is tuned as a series of SCSI buses. Use of the drives is planned just as if they were drives congured on buses that are not in an array. Target priorities apply as in a regular SCSI bus. A storage array, depending on its caching ability, might be slower than the equivalent number of individual drives, especially on reads, due to controller overhead. Also, devices reported by a storage array in the performance tools might not correspond to actual physical devices. Be sure you know what they represent physically.
7-40

RAID
As dened by the original University of California at Berkeley research papers, redundant array of independent disks (RAID) is a simple concept. Several disks are bound together to appear to the host system as a single, large storage device. The goal of RAID units in general is to overcome the inherent limitations of a single disk drive and to improve I/O performance and data storage reliability. There are six levels of RAID, numbered from 0 through 5, designed to address either performance or reliability. Some RAID levels are fast, some are inexpensive, and some are safe. None of the RAID levels deliver more than two of the three. Today, RAID Levels 2 through 4 are rarely, if ever, seen. The most common implementations provide, through hardware or software (or both), RAID Levels 0, 1, or 5. As with any technology, there are cases where the use of certain RAID levels is appropriate and cases where it is not. The incorrect use of a particular RAID level, especially RAID-5, can destroy the systems I/O performance.
I/O Tuning
7-41
RAID-0
RAID-0 provides signicant performance benets, but no redundancy benets. Instead of concentrating the data on one physical volume, RAID-0 implementations stripe the data across multiple volumes. This allows the single le or partition to be accessed through multiple drives. Striping improves sequential disk access by permitting multiple member disks to operate in parallel. Striping also improves random access I/O performance, though not directly. Striping does not benet individual random access requests. But striping can dramatically improve overall disk I/O throughput in environments dominated by random access by reducing disk utilization and disk queues. When more I/O requests are submitted than the disk can handle, a queue forms and requests must wait their turn. Striping spreads the data evenly across all of the volumes member disks to provide a perfect mechanism for avoiding long queues.
7-42

7
When you initiate many physical I/O operations for each logical I/O operation, you can easily saturate the controller or bus to which the drives are attached. You can avoid this problem by distributing the drives across multiple controllers. An even better solution is to put each drive on a private controller. None of the drives on a single SCSI bus should be part of the same RAID group. Adding more controllers and managing I/O distribution across those controllers is critical to achieving optimal performance from a RAID conguration. The drawback to RAID-0 is that it is not redundant. The loss of any physical volume in the RAID-0 conguration usually means that the entire contents of the conguration need to be restored. When considering striping, you should consider a hardware RAID solution. This allows the striping to be performed by the RAID controller and cache rather than being managed by the host computer system. This relieves the load on the hosts CPU and RAM, which should, in turn, provide a performance gain.
I/O Tuning
7-43
RAID-0 Stripe Sizing

Tuning the stripe size is very important. If the stripe size is too large, many I/O operations t in a single stripe and are restricted to a single drive. If the stripe is too small, many physical operations take place for each logical operation, saturating the bus controller. Performance in random access environments is maximized by including as many member disks as possible. Disk utilization, disk response latency, and disk queues are minimized when the number of disks is maximized. This optimal result is accomplished with a relatively large stripe size, such as 64 Kbytes or 128 Kbytes. Performance in sequential access environments is maximized when the size of the I/O request is equal to the strip size. Match the stripe size to a multiple of the systems page size, or a common applications I/O parameter. Oracle, for example, blocks all I/O into 8-KByte operations; a four-drive RAID-0 group with a 2-Kbyte stripe size would balance each Oracle read or write across all four drives in the RAID group.
7-44

7
Each request to a member disk requires the generation of a SCSI command packet with all of its associated overhead. For small requests, the effort of transferring the data is trivial compared to the amount of work required to set up and process the I/O. Typically, the break-even point is a stripe size of 2 Kbytes to 4 Kbytes.
I/O Tuning
7-45
RAID-1
RAID-1 is disk mirroring. Every time data is written to disk, one or more copies of the data are written to alternate locations. This provides data redundancy in the event of the loss of a disk device. When a drive fails, the system ignores it, reading and writing data to its remaining partner in the mirror. Properly congured, a RAID-1 drive group can survive the loss of an entire I/O bus and even a storage array unit. A RAID-1 group running with one or more failed or missing drives is said to be running in degraded mode. Data availability is not interrupted, but the RAID group is now open to failure if one of the remaining partner drives fails. The failed drive needs to be replaced quickly to avoid further problems. When the failed drive is replaced, the data in the good drive must be copied to the new drive. This operation is known as synching the mirror, and can take some time, especially if the affected drives are large. Again, data access is uninterrupted, but the I/O operations needed to copy the data to the new drive steal bandwidth from user data access, possibly decreasing overall performance.
7-46

7
By its nature, RAID-1 can double the amount of physical I/O in your system, especially in a write-intensive environment. For this reason, the use of multiple controllers is an absolute necessity for effective RAID-1 congurations. Some RAID-1 implementations provide the ability to read from any mirror, thus distributing the read workload for better performance. A common misconception is that writes to a mirror are half the speed of writes to an individual disk. This is rarely the case because the two writes do not need to be serviced sequentially. Most RAID-1 implementations handle mirrors the same way they handle stripes; the writes are dispatched sequentially to both members, but the physical writes are serviced in parallel. As a result, typical writes to mirrors take only about 15 to 20 percent longer than writes to a single member. Rather than rely on hardware RAID, as in the case of striping, and for mirroring, good performance can be achieved using software RAID. This allows the host system to control the checks that both mirrors have been written to, rather than relying on the RAID controller to inform the I/O subsystem in the host. The main drawback to RAID-1, besides the duplicate write overhead, is cost. All RAID-1 storage is duplicated, thus doubling the physical disk storage cost. Considering the costs of downtime, reconstruction of missing data, and extensive logging and backup for non-mirrored data, RAID-1 can easily pay for itself.
I/O Tuning
7-47
RAID-0+1
Pure RAID-1 suffers from the same problems as pure RAID-0: Data is written sequentially across the volumes, potentially making one drive busy while the others on that side of the mirror are idle. You can avoid this problem by striping within your mirrored volumes, just like you striped the volumes of your RAID-0 group. Theoretically, RAID-1 mirrors physical devices on a one-for-one basis. In reality, you want to build big logical devices using RAID-0 groups and then mirror them for redundancy. This conguration is known as RAID 0+1 because it combines the concatenation features of RAID-0 with the mirroring features of RAID-1. Unlike RAID-0, RAID-1 and RAID 0+1 are often implemented in hardware using smart mirroring controllers. The controller manages multiple physical devices and presents a single logical device to the host system. The host system sends a single I/O request to the controller, which, in turn, creates appropriate requests to multiple devices to satisfy the operation.
7-48

7
Higher-end mirrored disk subsystems also offer hot-swap drives and some level of drive management tools. Hot-swap is an important feature that enables you to remove and replace a failed drive without having to shut down the bus to which the drive is attached. RAID-0 and RAID-1 exist at different ends of the RAID spectrum. RAID-0 offers large volumes with no redundancy or failure immunity at the lowest possible cost. RAID-1 provides complete data redundancy and robust failure immunity at a high cost. Both congurations can be tuned for performance by adding controllers and using striping to distribute the I/O load across as many drives as possible. Sequential reads improve through striping, and random reads improve through the use of many member disks. Writes typically approach the speed of RAID-0 stripes, but because of the overhead of writing two copies for mirroring, RAID-0+1 is about 30 percent slower. RAID-0+1 takes the best of both congurations, providing large volumes, high reliability, and failure immunity. It does not, however, solve the price problem.
I/O Tuning
7-49
RAID-3
RAID-3 provides both the striping benets of RAID-0 and the redundancy benets of RAID-1 in less disk space than RAID-1. The idea behind RAID-3 is to keep one striped copy of the data, but with enough extra data (called parity data) to re-create the data in any stripe. This means that one physical volume could be lost, and the system would continue to operate by re-creating the lost data from the parity stripe (or by re-creating the lost parity from the data). The process is very similar to that of error checking and correction (ECC) used in main storage. RAID-3 adds an additional disk to the member group. The additional disk is used to store the parity data. The parity disk contains corresponding blocks that are the computed parity for the data blocks residing on the non-parity members of the RAID group. Parity computation is quite fast. For a typical 8-Kbyte le system I/O, parity generation takes much less than a millisecond.
7-50

7
When servicing large quantities of single-stream sequential I/O, RAID-3 delivers efciency and high performance at a relatively low cost. The problem with RAID-3 is that it does not work well for random access I/O, including most relational database systems and timesharing workloads. Random access suffers under RAID-3 because of the physical organization of a RAID-3 stripe. Every write to a RAID-3 group involves accessing the parity disk. RAID-3 uses only one parity disk and becomes the bottleneck for the entire group. In most implementations, reads do not require access to the parity data unless there is an error reading from a member disk. Thus, from a performance standpoint, RAID-3 reads are equivalent to RAID-0 levels. In practice, few vendors actually implement RAID-3 because RAID-5 addresses the shortcomings of RAID-3 while retaining its strengths.
I/O Tuning
7-51
RAID-5
The primary weakness of RAID-3 is that writes over use the parity disk. RAID-5 overcomes this problem by distributing the parity across all of the member disks. All member disks contain some data and some parity. Whereas RAID-3 performs best when accessing large les sequentially, RAID-5 is optimized for random access I/O patterns. A simple write operation results in two reads and two writes, a fourfold increase in the I/O load. These multiple writes for a single I/O operation introduce data integrity issues. To maintain data integrity, most RAID software or rmware uses a commit protocol similar to that used by database systems:
q q q
Member disks are read in parallel for parity computation The new parity block is computed The modied data and new parity are written to a log
7-52

7
q
The modied data and parity are written to the member disk in parallel; and The parts of the log associated with the write operation are removed.
Many sites use RAID-5 in preference to RAID-1 because it does not require the hardware cost of fully duplicated disks. However, RAID-5 is not suited for certain types of applications. The cost of a simple write operation involves a read to obtain parity data, the parity computation, two writes to manage the two-phase commit log, one write for the parity disk, and one or two writes to the data disks. All told, this represents a six-fold increase in actual I/O load and typically about a 60 percent degradation in actual I/O speed. Nonvolatile memory is frequently used to accelerate the write operations. Buffering the modied data and parity blocks is faster than writing them to physical disks. Using nonvolatile memory also allows write cancellation.
I/O Tuning
7-53
Tuning the I/O Subsystem

The best way to tune the I/O subsystem is to cut the number of I/O operations that are done. This can be done in several ways: requesting larger blocks (transfer time is a small part of the overall cost of an I/O operation), tuning main memory properly, tuning the UFS caches appropriately, and using write cancellation. Remember, the I/O subsystem pays the price for memory problems because problems with main memory (improper cache management) generate more I/O operations due to the need to move data in and out of storage. Do not spend a lot of time tuning the I/O subsystem if you have signicant memory performance problems. Spread the I/O activity across as many devices and buses as possible to try to avoid hotspots or overloaded components. The iostat command measures how well disk loads are spread across each disk on a server system. sar -d captures information about partition usage.
7-54

7
Determine whether your system is performing properly. You might spend time trying to tune poor I/O performance only to nd that it is performing as well as it can. Calculate what your response times should be to see how close you are to achieving them. Determine what your critical data is and place it appropriately in your I/O subsystem. Take into account bus speeds; SCSI targets; and device speeds and capabilities, such as tagged queueing, tape drive placement, and data placement on the drive itself. Put your important data where you can get at it quickly. Remember, as you eliminate CPU and memory bottlenecks, you might see new I/O bottlenecks.
I/O Tuning
7-55
Available Tuning Data

There are a number of sources for data on the performance and conguration of your I/O subsystem. The utilities listed in the overhead should provide sufcient data to allow for a reasonably precise view of I/O performance.
7-56

Available I/O Statistics

The overhead shows some of the most important I/O subsystem data available from sar and iostat. The data, while coming from the same origin in the kernel, might be reported differently by the tools. Be sure you take this into account when comparing data from the different tools. # sar -d 08:58:48 08:58:50 device fd0 nfs1 nfs2 nfs3 sd0 sd0,a sd0,b sd0,c sd0,d %busy 0 0 0 0 6 2 4 0 0 avque 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.0 r+w/s 0 0 0 0 4 1 3 0 0 blks/s 0 0 7 18 97 21 76 0 0 avwait 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 avserv 0.0 0.0 15.5 7.6 32.1 46.4 26.7 0.0 0.0
I/O Tuning
7-57
7
Some elds in this output are:
q q
device The name of the disk device being monitored. %busy The percentage of time the device spent servicing a transfer request. r+w/s The number of read and write transfers to the device per second. blks/s The number of 512-byte blocks transferred to or from the device per second. avwait The average time, in milliseconds, that transfer requests wait idly in the queue (measured only when the queue is occupied). avserv The average time, in milliseconds, for a transfer request to be completed by the device (for disks, this includes seek, rotational latency, and data transfer times). avqueue The average number of outstanding requests in the sample time period.
# sar -u 00:00:00 08:00:00 08:20:02 08:40:02 09:00:02 09:20:02 Average # %usr 0 3 3 1 4 1 %sys 0 2 2 1 2 0 %wio 0 13 12 5 9 1 %idle 99 82 83 93 85 97
The I/O statistic shown in this output is %wio, which is the percentage of time the processor is idle and waiting for I/O completion.
7-58

7
# iostat -x device fd0 sd0 sd6 nfs1 nfs2 nfs3 nfs545 # r/s 0.0 0.6 0.0 0.0 0.0 0.1 0.0 w/s 0.0 0.3 0.0 0.0 0.1 0.0 0.0 kr/s 0.0 4.7 0.0 0.0 0.4 1.2 0.0 extended device statistics kw/s wait actv svc_t %w %b 0.0 0.0 0.0 0.0 0 0 5.0 0.0 0.0 49.3 0 1 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0 2.9 0.0 0.0 25.3 0 0 0.0 0.0 0.0 7.6 0 0 0.0 0.0 0.0 0.0 0 0

q q q q q
r/s The number of reads/sec. w/s The number of writes/sec. kr/s The number of Kbytes read/sec. kw/s The number of Kbytes written/sec. svc_t The average I/O service time. Normally, this is the time taken to process the request when it has reached the front of the queue, however, in this case, it is actually the response time of the disk. %w The percentage of time that the queue is not empty; that is, the percentage of time that transactions wait for service. %b The percentage of time the disk is busy; that is, transactions in progress and commands are active on the drive.
I/O Tuning
7-59
Available I/O Data

The overhead shows the denition of the kernel elds reported in the tuning reports.
7-60

7
iostat Service Time
The iostat service time is not what it appears to be. It is the sum of the time taken by a number of items, only some of which are due to disk activity. The iostat service time is measured from the moment that the device driver receives the request until the moment it returns the information. This includes driver queueing time as well as device activity time (true service time). If the drive does not support tagged queuing, multiple requests are not reordered, and each request is dealt with serially. Time spent by one request waiting for any previous requests to complete affects its reported service time. For example, suppose an application sent ve requests almost simultaneously to a disk. Service times and response times might be: Request Number 1 2 3 4 5 Service Time (Each Request) 29.65 ms 15.67 ms 13.81 ms 15.05 ms 14.52 ms Response Time 29.65 ms 45.32 ms 59.13 ms 74.18 ms 88.70 ms
You can see a range of reasonable service times, as seen by the device driver. As the reported service times gradually increase, later requests must wait for the earlier ones to complete. The average iostat service time is the sum of all response times divided by ve: (29.65 + 45.32 + 59.13 + 74.1 + 88.70) 5 = 49.50 ms. The iostat report includes in its measure the time Request 5 spent waiting for Requests 1 to 4 to clear. In this particular scenario, it is possible to get from the reported gure back to an average service time for each request using the following formula: Mean true service time is 2 x (reported service time) (n+1), where n is the number of simultaneous requests. Thus, the mean service time in the example is 2 x 49.50 6 = 16.50.
I/O Tuning
7-61
7
Disk Monitoring Using SE Toolkit Programs
The siostat.se Program
The siostat.se program is a version of iostat that shows the real denition of service times, which is the time it takes for the rst command in line to be processed.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/siostat.se 16:37:21 ------throughput------ -----wait queue----disk r/s w/s Kr/s Kw/s qlen res_t svc_t %ut c0t6d0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0 c0t0d0 7.0 0.0 49.0 0.0 0.00 0.00 0.00 0 c0t0d0s0 7.0 0.0 49.0 0.0 0.00 0.00 0.00 0 c0t0d0s1 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0 c0t0d0s2 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0 c0t0d0s3 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0 fd0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0 ^C# ----active queue---qlen res_t svc_t %ut 0.00 0.00 0.00 0 0.05 7.49 7.49 5 0.05 7.48 7.48 5 0.00 0.00 0.00 0 0.00 0.00 0.00 0 0.00 0.00 0.00 0 0.00 0.00 0.00 0
The xio.se Program

The xio.se program is a modied xiostat.se that tries to determine whether accesses are mostly random or sequential by looking at the transfer size and typical disk access time values. Based on the expected service time for a single random access read of the average size, results above 100 percent indicate random access, and those under 100 percent indicate sequential access.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/xio.se extended disk statistics disk r/s w/s Kr/s Kw/s wtq actq wtres actres %w c0t6d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 c0t0d0 9.0 7.0 72.0 56.0 0.0 0.4 0.0 25.5 0 c0t0d0s0 4.0 0.0 32.0 0.0 0.0 0.0 0.0 7.5 0 c0t0d0s1 5.0 7.0 40.0 56.0 0.0 0.4 0.0 31.4 0 c0t0d0s2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 c0t0d0s3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ^C#
%b 0 16 3 13 0 0 0
svc_t %rndsvc 0.0 0.0 10.1 65.4 7.5 48.7 10.9 70.7 0.0 0.0 0.0 0.0 0.0 0.0
7-62

7
The xiostat.se Program
The xiostat.se program is a basic iostat -x clone that prints the name of each disk in a usable form. # /opt/RICHPse/bin/se /opt/RICHPse/examples/xiostat.se extended disk statistics disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b c0t6d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 4.0 3.0 32.0 32.0 0.0 0.1 13.9 0 6 c0t0d0s0 4.0 0.0 32.0 0.0 0.0 0.0 7.8 0 3 c0t0d0s1 0.0 3.0 0.0 32.0 0.0 0.1 21.9 0 3 c0t0d0s2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0s3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 ^C#
The iomonitor.se Program

The iomonitor.se program is like iostat -x, but it does not print anything unless it sees a slow disk. An additional column headed delay prints the total number of I/Os multiplied by the service time for the interval. This is an aggregate measure of the amount of time spent waiting for I/Os to the disk on a per-second basis. High values could be caused by a few very slow I/Os or many fast ones.
The iost.se Program

The iost.se program is iostat -x output that does not show inactive disks. The default is to skip 0.0 percent busy. # /opt/RICHPse/bin/se /opt/RICHPse/examples/iost.se extended disk statistics disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b c0t0d0 0.0 1.0 0.0 8.0 0.0 0.0 20.1 0 2 c0t0d0s0 4.0 1.0 32.0 8.0 0.0 0.0 8.2 0 4 TOTAL w/s 2.0 16.0 r/s 4.0 32.0 ^C#
I/O Tuning
7-63
7
The xit.se Program
The xit.se program is the extended iostat tool. It produces the same output as running iostat -x in a cmdtool (Figure 7-2). It uses a sliding bar to interactively set the update rate and to start, stop, or clear the output.
Figure 7-2
Output From the xit.se Program
The disks.se Program

The disks.se program prints out the disk instance to controller/target mappings and partition usage. # /opt/RICHPse/bin/se /opt/RICHPse/examples/disks.se se thinks MAX_DISK is 8 kernel -> path_to_inst -> /dev/dsk part_count [fstype mount] sd6 -> sd6 -> c0t6d0 0 sd0 -> sd0 -> c0t0d0 3 c0t0d0s0 ufs / c0t0d0s1 swap swap c0t0d0s3 ufs /cache sd0,a -> sd0,a -> c0t0d0s0 0 sd0,b -> sd0,b -> c0t0d0s1 0 sd0,c -> sd0,c -> c0t0d0s2 0 sd0,d -> sd0,d -> c0t0d0s3 0 fd0 -> fd0 -> fd0 0 #
7-64

7
Check Your Progress
Before continuing on to the next module, check that you are able to accomplish or answer the following: u u u u u u u Demonstrate how I/O buses operate Describe the conguration and operation of a SCSI bus Explain the conguration and operation of a Fibre Channel bus Demonstrate how disk and tape operations proceed Explain how a storage array works List the RAID levels and their uses Describe the available I/O tuning data
I/O Tuning
7-65
7
Think Beyond
Some questions to ponder are:
q q
Why are there only eight LUNS on a SCSI target? What would happen if the SCSI initiator was not the highest priority device on the bus? How could you tell? What devices besides a disk drive might tagged queueing benet? What special considerations are there in looking at tuning reports from a storage array?
q q
7-66

I/O Tuning Lab

Objectives
q
L7
Identify the properties of small computer system interface (SCSI) devices on a system Calculate the time an I/O operation should take Compare calculated I/O times with actual times
q q
L7-1
L7
Disk Device Characteristics
Below is example output from the prtvtoc command: # prtvtoc /dev/dsk/c0t2d0s0 * /dev/dsk/c0t2d0s0 partition map * * Dimensions: * 512 bytes/sector * 159 sectors/track * 4 tracks/cylinder * 636 sectors/cylinder * 6485 cylinders * 6483 accessible cylinders ... # prtvtoc /dev/dsk/c0t3d0s0 * /dev/dsk/c0t3d0s0 partition map * * Dimensions: * 512 bytes/sector * 80 sectors/track * 19 tracks/cylinder * 1520 sectors/cylinder * 3500 cylinders * 2733 accessible cylinders ...
L7-2

L7
Given the information, try to answer these questions: 1. Assume that both disks write data at the same speed and the heads are positioned over the appropriate starting track for a write operation. If 2 Mbytes of data is transferred to each disk and each disk has a seek time of 4 ms and a rotation speed of 10,000 RPM, provide answers to each of these questions:
w w w
Which disk would take less time to write the data? Why? What would be the time difference (in percentage terms)?
Use the following steps to determine I/O subsystem character: 2. Locate your boot drive and CD-ROM drive. Use prtconf -v to identify the following characteristics for each (if they are SCSI devices). For details, see Table 7-1 on page L7-4 and example output. Target ________ Device type Speed ____________ ____________
Tagged queueing? ____________ Wide SCSI? Target ________ Device type Speed ____________ ____________ ____________
Tagged queueing? ____________ Wide SCSI? ____________
I/O Tuning Lab

L7-3
L7
Table L7-1 shows the speeds reported by prtconf and the translated bus speed. Table L7-1 prtconf Bus Speeds prtconf Speed 9c40 4e20 2710 1388 Bus Speed 40 Mbytes/sec 20 Mbytes/sec 10 Mbytes/sec 5 Mbytes/sec
If tagged queuing (TQ) is enabled, the length value associated with TQ will be a number greater than zero (0). In this example, tagged queuing is not enabled or available on this drive: name <target3-TQ> length <0> -- <no value>. 3. In a terminal window, start a find process in the background, which searches all directories on the system using the following command: # ./finder8 & 4. In the same window, start the iostat command, and observe the activity per disk slice: # iostat -xp 3 15 Why is the value in the %b column not showing 100 percent?
L7-4

L7
System Behavior Characteristics
The question relates to system characteristics. See if you can determine the characteristics (with minimal background information) of three different system types. 5. Match the best conguration to the application type, based on the details shown below. Application types:
q
Web hosting server (multiple Web sites) Computer aided design (CAD)/Computer aided manufacturing (CAM) CAD/CAM workstation Database server
System 1 has high network activity and high numbers of disk writes. The data read from the disk and written to the disk is not sequential and the data throughput value varies with each read or write. System 2 has high central processing unit (CPU) utilization with very little disk I/O. When disk I/O does take place it is usually sequential access (read or write) with large amounts of data throughput. System 3 has high memory utilization with low disk-access values. When the disk is accessed, the data is mostly disk-read and sequential in nature. This system has a large amount of random access memory (RAM) that is in use by a low numbers of processes.
I/O Tuning Lab

L7-5
L7
Match the system to the type (tick the appropriate box in Table L7-2): Table L7-2 System Types Web Server System 1 System 2 System 3 CAD/CAM Database
L7-6

L7
Reading iostat Report Output
6. The lab7 lab directory includes a le called iostat_example.txt. The le contains partial output from iostat, showing active metadevices. Which of the metadevices provides: The best service time? The worst service time? The best average write time? The best average read time? You should use the following formula: Average write service time: kw/s / (kw/s + kr/s) * svc_t Average read service time: kr/s / (kw/s + kr/s) * svc_t
I/O Tuning Lab

L7-7
UFS Tuning
Objectives
q q q q q q q q q
Describe the vnode interface layer to the le system Explain the layout of a UNIX le system (UFS) partition Describe the contents of an inode and a directory entry Describe the function of the superblock and cylinder group blocks Understand how allocation is performed in the UFS List three differences between the UFS and Veritas File Systems List the UFS-related caches Tune the UFS caches Describe how to use the UFS tuning parameters
8-1
8
Relevance

q q q q q
Why would you want to tune a UFS partition? What might be tunable in a partition? Is it more important to tune the partition or the drive? Does UFS satisfy all types of le storage requirements? Does the use of storage arrays change the way a partition should be tuned?
Additional resources The following references can provide additional details on the topics discussed in this module:
q
M. McKusick, W. Joy, S. Lefer, and R. Fabry, A Fast File System for UNIX, ACM Transactions on Computer Systems, 2(3) August 1984. Cockcroft, Adrian, and Richard Pettit. Sun Performance and Tuning. Palo Alto: Sun Microsystems Press/Prentice Hall, 1998. Wong, Brian L. Conguration and Capacity Planning for Solaris Servers. Palo Alto: Sun Microsystems Press/Prentice Hall, 1997. Graham, John. Solaris 2.x Internals. McGraw-Hill, 1995. Mauro and McDougall. Solaris Architecture and Internals. Sun Microsystems Press/Prentice Hall, 1999. Man pages for the various commands (fstyp, mkfs, newfs, tunefs) and performance tools (sar, vmstat, iostat, netstat). White papers found at http://www.veritas.com.
q q
8-2

The vnode Interface

One of the major design paradigms of the UNIX operating system is that everything looks like a le. In SunOS, this is true even in most of the kernel. The SunOS operating system has the ability to access SunOS le systems on local disks, remote le systems using NFS, and many other types of le systems, such as the High Sierra File System and MS-DOS le systems. This is accomplished using vnodes. The vnode provides a generic le interface at the kernel level, so the calling component does not know the type of le system with which it is dealing. Because it is this transparent in the kernel, it is also transparent to the user.

Point out that the Solaris 8 OE also supports Universal Disk (file) Format (UDF) for use with CD-ROMs. (This file system is also known as ISO9660.)
UFS Tuning
8-3
8
The vnode interface:
q
Provides an architecture that multiple le system implementations can use within SunOS Separates the le system dependent functionality from the le system independent functionality Species a well-dened interface for the le system dependent and le system independent interface layers
The different le systems are manipulated through the vfs (virtual le system) structure, which presents the mounted le system list as one logical directory hierarchy. A vfs is like a vnode for a partition. All les, directories, named pipes, device les, doors, and so on are manipulated using the vnode structure.
8-4

The Local File System

The SunOS UFS le system is implemented as a derivative of the Berkeley BSD Fast FAT File system. It has the following characteristics:
q
A disk drive is divided into one or more slices or partitions. Each slice can contain one and only one le system. Logical volumes can also make use of UFS. Files are accessed in xed-length blocks on the disk, almost invariably 8-Kbytes long. While 4-Kbyte blocks are an option, they are not supported on UltraSPARC systems. Information about a le is kept in an inode. A directory references an inode for each named object on disk.
q q
From the user level, any le is viewed as a stream of bytes. The underlying hardware manages the le data as blocks of bytes. The minimum size the hardware can write or read from disk is a sector of 512 bytes.
UFS Tuning
8-5
Fast File System Layout

The diagram in the overhead shows the structure in which mkfs (called by newfs) lays out the le system on disk.
Disk Label
The disk label is the rst sector of a partition. It is only present for the rst partition on a drive. It contains the drive ID string, a description of the drive geometry, and the partition table, which describes the start and extent of the drives partitions or slices.
Boot Block
The remainder of the rst 8-Kbyte block in the partition contains the boot block if this is a bootable partition. Otherwise, the space is unused.
8-6

8
Superblock
The superblock is the second 8-Kbyte block in the partition, starting at sector 16. It contains two types of data about the partition: static, descriptive data and dynamic, allocation status data. It describes the current conguration and state of the partition to UFS.
Cylinder Groups
The remainder of the disk slice is divided into cylinder groups. A cylinder group (cg) has a default size of 16 cylinders regardless of the disk geometry. The number of cylinders assigned to a cylinder group will be automatically scaled according to the size of the disk slice in which the le system is being created. It is possible to set the cylinder group size using the newfs -c option. Each cylinder group contains four components: backup superblock, cylinder group block, inode table, and data blocks.
Backup Superblock
Each cylinder group contains a backup superblock, usually as the rst 8-Kbyte block in the cg. The backup superblock is written when the partition is created and is never written again. It contains the same static descriptive data that the primary superblock does, which is used to rebuild the primary superblock if it becomes damaged. The rst backup superblock starts at Sector 32.
Cylinder Group Block

The cylinder group block (cgb) is the 8-Kbyte block immediately following the backup superblock. It contains bit maps and summaries of the available and allocated resources, such as inodes and data blocks, in the cylinder group. Each cylinder groups summary information (16 bytes) is also kept in the superblock for use during allocation.
UFS Tuning
8-7
8
The inode Table
The next part of the cylinder group contains a share of the partitions inodes, just after the cgb. The total number of inodes in the partition is calculated based on the newfs -i operand and the size of the partition. The resulting total number of inodes is then divided by the number of cylinder groups, and that number of 128-byte inodes is then created in each cylinder group. A bit map in the cgb tracks that inodes are allocated and free.
Data Blocks
The remainder of the partition is divided into data blocks. Their availability is also tracked in bit maps in the cgb.
Viewing UFS Settings

Current partition conguration settings can be determined by using fstyp -v. This command reads and displays the contents of the superblock in addition to displaying per-cylinder-group information. The output can be extensive so it is recommended that you pipe the information into more.
8-8

The Directory
The directory is the fundamental data structure used to locate les in a partition. Each directory is pointed to by an inode, which in turn points to other inodes. Directories are allocated in 512-byte (one sector) blocks. Each directory entry consists of four elds. The rst is the inode number of the le the entry represents, the second the total length of the entry, the third the length of the le name, and the fourth the variable length le name itself. The directory structure is described in /usr/include/sys/dirent.h. Any holes in the directory caused by deletion and the leftover space at the end of a directory block are accounted for by having the previous entrys record length include the unused bytes. Deleted entries at the beginning of a 512-byte block are recognized by an inode number of 0. New entries are allocated on a rst-t basis using the holes in the directory structure.
UFS Tuning
8-9
Directory Name Lookup Cache (DNLC)

Searching for a name in a directory is expensive: Many small blocks might need to be read, and each must be searched, making this a very expensive operation. The DNLC is used by both UFS and NFS. To save time, each used directory entry is cached in the DNLC. Prior to the Solaris 7 Operating Environment (OE), the directory component name (between the /s) must be less than 32 bytes long to be cached (the actual eld is 32 characters long, but the last character is a termination character). If it is not 32 characters long, the system must go to disk every time the long name is requested. The Solaris 8 OE has no length restriction. The ncsize parameter determines the size of DNLC. It is scaled from the maxusers parameter and should be the same size as the UFS inode cache described later in this module.
8-10

8
The DNLC caches the most recently referenced directory entry names and the associated vnode for the directory entry. Names are searched for in the cache by using a hashing algorithm. The hashing algorithm uses three values: the entry name, the length of the name, and a pointer to the vnode of the parent directory. If the requested entry is in the cache, the vnode is returned. The number of name lookups per second is reported as namei/s by the sar -a command. # sar -a SunOS earth 5.8 Generic sun4m 16:00:01 16:20:00 16:40:00 Average # iget/s namei/s dirbk/s 0 4 3 0 3 2 0 3 2 11/03/00
One eld in this output is namei/s, which represents the number of le system path searches per second. If namei does not nd a directory name in the DNLC, it calls iget to get the inode for either a le or directory. Most iget calls are the result of DNLC misses. If les are being accessed on a random basis, anywhere within a le system, the efciency of the cache is minimal. This is especially true in the case of the find command, searching through the entire directory hierarchy as shown below: # sar -a SunOS earth 5.8 Generic sun4m 16:00:01 16:20:00 16:40:00 17:00:01 Average # iget/s namei/s dirbk/s 0 4 3 0 3 2 57 78 64 19 28 23 11/03/00
UFS Tuning
8-11
The inode Cache

The inode cache and the DNLC are closely related. For every entry in the DNLC, there is an entry in the inode cache. This allows the quick location of the inode entry, without needing to read the disk. As with almost any cache, the DNLC and inode caches are managed on a least recently used (LRU) basis; new entries replace the old. But, just as in main storage, entries that are not currently being used are left in the cache on the chance that they might be used again. If the cache is too small, it will grow to hold all of the open entries. Otherwise, closed (unused) entries remain in the cache until they are replaced. This is benecial, for it allows any pages from the le to remain available in memory. This allows les to be opened and read many times, with no disk I/O occurring. The ufs_ninode parameter determines the size of the inode cache. It is scaled from maxusers and should be the same size as the DNLC.
8-12

You can see inode cache statistics by using netstat -k to display the raw kernel statistics information.
# netstat -k kstat_types: raw 0 name=value 1 interrupt 2 i/o 3 event_timer 4 segmap: fault 197134 faulta 0 getmap 144584 get_use 2 get_reclaim 135706 get_reuse 6828 ... ... inode_cache: size 11372 maxsize 10492 hits 7303 misses 81025 kmem allocs 12843 kmem frees 0 maxsize reached 12852 puts at frontlist 52775 puts at backlist 2853 queues to free 0 scans 533423 thread idles 52441 lookup idles 0 vget idles 0 cache allocs 81025 cache frees 70145 pushes at close 0 ... semaphore: mem_inuse 0 mem_import 0 mem_total 60 vmem_source 0 alloc 0 free 0 wait 0 fail 0 lookup 0 search 0 populate_wait 0 populate_fail 0 contains 0 contains_search 0 #
UFS Tuning
8-13
DNLC and inode Cache Management

The DNLC and inode caches should be the same size because each DNLC entry has an inode cache entry. The default sizes for both caches are 4 (maxusers + max_nprocs) + 320 entries. Cache sizes as large as 80,000 entries or more are not unreasonable for NFS servers. Entries removed from the inode cache are also removed from the DNLC. Because the inode cache caches only UFS data (inodes) and the DNLC caches both NFS (rnodes) and UFS (inodes) data, the ratio of NFS to UFS activity should be considered when modifying ncsize. vmstat -s displays the DNLC hit rate and lookups that failed because the name was too long. The hit rate should be 90 percent or better. The sar options used with the DNLC and inode caches are described near the end of this module.
8-14

8
# vmstat -s 0 swap ins 0 swap outs 0 pages swapped in 0 pages swapped out 89870 total address trans. faults taken 12136 page ins 110 page outs 20769 pages paged in 1137 pages paged out 23391 total reclaims 23058 reclaims from free list 0 micro (hat) faults 89870 minor (as) faults 11399 major faults 16402 copy-on-write faults 25861 zero fill page faults 0 pages examined by the clock daemon 0 revolutions of the clock hand 798 pages freed by the clock daemon 648 forks 20 vforks 647 execs 399288 cpu context switches 1076485 device interrupts 127026 traps 2151736 system calls 184851 total name lookups (cache hits 57%) 16941 user cpu 12771 system cpu 400087 idle cpu 10972 wait cpu # The reason for the poor cache hit percentage, in the preceding example, is because the output was produced immediately after a find command had searched the entire directory hierarchy. List the processes running on the system, before producing the vmstat output, to ensure that no processes are adversely affecting the cache hit values.
UFS Tuning
8-15
The inode Structure

The diagram in the overhead image shows most of the contents of the on-disk inode structure. The le size is an 8-byte integer (long) eld. Since the release of the SunOS 5.6 operating system, the largest le size supported by UFS is 1 Tbyte. To store more than 2 Gbytes in an individual le, programming changes are needed in most applications. (See the man page for largefiles(5) for more information.) The number of sectors used to store the le data is stored in the inode as well as the le size due to UFS support for sparse or holey les, where disk space has not been allocated for the entire le. It can be seen with the ls -s command, which also gives the actual amount of disk space used by the le. The shadow inode pointer is used with access control lists (ACLs).
8-16

8
inode Timestamp Updates
There are three timestamp elds in the inode.
q
File access time (atime) Every time a le is accessed, whether for read or write, the access time stamp is updated. Normally, this time stamp is used by backup programs to ensure that les that have been accessed are backed up to tape. For some applications, such as Web and database servers, this access timestamp may not be of any specic use. To turn off access time updates in the Solaris 8 OE, mount the partition with the noatime option. No access time updates will be made for any le in the partition. See the man page for mount_ufs(1M) for more details.
File modication time (mtime) Every time the le is modied, the le modication time stamp is updated. This allows incremental backups to be performed: It is easy to tell which les need to be backed up. Some les, such as database management system (DBMS) tablespaces, are constantly written to but are backed up using other criteria. Updates of modication timestamp details occur during the fsflush operation. There is no available mechanism to turn off this update procedure.
inode modication time (ctime) The inode modication time is set when the inode is changed, such as by chmod or chown. These updates cannot be turned off.
Many reference books use actual field names. Field names are atime, mtime, and ctime.
UFS Tuning
8-17
The inode Block Pointers

The inode contains two sets of block pointers, direct and indirect. The direct pointers point to one data block, usually 8 Kbytes. With 12 block pointers, this accounts for 96 Kbytes of the le. The inodes size value helps UFS to determine how many block-fragments are in use. When the size of the le reaches 96 Kbytes, that les data will always use whole 8K blocks. If the le is larger than 96 Kbytes, the indirect block pointers are used. The rst indirect block points to a data block, which has been divided into 2048 pointers to data blocks. The use of this indirect block accounts for an additional 16 Mbytes of data. If the le is larger than 16 Mbytes, the second indirect pointer is used. It points to a block of 2048 addresses, each of which points to another indirect block. This allows for an additional le size of 32 Gbytes. The third pointer points to a third level indirect block, which allows an additional 64 Tbytes to be addressed. However, remember that the maximum le size is limited to 1 Tbyte at this time.
8-18

UFS Buffer Cache

The UFS buffer cache (also known as metadata cache) holds inodes, cylinder groups, and indirect blocks only (no le data blocks).

This cache was introduced in Module 6, System Buses.
The default buffer cache size allows up to 2 percent of main memory to be used for caching the le system metadata just listed. The size can be adjusted with the bufhwm parameter, which is specied in Kbytes. The default might be too much on systems with very large amounts of main memory. You can reduce it to a few Mbytes with an entry in the /etc/system le. Use the sysdef command to check the size of the cache. Read and write information is reported by sar -b, as lreads and lwrites. The UFS buffer cache is sometimes confused with the segmap cache, which is used for managing application data le requests.
UFS Tuning
8-19
8
The following partial output of the sysdef command displays the size of the buffer cache (bufhwm): # sysdef ... * Tunable Parameters * 2646016 maximum memory allowed in buffer cache (bufhwm) ... # # prtconf System Configuration: Sun Microsystems sun4m Memory size: 128 Megabytes System Peripherals (Software Nodes): ... # In this output, the value of bufhwm represents 2 percent of the memory size (128 Mbytes). The following displays partial output of the sar -b command: # sar -b SunOS earth 5.8 Generic sun4m 11/01/00
17:00:00 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 17:20:00 0 0 100 0 0 63 0 0 17:40:01 0 0 100 0 0 54 0 0 18:00:00 0 0 100 0 0 55 0 0 19:00:00 0 0 100 0 0 57 0 0 20:00:00 0 4 99 0 3 91 0 0 Average # 0 2 99 0 1 90 0 0
In this output are:

q
lread/s The average number of logical reads per second from the buffer cache. lwrite/s The average number of logical writes to the buffer cache per second.
8-20

Hard Links
The inode number contained in the directory entry creates a hard link between the le name and the inode. A le always has a single inode, but multiple directory entries can point to the same inode. The reference count in the inode structure is the number of hard links referencing the inode. When the last link is deleted (unlinked), the inode can be deallocated and the associated le space freed. There are no back pointers from the inode to any directory entries. An entry in the DNLC is made for each hard link, but the same inode cache entry is used for each. Because the hard links use inode pointers, they are used only within a partition. To reference a le in another partition, symbolic or soft links must be used.
UFS Tuning
8-21
Symbolic Links
SunOS also supports a link to a le that is implemented by pointing to a le that contains a path name, either absolute or relative. This is a soft or symbolic link and can point to a le or directory on a different le system and le system type. If the number of symbolic links referenced during a translation is greater than 20, it is assumed that a loop has been created and an error is returned. A fast symbolic link is used when the path name that is being linked is less than or equal to 56 characters. The 56 characters represent the fourteen 4-byte unused direct and indirect block pointers, where the le name referenced by the link is placed.
8-22

Allocating Directories and inodes

UFS tries to evenly allocate directories, inodes, and data blocks across the cylinder groups. This provides a constant performance and provides a better chance for related information to be kept close by. The determination of which cylinder group to use is made from the cylinder group allocation summaries kept in the primary superblock. Allocated data is clustered near the front of the cylinder group.
UFS Tuning
8-23
The Veritas File System

An alternative to the UFS le system is the Veritas File System. This le system is often referred to as VxFS. Unlike the UFS le system, which allocates in 8 Kbytes le system blocks, VxFS allocates in extents. The size of the extent can be congured although, by default, the extent size is set to 4 Mbytes. Data extents are stored in allocation units (AUs), which are typically 32 Mbytes in size. This means that eight extents can be written to one AU area. VxFS is biased towards les that are large in size. The use of extents within AUs allows contiguous disk space in which the le can grow to be allocated. In VxFS, the data that is to be written to disk storage is rst written to an intent log, which is also stored on disk. The intent log is periodically ushed to the appropriate extents in data space.
8-24

8
The inode information for the le (metadata) is stored at the front of an AU. Small groups of inodes are created dynamically when required, unlike UFS where inodes are generated at the time of the creation of the le system. The extent approach allows sequential data writes to occur to one area of the disk drive (the intent log) in a continuous stream. The data can be ushed from the intent log at any time, thereby allowing the drive to be heavily used without incurring large amounts of read/write head movement. The use of the intent log also allows the ushing of data relating to canceled write operations to occur more easily. The data in the intent log can be ignored whenever a data write (from intent log to data area) occurs on disk. Note You can nd more information about the VxFS characteristics and functionality at http://www.veritas.com.
UFS Tuning
8-25
Comparison of UFS and VxFS

Each le system type offers a method of data storage that might be better suited to a particular application. Although UFS is designed as a multipurpose le system, it might be preferable to make use of a thirdparty le system, such as VxFS, for certain types of storage requirement. For example, mail les tend to grow in 2 Kbyte chunks. With such small amounts of data, UFS would be suited to store mail les because of its block and fragment allocation and its ability to handle large numbers of relatively small les. Web server systems would also be better served by UFS because most Web sites consist of many small Hypertext Markup Language (HTML), Graphics Interchange Format (GIF), and Joint Photographic Experts Group (JPEG) les spread over a series of directory hierarchies. Careful management of the creation of the UFS directories could ensure that all les related to one Web site (or portion of a Web site) could be stored within one cylinder group on disk.
8-26

8
Graphic image les tend to be large in size and contain data that is read or written in a contiguous manner. VxFS would seem to be better suited to store this type of le. Consider the applications requirements before deciding on which le system to apply. Table 8-1 shows some of the differences between UFS and VxFS. However, care should be taken whenever assessing le system requirements. Table 8-1 Comparison of UFS and VxFS File System VxFS Better suited to the storage of small numbers of large les Data blocks gradually ll up the partition in contiguous extents Uses the intent log as standard Fast boot time due to availability of the le system log
UFS Better suited to the storage of large numbers of small les spread over many directories Data blocks are stored across the entire partition, based on the directories in use Logging is available Slow boot time due to fsck activity over many cylinder groups
Note The Solaris OE UFS supports logging. If enabled, the slow boot-up time caused by the fsck activity over multiple cylinder groups is eradicated.
UFS Tuning
8-27
Allocation Performance
Many non-UFS le systems allocate from the front of the partition to the rear, regularly compressing the partition to move data to the front. This compacted approach was rejected during the design of UFS in favor of allocation across cylinder groups. Note The effect of the performance spikes for VxFS has been heavily accentuated in this example graphic. The graphic is not intended to represent actual scale, rather, it is intended to draw attention to the differences in performance over a long-term period as disk storage increases. The benet of the UFS approach is that the disk arm moves, on average, half of the partition regardless of how full the partition is. This provides a consistent response time for random le access activity. If the data is accessed contiguously, UFS appears to improve its performance slightly.
8-28

8
In a compacted le system, the arm traverses the platter surface further and further, taking longer and longer as the partition lls up. While still fast, it creates the perception that the system is slowing as the le system lls. Over time, this inevitably leads to degradation of performance. With VxFS, the use of the intent log at the front of the partition allows large amounts of data to be written with minimal degradation. When the intent log becomes full, however, the data must be ushed to the data storage area and this can cause spikes of poor disk performance if the VxFS le system is being heavily accessed. If large amounts of data are being contiguously written to or read from the disk, the spikes are much smaller in their impact. The disk head movement for UFS remains fairly constant, hence the static line in the graphic but the lower performance overall. With VxFS, head movement increases proportionately to the volume of stored data if random access read/write is occurring with large numbers of small les. However, with sequential access of fewer large les, the VxFS performance is signicantly greater than that for UFS. VxFS also has additional software packages which allow for online defragmentation or online backup. The decision to use VxFS might, therefore, be based on these extra facilities of VxFS rather than on performance issues.
UFS Tuning
8-29
Allocating Data Blocks

Allocation of data blocks in a UFS partition is done in ve steps, as shown in the overhead. The quadratic rehash algorithm is described on page 8-31. The reason that the last step invokes a panic is that the le system allocation data is found to be incorrect. If the partition is known to be full, the allocation request fails immediately. Because it is not, the le system indicates that there is available space. However, if no space is found, the le system is treated as being inconsistent. The panic causes the system to run fsck, during the reboot, to correct the information.
8-30

Quadratic Rehash
The quadratic rehash algorithm was developed to try to quickly locate random blocks of free space in a nearly full partition. It makes the assumption that if a cylinder group is full, the cylinder groups around it will be full too, so farther locations are more likely to have free space. When the ideal cylinder group is full, the quadratic algorithm is used to try to locate a cylinder group with free space. The amount of free space is not considered. The faster the quadratic rehash algorithm nds free space, the sooner it stops, so the partition has a certain amount of space set aside (as seen with the df command) that can only be allocated by the root user. In a large, nearly full partition, the algorithm can be very time consuming.
UFS Tuning
8-31
8
Until SunOS 5.6, the free space amount was 10 percent of the available space in the partition. In the Solaris 2.6 OE and later releases, the amount reserved depends on the size of the partition, running from 10 percent in small partitions to 1 percent of the size in large partitions. # df -k Filesystem /dev/dsk/c0t3d0s0 ... kbytes 246167 used 40801 avail capacity 180750 19% Mounted on /
Used (40801) plus Avail (180750) produces a total of 221551. The Kbytes total value is 246167, leaving a shortfall of 24616 Kbytes. This (24616 Kbytes) represents approximately 10 percent of the total space in this disk slice (c0t3d0s0). # df -k Filesystem ... /dev/dsk/c0t3d0s7 ... kbytes 1103879 used 233618 avail capacity 815068 23% Mounted on /spare
Used (233618) plus Avail (815068) produces a total of 1048686. The Kbytes total value is 1103879, leaving a shortfall of 55193 Kbytes. This (55193 Kbytes) represents approximately 5 percent of the total space in this disk slice (c0t3d0s7).
8-32

8
The Algorithm
Assuming a partition of 33 cylinder groups, and a current cylinder group of 13, the algorithm checks other cylinder groups as shown in Table 8-2: Table 8-2 Cylinder Group
13 14 16 20 28 11 10 ...
Cylinder Group Algorithm
Next Increment
1 2 4 8 16 32 64
The algorithm continues until it encountered a cylinder group that it had already checked. Then, the unchecked cylinder groups are checked serially (the brute force step).
UFS Tuning
8-33
Fragments
For very small les, the allocation of an entire 8-Kbyte block is a signicant waste of space. UFS supports the suballocation of data blocks, called fragments. The number of fragments per block can be set to 1, 2, 4, or 8 (the default) by using newfs. Fragments are allocated to a whole new 8-Kbyte block every time (optimization for time), or to the best-t space available in an existing fragmented block (optimization for space). The optimization type can be changed by the tunefs utility. The rules for fragment allocation are shown in the overhead.
8-34

Logging File Systems

The ability of the UFS to allocate space as required and support les of practically unlimited size is the key to its exibility. The drawback of this exibility is the amount of overhead required when allocating space. Every data block or fragment allocated requires at least three writes, which should all be done before the block is safe to use. In a busy system, this generates a large amount of I/O. To avoid this I/O overhead, a logging le system is used. The logging le system writes a record of the allocation transaction to the le system log. Updates to the disk structures are done at a later time. This also provides a signicant benet in recovery time: The entire le system does not need to be checked. Only the log transactions need to be checked and rewritten if necessary.
UFS Tuning
8-35
Application I/O
UFS application I/O is not usually done directly to a process address space. Instead, data is copied into a cache (the segmap cache) in the kernel, then copied into the user address space. The segmap cache is a fully associative, write-back cache of variable size. It is at the same kernel virtual address for every process, but the contents differ depending on the processs le usage pattern. The le system pages in the segmap cache might be shared with other processes segmap caches, as well as with other access mechanisms, such as mmap. Because the segmap cache is pageable, a block shown in the cache might not be present in memory. It will be page faulted back in if it is required again.

This cache is called the segmap cache after the name of its segment driver.
8-36

Application I/O
Access to the segmap cache is done by the read and write system calls. The system call checks the cache to see if the data is present in the cache. If it is not, the segmap driver requests that the page be located and allocated to the selected caches virtual address. If it is not already in memory, it is page faulted into the system. (For a full block write, the read is not necessary.) The specied data is then copied between the user region and the cache. This can generate a signicant amount of overhead: One 8-Kbyte block reading 80 bytes at a time requires 102 system calls. The number of disk I/O operations related to UFS and NFS operations to and from cache is reported by sar -b as breads and bwrits.
UFS Tuning
8-37
8
The following displays partial output of the sar -b command: # sar -b SunOS earth 5.8 Generic sun4m 11/03/00
16:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 16:20:00 0 1 92 0 1 70 0 0 16:40:00 0 1 96 0 0 71 0 0 17:00:01 2 61 97 2 9 82 0 0 17:20:00 0 2 98 0 1 86 0 0 17:40:00 0 2 99 0 1 81 0 0 18:00:00 0 4 99 0 3 90 0 0 19:00:00 0 1 100 0 1 90 0 0 Average # 0 8 97 0 2 84 0 0
Some of the elds in this output are:

q
bread/s The average number of reads per second submitted to the buffer cache from the disk. bwrit/s The average number of physical blocks (512 blocks) written from the buffer cache to disk per second.
8-38

Working With the segmap Cache

Because of the overhead involved in accessing the segmap cache, the system has provided several mechanisms to bypass it. Remember, the segmap cache is write back, not write through. The program is told that I/O is complete, even though it has not started. When an application must wait until the data is on disk before continuing, it can open the le with the O_SYNC or O_DSYNC ags. O_DSYNC requests a data synchronous operation; the data must be on the drive before the request is signaled complete. O_SYNC requests that metadata updates be complete, and performs the O_DSYNC processing. The use of these ags can signicantly slow the application and generate extra I/O because write cancellation cannot be used, even though its use is required by an application (such as a log le). Asynchronous I/O allows the program to continue running while the I/O occurs. Standard UNIX I/O is synchronous to the application programmer.
UFS Tuning
8-39
File System Performance Statistics

sar options -a, -g, and -v provide data relevant to le system caching efciency and the cost of cache misses. Any nonzero values for %ufs_ipf, as reported by sar -g, indicate that the inode cache is too small for the current workload. The DNLC hit rate is calculated by using (1 - (iget namei)) 100.
8-40

8
# sar -a SunOS earth 5.8 Generic sun4m 16:00:01 16:20:00 16:40:00 17:00:01 17:20:00 17:40:00 18:00:00 19:00:00 Average # iget/s namei/s dirbk/s 0 4 3 0 3 2 57 78 64 0 1 1 0 2 1 0 2 1 0 1 0 6 10 8 11/03/00
Some of the numbers reported are:

q
iget/s The number of requests made for inodes that were not in the DNLC. namei/s The number of le system path searches per second. If namei does not nd a directory name in the DNLC, it calls iget to get the inode for either a le or directory. Most iget calls are the result of DNLC misses. dirbk/s The number of directory block reads issued per second.
UFS Tuning
8-41
8
# sar -g SunOS earth 5.8 Generic sun4m 16:00:01 16:20:00 16:40:00 17:00:01 17:20:00 17:40:00 18:00:00 19:00:00 Average # 11/03/00
pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.08 0.91 0.63 0.00 0.46 0.00 0.06 0.06 0.00 0.00 0.03 0.16 0.16 0.00 0.00 0.02 0.16 0.16 0.00 0.00 0.01 0.05 0.05 0.00 0.00 0.02 0.16 0.13 0.00 0.46
The one eld that is relative to le system caching in this output is %ufs_ipf, which is the percentage of UFS inodes taken off the free list by iget that had reusable pages associated with them. These pages are ushed and cannot be reclaimed by processes. Thus, this is the percentage of iget calls with page ushes. A high value indicates that the free list of inodes is page bound, and the number of UFS inodes might need to be increased.
8-42

8
# sar -v SunOS earth 5.8 Generic sun4m 16:00:01 16:20:00 16:40:00 17:00:01 17:20:00 17:40:00 18:00:00 19:00:00 # Included in this output are:
q
11/03/00 ov 0 0 0 0 0 0 0 lock-sz 0/0 0/0 0/0 0/0 0/0 0/0 0/0
proc-sz 63/2394 64/2394 65/2394 65/2394 65/2394 65/2394 65/2394
ov 0 0 0 0 0 0 0
inod-sz ov file-sz 2096/10492 0 450/450 2411/10492 0 473/473 11365/11365 0 478/478 11447/11447 0 479/479 11524/11524 0 479/479 11547/11547 0 479/479 11567/11567 0 479/479
inod-sz The total number of inodes in memory compared to the maximum number of inodes allocated in the kernel. This is not a strict high-water mark; it can overow. file-sz The number of entries and the size of the open system le table.
UFS Tuning
8-43
The fsflush Daemon

Main memory is treated as a write-back cache, and UFS treats the segmap cache the same way, implying that modied le data can remain in memory indenitely. This is certainly not acceptable. In the event of a system crash, the amount and identity of lost data would never be known, and complete le system recovery from backups might be required. UFS tries to balance the performance gained by write-back caching with the need to prevent data loss. The result is the fsflush daemon. The fsflush daemon wakes up every tune_t_fsflushr seconds (the default is 5) and searches for modied le system pages. It must complete an entire cycle through main storage in autoup seconds (the default is 30). This means that every awakening fsflush must check 5/30 or 1/6 of main memory. This might cause I/O spikes in large systems, so tune_t_fsflushr can be made smaller to even out the I/O load.
8-44

8
In systems with a large amount of physical memory, you might want to increase the value of autoup, along with tune_t_fsflushr. This reduces the amount of CPU time used by fsflush by increasing its cycle time. Be aware that the trade-off is more potential data loss if a system failure occurs.
UFS Tuning
8-45
Direct I/O
Direct I/O is application I/O that does not pass through the segmap cache. Because all application I/O passes through the segmap cache by default, use of the direct I/O special management is required. Direct I/O is requested in two ways: per le, by the application using the directio library call, and as the default for an entire partition using the forcedirectio mount option. Using direct I/O on an entire partition allows the use of a UFS le system for data that would otherwise require raw partitions, such as database table spaces.
8-46

8
Restrictions
There are restrictions on the use of direct I/O. All of the following criteria must be true for direct I/O to be used. If they are not met, regular I/O will be used, without notice.
q q q q
The le is not memory mapped in any other process. The le is not on a UFS logging le system. The le does not have holes (is not sparse). The specic read or write request is disk sector aligned (on a 512-byte seek boundary), and the request is for an integer number of sectors.
Also note:
q
Direct I/O does not do read-ahead or write-behind. This must be managed by the application. All I/O to direct les must be complete before the application exits. Direct I/O cannot coexist with regular use of the le. The direct I/O facility is disabled by setting ufs:directio_enabled to 0.
q q
UFS Tuning
8-47
Accessing Data Directly With mmap

Another mechanism to bypass the segmap cache is mmap. The mmap system call allows the programmer to memory map a le. This means that the le, or a portion of it (called a window), is treated as if it were a segment of the processs address space. The le thus looks like an array resident in the processs virtual memory. The le is accessed as if it were an array, referenced by pointers or array element subscripts. Changes to the le are performed by writing to the proper memory location. The fsflush command then causes them to be moved to disk (or the programmer might request a write). To control the systems management of the mmap segment, or any virtual memory segment, the programmer uses the madvise call.
8-48

Using the madvise Call

madvise species how the application intends to reference pages of virtual memory. The program can specify:
q q q
MADV_NORMAL Normal access (perform read ahead). MADV_RANDOM Random access; no read ahead. MADV_SEQUENTIAL Sequential access (use read ahead and free behind). MADV_WILLNEED A range of virtual addresses that the application intends to reference. These pages are asynchronously faulted into memory (read ahead). MADV_DONTNEED A range of virtual addresses that are no longer needed by the application. The kernel will free the system resources associated with these addresses. Any modied data is written to disk.
UFS Tuning
8-49
File System Performance

File system performance is heavily dependent on the nature of the application generating the load. Before conguring a le system, you need to understand the workload characteristics that are going to use the le system.
Comparing Data-Intensive and Attribute-Intensive Applications

A data-intensive application accesses very large les. Large amounts of data are moved around without creating or deleting many les. An example of a data-intensive workload is an image processing application or a scientic batch program that creates Gbyte-size les. An attribute-intensive application accesses many small les. An application that creates and deletes many small les and only reads and writes small amounts of data in each is considered to be attribute intensive. Perhaps the best example is the software development shop.
8-50

8
Comparing Sequential and Random Access Patterns
The access pattern of an application also needs to be considered when planning an I/O conguration. Applications read and write through a le either sequentially or in a random order. Sequential workloads are easiest to tune because I/Os can be grouped before being issued to the storage device. Online transaction processing (OLTP) is characterized by small I/O size (2 Kbytes to 16 Kbytes) and is predominantly random access. Decision support systems (DSSs) are characterized by large I/O size (64 Kbytes to 1 Mbyte) and are predominately sequential access. High-performance computing (HPC) is characterized by large I/O size (1 Mbyte to 16 Mbytes) and is predominantly sequential access. Internet Service Providers (ISPs) are characterized by small I/O size (1 Kbyte to 128 Kbytes) and are predominately random access. NFS version 2 is characterized by 8-Kbyte synchronous writes. It should be avoided whenever NFS version 3 is available. NFS version 3 is highly exible and can be tailored to the needs of your application. Tuning is based on the data-intensive compared to the attributeintensive disposition of the workload. Timeshare is characterized by being nearly impossible to characterize. Determine which activities share the system and then congure the most balanced conguration possible.
UFS Tuning
8-51
Cylinder Groups
Using newfs in a default way causes the system to allocate 16 cylinders per cylinder group. Recall that a cylinder is a collection of all tracks through the disk platter. Each track is composed of sectors. Each sector is 512 bytes. Every disk has a different number of sectors per track. Given this information from the disk vendor (or from the mkfs -m command), the size of a cylinder group (cg) can be computed as follows: Number of bytes in a cg = (512 bytes/sector) x (number of sectors per track) x (number of tracks per cylinder) x (16 cylinders/cg) For attribute-intensive le systems, this is probably a sizeable number of cylinders per cylinder group because it is optimized for that purpose. For data-intensive le systems, this value might be too small. If the average le size is 15 Mbytes, consider increasing the number of cylinders per cylinder group, based on the previous computation. Remember, the le system tries to keep les in a single cylinder group, so make sure that the cylinder groups are big enough.
8-52

8
Consider the following example: Number of sectors per track = 224 Number of tracks per cylinder = 7 Using this computation: Number of bytes in a cg = 512 x 224 x 7 x 16 = 12,845,056 In this example, a le system with an average le size of 15 Mbytes would benet by increasing the number of cylinders per cylinder group to 32 (256 is the maximum). # newfs -c 32 /dev/rdsk/xxx
UFS Tuning
8-53
Sequential Access Workloads

Sequential workloads read or write sequential le blocks. Sequential I/O typically occurs when moving large amounts of data, such as copying a le, or reading or writing a large portion of a le. Use the truss command to investigate the nature of an applications access pattern by looking at the read, write, and lseek system calls. # truss -p 20432 read(3, read(3, read(3, read(3, read(3, read(3,

"\0\0\0\0\0\0\0\0\0\0\0\0".., "\0\0\0\0\0\0\0\0\0\0\0\0".., "\0\0\0\0\0\0\0\0\0\0\0\0".., "\0\0\0\0\0\0\0\0\0\0\0\0".., "\0\0\0\0\0\0\0\0\0\0\0\0".., "\0\0\0\0\0\0\0\0\0\0\0\0"..,
512) 512) 512) 512) 512) 512)
= = = = = =
512 512 512 512 512 512
Mention that the apptrace command can be used as an alternative to the truss command.
8-54

8
By tracing the system calls generated by an application, you can see that the application is reading 512-byte blocks. The absence of lseek system calls between each read indicates that the application is reading sequentially. When the application requests the rst read, the le system reads in the rst le system block, 8 Kbytes, for the le. This operation requires a physical disk read, which takes a few milliseconds. The second 512-byte read will read the next 512 bytes from the 8-Kbyte le system block that is still in memory. This only takes a few hundred microseconds. This is a major benet to the performance of the application. Reading the physical disk in 8-Kbyte blocks means that the application only needs to wait for a disk I/O every 16 reads rather than every read, reducing the amount of time spent waiting for I/O. Waiting for a disk I/O every sixteen 512-byte reads is still terribly inefcient. Most le systems implement a read-ahead algorithm in which the le system can initiate a read for the next few blocks as it reads the current block. Because the access pattern is repeating, the le system can predict that it is very likely the next block will be read, given the sequential order of the reads.
UFS Tuning
8-55
UFS File System Read Ahead

The UFS le system determines when to implement read ahead by keeping track of the last read operation. If the last read and the current read are sequential, a read ahead of the next sequential series of le system blocks is initiated. The following criteria must be met to engage UFS read ahead: The last le system read must be sequential with the current one, there must be only one concurrent reader of the le (reads from other processes will break the sequential access pattern for the le), the blocks of the le being read must be sequentially layered out on the disk, and the le must be being read or written using the read and write system calls; memory-mapped les do not use the UFS read ahead. The UFS le system uses the cluster size (maxcontig parameter) to determine the number of blocks that are read ahead. In the Solaris 8 OE, the default is the maximum size transfer supported by the underlying device, which defaults to sixteen 8-Kbyte blocks (128 Kbytes) on most storage devices. For larger disk drives, this value can be much higher.
8-56

8
The default values for read ahead are often too low and should be increased to allow optimal read rates. The cluster size should be set very large for high-performance sequential access to take advantage of modern I/O systems. The behavior of the 512-byte read example is displayed by iostat. # iostat -x 5 device fd0 sd6 ssd11 ssd12 ssd13 # r/s 0.0 0.0 0.0 0.0 49.0 w/s kr/s 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6272.0 kw/s wait actv 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.7 svc_t 0.0 0.0 0.0 0.0 73.7 %w 0 0 0 0 0 %b 0 0 0 0 93
Issuing 49 read operations per second to disk ssd113 and averaging 6,272 Kbytes per second from the disk results in the average transfer size of 128 Kbytes. This conrms that the default 128-Kbyte cluster size is grouping the 512-byte read requests into 128-Kbyte groups. The cluster size of a UFS le system (maxcontig parameter) is displayed by using the command, fstyp -v. # fstyp -v /dev/rdsk/c0t3d0s7| grep maxcontig maxcontig 80 rotdelay 0ms rps 90 # For this le system, you can see that the cluster size (maxcontig) is 80 blocks of 8,192 bytes, or 640 Kbytes.

As an alternative, you can use the mkfs -m command on some file systems to see the maxcontig parameter. However, this will not always produce output. Inform the students that the maxcontig parameter is fixed to 80 by the newfs command. If this value required changing, the change would have to be applied using the tunefs command immediately following the creation of the file system.
UFS Tuning
8-57
Setting Cluster Sizes for RAID Volumes

SCSI drivers in the Solaris OE limit the maximum size of an SCSI transfer to 128 Kbytes by default. Even if the le system is congured to issue 512-Kbyte requests, the SCSI driver breaks the requests into smaller 128-Kbyte chunks. The same limit applies to volume managers like Solstice DiskSuite and Veritas Volume Manager (VxVM). Whenever you use a device that requires larger cluster sizes, you need to set the SCSI and Volume Manager parameters in the /etc/system conguration le to allow bigger transfers. The following changes in the /etc/system le provide the necessary conguration for larger cluster sizes. To allow larger SCSI I/O transfers (parameter is bytes), use: set maxphys = 1048576 To allow larger Solstice DiskSuite I/O transfers (parameter is bytes), use: set md:md_maxphys = 1048576
8-58

8
To allow larger VxVM I/O transfers (parameter is 512-byte units), use: set vxio:vol_maxio = 2048 Because RAID devices and volumes often consist of several physical devices arranged as one larger volume, you need to pay attention to what happens to these large I/O requests when they arrive at the RAID volume. For example, a simple RAID Level 0 stripe can be constructed from seven disks in a Sun A5200 storage subsystem. I/Os are then interlaced across each of the seven devices according to the interlace size or stripe size. By interlacing I/O operations across separate devices, you can potentially break up the I/O that comes from the le system into several requests that can occur in parallel. With seven separate disk devices in the RAID volume, you have the ability to perform seven I/O operations in parallel. Ideally, you should also have the le system issue a request that initiates I/Os on all seven devices at once. To split the I/O into exactly seven components when it is broken up, you must initiate an I/O the size of the entire stripe width of the RAID volume. This requires you to issue I/Os that are 128 Kbytes x 7, or 896 Kbytes each. To do this, you must set the cluster size to 896 Kbytes. RAID-5 is similar but you only have an effective space of n - 1 devices because of the parity stripe. Therefore, an eight-way RAID-5 stripe has the same stripe width and cluster size as a seven-way RAID-0 stripe. The guidelines for cluster sizes on RAID devices are as follows:
q
RAID-0 striping: The cluster size is equal to the number of stripe members multiplied by the stripe size. RAID-1 mirroring: The cluster size is the same as for a single disk. RAID-0+1 striping and mirroring: The cluster size is equal to the number of stripe members per mirror multiplied by the stripe size.
q q
UFS Tuning
8-59
Commands to Set Cluster Size

Use the newfs command to set the cluster size at the time the le system is created. Cluster size can be modied after the le system is created by using the tunefs command. To create a UFS le system with an 896-Kbyte cluster size command, issue the following command: # newfs -C 112 /dev/rdsk/xxx

The parameter for the -C option is the number of 8-Kbyte blocks that defaults to sixteen 8Kbyte blocks (128 Kbytes) on most storage devices. One hundred and twelve 8-Kbyte blocks result in the desired 896-Kbyte cluster size.
To modify the cluster size after the le system has been created using the tunefs -a command, use: # tunefs -a 112 /dev/rdsk/xxx Note The UFS read-ahead algorithms do not differentiate between multiple readers; thus, two processes reading the same le will break the read-ahead algorithms.
8-60

8
Tuning for Sequential I/O
Tuning for sequential I/O involves maximizing the size of the I/Os from the le system to disk. This also reduces the amount of CPU required and the number of disk I/Os. To maximize the amount of I/O to a le system, the le system needs to be congured to write as large as possible I/Os out to disk. The maxcontig parameter is also used to cluster blocks together for I/Os.
Tuning for Sequential I/O in a RAID Environment

maxcontig is set so that the cluster size is the same as the stripe width in a RAID conguration.
In a RAID conguration with four disks in a stripe and a stripe size of 16 Kbytes, the stripe width is 64 Kbytes. To optimize I/O in this case, set maxcontig to 8 (64 Kbytes 8 Kbytes). To congure this when creating the le system, use the command: # newfs -C 8 /dev/rdsk/xxx To modify an existing le system, use the command: # tunefs -a 8 /dev/rdsk/xxx
UFS Tuning
8-61
File System Write Behind

UNIX uses an efcient method to process writes. Rather than writing each I/O synchronously, UNIX passes the writes to the operating system. This allows the application to continue processing. This method of delayed asynchronous writes is the UNIX le systems default method for writing data blocks. Synchronous writes are used only when a special le option is set. Delayed asynchronous writes allow the application to continue to process without having to wait for each I/O; it also allows the operating system to delay writes long enough to group together adjacent writes. When writing sequentially, this allows a few large writes to be issued, which is far more efcient than issuing several small writes. The UFS le system uses the same cluster size parameter, maxcontig, to control how many writes are grouped together before a physical write is performed. The guidelines used for read ahead should also be applied to write behind. Again, if a RAID device is used, care should be taken to align the cluster size to the stripe size of the underlying device.
8-62

8
The following example shows the I/O statistics for writes generated by the mkfile command, which issues sequential 8-Kbyte writes: # mkfile 500m testfile & # iostat -x 5 device sd3 ssd49 ssd50 ssd64 # r/s w/s 0.0 0.0 0.0 0.0 0.0 0.0 0.0 39.8 kr/s kw/s wait actv 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5097.4 0.0 39.5 svc_t 0.0 0.0 0.0 924.0 %w %b 0 0 0 0 0 0 0 100
The iostat command shows 39.8 write operations per second to the disk ssd64, averaging 5097.4 Kbytes/sec from the disk. This results in an average transfer size of 128 Kbytes. Changing the cluster size to 1 Mbyte, or 1024 Kbytes, requires setting maxcontig to 128, which represents one hundred and twenty-eight 8-Kbyte blocks, or 1 Mbyte. The maxphys parameter must also be increased as described earlier. # tunefs -a 128 /ufs # mkfile 500m testfile & # iostat -x 5 device sd3 ssd49 ssd50 ssd64 # r/s 0.0 0.0 0.0 0.2 w/s 0.0 0.0 0.0 6.0 kr/s kw/s wait actv 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 6146.0 0.0 5.5 svc_t 0.0 0.0 0.0 804.4 %w 0 0 0 0 %b 0 0 0 99
Now iostat shows 6.0 write operations per second, averaging 6146 Kbytes/sec from the disk. Dividing the transfer rate by the number of I/Os per second results in an average transfer size of 1024 Kbytes. Note The UFS clustering algorithm only works properly if one process or thread writes to the le at a time. If more than one process or thread writes to the same le concurrently, the delayed write algorithm in UFS begins breaking up the writes into random sizes.
UFS Tuning
8-63
UFS Write Throttle

The UFS le system limits the amount of outstanding I/O (waiting to complete) to 384 Kbytes per le. If a process has a le that reaches the limit, the entire process is suspended until the resume limit has been reached. This prevents the le system from ooding the system memory with outstanding I/O requests. The default parameters for the UFS write throttle prevent you from using the full sequential write performance of most disks and storage systems. If you ever have trouble getting a disk, stripe, or RAID controller to show up as 100 percent busy when writing sequentially, the UFS write throttle is the likely cause. Two parameters control the write throttle: the high-water mark, ufs_HW, and the low-water mark, ufs_LW. The UFS le system suspends writing when the number of outstanding writes grows larger than the number of bytes in ufs_HW; it resumes writing when the number of writes falls below ufs_LW.
8-64

8
A third parameter, ufs_WRITES, determines whether le system watermarks are enabled. File system watermarks are enabled by default because the value for ufs_WRITES is 1. The ufs_WRITES parameter should not be set to 0 (disabled) because this could lead to the entire system memory being consumed with I/O requests. ufs_HW and ufs_LW have meaning only if ufs_WRITES is not equal to 0.
ufs_HW and ufs_LW should be changed together to avoid needless churning when processes awake and nd that they either cannot issue a write (the parameters are too close) or have waited longer than necessary (the parameters are too far apart).
The write throttle can be permanently set in /etc/system. Set the write throttle high-water mark to 1/64 of the total memory size and the low-water mark to 1/128 of the total memory size. For a 1-Gbyte machine, you would set ufs_HW to 16,777,216 and ufs_LW to 8,388,608. set ufs:ufs_HW=16777216 set ufs:ufs_LW=8388608
UFS Tuning
8-65
Random Access Workloads

Use the truss command to investigate the nature of an applications access pattern. In the following output, the truss command traces the system calls generated by an application whose process ID is 19231: # truss -p 19231 lseek(3, 0x0D780000, SEEK_SET) read(3, 0xFFBDF5B0, 8192) lseek(3, 0x0A6D0000, SEEK_SET) read(3, 0xFFBDF5B0, 8192) lseek(3, 0x0FA58000, SEEK_SET) read(3, 0xFFBDF5B0, 8192) lseek(3, 0x0F79E000, SEEK_SET) read(3, 0xFFBDF5B0, 8192) lseek(3, 0x080E4000, SEEK_SET) read(3, 0xFFBDF5B0, 8192) lseek(3, 0x024D4000, SEEK_SET) = = = = = = = = = = = 0x0D780000 0 0x0A6D0000 0 0x0FA58000 0 0x0F79E000 0 0x080E4000 0 0x024D4000
Use the arguments from the read and lseek system calls to determine the size of each I/O and the seek offset at which each read is performed.
8-66

8
The lseek system call shows the offset within the le in hexadecimal. In this example, the rst two seeks are to offset 0x0D780000 and 0xA6D0000, respectively. These two addresses appear to be random, and further inspection of the remaining offsets shows that the reads are completely random. Look at the argument to the read system call to see the size of each read, which is the third argument. In this example, every read is exactly 8,192 bytes, or 8 Kbytes. Thus, you can see that the seek pattern is completely random, and that the le is being read in 8-Kbyte blocks.
UFS Tuning
8-67
Tuning for Random I/O

There are several factors that should be considered when conguring a le system for random I/O:
q q q
Try to match the I/O size and the le system block size. Choose an appropriately large le system cache. Disable prefetch and read ahead, or limit read ahead to the size of each I/O. Disable le system caching when the application does its own caching (for example, in databases).
Try to match the le system block size to a multiple of the I/O size for workloads that include a large proportion of writes. A write to a le system that is not a multiple of the block size requires the old block to be read, the new contents to be updated, and the whole block to be written out again. Applications that perform odd-sized writes should be modied to pad each record to the nearest possible block size multiple to eliminate the read-modify-write cycle. The -b option for the newfs command sets the block size in bytes.
8-68

8
Random I/O workload often accesses data in very small blocks (2 Kbytes to 8 Kbytes), and each I/O to and from the storage device requires a seek and an I/O because only one le system block is read at a time. Each disk I/O takes a few milliseconds, and, while the I/O is occurring, the application must wait for it to complete. This can represent a large proportion of the applications response time. Caching le system blocks into memory can make a big difference to an applications performance because you can avoid many of those expensive and slow I/Os. For example, consider a database that does three reads from a storage device to retrieve a customer record from disk. If the database takes 500 microseconds of CPU time to retrieve the record, and spends 3 to 5 milliseconds to read the data from disk, it spends a total of 15.5 milliseconds to retrieve the record; 97 percent of that time is spent waiting for disk reads. You can dramatically reduce the amount of time spent waiting for I/Os by using memory to cache previously read disk blocks. If that disk block is needed again, it is retrieved from memory. This eliminates the need to go to the storage device again. Setting the cluster size, maxcontig, to 1 reduces the size of the I/Os from the le system to disk. Use newfs -C 1 /dev/rdsk/xxx when creating the le system and tunefs -a 1 /dev/rdsk/xxx when modifying an existing le system. When conguring a RAID device for random access workloads, use many small disks. Setting the stripe size of a RAID device to a small number helps reduce latency and the amount of transferred physical blocks required to read the small blocks of random access workloads.
UFS Tuning
8-69
Post-Creation Tuning Parameters

When a partition has been created, many of its characteristics are xed until it is re-created. There are several post-creation strategies that can be used to affect the performance of the partition using tunefs.
Fragment Optimization
As described earlier, if a le system is optimized for time and if there is not enough room in the last block of the le to occupy another fragment, when a le needs to grow, the kernel selects an empty block and copies the les fragments from the old block to the new block. If a le system is optimized for space and there is not enough room in the last block to occupy another fragment, when a le needs to grow, the kernel looks for a block with the number of free fragments equal to exactly what the le needs and copies the fragments to this new block. In other words, it uses a best-t policy.
8-70

8
The minfree Parameter
The minfree parameter is used to force a percentage of blocks in the le system to always remain unallocated. When a le system is almost full, performance of the allocation routines for data blocks and inodes drops due to the quadratic rehash process. In le systems created before SunOS release 5.6, minfree is most likely to be large. Using the algorithm given earlier in the quadratic rehash description, you can adjust the value of minfree in your existing partitions. In a le system that is read only, minfree should be dropped from the default 10 percent to 0 percent because data block and inode allocation are not a concern.
The maxbpg Parameter

The UFS le allocation system only allows a certain number of blocks from a le in each cylinder group. The default is 16 MBytes, which is usually larger than the default size of a cylinder group. To ensure that a large le can use the entire cylinder group (keeping the le more contiguous), increase this limit. This limit is set with the tunefs -e parameter and cannot be set with newfs. A recommended value that should work for all cylinder group sizes is 4,000,000. Do not change this value if performance of small les in the le system is important.
UFS Tuning
8-71
Application I/O Performance Statistics

The sar -b command reports on the segmap and metadata caches. The overhead shows the available information. # sar -b SunOS earth 5.8 Generic sun4m 11/03/00
16:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 16:20:00 0 1 92 0 1 70 0 0 16:40:00 0 1 96 0 0 71 0 0 17:00:01 2 61 97 2 9 82 0 0 ... 19:00:00 0 1 100 0 1 90 0 0 ... Average 0 6 97 0 2 84 0 0 #
8-72

8
Fields in this output include:
q
bread/s The average number of reads per second submitted to the buffer cache from the disk lread/s The average number of logical reads per second from the buffer cache %rcache The fraction of logical reads found in the buffer cache (100 percent minus the ratio of bread/s to lread/s) bwrit/s The average number of physical blocks (512 blocks) written from the buffer cache to disk per second lwrit/s The average number of logical writes to the buffer cache per second %wcache The fraction of logical writes found in the buffer cache (100 percent minus the ratio of bwrit/s to lwrit/s) pread/s The average number of physical reads, per second, using character device interfaces pwrit/s The average number of physical write requests, per second, using character device interfaces
UFS Tuning
8-73
8
Check Your Progress
Before continuing on to the next module, check that you are able to accomplish or answer the following: u u u u u u u u u Describe the vnode interface layer to the le system Explain the layout of a UFS partition Describe the contents of an inode and a directory entry Describe the function of the superblock and cylinder group blocks Understand how allocation is performed in the UFS List three differences between the UFS and Veritas File Systems List the UFS-related caches Tune the UFS caches Describe how to use the UFS tuning parameters
8-74

8
Think Beyond
Consider these questions:
q q
What will the effect on UFS be if there is not enough main storage? Why does the UltraSPARC not support the 4-Kbyte UFS block size? When would the use of a logging le system be critical to performance? What are the most expensive UFS operations? Why must they always be used?
UFS Tuning
8-75
UFS Tuning Lab

Objectives
q q q q q q
L8
Determine how a partition was created See the performance benet of the DNLC Determine how many inodes a partition needs See the performance benets of sequential read/write access Determine the cylinder group to which a le belongs Alter the default UFS creation parameters
L8-1
L8
File Systems
This lab involves the use of the fstyp command to view the characteristics of a UFS le system on hard disk. Look at the newfs/mkfs parameters used to create your root partition and record the following information: fstyp -v /dev/rdsk/cXtXdXsX | more size__________________________ ncg___________________________ minfree_______________________ maxcontig_____________________ optim_________________________ nifree________________________ ipg___________________________ ncyl__________________________ The values will be system-dependent. size is the size of the partition in numbers of le system blocks (see bsize). ncg is the number of cylinder groups in the le system. minfree is the percentage of space in the partition to reserve for root use only. maxcontig is the number of contiguous disk blocks to be allocated.
L8-2

L8
optim is the optimization method (space or time). nifree is the number of inode entries free for use. ipg is the number of inodes per cylinder group. ncyl is the number of cylinders in the partition. The values for the questions below will be system-dependent. How many cylinders exist in the last cylinder group in this partition? ___________________________________________________________ Using fstyp, determine how many inodes are present in the last cylinder group in the partition. ___________________________________________________________ Compare the values above with the output of df -Fufs -o i. ___________________________________________________________ At 128 bytes per inode, how much space is being allocated to unused inodes? What percentage of the partition space is this? ___________________________________________________________
UFS Tuning Lab

L8-3
L8
Directory Name Lookup Cache
Use the les in the lab8 directory for these exercises. 1. Examine the effects of running a system with a very small DNLC and inode cache. ncsize is the size of the DNLC, and ufs_ninode is the size of the inode cache. Type: ./dnlc.sh Record the size of both caches. ncsize __________________________________________ ufs_ninode ______________________________________ 2. Type: vmstat -s Record the total name lookups (cache hits ##%). total name lookups______________________________ cache hits% ____________________________________ 3. Edit /etc/system. Add the following lines: set ncsize = 50 set ufs_ninode = 50 4. Reboot your system.
L8-4

L8
5. After it has rebooted, type: vmstat -s Record the total name lookups (cache hits ##%). total name lookups_____________________________ (Treat this as value X in the equation in step 8.) cache hits% ___________________________________ (Treat this as value A in the equation in step 8.)
6. To determine how much disk activity is taking place in step 7, start the sar program running in a separate window: sar -d 2 100 Observe how much disk activity takes places as the script is executing. You may also wish to use: sar -q 2 100 Calculate a percentage hit rate using the following formula: namei_iget / namei * 100 = Hit-Rate-Percentage Be aware that there may be other disk activity, such as daemon logging and le metadata synchronization which could affect the output. 7. Type: ./finder This test takes a minute or so to run.
UFS Tuning Lab

L8-5
L8
8. When the test has completed, record the results. What has happened to the DNLC hit rate? Why? What can you deduce from just the vmstat output? total name lookups _____________________________ (Treat this as value Y in the equation, below.) cache hits ____________________________________ (Treat this as value B in the equation below.)
Now calculate the following (using the values from Step 5 and Step 8, above): (B - A) / (Y - X) What is the actual percentage of cache hits? ______________________________________________________
L8-6

L8
9. Type: sar -v 1 The size of the inode table corresponds to the value you set for ufs_ninode and that the total number of inodes exceeds the size of the table. This is not a bug. ufs_ninode acts as a soft limit. The actual number of inodes in memory may exceed this value. However, the system attempts to keep the size below ufs_ninode. 10. Edit /etc/system. Remove the lines that were added earlier. 11. Reboot your system. 12. After the system has rebooted, change to the lab8 directory and type: ./finder 13. When the test has completed, record the results. What has happened to the DNLC hit rate? Why? What can you deduce from the vmstat output? total name lookups _____________________________________ cache hits ______________________________________________ Are these values what you expected?
UFS Tuning Lab

L8-7
L8
UFS Parameter Controls
For this lab, you will be using the spare disk drive that is currently not in use by your system. First, you should verify that there is a spare disk drive (if not, consult with your instructor) and, second, you should ensure that there are no current mounts of le systems from that drive.

Verify that there is a disk on which it is safe to carry out the following steps.
Note Do not alter any settings for your system boot disk drive. 1. Use the format command to determine the size of the spare disk drive (in cylinders). 2. When you have determined the size of the disk (in cylinders) partition the disk using the information shown below, where the allocated values are as close to the percentages as possible:
w
In addition to Slice 2, three slices are used. These slices are: Slice-0 Allocate 5% of the total disk cylinders Slice-6 Allocate 5% of the total disk cylinders Slice-7 Allocate 45% of the total disk cylinders The total number of cylinders allocated to these slices is equal the number of cylinders of Slice 2.
Each of the slices listed above must be write-enabled and mountable. There should be no overlapping of any slices (except for Slice 2).
L8-8

L8
3. When the partitions have been dened, label the disk using the appropriate command. 4. Quit from the format command using the appropriate program instructions. 5. Create a new le system in Slice 0 (on the drive that you have just used with the format command) using: newfs /dev/rdsk/cXtXdXs0 where the device path name is appropriate for the disk slice being used in this lab. Note For the remainder of this portion of the lab, whenever a device pathname is used, you should assume that it is appropriate to that task, as for step 5, above. If you are unsure, consult with your instructor. 6. Create a new le system in Slice 6 (on the drive that you have just used with the format command) using: newfs -c 48 -i 6144 -m 3 -o space /dev/rdsk/cXtXdXs6 Note You may nd that this command does not produce the expected results. This could be because of the size of the disk slice being over 4 GBytes in size. Further, the space option may be ignored by the newfs command.

If the space option is ignored and the newfs defaults to time optimization, inform the students that space optimization can be set using the tunefs command.
7. Create a new le system in Slice 7 (on the drive that you have just used with the format command) using: newfs /dev/rdsk/cXtXdXs7 8. Compare the details of the le systems in Slice 0 and Slice 6 using the following commands: fstyp -v /dev/rdsk/cXtXdXs0 | head -30 fstyp -v /dev/rdsk/cXtXdXs6 | head -30
UFS Tuning Lab

L8-9
L8
9. Are there any differences in the control values for these two slices? Write down the differences (if any are found): minfree: _____________________________________ cpg: _________________________________________ bpg: _________________________________________ ipg: _________________________________________ maxbpg: ______________________________________ optim: _______________________________________ If there are differences, why do you think these differences occurred?
10. Compare the details of the le systems in Slice 0 and Slice 7 using the following commands: fstyp -v /dev/rdsk/cXtXdXs0 | head -30 fstyp -v /dev/rdsk/cXtXdXs7 | head -30 11. Are there any differences in the control values for these two slices? Write down the differences (if any are found): minfree: _____________________________________ cpg: _________________________________________ bpg: _________________________________________ ipg: _________________________________________ maxbpg: ______________________________________ optim: _______________________________________
L8-10

L8
12. Create the following directories using the commands listed below: mkdir /newdir0 mkdir /newdir6 mkdir /newdir7 13. Mount the slices (from the non-system disk) as follows: mount /dev/dsk/cXtXdXs0 /newdir0 mount /dev/dsk/cXtXdXs6 /newdir6 mount /dev/dsk/cXtXdXs7 /newdir7 14. In a new terminal window, issue the command: iostat 2 Observe the values being displayed as you proceed through the remainder of this lab.
UFS Tuning Lab

L8-11
L8
15. When the disk slices are mounted, issue the following commands and record (in the spaces provided) the amount of time taken to perform the tasks: Note These tasks relate to sequential disk access times. Observe the kps column for the disk drive in the output of iostat when each command is executed ptime mkfile 60m /newdir0/file60mb Real:_____________ User:_____________ Sys:______________ ptime mkfile 60m /newdir6/file60mb Real:_____________ User:_____________ Sys:______________ ptime mkfile 60m /newdir7/file60mb Real:_____________ User:_____________ Sys:______________
L8-12

L8
16. Remove the les created in the preceding step by issuing the following commands: rm /newdir0/file60mb rm /newdir6/file60mb rm /newdir7/file60mb 17. In the lab8 directory is a shell script called ufscreate1. This shell script will create a series of directories and sub-directories (in one of the le systems mounted earlier). When the series of directories have been created, a le will be copied into each of the new sub-directories. Issue each of the commands listed below and record the amount of time taken to complete: ptime ./ufscreate1 /newdir0 Real:_____________ User:_____________ Sys:______________ ptime ./ufscreate1 /newdir6 Real:_____________ User:_____________ Sys:______________ ptime ./ufscreate1 /newdir7 Real:_____________ User:_____________ Sys:______________
UFS Tuning Lab

L8-13
L8
Were there any differences when /newdir0 and /newdir6 were used? If so, why?
The /newdir7 slice is nine times the size of the /newdir0 slice. Did the directory and le creation in /newdir7 take nine times longer than the directory and le creation in /newdir0? If not, why not?
How many directories were created in /newdir0?
How many directories were created in /newdir7?
L8-14

L8
The mmap and read/write Programs
Use the les in the lab8 directory for these exercises. The purpose of this exercise is to demonstrate that programming technique might affect le access. If a program uses incorrect technique, this can have a severe impact on le system access. Note If you do not have a background in programming, you might want to skip this exercise and move on the lab exercise on page L8-18. 1. Use your favorite editor to view the source for the program, map_file_f_b.c. This program sequentially scans a le, either forward or backward. What system call is the program using to access the le? ___________________________________________________________ Why is msync() called at the end of the program? ___________________________________________________________ ___________________________________________________________ Which should take more time scanning the le forward or scanning the le backward?
Why?
While students are executing this lab exercise, suggest that they run iostat -x to display the data being transferred to and from the disk.
UFS Tuning Lab

L8-15
L8
2. There is a le called test_file in the labfiles directory. It is used by some of the programs you will be running. Use df -n to verify that this le resides in a (local) UFS le system. If it is not in a UFS le system, perform the following steps: a. Copy the les map_file_f_b, mf_f_b.sh, and test_file to a directory on a UFS le system. map_file_f_b and mf_f_b.sh are in your lab8 directory. test_file is in the parent directory of the lab8 directory. You will be looking at some statistics related to disk I/O; therefore, test_file, map_file_f_b, and mf_f_b.sh need to be on a local le system. b. Change (cd) to the directory containing test_file, map_file_f_b, and mf_f_b.sh. 3. Type: mf_f_b.sh f This script runs map_file_f_b several times and prints out the average execution time. The f argument tells map_file_f_b to sequentially scan the le (forwards) starting at the rst page and nishing on the last page. Note the results. System total ________ User total ________ Real-time total ________ Average ___________ Average ___________ Average ___________
L8-16

L8
4. Type: mf_f_b.sh b This runs map_file_f_b several times, but this time scanning (backwards) begins on the last page of the le and works back to the rst page. Note the results. System total ________ User total ________ Real-time total ________ Average ___________ Average ___________ Average ___________
Which option took a shorter amount of time? Why?
5. Change (cd) back to the lab8 directory if you are not already in that directory.
UFS Tuning Lab

L8-17
L8
Using the madvise Program
Experiment with madvise as follows: 1. Look at file_advice.c. What does this program do?
What options does it accept and what effect do they have on the program?
2. Copy file_advice to the directory containing test_file. 3. Change (cd) to the directory containing test_file. 4. In one window, type: vmstat 5 When file_advice has nished accessing the mapped le, it prompts you to press Return to exit. Do not press Return. Instead, note the change in free memory from when the program was started until file_advice prompted you to press Return. 5. Type: ./file_advice n The n option tells file_advice to use normal paging advice when accessing its mappings. Note the pagein activity. What caused this?
L8-18

L8
After file_advice walked through the mapping, were the pages it used freed up? (Did you see a major drop in freemem in the vmstat output?) Explain.
6. Press Return so file_advice can exit. 7. Type: ./file_advice s The s tells file_advice to use sequential advice when accessing its mappings. When file_advice prompts you to press Return, was any of the memory that had been used to access the mapping freed? (Did you see a smaller drop in freemem in the vmstat output.) Explain.
8. Press Return so file_advice can exit. 9. Type: ./file_advice s Note the number of I/O requests that were posted to the disk test_file resides on (for example, s0).
10. Press Return so file_advice can exit.
UFS Tuning Lab

L8-19
L8
11. Type: ./file_advice r The r option tells file_advice to use random advice when accessing its mappings. Note the number of I/O requests that were posted to the disk that test_file resides on (for example, s0).
Is there a difference between the sequential case and the random case?
12. Press Return so file_advice can exit.
L8-20

Network Tuning
Objectives
q q q q q
Describe the different types of network hardware List the performance characteristics of networks Describe the use of Internet Protocol (IP) trunking Discuss the NFS performance issues Briey describe the functions, features, and benets of the CacheFS le system Discuss how to isolate network performance problems Determine what can be tuned
q q
9-1
9
Relevance

q q q
What can be tuned in a network? How important is tuning the network servers and clients? How do you know what to tune?
q
Stevens, W. Richard. TCP/IP Illustrated, Volume 1. Reading, MA: Addison-Wesley, 1998. ISBN: 0-201-63346-9. Cockcroft, Adrian, and Richard Pettit. Sun Performance and Tuning. Palo Alto: Sun Microsystems Press/Prentice Hall, 1998. Wong, Brian L. Conguration and Capacity Planning for Solaris Servers. Palo Alto: Sun Microsystems Press/Prentice Hall, 1997. SMCC NFS Server Performance Tuning Guide in the Hardware AnswerBook on the SMCC Supplements CD-ROM. Solaris 8 Answerbook, System Administration Guide. Man pages for the various commands (ndd, ping, snoop, spray) and performance tools (netstat, nfsstat).
q q
9-2

Networks
The network system has three areas to tune, just like the I/O subsystem, although most people concentrate on just one. A network operates like any other bus, with bandwidth and access issues. Just like any bus, speed and capacity issues play an important role in performance. The network is generally not susceptible to any tuning at the individual system level because it cannot be adjusted from Operating System (OS) options. Tuning the enterprise network is well beyond the scope of this course. The other aspects of tuning the network, as in the I/O subsystem, involve making efcient use of the network. Caching where appropriate, writing cancellation opportunities, and using network protocol tuning options can greatly increase the efciency and capacity of the network. Each system connected to the network must be tuned differently for the network, depending on its workload.
Network Tuning
9-3
Network Bandwidth
There are many different network technologies, which provide a wide range of bandwidths depending on trafc needs and system capabilities. The gures in the overhead are bandwidth values. Many of the networks are contention based, unlike the small computer system interface (SCSI) bus, and are unable to achieve practical throughput beyond 3060 percent of these speeds.
9-4

Full Duplex 100BASE-T Ethernet

100BASE-T Ethernet (Fast Ethernet) interfaces, by default, automatically adjust their operational characteristics based on what they sense in the networking device to which they are attached (this is called autosensing). These characteristics include:
q q
Speed (10 Mbits or 100 Mbits/sec) Mode (full duplex or half duplex)
It is common for the interface to sense the network incorrectly and to operate with less capability than it can. Usually it operates at the correct speed, but it might be operating in half duplex mode instead of full duplex. You can use the ndd command to check (and set) the operational characteristics of a network interface.
Network Tuning
9-5
9
ndd Examples
There are several 100BASE-T full-duplex parameters. Only those beginning with adv_ should be adjusted. To check the actual mode and speed of operation of an hme card, use the following command: # ndd /dev/hme link_mode The link-mode would be either 1 (full duplex) or 0 (half duplex).

Due to possible incompatibilities with older network devices or network cards, switching to full duplex might cause severe network performance problems. If a student is experiencing network problems, switching to half duplex might remove the problem.
# ndd /dev/hme link_speed The link_speed would be either 0 (100 Mbits/sec) or 1 (10 Mbits/sec). If the values are not what you require because of auto-negotiation, you can force the interface to operate with xed parameter values. The following ndd commands could be placed in a shell script that is used to force an hme card to work at 100Mb full-duplex: ndd ndd ndd ndd ndd ndd ndd -set -set -set -set -set -set -set /dev/hme /dev/hme /dev/hme /dev/hme /dev/hme /dev/hme /dev/hme instance 0 adv_100T4_cap 0 adv_100fdx_cap 1 adv_100hdx_cap 0 adv_10fdx_cap 0 adv_10hdx_cap 0 adv_autoneg_cap 0
9-6

9
Forcing 100BASE-T Full Duplex
To force 100BASE-T operation using ndd, set adv_autoneg_cap and adv_100hdx_cap to 0, and set adv_100_fdx_cap to 1. To set them using /etc/system, use: set hme:hme_adv_autoneg_cap=0 set hme:hme_adv_100hdx_cap=0 set hme:hme_adv_100fdx_cap=1 You can also set these parameters using ndd or by editing the /kernel/drv/hme.conf le. If the parameters are altered using the /etc/system le, all instances of hme cards have the parameters applied. However, if you use ndd, each instance of a card must be nominated individually. Note For alternative network card types, such as qfe, use the information above, substituting qfe for hme. The following ndd commands could be placed in the /etc/system le to make an hme card to work at 100Mb full-duplex when the system is booted: set set set set set set hme:adv_100T4_cap=0 hme:adv_100fdx_cap=1 hme:adv_100hdx_cap=0 hme:adv_10fdx_cap=0 hme:adv_10hdx_cap=0 hme:adv_autoneg_cap=0
Network Tuning
9-7
IP Trunking
IP trunking is a capability provided by the 100BASE-T Quad Fast Ethernet (qfe) cards and the Gigabit Ethernet cards. IP trunking allows either two or all four interfaces to be combined as one higher speed interface, sort of like a network RAID or interleaving capability. Load balancing over the interfaces, failure recovery, and full Ethernet capabilities are provided.

Point out here that because disk striping and interleaving have been explained, you can now recognize those principles in use in other places.
Generally, the IP trunking connection goes to an Ethernet switch, although there are other uses for it. It presents only one media access control (MAC) address for the aggregate connection. To enable and congure IP trunking, you must install the Sun Trunking software.

IP Trunking is not the same as IP Multi-Pathing (IPMP), although there are some similarities. You may wish to also discuss IPMP.
9-8

9
Sun Trunking Software
Sun Trunking software multiplies network bandwidth in existing Gigabit Ethernet and Fast Ethernet networks. Sun Trunking software doubles the throughput of Gigabit Ethernet ports. The software also increases the throughput of Fast Ethernet networks by eight times on a single, logical port. Sun Trunking software aggregates or trunks multiple network ports to create high-bandwidth network connections to alleviate bottlenecks in existing Ethernet networks. This level of network throughput eases network bandwidth constraints in applications requiring the transport of large data les in a short period of time, such as data warehousing, data mining, database backup, digital media creation, and design automation. The technology is also be benecial in applications that require an extremely reliable and resilient network, such as nancial services. The Sun Trunking software works in existing Gigabit Ethernet and Fast Ethernet networks and is compatible with switches from leading switch vendors including 3Com Corporation, Nortel Networks, Cabletron, Cisco, Extreme Networks, and Foundry Networks. Sun Trunking software helps protect customers current equipment investments while providing a smooth migration path to next generation networks. Sun Trunking software allows the aggregation of multiple ports, which increases the reliability and resiliency of network connections. In the event that a single link fails, trafc can be smoothly redistributed over the remaining links. In addition, the softwares load balancing feature more fully uses the aggregated links to provide more throughput over multiple links.

WARNING! Sun Trunking might cause problems in a cluster environment.
Network Tuning
9-9
TCP Connections
Transmission Control Protocol (TCP) connections require more system processing and resources than a User Datagram Protocol (UDP) connection. You can think of UDP connections as packet-switched connections, where TCP connections are circuit switched. This implies that more overhead and processing are required for a TCP connection. Just setting up a TCP connection requires several round-trip messages between the two participating systems, even before the actual work starts. On the receiving side, one connection request, taken from the listen queue of pending requests, is processed at a time. In addition, because the reliability of the new connection is unknown, TCP transmits only a few messages at a time until it discovers the stability of the connection, then it begins sending more, creating more efciency. There are TCP/IP stack tuning parameters that affect all of these areas.
9-10

TCP performs lost message recovery as part of its reliability protocols. This means that, after a message has been sent, if a response is not received within a certain amount of time, the message is assumed to have been lost and is retransmitted. If the message has been transmitted over a high latency media, such as a satellite link, messages can unnecessarily be retransmitted. This leads to signicant bandwidth waste, as well as unnecessary server overhead. Also, because connection and state information must be kept, each open TCP connection uses memory. Hypertext Transfer Protocol (HTTP) persistent connections can also add to the overhead, even though they allow several operations to be transmitted over the same connection. The volume of the connections can lead to signicant resource requirements.
Network Tuning
9-11
TCP Stack Issues

The overhead image gives a number of tuning guidelines for the TCP/IP stack. Warn means that this is a condition that requires further investigation. The tuning parameters for these conditions will be discussed on the next pages. There are many other tuning parameters for the TCP stack: do not adjust them. They are intended for development use or for conformance with rarely used parts of the TCP specication (RFC [request for comments] 793) and should not be used. Make sure you have the latest patches for your TCP stack and network drivers. This area of the system changes constantly, and new features, especially performance features, are added regularly. Be sure to read the README les for details of the changes.
9-12

TCP Tuning Parameters

As mentioned before, there are many parameters that can be tuned in the TCP/IP stack. (For a complete list, run the ndd /dev/tcp \? command.) Similar parameters might be available for network adapters. The close-time interval (closed socket reuse delay), with a default of 240 seconds (4 minutes), species how long a socket must remain idle until it can be reused. When changed to 1 minute on systems such as Web servers that use many sockets, signicant response time improvements can be seen. Do not set it to less than 60,000 ms (1 minute). The request queue limit, or the number of system-wide pending connection requests, is 1024 by default. Unless you are experiencing dropped connection requests, which might occur on a heavily loaded HTTP server, do not adjust this parameter. The pending listen queue can have problems similar to the incomplete listen queue and should be tuned under the same conditions.
Network Tuning
9-13
The TCP slow start parameter can be used to specify how many packets TCP should send when testing the reliability of a new connection. As mentioned before, as the connection tests out as reliable, more packets are sent at once. While the standard species one initial packet, most vendors send two packets (the Solaris 8 OE sets the default value to 4). You can adjust this value to bypass the initial testing procedure. This change is especially important when connecting to Windows 98 and Windows NT client systems. The window size species how much data the client can accept at one time. Think of it as the input buffer size. On a slow network, the 16-Kbyte default is ne, but on faster networks (100BASE-T and up), you can increase it to 32 Kbytes. Note For further details of TCP/IP-related tuning parameters and recommended settings, refer to the Solaris Tunable Parameters reference manual at http://docs.sun.com.
9-14

Using the ndd Command

The ndd command enables you to inspect and change the settings on any driver involved in the TCP/IP stack. For example, you can look at parameters from /dev/tcp, /dev/udp and /dev/ip. To see what parameters are available (some are read-only), use the ndd /dev/ip \? command. (The ? is escaped to avoid shell substitution.) To see the value of a parameter, specify the parameter name. ndd /dev/ip ip_forwarding Changes can be made by using ndd -set, such as: ndd -set /dev/ip ip_forwarding 0 which disables IP forwarding until the system is rebooted. (This has the same effect as creating /etc/notrouter.) A persistent change can be made using /etc/system. The change made in /etc/system would look like: set ip:ip_forwarding=0
Network Tuning
9-15
9
The ndd command can be used to inspect parameters, but only root can use ndd to set parameters. If you are changing parameters for a network interface, remember to specify the proper interface instance number. For example, to see the current parameter settings for interface hme3, use: ndd /dev/hme instance 3 \?
9-16

Network Hardware Performance

Networks act like packet-switched buses and have the same performance that other packet-switched buses, such as the SCSI bus, have. The main issue with these buses is access to transmit requests and responses. While SCSI and some network topologies (such as Token Ring) have strict contention management protocols, others, such as Ethernet, operate on a rst-come, rst-served basis. Ethernet, for example, begins transmitting when the network appears quiet. If another message is received during the transmission, the transmission is stopped and the other message continues through. This event, a collision, requires that the transmitter delay and try again. For each collision that occurs, the transmitter delays more. To avoid collisions, it is important that the network be available when a node needs to transmit. This is done by using a higher speed network, but another solution is to investigate the network trafc to see if unnecessary messages are being sent.
Network Tuning
9-17
NFS
Before tuning NFS specically, be sure that the client and server are properly tuned. Issues such as directory name lookup cache (DNLC) and inode caches that are too small can result in increased network trafc because the information has to be reread. Check for heavy and uneven loads on the servers disks or memory shortages on the server. When the client and server are tuned, it becomes easier to identify specic problems as being NFS-related. The rst step in network tuning is to try to eliminate as much message trafc as possible. Messages that are not sent provide the greatest performance and capacity improvement. Excessive trafc is eliminated by using cachefs and the actimeo parameter, as well as properly sizing the client and server I/O caches. Also, NFS version 3 has some network message efciencies and uses larger block sizes, which can add signicant improvements to a very busy NFS environment. The nfsstat -m command from the client can identify which version of NFS you are using.
9-18

NFS Client Retransmissions

The NFS distributed le service uses an remote procedure call (RPC) facility that translates local commands into requests for the remote host. The RPCs are synchronous. That is, the client application is blocked or suspended until the server has completed the call and returned the results. One of the major factors affecting NFS performance is the retransmission rate. If the le server cannot respond to a clients request, the client retransmits the request a specied number of times before it quits. Each retransmission imposes system overhead and increases network trafc. Excessive retransmissions can cause network performance problems. If the retransmission rate is high, look for:
q q q
Overloaded servers that take too long to complete requests An Ethernet interface dropping packets Network congestion that slows the packet transmission
Network Tuning
9-19
9
Use the nfsstat -c command to show client statistics. The number of client retransmissions is reported by the retrans eld.
# nfsstat -c Client rpc: Connection oriented: calls badcalls 156274 4 timers cantconn 0 4 Connectionless: calls badcalls 9 1 badverfs timers 0 4
badxids 0 nomem 0 retrans 0 nomem 0
timeouts 0 interrupts 0 badxids 0 cantsend 0
newcreds 0
badverfs 0
timeouts 0
newcreds 0
Client nfs: calls badcalls clgets 153655 1 153655 Version 2: (6 calls) null getattr setattr 0 0% 5 83% 0 0% read wrcache write 0 0% 0 0% 0 0% link symlink mkdir 0 0% 0 0% 0 0% Version 3: (151423 calls) null getattr setattr 0 0% 44259 29% 5554 3% read write create 20639 13% 24486 16% 3120 2% remove rmdir rename 1946 1% 5 0% 1019 0% fsstat fsinfo pathconf 73 0% 33 0% 280 0% Client nfs_acl: Version 2: (1 calls) null getacl 0 0% 0 0% Version 3: (2225 calls) null getacl 0 0% 2225 100% #
cltoomany 0 root 0 0% create 0 0% rmdir 0 0% lookup 26372 17% mkdir 190 0% link 17 0% commit 2097 1% lookup 0 0% remove 0 0% readdir 0 0% access 18802 12% symlink 4 0% readdir 998 0% readlink 0 0% rename 0 0% statfs 1 16% readlink 765 0% mknod 0 0% readdirplus 764 0%
setacl 0 0% setacl 0 0%
getattr 1 100%
access 0 0%
9-20

NFS Server Daemon Threads

NFS servers have several system daemons that perform operations requested by clients, such as mountd for mounting a shared le system to the client. Normal client I/O requests, however, are handled by NFS I/O system threads. The NFS daemon, nfsd, services requests from the network. The NFS I/O system threads are created by the NFS server startup script, /etc/init.d/nfs.server. The default number of I/O system threads is 16. To increase the number of threads, change the nfsd line found about halfway into the le. /usr/lib/nfs/nfsd -a 16
Network Tuning
9-21
9
One of the most common causes of poor NFS server performance is too few I/O daemon server threads. A large NFS server can require hundreds. You can intentionally supply too few I/O system threads to limit the servers NFS workload. To determine the proper number of server threads, use the (one) largest value calculated from the following:

Choose only one of the values. That value should be the highest value resulting from all of the calculations shown in the list. Not all of these calculations will be relevant to all students NFS servers.
q q q q q
Two NFS threads per active client process Sixty-four NFS threads per client SuperSPARC processor Two hundred NFS threads per client UltraSPARC processor Sixteen threads per 10 Mbits of network capacity One hundred and sixty threads per ber distributed data interface (FDDI)
Each nfsd system thread takes approximately 8 Kbytes of kernel memory.
9-22

The Cache File System

The Cache le system (CacheFS) works with NFS to allow the use of high-speed, high-capacity, local disk drives to store frequently used data from a remote le system. As a result, subsequent access to this data is much quicker than having to access the remote le system each time. Note CacheFS works similarly to ISO-9660 and High Sierra File System-based (HSFS-based) media (such as CD-ROMs). CacheFS is a le system type (cachefs) that is available to the system as a loadable kernel module. It uses the administration command cfsadmin and the existing mount and fsck commands using the F cachefs option.
Network Tuning
9-23
Cache File System Characteristics

CacheFS has these characteristics:
q
File data is cached This relieves the burden on the server to locate the le contents on the servers disk subsystem or page cache. Use of the local cache helps reduce large data transfers over the network when that les data is in the cache. File attributes are cached Ownership and permission details are also stored in the local cache. If several users, working on the client, attempt to access the le, the local cache can provide the necessary attribute details without having to access the remote NFS server. Directory contents are cached If more les are to be accessed in the same directory location, the local cache can provide more precise details to the NFS server, speeding up access on the NFS servers disk subsystem. Symbolic links are cached This should also assist in providing more precise details to the remote server.
9-24

9
q
Local disk resources are reclaimed on a Least Recently Used (LRU) basis When the local cache lls, disk resources are reclaimed on an LRU basis. The LRU algorithms ensure that most recently used data is maintained in the cache. Cache contents are veried periodically The client will, periodically, request updates from the server. The update requests will relate to both the le contents and the le metadata, such as permission and ownership details.
Network Tuning
9-25
CacheFS Benets and Limitations

The benets of CacheFS are:
q
Improved performance CacheFS improves user-perceived performance due to local caching. Network latencies are improved because the use of CacheFS reduces network trafc. CacheFS also enhances read performance when used with HSFS (CD-ROM) le systems. Increased server capacity CacheFS supports higher ratios of NFS clients per server (up to twice as many than without CacheFS). Increased network capacity Network segmentation might be deferred because the use of CacheFS reduces network loads and extends network capacity. Non-volatile storage The cached data survives system crashes and reboots (clients can quickly resume work using cached data).
9-26

9
CacheFS has the following limitations:
q q q
No caching of root le systems No benet for write mostly le systems No benets for applications using read once le systems (for example, news) No benet if data is frequently modied on the server Initial reads of cached le systems take slightly longer than without a cache because the data is being copied to the local cache directory on disk (subsequent reads are much faster)
q q
Network Tuning
9-27
9
Creating a Cache
CacheFS caches are created with the cfsadmin command. The -c option species the cache name. # cfsadmin -c cachedirname Options to cfsadmin are specied after the -o option, in the form option=value. Option maxblocks minblocks threshblocks maxfiles minfiles threshfiles Default 90% 0% 85% 90% 0% 85% Description The amount of partition available The minimum use amount When minblocks begins The maximum percentage of inodes available The minimum percentage of inodes When minfiles begins
Displaying Information
Use the -l option to cfsadmin to view the current cache information, including the le system ID. cfsadmin -l
9-28

9
Mounting With CacheFS
To mount an NFS le system on a client using CacheFS, use the normal NFS mount command format, specifying a le system type of cachefs instead of nfs. The mount options specify the remainder of the required CacheFS information.
mount -F cachefs -o backfstype=nfs,cachedir=cachedirname merlin:/docs /nfsdocs
Using /etc/vfstab
CacheFS mounts can be placed in /etc/vfstab like any other mount request. The device to fsck is the cache itself. Remember to include the proper mount options.
merlin:/doc /cache/doc /nfsdocs cachefs 2 yes rw,backfstype=nfs,cachedir=cachedirname
Using the Automounter

Add the appropriate options to the master map statement:
/home auto-home -fstype=cachefs, backfstype=nfs, cachedir=cachedirname
In the master map, these details provide defaults for all the individual mounts done under the /home directory. The options can also be used on direct maps.
Network Tuning
9-29
9
The cachefs Operation
While running, CacheFS checks the consistency and status of all les it is caching on a regular basis. File attributes (last modication time) are checked on every le access. This can be adjusted using the standard actimeout parameter.
Writes to Cached Data

By default, writes to a cached le result in the entire le being deleted from the cache. This ensures consistency with other users of the NFS le. This is the default CacheFS behavior, and can be specied by using the write_around mount option. If the les in the mount point are used only by the system that they are mounted to, the non_shared option prevents the deletion of an updated le. Both the cache and NFS server are updated. Warning Do not use the non_shared option with shared les. Data corruption could occur.
9-30

9
Managing cachefs
The cachefsstat Command
The cachefsstat command displays statistics about a particular local cachefs mount point. This information includes cache hit rates, writes, le creates, and consistency checks. cachefsstat [-z] [path ...] If no path is given, statistics for all mount points are listed. The -z option re-initializes the counters to zero for the specied (or all) paths.
The cachefslog Command

The cachefslog command species the location of the cachefs statistics log for a given mount point and starts logging, or halts logging altogether. cachefslog [-f logfile] [-h] local_mount_point If neither -f nor -h is specied, the current state of logging and the logle are listed. -f starts logging and species the new logle, and -h terminates logging. To inspect the logle, use the cachefsstat command.
The cachefswssize Command

Using the information from the cachefs log, which is managed by cachefslog, cachefswssize calculates the amount of cache disk space required for each mounted le system. The command syntax is: cachefswssize logfile # cachefswssize /var/tmp/cfslog /home/tlgbs end size: high water size: /mswre end size: high water size: 128k 128k 10688k 10704k
Network Tuning
9-31
9
/usr/dist end size: high water size: total for cache initial size: end size: high water size: 110960k 12288k 12304k 1472k 1472k
The cachefspack Command

The cachefspack command provides the ability to control and order the les in the cachefs disk cache for better performance. The rules for packing les are given in the packingrules(4) man page. In addition to packing the les, in association with the filesync(1) command, the packing list species les that the user can synchronize with a server.
9-32

Monitoring Network Performance

Commands available for monitoring network performance include:
q q
ping Looks at the response of hosts on the network. spray Tests the reliability of your packet sizes. It can tell you whether packets are being delayed or dropped. snoop Captures packets from the network and traces the calls from each client to each server. netstat Displays network status, including state of the interfaces used for TCP/IP trafc, the IP routing table, and the per-protocol statistics for UDP, TCP, Internet Control Message Protocol (ICMP), and Internet Group Management Protocol (IGMP). nfsstat Displays a summary of server and client statistics that can be used to identify NFS problems. traceroute Displays the route that packets take to reach the network host.
Network Tuning
9-33
9
The ping Command
Check the response of hosts on the network with the ping command. If you suspect a physical problem, you can use ping to nd the response time of several hosts on the network. If the response from one host is not what you would expect, investigate that host. Physical problems can be caused by:
q q q q
Loose cables or connectors Improper grounding Missing termination Signal reection
The simplest version of ping sends a single packet to a host on the network. If it receives the correct response, it prints that the message host is alive. # ping proto198 proto198 is alive # With the -s option, ping sends one datagram per second to a host. It then prints each response and the time it took for the round trip. # ping -s proto198 PING proto198: 56 data bytes 64 bytes from proto198 (72.120.4.98): icmp_seq=0. time=0. ms 64 bytes from proto198 (72.120.4.98): icmp_seq=1. time=0. ms 64 bytes from proto198 (72.120.4.98): icmp_seq=2. time=0. ms 64 bytes from proto198 (72.120.4.98): icmp_seq=3. time=0. ms ^C ----proto198 PING Statistics---5 packets transmitted, 4 packets received, 20% packet loss round-trip (ms) min/avg/max = 0/0/0 #
9-34

9
The spray Command
Test the reliability of your packet sizes with the spray command. Syntax of the command is: # spray [-c count -d interval l packet_size] hostname. Options are:
q q
-c count The number of packets to send. -d interval The number of microseconds to pause between sending packets. If you do not use a delay, you might run out of buffers. -l packet_size The packet size. hostname The system to send packets to.
q q
The following example sends 100 packets to a host (-c 100); each packet contains 2048 bytes (-l 2048). The packets are sent with a delay time of 20 microseconds between each burst (-d 20). # spray -c 100 -d 20 -l 2048 proto198 sending 100 packets of length 2048 to proto198 ... no packets dropped by proto198 4433 packets/sec, 9078819 bytes/sec #
Network Tuning
9-35
9
The snoop Command
To capture packets from the network and trace the calls from each client to each server, use snoop. The snoop command provides accurate time stamps that allow some network performance problems to be isolated quickly. Dropped packets can be caused by insufcient buffer space or an overloaded CPU. Captured packets can be displayed as they are received, or saved to a le for later inspection. Output from snoop can quickly overwhelm available disk space with the sheer amount of data captured. Tips to avoid information overload include:
q
Capture output to a binary le: # snoop -o packet.file ^C#
Display the contents of the binary le: # snoop -i packet.file
Capture only relevant information, such as the packet header contained in the rst 120 bytes: # snoop -s 120
Capture a specic number of packets: # snoop -c 1000
Capture output to tmpfs to avoid dropping packet information due to disk bottlenecks: # snoop -o /tmp/packet.file
Use lters to capture specic types of network activity: # snoop broadcast
9-36

Isolating Problems
The overhead shows the four sources of network problems. By properly tuning the client and server systems, network and NFS performance can be signicantly improved even before tuning the network itself. When the servers are operating correctly, and NFS activity has been tuned (for example, with cachefs and the actimeo parameter), look at physical network issues. Just as system memory problems can cause I/O and CPU problems, system problems can cause network problems. By xing any system problems rst, you can get a much clearer look at the cause of the network problems, if they still exist. You can use ndd, ping, spray, snoop, netstat, and nfsstat to get detailed information about your network and the network trafc.
Network Tuning
9-37
Tuning Reports
The netstat monitor provides information on the physical networks performance as well as message and error counts from the TCP/IP stack. The nfsstat monitor provides statistics on NFS data movement, le system usage, and NFS message types being transmitted. Beginning with the Solaris 2.6 OE, iostat reports I/O performance data for NFS mounted le systems. Samples of the netstat and nfsstat reports are in Appendix C, Performance Monitoring Tools.
9-38

9
Network Monitoring Using SE Toolkit Programs
The net.se Program
The net.se program output is like that of netstat -i plus collision percentage.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/net.se Name Ipkts Ierrs Opkts Oerrs hme0 1146294 0 1077416 0 # Colls 0 Coll-Rate 0.00 %
The netstatx.se Program

The netstatx.se program is like iostat -x in format. It shows persecond rates like netstat.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/netstatx.se Current tcp minimum retransmit timeout is 200 Thu Jul 29 11:15:29 1999 Name hme0 ^C# Ipkt/s Ierr/s Opkt/s Oerr/s Coll/s 162.0 0.0 151.6 0.0 0.0 Coll% tcpIn 0.00 145695 tcpOut DupAck %Retran 128576 0 0.00
The nx.se Program

The nx.se program prints netstat and tcp class information. NoCP means nocanput, which is a packet that is discarded because it lacks IP-level buffering on input. It can cause a TCP connection to time out and retransmit the packet from the other end. Defr is defers, an Ethernet metric that counts the rate at which output packets are delayed before transmission.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/nx.se Current tcp RtoMin is 200, interval 5, start Thu Jul 29 13:18:20 1999 13:18:40 Iseg/s Oseg/s InKB/s OuKB/s Rst/s Atf/s Ret% Icn/s Ocn/s tcp 1.6 1.6 0.19 0.18 0.00 0.00 0.0 0.00 0.00 Name Ipkt/s Opkt/s InKB/s OuKB/s IErr/s OErr/s Coll% NoCP/s Defr/s hme0 2.4 2.2 0.33 0.30 0.000 0.000 0.0 0.00 0.00 ^C#
Network Tuning
9-39
9
The netmonitor.se Program
The netmonitor.se program waits for a slow network, then prints out the same basic data as netstatx.se. Data displayed is in persecond rates.
The nfsmonitor.se Program

The nfsmonitor.se program is the same as nfsstat-m.se except that it is built as a monitor. It prints out only slow NFS client mount points, ignoring the ones that are working ne. You can only run it as the root user. For the Solaris 2.5 OE, nfsmonitor.se reliably reports data only for NFS version 2 (V2). For NFS version 3 (V3), it always reports zeroes (as does the regular nfsstat -m command).
The tcp_monitor.se Program

The tcp_monitor.se program is a GUI that is used to monitor TCP and has a set of sliders that enable you to tune some of the parameters when run as root (Figure 9-1).
Figure 9-1
tcp_monitor.se Program
9-40

9
Check Your Progress
Before continuing on to the next module, check that you are able to accomplish or answer the following: u u u u u u u Describe the different types of network hardware List the performance characteristics of networks Describe the use of IP trunking Discuss the NFS performance issues Briey describe the functions, features, and benets of the CacheFS le system Discuss how to isolate network performance problems Determine what can be tuned
Network Tuning
9-41
9
Think Beyond
q
What other similarities are there between system buses, the small computer system interface (SCSI) bus, and networks? Why does the TCP/IP stack have so many tuning parameters? Why is tuning the NFS client and server more likely to improve performance than tuning the network? How can you be sure that you have a network problem and not a system issue?
q q
9-42

Network Tuning Lab

Objectives
q q q q
L9
Identify the hit statistics for Naming Services caches Trace NFS server access by your hosts automount daemon Use the CacheFS Use netstat reports to determine network performance
L9-1
L9
The nscd Naming Service Cache Daemon
This lab demonstrates the use of cached information that is referenced by the Naming Service Cache Daemon (nscd). Use the lab9 directory for these exercises. You will require the following CDE windows for this lab: 1 x Console window (which will be referred to as Console-Window during this lab). 3 x Terminal windows (which will be referred to as Terminal-1, Terminal-2, and Terminal-3 during this lab). Note You should have only one Console window open during this lab. Complete the following steps: 1. In Terminal-3, view the contents of the /etc/nsswitch.conf le. Write down the source of the data for the following:
w
passwd: ___________________________ group: __________ hosts: _______________ ipnodes: __________________________
2. In Terminal-3, view the current network statistics using the following command: # netstat -s Note You will be comparing these statistics at the end of this lab.
L9-2

L9
3. You should alter the font size of Terminal-2 so that the window has a width of at least 40 characters and a depth of at least 47 lines. In Terminal-2, execute the following command: # ./netcache 4. In the Terminal-1 window, execute the following commands and observe the values being output by the netcache script. # ls -lR /etc /usr/share > /dev/null If you are using NIS in the classroom, try the following commands: # ypcat group # ypcat hosts If you are using NIS+ in the classroom, try the following commands: # niscat group.org_dir # niscat hosts.org_dir When you have completed the steps and observed the changes in value in Terminal-2, you should terminate the command running in Terminal-2. Note - Read the man page for nscd(1m) and nscd.conf(4) to determine how you can change the cache characteristics.
Network Tuning Lab

L9-3
L9
autofs Tracing
In the Solaris OE implementation, the NFS autoumounter facility is provided with a trace facility. This trace facility allows the administrator to observe the actions of the automountd daemon. The trace facility is useful in determining whether NFS servers in a network segment are overloaded with requests, whether they are responding correctly to client requests, and whether the version of NFS has been set correctly or if negotiation is taking place between client and server. The following steps show how to use the automounters trace facility. For this exercise, you will need access to an NFS server. 1. In the Console-Window, type the following command: # touch /net/=9 This will invoke the trace facility with the highest level of tracing enabled (level 9).
L9-4

L9
Your Console-Window should look something like Figure L9-1.
Figure L9-1
Console-Window Output From the touch /net/=9 Command
Network Tuning Lab

L9-5
L9
2. In the Terminal-2 window, issue the following command: # cd /net/moon where moon is the hostname of a system that is not acting as an NFS server. The output you should see in the Console-Window will look similar to Figure L9-2.
Figure L9-2
Console-Window Output From the cd /net/moon Command
3. In the Terminal-2 window, issue the following command: # cd /net/star where star is the hostname of a system that is acting as a NFS server. The output in the Console-Window will show how the NFS automount was successful. 4. When you have successfully mounted from the appropriate server, you might want to try accessing directories and les that are now automounted. For each action, observe the trace details being displayed in the Console-Window.
L9-6

L9
5. Change to the root directory: # cd / 6. Wait for a period of more than 5 minutes, performing no keyboard activity. This will allow the automounter to terminate the mount. Observe the trace values being displayed in the Console-Window at the time of the un-mount of the NFS server directories. 7. To terminate the trace facility, in the Console-Window, issue the following command: # touch /net/=0 Where zero (0) terminates the trace capability.
Network Tuning Lab

L9-7
L9
cachefs Tracing
The cachefs allows remote (NFS) les to be cached on the local disk sub-system. Use of cachefs should, therefore, relieve the trafc on the network and speed up access to the le data for the local user. The following lab shows how cachefs can reduce network trafc. For this lab, there should be one host nominated as a manual page server. You will become a client of this (NFS) manual page server. Unless specied, all commands should be issued in the Terminal-1 window. 1. Make a backup copy of the /etc/auto_master le. 2. Edit the /etc/auto_master le to read: //etc/auto_cache
3. Create the /etc/auto_cache le, containing the following text /usr/share/man -ro,nosuid,noconst,fstype=cachefs,\ backfstype=nfs,cachedir=/var/cachedir\ server:/usr/share/man Note For readability, the line has been shown as being split over three text lines. It is more effective to have just one continuous line in the auto_cache le. where server is the name of the manual page server for the class. 4. Create the cache directory: # cfsadmin -c /var/cachedir 5. Update the automount daemon: # automount -v
L9-8

L9
6. Now, make use of the automounter connection to the man pages server by giving the command: # man ls 7. Verify that the mount took place: # mount Is the man page server mounted? 8. In the Terminal-2 window, issue the following command: # snoop -a server where server is the name of the man page (NFS) server. 9. The man page for ls should now be stored in the cache. Try reading the man page for ls, again using the command: # man ls Did you notice much noise from the snoop session? 10. Now, read the man page for ksh using the following command: # man ksh Did you notice much noise from the snoop session? 11. Again, read the man page for ksh using the following command: # man ksh Did you notice much noise from the snoop session this time? 12. Terminate the snoop session that is running in Terminal-2. 13. Revert the /etc/auto_master le to its original state using the backup copy you made earlier in this lab.
Network Tuning Lab

L9-9

Objectives
q q q q
10
List four major performance bottlenecks Describe the characteristics and solutions for each Describe several different ways to detect a performance problem Identify the type of performance problem that you have, and what tools are available to resolve it
10-1
10
Relevance

q
What general assessments can you quickly make of a poorly tuned system? Is there an order to tuning solutions? Are there any common problems to look out for?
q q
q
Cockcroft, Adrian and Richard Pettit. Sun Performance and Tuning. Palo Alto: Sun Microsystems Press/Prentice Hall, 1998. Wong, Brian L. Conguration and Capacity Planning for Solaris Servers. Palo Alto: Sun Microsystems Press/Prentice Hall, 1997.
10-2

10
Some General Guidelines

The overhead gives some basic guidelines for tuning a system.

10-3
10
What You Can Do

To tune a system, you can
q
Use accounting to monitor the workload. Enable accounting by using the following commands: # # # # ln /etc/init.d/acct /etc/rc0.d/K22acct ln /etc/init.d/acct /etc/rc2.d/S22acct /etc/init.d/acct start crontab -l adm
Collect long-term system utilization data using sys crontab. # crontab -l sys
Graph your data. It is much easier to see trends and utilization problems in a graph than in columns of numbers. A good package to build on, mrtg, can be obtained from http://ee-staff.ethz.ch/~oetiker/webtools/mrtg/ mrtg.html. It can be easily modied to report on more than network data.
10-4

10
Application Source Code Tuning

It is usually much more productive to tune applications than to change Operating System (OS) parameters. (But you can do both.) Application performance is affected by many items. Review these, if possible, if you are experiencing application performance problems. You might not be able to x them, but at least you will be able to identify the source of the problem, perhaps for future repair.

10-5
10
Compiler Optimization
The majority of performance benet comes from tuning the application. One of the simplest and most overlooked application performance improvements is the use of compiler optimization. Most applications are compiled extremely inefciently, for debugging purposes, and put into production that way. If compiled with basic optimization, performance gains of several hundred percent or more are possible, depending on the application. Also, if you use analytical libraries, or shared libraries at all, be sure to compile them with optimization because their use affects many other programs. The slightly increased difculty in determining the location of a failure should be outweighed by signicantly better performance.
10-6

10
Application Executable Tuning Tips

The overhead image gives a number of suggestions for locating problems with applications. There might be difculty initially determining which application is causing the problem, which system accounting can help with. To determine what the application is doing, you should use truss, apptrace, or prex to see kernel-level data. There might be problems caused by NFS, incorrect tuning applied to system parameters, or inefcient or inappropriate resource usage. Further, applications, DBM, and system maintenance can introduce new functions or parameters. These might also cause changes in system performance. Indeed, almost anything could be wrong, which makes it important to be able to identify possible causes of the problem.

10-7
10
Performance Bottlenecks
This section reviews the four basic performance bottlenecks: memory, I/O, central processing unit (CPU), and the network. This module summarizes the detection and removal of these bottlenecks. The primary monitors used to record performance data are sar, iostat, and vmstat. These performance monitors collect and display performance data at regular intervals. Selecting the sample interval and duration to use is partially dependent on the area of performance that is being analyzed. Long sample durations are useful for determining general trends over an extended period of time. For example, to determine disk load, it might be more effective to sample over a longer period of time using larger intervals between samplings. If too short of a sample period is used, the data might not be representative of the general case (that is, the sample period might have occurred when the machine was abnormally busy).
10-8

10
Short sample durations are useful when the effects on the system of a specic application, device, and so on, are being determined. A shorter interval should be used so that more detailed data is recorded. Remember, the data reported by these performance monitors is an average of the activity over the interval. The longer the interval, the less chance that a brief spike in performance will be noticed. However, the shorter the interval, the more often the performance monitor will run and, therefore, the more it skews the performance data being displayed.

10-9
10
Memory Bottlenecks
If memory is a bottleneck on a system, this means that there is not enough memory for all of the active threads to keep their working sets resident. When the amount of free memory drops below lotsfree, the page daemon begins trying to free up pages that are not being used. If the amount of free memory drops too low, the swapper begins swapping lightweight processes (LWPs) and their associated processes. When the page daemon and the swapper need to execute, this decreases the amount of execution time the application receives. If the page daemon is placing pages that have been modied on the pagefree list, these pages are written out to disk. This causes a heavier load on the I/O system, up to the maxpgio limit. The only utility that provides statistics on the hardware cache is vmstat -c. Cache misses are a potential bottleneck if they occur frequently enough. However, even when using vmstat -c, they are extremely difcult to detect.
10-10

10
Memory Bottleneck Solutions

There are many possible solutions to a memory bottleneck, depending on the cause. Your main goal is to make more efcient use of existing memory. If a system is consistently low on memory, there are a few potential remedies that can be applied. If several large processes are in execution at the same time, these processes might be contending with each other for free memory. Running these at separate times might lower the demand. For example, running several applications sequentially might put less of a load on the system than running all of them concurrently. You can use memcntl (or related library calls such as madvise) to more effectively manage memory that is used by an application. These routines are used to notify the kernel that the application is referencing a set of addresses sequentially or randomly. They are also used to release memory back to the system that is no longer needed by the application.

10-11
10
lotsfree, minfree, slowscan, and fastscan are used to control when the kernel begins paging and swapping and at what rate. They also modify the impact the page daemon and swapper have on system performance.
Applications that use shared libraries require less physical memory than applications that have been statically linked. Functions that call one another within an application should be placed as close together as possible. If these functions are located on the same page, then only one physical page of memory is needed to access both functions. setrlimit can be used to force applications to keep their stack and data segments within a specic size limit. Adding more physical memory, or at least more and faster swap drives, is always a solution as well.
10-12

10
I/O Bottlenecks
You can use sar, vmstat, and iostat to detect a possible I/O bottleneck. One of the goals in this area is to spread the load out evenly among the disks on a system. This might not always be possible. But in general, if one disk is much more active than other disks on the system, there is a potential performance problem. Although determining an uneven disk load might vary, depending on the system and expectations, 20 percent is normally used. If the number of operations per disk vary by more than 20 percent, the system might benet by having the disk load balanced. If several threads are blocked waiting for I/O to complete, the kernel might not be able to service the requests fast enough to keep them from piling up. The time a thread spends waiting for an I/O request to complete is time that the thread can be in execution if the requested data is available. There is no eld that records this information in sar. However, the b column in vmstat displays this information.

10-13
10
Disks that are busy a large percentage of time are also indications that there is a potential performance problem. There are two ways to measure this. If the number of reads per second plus the number of writes per second is greater than 65 percent of the disks capacity, the disk might be overloaded. If the disk is busy more than 85 percent of the time, this might also indicate a disk load problem. Because block allocation dramatically slows when a disk is nearly full, active disks with very little free space (near or less than the partition minfree value) might be slowing I/O due to space allocation overhead.
10-14

10
I/O Bottleneck Solutions

Several actions can be taken to avoid I/O creating a performance bottleneck on a system. A balanced disk load is very effective. The goal is to create a conguration where, on average, all disks are equally busy. If a system has some partitions that are heavily accessed, these partitions can be placed on separate disks. The number of disks per controller should also be considered when evaluating disk load. Several extremely fast disks attached to a slow controller are very effective. Placing too many drives on a bus can overload that bus, causing poor response times that are not caused by the drives themselves. Spreading swap partitions and les out among the disks is also effective in balancing the disk load. When allocating swap space, the kernel works in round-robin fashion, allocating 1 Mbyte of space from a swap device or le before moving to the next swap device.

10-15
10
A large disk will, by virtue of its size, have more requests to service than a smaller disk. By placing busy le systems on smaller disks, the load between large and small disks can be more easily balanced. mmap requires less overhead to do I/O than read and write, as does direct I/O. Applications can be modied to be more efcient with respect to their I/O requests. Instead of making many smaller, non-contiguous requests, an application should attempt to make fewer, larger requests. This is because, in most cases, the overhead to actually nd the data on disk is much larger than the overhead required to transfer the data, as you have seen.
10-16

10
CPU Bottlenecks
The determination of whether the CPU is a performance bottleneck is largely a subjective one. This determination is based on the desired performance from the system. The user who wants rapid response time uses a different criteria than the user who is more interested in supporting a large number of jobs that are not interactive. If the CPU is not spending a lot of time idle (less than 15 percent), then threads that are runnable have a higher likelihood of being forced to wait before being put into execution. In general, if the CPU is spending more than 70 percent in user mode, the application load might need some balancing. In most cases, 30 percent is a good highwater mark for the amount of time that should be spent in system mode.

10-17
10
Runnable threads waiting for a CPU are placed on the dispatch queue (or run queue). If threads are consistently being forced to wait on the dispatch queue before being put into execution, then the CPU is impeding performance to some degree. For a uniprocessor, more than two threads waiting on the dispatch queue might signal a CPU bottleneck. A run queue more than four times the number of CPUs indicates processes are waiting too long for a slice of CPU time. If this is the case, adding more CPU power to the system would be benecial. A slower response time from applications might also indicate that the CPU is a bottleneck. These applications might be waiting for access to the CPU for longer periods of time than normal. Consider locking and cache effects if no other cause for the performance problem can be found.
10-18

10
CPU Bottleneck Solutions

You can use the priocntl and nice commands to affect a threads priority. Because threads are selected for execution based on priority, threads that need faster access to the CPU can have their priority raised. You can also use processor sets to guarantee or limit access to CPUs. The dispatch parameter tables can be modied to favor threads that are CPU intensive or I/O bound. They could also be modied to favor threads whose priorities fall within a specic range of priorities. The CPU bottleneck might be the result of some other problem on the system. For example, a memory shortfall causes the page daemon to execute more often. Other lower priority threads are not able to run while the page daemon is in execution. Checking for system daemons that are using an abnormally large amount of CPU time might indicate where the real problem lies.

10-19
10
Use vmstat -i to determine if a device is interrupting an abnormally large number of times. Each time a device interrupts, the kernel must run the interrupt handler for that device. Interrupt threads run at an extremely high priority. Lowering the process load by forcing some applications to run at different times might help. Modifying max_nprocs reduces the total number of processes that are allowed to exist on the system at one time. Any custom device drivers or third-party software should be checked for possible inefciencies, especially unnecessary programmed I/O (PIO) usage with faster CPUs. Lastly, if all else fails, adding more CPUs might alleviate some of the CPU load.
10-20

10
Ten Tuning Tips

Some of the most useful tuning reminders are given on the overheads on the next two pages. Do not overreact to a high value in a tuning report. There are always transient periods when demands are higher on the system than usual, and usually it is not cost effective to handle them. Make sure, however, that you have enough system resources to support your regular workload with room for growth.

10-21
10
Do not hesitate to call your application vendor if you suspect problems with an application system. They are the experts on their system and might be able to quickly suggest a solution that might take days to nd any other way.
10-22

10
Check Your Progress
Check that you are able to accomplish or answer the following: u u u u List four major performance bottlenecks Describe the characteristics and solutions for each Describe several different ways to detect a performance problem Identify the type of performance problem that you have, and what tools are available to resolve it

10-23
10
Think Beyond
Things to consider:
q
What will you look at rst when a performance problem is reported? Why? What type of problems do large applications pose for system tuning? How do you know when your system is well tuned? What types of performance problems might your systems have that require special monitors or tuning approaches?
q q
10-24

10
Case Study
The following case study involves a customer who is experiencing less than satisfactory system performance with a statistical analysis software (SAS) package running on Sun equipment. The customer asked Sun personnel to evaluate the system and recommend solutions to the performance problems. Although the case study relates to systems running earlier versions of the Solaris OE, the methods applied are still relevant to the Solaris 8 OE systems. Even though output, from utilities such as sar, might differ from the Solaris 8 OE versions of the program, these case study notes still provide useful background information relating to the methods applied to a performance monitoring project. When attempting to performance tune a system, you need to gather information regarding the system conguration; the applications that are running on the system; and information on the user community, such as the number of users, where the users are located, and what type of work patterns they have. Some initial questions need to be answered. Data needs to be collected to conrm a performance problem, and then solutions must be implemented. You should make one change at a time and then reevaluate performance before making any other changes. This particular customer has three Sun Enterprise 6000 servers. Two of the servers are running an Oracle relational database management system, and the third server is running an SAS application. Each of the three servers is connected to two different storage arrays. Each of the disk arrays has 1.25 Tbytes of mirrored storage capacity. Presently, 200 plus users have been created on the SAS server. The users have remote access to the server over 100BASET Ethernet. Each user is connected to the server through a Fast Ethernet switch, which means that each user has a dedicated 100-Mbyte/sec pipe into the local area network.

10-25
10
Each of the 200 users have cron jobs that query the Oracle database for data, perform some statistical analysis on the data, and then generate reports. The system is quiet at night and the cron jobs start at around 5:30 a.m. each morning and run throughout the day. Initially, 50 users were set up and performance was ne at that point. System performance degraded as more users were added. Queries to the Oracle database were still satisfactory, but performing the statistical analysis on the SAS server was taking longer as more and more users were added to the system. It was the SAS server that was running slow.
10-26

10
Case Study System Conguration
The prtdiag utility, rst provided with the SunOS 5.5.1 operating system, provides a complete listing of a systems hardware conguration, up to the I/O interface cards. Partial output of the prtdiag command follows. This particular system is a Sun Enterprise 6000 with eight CPUs and 6 Gbytes of memory.
# /usr/platform/sun4u/sbin/prtdiag System Configuration: Sun Microsystems System clock frequency: 84 MHz Memory size: 6144Mb sun4u 16-slot Sun Enterprise 6000
========================= CPUs ========================= Run MHz ----336 336 336 336 336 336 336 336 Ecache MB -----4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 CPU Impl. -----US-II US-II US-II US-II US-II US-II US-II US-II CPU Mask ---2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
Brd --0 0 2 2 4 4 6 6
CPU --0 1 4 5 8 9 12 13
Module ------0 1 0 1 0 1 0 1
========================= Memory ========================= Intrlv. Factor ------4-way 2-way 4-way 2-way 4-way 4-way Intrlv. With ------A B A B A A
Brd --0 0 2 2 4 6 ... ... #
Bank ----0 1 0 1 0 0
MB ---1024 1024 1024 1024 1024 1024
Status ------Active Active Active Active Active Active
Condition ---------OK OK OK OK OK OK
Speed ----60ns 60ns 60ns 60ns 60ns 60ns

10-27
10
The prtdiag I/O section shows, by name and model number, the interface cards installed in each peripheral I/O slot.
# /usr/platform/sun4u/sbin/prtdiag ... ... ========================= IO Cards ========================= Bus Type ---SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus SBus Freq MHz ---25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25
Brd --1 1 1 1 1 1 3 3 3 3 3 3 5 5 5 5 7 7 7 7 7
Slot ---0 1 2 3 3 13 0 1 2 3 3 13 0 3 3 13 0 2 3 3 13
Name -------------------------------nf fca fca SUNW,hme SUNW,fas/sd (block) SUNW,socal/sf (scsi-3) QLGC,isp/sd (block) fca fca SUNW,hme SUNW,fas/sd (block) SUNW,socal/sf (scsi-3) QLGC,isp/sd (block) SUNW,hme SUNW,fas/sd (block) SUNW,socal/sf (scsi-3) fca QLGC,isp/sd (block) SUNW,hme SUNW,fas/sd (block) SUNW,socal/sf (scsi-3)
Model ---------------------SUNW,595-3444,595-3445+ FC FC
501-3060 QLGC,ISP1000 FC FC
501-3060 QLGC,ISP1000
501-3060 FC QLGC,ISP1000
501-3060
No failures found in System =========================== No System Faults found ====================== #
The most serious bottlenecks are usually I/O related. Either the system bus is overloaded, the peripheral bus is overloaded, or the disks themselves are overloaded or both.
10-28

10
From the information in the I/O section of the prtdiag command, the requested bandwidth for the bus can be calculated. This allows you to determine if I/O bottlenecks are caused by installing interface cards with more bandwidth capability than the peripheral bus can handle. Given the maximum bus bandwidths for the system, and the bandwidths necessary for the interface cards, it should be easy to tell if the available bandwidth for the peripheral bus has been exceeded. On the Enterprise 6000, each board actually has two SBuses for an aggregate bandwidth of 200 Mbytes/sec. The bandwidth of the Gigaplane bus on the E6000 is 2.6 Gbytes/sec. With the four boards (1, 3, 5, and 7) each having two SBuses, aggregate bandwidth of the eight SBuses is 800 Mbytes/sec. This is well below the available bandwidth of the 2.6-Gbyte backplane. Because the SBuses on Board 1 have the most devices, this board will be evaluated for an overload condition. Following are the bandwidths for the devices on Board 1:
q q q q q q
nf FDDI at 100 Mbps (or 12.5 Mbytes/sec) fca Fibre Channel controller at 40 Mbytes/sec fca Fibre Channel controller at 40 Mbytes/sec SUNW,hme Fast Ethernet at 100 Mbps (or 12.5 Mbytes/sec) SUNW,fas/sd Fast/Wide SCSI at 20 Mbytes/sec SUNW,socal/sf Serial Optical Channel at 100 Mbytes/sec
Totaling the bandwidths results in 225 Mbytes/sec that does exceed the maximum bandwidth of the SBuses. With a peak request for 225 Mbytes/second bandwidth and, at most, 200 Mbytes/sec available, delays are inevitable. The customer was advised to move an fca to board 5. The I/O cards on Board 5 as currently congured, only have an aggregate bandwidth of 152.50 Mbytes/sec. Moving one fca to Board 5 results in a bandwidth of 145 Mbytes/sec for Board 1 and 192.5 Mbytes/sec on Board 5.

10-29
10
Case Study Process Evaluation
When you identify which processes are active, you can narrow it down to the ones causing any delays and determine which system resources are causing bottlenecks in the process. The BSD version of the ps command displays the most active processes in order.
# /usr/ucb/ps -aux USER PID %CPU %MEM SZ RSS TT user1 3639 11.1 0.120376 4280 pts/9 user2 22396 8.2 0.1 6104 4208 ? user3 24306 5.0 0.53086428976 ? user4 21821 4.9 0.1 7656 4608 ? user5 17093 2.7 0.1 8328 5080 ? user6 20156 2.7 0.218896 7480 pts/14 user7 12982 2.2 0.1 5576 3680 ? user8 25108 2.0 0.1 5896 4152 ? user9 24315 1.7 0.43369622920 ? # S O O S S S O O S O START TIME COMMAND 06:42:22 115:37 sas 09:04:13 11:51 /usr/local/sas612/ 09:14:57 6:10 /usr/local/sas612/ 09:00:07 4:56 /usr/local/sas612/ 08:25:55 0:46 /usr/local/sas612/ 08:47:05 4:26 sas 08:00:01 11:02 /usr/local/sas612/ 09:18:45 0:26 /usr/local/sas612/ 09:15:01 2:21 /usr/local/sas612/
You knew that SAS would be the most active process on the system, however, the report still provides some useful information. The %CPU column is the percentage of the CPU time used by the process. As you can see, the rst SAS process is using a large percentage of CPU time, 11.1 percent. Note User IDs were changed in the ps output to protect the privacy of the actual users. The status column, S, is also useful. A status of P indicates that the process is waiting for a page-in. A status of D indicates that the processes is waiting for disk I/O. Either of these statuses could indicate an overload of the disk and memory subsystems. This is not the case on this system. The processes have a status of O, indicating that the system is on the CPU or running. A status of S indicates that the process is sleeping. The process identier (PID) of the active processes is also useful. The memory requirements of the process are determined by using the /usr/proc/bin/pmap -x PID command.
10-30

10
Case Study Disk Bottlenecks
An I/O operation cannot begin until the request has been transferred to the device. The request cannot be transferred until the initiator can access the bus. As long as the bus is busy, the initiator must wait to start the request. During this period, the request is queued by the device driver. A read or write command is issued to the device driver and sits in the wait queue until the SCSI bus and disk are both ready. Queued requests must also wait for any other requests ahead of them. A large number of queued requests suggests that the device or bus is overloaded. The queue can be reduced by several techniques: use of tagged queueing, a faster bus, or moving slow devices to a different bus. Queue information is reported by the iostat -x report. The queue length, that is, the average number of transactions waiting for service, is reported by the wait column. The percentage of time the queue is not empty, that is, the percentage of time there are transactions waiting for service, is reported by the %w column. If a system spends a good percentage of time waiting for I/O, this could indicate a disk bottleneck. To locate the devices causing a high wait I/O percentage, check the %w column. # iostat -x device sd0 sd1 sd2 sd3 sd4 sd5 sd6 r/s 9.4 2.5 3.1 1.4 3.1 1.4 3.0 w/s 2.1 3.1 3.8 0.2 3.6 0.2 3.7 kr/s 79.2 62.3 121.4 71.0 120.0 69.9 118.6 extended device statistics kw/s wait actv svc_t %w %b 57.0 0.4 0.5 75.0 1 9 140.4 0.0 0.2 34.0 0 5 166.9 0.1 1.6 246.8 1 21 8.4 0.0 0.0 25.6 0 3 164.0 0.0 1.0 156.2 1 18 8.4 0.0 0.0 25.6 0 3 163.9 0.1 1.6 258.9 2 21
As you can see from this random sampling of I/O devices, none of the devices report a high number of transactions waiting for service. The percentage of time that there are transactions waiting for service is also quite acceptable. This suggests that the devices are distributed properly and that the bus is not overloaded.

10-31
10
You can also use iostat -x to look for disks that are more than 5 percent busy, %b, and have average response times, svc_t, of more than 30 ms. Careful conguration, that is, spreading the load and using a disk array controller that includes nonvolatile random access memory (NVRAM) can keep average response time under 10 ms. # iostat -x device sd0 sd1 sd2 sd3 sd4 sd5 sd6 r/s 9.4 2.5 3.1 1.4 3.1 1.4 3.0 w/s 2.1 3.1 3.8 0.2 3.6 0.2 3.7 kr/s 79.2 62.3 121.4 71.0 120.0 69.9 118.6 extended device statistics kw/s wait actv svc_t %w %b 57.0 0.4 0.5 75.0 1 9 140.4 0.0 0.2 34.0 0 5 166.9 0.1 1.6 246.8 1 21 8.4 0.0 0.0 25.6 0 3 164.0 0.0 1.0 156.2 1 18 8.4 0.0 0.0 25.6 0 3 163.9 0.1 1.6 258.9 2 21
From the iostat report, you can see that average response times are high for all disks. Also, sd0, sd2, sd4, and sd6 are busy more than 5 percent of the time. A better distribution of load across these disks would help. Reducing the number of I/Os would also help. UFS tries to balance the performance gained by write-back caching with the need to prevent data loss. Main memory is treated as a writeback cache, and UFS treats the segmap cache the same way. The fsflush daemon periodically wakes up and writes modied le data to disk. Without this mechanism, modied le data could remain in memory indenitely. There are two tuning parameters that control the fsflush daemon:
q
tune_t_fsflushr How often to wake up the fsflush daemon. The default is 5 seconds. autoup How long a cycle should take; that is, the longest that unmodied data will remain unwritten. The default is 30 seconds.
The SAS application reads data from the Oracle database tables, performs some statistical analysis, and produces reports. Therefore, it is not necessary to activate the fsflush daemon as often. SAS requires a working le system for each user to write the results of computations to disk. But these writes are not that critical because the original Oracle data can always be read again in the event of a system crash.
10-32

10
Increasing both tuning variables for the fsflush daemon decreases the number of CPU cycles required to run fsflush, resulting in an increase in the number of CPU cycles for SAS. Increasing both tuning variables also decreases the number of I/Os required to write le data back to disk. Recommended entries in /etc/system are: set tune_t_fsflushr=10 set autoup=120 A blocked process is a sign of a disk bottleneck. The number of blocked processes should be less than the number of processes in the run queue. Whenever there are any blocked processes, all CPU idle time is treated as wait for I/O time. Use the vmstat command to display run queue (r) blocked processes (b) and CPU idle time (id).
# vmstat 2 procs memory r b w swap free 1 1 34 121984 41312 0 0 71 6866280 97728 0 1 71 6866280 94832 0 2 71 6868128 92984 0 1 71 6868416 95480 ^C# page po 2439 11392 12948 20432 20100 disk s0 s1 s2 12 0 0 2 0 0 6 1 0 13 0 0 4 0 0 faults in sy cs 1425 2872 926 2391 5459 969 2345 5559 1076 2364 5530 1056 2637 5138 1175 cpu us sy 15 21 22 46 24 53 27 44 24 44
re 13 37 49 5 5
mf 2612 4056 5049 4935 4621
pi 11218 22808 28232 31444 30732
fr de 14303 0 44380 0 47268 0 55256 0 53460 0
sr 1528 4220 4347 4467 4287
s3 6 0 0 0 0
id 63 32 22 29 32
In this case, the number of blocked processes exceeds the number in the run queue, indicating that the disk subsystem requires additional tuning. Continue to balance devices across the bus, balance the load across disk devices, and decrease the number of I/Os. Increasing the inode cache size (ufs_ninode) may also help reduce the number of disk I/Os.

10-33
10
Case Study CPU Utilization
To collect data on all processes, system accounting was enabled. Performance monitoring data could then be collected on a regular basis and analysis of system resource usage performed. The sar -u command reports CPU utilization. Output from sar -u conrms the users complaints that the system begins to be sluggish at around 8:00 or 9:00 in the morning. As you can see from the report, %idle (the percentage of time the processor is idle) was zero. As has been discussed previously, zero idle time is a good sign that a system is overloaded. %wio (the percentage of time the processor is idle and waiting for I/O completion) is also high after 9:00. Once again, this indicates that the I/O devices are a bottleneck. # sar -u SunOS pansco-sdm 5.6 Generic_105181-13 sun4u 00:00:01 01:00:01 02:00:01 03:00:01 04:00:00 05:00:01 06:00:01 07:00:03 08:00:05 08:20:02 08:40:01 09:00:02 09:20:02 09:40:01 10:00:01 10:20:01 Average # %usr 5 0 1 4 3 10 37 33 33 34 32 33 32 35 26 16 %sys 8 3 3 3 4 13 53 61 63 66 66 54 48 48 32 26 %wio 62 13 1 1 2 5 4 5 5 0 2 12 20 17 39 12 %idle 26 84 95 91 91 72 6 0 0 0 0 1 0 0 3 45 07/15/99
10-34

10
Case Study CPU Evaluation
On a multiprocessor system (which this system is), the mpstat utility displays per-processor performance statistics. Use this command to detect load imbalances between CPUs. On this system, the load distribution between CPUs is satisfactory. CPUs 4 and 5 appear to be working harder than the others. Adding more CPUs would not help improve performance, however, because each of the CPUs is spending at least some of the time idle (idl) already. The CPUs are spending a signicant amount of time waiting for I/O (wt) which again indicates an I/O bottleneck. The number of interrupts (intr) are high, which could also be attributed to I/O device interrupts. Using processor sets, changing process priorities by modifying dispatch tables, running applications in real time, or using the Solaris Resource Manager are all options to improve the efciency of the CPUs. But these all require some trial-and-error. The customer, at this time, has elected not to experiment with any of these options.
# mpstat 2 CPU minf mjf xcal 0 411 1 17351 1 467 1 757 4 431 1 17997 5 467 1 59 8 218 1 14856 9 213 1 15165 12 189 1 14071 13 215 1 14944 CPU minf mjf xcal 0 591 3 30545 1 717 5 33174 4 803 1 26498 5 580 6 27443 8 317 19 20689 9 72 13 14290 12 135 2 12222 13 417 0 25322 ^C# intr ithr 61 40 21 0 20 0 206 183 245 207 249 210 259 218 461 226 intr ithr 14 0 11 1 13 0 96 80 426 386 411 368 409 371 580 349 csw icsw migr smtx 106 20 12 85 110 20 11 89 105 19 11 87 92 18 10 91 126 19 14 106 130 19 14 112 131 20 13 105 123 19 13 107 csw icsw migr smtx 94 12 18 82 108 9 18 120 50 12 8 85 89 15 18 66 199 14 26 145 140 14 24 121 133 8 20 102 93 10 17 124 srw syscl 0 418 0 469 0 431 0 452 0 273 0 282 0 257 0 288 srw syscl 0 480 0 399 0 687 0 397 1 411 0 249 0 183 0 166 usr sys 19 21 22 22 20 21 23 22 10 20 10 20 9 19 10 24 usr sys 27 27 12 32 67 24 46 24 10 20 20 14 4 8 14 36 wt idl 21 39 19 36 21 38 19 37 26 44 26 44 27 45 23 42 wt idl 28 18 38 19 10 0 28 3 54 16 44 22 62 26 34 16

10-35
10
Case Study Memory Evaluation
The amount of total physical memory can be determined from the output of the prtdiag command (as shown previously) and also from the prtconf command. For this system, physical memory is 6 Gbytes. # prtconf System Configuration: Sun Microsystems Memory size: 6144 Megabytes System Peripherals (Software Nodes): ... ... # sun4u
The amount of memory allocated to the kernel can be found by using the sar -k command and totaling all of the alloc columns. The output is in bytes.
# sar -k SunOS pansco-sdm 5.6 Generic_105181-13 sun4u 00:00:01 01:00:01 02:00:01 03:00:01 04:00:00 05:00:01 06:00:01 07:00:03 08:00:05 08:20:02 08:40:01 09:00:02 09:20:02 09:40:01 10:00:01 10:20:01 Average # sml_mem 68993024 69902336 70803456 64528384 64176128 63840256 62021632 61349888 61218816 61571072 61325312 60047360 58433536 58023936 57958400 alloc 51091032 44424744 37963860 32227740 31738052 32549792 33294448 33669428 34200836 38224384 38983276 36846884 34901304 34809740 33979004 fail 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 07/15/99 fail 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ovsz_alloc 29458432 29589504 29655040 29622272 29589504 30179328 30801920 31293440 31555584 31686656 31260672 31326208 32120832 31653888 31588352 30758775 fail 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
lg_mem alloc 228032512 150674684 228089856 114794740 228384768 79356972 139239424 71898524 138182656 72442176 134479872 74934108 118980608 76080604 117096448 75169864 116613120 75783280 115449856 97785736 114278400 98494340 114139136 86480376 113737728 76232028 113131520 74745224 112869376 74450408 86621538
62946236 36593635
0 142180352
Totaling the averages of small, large, and oversize allocations, the kernel has grabbed 153 Mbytes of memory.
10-36

10
Knowing how much memory an application uses allows you to predict the memory requirements when running more users. Use the pmap command or Memtools pmem functionality to determine how much memory a process is using and how much of that memory is shared. The PID of the process, 3639, was obtained using the ps command.
# /usr/proc/bin/pmap -x 3639 3639: sas Address Kbytes Resident Shared Private Permissions 00002000 8 8 8 read 00010000 2048 2000 1848 152 read/exec 0021E000 56 32 8 24 read/write/exec 0022C000 1296 16 16 read/write/exec ED870000 336 336 336 - read/exec/shared ED8C4000 8 8 8 read/write/exec ED8C6000 16 8 8 read/write ED8D0000 16 16 16 - read/exec/shared ED8D4000 8 - read/write/exec ED8E0000 296 24 8 16 read/exec/shared ED92A000 8 - read/write/exec ED92C000 24 - read/write ED940000 256 32 32 read/write/exec ED990000 200 152 152 read/exec/shared ED9C2000 8 - read/write/exec ED9C4000 24 24 24 read/write ED9D0000 184 96 16 80 read/exec/shared ED9FE000 8 - read/write/exec EDA00000 8 - read/write EDA10000 280 208 56 152 read/exec/shared EDA56000 8 - read/write/exec EDA58000 96 80 80 read/write EDA80000 512 16 16 read/write/exec EDB10000 8 8 8 - read/exec/shared EDB12000 8 - read/write/exec EDB20000 96 8 8 read/exec/shared EDB38000 8 - read/write/exec EDB3A000 8 - read/write ... ... EF7D0000 112 112 112 - read/exec EF7FA000 8 8 8 - read/write/exec EFFF4000 48 16 16 read/write/exec -------- ------ ------ ------ -----total Kb 20376 9904 6712 3192
Mapped File [ anon ] sas sas [ heap ] dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 [ anon ] dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 [ anon ] dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006 dev:158,34006
ino:3073 ino:3073 ino:3073 ino:3241 ino:3241 ino:3169 ino:3169 ino:3169 ino:3063 ino:3063 ino:3063 ino:3156 ino:3156 ino:3156 ino:3157 ino:3157 ino:3157 ino:3116 ino:3116 ino:3061 ino:3061 ino:3061
ld.so.1 ld.so.1 [ stack ]
The output of the pmap command shows that the SAS process is using 9904 Kbytes of real memory. Of this, 6712 Kbytes is shared with other processes on the system, using shared libraries and executables.

10-37
10
The pmap command also shows that the SAS process is using 3192 Kbytes of private (non-shared) memory. Another instance of SAS will consume 3192 Kbytes of memory (assuming its private memory requirements are similar). Memory requirements for the 200 users the customer has set up to run SAS would be 638,400 Kbytes (or 638 Mbytes). This system has 6 GBytes of memory, which is more than enough to support 200 simultaneous SAS users. The DNLC caches the most recently referenced directory entry names and the associated vnode for the directory entry. The number of name lookups per second is reported as namei/s by the sar -a command. # sar -a SunOS pansco-sdm 5.6 Generic_105181-13 sun4u 00:00:01 01:00:01 02:00:01 03:00:01 04:00:00 05:00:01 06:00:01 07:00:03 08:00:05 08:20:02 08:40:01 09:00:02 09:20:02 09:40:01 10:00:01 10:20:01 Average # iget/s namei/s dirbk/s 1 240 1079 0 167 1108 0 167 1100 0 162 1079 0 176 1080 1 204 1063 1 231 906 3 238 836 1 223 788 58 296 609 4 281 734 2 245 910 1 208 844 0 227 842 0 233 962 3 209 982 07/15/99
If namei does not nd a directory name in the Domain Name Lookup Cache (DNLC), it calls iget to get the inode for either a le or directory. Most iget calls are the result of DNLC misses. iget/s reports the number of requests made for inodes that were not in the DNLC.
10-38

10
The vmstat -s command displays the DNLC hit rate and lookups that failed because the name was too long. The hit rate should be 90 percent or better. As you can see, on this system, the hit rate is only 78 percent. # vmstat -s 1156 swap 916 2312 16226 591157504 53993307 8768923 317283000 69002475 3089577 3077302 0 591157504 2286240 4181274 22339837 345756783 452 404559907 300274 9104 252881 209518071 345103090 623248924 649934363 156855213 181605 27810894 38464345 73582949 41121115 # ins swap outs pages swapped in pages swapped out total address trans. faults taken page ins page outs pages paged in pages paged out total reclaims reclaims from free list micro (hat) faults minor (as) faults major faults copy-on-write faults zero fill page faults pages examined by the clock daemon revolutions of the clock hand pages freed by the clock daemon forks vforks execs cpu context switches device interrupts traps system calls total name lookups (cache hits 78%) toolong user cpu system cpu idle cpu wait cpu
For every entry in the DNLC, there is an entry in the inode cache. This allows the quick location of the inode entry, without having to read the disk.

10-39
10
If the cache is too small, it grows to hold all of the open entries. Otherwise, closed (unused) entries will remain in the cache until they are replaced. This is benecial, for it allows any pages from the le to remain available in memory. A le can be opened and read many times, with no disk I/O occurring. inode cache size is set by the kernel parameter ufs_ninode. You can see inode cache statistics by using netstat -k to dump out the raw kernel statistics information. maxsize is the variable equal to ufs_ninode. A maxsize reached value higher than maxsize indicates that the number of active inodes has exceeded cache size.
# netstat -k kstat_types: raw 0 name=value 1 interrupt 2 i/o 3 event_timer 4 segmap: fault 1646083218 faulta 0 getmap 1531102064 get_use 8868192 get_reclaim 534564036 get_reuse 988319526 ... ... inode_cache: size 3607 maxsize 17498 hits 770740 misses 591785 kmem allocs 95672 kmem frees 92065 maxsize reached 21216 puts at frontlist 847788 puts at backlist 316742 queues to free 0 scans 8873146 thread idles 587283 lookup idles 0 vget idles 0 cache allocs 591785 cache frees 666538 pushes at close 0 ... ... #
10-40

10
Case Study Memory Tuning
Any nonzero values for %ufs_ipf, as reported by sar -g, indicate that the inode cache is too small for the current workload. %ufs_ipf reports flushed inodes, that is, the percentage of (closed) UFS inodes flushed from the cache that had reusable file pages still in memory. # sar -g SunOS pansco-sdm 5.6 Generic_105181-13 sun4u 00:00:01 01:00:01 02:00:01 03:00:01 04:00:00 05:00:01 06:00:01 07:00:03 08:00:05 08:20:02 08:40:01 09:00:02 09:20:02 09:40:01 10:00:01 10:20:01 Average # 07/15/99
pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf 0.16 0.26 1087.63 1055.39 0.00 0.16 0.27 77.66 83.49 0.00 0.15 0.27 82.27 72.65 0.00 0.27 1.27 134.35 137.28 0.00 0.87 5.98 100.98 84.64 0.00 1.89 11.58 539.67 525.53 0.00 33.47 262.36 2816.19 2584.70 0.00 176.68 1383.28 6086.32 4931.94 0.25 205.35 1614.82 6049.20 4643.43 0.22 188.46 1497.26 5477.24 4238.56 0.17 192.56 1521.26 5507.22 4215.31 0.18 113.15 883.76 4027.47 3246.40 0.10 127.03 1004.41 4296.40 3380.89 0.10 68.85 545.83 3400.88 2861.87 0.00 15.34 120.68 2090.83 1996.26 0.00 50.05 392.96 2052.27 1709.92 0.07
Increasing the size of the DNLC cache, ncsize, and the size of the inode cache, ufs_ninode, improves the cache hit rate and helps reduce the number of disk I/Os. The inode cache should be the same size as the DNLC. Recommended entries in /etc/system are: set ufs_ninode=20000 set ncsize=20000 The UFS buffer cache (also known as metadata cache) holds inodes, cylinder groups, and indirect blocks only (no le data blocks). The default buffer cache size allows up to 2 percent of main memory to be used for caching the le system metadata. The default is too high for systems with large amounts of memory.

10-41
10
Use the sysdef command to check the size of the cache, bufhwm. # sysdef | grep bufhwm 126189568 maximum memory allowed in buffer cache (bufhwm) # Because only inodes and metadata are stored in the buffer cache, it does not need to be a very large buffer. In fact, you only need 300 bytes per inode, and about 1 Mbyte per 2 Gbytes of les that are expected to be accessed concurrently. The size can be adjusted with the bufhwm kernel parameter, specied in Kbytes, in the /etc/system le. For example, if you have a database system with 100 les totaling 100 Gbytes of storage space and you estimate that only 50 Mbytes of those les will be accessed at the same time, then at most you would need 30 Kbytes (100 x 300 bytes = 30 Kbytes) for the inodes, and about 25 Mbytes (50 (2 x 1 Mbyte) = 25 Mbytes) for the metadata. Making the following entry into /etc/system to reduce the size of the buffer cache is recommended: set bufhwm=28000 You can monitor the buffer cache hit rate using sar -b. The statistics for the buffer cache show the number of logical reads and writes into the buffer cache, the number of physical reads and writes out of the buffer cache, and the read/write hit ratios. Try to obtain a read cache hit ratio of 100 percent on systems with a few, but very large les, and a hit ratio of 90 percent or better for systems with many les.
10-42

10
# sar -b SunOS pansco-sdm 5.6 Generic_105181-13 sun4u 07/15/99
00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 01:00:01 0 24 99 3 10 74 5 7 02:00:01 0 7 100 1 6 79 6 2 03:00:01 0 7 100 1 7 83 5 0 04:00:00 0 6 99 1 8 82 5 0 05:00:01 0 6 97 3 12 79 5 0 06:00:01 1 9 91 6 30 79 5 0 07:00:03 1 32 97 12 56 78 4 0 08:00:05 0 37 99 14 57 76 4 0 08:20:02 1 28 98 14 52 74 4 0 08:40:01 3 69 95 15 66 78 3 0 09:00:02 14 31 55 13 55 76 3 0 09:20:02 1 20 97 12 58 79 4 0 09:40:01 0 24 99 9 36 75 4 0 10:00:01 0 16 99 9 41 77 4 0 10:20:01 0 12 98 8 43 81 5 0 Average # 1 19 95 7 29 78 5 1
The best RAM shortage indicator is the scan rate (sr) output from vmstat. A scan rate above 200 pages per second for long periods (30 seconds) indicates memory shortage. Make sure you have the most recent OS version or, at least, the most recent kernel patch installed.
# vmstat 2 procs memory r b w swap free 1 1 34 121984 41312 0 0 71 6866280 97728 0 1 71 6866280 94832 0 2 71 6868128 92984 0 1 71 6868416 95480 ^C# page po 2439 11392 12948 20432 20100 disk s0 s1 s2 12 0 0 2 0 0 6 1 0 13 0 0 4 0 0 faults in sy cs 1425 2872 926 2391 5459 969 2345 5559 1076 2364 5530 1056 2637 5138 1175 cpu us sy 15 21 22 46 24 53 27 44 24 44
re 13 37 49 5 5
mf 2612 4056 5049 4935 4621
pi 11218 22808 28232 31444 30732
fr de 14303 0 44380 0 47268 0 55256 0 53460 0
sr 1528 4220 4347 4467 4287
s3 6 0 0 0 0
id 63 32 22 29 32
Excessive paging is evident by constant nonzero values in the scan rate (sr) and page out (po) columns. Pageouts are the number of Kbytes paged out per second. As you can see from the output of the vmstat command, pageouts are quite large and the scan rate is well over 200 pages per second, indicating a memory shortage.

10-43
10
By looking at the memory requirements of the SAS application (using /usr/proc/bin/pmap -x PID) and looking at the amount of memory allocated to the kernel (sar -k), you have determined that the system does have adequate memory. Recall that the page daemon begins scanning when the number of free pages is greater than lotsfree. The free column of the vmstat command reports the size of the free memory list in Kbytes. These are pages of RAM that are immediately ready to be used whenever a process starts up or needs more memory. Increasing the size of lotsfree is appropriate in a system with large amounts of memory. This reduces the amount of scanning because scanning does not take place until the number of free pages is greater than lotsfree. To increasing lotsfree to 256 Mbytes is recommended on this system. Make the following entry in the /etc/system le: set lotsfree=0x10000000 Many of the sar command options also evaluate memory. sar -g reports the number of pages, per second, scanned by the page daemon (pgscan/s). If this value is high, the page daemon is spending a lot of time checking for free memory. This implies that more memory may be needed (or the kernel parameter [lotsfree] should be increased). The number of page out requests per second (pgout/s) are also displayed as well as the actual number of pages that are paged out per second (ppgout/s). The number of pages per second that are placed on the free list by the page daemon (pgfree/s) are also displayed.
10-44

10
# sar -g SunOS pansco-sdm 5.6 Generic_105181-13 sun4u 00:00:01 01:00:01 02:00:01 03:00:01 04:00:00 05:00:01 06:00:01 07:00:03 08:00:05 08:20:02 08:40:01 09:00:02 09:20:02 09:40:01 10:00:01 10:20:01 Average # 07/15/99
pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf 0.16 0.26 1087.63 1055.39 0.00 0.16 0.27 77.66 83.49 0.00 0.15 0.27 82.27 72.65 0.00 0.27 1.27 134.35 137.28 0.00 0.87 5.98 100.98 84.64 0.00 1.89 11.58 539.67 525.53 0.00 33.47 262.36 2816.19 2584.70 0.00 176.68 1383.28 6086.32 4931.94 0.00 205.35 1614.82 6049.20 4643.43 0.00 188.46 1497.26 5477.24 4238.56 0.00 192.56 1521.26 5507.22 4215.31 0.18 113.15 883.76 4027.47 3246.40 0.00 127.03 1004.41 4296.40 3380.89 0.00 68.85 545.83 3400.88 2861.87 0.00 15.34 120.68 2090.83 1996.26 0.00 50.05 392.96 2052.27 1709.92 0.04
It is normal to have high paging and scan rates with heavy le system usage as is the case in this customers environment. However, this can have a very negative effect on application performance. The effect can be noticed as poor interactive response, trashing of the swap disk, and low CPU utilization due to the heavy paging. This happens because the page cache is allowed to grow to the point where it steals memory pages from important applications. Priority paging addresses this problem. Recall, priority paging is a new paging algorithm that allows main memory to operate as sort of a Harvard cache: providing preferential treatment for instructions (non-le pages) over data (le pages). The priority paging algorithm allows the system to place a boundary around the le cache, so that le system I/O does not cause paging of applications.

10-45
10
By default, priority paging is disabled. For customers to use priority paging, they need either a Solaris 7, Solaris 2.6 with kernel patch 105181-13, or Solaris 2.5.1 OE with kernel patch 103640-25 or higher environment. This customer is running the Solaris 2.6 OE so the rst thing to do is load kernel patch 105181-13 before priority paging can be enabled. Then, to enable priority paging, set the following in the /etc/system le: set priority_paging=1 Priority paging introduces a new additional water mark, cachefree. The default value for cachefree is twice lotsfree, or 512 Mbytes for this customers system.
10-46

10
Case Study File System Tuning
File system performance is heavily dependent on the nature of the application generating the load. Before conguring a le system, you need to understand the characteristics of the workload that is going to use the le system. The SAS application is an example of a data-intensive application. Data-intensive applications access very large les. Large amounts of data are moved around without creating or deleting many les. SAS is essentially a decision support system (DSS). DSSs are characterized by large I/O size (64 KBytes to 1 MByte) and are predominately sequential access. For data-intensive le systems, the default number of 16 cylinders per cylinder group is too small. Remember, the le system attempts to keep les in a single cylinder group, so make sure that the cylinder groups are big enough. With the large le sizes of a DSS, the number of cylinders per cylinder group should be increased to the maximum of 32. # newfs -c 32 /dev/rdsk/xxx Most le systems implement a read ahead algorithm in which the le system can initiate a read for the next few blocks as it reads the current block. Given the repeating nature of the sequential access patterns of a DSS, the le system should be tuned to increase the number of blocks that are read ahead. The UFS le system uses the cluster size (maxcontig parameter) to determine the number of blocks that are read ahead. This defaults to seven 8-Kbyte blocks (56 Kbytes) in the Solaris OE versions up to Solaris 2.6. In the Solaris 2.6 OE, the default changed to the maximum size transfer supported by the underlying device, which defaults to sixteen 8-Kbyte blocks (128 Kbytes) on most storage devices. The default values for read ahead are often too low and should be set very large to allow optimal read rates.

10-47
10
The customer has decided to use RAID-0+1 (striping + mirroring) on the le systems. Sequential reads improve through striping and the customer also wants the high reliability and failure immunity of mirroring. RAID-0+1 combines the concatenation features of RAID-0 with the mirroring features of RAID-1. RAID-0+1 also avoids the problem of writing data sequentially across the volumes, potentially making one drive busy while the others on that side of the mirror are idle. Cluster size should equal the number of stripe members per mirror multiplied by the stripe size. Two stripe members with a stripe size of 128 Kbytes results in an optimal cluster size of 512 Kbytes. The -C option to newfs or the -a option to tunefs sets the cluster size. # newfs -C 512 /dev/rdsk/xxx SCSI drivers in the Solaris OE limit the maximum size of a SCSI transfer to 128 Kbytes by default. Even if the le system is congured to issue 512-Kbyte requests, the SCSI drivers will break the requests into smaller 128-Kbyte chunks. Cluster sizes larger than 128 Kbytes also require the maxphys parameter to be set in /etc/system. The following entry in /etc/system provides the necessary conguration for larger cluster sizes: set maxphys = 16777216
10-48

Performance Tuning Summary Lab L10

Objectives
q q q q
Identify the stress on system memory Use vmstat reports to determine the state of the system Determine the presence and cause of swapping Use sar reports to determine system performance
L10-1
L10
vmstat Information
Use the lab10 directory for these exercises. Complete the following steps: 1. Type: vmstat -S 5 Use this output to answer the following questions.
w
How many threads are runnable? ___________________________ How many threads are blocked waiting on disk I/O? __________ How many threads are currently swapped out? _______________ How much swap space is available? __________________________ How much free memory is available? _________________________ Is the system paging in? _____________________________________ Is the system paging out? ____________________________________ Where is the system spending most of its time (user mode, kernel mode, or idle)? _______________________________________________________
L10-2

L10
2. Type: ./load11.sh Note - load11.sh may take 3 to 4 minutes to complete. Note any changes in the vmstat output for the following areas:
w w w w w w w
Pagein activity ________________________________ Pageout activity _______________________________ Amount of free memory _________________________ Disk activity __________________________________ Interrupts _____________________________________ System calls ___________________________________ CPU (us/sy/id) ________________________________
Performance Tuning Summary Lab

L10-3
L10
sar Analysis I
Complete the following steps. In this part of the lab, use sar to read les containing performance data. This data was recorded by sar using the -o option. The data was recorded in binary format. Each sar le was generated using a different benchmark. 1. Type: sar -u -f sar.file1 What kind of data is sar showing you? ___________________________________________________________ In what mode was the CPU spending most of its time? ___________________________________________________________ Look at the %wio eld. What does this eld represent? Is there a pattern? ___________________________________________________________ ___________________________________________________________ ___________________________________________________________
L10-4

L10
2. Type the following command to display the CPU load statistics: sar -q -f sar.file1 Look at the size of the run queue as the benchmark progresses. What trend do you see? What does this indicate about the load on the CPU? ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ Note the amount of time the run queue is occupied (%runocc). Does this support your conclusions from the previous step? ___________________________________________________________ ___________________________________________________________ What is the average length of the swap queue? The size of the swap queue indicates the number of LWPs that were swapped out of main memory. What might the size of the swap queue tell you with respect to the demand for physical memory? (If this eld is blank, then the swap queue was empty.) ___________________________________________________________ ___________________________________________________________ ___________________________________________________________

L10-5
L10
3. Type the following command to format the statistics pertaining to disk I/O: sar -d -f sar.file1 Was the disk busy during this benchmark? (Did the benchmark load the I/O subsystem?) Use the %busy eld to make an initial determination. ___________________________________________________________ Was the disk busy with many requests? (See the avque eld.) Did the number of requests increase as the benchmark progressed? ___________________________________________________________ As the benchmark progressed, were processes forced to wait longer for their I/O requests to be serviced? (Use avwait and avserv.) What trend do you notice in these statistics? ___________________________________________________________ ___________________________________________________________ Was the I/O subsystem under a load? What was the average size of a request? ___________________________________________________________ 4. Type the following command to display memory statistics: sar -r -f sar.file1 What unit is the freeswap eld in?____________________________ What was the average value of freemem? ______________________
L10-6

L10
5. Type the following command to format more memory usage statistics: sar -g -f sar.file1 Was the paging activity steady? (Did the system page at a steady rate?) Explain. ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ You have just checked three major areas in the SunOS operating system that can affect performance (CPU, I/O, and memory use). You can use this approach to gather initial data on a suspect performance problem. Given the data you have just collected, what can you conclude about the load on the system? ___________________________________________________________ ___________________________________________________________

L10-7
L10
sar Analysis II
Perform the same analysis as in the previous section on the sar.file2 le. 1. Type: sar -u -f sar.file2 What kind of data is sar showing you? ___________________________________________________________ In what mode was the CPU spending most of its time? ___________________________________________________________ Look at the %wio eld. What does this eld represent? Is there a pattern? ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ 2. Type the following command to display the CPU load statistics: sar -q -f sar.file2 Look at the size of the run queue as the benchmark progresses. What trend do you see? What does this indicate about the load on the CPU? ___________________________________________________________ ___________________________________________________________ ___________________________________________________________
L10-8

L10
Note the amount of time the run queue is occupied (%runocc). Does this support your conclusions from the previous step? ___________________________________________________________ ___________________________________________________________ What is the average length of the swap queue? The size of the swap queue indicates the number of LWPs that were swapped out of main memory. What might the size of the swap queue tell you about the demand for physical memory? (If this eld is blank, then the swap queue was empty.) ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ 3. Type the following command to format the statistics pertaining to disk I/O: sar -d -f sar.file2 Was the disk busy during this benchmark? (Did the benchmark load the I/O subsystem?) Use the %busy eld to make an initial determination. ___________________________________________________________ Was the disk busy with many requests? (See the avque eld.) Did the number of requests increase as the benchmark progressed? ___________________________________________________________ As the benchmark progressed, were processes forced to wait longer for their I/O requests to be serviced? (Use avwait and avserv.) What trend do you notice in these statistics? ___________________________________________________________ ___________________________________________________________ Was the I/O subsystem under a load? What was the average size of a request? ___________________________________________________________

L10-9
L10
4. Type the following command to display memory statistics: sar -r -f sar.file2 What unit is the freeswap eld in?____________________________ What was the average value of freemem? ______________________ 5. Type the following command to format more memory usage statistics: sar -g -f sar.file2 Was the paging activity steady? (Did the system page at a steady rate?) Explain. ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ What can you conclude about the load on the system? Which areas, if any, does this benchmark load? ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________
L10-10

L10
sar Analysis III
Perform the same analysis as in the previous sections on the sar.file3 le. 1. Type: sar -u -f sar.file3 What kind of data is sar showing you? ___________________________________________________________ In what mode was the CPU spending most of its time? ___________________________________________________________ Look at the %wio eld. What does this eld represent? Is there a pattern? ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ 2. Type the following command to display the CPU load statistics: sar -q -f sar.file3 Look at the size of the run queue as the benchmark progresses. What trend do you see? What does this indicate with respect to the load on the CPU? ___________________________________________________________ ___________________________________________________________ ___________________________________________________________

L10-11
L10
Note the amount of time the run queue is occupied (%runocc). Does this support your conclusions from the previous step? ___________________________________________________________ ___________________________________________________________ What is the average length of the swap queue? The size of the swap queue indicates the number of LWPs that were swapped out of main memory. What might the size of the swap queue tell you with respect to the demand for physical memory? (If this eld is blank, then the swap queue was empty.) ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ 3. Type the following command to format the statistics pertaining to disk I/O: sar -d -f sar.file3 Was the disk busy during this benchmark? (Did the benchmark load the I/O subsystem?) Use the %busy eld to make an initial determination. ___________________________________________________________ Was the disk busy with many requests? (See the avque eld.) Did the number of requests increase as the benchmark progressed? ___________________________________________________________ As the benchmark progressed, were processes forced to wait longer for their I/O requests to be serviced? (Use avwait and avserv.) What trend do you notice in these statistics? ___________________________________________________________ ___________________________________________________________ Was the I/O subsystem under a load? What was the average size of a request? ___________________________________________________________
L10-12

L10
4. Type the following command to display memory statistics: sar -r -f sar.file3 What unit is the freeswap eld in?____________________________ What was the average value of freemem? ______________________ 5. Type the following command to format more memory usage statistics: sar -g -f sar.file3 Was the paging activity steady? (Did the system page at a steady rate?) Explain. ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ What can you conclude about the load on the system? Which areas, if any, does this benchmark load? ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________

L10-13

Objectives
q
LF1
Execute performance monitoring tools native to the Solaris Operating Environment (OE) Start system accounting Install the SE Toolkit Install the memtool package Use the prstat command Use the sar command to read data saved to a le
q q q q q
LF1-1
LF1
Tasks
Using the Tools
Complete the following steps, logging in at the command line as root. 1. Change to the lab1 directory for this lab. 2. List which tools you would use to look at the following data elements (tick the appropriate boxes in Table LF1-1): Table LF1-1 Data Elements and Tools Data Element Tape device response time Free memory Per CPU interrupt rates Network collision rates UFS logical reads NFS read response time I/O device error rates Semaphore activity Active sockets sar iostat vmstat netstat nfsstat
3. Read the man pages for sar and netstat to see how many options exist for both commands. Do not worry about understanding all the options. 4. Open two terminal windows on your workstation. Start iostat in one window and vmstat in the second window. Use a 15-second interval, and let the output display on the screen.
LF1-2

LF1
5. Open another terminal window, and run the load3.sh script which is in your lab1 directory. When the script nishes, look at the script. Based on what it does, does the output from the tuning tools that you were running in Step 4 make sense? There should have been an amount of disk I/O. Also, there would be le system activity and an amount of RAM would have to have been allocated to the find process. 6. Stop the tuning tools that you started in Step 4.

LF1-3
LF1
Enabling System Activity Data Capture
The Solaris OE allows system accounting to take place. The volume of data that is collected by system accounting can be extremely large and the intrusive nature of the accounting processes affect other lab exercises. However, as part of the accounting package, a crontab le for the sys user is installed. This le is congured prior to invoke system activity collection that can, at a later time, be displayed using the system activity reporting (sar) utility. 1. As the root user, issue the following commands: # EDITOR=vi export EDITOR # crontab -e sys This starts a vi session with the sys accounts crontab le. Uncomment each of the command lines in this crontab le, and then save and exit from the vi session. The crontab le is passed to the cron utility when the le has been saved. 2. Now, issue the following commands: # vi /etc/init.d/perf ...<file edit takes place>... # /etc/init.d/perf start This starts a vi session with the performance run-control script. At the end of the le, there are lines that should be un-commented so that the /usr/lib/sa/sadc command is invoked at boot time.
LF1-4

LF1
Installing the SE Toolkit
The SE Toolkit is provided in the perf directory of your student account. To install it on your lab system, as root: 1. Change to the lab1 directory. 2. Uncompress the toolkit installation les using: unzip setoolkit.zip 3. Add the toolkit package using: pkgadd -d . Add the RICHPse package using appropriate options. 4. Modify your PATH variable, in the $HOME/.profile le, to include the path name: /opt/RICHPse/bin 5. Test the installation by executing the following series of commands: . $HOME/.profile cd /opt/RICHPse/examples se zoom.se Note The rst of these three command lines is used to source the change made to the PATH variable in the $HOME/.profile le.

LF1-5
LF1
Installing memtool
The memtool package is provided in the memtool directory of your student account. To install it on your lab system, as root: 1. Change to the memtool directory. Uncompress and untar the toolkit installation les using: zcat RMCmem3.7.5Beta1.tar.Z | tar -xvf 2. Add the memtool package using: pkgadd -d . 3. Respond y to all prompts about commands to be started at boot time. When asked if you want to load the kernel module now, respond yes. When asked if you wish to support the development efforts, you should respond no. 4. Verify the kernel module was successfully loaded. # modinfo | grep -i memtool 209 102bff10 3659 1 bunyipmod (MemTool Module %I% %E%)
LF1-6

LF1
Using the prstat Command
The prstat command is provided with the Solaris 8 OE. Unlike the ps command, it runs continually until terminated by the user, and displays a list of all processes on the system. The listing is restricted to the current terminal display. The prstat command can display the process listing in sorted order and provides a number of options to sort by appropriate values. To view the information produced by prstat, follow these steps: 1. Issue the command: prstat Observe the current list of processes in the display. After a period of 30 seconds, quit from the program by typing the letter q. 2. Read the man page for prstat using the command: man prstat Evaluate which options should be used to sort the display output by using the information in Table LF1-2: Table LF1-2 Sort Key Values and Options Sort Key Value CPU Usage Size of process image Execution time Options -s cpu -s size -s time
When you have evaluated which options should be used, invoke the command using each of the options. Observe the differences in the display. Which process is using (or has used) most CPU resource? It will probably be the Xsun process. Because this process manages all of the CDE (MOTIF) graphical controls, this process will probably be the most heavily utilized.

LF1-7
LF1
Which process has the largest process image size? Use prstat -s size to display the information sorted into size order. The largest process size will probably be a java process or the Xsun process.
Using the sar Command to Read Data From a Saved File

The sar command can save data to a le when for reporting at a later time. In the lab1 directory is a le called sar.out. This le contains a set of data, captured by the sar program. Firstly, read the entire contents of the le, using sar, to see how that data is displayed. Second, extract specic types of data, such as CPU or buffer, to show that subsets of the data can be reported on from the data contained in the le. 1. Change the directory to the lab1 directory. 2. Invoke the sar command to read all of the data in the le, using the command: sar -A -f sar.out | more 3. Use the appropriate keyboard instructions for more to page through the data that is being displayed. 4. After you have seen the entire display output, issue the following commands, individually, to display subsets of the sar data: sar -u -f sar.out | more sar -b -f sar.out | more sar -q -f sar.out | more Was the CPU busy at the time when the run queue was occupied? Not really. At 10:25:57, there is a run queue value but the CPU, at that time, shows 74 percent idle. Use sar -q and sar -u options to view the appropriate results.
LF1-8

LF1
Were the buffers busy at the time when the run queue was occupied? The buffers were being used by logical reads and logical writes in the few seconds just before the run queue value is displayed. But, the buffers are not active at the time of the run queue entry (10:25:57). The queued entry, therefore, might not be related to le system activity.

LF1-9
Processes and Threads Lab

Objectives
q
LF2
Understand the relationship of maxusers to other tuning parameters Understand output from the lockstat command Understand various methods of displaying process information
q q
LF2-11
LF2
Tasks
Using the maxusers Parameter
You use the maxusers parameter to set the size of several system tables. Some of the system parameters that the maxusers parameter modies directly are:
q q q q
max_nprocs The maximum number of processes ufs_ninode The size of the inode cache maxuprc The maximum number of processes per user ncsize The size of the directory name lookup cache
Firstly, to see the current values of these system parameters, you will make use of the debugging tools adb and mdb. The reason for using both tools is to show that they can both produce the same output, if required. The mdb program is new to the Solaris 8 Operating Environment (OE) and is the recommended debugging tool for this release. In the rst part of the lab, you display and alter the system parameters by changing the value of maxusers. The second part of the lab, you learn how the lockstat command displays the locks that have been applied to an executing process. During this lab, you observe the effect of the priority of the serial port communication. The nal part of the lab concentrates on displaying process information. For this part of the lab, you use several tools, some of which are supplied by default with the Solaris 8 OE. Note For this lab, the manual pages should have been installed on your system. if not, consult with the instructor.
LF2-12

LF2
Complete the following steps: 1. You must be logged in as the root user and be working in a Common Desktop Environment (CDE) environment. 2. Change to the lab2 directory for this lab. 3. Type ./maxusers.adb.sh. This lists the current value of maxusers and some of the parameters it affects. Record the values. pidmax maxusers physmem max_nprocs ufs_ninode maxuprc ncsize ____________ ____________ ____________ ____________ ____________ ____________ ____________
The output will vary depending on the size of RAM for your system. 4. Type ./maxusers.mdb.sh., and compare the output to that shown by the preceding command. 5. Run the sysdef command, and save the output to a le: # sysdef > /var/tmp/sysdef1 6. Make a copy of the /etc/system le using the following command: # cp /etc/system /etc/system.original 7. In the /etc/system le, insert a line so that 50 is added to the current value of maxusers as discovered in Step 3 (where new_value = current_value + 50). set maxusers=new_value 8. Reboot the system.

LF2-13
LF2
9. Verify the changes by executing the ./maxusers.mdb.sh script. 10. Run the sysdef command again, saving the output to a le different from that used in Step 5. Run diff on the two les to identify any changes in the sysdef output. # sysdef > /var/tmp/sysdef2 # diff /var/tmp/sysdef1 /var/tmp/sysdef2 | more The output should look similar to this: # diff /var/tmp/sysdef1 /var/tmp/sysdef2 | more 80d79 < lockstat, instance #0 347c346 < 1594 maximum number of processes (v.v_proc) --> 2394 maximum number of processes (v.v_proc) 349c348 < 1589 maximum processes per user id (v.v_maxup) --> 2389 maximum processes per user id (v.v_maxup) # 11. Restore the /etc/system le to its original form using the copy made in Step 6, and reboot the system.
LF2-14

LF2
Using the lockstat Command
Complete the following steps: 1. To view output from lockstat, start a desktop terminal window under lockstat using the command: % lockstat /usr/dt/bin/dtterm & Remember, you do not see any output from lockstat until the dtterm window exits. 2. After the window opens, change the size of the display font to 16 point using the Options -> Font-Size window controls. 3. Resize the window by dragging one corner. 4. Continue moving the dtterm window around the screen for a period of three seconds. 5. Close the dtterm window using an appropriate window-menu option. The output should look similar to this: # lockstat /usr/dt/bin/dtterm & [1] 789 # Adaptive mutex block: 10 events in 27.214 seconds (0 events/sec) Count indv cuml rcnt nsec Lock Caller ---------------------------------------------------------------------5 50% 50% 1.00 82800 0xf5c030d0 clock+0x204 2 20% 70% 1.00 97250 0xf59a3000 callout_schedule_1+0x8 1 10% 80% 1.00 37000 0xf59a0e28 esp_intr+0xa8 1 10% 90% 1.00 100500 0xf59a0510 esp_intr+0x1c 1 10% 100% 1.00 86000 0xf5c031c8 clock+0x204 ---------------------------------------------------------------------[1] + # Done lockstat /usr/dt/bin/dtterm & 6. Press the Return key.

LF2-15
LF2
Monitoring Processes
Complete the following steps: 1. Display the most active processes on your system using the command: # /usr/ucb/ps -aux | head 2. Answer the following questions:
w
What is the most active process on your system? This is the process that is consuming the most CPU time. The most active process on the system will probably be the /usr/openwin/bin/Xsun process. This process will probably have consumed most CPU time.
What percentage of CPU time is used by the most active process on your system? This is shown in the %CPU column. The percentage of CPU time is determined as the percentage of CPU time allocated while the ps command was obtaining data. The %CPU column does not show a cumulative value. However, the prstat command will show an averaged %CPU value.
What percentage of the total RAM in your system is in use by the most active process? This is shown in the %MEM column. This will be system-dependent, based on the amount of RAM that your system has installed. Are there any processes that are blocked waiting for disk I/O? How can you tell? Processes that have very low TIME values yet have a start time (START) that occurred several minutes or hours ago are probably blocked on I/O.
LF2-16

LF2
w
Is your system doing any unnecessary work? Can you disable any unused daemons? How would you prevent unused daemons from starting at boot time? You prevent unused daemons from starting at boot time by disabling the run-control scripts in the /etc/init.d (or related) directory.
What is the advantage of using system accounting to monitor processes over the occasional use of the ps command? System accounting allows all events to be monitored and recorded over a period of time. The ps command, however, only takes a snapshot of the current process table and might not include processes that have exited recently.
3. Monitor processes using the following SE Toolkit programs:

w w w w w
msacct.se pea.se ps-ax.se ps-p.se pwatch.se
4. List processes using /usr/bin/ps by giving the following command: # /usr/bin/ps -eL | more Identify processes that are using LWPs and list some of them below: Here are some examples of multi-threaded processes that may be running LWPs on your system: dwhttpd snmpXd dmispdmid mibiisa dtlogin dtwm ttsession powerd vold statd syslogd devfseventd
LF2-17
LF2
5. Verify the time allocated to the entire process and the time allocated to each LWP of that process using the following two commands: # /usr/bin/ps -e | grep mib # /usr/bin/ps -eL | grep mib Compare the times in the third and fourth columns for each of the command outputs. Do they add up to the same (total) value? They may not. The reason is that the time value for the process is rounded up or down to the nearest complete second. When LWPs are displayed in the list, each LWP time value is rounded up or down. Is time allocated evenly across all of the LWP entries or does one entry have more time than other LWPs? If each LWP and the associated threads are performing varying tasks, then there could be varying amounts of time allocated to each LWP. 6. To see how the two versions of ps differ in their output, use the counter.sh script. This script invokes the counter script, which increments an integer variable from the value 1 through to 100,000. While the counter is being incremental, the /usr/ucb and /usr/bin versions of ps display the details of the current terminals processes. Execute the script in the lab2 directory, using the following command: # ./counter.sh After the script has completed, determine the PID value of the current shell by issuing the following command: # echo $$ While the script was running, output will have been produced by both versions of the ps command. Compare the time values for the current PID (as determined by the echo command). Is there a difference in the TIME column of each output for the current shell? Yes. The /usr/ucb/ps command accumulates the time of the current shells child processes, whereas the /usr/bin/ps command shows time on a per-process basis only.
LF2-18

LF2
7. In the lab2 directory, there are two shell scrips called bourne and korn. The purpose of these scripts is to show how context switching and the fork/exec operation can severely affect processing. Because the Korn shell has an in-built arithmetic capability, it is much more efcient than the Bourne shell at performing arithmetic calculations. To analyze the number of system calls and the number of fork/exec operations, issue the following command in a separate terminal window: # sar -c 1 120 Then, in another terminal window, execute the Korn shell script and the Bourne shell script by issuing the following two command lines: # ./korn # ./bourne While the commands are running, be aware of the values displayed in the scall/s, fork/s, and exec/s columns. Note The following lab exercise could take up to 30 minutes to complete. Before starting this lab exercise, make sure that you have at least four terminal windows open on your CDE desktop. 8. Reboot the system. 9. In the lab2 directory is the catman.sh. script. This shell script executes the catman command to create a what is database of the manual page entries. The catman program performs many context switches when it forks subprocesses to perform its work. To invoke this shell script, issue the command: # ./catman.sh Follow the instructions displayed on the screen.

LF2-19
CPU Scheduling Lab

Objectives
q q q
LF3
Read a dispatch parameter table Change a dispatch parameter table Examine the behavior of CPU-bound and I/O-bound processes, both individually and together Run and observe a process in the real-time scheduling class
LF3-21
LF3
Tasks
CPU Dispatching
Complete the following steps: 1. Display the congured scheduling classes by typing: dispadmin -l What classes are displayed? The classes displayed should be: SYS TS IA (System Class) (Timesharing) (Interactive)
2. Use dispadmin to display the timesharing dispatch parameter table by typing: dispadmin -c TS -g What is the longest time quantum for a timeshare thread? By default, this should be the value 200 (with RES=1000) and the value should be associated with priority values 0 (zero) through to 9. 3. Use the dispadmin command to display the timeshare dispatch parameter table so that the time quantums are displayed in clock ticks. What command did you use? dispadmin -g -r 100 -c TS The resolution value of 100 assumes 100 clock ticks per second. Notice that the time quanta is now one tenth of the original value displayed in Step 2.
LF3-22

LF3
4. Display the real-time dispatch parameter table by typing: dispadmin -c RT -g What is the smallest time quantum for a real-time thread? By default, the value of 100 is the smallest time quantum (based on a RES value of 1000). This value will be associated with the priority value of 59 (equivalent to system priority 159). 5. Display the congured scheduling classes again. What has changed? The Real-Time class is now displayed in the list because it has been used with earlier dispadmin commands. When used, it is assumed to be current in the kernels process controls.
Process Priorities
Recall how process priorities get assigned:
q
A process gets assigned a priority within its class (timeshare process priorities range from 0 through 59 with 59 being the highest priority). Each priority has an associated quantum, which is the amount of CPU time allocated in ticks. If a process exceeds its assigned quantum before it has nished executing, it is assigned a new priority, tqexp. The new priority, tqexp, is lower but the associated quantum is higher so the process gets more time. If a process (or thread) blocks waiting for I/O, it is assigned a new priority, slpret, when it returns to user mode.
slpret is a higher priority (in the 50s) allowing faster access to the CPU.
If a process (or thread) does not use its time quantum in maxwait seconds (maxwait of 0 equals 1 second), its priority is raised to lwait, which is higher and ensures that the thread will get some CPU time.
CPU Scheduling Lab

LF3-23
LF3
The purpose of this portion of the lab is to observe the process priorities during execution of a program. The two programs that you will be executing are dhrystone and io_bound. The dhrystone program is an integer benchmark that is widely used for benchmarking UNIX systems. The dhrystone program provides an estimate of the raw performance of processors. (You might also be familiar with a oating-point benchmark called whetstone that was popular for benchmarking FORTRAN programs.) The dhrystone program is entirely CPU bound and runs very fast on machines with high-speed caches. Although it is good measure for programs that spend most of their time in some inner loop, it is a poor benchmark for I/O-bound applications. The io_bound program, as the name implies, is an I/O-intensive application. I/O-bound processes spend most of their time waiting for I/O to complete. Whenever an I/O-bound process wants CPU, it should be given CPU immediately. The scheduling scheme gives the I/O-bound processes higher priority for dispatching. Complete the following steps, as root, using the lab3 directory. Note If the dhrystone program runs too quickly (this is systemdependent) to be monitored correctly, read the README le in the lab3 directory for directions on how to compensate its actions. 1. Type: /usr/proc/bin/ptime ./io_bound Record the results. ___________________________________________________________ ___________________________________________________________ 2. Type: /usr/proc/bin/ptime ./dhrystone Record the results. ___________________________________________________________ ___________________________________________________________
LF3-24

LF3
3. Using the elapsed time of both results, calculate the factor of time between dhrystone and io_bound. For example, if io_bound took ten times longer to execute (in real elapsed time) than dhrystone then the factor is 10. 4. Run the dhrystone command. Pass a parameter of 500000 times the factor (shown as value in the command line below). ./dhrystone value 5. Use prstat 1 | egrep (PID|dhry) in another terminal window to view dhrystone processs priority. What priority has it been assigned? This will be system-dependent. The priority value is probably in the SYS range. Why is it so low? (Get several snapshots of the process priority by using ps -cle.) The value probably varies somewhere in the range between 50 and 99. 6. When dhrystone has completed, type: ./io_bound This runs an I/O-intensive application. 7. Type prstat 1 | egrep (PID|dhry) in another terminal window to view the priority of this process. What priority has it been assigned? This will be system-dependent but will probably be low priority. Why? (Again, get several snapshots of the priority using ps -cle.) Because the process must wait on I/O, the priority stays much the same, except when it has data to process in RAM.
CPU Scheduling Lab

LF3-25
LF3
8. Type: ./test.sh value where value is the value used in the dhrystone command in Step 4. This shell script runs both dhrystone and io_bound and stores the results in two les, io_bound.out and dhrystone.out. It then waits for both of them to complete. Record the results.
w
io_bound results _______________________________________________________ _______________________________________________________
dhrystone results _______________________________________________________ _______________________________________________________
Why did the dhrystone process take longer to execute when it was run at the same time as the io_bound process? ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ The system time is the time taken by system calls, kernel activity and system-level processing. The user time is the time taken to process information in RAM, display information on-screen and so on. The values you record will be system-dependent.
LF3-26

LF3
The Dispatch Parameter Table
Using the lab3 directory, perform the following steps as root: 1. Type: dispadmin -g -c TS > ts_config This saves the original timeshare dispatch parameter table. 2. Type: dispadmin -s ts_config.new -c TS You have just installed a new timeshare dispatch parameter table. Do not look at the new table at this time. 3. Type: ./test.sh Note the results. Given the results, can you make an educated guess as to the nature of the ts_config.new dispatch parameter table? The contents of the dispatch table are biased towards batch jobs. Further, as processes are given a higher priority after a maxwait period, the priority increasing back up the table rather than jumping to a high (in the 50s) priority value. Look at ts_config.new to conrm. 4. Type: ps -cle Do you notice anything strange about some of the priorities you see? Which processes have very low priorities? Why? This varies from system to system. You must evaluate the impact of one process on another given the mix of active processes on your system. CPU intensive processes will be most heavily affected by this scheduler table.
CPU Scheduling Lab

LF3-27
LF3
5. Type: dispadmin -s ts_config -c TS This restores the original table.
LF3-28

LF3
Run this exercise as root from the lab3 directory. Use the following steps: 1. Run test.sh as a real-time process. Type: priocntl -e -c RT ./test.sh. Note Running these processes as Real-Time might cause your system to temporarily hang. 2. What is different from the timeshare case? Processes may hang for periods of time. Can you explain this? By making a process real-time, if that process is CPU-intensive, it will demand most (if not all) of the CPUs processing time until it has completed. 3. Record the results for io_bound and dhrystone.
w
io_bound results This varies from system to system.
dhrystone results This varies from system to system.
CPU Scheduling Lab

LF3-29
LF3
Processor Sets
This lab is designed to illustrate how you set up and populate processor sets with CPUs and processes. The lab also demonstrates the potential benets on a moderately loaded system. Document the results that you obtain during the lab. The results vary depending on the number and speed of CPUs on your system. Note If your system is a single-processor system, do not attempt this lab. Best results are obtained on a system with more than two CPUs. Run this exercise as root. Use the following steps: 1. Open two terminal windows, and note their process ID values. In each of the windows, type the following command lines: # exec csh % while 1 ? end 2. In a third window, issue the following command: # mpstat 5 Observe the output in the smtx column. On systems with more and faster CPUs, the values in the smtx column should tend to be higher than on systems with less or slower CPUs. If there is a mix of systems in the classroom, compare the values between systems. 3. Execute zoom.se to see if a mutex problem has been detected. 4. Create a processor set using the following command: # psrset -c If the command is successful, you will see a message similar to: created processor set 1
LF3-30

LF3
5. Add a processor to the processor set: # psrset -a SID PROCID Where SID represents the processor set ID, and PROCID represents the processor ID. To obtain the values for SID and PROCID, respectively, issue the following commands: # psrset -i # psrinfo -v When the processor has been added to the processor set, observe the values output by mpstat. Is there any activity on the CPU that has been added to the processor set? Can you explain this? 6. Migrate the two C shell windows to the processor set: # psrset -b 1 PID1 PID2 where PID1 and PID2 represents the process ID values for the two windows created in Step 1. After the two processes have been migrated to the processor set, is there a reduction in the smtx activity across the remaining processors in the mpstat output?
CPU Scheduling Lab

LF3-31
Cache Lab
Objectives
q
LF4
Calculate the cost of a cache miss, in both time and percentage of possible performance View cache-related statistics using standard Solaris 8 Operating Environment (OE) tools
LF4-33
LF4
Tasks
Calculating Cache Miss Costs
Use the following gures to answer the questions in this exercise:
q q
Cache hit cost 12 nanoseconds Cache miss cost 540 nanoseconds
1. What speed will 100 accesses run at if all are found in the cache (100 percent hit rate)? 1200 nanoseconds (for all 100 cache hits) 2. What is the cost in time of a 1 percent miss rate? In percentage performance? 99 * 12 = 1188 1 x 540 = 540 1188 + 540 = 1728 1728 - 1200 = 528 528 / 1200 = 0.44 Performance will degrade by 44 percent. 3. If a job was running three times longer than it should, and this was due completely to cache miss cost, what percentage hit rate does it achieve? The calculation would be as follows: (12 * n) + (540 * (100 - n)) = 3600 This equates to: 12n + 54000 - 540n = 3600 12n - 540n = 3600 - 54000 -528n = -50400 528n = 50400 n = 50400/528 n = 95.45
therefore, Hit Rate (n) = 95.45%
LF4-34

LF4
Displaying Cache Statistics
The following lab exercises involve the use of the tools mentioned in this module. The purpose of this section of the lab is to allow you to become familiar with the output of the commands and to help you determine what information, contained in the command output, will be of use to you. Another aspect of the lab is to demonstrate that there will always be times when a cache is not working as efciently as you would like. The mix of system activities determines how much data is cached and available for use. Sometimes, a performance gain can be obtained by altering the system workload rather than altering kernel parameters or adding memory. During the lab exercise, you see how standard commands, such as find, can have a negative effect on the systems caches. You should be logged in to the CDE windowing environment as the root user for this exercise. The lab4 directory should be your current working directory. 4. In a terminal window, issue the following command: # vmstat -s Write down the value associated with total name lookups. 5. In the same terminal window, issue the following command: # vmstat 5 | tee /tmp/vmstat_lab 6. In another terminal window, issue the following command: # ./finder Observe the page-in (pi) column of the vmstat output while the find command is executing. After the find command has completed, terminate the vmstat process you started in Step 5.
Cache Lab
LF4-35
LF4
7. In a terminal window, issue the following command: # vmstat -s Compare the current value associated with total name lookups to the previous value (from Step 6). Has the cache hit percentage increased or decreased? It has probably decreased. Why? Because the le names that were being used are not currently stored in the cache. As the find process was executing, names of les and directories were accessed that had either never been used or had not been used in a long time. This, in turn, caused the cache-hit percentage value to fall. 8. In a same terminal window, issue the same command as before: # vmstat 5 9. In another terminal window, again issue the command: # find / > /dev/null Observe the page-in (pi) and minor-fault (mf) column of the vmstat output while the find command is executing. Does it appear to be the same as before? The percentage might have changed but probably only changed slightly. This is because most of the le and directory names are now stored in the cache. Is the CPU spending as much time in an idle (id) state? Probably not. The CPU should spend most of its time in a wait state rather than idle. Is the find command reading data from disk or from the cache (or both)? Both.
LF4-36

LF4
10. If your system is sun4u or sun4d architecture, run the prtdiag command as shown: # /usr/platform/uname -m/sbin/prtdiag Write down the size of the caches for each of your CPUs. (The size is displayed in the Ecache columns). If you are uncertain of your system architecture, try either of these two commands: # uname -m # arch -k 11. Issue the following command: # prtconf -vp | grep -i cache The values are output in hexadecimal. Determine the decimal values for the following information:
Level 1 Data Cache:

dcache-line-size: dcache-nlines: Total cache size (line-size x nlines): dcache-associativity:
Level 1 Cache:
ecache-line-size: ecache-nlines: Total cache size (line-size x nlines): ecache-associativity:
Level 1 Instruction Cache:

icache-line-size: icache-nlines: Total cache size (line-size x nlines): icache-associativity:
Cache Lab
LF4-37
LF4
Note You can use the bc command to convert hexadecimal values to decimal. (Use the instruction ibase=16.) 12. If there are systems in your classroom with different architectures, compare the results found on your system architecture to those on other system architectures.
LF4-38

Memory Tuning Lab

Objectives
q q q
LF5
Use the pmap command to view the address space structure Identify the contents of a process address space Describe the effect of changing the lotsfree, fastscan, and slowscan parameters on a system Recognize the appearance of memory problems in memory tuning reports
For the Solaris 8 Operating Environment (OE) in 64-bit mode: All tuning parameters are now 64-bit elds. To change them to work on 32-bit systems with adb or mdb, use the following steps: 1. For adb/mdb, substitute the following verbs J (64-bit) to X (32-bit) E (64-bit) to D (32-bit) Z (64-bit) to W (32-bit) 2. If you are unsure of your systems architecture, issue the command: # isainfo -b
LF5-39
LF5
Tasks
Process Memory
Complete the following steps: 1. Using the ps command or ptree command, locate the process identier (PID) of your current sessions Xsun process. Write the PID value of that process in the space provided below: ___________________________________________________ 2. Run: # pmap -x Xsun-pid where sun-pid is the value you wrote down in step 1. How much real (resident) memory is the Xsun process using? (Totals are in Kbytes.) The value should be approximately 12400K. How much of this real memory is shared with other processes on the system? Probably around 1780K of the total will be shared with other processes. How much private (non-shared) memory is the Xsun process using? Assuming around 1780K is shared, this would leave 10620K of RAM that is in private use by this process.
LF5-40

LF5
Page Daemon Parameters
Determining Initial Values
Use the lab5 directory for these exercises. 1. Observe the effects of modifying various kernel paging parameters on a loaded system. The rst step is to dump all of the current paging parameters. Type: ./paging.sh Record the output. These values are system dependent.
lotsfree _______________________ pages desfree ________________________ pages minfree ________________________ pages slowscan _______________________ pages fastscan _______________________ pages physmem ________________________ pages pagesize On an Ultra system: 8 Kbytes
2. Convert lotsfree to bytes from pages. You use these values later in this lab. _________________________________________________________ Divide this number by 1024 to get lotsfree in Kbytes. ___________________________________________
Memory Tuning Lab

LF5-41
LF5
Changing Parameters
In this portion of the lab, you put a load on the system by running a program called Memory Monitor, which locks down a portion of physical memory. Then you run a shell script called load3.sh., which puts a load on the systems virtual memory subsystem. This script should force the system to page and, perhaps, swap. You will run Memory_Monitor and load3.sh several times, each with a different set of paging parameters to observe their effect. 1. Change the directory to the lab5 directory. 2. Observe the system under load with the default parameters. vmstat -S 5 3. To obtain the amount of physical memory on your machine, in another terminal window, type: prtconf | grep Memory 4. Please exit from Netscape and Acrobat sessions before continuing with this lab. (As these programs might interfere with the running of the Memory Monitor, Java-based, program)
LF5-42

LF5
5. You should start a Memory Monitor program for each 28 Mbytes of system RAM in your system.To start each Memory Monitor program, type: java -jar MemoryMonitor.jar & and wait for the Memory Monitor windows to open. Each window should look like Figure LF5-1.
Figure LF5-1 Memory Monitor Window Click on the Start button to start the Memory Monitor program. Note You are required to start a number of instances of the Memory Monitor program. The value of 28 Mbytes of RAM was selected as an appropriate value used in the calculation of how many instances to execute. However, if you want to place more load on your system, try starting one instance of Memory Monitor for every 15 Mbytes of RAM.
Memory Tuning Lab

LF5-43
LF5
As memory is being allocated, the pie chart displays how much RAM has been allocated and the values also change in the window as shown in Figure LF5-2.
Figure LF5-2 Memory Monitor Pie Chart At intervals, you can pause and then resume the Memory Monitor program by selecting the Pause or Resume button as shown in Figure LF5-3.
Figure LF5-3 Memory Monitor Buttons
LF5-44

LF5
Try experimenting by entering different values relating to the quantity (where the default is 10,000) and the delay time (where the default is 2,000 milliseconds). Note Each time you change one of the two values (quantity or delay time) you must click the Pause button and then select the Resume button for the change to take effect. Java technology allows for garbage collection of redundant areas of RAM. You should try selecting and de-selecting the Garbage Collection button to see what, if any, difference garbage collection makes. When garbage collection is used, RAM is freed. However, this freed RAM then becomes consumed by other processes very quickly. 6. In a third window, type: ./load5.sh This puts a load on the system. It takes about 3 to 5 minutes to run. Note load5.sh produces an invalid le type error as well as other job completion messages; this is normal. Ignore the error.
Memory Tuning Lab

LF5-45
LF5
Make the following observations while the vmstat command is running:
w
The value of freemem (remember, freemem is M Kbytes and should hover around the value of lotsfree) The scan rate and number of Kbytes being freed per second. Most of this activity is due to the page daemon Disk activity The time spent in user and system mode Data that is being paged in and paged out
w w w
When load3.sh is nished, press Control-C to exit vmstat. 7. Record the same data in a sar le. Type: sar -A -o sar.norm 5 60 8. In another window, type: ./load5.sh Wait for load5.sh to complete. If sar is still running, use Control-C to stop the output of sar. 9. You can leave Memory Monitor running if it does not impact your system performance too much. 10. In a new window, type: mdb -kw 11. Try changing lotsfree to be one-tenth of main memory. In mdb, type: lotsfree/Z 0tnew_lotsfree (Remember, the value you use for new_lotsfree should be in pages.) 12. To verify, in mdb, type: lotsfree/E You should see the value you just set.
LF5-46

LF5
13. Increase slowscan to half of fastscan. In mdb, type: slowscan/Z 0tnew slowscan 14. To verify the new value for slowscan, type: slowscan/E Note Type $q to exit mdb. For this lab, you may want to keep this window open. Given these alterations, what sort of behavior do you expect to see on the system with respect to the following:
w
The amount of time the CPU spends in user and system mode. _______________________________________________________
The average amount of free memory on the system. _______________________________________________________
The number of pages being scanned and put on the freelist. _______________________________________________________
15. If Memory Monitor is not still running, invoke Memory Monitor by typing: java -jar MemoryMonitor.jar 16. In another window, type: sar -A -o sar.1 5 60 17. In another window, type: ./load5.sh Wait for load5.sh to complete. 18. Exit Memory Monitor.
Memory Tuning Lab

LF5-47
LF5
19. In one window, type: sar -f sar.norm -grqu 20. In another window, type: sar -f sar.1 -grqu Did you see the changes you expected? Explain why or why not. ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ 21. Restore lotsfree to its original value. In the mdb window, type: lotsfree/Z 0toriginal_value_of_lotsfree
LF5-48

LF5
Scan Rates
To examine the scan rates: 1. Lower slowscan. In the mdb window, type: slowscan/Z 0t10 2. Lower fastscan. In the mdb window, type: fastscan/Z 0t100 What effect do you think this will have on the systems behavior with respect to:
w
The number of pages scanned per second Will be lower
The number of pages free per second Will be lower
The average amount of free memory Will be less than previously
The size of the swap queue Will be higher
3. If necessary, generate paging activity by running Memory Monitor. Type: java -jar MemoryMonitor.jar 4. In another window, type: sar -A -o sar.2 5 60 5. In another window, type: load5.sh Wait for load5.sh to complete. 6. Exit Memory Monitor.
Memory Tuning Lab

LF5-49
LF5
7. In one window, type: sar -f sar.norm -grqu 8. In another window, type: sar -f sar.2 -grqu Did you see the changes you expected? Explain why or why not. ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________
9. Restore slowscan to its original value. In the adb window, type: slowscan/Z 0toriginal_value 10. Restore fastscan to its original value. In the mdb window, type: fastscan/Z 0toriginal_value 11. Exit mdb using the $q command
LF5-50

System Buses Lab

Objectives
LF6
Upon completion of this lab, you should be able to assess hardware components and layout.
LF6-51
LF6
Tasks
Testing the Hardware Conguration of a System
The purpose of this lab is to demonstrate how you can determine the hardware layout of a system. You will partner with another student for this lab. For the duration of the lab, the notes refer to System-A and System-B. You must decide which of the pair of hosts will be System-A and System-B and carry out the tasks appropriately. For the remainder of this lab, you will be working as a paired team. Work carried out on System-A must use the root login. Note This lab requires that you have access to the serial and keyboard ports on your system and that you have the appropriate serial cables to connect two systems together. If you do not have appropriate access or the serial connector cable, do not attempt this part of the lab. If you are in doubt, consult with your instructor. The answers for this lab are all system-dependent or based on your own interpretation of the lab exercise. 1. On System-B, execute the prtdiag command (on sun4u systems) and document the output below: ________________________________________________ ________________________________________________ ________________________________________________ ________________________________________________ ________________________________________________ ________________________________________________
LF6-52

LF6
2. Using the serial cable provided by your instructor, connect the cable to Serial Port B on System-A and System-B so that both hosts can communicate with each other through the serial port. 3. On System-A, verify that there is an entry in the /etc/remote le that is similar to the following: hardwire:\ :dv=/dev/term/a:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D: Ensure that the dv= entry reads /dev/term/a. If the entry does not exist, then add the entry for hardwire. 4. On System-A, issue the following command: # tip hardwire If the serial connection between the two systems is working correctly, you should see the message: connected If this is not the case, please consult with the instructor. 5. On System-B, to halt the system and take the system to the PROM command level, issue the command: # init 0
System Buses Lab

LF6-53
LF6
6. When System-B has shut down to Level 0, issue the following commands: ok setenv diag-switch? true ok setenv auto-boot? false ok banner If the revision of the PROM is 3.x, issue the following command: ok reset-all If the revision of the PROM is 2.x, issue the following command: ok reset 7. On System-A, the hardware conguration of System-B is listed as the system boots. Answer the following questions: Do you see a listing of the attached hardware? On a sun4u system, does the listing match the output of the prtdiag command (as issued in Step 2 of this lab)? How many memory banks are there on System-B? What is the distribution of memory modules across the memory banks? Can you see the size of the Level 2 (look for Ecache Size) cache? 8. On System-A, you should now terminate the tip session by typing the following: ~. 9. On System-A, record all of the data into a le using the following command: # script /var/tmp/hardware_list.script Script started, file is /var/tmp/hardware_list.script # tip hardwire Note If you are unfamiliar with the script command, consult with the instructor.
LF6-54

LF6
10. When System-B has booted to the PROM command level, issue the following commands on System-B: ok setenv diag-switch? false If the version of the PROM is 3.x, issue the following command: ok reset-all If the version of the PROM is 2.x, issue the following command: ok reset Before System-B starts to boot, remove the keyboard cable from its socket. 11. As System-B boots, the ok prompt should appear in the terminal window on System-A. If so, issue the following command and answer the questions below: ok boot -v What is the name of the graphics adapter on System-B, and how much memory is on the graphics card? What is the name and speed of the network card? What types of system and I/O buses are used on System-B? 12. When System-B has booted, on System-A issue the following commands: # ~. # exit Script done, file is /var/tmp/hardware_list.script # The output produced by System-B, during its boot-cycle, is now stored in the /var/tmp/hardware_list.script le. Compare the details in this le to the listing produced in Step 2.
System Buses Lab

LF6-55
LF6
13. If you are using sun4d or sun4u architecture systems, use the prtdiag command to determine the clock speed of the system bus. Using Table 6-1 on page 6-10 to determine the width of your system bus. Calculate the burst bandwidth of the bus by multiplying the clock speed of the bus by the width value. Then, divide the total by 8 to obtain a value in Mbytes/second. Write down your answer: ___________Mbytes/sec 14. If a system with OBP 2.x is available, try the following commands at the ok prompt: Note The PROM commands may not be available depending upon the type of system in the classroom. ok cpu-info ok show-sbus 15. If a system with OBP 3.x is available, try the following commands at the ok prompt: ok .speed 16. Boot the system using the command: ok boot
LF6-56

I/O Tuning Lab

Objectives
q
LF7
Identify the properties of small computer system interface (SCSI) devices on a system Calculate the time an I/O operation should take Compare calculated I/O times with actual times
q q
LF7-57
LF7
Tasks
Disk Device Characteristics
Below is example output from the prtvtoc command: # prtvtoc /dev/dsk/c0t2d0s0 * /dev/dsk/c0t2d0s0 partition map * * Dimensions: * 512 bytes/sector * 159 sectors/track * 4 tracks/cylinder * 636 sectors/cylinder * 6485 cylinders * 6483 accessible cylinders ... # prtvtoc /dev/dsk/c0t3d0s0 * /dev/dsk/c0t3d0s0 partition map * * Dimensions: * 512 bytes/sector * 80 sectors/track * 19 tracks/cylinder * 1520 sectors/cylinder * 3500 cylinders * 2733 accessible cylinders ... Given the information, try to answer the following questions: 1. Assume that both disks write data at the same speed and the heads are positioned over the appropriate starting track for a write operation. If 2 Mbytes of data is transferred to each disk and each disk has a seek time of 4 ms and a rotation speed of 10,000 RPM, provide answers to each of these questions:
w
Which disk would take less time to write the data? The second disk, in the list shown above, will take less time.
LF7-58

LF7
w
Why? That is because the second disk has 19 tracks of 80 sectors, allowing the writing of 760 Kbytes per cylinder. The rst disk only writes 318 Kbytes per cylinder so would have to invoke a head movement. Without the requirement of the head movement, the rst disk takes less time.
What would be the time difference (in percentage terms)? In percentage terms, the difference would be (760 - 318) / 760, which results in a 58 percent difference.
Use the following steps to determine I/O subsystem characteristics: 2. Locate your boot drive and CD-ROM drive. Use prtconf -v to identify the following characteristics for each (if they are SCSI devices). For details, see Table L7-1 on page L7-4 and example output. Target ________ Device type Speed ____________ ____________
Tagged queueing? ____________ Wide SCSI? Target ________ Device type Speed ____________ ____________ ____________
Tagged queueing? ____________ Wide SCSI? ____________
I/O Tuning Lab

LF7-59
LF7
3. In a terminal window, start a find process, in the background, which searches all directories on the system using the following command: # ./finder8 & 4. In the same window, start the iostat command and observe the activity per disk slice: # iostat -xp 3 15 Why is the value in the %b column not showing 100 percent? Because bus and I/O arbitration are blocking data transfer. Also, the nd process is being context-switched between runnable and wait-on-I/O so is not capable of handling a continuous stream of data input. The disk, therefore, cannot be transferring data to the nd process 100 percent of the time. There might also be other reasons.
LF7-60

LF7
System Behavior Characteristics
This question relates to system characteristics. See if you can determine the characteristics (with minimal background information) of three different system types. 5. Match the best conguration to the application type, based on the details shown below: Application Type
w w
Web hosting server (multiple Web sites) Computer-aided design (CAD) /Computer-aided manufacturing (CAM) workstation Database server
System 1 has high network activity and high numbers of disk writes. The data read from the disk and written to the disk is not sequential and the data throughout value varies with each read or write. System 2 has high CPU utilization with very little disk I/O. When disk I/O does take place it is usually sequential access (read or write) with large amounts of data throughput. System 3 has high memory utilization with low disk-access values. When the disk is accessed, the data is mostly disk-read and sequential in nature. This system has a large amount of random access memory (RAM) that is in use by low numbers of processes. There is not one correct answer to this question. In fact, two types of systems could display the same characteristics. So, the table containing the answers shows which are most likely.
I/O Tuning Lab

LF7-61
LF7
Match the system to the type (tick the appropriate box in Table LF7-1): Table LF7-1 System Types Web Server System 1 System 2 System 3 CAD/CAM Database
Reading iostat Report Output

6. The lab7 lab directory includes a le called iostat_example.txt. The le contains partial output from iostat, showing active metadevices. Which of the metadevices provides: The best service time? md24 has the best actual and averaged service time value recorded in the output. The worst service time? md0 has the worst service time for one sample-period. However, in that sample-period, the amount of data read and written is quite high. md4 has the worst averaged service time, where it took almost 11 ms to write just 3.1 Kbytes of data. The best average write time? md24 has the best write service time. The best average read time? md0 shows the best average read time in the entry: md0 0.2 4.3 0.5 34.4 0.1 0.4 91.9 6 5
LF7-62

LF7
You should use the following formula: Average write service time kw/s / (kw/s + kr/s) * svc_t Average read service time kr/s / (kw/s + kr/s) * svc_t
I/O Tuning Lab

LF7-63
UFS Tuning Lab

Objectives
q q
LF8
Determine how a partition was created See the performance benet of the domain name lookup cache (DNLC) Determine how many inodes a partition needs See the performance benets of sequential read/write access Determine the cylinder group to which a le belongs Alter the default UFS creation parameters
q q q q
LF8-65
LF8
Tasks
File Systems
Complete the following steps: 1. Look at the newfs/mkfs parameters used to create your root partition, and record the following information: fstyp -v /dev/rdsk/cXtXdXsX | more size __________________________ ncg ___________________________ minfree _______________________ maxcontig _____________________ optim _________________________ nifree ________________________ ipg ___________________________ ncyl __________________________ The values are system dependent. size is the size of the partition in numbers of le system blocks (see bsize). ncs is the number of cylinder groups in the le system. minfree is the percentage of space in the partition to reserve for root use only. maxcontig is the number of contiguous disk blocks to be allocated. optim is the optimization method (space or time). nifree is the number of inode entries free for use.
LF8-66

LF8
ipg is the number of inodes per cylinder group. ncyl is the number of cylinders in the partition. The values for the following questions are system dependent. How many cylinders exist in the last cylinder group in this partition? ___________________________________________________________ Using fstyp, determine how many inodes are present in the last cylinder group in the partition. ___________________________________________________________ Compare the previous values with the output of df -Fufs -o i. ___________________________________________________________ At 128 bytes per inode, how much space is being allocated to unused inodes? What percentage of the partition space is this? ___________________________________________________________
UFS Tuning Lab

LF8-67
LF8
Directory Name Lookup Cache
Use the les in the lab8 directory for these exercises. 1. Examine the effects of running a system with a very small DNCL and inode cache. ncsize is the size of the DNLC, and ufs_ninode is the size of the inode cache. Type: ./dnlc.sh Record the size of both caches. ncsize __________________________________________ ufs_ninode ______________________________________ 2. Type: vmstat -s Record the total name lookups (cache hits ##%). total name lookups ___________________________ cache hits% ____________________________________ 3. Edit /etc/system. Add the following lines: set ncsize = 50 set ufs_ninode = 50 4. Reboot your system. 5. After it has rebooted, type: vmstat -s Record the total name lookups (cache hits ##%). total name lookups _____________________________ (Treat this as value X in the equation in Step 8.) cache hits% ___________________________________ (Treat this as value A in the equation in Step 8.)
LF8-68

LF8
6. To determine how much disk activity is taking place in Step 7, start the sar program running in a separate window: sar -d 2 100 Observe how much disk activity takes places as the script is executing. You may also wish to use: sar -q 2 100 Calculate a percentage hit rate using the following formula: namei_iget / namei * 100 = Hit-Rate-Percentage Be aware that there might be other disk activity, such as daemon logging and le metadata synchronization, which could affect the output. 7. Type: ../finder This test takes a minute or so to run. 8. When the test has completed, record the results. What has happened to the DNCL hit rate? Why? What can you deduce just from the vmstat output? total name lookups _____________________________ (Treat this as value Y in the following equation.) cache hits ____________________________________ (Treat this as value B in the following equation.) Now calculate the following (using the values from Step 5 and Step 8): (B - A) / (Y - X) What is the actual percentage of cache hits? ______________________________________________________
UFS Tuning Lab

LF8-69
LF8
9. Type: sar -v 1 The size of the inode table corresponds to the value you set for ufs_ninode and that the total number of inodes exceeds the size of the table. This is not a bug. ufs_ninode acts as a soft limit. The actual number of inodes in memory might exceed this value. However, the system attempts to keep the size below ufs_ninode. 10. Edit /etc/system. Remove the lines that were added earlier. 11. Reboot your system. 12. After the system has rebooted, change to the lab8 directory and type: ./finder 13. When the test has completed, record the results. What has happened to the DNLC hit rate? Why? What can you deduce just from the vmstat output? total name lookups _____________________________________ cache hits ______________________________________________ Are these values what you expected?
LF8-70

LF8
UFS Parameter Controls
For this lab, you use the spare disk drive that is currently not in use by your system. First, you should verify that there is a spare disk drive (if not, consult with your instructor) and, second, you should ensure that there are no current mounts of le systems from that drive.

Verify that there is a disk on which it is safe to perform the following steps.
Note Do not alter any settings for your system boot disk drive. 1. Use the format command to determine the size of the spare disk drive (in cylinders) 2. When you have determined the size of the disk (in cylinders) partition the disk using the information shown below, where the allocated values are as close to the percentages as possible:
w
In addition to Slice 2, four slices will be used. These slices are: Slice-0 - Allocate 5 percent of the total disk cylinders Slice-1 - Allocate 45 percent of the total disk cylinders Slice-6 - Allocate 5 percent of the total disk cylinders Slice-7 - Allocate 45 percent of the total disk cylinders The total number of cylinders allocated to these four slices will equal the number of cylinders of Slice 2.
Each of the four slices listed previously must be write-enabled and mountable. There should be no overlapping of any slices (except for Slice 2).
3. When the partitions have been dened, label the disk using the appropriate command. 4. Quit from the format command using the appropriate program instructions.
UFS Tuning Lab

LF8-71
LF8
5. Create a new le system in Slice 0 (on the drive that you have just used with the format command) using: newfs /dev/rdsk/cXtXdXs0 where the device path name is appropriate for the disk slice being used in this lab. Note For the remainder of this portion of the lab, whenever a device path name is used, you should assume that it is appropriate to that task, as for Step 5. If you are unsure, please consult with your instructor. 6. Create a new le system in Slice 6 (on the drive that you have just used with the format command) using: newfs -c 48 -i 6144 -m 3 -o space /dev/rdsk/cXtXdXs6 Note You may nd that this command does not produce the expected results. This could be due to the size of the disk slice being over 4 GBytes in size. Also, the space option may be ignored by the newfs command. 7. Create a new le system in Slice 7 (on the drive that you have just used with the format command) using: newfs /dev/rdsk/cXtXdXs7 8. Compare the details of the le systems in Slice 0 and Slice 6 using the following commands: fstyp -v /dev/rdsk/cXtXdXs0 | head -30 fstyp -v /dev/rdsk/cXtXdXs6 | head -30 9. Are there any differences in the control values for these two slices? Write down the differences (if any are found): minfree: ______________________________________ cpg: _________________________________________ bpg: _________________________________________ ipg: _________________________________________
LF8-72

LF8
maxbpg: _____________________________________ optim: _______________________________________ Why are there differences? Differences should be found. Although both slices are identical in size, the values used by newfs for both slices were different. Once slice had a le system created with the default values while the other had manually specied values. These should be where differences are found. Also, the space option for slice 6 was probably ignored. This is a feature of the Solaris 8 version of newfs. 10. Compare the details of the le systems in Slice 0 and Slice 7 using the following commands: fstyp -v /dev/rdsk/cXtXdXs0 | head -30 fstyp -v /dev/rdsk/cXtXdXs7 | head -30 11. Are there any differences in the control values for these two slices? If differences are found, it is because the size of the slices are different and the newfs command has reevaluated its default settings. The values are changed to t with the size of slice in which the le system is being created. Write down the differences (if any are found): minfree: ______________________________________ cpg: _________________________________________ bpg: _________________________________________ ipg: _________________________________________ maxbpg: _____________________________________ optim: _______________________________________
UFS Tuning Lab

LF8-73
LF8
12. Create the following directories, using the commands listed below: mkdir /newdir0 mkdir /newdir6 mkdir /newdir7 13. Mount the slices (from the non-system disk) as follows: mount /dev/dsk/cXtXdXs0 /newdir0 mount /dev/dsk/cXtXdXs6 /newdir6 mount /dev/dsk/cXtXdXs7 /newdir7 14. In a new terminal window, issue the command: iostat 2 Observe the values being displayed as you proceed through the remainder of this lab. 15. When the disk slices are mounted, issue the following commands and record (in the spaces provided) the times taken to perform the tasks. Note These tasks relate to sequential disk access times. Observe the kps column for the disk drive in the output of iostat when each command is executed. ptime mkfile 60m /newdir0/file60mb Real: _____________ User: _____________ Sys: ______________ ptime mkfile 60m /newdir6/file60mb Real: _____________ User: _____________ Sys: ______________
LF8-74

LF8
ptime mkfile 60m /newdir7/file60mb Real: _____________ User: _____________ Sys: ______________ 16. Remove the les, created in the preceding step, by issuing the following commands: rm /newdir0/file60mb rm /newdir6/file60mb rm /newdir7/file60mb 17. In the lab8 directory is a shell script called ufscreate1. This shell script creates a series of directories and subdirectories (in one of the le systems mounted earlier). When the series of directories have been created, a le is copied into each of the new subdirectories. Issue each of the commands listed below and record the time taken to complete each one: ptime ./ufscreate1 /newdir0 Real: _____________ User: _____________ Sys: ______________ ptime ./ufscreate1 /newdir6 Real: _____________ User: _____________ Sys: ______________
UFS Tuning Lab

LF8-75
LF8
ptime ./ufscreate1 /newdir7 Real: _____________ User: _____________ Sys: ______________ Were there any differences when /newdir0 and /newdir6 were used? If so, why? If any differences are noticed, it is because Slice 0 is probably on the outer region of the disk surface, where the density of disk read/write is at its highest. Therefore, Slice 0 should produce a slightly faster time. The /newdir7 slice is 9 times the size of the /newdir0 slice. Did the directory and le creation in /newdir7 take 9 times longer than the directory and le creation in /newdir0? If not, why not? Probably because of disk zoning and disk caching. How many directories were created in /newdir0? This is disk dependent. How many directories were created in /newdir7? This is disk dependent.
LF8-76

LF8
mmap and read/write
Use the les in the lab8 directory for these exercises. The purpose of this exercise is to demonstrate that programming technique might affect le access. If a program uses incorrect technique, this can have a severe impact on le system access. Note If you do not have a background in programming, you might wish to skip this exercise and move to the exercise on page L9-71. 1. Use your favorite editor to view the source for the program, map_file_f_b.c. This program sequentially scans a le, either forward or backward. What system call is the program using to access the le? ___________________________________________________________ Why is msync() called at the end of the program? ___________________________________________________________ ___________________________________________________________ Which should take more timescanning the le forward or scanning the le backward? Scanning the le backwards. Why? Data is read, by default, from the start of the le. If the data is read backward, it is actually being read forward, from disk, then backward from cached data.
UFS Tuning Lab

LF8-77
LF8
2. There is a le called test_file in the labfiles directory. It is used by some of the programs you will be running. Use df -n to verify that this le resides in a (local) UFS le system. If it is not in a UFS le system, perform the following steps: a. Copy the les map_file_f_b, mf_f_b.sh, and test_file to a directory on a UFS le system. The les map_file_f_b and mf_f_b.sh are in your lab8 directory. The test_file le is in your home directory. You will be looking at some statistics related to disk I/O; therefore, test_file, map_file_f_b, and mf_f_b.sh need to be on a local le system. b. Change (cd) to the directory containing test_file, map_file_f_b, and mf_f_b.sh. 3. Type: mf_f_b.sh f This script runs map_file_f_b several times and prints out the average execution time. The f argument tells map_file_f_b to sequentially scan the le (forwards) starting at the rst page and nishing on the last page. Note the results. System total ________ User total ________ Real-time total ________ Average ___________ Average ___________ Average ___________
LF8-78

LF8
4. Type: mf_f_b.sh b This runs map_file_f_b several times, but this time scanning (backwards) begins on the last page of the le and works back to the rst page. Note the results. System total ________ User total ________ Real-time total ________ Average ___________ Average ___________ Average ___________
Which option took a shorter amount of time? The -f option Why? Because reading forward is the default method used in the Solaris OE. 5. Change (cd) back to the lab8 directory, if you are not already in that directory.
UFS Tuning Lab

LF8-79
LF8
Using madvise
Experiment with madvise as follows: 1. Look at file_advice.c. What does this program do? The program access is a le, sequentially, and touches le pages. What options does it accept and what effect do they have on the program? Users can specify n if they do not wish to set memory advice, s if they want to set memory advice to sequential, or r if they want to set random advice. 2. Copy file_advice to the directory containing test_file. 3. Change (cd) to the directory containing test_file. 4. In one window, type: vmstat 5 When file_advice has nished accessing the mapped le, it prompts you to press Return to exit. Do not press Return. Instead, note the change in free memory from when the program was started until file_advice prompted you to press Return. 5. Type: ./file_advice n The n option tells file_advice to use normal paging advice when accessing its mappings. Note the pagein activity. What caused this? Pages were being read in by the file_advice program. After file_advice walked through the mapping, were the pages it used freed up? (Did you see a major drop in freemem in the vmstat output?) Explain. The pages were not freed up so the free memory value, displayed by vmstat, was reduced.
LF8-80

LF8
6. Press Return so file_advice can exit. 7. Type: ./file_advice s The s tells file_advice to use sequential advice when accessing its mappings. When file_advice prompts you to press Return, was any of the memory that had been used to access the mapping freed? (Did you see a smaller drop in freemem in the vmstat output?) Explain. After pressing Return, the memory becomes available for use. That is because the file_advice process has now terminated and the kernel allocates the address space back onto the free memory list. 8. Press Return so file_advice can exit. 9. Type: ./file_advice s Note the number of I/O requests that were posted to the disk test_file resides on (for example, s0). This is system dependent. 10. Press Return so file_advice can exit. 11. Type: ./file_advice r The r option tells file_advice to use random advice when accessing its mappings. Note the number of I/O requests that were posted to the disk on which test_file resides (for example, s0). This is system dependent.
UFS Tuning Lab

LF8-81
LF8
Is there a difference between the sequential case and the random case? There should be. What might account for this difference? The sequential operation is reading through pages one after another so those pages are paged-in once. With the random read, individual pages may be paged in more than once to allow a random access to subsequent pages in the test le. 12. Press Return so file_advice can exit.
LF8-82

Network Tuning Lab

Objectives
q q q q
LF9
Identify the hit statistics for Naming Services caches Trace NFS server access by your hosts auto mount daemon Use the Cache le system (CacheFS) Use netstat reports to determine network performance
LF9-83
LF9
Tasks
nscd Naming Service Cache
Use the lab9 directory for these exercises. You need the following CDE windows for this lab:
q
One Console window (which is referred to as Console window during this lab) Three terminal windows (which are referred to as Terminal 1 window, Terminal 2 window and the Terminal 3 window during this lab).
Note You should have only one Console window open during this lab. Complete the following steps: 1. In the Terminal 3 window, view the contents of the /etc/nsswitch.conf le. Write down the source of the data for the following: The values are system dependent.
w
passwd: ___________________________ group: __________ hosts: _______________ ipnodes: __________________________
2. In the Terminal 3 window, view the current network statistics using the following command: # netstat -s Note You compare these statistics at the end of this lab.
LF9-84

LF9
3. Alter the font size of the Terminal 2 window so that the window has a width of at least 40 characters and a depth of at least 47 lines. In the Terminal 2 window, execute the following command: # ./netcache 4. In the Terminal-1 window, execute the following commands and observe the values being output by the netcache script. # ls -lR /etc /usr/share > /dev/null If you are using Network Information Service (NIS) in the classroom, try the following commands: # ypcat group # ypcat hosts If you are using Network Information Service Plus(NIS+) in the classroom, try the following commands: # niscat group.org_dir # niscat hosts.org_dir When you have completed the preceding steps and observed the changes in value in the Terminal 2 window, terminate the command running in the Terminal 2 window. Note - Read the man page for nscd(1m) and nscd.conf(4) to determine how you can change the cache characteristics.
Network Tuning Lab

LF9-85
LF9
autofs Tracing
In the Solaris Operating Environment (OE) implementation, the NFS automounter facility is provided with a trace facility. This trace facility allows the administrator to observe the actions of the automounted daemon. The trace facility is useful in determining whether NFS servers in a network segment are overloaded with requests, whether they are responding correctly to client requests, and whether the version of NFS has been set correctly or if negotiation is taking place between client and server. The following steps show how to use the automounters trace facility. For this exercise, you need access to a NFS server. 1. In the Console window, type the following command: # touch /net/=9 This invokes the trace facility with the highest level of tracing enabled (Level 9).
LF9-86

LF9
Your Console-Window should look something like Figure LF9-1:
Figure LF9-1 Console-Window Output From the touch /net/=9 Command 2. In the Terminal 2 window, issue the following command: # cd /net/moon where moon is the host name of a system that is not acting as a NFS server. The output you should see in the Console window looks similar to Figure LF9-2 on page LF7-88:
Network Tuning Lab

LF9-87
LF9
Figure LF9-2 Console-Window Output From the cd /net/moon Command 3. In the Terminal 2 window, issue the following command: # cd /net/star where star is the host name of a system that is acting as a NFS server. The output in the Console window shows how the NFS automount was successful. 4. When you have successfully mounted from the appropriate server, try accessing directories and les that are now automounted. For each action, observe the trace details being displayed in the Console window. 5. Change directory to the root directory: # cd / 6. Wait for a period of more than ve minutes, performing no keyboard activity. This allows the automounter to terminate the mount.
LF9-88

LF9
Observe the trace values being displayed in the Console window at the time of the unmount of the NFS server directories. 7. To terminate the trace facility in the Console window, issue the following command: # touch /net/=0 Where 0 (zero) terminates the trace capability.
Network Tuning Lab

LF9-89
LF9
cachefs Tracing
The cachefs allows remote (NFS) les to be cached on the local disk subsystem. Use of cachefs should, therefore, relieve the trafc on the network and speed up access to the le data for the local user. The following lab shows how cachefs can reduce network trafc. For this lab, there should be one host nominated as a manual page server. You will become a client of this (NFS) manual page server. Unless specied, all commands should be issued in the Terminal 1 window. 1. Make a backup copy of the /etc/auto_master le. 2. Edit the /etc/auto_master le to read: //etc/auto_cache
3. Create the /etc/auto_cache le, containing the following text: /usr/share/man -ro,nosuid,noconst,fstype=cachefs,\ backfstype=nfs,cachedir=/var/cachedir\ server:/usr/share/man where server is the name of the manual page server for the class. Note For readability, the line has been shown as being split over three text lines. It is more effective to have just one, continuous line, in the auto_cache le. 4. Create the cache directory: # cfsadmin -c /var/cachedir 5. Update the automount daemon: # automount -v
LF9-90

LF9
6. Now, make use of the automounter connection to the man pages server by giving the command: # man ls 7. Verify that the mount took place: # mount Is the man page server mounted? 8. In the Terminal 2 window, issue the following command: # snoop -a server where server is the name of the man page (NFS) server. 9. The man page for ls should now be stored in the cache. Try reading the man page for ls, again, using the command: # man ls Did you notice much noise from the snoop session? There should be a short burst of noise when the man page is rst accessed. This shows that the client host is requesting le-handle data from the server. However, the man page data should be stored in the cache so there should be very little noise. 10. Now, read the man page for ksh using the following command: # man ksh Did you notice much noise from the snoop session? There should be a long burst of noise when the man page is accessed. This shows that the client host is requesting le-handle and content data from the server. Because the data is not in cache, the noise will be fairly substantial. 11. Again, read the man page for ksh using the following command: # man ksh Did you notice much noise from the snoop session this time? Hopefully, not.
Network Tuning Lab

LF9-91
LF9
12. Terminate the snoop session that is running in the Terminal 2 window. 13. Revert the /etc/auto_master le back to its original state using the backup copy you made earlier in this lab.
LF9-92

PerformanceTuningSummary Lab
Objectives
q q q q
LF10
Identify the stress on system memory Use vmstat reports to determine state of the system Determine the presence and cause of swapping Use sar reports to determine system performance
LF10-93
LF10
Tasks
vmstat Information
Use the lab10 directory for these exercises. Complete the following steps: 1. Type: vmstat -S 5 Use this output to answer the following questions. The values are system dependent.
w
How many threads are able to run? ___________________________ How many threads are blocked waiting on disk I/O? __________ How many threads are currently swapped out? _______________ How much swap space is available? __________________________ How much free memory is available? _________________________ Is the system paging in? _____________________________________ Is the system paging out? ____________________________________ Where is the system spending most of its time (user mode, kernel mode, or idle)? _______________________________________________________
LF10-94

LF10
2. Type: ./load11.sh Note - load11.sh may take three to four minutes to complete. Note any changes in the vmstat output for the following areas: The values are system dependent.
w w w w w w w
Pagein activity ________________________________ Pageout activity _______________________________ Amount of free memory _________________________ Disk activity __________________________________ Interrupts _____________________________________ System calls ___________________________________ CPU (us/sy/id) ________________________________
The pagein values will increase and then pageout values should start to appear. The central processing unit (CPU) will spend more time in system mode. Other values will also change.
Network Tuning Lab

LF10-95
LF10
sar Analysis I
Complete the following steps. In this part of the lab, use sar to read les containing performance data. This data was recorded by sar using the -o option. The data was recorded in binary format. Each sar le was generated using a different benchmark. 1. Type: sar -u -f sar.file1 What kind of data is sar showing you? CPU activity In what mode was the CPU spending most of its time? System mode Look at the %wio eld. What does this eld represent? Is there a pattern? The system is not waiting for I/O to occur, so the processing must be CPU intensive.
LF10-96

LF10
2. Type the following command to display the CPU load statistics: sar -q -f sar.file1 Look at the size of the run queue as the benchmark progresses. What trend do you see? What does this indicate with respect to the load on the CPU? The run queue size is minimal until half-way through the sample data. This indicates that swapping is preventing the CPU from working effectively. Note the amount of time the run queue is occupied (%runocc). Does this support your conclusions from the previous step? Yes What is the average length of the swap queue? The size of the swap queue indicates the number of lightweight processes (LWPs) that were swapped out of main memory. What might the size of the swap queue tell you with respect to the demand for physical memory? (If this eld is blank, then the swap queue was empty.) This indicates that there is a very high demand for memory as the swap queue is almost constantly at 100 percent.
Network Tuning Lab

LF10-97
LF10
3. Type the following command to format the statistics pertaining to disk I/O: sar -d -f sar.file1 Was the disk busy during this benchmark? (Did the benchmark load the I/O subsystem?) Use the %busy eld to make an initial determination. No. This disk is not busy. Was the disk busy with many requests? (See the avque eld.) Did the number of requests increase as the benchmark progressed? Not really. As the benchmark progressed, were processes forced to wait longer for their I/O requests to be serviced? (Use avwait and avserv.) What trend do you notice in these statistics? The trend shows that there is a variable demand on the disk. Processes were not particularly forced to wait longer to be serviced. Was the I/O subsystem under a load? What was the average size of a request? No. The average size of request was 13/14 blocks. 4. Type the following command to display memory statistics: sar -r -f sar.file1 What unit is the freeswap eld in? Disk blocks (512 bytes) What was the average value of freemem? 631
LF10-98

LF10
5. Type the following command to format more memory usage statistics: sar -g -f sar.file1 Was the paging activity steady? (Did the system page at a steady rate?) Explain. No, paging took place in this sample. You have just checked three major areas in the SunOS operating system that can affect performance (CPU, I/O, and memory use). You can use this approach to gather initial data on a suspect performance problem. Given the data you have just collected, what can you conclude with respect to the load on the system? The load is very light with regard to disk I/O, but the CPU is heavily loaded. The presumption is that there is a CPU problem.
Network Tuning Lab

LF10-99
LF10
sar Analysis II
Perform the same analysis as in the previous section on the sar.file2 le. 1. Type: sar -u -f sar.file2 What kind of data is sar showing you? The data shows a varied amount of CPU activity. In what mode was the CPU spending most of its time? The CPU spent half of the time mostly in User mode and the other half mostly in Idle mode. Look at the %wio eld. What does this eld represent? Is there a pattern? There is no wait on I/O displayed. 2. Type the following command to display the CPU load statistics: sar -q -f sar.file2 Look at the size of the run queue as the benchmark progresses. What trend do you see? What does this indicate about the load on the CPU? There is no run queue indicated.
LF10-100

LF10
Note the amount of time the run queue is occupied (%runocc). Does this support your conclusions from the previous step? No values are indicated. What is the average length of the swap queue? The size of the swap queue indicates the number of LWPs that were swapped out of main memory. What might the size of the swap queue tell you with respect to the demand for physical memory? (If this eld is blank, then the swap queue was empty.) 32.0. 3. Type the following command to format the statistics pertaining to disk I/O: sar -d -f sar.file2 Was the disk busy during this benchmark? (Did the benchmark load the I/O subsystem?) Use the %busy eld to make an initial determination. No. Was the disk busy with many requests? (See the avque eld.) Did the number of requests increase as the benchmark progressed? No. As the benchmark progressed, were processes forced to wait longer for their I/O requests to be serviced? (Use avwait and avserv.) What trend do you notice in these statistics? The trend shows a random disk access workload. Was the I/O subsystem under a load? What was the average size of a request? No. The average load was 6.
Network Tuning Lab

LF10-101
LF10
4. Type the following command to display memory statistics: sar -r -f sar.file2 What unit is the freeswap eld in? Disk blocks (512 bytes) What was the average value of freemem? 627 5. Type the following command to format more memory usage statistics: sar -g -f sar.file2 Was the paging activity steady? (Did the system page at a steady rate?) Explain. Paging activity was at 0 (zero). What can you conclude about the load on the system? Which areas, if any, does this benchmark load? The load on the system was mostly memory based. The signs are that the processing was managing data in RAM that was displayed to screen, with very little disk I/O taking place.
LF10-102

LF10
sar Analysis III
Perform the same analysis as in the previous sections on the sar.file3 le. 1. Type: sar -u -f sar.file3 What kind of data is sar showing you? There is a high wait on I/O, so this shows a system that is I/O dependent. In what mode was the CPU spending most of its time? Wait on I/O Look at the %wio eld. What does this eld represent? Is there a pattern? The wait on I/O starts after the rst few readings and then stays at a high level. 2. Type the following command to display the CPU load statistics: sar -q -f sar.file3 Look at the size of the run queue as the benchmark progresses. What trend do you see? What does this indicate with respect to the load on the CPU? The CPU is not being heavily loaded as there is no queue shown. Note the amount of time the run queue is occupied (%runocc). Does this support your conclusions from the previous step? Yes What is the average length of the swap queue? The size of the swap queue indicates the number of LWPs that were swapped out of main memory. What might the size of the swap queue tell you with respect to the demand for physical memory? (If this eld is blank, then the swap queue was empty.) 32.0.
Network Tuning Lab

LF10-103
LF10
3. Type the following command to format the statistics pertaining to disk I/O: sar -d -f sar.file3 Was the disk busy during this benchmark? (Did the benchmark load the I/O subsystem?) Use the %busy eld to make an initial determination. Yes. The disk was busy. Was the disk busy with many requests? (See the avque eld.) Did the number of requests increase as the benchmark progressed? Yes. The load on the disk increased as the sample progressed. As the benchmark progressed, were processes forced to wait longer for their I/O requests to be serviced? (Use avwait and avserv.) What trend do you notice in these statistics? Yes. The more load was placed on the disks, the more processes had to wait to be serviced. Was the I/O subsystem under a load? What was the average size of a request? The average size of a request was 210K (approximately 420 blocks) 4. Type the following command to display memory statistics: sar -r -f sar.file3 What unit is the freeswap eld in? Disk blocks What was the average value of freemem? 419
LF10-104

LF10
5. Type the following command to format more memory usage statistics: sar -g -f sar.file3 Was the paging activity steady? (Did the system page at a steady rate?) Explain. The paging occurred in two rows of the data rather than showing a steady rate. This is because the disks were busy and paging occurred when the demand on disk read/writes fell. What can you conclude about the load on the system? Which areas, if any, does this benchmark load? The system is I/O intensive but the system can cope with most of the demand. Occasionally, queues develop, which delays the processing.
Network Tuning Lab

LF10-105
Accounting
This appendix describes the following:
q q q q q
Ways in which accounting assists system administrators Four types of accounting mechanisms The three types of les accounting maintains The steps needed to start accounting The commands used to generate raw data and the les these commands create The commands used to generate reports from the raw data collected
A-1
A
What Is Accounting?
In the terminal server context, the term accounting originally implied the ability to bill users for system services. More recently, the term is used when referring to the monitoring of system operations. Accounting resources comprise a number of programs and shell scripts that collect data about system use and provide several analysis reports.
A-2

A
What Is Accounting Used For?
Accounting assists system administrators in the following functions:
q q q q
Monitoring system usage Troubleshooting Monitoring system capacity and performance Ensuring data security
Accounting
A-3
A
Types of Accounting
There are four types of accounting.
Connection Accounting
Connection accounting includes all information associated with the logging in and out of users. This element of the accounting mechanism does not have to be installed for data to be collected. But if it is not installed, no analysis can be performed (however, with the last command you can print out the protocol le). When connection accounting is installed, the following data is collected in the /var/adm/wtmp le:
q q q q
The times at which users logged in Shutdowns, reboots, and system crashes Terminals and modems used The times at which accounting was turned on and off
Process Accounting
As each program terminates, the kernel (the exit() function) places an entry in /var/adm/pacct. The entry contains the following information:
q
The users user identier (UID) number and group identier (GID) number Process start and end times The central processing unit (CPU) time split between user and kernel space The amount of memory used The number of characters read and written The command name (eight characters) The processs controlling terminal
q q
q q q q
A-4

A
Disk Accounting
Disk accounting enables you to monitor the amount of data a user has stored on disk. The dodisk command available for this purpose should be used once per day. The following information is stored:
q q
The user name and user ID The number of data blocks used by a users les
Charging
The charging mechanisms enable the system administrator to levy charges for specic services (for example, restoring deleted les). These entries are stored in /var/adm/fee and are displayed in the accounting analysis report. The following information is stored:
q q
The user name and user ID The fee to charge
Accounting
A-5
A
How Does Accounting Work?
Location of Files
The shell scripts and binaries are located in the /usr/lib/acct directory, and the data and report analyses are stored in /var/adm/acct.
Types of Files
Accounting maintains three types of les:
q
Collection files Files containing raw data that is appended by the current process or kernel. Reports Data in the form of user reports. Summary files A large number of files containing only summaries. Reports are created from these les.
q q
What Happens at Boot Time

At boot time, the kernel is informed (through the script /etc/rc2.d/S22acct) that an entry must be created in /var/adm/pacct for each process. The script consists primarily of a call for the following commands:
q
startup Starts collection of process information. The data collects in /var/adm/pacct, which can grow rapidly, if there is a lot of system activity. shutacct Stops data collection and creates an entry in /var/adm/wtmp.
A-6

A
Programs That Are Run
Programs started by crontab summarize and analyze the collected data and delete the /var/adm/pacct le after analysis.
q
ckpacct Prevents excessive growth (more than 500 Kbytes) of the /var/adm/pacct le to avoid unnecessarily high compute times during summarizing in the event of an error. In this case, the pacct le is renamed.1 The collection of data is continued in a new pacct le. Additionally, accounting is halted when free space drops below 500 Kbytes in /var/adm. dodisk Scans the disk and generates a report on disk space currently in use. runacct Summarizes the raw data collected over the day, deletes the raw data le, and generates analyses as described in the following paragraphs. monacct Generates an overall total from the days totals, and creates a summary report for the entire accounting period. Additionally, the daily reports are deleted, and the summary les reset to zero.
A number of commands are started by runacct and monacct; they can also be started manually.
q
lastlogin Lists all known user names and the date on which these users last logged in nulladm Creates empty les with correct permissions prdaily Generates the daily report from the accounting les
q q
Additionally, the chargefee program is available to the system administrator to register additional fees.
1.
To create the new le name, a digit is added to the old le name. The digit increases; /var/adm/pacct becomes /var/adm/pacct1, /var/adm/pacct2, and so on.
Accounting
A-7
A
Location of ASCII Report Files
The analysis and summary les are stored in the /var/adm/acct/sum and /var/adm/acct/fiscal directories.
q
/var/adm/acct/sum Contains the daily summary les and daily reports. These les are created by runacct and the scripts and programs runacct invokes. /var/adm/acct/fiscal Contains the monthly analyses and summary les. These les are created by monacct.
A-8

A
Starting and Stopping Accounting
Accounting is started using the following steps: 1. Install the SUNWaccr and SUNWaccu packages using the pkgadd or swmtool command. 2. Install the /etc/init.d/acct script as the start script in run level 2: # ln /etc/init.d/acct /etc/rc2.d/S22acct 3. Install the/etc/init.d/acct script as the stop script in run level 0: # ln /etc/init.d/acct /etc/rc0.d/K22acct 4. Modify the crontabs for users adm and root to start the programs dodisk, ckpacct, runacct, and monacct automatically. # EDITOR=vi;export EDITOR # crontab -e 30 22 * * 4 /usr/lib/acct/dodisk # crontab -e adm 0 * * * * /usr/lib/acct/ckpacct 30 2 * * * /usr/lib/acct/runacct 2> \ /var/adm/acct/nite/fd2log 30 7 1 * * /usr/lib/acct/monacct 5. Modify the le /etc/acct/holidays, which determines the prime work times. (The reports generated by the accounting programs show a separate total for Prime [prime work time] and Nonprime [other times].) A line beginning with an asterisk (*) is a comment line. This le must be modied each year to update the system with the changing national or local holidays. An example holidays le is shown on the next page. 6. Reboot the system, or type: # /etc/init.d/acct start
Accounting
A-9
A
Following is an example of the /etc/acct/holidays le:
# * * * * * *
vi /etc/acct/holidays @(#)holidays 2.0 of 1/1/99 Prime/Nonprime Table for UNIX Accounting System Curr Year Prime Start Non-Prime Start 1800
1999 0800
* * only the first column (month/day) is significant. * * month/day Company * Holiday * * Attention! Holidays with annually changing dates * have to be updated each year. 1/1 1/16 2/20 5/29 7/4 9/4 11/23 11/24 12/25 New Years Day Martin Luther Kings Birthday Presidents Day Memorial Day Independence Day Labor Day Thanksgiving Day Day After Thanksgiving Christmas Day
A-10

A
Generating the Data and Reports
Raw Data
Raw data is stored in four separate les:
q q q q
/var/adm/pacct /var/adm/wtmp /var/adm/acct/nite/disktacct /var/adm/fee
The /var/adm/pacct File

Following the startup of accounting, the kernel (the exit() function) writes an entry in the /var/adm/pacct le whenever a process terminates. This is the only logging function that needs to be started explicitly. The denition of the format for entries in this le is found in <sys/acct.h>2 in the structure name acct.
The /var/adm/wtmp File

The /var/adm/wtmp le contains information on logins and logouts and boot operations and shutdowns. The dening structure for the format of this le is called struct utmp and is dened in <utmp.h>. Entries in the /var/adm/wtmp le are written by the following programs:
q q
init When the system is booted and stopped startup and shutacct When process accounting is started and stopped ttymon The port monitor that monitors the serial port for server requests such as login login At login
2.
Acute brackets refer to les in the /usr/include directory. Therefore, <sys/acct.h> means /usr/include/sys/acct.h. These structures are described later in this appendix.
Accounting
A-11
A
The /var/adm/acct/nite/disktacct File
Entries are generated once a day by dodisk. The format of the tacct structure is described in the acct(4) man page. It is also illustrated on page A-28.
The /var/adm/fee File

Only the chargefee command places data in the /var/adm/fee le. The structure consists of the UID, user name, and a number specifying the fee the user pays for a service. This number has a relative rather than an absolute value, which is determined only when the daily and monthly reports are processed. UID Username Fillbytes Fee 101 otto 0 0 0 0 0 0 0 0 0 0 1500
A-12

A
rprtMMDD Daily Report File
A daily report le, /var/adm/acct/sum/rprtMMDD, is generated each night by runacct from the raw data that has been collected. MM represents the month (two digits), and DD represents the day. This le is composed of ve parts (subreports): Daily Report, Daily Usage Report, Daily Command Summary, Monthly Total Command Summary, and a Last Login Report. The runacct program generates daily summary reports, deletes the raw data, and then calls the script prdaily to generate the ve subreports in the le rprtMMDD. To bill users, the reports must be processed further because no cost factors are stored in the system (for example, 1 second CPU = $10). This also applies to charges logged by chargefee.
Generation of the rprtMMDD File

Figure A-1 shows the generation of the rprtMMDD le. login, init, acctwtmp kernel charge fee do disk
wtmp
pacctn
fee
disktacct
runacct reboots lineuse daytacct daycms cms loginlog
prdaily (using runacct)
rprtMMDD Figure A-1 rprtMMDD File
Accounting
A-13
A
Daily Report Connections
The Daily Report shows the use of the system interfaces and the basic availability.
Generating the Data and Report

Apr 16 12:28 1998 from Mon Dec to 24 3 3 2 2 1 DAILY REPORT FOR jobe P
1 12:15:15 1997
Wed Apr 15 13:57:15 1998 system boot run-level 3 acctg on run-level 6 acctg off acctcon
As shown in this output, the following elds are displayed:

q q
from and to The time period to which the report applies. acctcon, runacct, and so on More specic activities, such as booting, shutting down, and starting and stopping of accounting that are stored in /var/adm/wtmp. TOTAL DURATION The time for which the system was available to the user during this accounting period. LINE The interface described by the next line. MINUTES The number of minutes the interface was active; that is, when someone was logged in. PERCENT The percentage of minutes in the total duration. # SESS The number of times login was started by the port monitor.
q q
q q
A-14

A
q
# ON The same number as # SESS because it is no longer possible to invoke login directly. # OFF The number of logouts at which control was returned (voluntarily or involuntarily) to the port monitor. Interrupts at the interface are also shown. If the number is signicantly higher than the number of # ONs, a hardware error is probably indicated. The most common cause of this is an unconnected cable dangling from the multiplexer.
Daily Usage Report

The Daily Usage Report shows system usage by users.
Apr 16 12:28 1998
DAILY USAGE REPORT FOR jobe Page 1
LOGINCPU(MINS)KCORE-MINSCONNECT(MINS)DISK# OF# OF# DISKFEE UIDNAMEPRIMENPRIMEPRIME NPRIMEPRIMENPRIMEBLOCKSPROCSSESSSAMPLES 0TOTAL6112619614791023848782431500 0root160131514561000976205110 5uucp001200056010 7987otto45124726002322872521311500 ...

q q q
UID The user identication number. LOGIN NAME The user name. CPU (MINS) The number of CPU minutes used. This information is split into PRIME and NPRIME (non-prime), where PRIME is normal work hours and NPRIME is other times, such as nights and weekends. The split is determined by the le /etc/acct/holidays, which is set up when accounting is installed. KCORE (MINS) A cumulative measure of the amount of memory in Kbyte segments-per-minute that a process uses while running. It is divided into PRIME and NPRIME utilization. CONNECT (MINS) The actual time the user was logged in.
Accounting
A-15
A
q
DISK BLOCKS The number of 512-byte disk blocks at the time dodisk was run. # OF PROCS The number of processes invoked by the user. If large numbers appear, a user might have a shell procedure that has run out of control. # OF SESS The number of times the user logged in and out during the reporting period. # DISK SAMPLES The number of times dodisk was started to obtain the information in the DISK BLOCKS column. FEE The fees collected by the chargefee command (for example, for restoring a le).
Daily Command Summary

This statistic enables the system bottlenecks to be identied. It shows which commands were started and how often, how much CPU time was used, and so on. Apr 16 12:28 1998 DAILY COMMAND SUMMARY Page 1
TOTAL COMMAND SUMMARY PRIMEPRIMEPRIME COMMANDNUMBERTOTALTOTALTOTALMEANMEANHOGCHARSBLOCKS NAME CMDS KCOREMINCPU-MINREAL-MINSIZE-KCPU-MINFACTORTRNSFDREAD TOTALS78252.86379.5521430.150.140.590.021447776734 sendmail0 8.970.244.8321.460.000.00158130106 grep 74 5.05 0.11 2.50 44.690.000.0550447117 ...
A-16

A
q
COMMAND NAME The name of the command. Unfortunately, all shell procedures are lumped together under the name sh because only object modules are reported by the process accounting system. You should monitor the frequency of programs called a.out or core or any other unexpected name. acctcom determines who executed an oddly named command and if superuser privileges were used. NUMBER CMDS Each command called increments this number by one. TOTAL KCOREMIN The total memory used by these commands in Kbytes-per-minute. PRIME TOTAL CPU-MIN The total CPU time used by these programs. PRIME TOTAL REAL-MIN The actual elapsed time. MEAN SIZE-K The average memory requirements. MEAN CPU-MIN The average CPU time. HOG FACTOR The ratio of CPU time to actual elapsed time. CHARS TRNSFD The number of characters transferred by read() and write() system calls. BLOCKS READ The number of disk blocks read or written by the program.
q q q q q
Accounting
A-17
A
Monthly Total Command Summary
The Monthly Total Command Summary report (Subreport 4) contains the same elds as the Daily Command Summary report. However, the monthly summary numbers represent accumulated totals since the last execution of monacct.
Last Login
Summary information shows when a user last logged in. It enables unused accounts to be easily identied because the entries are displayed in time sequence. Apr 16 12:28 1998 LAST LOGIN Page 1
00-00-00uucp 00-00-00adm 00-00-00bin 00-00-00sys 95-03-25root 95-04-21otto 95-04-22dmpt 93-10-04guest 92-12-15tjm
The runacct Program

The primary daily accounting shell script, runacct, is normally invoked by cron outside of prime-time hours. The runacct shell script processes connect, fee, disk, and process accounting les. It also prepares daily and cumulative summary les for use by prdaily and monacct for billing purposes. The runacct shell script takes care not to damage les if errors occur. A series of protection mechanisms are used that attempt to recognize an error, provide intelligent diagnostics, and complete processing in such a way that runacct can be restarted with minimal intervention. It records its progress by writing descriptive messages into the active le. (Files used by runacct are assumed to be in the
A-18

A
/var/adm/acct/nite directory, unless otherwise noted.) All diagnostic output during the execution of runacct is written into fd2log. When runacct is invoked, it creates the les lock and lock1. These les prevent simultaneous execution of runacct. The runacct program prints an error message if these les exist when it is invoked. The lastdate le contains the month and day runacct was last invoked and prevents more than one execution per day. If runacct detects an error, a message is written to the console, mail is sent to root and adm, locks can be removed, diagnostic les are saved, and execution is ended. To allow runacct to be restartable, processing is broken down into separate re-entrant states. The le statefile is used to keep track of the last state completed. When each state is completed, statefile is updated to reect the next state. After processing for the state is complete, statefile is read and the next state is processed. When runacct reaches the CLEANUP state, it removes the locks and ends. States are executed as shown in runacct states.
runacct States
SETUP
The command turnacct switch is executed to create a new pacct le. The process accounting les in /var/adm/pacctn (except for the pacct le) are moved to /var/adm/Spacctn.MMDD. The /var/adm/wtmp le is moved to /var/adm/acct/nite/wtmp.MMDD (with the current time record added on the end) and a new /var/adm/wtmp is created. closewtmp and utmp2wtmp add records to wtmp.MMDD and the new wtmp to account for users currently logged in.
The wtmpfix
The wtmpfix program checks the wtmp.MMDD le in the nite directory for accuracy. Because some date changes cause acctcon to fail, wtmpfix attempts to adjust the timestamps in the wtmp le if a record of a date change appears. It also deletes any corrupted entries from the wtmp le. The xed version of wtmp.MMDD is written to tmpwtmp.
Accounting
A-19
A
CONNECT
The acctcon program to records connect accounting records in the le ctacct.MMDD. These records are in tacct.h format. In addition, acctcon creates the lineuse and reboots les. The reboots le records all the boot records found in the wtmp le.
PROCESS
The acctprc program converts the process accounting les, /var/adm/Spacctn.MMDD, into total accounting records in ptacctn.MMDD. The Spacct and ptacct les are correlated by number so that if runacct fails, the Spacct les are not processed.
The MERGE Program

The MERGE program merges the process accounting records with the connect accounting records to form daytacct.
FEES
The MERGE program merges ASCII tacct records from the fee le into daytacct.
The DISK Program

If the dodisk procedure has been run, producing the disktacct le, the DISK program merges the le into daytacct and moves disktacct to /tmp/disktacct.MMDD.
The MERGETACCT Program

The MERGETACCT program merges daytacct with sum/tacct, the cumulative total accounting le. Each day, daytacct is saved in sum/tacct.MMDD, so that sum/tacct can be re-created if it is corrupted or lost.
A-20

A
CMS
The acctcms program is run several times. The acctcms program is rst run to generate the command summary using the Spacctn les and write the summary to sum/daycms. The acctcms program is then run to merge sum/daycms with the cumulative command summary le sum/cms. Finally, acctcms is run to produce the ASCII command summary les, nite/daycms and nite/cms, from the sum/daycms and sum/cms les, respectively. The lastlogin program creates the /var/adm/acct/sum/loginlog log le, the report of when each user last logged in. (If runacct is run after midnight, the dates showing the time last logged in by some users will be incorrect by one day.)
USEREXIT
Any installation-dependent (local) accounting program can be included at this point. runacct expects it to be called /usr/lib/acct/runacct.local.
CLEANUP
This program cleans up temporary les, runs prdaily and saves its output in sum/rpt.MMDD, removes the locks, and then exits. Remember, when restarting runacct in the CLEANUP state, remove the last ptacct le because it will not be complete.
Periodic Reports
Monthly reports (generated by monacct) follow the same formats as the daily reports. Further custom analysis is required to use these reports for true accounting. Certain features should, however, be noted:
q
The report includes all activities since monacct was last run. There is no interface report. The report is in /var/adm/acct/fiscal/fiscrptMM, where MM represents the current month. Monthly summary les are held in /var/adm/acct/fiscal. Daily summary les are deleted following monthly reporting. Daily reports are deleted following monthly reporting.
q q q
Accounting
A-21
A
Looking at the pacct File With acctcom
At any time, you can examine the contents of the /var/adm/pacctn les, or any le with records in the acct.h format, by using the acctcom program. If you do not specify any les and do not provide any standard input when you run this command, acctcom reads the pacct le. Each record read by acctcom represents information about a dead process. (Active processes can be examined by running the ps command.) The default output of acctcom provides the following information:
q
The command name (pound [#] sign if it was executed with superuser privileges) The user The terminal identier (TTY) name (listed as ? if unknown) The starting time The ending time The real time (in seconds) The CPU time (in seconds) The mean size (in Kbytes)
q q q q q q q
The following information can be obtained by using options to acctcom:

q q q q q q q
The state of the fork/exec ag (1 for fork without exec) The system exit status The hog factor The total kcore minutes The CPU factor The characters transferred The blocks read
A-22

A
acctcom Options
Some options that can be used with acctcom are:
q
-a Shows some average statistics about the processes selected. (The statistics are printed after the output is recorded.) -b Reads the les backwards, showing latest commands rst. (This has no effect if reading standard input.) -f Prints the fork/exec ag and system exit status columns. (The output is an octal number.) -h Instead of mean memory size, shows the hog factor, which is the fraction of total available CPU time consumed by the process during its execution], Hogfactor=total_CPU_timeelapsed_time). -i Prints the columns containing the I/O counts in the output. -k Shows the total kcore minutes instead of memory size. -m Shows the mean core size (this is the default). -q Does not print output records, just prints average statistics. -r Shows CPU factor: user_time(system_time+user_time). -t Shows the separate system and user CPU times. -v Excludes the column headings from the output. -C sec Shows only processes with total CPU time (system plus user) exceeding sec seconds. -e time Shows the processes existing at or before time, given in the format hr[:min[:sec]]. -E time Shows the processes starting at or before time, given in the format hr[:min[:sec]]. Using the same time for both -S and -E, shows processes that existed at the time. -g group Shows only processes belonging to group. -H factor Shows only processes that exceed factor, where factor is the hog factor (see the -h option). -I chars Shows only processes transferring more characters than the cutoff number specied by chars.
q q q q q q q q
q q
Accounting
A-23
A
q
-l line Shows only the processes belonging to the terminal /dev/line. -n pattern Shows only the commands matching pattern (a regular expression except that + means one or more occurrences). -o ofile Instead of printing the records, copies them in acct.h format to ofile. -O sec Shows only processes with CPU system time exceeding sec seconds. -s time Shows the processes existing at or after time, given in the format hr[:min[:sec]]. -S time Shows the processes starting at or after time, given in the format hr[:min[:sec]]. -u user Shows only the processes belonging to user.
A-24

A
Generating Custom Analyses
Writing Your Own Programs
You can write your own programs to process the raw data collected by the accounting programs and generate your own reports. For your programs to process the raw data, they must correctly use the format of the entries in the raw data les. The following pages show the structures that dene the format of the entries in the raw data les. Where custom analysis programs are used, runacct and monacct should not be run.
Raw Data Formats

Different structures dene the format of the entries in the following raw data les: /var/adm/wtmp, /var/adm/pacct, and /var/adm/acct/nite/disktacct.
Accounting
A-25
A
The /var/adm/wtmp File
An entry is created for each login and logout in the /var/adm/wtmp le. These entries can be identied from your ut_pid and ut_line entries. The following is an extract from <utmp.h> that denes the format for entries in this le: struct utmp { char ut_user[8]; char ut_id[4];
/* User login name */ /* /etc/inittab id usually line #) */ char ut_line[12]; /* device name (console, lnxx) */ short ut_pid; /* short for compat. - process id */ short ut_type; /* type of entry */ struct exit_status ut_exit;/* The exit status of a process */ /* marked as DEAD_PROCESS. */ time_t ut_time; /* time entry was made */
};
/* Definitions for ut_type */ #define EMPTY 0 #define RUN_LVL 1 #define BOOT_TIME 2 #define OLD_TIME 3 #define NEW_TIME 4 #define INIT_PROCESS 5 /* Process spawned by init */ #define LOGIN_PROCESS 6 /* A getty process waiting for login */ #define USER_PROCESS 7 /* A user process */ #define DEAD_PROCESS 8 #define ACCOUNTING 9
A-26

A
The /var/adm/pacct File
The /var/adm/pacct le is the process accounting le. The format of the information contained in this le is dened in the <sys/acct.h> le. The following is an extract from <sys/acct.h>: struct { acct char char uid_t gid_t dev_t time_t comp_t comp_t comp_t comp_t comp_t comp_t char }; ac_flag; ac_stat; ac_uid; ac_gid; ac_tty; ac_btime; ac_utime; ac_stime; ac_etime; ac_mem; ac_io; ac_rw; ac_comm[8]; /* Accounting flag */ /* Exit status */ /* Accounting user ID */ /* Accounting group ID */ /* control typewriter */ /* Beginning time */ /* acctng user time in clock ticks */ /* acctng system time in clock ticks */ /* acctng elapsed time in clock ticks */ /* memory usage */ /* chars transferred */ /* blocks read or written */ /* command name */
Accounting
A-27
A
The /var/adm/acct/nite/disktacct File
The /var/adm/acct/nite/disktacct le is in tacct format and is created by dodisk. However, dodisk is relatively inefcient because it processes le systems using find. Conversion to quot and awk should present no problems to an experienced shell-script programmer. The actual tacct structure is dened by the acctdisk program as follows: /* * */
total accounting (for acct period), also for day
struct tacct { uid_t char float
/* userid */ /* login name */ /* cum. cpu time, p/np (mins) */ float ta_kcore[2]; /* cum kcore-minutes, p/np */ float ta_con[2]; /* cum. connect time, p/np, mins */ float ta_du; /* cum. disk usage */ long ta_pc; /* count of processes */ unsigned short ta_sc; /* count of login sessions */ unsigned short ta_dc; /* count of disk samples */ unsigned short ta_fee; /* fee for special services */
ta_uid; ta_name[8]; ta_cpu[2];
};
A-28

A
Summary of Accounting Programs and Files
Table A-1 summarizes the accounting scripts, programs, raw data les, and report les described in this appendix. Table A-1 Accounting Commands and Files Commands init startup/ shutacct ttymon login kernel(exit() function) dodisk chargefee runaccta monacct Generated Raw Data Files /var/adm/wtmp /var/adm/wtmp /var/adm/wtmp /var/adm/wtmp /var/adm/pacct /var/adm/acct/nite/disktacct /var/adm/fee /var/adm/acct/sum/rprtMMDD /var/adm/acct/fiscal/ fiscrptMM Generated ASCII Report Files
a. Actually, runacct calls the prdaily script to create the rprtMMDD le.
Accounting
A-29
IPC Tuneables
This appendix discusses the UNIX System V interprocess communication (IPC) facilities. It also provides the meanings and default values for the conguration parameters for the IPC functionality. These are shared memory, semaphores, and message queues.
B-1
B
Additional Resources The following references provide additional details on the topics discussed in this appendix:
q
Wong, Brian L. Conguration and Capacity Planning for Solaris Servers. Palo Alto: Sun Microsystems Press/Prentice Hall, 1997. Cockcroft, Adrian and Richard Pettit. Sun Performance and Tuning. Palo Alto: Sun Microsystems Press/Prentice Hall, 1998.
B-2

B
Overview
The Solaris 2.x Operating Environment (OE) implements a multithreaded version of the System V IPC facility. The IPC facility is often heavily used by databases, middleware, and client/server tools. IPC is a well understood, portable facility.
How to Determine Current IPC Values

One of the important aspects of the exibility of the Solaris OE is its dynamic kernel. If a driver or service is not needed, it is not loaded. When it is rst used, it is loaded by the Operating System (OS) and stays in memory for the life of the boot. This is true for the IPC services. The three dynamically loadable kernel modules for the IPC facilities are:
q q q
shmsys Shared memory facility semsys Semaphore facility msgsys Message queue facility
IPC status is reported by the ipcs -a command, which lists detailed information about the current IPC facility usage. If a service has not been loaded, ipcs returns a facility not in system message. This means that a program that uses this IPC facility has not been run. The ipcs command shows current IPC facility usage and status, but it does not show the tuneable parameter settings. The sysdef command lists all of the loaded kernel modules and IPC tuneable parameter settings. However, if a facility has not yet been loaded, these values are zero. As soon as an IPC facility is loaded, sysdef reports its tuneable parameter settings.
IPC Tuneables
B-3
B
An unloaded kernel module can be manually loaded with the modload command. Any user can run the ipcs and sysdef commands, but only the root user can run the modload command. The modload command takes the lename of the module to be loaded as its argument. Typically the IPC modules reside in the /kernel/sys directory. For example, modload /kernel/sys/shmsys loads the shared memory facility. When shmsys is loaded, ipcs and sysdef show shared memory status and tuneable values.
How to Change IPC Parameters

The IPC tuneable values are set at boot time when the kernel reads the system specication le, /etc/system. Even though the IPC facility might not be loaded until later, the tuneables are set at boot time. The default IPC parameters are set to values that made sense many years ago when IPC was designed. It is highly probable that the default values will not be suitable for modern applications. To change these values, edit the /etc/system le and add the appropriate set statements. All of the IPC values are integers; for example: set shmsys:shmmax=2097152 There are no specic recommended IPC values because SunOS does not use the IPC facility itself. However, application and database vendors may recommend IPC values. Consult the vendors installation instructions for recommended IPC values, especially for databases. After setting the tuneables in /etc/system, a reboot is required to allow the kernel to read the new settings.
B-4

B
Tuneable Parameter Limits
Parameter maximum values can be deceiving because almost all the IPC tuneables are declared as ints, which on a SPARC have a maximum value of 2,147,483,647. In many cases, the data item itself must t into something smaller, such as a 2-byte integer eld imbedded in a data structure, which reduces the theoretical maximum signicantly. In any event, the maximum values listed here are strictly a technical limitation based on the data type, and in no way should be construed as something attainable in practice. It is often impossible to provide a realistic maximum value. Some of the tuneables cause the kernel to use memory for storing static structures. Operating system version, processor architecture, and other kernel modules impact the amount of available kernel memory, making it impossible to dene a general absolute maximum. Without enough kernel memory space, the facility may not be loadable. Such an obvious error can be easily rectied by reducing the value of the tuneable.
IPC Tuneables
B-5
B
Shared Memory
Shared memory (Table B-1) provides one or more segments of virtual memory that can be shared between programs. Normal shared memory is treated no differently from regular virtual memory: It is pageable, has page table entries for each process that uses it, and keeps separate entries in the translation lookaside buffer (TLB) for each user. The most common use of shared memory is for database management system (DBMS) system caches, where high performance is important. To provide better performance for users of shared memory, the Solaris OE provides intimate shared memory (ISM). ISM allows the different processes to share page tables for the shared memory segments, using 4-Mbyte page table entries. More importantly, these ISM segments are locked in memory. Because ISM is non-pageable, it performs much better. Because it is locked down, there must be enough swap space to support the remainder of the virtual memory required by the system. If there is not enough, normal shared memory will be used. ISM is available by default, but it is not used unless the calling program specically requests the use of ISM (using the SHM_SHARE_MMU ag). For better performance on a Solaris 2.6 platform, with patch 105181-05 installed, set enable_grp_ism to one in /etc/system. This is not necessary for the Solaris 7 OE. The use of ISM can be disallowed altogether by setting use_ism in /etc/system to 0. Table B-1 Tuneable Shared Memory Type int Default 131072 Maximum 2147482647 Notes The maximum size of a shared memory segment. This is the largest value a program can use in a call to shmget(2). Setting this tuneable very high does not hurt anything because kernel resources are not allocated based on this value alone.
shmmax
B-6

B
Table B-1 Tuneable Shared Memory (Continued) Type int Default 1 Maximum 2147482647 Notes The smallest possible size of a shared memory segment. This is the smallest value that a program can use in a call to shmget(2). Because the default is 1 byte, there should never be any reason to change this; making this int too large could potentially break programs that allocate shared segments smaller than shmmin. The maximum number of shared memory identiers that can exist in the system at any point in time. Every shared segment has an identier associated with it; this is the value that shmget(2) returns. The required number of these is completely dependent on the needs of the application, or how many shared segments are needed. Setting this randomly high can cause problems because the system uses this value during initialization to allocate kernel resources. The kernel memory allocated totals shmmni x 112 bytes. The maximum number of shared memory segments per process. Resources are not allocated based on this value. It is a limit that is checked to ensure that a program has not allocated or attached to more than shmseg shared memory segments before allowing another attach to complete. The maximum value should be the current value of shmmni because setting it greater than that will not have any effect.
shmmin
shmmni
int
100
65535
shmseg
int
65535
IPC Tuneables
B-7
B
Semaphores
Semaphores (Table B-2) are best described as counters used to control access to shared resources by multiple processes. They are most often used as a locking mechanism to prevent processes from accessing a particular resource while another process is performing operations on it. Semaphores are often dubbed the most difcult to grasp of the three types of System V IPC objects. The name semaphore is actually an old railroad term referring to the crossroad arms that prevent cars from crossing the tracks at intersections. The same can be said about a simple semaphore set. If the semaphore is on (the arms are up), then a resource is available (cars may cross the tracks). However, if the semaphore is off (the arms are down), then resources are not available (the cars must wait). Semaphores are actually implemented as sets, rather than as single entities. Of course, a given semaphore set might only have one semaphore, as in this railroad analogy. Table B-2 Tuneable Semaphores Type int Default 10 Max 2147482647 Notes The number of entries in the semaphore map. The memory space given to the creation of semaphores is taken from the semmap, which is initialized with a xed number of map entries based on the value of semmap. semmap should never be larger than semmni. If the number of semaphores per semaphore set used by the application is known (n), then you can use: semmap = (((semmni + n - 1) / n) + 1). If semmap is too small for the application, you will see WARNING: rmfree map overflow... messages on the console. Set it higher and reboot.
semmap
B-8

B
Table B-2 Tuneable Semaphores (Continued) Type int Default 10 Max 65535 Notes The maximum number of semaphore sets, system-wide, or the number of semaphore identiers. Every semaphore set in the system has a unique identier and control structure. During initialization, the system allocates kernel memory for semmni control structures. Each control structure is 84 bytes. The maximum number of semaphores in the system. A semaphore set may have more than one semaphore associated with it, and each semaphore has a sem structure. During initialization the system allocates (semmns x size of (struct sem)) bytes of kernel memory. Each sem structure is only 16 bytes, but they should not be over-specied. This number is usually semmns = semmni x semmsl. The system-wide maximum number of undo structures. This value should be equal to semmni, which provides an undo structure for every semaphore set. Semaphore operations done using semop(2) can be undone if the process that used them terminates for any reason, but an undo structure is required to guarantee this. The maximum number of semaphores per unique ID (per set). As mentioned previously, each semaphore set may have one or more semaphores associated with it. This parameter denes what the maximum number of semaphores is per set. The maximum number of individual semaphore operations that can be performed in one semop(2) call.
semmni
semmns
int
60
65535
semmnu
int
30
2147482647
semmsl
int
25
2147482647
semopm
int
10
2147482647
IPC Tuneables
B-9
B
Table B-2 Tuneable Semaphores (Continued) Type int Default 10 Max 2147482647 Notes The maximum number of per-process undo structures. Make sure that the application is setting the SEM_UNDO ag when it gets a semaphore; otherwise, undo structures are not needed. semume should be less than semmnu but sufcient for the application. A reasonable value would be semopm times the average number of processes that will be doing semaphore operations at any point in time. Listed as the size in bytes of undo structures, the number of bytes required for the maximum number of congured per-process undo structures. During initialization, it is properly set to (size of (undo) + (semume x size of (undo))), so it should not be changed. The maximum value of a semaphore. Due to the interaction with undo structures (and semaem), this parameter should not exceed the maximum of its default value of 32767 (half of 65535) unless you can guarantee that SEM_UNDO is never being requested. The maximum adjust-on-exit value. This is a signed, short integer because semaphore operations can increase the value of a semaphore, even though the actual value of a semaphore can never be negative. If semvmx is 65535, semaem would need to be at least 17 bits to represent the range of values possible in a single semaphore operation. In general, semvmx or semaem should not be changed unless you really understand how the applications will be using the semaphores. And even then, semvmx should be left at its default value.
semume
semusz
int
96
2147482647
semvmx
int
32767
65535
semaem
short
16384
32767
B-10

B
Message Queues
System V message queues (Table B-3) provide a standardized messagepassing system that is used by some large commercial applications. Table B-3 Tuneable Message Queues Type int Default 100 Max 2147482647 Notes The number of entries in the message allocation map. This is similar to semmap in semaphores. The maximum size of a message. The maximum number of bytes on the message queue. The maximum number of message queue identiers. The maximum message segment size. The number of system message headers. The number of message segments.
msgmap
msgmax msgmnb msgmni msgssz msgtql msgseg
int int int int int short
2048 4096 50 8 40 1024
32767
IPC Tuneables
B-11
Performance Monitoring Tools
This appendix is to be used as a reference for the following tools:

q q q q q q
sar vmstat iostat mpstat netstat nfsstat
C-1
C
Additional resources The following references provide additional details on the topics discussed in this appendix:
q
Cockcroft, Adrian and Richard Pettit. Sun Performance and Tuning. Palo Alto: Sun Microsystems Press/Prentice Hall, 1998. Wong, Brian L. Conguration and Capacity Planning for Solaris Servers. Palo Alto: Sun Microsystems Press/Prentice Hall, 1997. Man pages for the various performance tools (sar, vmstat, iostat, mpstat, netstat, nfsstat.) Solaris 8 AnswerBook.
C-2

C
The sar Command
Display File Access Operation Statistics
sar -a SunOS grouper 5.6 Generic_105181-03 sun4u 14:45:15 14:45:17 iget/s namei/s dirbk/s 0 0 0 where:
q
05/24/99
iget/s The number of requests made for inodes that were not in the DNLC. namei/s The number of le system path searches per second. If namei does not nd a directory name in the DNLC, it calls iget to get the inode for either a le or directory. Most iget calls are the result of DNLC misses. dirbk/s The number of directory block reads issued per second.
Display Buffer Activity Statistics

sar -b SunOS grouper 5.6 Generic_105181-03 sun4u 05/25/99
08:49:23 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 08:49:25 0 4 100 0 2 67 0 0 where:
q
bread/s The average number of reads per second submitted to the buffer cache from the disk. lread/s The average number of logical reads per second from the buffer cache. %rcache The fraction of logical reads found in the buffer cache (100 percent minus the ratio of bread/s to lread/s).

C-3
C
q
bwrit/s The average number of physical blocks (512 blocks) written from the buffer cache to disk per second. lwrit/s The average number of logical writes to the buffer cache per second. %wcache The fraction of logical writes found in the buffer cache (100 percent minus the ratio of bwrit/s to lwrit/s). pread/s The average number of physical reads per second using character device interfaces. pwrit/s The average number of physical write requests per second using character device interfaces.
Display System Call Statistics

sar -c SunOS grouper 5.6 Generic_105181-03 sun4u 08:55:01 scall/s sread/s swrit/s 08:55:03 306 27 25 where:
q q q q
05/25/99 exec/s rchar/s wchar/s 0.00 4141 3769
fork/s 0.00
scall/s All types of system calls per second. sread/s The number of read system calls per second. swrit/s The number of write system calls per second. fork/s The number of fork system calls per second (about 0.5 per second on a four- to six-user system). This number increases if shell scripts are running. exec/s The number of exec system calls per second. If exec/s divided by fork/s is greater than three, look for:
w w
Inefcient PATH variables Older interpretive programs, such as the Bourne shell. Update shell scripts to the Korn shell if they require arithmetic functionality Poor program structure
C-4

C
q
rchar/s The characters (bytes) transferred by read system calls per second. wchar/s The characters (bytes) transferred by write system calls per second.

C-5
C
Display Disk Activity Statistics
sar -d SunOS grouper 5.6 Generic_105181-03 sun4u 08:58:48 08:58:50 device fd0 nfs1 nfs2 nfs3 sd0 sd0,a sd0,b sd0,c sd0,d sd6 where:
q q
05/25/99 blks/s 0 0 7 18 97 21 76 0 0 0 avwait 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 avserv 0.0 0.0 15.5 7.6 32.1 46.4 26.7 0.0 0.0 0.0
%busy 0 0 0 0 6 2 4 0 0 0
avque 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.0 0.0
r+w/s 0 0 0 0 4 1 3 0 0 0
device The name of the disk device being monitored. %busy The percentage of time the device spent servicing a transfer request. avque The sum of the average wait time plus the average service time. r+w/s The number of read and write transfers to the device per second. blks/s The number of 512-byte blocks transferred to the device per second. avwait The average time, in milliseconds, that transfer requests wait idly in the queue (measured only when the queue is occupied). avserv The average time, in milliseconds, for a transfer request to be completed by the device. (For disks, this includes seek, rotational latency, and data transfer times.)
C-6

C
Display Pageout and Memory Freeing Statistics (in Averages)
sar -g SunOS grouper 5.6 Generic_105181-03 sun4u 09:05:20 09:05:22 05/25/99
pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf 0.00 0.00 0.00 0.00 0.00 where:
q q
pgout/s The number of pageout requests per second. ppgout/s The actual number of pages that are paged out per second. (A single pageout request may involve paging out multiple pages.) pgfree/s The number of pages per second that are placed on the free list by the page daemon. pgscan/s The number of pages per second scanned by the page daemon. If this value is high, the page daemon is spending a lot of time checking for free memory. This implies that more memory might be needed. %ufs_ipf The percentage of UFS inodes taken off the free list by iget that had reusable pages associated with them. These pages are ushed and cannot be reclaimed by processes. Thus, this is the percentage of igets with page ushes. A high value indicates that the free list of inodes is pagebound and the number of UFS inodes may need to be increased.

C-7
C
Display Kernel Memory Allocation (KMA) Statistics
sar -k SunOS grouper 5.6 Generic_105181-03 sun4u 09:10:11 sml_mem alloc 09:10:13 3301376 1847756 where:
q
05/25/99 fail 0 ovsz_alloc 3661824 fail 0
fail lg_mem alloc 0 6266880 5753576
sml_mem The amount of memory, in bytes, that the kernel memory allocation (KMA) has available in the small memory request pool. (A small request is less than 256 bytes.) alloc The amount of memory, in bytes, that the KMA has allocated from its small memory request pool to small memory requests. fail The number of requests for small amounts of memory that failed. lg_mem The amount of memory, in bytes, that the KMA has available in the large memory request pool. (A large request is from 512 bytes to 4 Kbytes.) alloc The amount of memory, in bytes, that the KMA has allocated from its large memory request pool to large memory requests. fail The number of failed requests for large amounts of memory. ovsz_alloc The amount of memory allocated for oversized requests (those greater than 4 Kbytes). These requests are satised by the page allocator; thus, there is no pool. fail The number of failed requests for oversized amounts of memory.
C-8

C
Display Interprocess Communication Statistics
sar -m SunOS grouper 5.6 Generic_105181-03 sun4u 09:13:55 09:13:57 msg/s 0.00 sema/s 0.00 where:
q
05/25/99
msg/s The number of message operations (sends and receives) per second. sema/s The number of semaphore operations per second.
Display Pagein Statistics

sar -p SunOS grouper 5.6 Generic_105181-03 sun4u 09:15:56 09:15:58 atch/s 0.00 pgin/s ppgin/s 0.00 0.00 where:
q
05/25/99 vflt/s slock/s 3.48 0.00
pflt/s 3.98
atch/s The number of page faults per second that are satised by reclaiming a page currently in memory (attaches per second). Instances of this include reclaiming an invalid page from the free list and sharing a page of text currently being used by another process (for example, two or more processes accessing the same program text). pgin/s The number of times per second that le systems receive pagein requests. ppgin/s The number of pages paged in per second. A single pagein request, such as a soft-lock request (slock/s) or a large block size, may involve paging in multiple pages. pflt/s The number of page faults from protection errors. Instances of protection faults are illegal access to a page and copyon-writes. Generally, this number consists primarily of copy-onwrites.

C-9
C
q
vflt/s The number of address translation page faults per second. These are known as validity faults, and occur when a valid process table entry does not exist for a given virtual address. slock/s The number of faults per second caused by software lock requests requiring physical I/O. An example of the occurrence of a soft-lock request is the transfer of data from a disk to memory. The system locks the page that is to receive the data so that it cannot be claimed and used by another process.
Display CPU and Swap Queue Statistics

sar -q SunOS grouper 5.6 Generic_105181-03 sun4u 09:19:48 runq-sz %runocc swpq-sz %swpocc 09:19:50 22.0 100 where:
q
05/25/99
runq-sz The number of kernel threads in memory waiting for a CPU to run. Typically, this value should be less than two. Consistently higher values mean that the system may be CPU bound. %runocc The percentage of time the dispatch queues are occupied.
C-10

C
Display Available Memory Pages and Swap Space Blocks
sar -r SunOS rafael 5.7 Generic sun4u 00:00:00 freemem freeswap 01:00:00 535 397440 02:00:00 565 397538 where:
q
06/11/99
freemem The average number of memory pages available to user processes over the intervals sampled by the command. Page size is machine dependent. freeswap The number of 512-byte disk blocks available for page swapping.
Display CPU Utilization (the Default)

sar -u SunOS grouper 5.6 Generic_105181-03 sun4u 09:32:28 09:32:30 %usr 0 %sys 2 where:
q q
05/25/99
%wio 0
%idle 98
%usr The percentage of time that the processor is in user mode. %sys The percentage of time that the processor is in system mode. %wio The percentage of time the processor is idle and waiting for I/O completion. %idle The percentage of time the processor is idle and is not waiting for I/O.

C-11
C
Display Process, Inode, File, and Shared Memory Table Sizes
sar -v SunOS grouper 5.6 Generic_105181-03 sun4u 09:35:19 09:35:21 proc-sz 77/1962 ov inod-sz 0 5178/8656 ov 0 05/25/99 ov 0 lock-sz 0/0
file-sz 490/490
where:
q
proc-sz The number of process entries currently being used, and the process table size allocated in the kernel. inod-sz The total number of inodes in memory compared to the maximum number of inodes allocated in the kernel. This is not a strict high-water mark; it can overow. file-sz The number of entries and the size of the open system le table. lock-sz The number of shared memory record table entries currently being used or allocated in the kernel. The sz is given as 0 because space is allocated dynamically for the shared memory record table. ov Overows that occur between sampling points for each table.
C-12

C
Display Swapping and Switching Statistics
sar -w SunOS grouper 5.6 Generic_105181-03 sun4u 05/25/99
09:39:03 swpin/s bswin/s swpot/s bswot/s pswch/s 09:39:05 0.00 0.0 0.00 0.0 118 where:
q q
swpin/s The number of LWP transfers into memory per second. bswin/s The number of blocks (512-byte units) transferred for swap-ins per second. swpot/s The average number of processes swapped out of memory per second. If the number is greater than one, you might need to increase memory. bswot/s The number of blocks transferred for swap-outs per second. pswch/s The number of kernel thread switches per second.

C-13
C
Display Monitor Terminal Device Statistics
sar -y SunOS grouper 5.6 Generic_105181-03 sun4u 05/25/99
10:28:10 rawch/s canch/s outch/s rcvin/s xmtin/s mdmin/s 10:28:12 0 0 58 0 0 0 where:

q q
rawch/s The number of input characters per second canch/s The number of input characters processed by canon (canonical queue) per second outch/s The number of output characters per second rcvin/s The number of receiver hardware interrupts per second xmtin/s The number of transmitter hardware interrupts per second mdmin/s The number of modem interrupts per second
q q q
C-14

C
Display All Statistics
sar -A SunOS rafael 5.7 Generic sun4u 08:32:41 06/07/99
%usr %sys %wio %idle device %busy avque r+w/s blks/s avwait avserv runq-sz %runocc swpq-sz %swpocc bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s swpin/s bswin/s swpot/s bswot/s pswch/s scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s iget/s namei/s dirbk/s rawch/s canch/s outch/s rcvin/s xmtin/s mdmin/s proc-sz ov inod-sz ov file-sz ov lock-sz msg/s sema/s atch/s pgin/s ppgin/s pflt/s vflt/s slock/s pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf freemem freeswap sml_mem alloc fail lg_mem alloc fail ovsz_alloc fail 7 fd0 nfs1 nfs2 nfs3 nfs5 sd0 sd0,a sd0,b sd0,c sd0,d sd6 2 0 0 4 0 0 0 0 0 0 0 0 0 91 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
08:32:43
0 0 2 0 0 0 0 0 0 0 0
0 0 160 0 0 0 0 0 0 0 0
0.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 26.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
0 0 100 0 0 100 0 0.00 0.0 0.00 0.0 242 669 49 40 0.00 0.00 213992 92603 0 8 0 0 0 394 0 0 0 59/922 0 2188/4236 0 462/462 0 0/0 0.00 0.00 0.00 0.00 0.00 2.49 3.48 0.00 2.49 9.95 10.95 8.46 0.00 118 423690 2187264 1755488 0 7766016 6979056 0 1990656

C-15
C
Other Options
q q
-e time Select data up to time. The default is 18:00. -f filename Use filename as the data source. The default is the current daily data le /var/adm/sa/sadd (where dd are digits representing the day of the month). -i sec Select data at intervals as close as possible to sec seconds. -o filename Save output in filename in binary format. -s time Select data later than time in the form hh[:mm]. The default is 08:00.
q q
C-16

C
The vmstat Command
Display Virtual Memory Activity Summarized at Some Optional, User-Supplied Interval (in Seconds)
vmstat 2 procs memory r b w swap free 0 0 5 99256 73832 0 0 21 141688 2000 0 0 21 141688 2000 0 0 21 141688 2008 0 0 21 141688 2008 page disk mf pi po fr de sr aa dd f0 s1 2 0 1 1 0 1 0 0 0 0 1 0 0 16 0 8 0 0 0 0 0 0 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 faults cpu in sy cs us sy 113 96 53 0 0 143 214 95 0 0 168 264 101 0 0 138 212 96 0 0 137 196 89 0 0
re 0 0 0 0 0
id 100 100 99 100 100
where:
q
procs The number of processes in each of the three following states:

w
r The number of kernel threads in the dispatch (run) queue. The number of runnable processes that are not running. b The number of blocked kernel threads waiting for resources, I/O, paging, and so forth. The number of processes waiting for a block device transfer (disk I/O) to complete. w The number of swapped-out lightweight processes (LWPs) waiting for processing resources to nish, or in other words, the swap queue length. The number of idle swapped-out processes, not explained properly in man pages because that measure used to be the number of active swapped-out processes waiting to be swapped back in.
memory Usage of real and virtual memory.

w
swap The amount of swap space currently available in Kbytes. If this number ever goes to 0, your system hangs and is unable to start more processes. free The size of the free memory list in Kbytes. The pages of RAM that are immediately ready to be used whenever a process starts up or needs more memory.

C-17
C
q
page Page faults and paging activity, in units per second.

w
re The number of pages reclaimed from the free list. The page had been stolen from a process but had not yet been reused by a different process so it can be reclaimed by the process, thus avoiding a full page fault that would need I/O. mf The number of minor faults in pages. A minor fault is caused by an address space or hardware address translation fault that can be resolved without performing a pagein. It is fast and uses only a little CPU time. Thus the process does not have to stop and wait, as long as a page is available for use from the free list. pi The number of Kbytes paged in per second. po The number of Kbytes paged out per second. fr The number of Kbytes freed per second. Pages freed is the rate at which memory is being put onto the free list by the page scanner daemon. Pages are usually freed by the page scanner daemon, but other mechanisms, such as processes exiting, also free pages. de The decit or the anticipated, short-term memory shortfall needed by recently swapped-in processes. If the value is nonzero, then memory was recently being consumed quickly, and extra free memory will be reclaimed in anticipation that it will be needed. sr The number of pages scanned by page daemon as it looks for pages to steal from processes that are not using them often. This number is the key memory shortage indicator if it stays above 200 pages per second for long periods. This is the most important value reported by vmstat.
w w w
disk The number of disk operations per second. There are slots for up to four disks, labeled with a single letter and number. The letter indicates the type of disk (s is SCSI, i is IPI, and so forth); the number is the logical unit number.
C-18

C
q
faults The trap/interrupt rates.

w w w
in The number of interrupts per second. sy The number of system calls per second. cs The number of CPU context switches per second. This value is useful in assessing whether a CPU time-slice allocations are imbalanced.
cpu A breakdown of percentage usage of CPU time. On multiprocessor systems, this is an average across all processors.
w w
us The percentage of time that the processor is in user mode. sy The percentage of time that the processor is in system mode. id The percentage of time that the processor is idle. Unlike sar, this value is Idle time plus Wait-on-I/O time.

C-19
C
Display Cache Flushing Statistics
vmstat -c flush statistics: (totals) usr ctx rgn 821 960 0 where:
q q q q q q
seg 0
pag 123835
par 97
usr The number of user cache ushes since boot time. ctx The number of context cache ushes since boot time. rgn The number of region cache ushes since boot time. seg The number of segment cache ushes since boot time. pag The number of page cache ushes since boot time. par The number of partial-page cache ushes since boot time.
Display Interrupts Per Device

vmstat -i interrupt total rate -------------------------------clock 7734432 100 zsc0 218486 2 zsc1 422997 5 hmec0 127107 1 fdc0 10 0 -------------------------------Total 8503032 109
C-20

C
Display System Event Information, Counted Since Boot
vmstat -s 40 37 40 26071 1007708 20024 12205 27901 52838 6847 5841 0 1007708 19106 208321 197883 640512 41 65328 9791 3022 15138 18611577 74053744 1245736 33533770 2000912 188 73340 58410 34399437 82829 swap ins swap outs pages swapped in pages swapped out total address trans. faults taken page ins page outs pages paged in pages paged out total reclaims reclaims from free list micro (hat) faults minor (as) faults major faults copy-on-write faults zero fill page faults pages examined by the clock daemon revolutions of the clock hand pages freed by the clock daemon forks vforks execs cpu context switches device interrupts traps system calls total name lookups (cache hits 96%) toolong user cpu system cpu idle cpu wait cpu where:
q
micro (hat) faults The number of hardware address translation faults. minor (as) faults The number of address space faults. total name lookups The domain name lookup cache (DNLC) hit rate.
q q

C-21
C
Display Swapping Activity Summarized at Some Optional, User-Supplied Interval (in Seconds)
This option reports on swapping activity rather than the default, paging activity. This option changes two elds in vmstats paging display; rather than the re and mf elds, vmstat -S reports si and so.
vmstat -S 2 procs memory r b w swap free 0 0 5 104 73704 0 0 21 141024 2072 0 0 21 141024 2072 0 0 21 141008 2056 0 0 21 140992 2048 page disk so pi po fr de sr aa dd f0 s1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 faults cpu in sy cs us sy 113 97 53 0 0 141 226 94 0 0 139 415 98 1 2 134 464 108 3 6 138 201 93 0 0
si 0 0 0 0 0
id 100 100 96 91 100
where:
q
si The average number of LWPs swapped in per second. The number of Kbytes/sec swapped in. so The number of whole processes swapped out. The number of Kbytes/sec swapped out.
C-22

C
The iostat Command
Display CPU Statistics
iostat -c cpu us sy wt id 35 9 1 55 where:
q q q q
us The percentage of time the system has spent in user mode sy The percentage of time the system has spent in system mode wt The percentage of time the system has spent waiting for I/O sy The percentage of time the system has spent idle
Display Disk Data Transferred and Service Times

iostat -d fd0 kps tps serv 0 0 0 sd0 kps tps serv 64 1 57 For each disk:
q q q
sd6 kps tps serv 0 0 0
nfs1 kps tps serv 0 0 0
kps The number of Kbytes transferred per second tps The number of transfers per second serv The average service time in milliseconds

C-23
C
Display Disk Operations Statistics
iostat -D fd0 rps wps util 0 0 0.0 sd0 rps wps util 4 0 6.9 For each disk:
q q q
sd6 rps wps util 0 0 0.0
nfs1 rps wps util 0 0 0.0
rps The number of reads per second wps The number of writes per second util The percentage of disk utilization
Display Device Error Summary Statistics

iostat -e device fd0 sd0 sd6 nfs1 nfs2 nfs3 ---- errors --s/w h/w trn tot 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 where:
q q q q
s/w The number of soft errors h/w The number of hard errors trn The number of transport errors tot The total number of all types of errors
C-24

C
Display All Device Errors
iostat -E sd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST32430W SUN2.1G Revision: 0508 Serial No: 00155271 Size: 2.13GB <2127708160 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd6 Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 Vendor: TOSHIBA Product: XM-5401TASUN4XCD Revision: 1036 Serial No: 04/12/95 Size: 18446744073.71GB <-1 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0
Display Counts Rather Than Rates, Where Possible

iostat -I tty fd0 sd0 sd6 nfs1 cpu tin tout kpi tpi serv kpi tpi serv kpi tpi serv kpi tpi serv us sy wt id 0 80 0 0 0 64 8 59 0 0 0 0 0 0 0 0 0 100
Report Only on the First n Disks

iostat -l 3 tty fd0 tin tout kps tps serv 0 33 0 0 0 sd0 kps tps serv 48 6 92 sd6 kps tps serv 0 0 0 cpu us sy wt id 0 0 0 100

C-25
C
Display Data Throughput in Mbytes/sec Rather Than Kbytes/sec
iostat -M tty fd0 sd0 sd6 nfs1 cpu tin tout Mps tps serv Mps tps serv Mps tps serv Mps tps serv us sy wt id 0 40 0 0 0 0 7 77 0 0 0 0 0 0 0 0 2 98
Display Device Names as Mount Points or Device Addresses

iostat -n tty fd0 c0t0d0 c0t6d0 rafael:v cpu tin tout kps tps serv kps tps serv kps tps serv kps tps serv us sy wt id 0 40 0 0 0 68 9 121 0 0 0 0 0 0 0 0 2 98
Display Per-Partition Statistics as Well as Per-Device Statistics

iostat -p tty fd0 sd0 sd0,a sd0,b cpu tin tout kps tps serv kps tps serv kps tps serv kps tps serv us sy wt id 0 118 0 0 0 53 8 86 53 8 86 0 0 0 0 1 1 98
C-26

C
Display Per-Partition Statistics Only
iostat -P tty sd0,a sd0,b sd0,c sd0,d cpu tin tout kps tps serv kps tps serv kps tps serv kps tps serv us sy wt id 0 40 48 7 90 0 0 0 0 0 0 0 0 0 0 0 1 98
Display Characters Read and Written to Terminals

iostat -t tty tin tout 0 16 where:
q q
tin The number of characters in the terminal input queue tout The number of characters in the terminal output queue

C-27
C
Display Extended Disk Statistics in Tabular Form
iostat -x device fd0 sd0 sd6 nfs1 nfs2 nfs3 nfs5 r/s 0.0 18.5 0.0 0.0 0.4 2.0 0.0 w/s 0.0 2.8 0.0 0.0 0.5 0.0 0.0 kr/s 0.0 180.1 0.0 0.0 6.4 43.3 0.0 extended device statistics kw/s wait actv svc_t %w %b 0.0 0.0 0.0 0.0 0 0 122.6 0.0 1.1 52.8 0 25 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0 9.3 0.0 0.0 22.7 0 2 0.0 0.0 0.0 7.8 0 1 0.0 0.0 0.0 0.0 0 0
where:
q q q q q
r/s The number of reads per second. w/s The number of writes per second. kr/s The number of Kbytes read per second. kw/s The number of Kbytes written per second. wait The average number of transactions waiting for service (queue length). In the device driver, a read or write command is issued to the device driver and sits in the wait queue until the SCSI bus and disk are both ready. actv The average number of transactions actively being serviced (active queue). The average number of commands actively being processed by the drive. svc_t The average I/O service time. Normally, this is the time taken to process the request after it has reached the front of the queue, however, in this case, it is actually the response time of the disk. %w The percentage of time the queue is not empty. The percentage of time there are transactions waiting for service. %b The percentage of time the disk is busy, transactions are in progress, and commands are active on the drive.
C-28

C
Break Down Service Time, svc_t, Into Wait and Active Times and Put Device Name at the End of the Line
iostat -xn r/s 0.0 10.8 0.0 0.0 0.2 1.0 w/s 0.0 2.2 0.0 0.0 0.5 0.0 kr/s 0.0 97.2 0.0 0.0 3.7 20.6 extended device statistics kw/s wait actv wsvc_t asvc_t %w %b 0.0 0.0 0.0 0.0 0.0 0 0 68.7 0.0 0.7 0.1 57.1 0 15 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0 0 9.7 0.0 0.0 0.3 21.8 0 1 0.0 0.0 0.0 0.2 7.4 0 1 device fd0 c0t0d0 c0t6d0 rafael:vold(pid223) ra:/export/home13/bob shasta:/usr/dist

C-29
C
The mpstat Command
# mpstat CPU minf 0 0 2 0 CPU minf 0 2 2 0 CPU minf 0 0 2 0 ^C# 2 mjf xcal 0 0 0 4 mjf xcal 0 0 0 4 mjf xcal 0 0 0 4 intr ithr 0 0 208 8 intr ithr 0 0 212 11 intr ithr 0 0 248 48 csw icsw migr smtx 35 0 2 1 34 0 2 1 csw icsw migr smtx 30 0 1 0 21 0 1 0 csw icsw migr smtx 31 0 1 0 28 0 1 0 srw syscl 0 96 0 81 srw syscl 0 54 0 25 srw syscl 0 53 0 35 usr sys 0 0 0 0 usr sys 0 0 0 0 usr sys 0 0 0 0 wt 0 0 wt 0 0 wt 2 0 idl 100 100 idl 100 100 idl 98 100
where:
q q q q
CPU The processor ID. minf The number of minor faults. mjf The number of major faults. xcal The number of inter-processor cross-calls. These occur when one CPU wakes up another CPU by interrupting it. intr The number of interrupts. ithr The number of interrupts as threads (not counting clock interrupt). csw The number of context switches. icsw The number of involuntary context switches (probably caused by higher priority processes). migr The number of thread migrations to another processor. smtx The number of spins on mutual exclusion lock (mutexes); that is, the number of times the CPU failed to obtain a mutex on the rst try. srw The number of spins on readers/writer locks; that is, the number of times the lock was not acquired on rst try. syscl The number of system calls. usr The percentage of time spent in user mode. sys The percentage of time spent in system mode. wt The percentage of time spent waiting on I/O. idl The percentage of time spent idle.
q q
q q
q q
q q q q q
C-30

C
The netstat Command
Show the State of All Sockets
# netstat -a UDP Local Address Remote Address State -------------------- -------------------- ------*.sunrpc Idle *.* Unbound *.32771 Idle *.32773 Idle *.lockd Idle ... ... TCP Local Address Remote Address Swind Send-Q Rwind Recv-Q -------------------- -------------------- ----- ------ ----- -----*.* *.* 0 0 0 0 *.sunrpc *.* 0 0 0 0 *.* *.* 0 0 0 0 *.32771 *.* 0 0 0 0 *.32772 *.* 0 0 0 0 *.ftp *.* 0 0 0 0 *.telnet *.* 0 0 0 0 Active UNIX domain sockets Address Type Vnode Conn Local Addr Remote Addr 3000070d988 stream-ord 30000a67708 00000000 /var/tmp/aaa3daGZa 3000070db20 stream-ord 300000fe7b0 00000000 /tmp/.X11-unix/X0 3000070dcb8 stream-ord 00000000 00000000 #
State ------IDLE LISTEN IDLE LISTEN LISTEN LISTEN LISTEN
The display for each active socket shows the local and remote addresses, the send, Send-Q, and receive, Recv-Q, and queue sizes (in bytes); the send, Swind, and receive, Rwind, and windows (in bytes); and the internal state of the protocol. The symbolic format normally used to display socket addresses is either hostname.port when the name of the host is specied, or network.port if a socket address species a network but no specic host. Unspecied, or wildcard, addresses and ports appear as *.

C-31
C
The possible state values for Transaction Control Protocol (TCP) sockets are as follows:
q q q
BOUND Bound, ready to connect or listen CLOSED Closed. The socket is not being used CLOSING Closed, then remote shutdown; awaiting acknowledgment CLOSE_WAIT Remote shutdown; waiting for the socket to close ESTABLISHED Connection has been established FIN_WAIT_1 Socket closed; shutting down connection FIN_WAIT_2 Socket closed; waiting for shutdown from remote IDLE Idle, opened but not bound LAST_ACK Remote shutdown, then closed; awaiting acknowledgment LISTEN Listening for incoming connections SYN_RECEIVED Initial synchronization of the connection underway SYN_SENT Actively trying to establish connection TIME_WAIT Wait after close for remote shutdown retransmission
q q q q q q
q q
q q
C-32

C
Show All Interfaces Under Dynamic Host Conguration Protocol (DHCP) Control
# netstat -d msg 1: group = 263 mib_id = 0 length = 16 msg 2: group = 263 mib_id = 5 length = 980 ... ... --- Entry 1 --Group = 263, mib_id = 0, length = 16, valp = 0x2cc60 --- Entry 2 --Group = 263, mib_id = 5, length = 980, valp = 0x2da50 49 records for udpEntryTable: ... ... TCP Local Address Remote Address Swind Send-Q Rwind Recv-Q State -------------------- -------------------- ----- ------ ----- ------ ------rafael.1023 ra.nfsd 8760 0 8760 0 ESTABLISHED localhost.32805 localhost.32789 32768 0 32768 0 ESTABLISHED ... ... Active UNIX domain sockets Address Type Vnode Conn Local Addr Remote Addr 3000070d988 stream-ord 30000a67708 00000000 /var/tmp/aaa3daGZa 3000070db20 stream-ord 300000fe7b0 00000000 /tmp/.X11-unix/X0 3000070dcb8 stream-ord 00000000 00000000 #
Show Interface Multicast Group Memberships

# netstat -g Group Memberships Interface Group RefCnt --------- -------------------- -----lo0 224.0.0.1 1 hme0 224.0.0.1 1 #

C-33
C
Show State of All TCP/IP Interfaces
# netstat -i Name Mtu Net/Dest lo0 8232 loopback hme0 1500 rafael # Address localhost rafael Ipkts Ierrs Opkts Oerrs Collis Queue 20140 0 20140 0 0 0 860070 0 809700 0 0 0
where:
q
Ipkts The number of packets received since the machine was booted. Ierrs The number of input errors (should be less than 1 percent of Ipkts). If the input error rate is high (over 0.25 percent), the host might be dropping packets. Opkts The number of packets transmitted since the machine was booted. Oerrs The number of output errors (should be less than 1 percent of Opkts). Collis The number of collisions experienced while sending packets. This is interpreted as a percentage of Opkts. A normal value is less than 1 percent. A network-wide collision rate greater than 5 to 10 percent can indicate a problem.
A machine with active network trafc should show both Ipkts and Opkts continually increasing.
C-34

C
Show Detailed Kernel Memory Allocation Information
# netstat -k kstat_types: raw 0 name=value 1 interrupt 2 i/o 3 event_timer 4 segmap: fault 494755 faulta 0 getmap 728460 get_use 154 get_reclaim 656037 get_reuse 65235 get_unused 0 get_nofree 0 rel_async 15821 rel_write 16486 rel_free 376 rel_abort 0 rel_dontneed 15819 release 711598 pagecreate 64294 ... ... lo0: ipackets 20140 opackets 20140 hme0: ipackets 873597 ierrors 0 opackets 831583 oerrors 0 collisions 0 defer 0 framing 0 crc 0 sqe 0 code_violations 0 len_errors 0 ifspeed 100 buff 0 oflo 0 uflo 0 missed 0 tx_late_collisions 0 retry_error 0 first_collisions 0 nocarrier 0 inits 8 nocanput 0 allocbfail 0 runt 0 jabber 0 babble 0 tmd_error 0 tx_late_error 0 rx_late_error 0 slv_parity_error 0 tx_parity_error 0 rx_parity_error 0 slv_error_ack 0 tx_error_ack 0 rx_error_ack 0 tx_tag_error 0 rx_tag_error 0 eop_error 0 no_tmds 0 no_tbufs 0 no_rbufs 0 rx_late_collisions 0 rbytes 543130524 obytes 637185145 multircv 34120 multixmt 0 brdcstrcv 47166 brdcstxmt 100 norcvbuf 0 noxmtbuf 0 phy_failures 0 ... ... lm_config: buf_size 80 align 8 chunk_size 80 slab_size 8192 alloc 1 alloc_fail 0 free 0 depot_alloc 0 depot_free 0 depot_contention 0 global_alloc 1 global_free 0 buf_constructed 0 buf_avail 100 buf_inuse 1 buf_total 101 buf_max 101 slab_create 1 slab_destroy 0 memory_class 0 hash_size 0 hash_lookup_depth 0 hash_rescale 0 full_magazines 0 empty_magazines 0 magazine_size 7 alloc_from_cpu0 0 free_to_cpu0 0 buf_avail_cpu0 0 #

C-35
C
Show STREAMS Statistics
# netstat -m streams allocation: current 287 722 532 518 8 9 14 0 maximum 312 765 1651 2693 169 169 67 0 cumulative total 15176 34903 691067 6814776 18 215517 77 0 allocation failures 0 0 0 0 0 0 0 0
streams queues mblk dblk linkblk strevent syncq qband
659 Kbytes allocated for streams data #
Show Multicast Routing Tables

# netstat -M Virtual Interface Table is empty Multicast Forwarding Cache Origin-Subnet
Mcastgroup
# Pkts
In-Vif
Out-vifs/Forw-ttl
Total no. of entries in cache: 0
C-36

C
Show Multicast Routing Statistics
# netstat -Ms multicast routing: 0 hits - kernel forwarding cache hits 0 misses - kernel forwarding cache misses 0 packets potentially forwarded 0 packets actually sent out 0 upcalls - upcalls made to mrouted 0 packets not sent out due to lack of resources 0 datagrams with malformed tunnel options 0 datagrams with no room for tunnel options 0 datagrams arrived on wrong interface 0 datagrams dropped due to upcall Q overflow 0 datagrams cleaned up by the cache 0 datagrams dropped selectively by ratelimiter 0 datagrams dropped - bucket Q overflow 0 datagrams dropped - larger than bkt size
Show Network Addresses as Numbers, Not Names

# netstat -n TCP Local Address Remote Address Swind Send-Q Rwind Recv-Q -------------------- -------------------- ----- ------ ----- -----129.147.11.213.1023 129.147.4.111.2049 8760 0 8760 0 127.0.0.1.32805 127.0.0.1.32789 32768 0 32768 0 127.0.0.1.32789 127.0.0.1.32805 32768 0 32768 0 ... ... 129.147.11.213.34574 129.147.4.36.45430 8760 0 8760 0 129.147.11.213.974 129.147.4.43.2049 8760 0 8760 0 Active UNIX domain sockets Address Type Vnode Conn Local Addr Remote Addr 3000070d988 stream-ord 30000a67708 00000000 /var/tmp/aaa3daGZa 3000070db20 stream-ord 300000fe7b0 00000000 /tmp/.X11-unix/X0 3000070dcb8 stream-ord 00000000 00000000 #
State ------ESTABLISHED ESTABLISHED ESTABLISHED
CLOSE_WAIT ESTABLISHED

C-37
C
Show the Address Resolution Protocol (ARP) Tables
# netstat -p Net to Media Table Device IP Address ------ -------------------hme0 mandan-qfe1 hme0 129.147.11.248 hme0 micmac-qfe1 hme0 ssdlegal hme0 rafael hme0 224.0.0.0 #
Mask --------------255.255.255.255 255.255.255.255 255.255.255.255 255.255.255.255 255.255.255.255 240.0.0.0
Flags Phys Addr ----- --------------08:00:20:93:9e:2d 00:00:0c:07:ac:00 08:00:20:92:9f:c0 08:00:20:8f:6d:eb SP 08:00:20:78:54:90 SM 01:00:5e:00:00:00
Show the Routing Tables

# netstat -r Routing Table: Destination -------------------129.147.11.0 224.0.0.0 default localhost #
Gateway Flags Ref Use Interface -------------------- ----- ----- ------ --------rafael U 3 165 hme0 rafael U 3 0 hme0 129.147.11.248 UG 0 6097 localhost UH 0 9591 lo0
The rst column shows the destination network, while the second shows the router through which packets are forwarded. The U flag indicates that the route is up; while the G flag indicates that the route is to a gateway. The H flag indicates that the destination is a fully qualied host address, rather than a network. A D flag indicates the route was dynamically created using a ICMP-redirect. The Ref column shows the number of active uses per route, and the Use column shows the number of packets sent per route. The Interface column shows the network interface that the route uses.
C-38

C
Show Per Protocol Statistics
# netstat -s UDP udpInDatagrams udpOutDatagrams TCP tcpRtoAlgorithm tcpRtoMax tcpActiveOpens tcpAttemptFails tcpCurrEstab tcpOutDataSegs tcpRetransSegs tcpOutAck = 30568 = 30537 = 4 = 60000 = 1177 = 4 = 28 =636268 = 57 =198438 udpInErrors = 0
tcpRtoMin tcpMaxConn tcpPassiveOpens tcpEstabResets tcpOutSegs tcpOutDataBytes tcpRetransBytes tcpOutAckDelayed
= 200 = -1 = 338 = 5 =834720 =612298133 = 53036 = 26757
... ... IP
ipForwarding ipInReceives ipInAddrErrors ipForwDatagrams ipInUnknownProtos ipInDelivers
= 2 =809310 = 0 = 0 = 0 =822773
ipDefaultTTL ipInHdrErrors ipInCksumErrs ipForwProhibits ipInDiscards ipOutRequests
= 255 = 0 = 0 = 0 = 0 =848209
... ... ICMP
icmpInMsgs icmpInCksumErrs icmpInDestUnreachs
= = =
6 0 3
icmpInErrors icmpInUnknowns icmpInTimeExcds
= = =
0 0 0
... ... IGMP: 3237 0 0 3237 0 0 0 0 0 # messages received messages received with too few bytes messages received with bad checksum membership queries received membership queries received with invalid field(s) membership reports received membership reports received with invalid field(s) membership reports received for groups to which we belong membership reports sent

C-39
C
Show Extra Socket and Routing Table Information
# netstat -v TCP Local/Remote Address Swind Snext Suna Rwind Rnext Rack Rto Mss State -------------------- ----- -------- -------- ----- -------- -------- ----- ----- -----rafael.1023 ra.nfsd 8760 b6e2967a b6e2967a 8760 09c399a6 09c399a6 405 1460 ESTABLISHED localhost.32805 localhost.32789 32768 02adf2a5 02adf2a5 32768 02afd90f 02afd90f 4839 8192 ESTABLISHED localhost.32789 localhost.32805 32768 02afd90f 02afd90f 32768 02adf2a5 02adf2a5 4818 8192 ESTABLISHED localhost.32808 localhost.32802 32768 02b68d0b 02b68d0b 32768 02b71347 02b71347 433 8192 ESTABLISHED localhost.32802 localhost.32808 32768 02b71347 02b71347 32768 02b68d0b 02b68d0b 406 8192 ESTABLISHED localhost.32811 localhost.32810 32768 02b7d3ce 02b7d3ce 32768 02b86f76 02b86f76 462 8192 ESTABLISHED localhost.32810 localhost.32811 32768 02b86f76 02b86f76 32768 02b7d3ce 02b7d3ce 3375 8192 ESTABLISHED localhost.32814 localhost.32802 32768 02c1ae7e 02c1ae7e 32768 02c3aa4d 02c3aa4d 407 8192 ESTABLISHED localhost.32802 localhost.32814 32768 02c3aa4d 02c3aa4d 32768 02c1ae7e 02c1ae7e 462 8192 ESTABLISHED localhost.32817 localhost.32816 32768 02c37344 02c37344 32768 02c43cf7 02c43cf7 462 8192 ESTABLISHED localhost.32816 ... ... localhost.32847 32768 02fb576a 02fb576a 32768 02fa9745 02fa9745 3375 8192 ESTABLISHED rafael.32879 bast.imap 8760 03935a31 03935a31 8760 3de1ed9d 3de1ed9d 452 1460 ESTABLISHED rafael.telnet proto198.Central.Sun.COM.32831 8760 5f0321d6 5f0321d6 8760 029be11a 029be11a 457 1460 ESTABLISHED rafael.34069 proto198.Central.Sun.COM.6000 8760 5f6577c1 5f6577c1 8760 02c75d3c 02c75d3c 457 1460 ESTABLISHED rafael.34574 bast.45430 8760 85bc6ae2 85bc6ae2 8760 1496dc76 1496dc76 4778 1460 CLOSE_WAIT rafael.973 shasta.nfsd 8760 affeda04 affeda04 8760 00c090b0 00c090b0 3719 1460 ESTABLISHED Active UNIX domain sockets Address Type Vnode Conn Local Addr Remote Addr 3000070d988 stream-ord 30000a67708 00000000 /var/tmp/aaa3daGZa 3000070db20 stream-ord 300000fe7b0 00000000 /tmp/.X11-unix/X0 3000070dcb8 stream-ord 00000000 00000000 #
C-40

C
The nfsstat Command
Show Client and Server Access Control List (ACL) Information
# nfsstat -a Server nfs_acl: Version 2: (0 calls) null getacl 0 0% 0 0% Version 3: (0 calls) null getacl 0 0% 0 0% Client nfs_acl: Version 2: (1 calls) null getacl 0 0% 0 0% Version 3: (0 calls) null getacl 0 0% 0 0% #
getattr 0 0%
access 0 0%
getattr 1 100%
access 0 0%

C-41
C
Show Client Information Only
# nfsstat -c Client rpc: Connection oriented: calls badcalls 149404 4 timers cantconn 0 4 Connectionless: calls badcalls 9 1 badverfs timers 0 4
newcreds 0
badverfs 0
timeouts 0
newcreds 0
Client nfs: calls badcalls clgets 146831 1 146831 Version 2: (6 calls) null getattr setattr 0 0% 5 83% 0 0% read wrcache write 0 0% 0 0% 0 0% link symlink mkdir 0 0% 0 0% 0 0% Version 3: (144706 calls) null getattr setattr 0 0% 42835 29% 5460 3% read write create 18888 13% 23273 16% 2985 2% remove rmdir rename 1870 1% 5 0% 944 0% fsstat fsinfo pathconf 73 0% 32 0% 275 0% Client nfs_acl: Version 2: (1 calls) null getacl 0 0% 0 0% Version 3: (2118 calls) null getacl 0 0% 2118 100% #
getattr 1 100%
access 0 0%
C-42

C
where:
q q
calls The total number of calls sent. badcalls The total number of calls rejected by the remote procedure call (RPC). retrans The total number of retransmissions. badxid The number of times that a duplicate acknowledgment was received for a single NFS request. If it is approximately equal to timeout and above 5 percent, look for a server bottleneck. timeout The number of calls that timed out waiting for a reply from the server. If the value is more than 5 percent, then RPC requests are timing out. A badxid value of less than 5 percent indicates that the network is dropping parts of the requests or replies. Check that intervening networks and routers are working properly. Consider reducing the NFS buffer size parameters. (Refer to the mount_nfs command and rsize and wsize options.) However, reducing these parameters will reduce peak throughput. wait The number of times a call had to wait because no client handle was available. newcred The number of times the authentication information had to be refreshed. timers The number of times the time-out value was greater than or equal to the specied time-out value for a call. readlink The number of times a read was made to a symbolic link. If this number is high (over 10 percent), it could mean that there are too many symbolic links.
q q

C-43
C
Show NFS Mount Options
# nfsstat -m /home/craigm from ra:/export/home13/craigm Flags: vers=3,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=32768,wsize=32768,retrans=5 /usr/dist from shasta,seminole,otero,wolfcreek:/usr/dist Flags: vers=3,proto=tcp,sec=sys,hard,intr,llock,link,symlink,acl,rsize=32768,wsize=32768,retrans=5 Failover:noresponse=0, failover=0, remap=0, currserver=shasta /home/downhill from ra:/export/home19/downhill Flags: vers=3,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=32768,wsize=32768,retrans=5 #
Output includes the server name and address, mount ags, current read and write sizes, the retransmission count, and the timers used for dynamic retransmission.
C-44

C
Show NFS InformationBoth Client and Server
# nfsstat -n Server nfs: calls badcalls 0 0 Version 2: (0 calls) null getattr 0 0% 0 0% read wrcache 0 0% 0 0% link symlink 0 0% 0 0% Version 3: (0 calls) null getattr 0 0% 0 0% read write 0 0% 0 0% remove rmdir 0 0% 0 0% fsstat fsinfo 0 0% 0 0%
setattr 0 0% write 0 0% mkdir 0 0% setattr 0 0% create 0 0% rename 0 0% pathconf 0 0%
root 0 0% create 0 0% rmdir 0 0% lookup 0 0% mkdir 0 0% link 0 0% commit 0 0%
lookup 0 0% remove 0 0% readdir 0 0% access 0 0% symlink 0 0% readdir 0 0%
readlink 0 0% rename 0 0% statfs 0 0% readlink 0 0% mknod 0 0% readdirplus 0 0%
Client nfs: calls badcalls clgets 159727 1 159727 Version 2: (6 calls) null getattr setattr 0 0% 5 83% 0 0% read wrcache write 0 0% 0 0% 0 0% link symlink mkdir 0 0% 0 0% 0 0% Version 3: (157404 calls) null getattr setattr 0 0% 45862 29% 5622 3% read write create 21439 13% 26273 16% 3224 2% remove rmdir rename 2008 1% 5 0% 1073 0% fsstat fsinfo pathconf 73 0% 35 0% 288 0% #

C-45
C
Show RPC Information
# nfsstat -r Server rpc: Connection oriented: calls badcalls 0 0 dupreqs 0 Connectionless: calls badcalls 2 0 dupreqs 0 Client rpc: Connection oriented: calls badcalls 162588 4 timers cantconn 0 4 Connectionless: calls badcalls 9 1 badverfs timers 0 4 #
nullrecv 0
badlen 0
xdrcall 0
dupchecks 0
nullrecv 0
badlen 0
xdrcall 0
dupchecks 0
newcreds 0
badverfs 0
timeouts 0
newcreds 0
C-46

C
Show Server Information Only
# nfsstat -s Server rpc: Connection oriented: calls badcalls 0 0 dupreqs 0 Connectionless: calls badcalls 2 0 dupreqs 0 Server nfs: calls badcalls 0 0 Version 2: (0 calls) null getattr 0 0% 0 0% read wrcache 0 0% 0 0% link symlink 0 0% 0 0% Version 3: (0 calls) null getattr 0 0% 0 0% read write 0 0% 0 0% remove rmdir 0 0% 0 0% fsstat fsinfo 0 0% 0 0% Server nfs_acl: Version 2: (0 calls) null getacl 0 0% 0 0% Version 3: (0 calls) null getacl 0 0% 0 0% #
nullrecv 0
badlen 0
xdrcall 0
dupchecks 0
nullrecv 0
badlen 0
xdrcall 0
dupchecks 0
setattr 0 0% write 0 0% mkdir 0 0% setattr 0 0% create 0 0% rename 0 0% pathconf 0 0%
root 0 0% create 0 0% rmdir 0 0% lookup 0 0% mkdir 0 0% link 0 0% commit 0 0%
lookup 0 0% remove 0 0% readdir 0 0% access 0 0% symlink 0 0% readdir 0 0%
readlink 0 0% rename 0 0% statfs 0 0% readlink 0 0% mknod 0 0% readdirplus 0 0%
getattr 0 0%
access 0 0%

C-47
C
where:
q
calls The total number of RPCs received. NFS is just one RPC application. badcalls The total number of RPC calls rejected. If this value is nonzero, then RPC requests are being rejected. Reasons include having a user in too many groups, attempts to access an unexported le system, or an improper secure RPC conguration. nullrecv The number of times an RPC call was not there when one was thought to be received. badlen The number of calls with length shorter than the RPC minimum. xdrcall The number of RPC calls whose header could not be decoded by the external data representation (XDR) translation. readlink If this value is more than 10 percent of the mix, then client machines are making excessive use of symbolic links on NFS-exported le systems. Replace the line with a directory, perhaps using a loopback mount on both server and clients. getattr If this value is more than 60 percent of the mix, then check that the attribute cache value on the NFS clients is set correctly. It may have been reduced or set to 0. Refer to the mount_nfs command and the actimo option. null If this value is more than 1 percent, then the automounter time-out values are set too short. Null calls are made by the automounter to locate a server for the le system. writes If this value is more than 5 percent, then congure NVRAM in the disk subsystem or a logging le system on the server.
C-48

Copyright 2001 Sun Microsystems Inc., 901 San Antonio Road, Palo Alto, Californie 94303, U.S.A. Tous droits rservs. Ce produit ou document est protg par un copyright et distribu avec des licences qui en restreignent lutilisation, la copie, la distribution, et la dcompilation. Aucune partie de ce produit ou de sa documentation associe ne peut tre reproduite sous aucune forme, par quelque moyen que ce soit, sans lautorisation pralable et crite de Sun et de ses bailleurs de licence, sil y en a. Des parties de ce produit pourront tre drives du systme UNIX licenci par Novell, Inc. et du systme Berkeley 4.3 BSD licenci par lUniversit de Californie. UNIX est une marque enregistre aux Etats-Unis et dans dautres pays et licencie exclusivement par X/Open Company Ltd. Le logiciel dtenu par des tiers, et qui comprend la technologie relative aux polices de caractres, est protg par un copyright et licenci par des fournisseurs de Sun. Sun, Sun Microsystems, le logo Sun, AnswerBook, CacheFS, Forte, Java, OpenBoot, OpenWindows, Solaris, StarOffice, Solstice DiskSuite, Solaris Resource Manager, Sun Enterprise 10000, Sun Enterprise SyMON, Sun Logo, SunOS, Sun Management Center, Sun StorEdge, Sun StorEdge A3000, Sun StorEdge A5000, Sun StorEdge A7000, Sun Trunking, Ultra,Ultra Enterprise et VideoPix sont des marques dposes ou enregistres de Sun Microsystems, Inc. aux Etats-Unis et dans dautres pays. Toutes les marques SPARC, utilises sous licence, sont des marques dposes ou enregistres de SPARC International, Inc. aux Etats-Unis et dans dautres pays. Les produits portant les marques SPARC sont bass sur une architecture dveloppe par Sun Microsystems, Inc. ORACLE est une marque dpose registre de Oracle Corporation. Les interfaces dutilisation graphique OPEN LOOK et Sun ont t dveloppes par Sun Microsystems, Inc. pour ses utilisateurs et licencis. Sun reconnat les efforts de pionniers de Xerox pour la recherche et le dveloppement du concept des interfaces dutilisation visuelle ou graphique pour lindustrie de linformatique. Sun dtient une licence non exclusive de Xerox sur linterface dutilisation graphique Xerox, cette licence couvrant aussi les licencis de Sun qui mettent en place linterface dutilisation graphique OPEN LOOK et qui en outre se conforment aux licences crites de Sun. Le systme X Window est un produit de X Consortium, Inc. CETTE PUBLICATION EST FOURNIE EN LETAT SANS GARANTIE DAUCUNE SORTE, NI EXPRESSE NI IMPLICITE, Y COMPRIS, ET SANS QUE CETTE LISTE NE SOIT LIMITATIVE, DES GARANTIES CONCERNANT LA VALEUR MARCHANDE, LAPTITUDE DES PRODUITS A RPONDRE A UNE UTILISATION PARTICULIERE, OU LE FAIT QUILS NE SOIENT PAS CONTREFAISANTS DE PRODUITS DE TIERS.
CE MANUEL DE RFRENCE DOIT TRE UTILIS DANS LE CADRE DUN COURS DE FORMATION DIRIG PAR UN INSTRUCTEUR (ILT). IL NE SAGIT PAS DUN OUTIL DE FORMATION INDPENDANT. NOUS VOUS DCONSEILLONS DE LUTILISER DANS LE CADRE DUNE AUTO-FORMATION.
Please Recycle

Sa400 Ig

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sa400 Ig

Uploaded by

Copyright:

Available Formats

Solaris System Performance Management

Student Guide With Instructor Notes

Solaris System Performance Management

Solaris System Performance Management

Solaris System Performance Management

Solaris System Performance Management

Solaris System Performance Management

Solaris System Performance Management

Solaris System Performance Management

About This Course

Solaris System Performance Management

About This Course

Solaris System Performance Management

About This Course

Solaris System Performance Management

About This Course

Solaris System Performance Management

Skills Gained by Module

About This Course

Guidelines for Module Pacing

Solaris System Performance Management

Topics Not Covered

About This Course

Solaris System Performance Management

How Prepared Are You?

About This Course

Solaris System Performance Management

About This Course

How to Use the Course Materials

Solaris System Performance Management

About This Course

Course Icons and Typographical Conventions

Note Additional important, reinforcing, interesting, or special information.

Caution A potential hazard to data or machinery.

Solaris System Performance Management

About This Course

Notes to the Instructor

Solaris System Performance Management

About This Course

Solaris System Performance Management

Lecture (Minutes) 40 100 70 120 40 70 40 75 90 60 60

Total Time (Minutes) 40

180 90 90 50 100 45 100 100 80 60

280 160 210 90 170 95 175 190 140 120

About This Course

Changes for Revision C.1

Solaris System Performance Management

Instructor Setup Notes

Projection System and Workstation

Set up an overhead-projection system that can project instructor workstation screens.

About This Course

Solaris System Performance Management

Lab les The SA400_LF directory contains the following subdirectory:

About This Course

Course Labs An Explanation

Solaris System Performance Management

Course Labs Explanation (Introducing Performance Management)

About This Course

Course Labs Explanation (Processes)

Solaris System Performance Management

Course Labs Explanation (CPU Scheduling)

About This Course