SA400 - Solaris System Performance Management - Oh - 0898

Sun Educational Services
Enterprise Server
Performance Management
SA-400
Enterprise Server Performance Management August 1998

About This Course

This course is designed to provide an introduction to
performance tuning to students who are responsible for the
maintenance and tuning of their systems.
It provides an overview of the different areas of SunOS and
the hardware that are relevant to tuning, explaining the
concepts and available tuning data and parameters for each
area.
The course is intended for experienced system administrators
and hardware specialists with a background in Sun SPARC
system administration.
Enterprise Server Performance Management Preface 2 of 10

Copyright 1998 Sun Microsystems, Inc. All Rights Reserved. SunService August 1998
Course Map

Module-by-Module
Overview
• Introduction
• Processes and Threads

• Tuning Tools
• The System as a Cache
• Memory Tuning
• System Buses
• I/O Tuning
• UFS Tuning
• Network Tuning
• CPU Scheduling
• Summary

Appendices
• Interface Card Characteristics
• SyMON Installation
• Accounting

Course Objectives
• Describe the purpose and goals of tuning
• Describe the common characteristics of system

hardware and software components
• Understand how the major system components operate
• Understand the meanings of tuning report elements
and how they relate to system performance
• Identify the usual tuning parameters and understand
when and how to set them

Topics Not Covered

• Capacity Planning
• General system administration
• Network adminstration and management
• General Solaris problem resolution
• Data center operations
• Disk storage management

Course Prerequisites
• Basic Sun system administration as demonstrated in
Solaris 2.X System Administration Essentials (SA-135), and
Solaris 2.X System Administration (SA-285)
• Basic organization and operation of the SPARCcenter™

2000, SPARCserver™ 1000, Ultra Enterprise 3000, 4000,
5000, and 6000, and storage arrays

Introductions
1. Who are you?
2. What do you do?
3. What is your background and experience as it relates
to this course?
4. What do you hope to accomplish this week?

How to Use the Course Materials

• Course Map
• Relevance
• Additional Resources
• Overhead Image
• Lecture
• Exercise
• Check Your Progress
• Think Beyond

Module 1
Introduction

Module Overview
• Course map
• Relevance
• Objectives
Enterprise Server Performance Management Module 1, slide 2 of 15

Course Map

Objectives
• Explain what performance tuning really means
• Characterize a workload to make meaningful

measurements
• Explain performance measurement terminology
• Describe a typical performance tuning task

What is Tuning?
• Improving resource usage
• Improving user perception of the system
• Making more efficient use of system resources
• Increasing system capacity
• Adapting the system to your workload
• Lowering costs

Basic Tuning Procedure
Get base line measurements
Find a bottleneck
Remove it
Is
N performance Y
satisfactory Stop
?

Conceptual Model of Performance
Configuration changes
Workload Computer System Performance measurements

Under Test
Random fluctuations and noise

Where to Tune First?

Priority Area Example
OS Parameters ncsize
DBMS Shared memory
Application File access pattern
Server and Network Hardware Disk layout

Where to Tune?
• The closer to the application you tune, the more
leverage you have
• System administrators have little control at the
application or DBMS levels
• For the OS and hardware, more general things are
usually done that benefit all applications
• You can tune the system specifically for one
application if you accept the trade-offs
• You may need to understand the applications’
characteristics to properly tune them

Performance Measurement
Terminology
• Bandwidth
• Throughput
• Latency
• Utilization

Sample Tuning Task

• Confirm what the problem appears to be
• Collect appropriate tuning data
• Where are the problems?
• Generic system problems
• Application-specific problems
• Develop a plan to fix them
• Fix what’s reasonable
• Don’t fix what’s not broken
• Verify performance is acceptable

Performance Graphs
Throughput
Load Level or User Count
Saturation Point
Fast
Response
Degradation
Response Time Gradual

Response
Degradation
Utilization

The Tuning Cycle
Memory
CPU I/O

SunOS Components
User level User applications and libraries
Kernel System calls
Process management
File M m S
system mmap() e a c
m n h
o a e I
r g d P
y e u C
Buffer cache m l
segmap e i
page structures n n
t g
Character Block
Device drivers
Machine-dependent code
Hardware

Check Your Progress

• Explain what performance tuning really means
• Characterize a workload to make meaningful

measurements
• Explain performance measurement terminology
• Describe a typical performance tuning task

Module 2
Processes and Threads

Module Overview
• Course map
• Relevance
• Objectives

Course Map

Objectives
• Describe the lifetime of a process in SunOS
• Explain multithreading and its benefits
• Describe locking and its performance implications
• Describe the functions of the clock thread
• Describe the process and thread tuning data
• Describe the related tuning parameters

Process Lifetime
Parent Child
Creation fork()
exec() Execution
exit() Termination
Status wait()
collection

fork with exec Example

Shell
print prompt
y
fget() terminal
EOF ? done
input line
parse line n
Is y
(pid = fork()) error
<0
wait() n Is y exec()
pid == 0
on pid of child input command
parent child

Process Performance Issues

• The maximum number of processes allowed is 30,000
• You waste real memory if you specify more than you
need
• Executables that use shared libraries take fewer system
resources (disk space, memory, I/O, and so on)
• Threads are more efficient than multiple processes
• We’ll discuss threads later in this module
• Zombie processes cause no performance problems
• Don’t overallocate IPC resources like shared memory

Process Lifetime Performance Data

The sar -v and -c options provide data related to the process
lifetime:
• scall/s - Total number of system calls made by all
processes in the system, user and system
• fork/s - Total number of fork system calls
• exec/s - Total number of exec system calls
• proc-sz - Percentage of the total number of allowed
processes that are in use

Process-Related Tunable Parameters

• All are scaled from maxusers
Name Default Min Max
maxusers number of MB 8 1024 (2048)

of real memory
max_nprocs 16 * maxusers + 10 138 30000
maxuprc max_nprocs - 5 133 29995
ndquot (maxusers * 10) + 213 50480
max_nprocs
npty 48 48 3000+
pt_cnt 48 48 3000+
ncsize max_nprocs + maxusers + 80 226 34906
ufs_ninode max_uprocs + maxusers + 80 226 34906

Multithreading
• Thread - a logical sequence of instructions in a program
• The kernel is multithreaded
• Multiple tasks may be running in the kernel

simultaneously and independantly
• A user process may have many application threads that
execute independently of each other
• There is still only one executable in the process
• Less system resources are used than multiple processes
• Special programming techniques are required

Application Threads
proc
Threads
LWP
• Each thread has its own stack

• Threads share the process address space with each other
• The threads execute independently (and concurrently)
• Threads are completely invisible outside of the process
• Threads cannot be controlled from the command line

Process Execution
87
62
CPU
45
15 CPU

Thread Execution
Process System
73 87
44 62
LWP CPU
12 45
3 LWP 15 CPU

Process Thread Examples

User Layer proc1 proc2 proc3 proc4 proc5
Kernel Layer CPU assignment
Hardware Layer
= Application = LWP = CPU = Kernel

thread thread

Performance Issues
Multithreading an application allows it to:
• Be broken into separate tasks that can be scheduled and
executed independently
• Take advantage of multiprocessors with less overhead
than multiple processes
• Share memory without going through the overhead and
complexity of IPC mechanisms
• Use a cleaner programming model for certain types of
applications
• Extend the program more easily

Thread and Process Performance

Creation Primitives
Microseconds Ratio
Unbound
52 -
thread create
Bound
thread create 350 6.7
fork() 1700 32.7
• fork is actually worse due to memory management

issues
• Timings were obtained on a SPARCstation 2

Locking
• Locks are used to synchronize threads by serialization
• They protect critical data from simultaneous write
access
• Locks must be used when threads share writable data
• Solaris provides several different types of locks
• Four types, used depending on requirements

• Bad locking design can cause performance problems
• Usually requires a significant programming repair

Locking Problems
• Lock contention
• Granularity
• Inappropriate lock type
• Deadlock
• ’Lost’ locks
• Race Conditions
• Incomplete implementation

The lockstat Command

• Identifies lock use in the kernel
• Unidentfied delays may be caused by lock contention
• Excessive counts may indicate a problem
# lockstat lpstat
Adaptive mutex block: 2 events
Count indv cuml rcnt nsec Lock Caller

----------------------------------------------------------------------------
1 50% 50% 1.00 87500 0xf5ae43d0 esp_poll_loop+0xcc
1 50% 100% 1.00 151000 0xf5ae3ab8 esp_poll_loop+0x8c
----------------------------------------------------------------------------

Interrupt Levels
15 Nonmaskable interrupt
Real-time CPU clock
Serial communications
Higher Priority
10 clock
Video vertical interrupt
graphics, ALM
Tape, Ethernet
Disks
Lower Priority
Deferred interrupts
0 All interrupts enabled

The clock Routine

• The clock executes at interrupt level 10
• Most system timing is run off this clock
• Each time the clock routine executes is a tick
• For most processors, there are 100 ticks per second
• The HZ environment variable shows how many

• Can be changed to 1000 - Discussed later
• This is the limit of normal timing resolution
• The time of day clock will run slow

Clock Tick Processing

Every clock tick the system checks for:
• Timer time-outs
• Sleep and wait times
• I/O timeouts
• Network timeouts
• Profiling timers
• Active thread has used up its time slice
• Process CPU resource limit exceeded

Check Your Progress

• Describe the lifetime of a process in SunOS
• Explain multithreading and its benefits
• Describe locking and its performance implications
• Describe the functions of the clock thread
• Describe the process and thread tuning data
• Describe the related tuning parameters

Module 3
Tuning Tools

Module Overview
• Course map
• Relevance
• Objectives

Course Map

Objectives
• Describe how performance data is collected in the
system
• Describe how the tools retrieve the data
• List the SunOS-provided tuning data collection tools
• Explain how to view and set tuning prameters
• Describe the system accounting functions
• Describe the use of SyMON

How Data is Collected

• The data collection tools read counters and other data
structures
• Wait for the specified interval
• Reread the data structures
• Calculate and scale the counter differences
• The first reading uses zeros
• Ignore the first line in tuning reports
• If the kernel routines don’t collect the data, it cannot be
reported

The kstat
• Created by a system component
• One for every thing the component wants to measure
• Specific format for each component
• Read by the ioctl system call
• A copy is made into the requesting process
• A whole chain of kstats may be read
• The component updates the kstat as needed
• Usually as the recorded event occurs

How the Tools Process the Data

• Data is accumulated each interval
• The change from the previous interval is calculated
• The information is written:
• To the screen (stdout)
• To a graph (such as by SyMON)
• To a file for post-processing
• Post-processing can combine intervals
• Be careful of too much smoothing

Tuning Tools Provided with SunOS

• sar
• vmstat
• iostat
• mpstat
• netstat
• nfsstat
• SyMON
• Other utilities

sar
• sar does general data collection – a bit of everything
• Has 15 different collection types
• Covers:
• File system, system calls
• Physical memory usage, paging, swap, kernel
memory
• Block device activity, TTY activity
• Process table status
• SV IPC activity, CPU usage

sar Sample Report

# sar -abquv 10
SunOS e5-node1 5.6 Generic_105181-05 sun4u 07/26/98
11:15:55 iget/s namei/s dirbk/s

bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
runq-sz %runocc swpq-sz %swpocc
%usr %sys %wio %idle
proc-sz ov inod-sz ov file-sz ov lock-sz
11:16:05 0 0 0
0 0 100 0 0 100 0 0
0 0 0 100
32/3530 0 1594/15320 0 213/213 0 0/0
#

vmstat
• vmstat reports primarily processor-related statistics
• Has 4 operands
• Covers:
• Process, interrupt, CPU cache and CPU activity
• Virtual memory and swapping activity
• Disk activity
• Information similar to sar

vmstat Sample Report

# vmstat -i
interrupt total rate
--------------------------------
clock 15575472 100
zsc0 5 0
zsc1 19456 0
hmec0 0 0
hmec1 0 0
hmec2 0 0
hmec3 29727 0
--------------------------------
Total 15624660 100
# vmstat
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s6 sd in sy cs us sy
id
0 0 0 31840 186520 0 0 0 0 0 0 0 0 0 0 0 109 2 36 0 0
100
#

iostat
• iostat reports I/O device performance data
• Includes NFS mount points and raw devices
• Has 12 operands
• Covers:
• CPU utilization, TTY activity
• Disk activity, device errors
• Hints:
• Use -M to see Mbytes per second instead of Kbytes
• Use -n to get cXtXdX device names

iostat Sample Report

# iostat -nxp
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.1 0.3 0.0 0.0 0.0 50.6 0 0 c1t0d0
0.0 0.0 0.1 0.3 0.0 0.0 0.0 50.6 0 0 c1t0d0s0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0s1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0s2
...
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t18d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t18d0s2
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t18d0s3
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t18d0s4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 e5-
node1:vold(pid244)
#

mpstat
• mpstat reports CPU-related activity
• Reports CPU-specific statistics
• No operands
• Reports:
• Page faults, interrupts, cross-calls
• Locking, context switches, CPU utilization
• Usually not needed

mpstat Sample Report

# mpstat
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 8 6 5 21 0 0 0 0 1 0 0 0 100
1 0 0 19 203 2 15 0 0 0 0 1 0 0 0 100
#

netstat
• netstat reports network-related activity
• Has 14 operands
• Covers:
• Socket state, interface state, DHCP information
• arp and routing tables
• STREAMS statistics
• Gives you excellent visibility on the use of your network
• Can be complicated if you have many network
interfaces

netstat Sample Report

# netstat -i
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis
Queue
lo0 8232 loopback localhost 68 0 68 0 0 0
hme3 1500 e5-node1 e5-node1 29727 0 481 0 612 0
# netstat -s
UDP
udpInDatagrams = 274 udpInErrors = 0
udpOutDatagrams = 66
TCP tcpRtoAlgorithm = 4 tcpRtoMin = 200

tcpRtoMax =240000 tcpMaxConn = -1
tcpActiveOpens = 0 tcpPassiveOpens = 0
...

nfsstat
• nfsstat reports nfs-related activity
• Has 6 operands
• Covers:
• Client-side information
• NFS-mounted file system information
• NFS and RPC information
• Can help isolate the exact source of an NFS problem
• Can be useful with DBMS network problems as well
• Many use RPCs to transfer requests and data

nfsstat Sample Report

# nfsstat -m
/SEC from turbo:/SEC
Flags:
vers=3,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=32768,wsize=32768,r
etrans=5
...
# nfsstat
Server rpc:
Connection oriented:
calls badcalls nullrecv badlen xdrcall dupchecks
0 0 0 0 0 0
dupreqs
0
Connectionless:
calls badcalls nullrecv badlen xdrcall dupchecks
0 0 0 0 0 0
dupreqs
0
...

SyMON
• Provided on the SMCC Supplements CD-ROM
• Provides remote monitoring of:
• Performance data
• System failures
• System configuration
• Syslog information
• Uses SNMP to transfer data
• Rules can signal exception conditions

Other Utilities
• /usr/proc/bin
• proctool
• System accounting

/usr/proc/bin
• This directory was added by SunOS 5.5
• It contains 10 commands that format data from /proc
• Commands include:
• ptree – Process’ ancesters and children
• pseg – Process’ virtual memory segments
• pfile – Open file information for a process
• The output can be cryptic

proctool
• proctool can be used to continuously display the
following information on a per-process basis:
• Resource usage
• I/O usage
• Memory map
• Process credentials
• Traceback
• Scheduling parameters
• proctool is not part of Solaris and must be
downloaded separately

Viewing Tuning Parameters

• There are several utilities that will provide current
settings of system variables
• sysdef
• adb
• ndd

Setting Tuning Parameters

• Parameters are usually set in /etc/system
• The system must be rebooted for changes to take
efffect
• Some changes can be made with adb while the system is
runing
• Be careful - you could take the system down
• Not all parameters can be changed with adb
• Network device parameters can be set and inspected
with ndd
• The changes can also be made in /etc/system

/etc/system
• This file can be used to modify kernel-tunable variables.
• The basic format is:
set [modulename:]variablename = value
• /etc/system can also be used to force modules to be

loaded at boot time, to specify a root device other than
the default, and so on.
• The system must be rebooted for changes to take affect.
• Care should be used when updating /etc/system; this

file modifies the operation of the kernel.

System Accounting
• Accounting is not enabled by default
• Accounting allows you to track what is executed ad
how much resource it uses
• It can be very useful in determining what is using your
system
• You must run the reports to reduce the data
• It does add system overhead
• If you have a varying environment it is the easiest way
to decide where to spend your time
• Correlate the accounting data with your tuning reports

Check Your Progress

• Describe how performance data is collected in the
system
• Describe how the tools retrieve the data
• List the SunOS-provided tuning data collection tools
• Explain how to view and set tuning prameters
• Describe the system accounting functions

Module 4
The System as A Cache

Module Overview
• Course map
• Relevance
• Objectives

Course Map

Objectives
• Describe the purpose of a cache
• Describe the basic operation of a cache

• Describe the various types of caches implemented in
Sun systems
• Identify the important performance issues for a cache
• Identify the tuning points and parameters for a cache
• Understand the memory system’s hierarchy of caches
and their reletive performances

What is a Cache?
• Keeps accessed data near its user
• Must provide high-speed access
• Holds a small subset of the available data
• Caches are used by hardware and software
• Many different caches around the system
• Critical for performance
• Many different ways to manage a cache

Hardware Caches
• There are hardware caches all over the system
• CPU level 1 and level 2
• PDC/TLB
• Main store
• I/O components
• Each system has different caches with different
characteristics

SRAM and DRAM

• Hardware caches are usually SRAM (Static RAM)
• Static RAM does not need refresh
• DRAM data fades after about 10 cycles
• Refresh rewrites it, but this delays access

• SRAM does not fade, so no refresh is needed
• SRAM takes 4 times the transistors of DRAM
• It will always be more costly and take more power

CPU Caches
• Usually two levels of cache in a CPU
• Level 1 (Internal) - Usually 4-40 Kbytes in size
• Level 2 (External) - Usually .5-8 Mbytes in size

• Level 1 cache operates at CPU speeds
• Data must be in the L1 cache before the CPU can use it
• Many choices for cache behavior
• We will discuss these soon

Cache Operation
The cache’s user requests data
• If found, this is a cache hit. The data is returned.
• If not found, this is a cache miss
• Data is requested from the next level up
• The requester may wait until the data arrives

Cache Miss Processing

• Caches are usually kept 100% full
• On a cache miss, new data must be put in the cache
• This process is called a move in
• A location for the incoming data must be chosen
• Almost always a LRU algorithm is used
• Least Recently Used - The oldest data is replaced

Cache Replacement
• Data to be replaced may have to be copied back
• Changes are not always reflected up
• A move out or flush copies modified data back to the

next level up
• Once the move out is complete, the move in starts
• The amount of data moved is a cache line
• A hardware cache line is usually 64 bytes

Cache Operation
Cache Hit
Request
Hit
Next level cache

Cache Miss Request
s Move out
Request is
M
Move in

Cache Hit Rate

• The performance of a cache depends on its hit rate
• The cache hit rate is how often requested data is

available
• The cache hit rate depends on
• Size of the cache
• Fetch rate
• Locality of data references
• Cache structure

Cache Hit Rate

• Misses are very expensive
• For example:
• Hit cost is 20 units, miss cost is 600 units
• At 100% hit rate, cost = 100* 20 = 2000

• 99% hit cost = (99 * 20) + (1 * 600) = 1980 + 600 = 2580
• 1% miss cost is 580/2000 or 29% degradation!
• The exact numbers will depend on the system

Effects of CPU Cache Misses

Performance %
100 +
90
80 +
70
60 +
50
+
+
40 + +
30
+ + + + +
20 + +
10
0
100 99 98 97 96 95 94 93 92 91 90 89 88 86
Cache Hit Rate %
• There are many ways to try to improve the hit rate

Cache Characteristics
• There are several pairs of mutually exclusive cache
characteristics
Virtual Physical
Write Through Write Back
Direct Mapped Set Associative
Harvard Unified
Snooping Cache

Virtual Address Cache
Stack
Hole
Physical
address
CPU Cache MMU
Data
Virtual
address
Physical
Text memory

Physical Address Cache
Stack
Hole
Physical
address
CPU MMU Cache
Data
Virtual
address
Text Physical
memory

Direct Mapped and Set Associative

Caches
• Direct-mapped cache – One possible location for the data
• Set-associative cache – Multiple possible locations

• All possible locations are checked in parallel
• Number of locations is referred to as ways: 4-way
• Set associative caches are more complex, but provide
better contention management
• Harvard caches separate instructions and data
• Usually used in CPU level 1 caches

Write-Through And Write-Back

Caches
• Write-through caches always update the next level up
when its data changes
• Write-back caches usually update only when the line is

flushed
• Write-back caches assume a line modified once will be
modified again
• This may allow write cancellation
• Unecessary updates are avoided, saving bus capacity

Cache Thrashing
• Picking the wrong memory stride essentially turns off
the cache
• Accessing data at a regular address interval can map to

the same cache location every time
• This will cause 100% cache miss
• The performance degradation is easily noticeable
• Can be hard to diagnose; everything is working

correctly
• You must inspect the application and compare it with
the cache characteristics

Cache Thrashing
Physical
pages
Virtual address 1 (array_in[])
MMU
page table
Cache
Virtual address 2 (array_out[])

CPU Cache Snooping

• Write-back caches are great for CPU performance
• But what about hardware caches in an MP system?
• Other CPUs might want our modified data
• To avoid write-though caching, we must snoop

• All requests to memory are watched by the cache
controller
• When a request is made for modified data in its

cache, the controller provides it

CPU Cache Snooping
Module 0 Module 1
CPU CPU Main memory

Snooping Snooping
cache cache
MMU MMU
Write Write
buffers buffers
Bus
I/O
Interfaces

System Cache Hierarchies
CPU Registers L1 cache L2 cache
Main memory Disk Network People

System Cache Hierarchies

Cache Size
Speed Managed by
Registers 1Kbyte 3 ns Program
L1 CPU ~32 Kbytes 10 ns Hardware
L2 CPU .5-8 Mbytes 60 ns Hardware
Main memory 16 Mbytes - 64 Gbytes 600 ns Software
Disk 1 Gbyte - 11 Tbytes 20 ms Software, people
Network N/A 200 ms Software, people

Relative Access Times

• If one nanosecond was on second
Level Nanoseconds Seconds

CPU cycle time 3 3
Level 1 cache 10 10
Level 2 cache 60 1 minute
Main memory 600 10 minutes
Disk 20,000,000 7.7 months
Network 200,000,000 6.5 years

Cache Performance Issues

• Cache and locality of reference are important
• Linked lists are bad for locality

• SPARCworks™ Analyze can help locate problems
• Cache thrashing
• Check algorithms for power of 2 sizes
• Run on a different type of system
• vmstat -c reports cache flushes (but not hits)

Things You Can Tune

• Application
• Locality of reference
• Memory strides
• Algorithms
• Compiler Optimization
• Main memory management
• Covered in the next module

Check Your Progress

• Describe the purpose of a cache
• Describe the basic operation of a cache

• Describe the various types of caches implemented in
Sun systems
• Identify the important performance issues for a cache
• Identify the tuning points and parameters for a cache
• Understand the memory system’s hierarchy of caches
and their reletive performances

Module 5
Memory Tuning

Module Overview
• Course map
• Relevance
• Objectives

Course Map

Objectives
• Describe the structure of virtual memory
• Describe how the contents of main memory are

managed by the page daemon
• Understand why the swapper runs
• Interpret the measurement information from these

components
• Understand the tuning mechanisms for these

components

Virtual Memory
• What the programs see
• Moves between main memory and disk
• Created and deleted as necessary
• Managed by the OS
• Solaris acts as the main memory cache manager
• Contained in address spaces (processes)

• Managed in segments of contiguous bytes

Process Address Space

ffffffff
Kernel
Kernel base
Stack
Shared libraries
Hole
Heap
Data and BSS
Text
0

Types of Segments
DVMA segkmem
Kernel stacks segkp
Buffer cache segmap
Kernel data segkmem

Kernel text segkmem
Kernel base
Stack segvn
Device segdev
Frame
buffer
device
Data segvn
Text segvn

Virtual Address Translation

Stack
Hole
Physical
address
CPU MMU
Data Virtual
address
Text Physical
memory
• Translations are performed by the MMU using a three-

stage table lookup process

Page Descriptor Cache

• Also called TLB or DLAT
• Caches virtual to real address translations
• Entries created by MMU table lookups
• Usually between CPU and L1 cache
• Limited size - usually 256 pages
• Fully associative
• One per CPU; entries are based on the CPU’s activity
• Invalid entries are purged by crosscalls

Virtual Address Lookup

Request
A
Is (PDC hit)
it in cache? Done
n (Cache miss)
Wait for MMU to do table
lookup to find the pte
for the page being accessed.
Is y
the valid bit set in Put in cache. A
the pte?
n (Page fault)
Is the (Minor page fault)

page anywhere Load pte.
in main memory? y
n (Major page fault) y
I/O n
Get free page and do I/O. Error
successful?

The Memory Free Queue

• Main memory is a fully-associative disk cache managed
by the OS
• Move-ins and move-outs occur as in any cache
• For memory we call this paging
• Moves involving disk are very expensive
• A move out followed by a move in is too costly
• Free memory pages are kept available to avoid the need
for move outs before a move in
• We always need to have ’enough’ pages on the free
queue

The Paging Mechanism

• Every quarter second a check is made to see if the
amount of free memory is less than the quantity
specified by the lotsfree parameter
• If it is, the page daemon is run to replenish the free
memory queue
• Pages are stolen from their users if they haven’t been
used recently
• Pages are scanned to determine which can be stolen
• The more needed, the faster the scan rate
• If the page daemon can’t keep up, swapping may occur

Page Scanner Processing

fastscan
Scan rate
deficit
slowscan
0 minfree desfree lotsfree
Free Memory

Default Paging Parameters

Variable Purpose Default
lotsfree Start scanning, at slowscan rate 1/64 of mainstore
desfree Swapping control 1/2 lotsfree
minfree Scan at fastscan rate 1/2 desfree
fastscan Fastest scan rate 1/2 available memory
slowscan Slowest scanning rate 100
maxpgio Limit on pages to write per second 60
handspreadpages Alternate tuning mechanism fastscan
• Scanning is capped at a fixed percentage of the CPU

An Alternate Approach
The Clock Algorithm
pages
1
0
backhand 0
0
0
1 reference
bits
handspreadpages 0
0
0
fronthand 0
1

maxpgio
• Modifed pages need to be written back
• If every page was modified there would be periodic

I/O spikes
• This would not help performance
• We can’take only unodified pages
• Some modified pages must be taken
• But how many is safe?
• Adjust maxpgio to a realistic value to set the limit
• Try starting with number of disk controllers * 60

Swapping
• Swapping is a last resort – ’desperation swapping’
• If the page daemon consistently cannot keep up with the
demand for memory, memory use must be cut
• The number of swaps is not important; the fact that
there are swaps is
• A process will be swapped when its last LWP is
swapped out
• Its memory will not be freed until then
• Do not try to tune swaps, try to eliminate them

When is the Swapper Desperate?

The swapper is considered desperate for memory when:
• There is more than one runnable thread
AND
• avefree and avefree30 are both less than desfree
AND
• the paging rate is excessive OR avefree is less than

minfree

What is Not Swappable

A thread is not swappable if:
1. it is in the system (SYS) or real-time (RT) scheduling
class
2. it is part of a process that is being forked
3. In execution or stopped by a signal
4. it is exiting or is a zombie
5. it is a system thread
6. It is blocking a higher priority thread
A priority is calculated for each thread left eligable

Swapping Priorities
• The higher the caclulated priority the more likely the
thread will be swapped
• The priority calculation is:
epri = swapin_time - rss/maxpgio/2 - pri
• Where:
• swapin_time – How long since the thread was last
swapped
• rss – Amount of memory used by the thread’s process
• pri – The thread’s dispatching priority

Swap In
• Once the pressure is off the memory subsystem,
swapped out threads will be brought back in
• A swapping priority is calculated again
• The lowest priority swapped-out thread will be brought
back in
• The process continues until all threads are back
• Memory pressure may cause new swaps

Swap Space
• Virtual memory includes most of main memory
• Reservable virtual memory is disk swap space plus
physical memory, minus kernel pages, minus a
3.5-Mbyte safety factor.
• Systems with a large amount of physical memory
may not require much disk space for swap
• Remember to allocate enough to hold a panic dump
• Swap space is not allocated until the page is written out
• If disk space is not available, the page stays in
memory

tmpfs
• /tmp is a virtual memory ramdisk
• It uses virtual memory like any other user
• It uses real memory and swap space
• Heavy use of /tmp can fill main memory

• You can limit use of /tmp
• Use the size option in the vfstab entry
• Do not move it to a hard drive partition

• You call allocate other tmpfs directories if you wish

Performance Data
• Reporting is performed by vmstat and sar
• They don’t report the same data
• Many of the numbers should be ignored
• The scan rate is the most useful value
• Everything else can be misleading

Available Paging Statistics

Source
Activity vmstat (unit) sar (unit)
cpu_vminfo.
Pages scanned scan sr (pages) pgscan/s (pages)

Scanner deficit deficit de (pages)
Page in requests pgin pgin/s
Pages paged in pgpgin pi (kb) ppgin/s (pages)
Page out requests pgout pgout/s
Pages paged out pgpgout po (kb) ppgouts (pages)
Pages freed dfree fr (kb) pgfree/s (pages)
Page reclaimed pgfrec re (pages)
from free list
Average amount of freemem free (kb) freemem (pages)
free memory
Minor page faults hat_fault mf
Pages becoming atch/s (pages)
shared

Available Paging Data

cpu_vminfo.as_fault - Address translation faults (page faults)
cpu_vminfo.deficit - Average amount of memory a new process needs
cpu_vminfo.dfree - The page daemon has freed a page
cpu_vminfo.hat_fault - Number of page faults resolved with pages already valid
cpu_vminfo.pgfrec - A requested page was found on the freelist
cpu_vminfo.pgin - Page in requests
cpu_vminfo.pgout - Page-out requests
cpu_vminfo.pgpgin - Number of pages paged in
cpu_vminfo.pgpgout - Number of pages paged out
cpu_vminfo.pgrec - Pages reclaimed from the freelist
cpu_vminfo.scan - Number of pages checked by the page-out scanner
vminfo.freemem - Amount of memory on the free queue

Available Swapping Statistics

Activity Source vmstat (unit) sar (unit)
Average amount of swap_avail swap (amount of

free swap space unreserved swap) (K)
Average amount of swap_free freeswap (amount of
free swap space unallocated swap) (sectors)
Process swap outs swapout si swpout/s
Pages swapped out pgswapout bswout/s (sectors)
Average number of swpque swpq-sz
swapped out LWPs
Percentage of the swpocc swpocc%
interval any LWP
was swapped out
LWP swap ins swapin so swpin/s
Pages swapped in pgswapin bswin/s (sectors)

Available Swapping Data

cpu_vminfo.pgswapin - Number of pages read in with LWPs swapped in
cpu_vminfo.pgswapout - Number of pages written out with LWPs swapped out
cpu_vminfo.swapin - Number of LWPs swapped in
cpu_vminfo.swapout - Number of processes swapped out
sysinfo.swpocc - Percentage of the interval that any LWP was swapped out
sysinfo.swpque - Number of LWPs currently swapped out
vminfo.swap_avail - Maximum swap space less reserved swap space
vminfo.swap_free - Unallocated swap space: unreserved plus reserved but not
allocated

Shared Libraries
Shared libraries have a longer startup time when a routine is
used; however:
• Once the routine is in memory, the system can avoid

subsequent I/Os.
• There is a huge savings in both main memory and disk

space.
• Tools such as Analyze from SPARCworks can be used

to improve locality of reference.

Paging Review
Kernel pages are
Kernel fixed in memory.
Stack /dev/rdsk/c0t0d0s1
Anonymous memory is
Heap Swap space paged out to swap space
mmapped files /home/me/myfile

File data is
Device space File Data paged in
and out to
Library data its file
/usr/lib/libc.so
Library code Dynamically linked
Library Data libraries are paged in
from the filesystem.
/bin/cp Library Code

Program Data
Program Data Program code
is paged in from
Program Code the filesystem.
Program Code
Address Space
Segments

Things You Can Tune

• Page daemon parameters
• lotsfree, minfree, desfree
• fastscan, slowscan
• maxpgio
• /tmp usage
• Use of shared libraries
• Swap space placement
• Amount of memory
• Do not try to tune swaps

Check Your Progress

• Describe the structure of virtual memory
• Describe how the contents of main memory are

managed by the page daemon
• Understand why the swapper runs
• Interpret the measurement information from these

components
• Understand the tuning mechanisms for these

components

Module 6
System Buses

Module Overview
• Course map
• Relevance
• Objectives

Course Map

Objectives
• Define a bus
• Explain general bus operation
• Describe the processor buses: UPA, Gigaplane,
Gigaplane XB
• Describe the data buses: SBus, PCI
• Explain how to calculate bus loads
• Describe how to recognize and configure to avoid bus
problems
• Dynamic reconfiguration considerations

What is a Bus?
• A mechanism to transfer data, just like a network
• It usually connects more than two components
• It runs at a fixed speed
• It is usually bi-directional and broadcasts messages
• It is like a highway
• Speed limit is a maximum, not always a required
speed
• It has lower speed ’on and off ramps’
• Higher speeds mean more traffic capacity

General Bus Characteristics

• Circuit switched
• Operates like a telephone
• Bus is busy for the whole operation
• Example: SBus
• Packet switched
• Operates like a network
• Request is sent and bus is free for the next request
• Example: Gigaplane, PCI
Most buses are packet switched

System Buses
• Used to connect sections of the system to each other
• CPUs, peripheral controllers and memory
• Many different kinds, depending on technology and
needs
• Smaller systems usually use simpler buses
• Cost issues are important
• Large systems need throughput
• Performance outweighs cost
• Sun has used many different kinds over the years

System Buses
MBus and XDbus
• MBus – for small desktop multiprocessors (sun4m)
• Separate I/O bus from CPU and memory
• 135 Mbytes/second at 50 MHz
• Looks like a current Pentium system design
• XDBus – SS1000, SC2000, CS6400 (sun4d)
• 1 on 1000, 2 on 2000, 4 on 6400
• 250-300 Mbytes/second at 40 MHz
• Easy to overload with a full configuration

System Buses
UPA and Gigaplane Bus
• UPA – used by the Ultra systems
• Crossbar connects CPUs, memory and I/O
• Very efficient and very fast
• Gigaplane – Used in the Ultra servers
• Connects multiple UPA buses (on system boards)
• Very fast – ~2.75 GBytes/second (X500 servers)
• Configurable while you’re running
• This is what dynamic reconfiguration (DR) does

System Buses - Gigaplane XB Bus

• Only on the Sun Enterprise 10000
• Based on the Gigaplane bus
• Point-to-point connections to all boards
• Allows addition and removal of whole boards from
domains
• Contains multiple parallel sub-buses
• Can run without most of them
• Runs at 83.5 MHz
• Can do two operatios per cycle

The Gigaplane XB Bus
1
3
rd 1
oard 1
boa
mb
10
14
tem
te
d
r d
System board 12
ar
Sys
oa
Sys
bo
b
em
st e m
Sy st
Sy
d9 5
oar b
oa rd 1
tem
Sys te mb
Sys
System board 8
System board 0
Sys
tem Sys
boa tem
rd 7 boa
rd 1
Sy
st Sy
em st
em
bo
bo
Sys
Sys
ar
d ar
6 d
2
tem
tem
b
boa
oar
rd 5
d3
System board 4

Peripheral Buses
• Connect peripherals to the system buses
• Slower than the system buses, usually by a lot
• Circuit or packet switched
• Can easily be the (hidden) limiting factor in I/O
performance
• Sun provides two different ones: PCI and SBus
• Can mix PCI and SBus in Enterprise Servers
• All new systems are PCI

SBus vs PCI Bus

• SBus is 20-25MHz, 32 and 64 bits wide
• Peak bandwidth 200 MBytes/sec
• PCI Bus runs at 33 or 64 MHz, 32 or 64 bits wide
• Peak bandwidth is 528 MBytes/sec
• Provided on all new Sun hardware
• There are many 100 MByte interface cards today
• SOC+, SCI, ATM OC-12, and so on

PIO and DVMA

• PIO cards require the CPU to perform all data transfer
• The CPU can only move 8 or 64 bytes at a time
• Uses lots of CPU resource and overhead
• For lower performance or ’naturally’ slow cards
• DVMA bypasses the CPU and goes direct to memory
• Much more complicated setup
• Break-even with PIO is at about 2048 bytes
• Many cards can do both
• The driver chooses which is used (usually silently)

prtdiag
• Provided for Sun’s sun4d and sun4u systems
• SS1000, SC2000 and UltraSPARC systems
• Shows the hardware configuration of the system
• Ideal for locating configuration problems
• You do need to know your part identification
• Use the FE Handbook
• prtconf provides similar data but is harder to read

prtdiag CPU Section

# /usr/platform/sun4u1/sbin/prtdiag
System Configuration: Sun Microsystems sun4u SUNW,Ultra-Enterprise-10000
System clock frequency: 83 MHz
Memory size: 1024 Megabytes
========================= CPUs =========================
Run Ecache CPU CPU

Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
0 0 0 250 1.0 US-II 1.1
0 1 1 250 1.0 US-II 1.1
0 2 2 250 1.0 US-II 1.1
0 3 3 250 1.0 US-II 1.1
1 4 0 250 1.0 US-II 1.1
1 5 1 250 1.0 US-II 1.1
1 6 2 250 1.0 US-II 1.1
1 7 3 250 1.0 US-II 1.1

prtdiag Memory Section

========================= Memory =========================
Memory Units: Size

0: MB 1: MB 2: MB 3: MB
----- ----- ----- -----
Board 0 256 256 0 0
Board 1 256 256 0 0
• Can determine memory interleaving from this
• Proper interleaving can provide 10% or more

performance boost
• Systems configure this automatically – and silently
• Some systems can do up to 16-way interleave

prtdiag I/O section

========================= IO Cards =========================
Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- ----------------------
0 SBus 25 0 qec/qe (network) SUNW,595-3198
0 SBus 25 0 SUNW,soc/SUNW,pln 501-2069
0 SBus 25 1 QLGC,isp/sd (block) QLGC,ISP1000
1 SBus 25 0 qec/qe (network) SUNW,595-3198
1 SBus 25 0 SUNW,soc 501-2069
1 SBus 25 1 QLGC,isp/sd (block) QLGC,ISP1000
• You can calculate bus loading
• Move things if you’re overloaded

Bus Limits
• A bus is like a freeway
• More traffic = more congestion
• Too much traffic, things back up
• Things get delayed, you get problems
• Going faster takes more skill and money
• Bus saturation = problems
• You don’t get any specific messages
• How do these problems occur?

If I have checks...
• I must have money!
• If I have slots, I must have bandwidth!
• Both may be wrong
• Symptoms (of the second)
• Bad response time
• Dropped network packets
• Intermittant hardware problems
• Will usually be irregular but constant

Diagnosing Bus Problems

• Common set of symptoms:
• Every board has been replaced and the problem’s

still there
• Maybe the hardware wasn’t the problem!

• Check your configuration (prtdiag)
• Correlate with system usage
• Trust the specs, not the sales guy
• This can be mysterious to support personnel as well
• The hardware and software are fine

Avoiding the Problem

• You will usually get 60% of advertised bandwidth (at
best)
• Watch for old devices and cards
• They can silently slow a bus down
• Spread things out as much as possible
• Consider usage patterns

• Symmetry may be a bad idea - “It looks good”
• Appearance is not important

Avoiding the Problem

• Know how your system works
• There are plenty of good white papers
• Watch for unexpected problems

• Understand the basics
• Everything is either a bus or a cache
• Tune and monitor it accordingly
• Smarter is better than bigger
• Be prepared to leave empty slots

SBus Overload Example

Enterprise XX00 Server
Gigaplane Bus
UPA
200 Mbytes/second each
SBus A SBus B
12 20 ? ? ? 50 50
100T FW Slot Slot Slot Fibre Channel

SCSI 1 2 3

System Bus Overload Example

SC 2000E
20 CPU 20-30 Mbytes/second
300 Mbyte/second buses
10 SBus
Memory
5 Gbytes in 20
256 Mbyte banks
50 Mbytes/second

Dynamic Reconfiguration
Considerations
• Problems occur when you add, not when you remove
• Make sure you have bandwidth when you add
components
• Be careful replacing slower cards or components
• Consider reorganizing to spread the load
• Be careful of controller and component renumbering
• Plan ahead

Tuning Reports
• Not very much available in this area
• Vigilance is the best prevention
• Check prtdiag when you add new cards
• Remember that empty slots do not mean available
bandwidth
• Strange behavior in tuning reports may be related to bus
problems
• It usually occurs when the system is most heavily
loaded

Check Your Progress

• Define a bus
• Explain general bus operation
• Describe the processor buses: UPA, Gigaplane,
Gigaplane XB
• Describe the data buses: SBus, PCI
• Explain how to calculate bus loads
• Describe how to recognize and configure to avoid bus
problems
• Dynamic reconfiguration considerations

Module 7
I/O Tuning

Module Overview
• Course map
• Relevance
• Objectives

Course Map

Objectives
• Describe how the I/O buses operate
• Describe the configuration and operation of the SCSI
bus
• Describe the configuration and operation of a Fibre
Channel bus
• Describe how disk and tape operations proceed
• Describe how a storage array works
• Describe the RAID levels and their uses
• Describe the available I/O tuning data

SCSI Bus Overview

• Originally shipped in 1982; SCSI -1 in 1986
• Standardized now at SCSI-3
• Upward compatible evolution
• Supports disk, tape, network devices, media changers,
CD-ROMs and so on
• The most common computer peripheral connection
• Defines message traffic and bus operation
• Very similar in concept to Ethernet
• Similar principles apply to most I/O buses

SCSI Bus Characteristics

• There are several types of SCSI bus
• They can be distiguished by their mix of characteristics
• Speed – How fast the signals move
• Width – How much data moves at a time
• Length – How long the bus can be
• Buses can support several different sets of
characteristics concurrently
• Device drivers negotiate a device’s behavior on the bus
• Behaves like a circuit switched bus

SCSI Speeds
Name Maximum Speed
Asynchronous 4 MBytes/sec
Synchronous 5 MBytes/sec
Fast 10 MBytes/sec
Fast-20 (Ultra) 20 Mbytes/sec
• The SCSI committees are working on fast-40 and fast-80

SCSI Widths
Name Width Cable
Narrow 1 byte 50 conductor cable
Wide 2 bytes in parallel 68 conductor cable
Quad 4 bytes in parallel 2 cables, 104 conductors
• Quad SCSI is not usually available

• Speed and width together determine the bandwidth
• Multiply speed by width

SCSI Lengths
Name Limit Symbol
Single-ended 20’, 6m
Differental 82’, 20m
• Subtract 18"/.5 m for each device on the bus

Do not mix differential and single-ended on the same bus

SCSI Properties Summary

Speed Width Length
Async Narrow Single-ended
Sync Wide Differential
Fast Quad
Fast-20
• Each device runs with one combination

• Choose one from each column
• Multiple combinations are supported on one bus
• May not mix singe-ended and differential

SCSI Interface Addressing

• A SCSI target is a device controller, not a device
• 8 unique targets on a narrow bus, 16 on a wide
• Each target supports 8 luns, or devices
• Embedded SCSI has one lun per target
• The txdx scheme identifies these
• SCSI commands are issued by initiators
• The most common initiator is the host adapter
• Host adapters do not have luns
• There are 57 (narrow) or 121 (wide) positions on the bus

SCSI Bus Addressing
ESP
Host Adapter
target 7 SBE/S
Disk, target 0, lun 0

T
CD-ROM, target 6, lun 0
Tape, target 5, lun 0
Tape, target 4, lun 0

SCSI Target Priorities

• The SCSI bus needs an arbitration scheme
• This decides which device goes first
• Higher priority devices get better service
• On a narrow bus, the priorities run 7 -> 0
• On a wide bus they run 7 -> 0 then 15 -> 8
• The host adapter is usually 7
• Multi-initiator SCSI allows multiple initiators
• The host adapters are numbered 7 and 6

Fibre Channel
• General purpose networking mechanism
• Looks like a packet-switched bus
• Supports up to 128 targets
• Supports more than SCSI – ATM, video, and so on
• Three types
• FC-25 – SPARCstorage Arrays
• FC-AL – A5000
• FC Fabric – Full network
• All devices have a unique WWN, like an IP address

Disk Drive Features

• Multiple Zone recording
• Drive caching and fast writes
• Mode pages
• Tagged queueing
• Properties from sysconf and scsiinfo

Multiple Zone Recording

Zone Table
Zone 0
Sector Density Cylinder Zone 1

0 75 0
Zone 2
350,000 70 88
685,000 65 176
Zone 3
60
55 ...
...

Drive Caching and Fast Writes

• Disk drives today have from .5 Mbyte to 8+Mbytes of
cache
• Cache is used with tagged queueing to manage requests
• Buffers the drive’s ITR to bus speed
• Fast writes signal completion when the data reaches the
cache
• Should have the drive on UPS
• Can be set in arrays or on individual drives

Tagged Queueing
• A special ID field is added to the device command
• Allows the drive to handle multiple simultaeou
requests
• Requests can be executed out of order
• Takes advantage of ordered requests to limit arm
motion
• Must be negotiated with the device
• Not all devices support it
• Can provide noticable performance improvements

Mode Pages
• Mode pages are provided for all SCSI devices
• Identified by number
• They describe device properties and configuration
• Several standard, required, pages
• Additional device-specific pages
• Can also have vendor-specific pages
• Can be viewed using format -e
• Header files are in /usr/include/sys/scsi

Device Properties from prtconf

• prtconf -v reports some device information
name <target1-TQ> length <4>
value <0x00000001>.
name <target1-wide> length <4>
value <0x00000001>.
name <target1-sync-speed> length <4>
value <0x00004e20>.
• What is reported depends on the device.

• Device properties are listed uner the controller
• This device is using a fast, wide SCSI bus and Tagged
Queueing

Device Properties from scsiinfo

• scsiinfo can read (and set) mode page properties
• (show)

Disk I/O Time Components

• Waiting for access to the I/O bus
• Waiting for other queued disk requests
• Bus transfer time
• Seek time (to find correct track in cylinder)
• Rotation time (to find correct sector in track)
• ITR transfer time
• Reonnection time
• Interrupt time

Disk I/O Time Calculation

For a 9 ms 7200 rpm disk device
Queue ’rupt Driver

OS
time ?
?
Cmd Data transfer
Arb. trans. to cache Arb.Status
Bus
1.5 ms 1 ms (or less) 1.5 ms
Internal
Seek Rotation transfer
Device 4 ms
9 ms 2 ms

Tape Drive Operations

• Bus speed is a speed limit, not a mandatory speed
• Tapes can seriously hurt your performance
• They transfer very slowly
• Other devices have t wait until the tape is done
• Tapes should be on a separate bus from production
disks
• Backups can seriously degrade bus performance

Storage Arrays and JBODs

• JBOD – "Just a Box of Disks" (A5000, MultiPack)
• Looks like a bus in a box
• No controller, no special control
• May have special power and reliability support
• Storage arrays (SPARCstorage Array, A7000)
• Has a controller and cache in front of multiple buses
• Must be slower than plain disks if cache fills
• Provides many additional features

Storage Array Architecture

Host interfaces
Cache CPU Controller
Bus
Controllers
Drives

The RAID Levels

• There are RAID levels defined from 0 through 5
• 2, 3 and 4 are not used
• RAID 0 is mirroring (multiple data copies)
• RAID 1 is striping (spreading data cross multiple
drives)
• Raid 0+1 is both mirroring and striping
• RAID 5 is striping with parity
• One redundant drive for each set of drives
• Performs poorly in some applications

Tuning the I/O Subsystem

• "The fastest I/O you can do is the one you don’t do"
• Main store problems cause I/O problems
• Start by tuning main store and UFS caches
• Spread activity across boards, buses and devices
• Ensure devices are operating as they are supposed to
• Calculate ’good’ I/O times
• Eliminate hot spots
• Use target and device priorities to place data

Available Tuning Data

• iostat
• sar
• scsiinfo and prtconf
• Summary

Available I/O Statistics

Activity Source iostat (unit) sar -d (unit)
sample time disk was active (calculated) %b %busy

Disk I/O time total rtime svc_t (ms) avque+avwait (ms)
# of reads and writes reads r/s r+w/s
writes w/s
# of blocks read and written nread Kr/s (K) blks/s (sectors)
per second
nwrite Kw/s (K)
% of time waiting for I/O %w %wio

Available I/O Data

nread - number of bytes a device has read
nwritten - number of bytes a device has written
reads - number of read operations
rlentime - average length of time a transfer request has to spend being serviced
(actual device time)
rtime - The amount of time the device is active
wlentime - average length of time a transfer request has to spend waiting to be
serviced (pre-interface time)
writes - number of write operations

Check Your Progress

• Describe how the I/O buses operate
• Describe the configuration and operation of the SCSI
bus
• Describe the configuration and operation of a Fibre
Channel bus
• Describe how disk and tape operations proceed
• Describe how a storage array works
• Describe the RAID levels and their uses
• Describe the available I/O tuning data

Module 8
UFS Tuning

Module Overview
• Course map
• Relevance
• Objectives

Course Map

Objectives
• Describe the vnode interface layer to the file system
• Explain the layout of a UFS partition
• Describe the contents of an inode and a directory entry
• Describe the function of the superblock and cylinder

group blocks
• Understand how allocation is performed in the UFS
• List the UFS caches and how to tune them
• Describe how to use the other UFS tuning parameters

The vnode Interface

System calls
vnode layer
/proc PCFS HSFS UFS swapfs TMPFS NFS cachefs
Process Anonymous
address CD-Rom Disk memory
space Diskette

Local File System

• A disk drive must be divided into one or more
partitions or slices
• Each partition can contain only one file system
• Files are accessed in fixed-length blocks on the disk
• 4Kbyte and 8Kbyte blocks are available
• 4Kbyte blocks are not supported for UltraSparcs
• Information about a file is kept in an inode
• A directory points to an inode for every named object

in a partition

Fast File System Layout

Label
Boot block
Superblock
B-U Superblock
Cylinder group
block
inodes Cylinder group
0
Data
Data
B-U Superblock
Cylinder group
block Cylinder group
inodes 1
Data

The Directory
inode number
entry length
name length
name
inode number
entry length
name length
name
entry length Deleted

name length entry
name

Directory Name Lookup Cache

(DNLC)
• Searching for a name in a directory is expensive
• DNLC - cache of most recently referenced directory

entry names and their associated vnodes
• Used by UFS and NFS
• ncsize determines size of DNLC
• Maximum name size is 31 characters
• vmstat -s and sar display DNLC statistics

The inode Cache

• Contains an entry for every open inode
• Also holds as many closed inodes as possible
• Entries are located through the DNLC
• Cache size will grow to hold all open files
• May exceed your setting
• If it does, no closed inodes will be cached
• When closed inodes are cached, their data pages can be
reused
• If flushed (removed from cache), the pages are lost

DNLC and inode Cache Management

• DNLC and inode caches should be the same size
• Each DNLC entry has an inode entry
• Default sizes are ( 4 * ( maxuprc + maxusers ) ) + 320

• inode cache size is set by ufs_ninode
• DNLC cache size is set by ncsize
• Set ncsize high (5,000 or more) on NFS servers
• Try to keep the DNLC hit rate above 90%

The inode
inode type and permissions
Number of hard links
Owner and group id
Size in bytes
Time file last accessed
Time file last modified
Time inode last modified
12 direct block pointers
3 indirect block pointers
Number of sectors
Shadow inode location

The inode Block Pointers

Data
12 direct data block
Block pointers
Data
Data Data
Block Data
addresses Data
Single indirect
2048
Addresses Data
Double indirect Addresses
2048
Triple indirect 2048
Data Data
2048
Addresses Addresses Data
Addresses
2048 2048
2048
Data
2048 2048
• The pointers are sector addresses

Hard Links
directory entry
inode number
directory entry
inode number
name
name
inode
ic_nlink = 2
• Work only in the same partition

• No limit to the number
• No back pointers
• Must use find -i to locate the other names

Symbolic Links
Regular symbolic link Fast symbolic link
/usr directory entry
inode number inode number
entry length
12
name length 4
name t m
p \0
In-core inode /usr/tmp’s in-core inode
link count link count

1033 543
0 . . / v
a r / t
m p
Data block 1033

/preposterously_long_directory_name
/extremely_long_filename
Data block 543
../var/tmp

Allocating Directories and inodes

d d i b b d i ii b b b d d i i i bb b b
d i i b bb
i
d d i i i bb b b
d
d i i bb b ?
UFS tries to provide even performance regardless of partition

utilization
• Directories are allocated in the cylinder group
containing the fewest directories
• Ties are broken by the most free space
• File inodes are allocated to the same CG as its directory
• Data blocks are allocated in the same CG as their inode

Allocation Performance
’Compacted’ UFS
Allocation
performance
Normal UFS
Faster
0 100
Partition percentage full

Allocating Data Blocks

Data block allocation is a five step process
1. Request the ’ideal’ block
2. Take a block in the cylinder group
3. Do quadratic rehash
4. Look at all remaining cylinder groups
5. Panic

Quadratic Rehash
• Tries to avoid clumps of full cylinder groups
• Searches twice as far each time another cg is found full
• Stopping this quickly is important for performance
• Free space is reserved to try to stop quickly
• This is the "10%" reserved seen in df
• With 2.6 this amount depends on partition size
• Formula is (64 Mbytes / file system size) * 100
• Runs from 10% (<= 640 Mbytes) to 1% (>= 6.4
GBytes)

Fragments
• A full 8Kbyte block is a waste for small files
• Fragments allocate 1/8 block increments to a file
• Fragments can be allocated by first-fit or best-fit
• Fragment rules:
1. Only the last block of the file can be fragmented
2. Files larger than 96 Kbytes cannot use fragments
3. The fragments must be contiguous within their block
4. Fragments may be moved to remain contiguous

Logging File Systems

• Allocation requires lots of I/O
• Cylinder group block, inode, superblock updates are

needed for one data block allocation
• Logging file systems write to the log, other writes are

done later
• Takes advantage of write cancellation
• Faster recovery – All changes are known and logged
• The log is usually a disk hot spot
• Examples: VxFS, SunOS 5.7 UFS

Application I/O
• Application I/O usually goes through the segmap cache
• The segmap cache is a kernel memory segment
• The segmap cache is 4 MBytes in size
• It is a fully-associative, write-back cache
• The cache contents are different for each process
• The virtual address is the same for each
• Blocks in cache may not be in memory

• Data access requires a system call and a copy of the data

Application I/O
segmap
cache bread
bwrite
lread lwrite
Kernel
User
read() write()
• One system call and (at least) one copy per request

Working with the segmap Cache

• You can bypass the cache by doing direct I/O
• I/O goes direct to and from disk
• There are restrictions
• You can make the cache write-through using dsync and

osync
• You can continue to execute during I/O operations
using async I/O
• This allows non-blocking application I/O

fsflush
• Remember that the segmap cache is a write-back cache
• Data could stay there unwritten indefinitely, which is
not good in case of a system crash
• fsflush writes modified data back to disk on a regular
basis for UFS and NFS
• There are two tuning parameters:
• tune_t_fsflushr – How often to wake up
• autoup – How long a cycle should take
• autoup is the longest that unmodified data will remain
unwritten

Accessing Data Directly with mmap

Process
Stack
paddr Mapping
len
off
Data File
(or file-like object)
Text

Using madvise
Used to control more precisely an application’s use of main
storage. Can be used for most virtual memory areas.
• MADV_NORMAL – normal (read ahead only) access
• MADV_RANDOM – random access; no read ahead
• MADV_SEQUENTIAL – sequential access (read ahead
and free behind)
• MADV_WILLNEED – Virtual memory the application
intends to use soon. The data will be read in by read
ahead.
• MADV_DONTNEED – Unneeded virtual memory. Any
associated physical memory will be freed.

Post-Creation Tuning Parameters

Parameters on tunefs and newfs/mkfs
• Optimization for time or space
• space - Allocate fragments best-fit
• time – Allocate a new block for each fragment section
• minfree – Space to cut quadratic rehash short
• Formula is
• Minimum 1%, maximum 10%
• fs_maxcontig – Size of a cluster

UFS Metadata Cache

• Holds inodes, cylinder group blocks and indirect blocks
only
• bufhwm is the only parameter, specified in Kbytes

• The default of 0 allows up to 2% of main memory to be
used
• This may be too much on very large systems

• You can usually reduce it to a few Mbytes
• Use sysdef to see the current setting
• Usually you don’t need to tune this

Application I/O Performance

Statistics
From sar -b you can get segmap cache behavior
• bread/s, bwrite/s – Disk-segmap cache I/O

operations
• lread/s, lwrite/s – segmap cache-user I/O

operations
• %rcache, %wcache – segmap cache hit ratio
• pread/s, pwrite/s – phsyio() transfers. physio() is
used when accessing raw devices

File System Performance Statistics

From sar -v, -g, and -q:
• namei/s – DNLC pathname component requests
• iget/s – DNLC cache misses

• dirblk/s – Directory block reads per second
• file_sz - Total number of open files in the system
• inode_sz – Current percentage of inode cache in use
• ufs_ipf - Flushed inodes: percentage of (closed) UFS

inodes flushed from the cache which had reusable file
pages still in memory
Check Your Progress

• Describe the vnode interface layer to the file system
• Explain the layout of a UFS partition
• Describe the contents of an inode and a directory entry
• Describe the function of the superblock and cylinder

group structures
• Understand how allocation is performed in the UFS
• List the UFS caches and how to tune them
• Describe how to use the other UFS tuning parameters

Module 9
Network Tuning

Module Overview
• Course map
• Relevance
• Objectives

Course Map

Objectives
• Describe the different types of network hardware
• Describe the performance characteristics of networks
• Describe the use of IP trunking
• Discuss the NFS performance issues
• Discuss how to isolate network performance problems
• What can be tuned?

Networks
• Networks are generally packet switched buses
• There are several places to tune:
• The application
• The OS
• The network itself
• You will tune differently depending on your workload
•

Network Bandwidth
Network Connection Bandwidth (bits/s)
28800 baud modem 28.8K
ISDN 1 BRI 64K
ISDN 2 BRI 128K
T1 1.5M
Ethernet 10M
T3 45M
FastEthernet 100M
Full Duplex Fast Enet 100+100M
FDDI 100M
ATM (OC3) 155+155M
ATM (OC12) 622+622M

IP Trunking
• Like network RAID
• Data is striped over multiple network interfaces
• Allows multiple interfaces to look like one
• Works only with qfe cards
• 100base-T quad fast Ethernet
• Can combine 2 or all 4 interfaces
• Usually goes to an Ethernet switch
• Special software required

TCP Connections
TCP Connections are expensive
• TCP is optimized for reliable data on long lived
connections
• Making a connection uses a lot more CPU resource than
moving data
• Connection setup protocol involves several round trip
delays
• Pending connections can cause “listen queue” issues
• Each new connection goes through a “slow start” ramp
up as reliability is determined

TCP Connections
• TCP windows can limit high latency high speed links
• Lost or delayed data causes timeouts and
retransmissions; overhead
• Each open connection consumes memory
• HTTP persistent connections carry several operations
on one connection

TCP Tuning Parameters

• tcp_close_wait_interval (actually
tcp_time_wait_interval)
• Default 240000 ms. Reduced to 60000 for SPECweb96
runs
• tcp_conn_req_max - obsolete per-process listen queue
limit
• tcp_conn_req_max_q0 - incomplete listen connection
limit
• Default is 1024 - no need to tune unless
tcpListenDropQ0 is non-zero

TCP Tuning Parameters

• tcp_conn_req_max_q - pending completed connection
limit
• Default 128 - no need to tune it unless tcpListenDrop
seen
• tcp_slow_start_initial - number of initial packets
to send
• Default 1 - set to 2 to be the same as other vendors
• tcp_xmit_hiwat and tcp_recv_hiwat - window size
• Default 8192 - try setting to 32768 if using fast WAN
links

TCP Stack Issues

• Ignore everything if TCP throughput is less than 2 KB/s
• Warn if retransmit rate is over 15%, problem if over 25%
• Warn if listen drops are seen, problem if over 0.5/s
• Warn if connection refused - sending reset packets at
0.5/s
• Warn if attempted outgoing connection fails over 2/s
• Warn if incoming duplicate packets over 15%
• Problem if duplicates at 25% - remote is
retransmitting at you

Network Hardware Performance

• Networks are packet switched buses
• They have identical performance issues
• Access to the wire is important, so low utilization is
important
• Collisions imply not enough available access
• You can cut traffic or speed up the network

• You may have a lot of unnecessary traffic
• NFS attribute messages

Using ndd
• Network driver parameters can be changed in
/etc/system
• They can also be reported on and changed by ndd
• Change lasts until reboot
•

NFS
• Tune the client and server first
• Eliminate non-NFS-specific performance issues
• DNLC and inode cache
• Server I/O and memory performance issues
• Network performance issues
• actimeout too low generates lots of traffic
• cachefs
• NFS V.3 added some protocol efficiencies

Isolating Problems
• Four sources of problems
• Client
• Server
• Network
• NFS itself
• Ensure the individual pieces are tuned, then look at NFS
specifically
• Use NFS V.3 if possible to cut overhead

Tuning Reports
• netstat
•
• nfsstat
•

Check Your Progress

• Describe the different types of network hardware
• Describe the performance characteristics of networks
• Describe the use of IP trunking
• Discuss the NFS performance issues
• Discuss how to isolate network performance problems
• What can be tuned?

Module 10
CPU Scheduling

Module Overview
• Course map
• Relevance
• Objectives

Course Map

Objectives
• Describe the Solaris scheduling classes
• Explain the use of dispatch priorities
• Describe the time-sharing scheduling class
• Describe Solaris real-time processing
• Describe the commands that manage scheduling
• Describe the purpose and use of Processor Sets

Scheduling Classes
• There are four scheduling classes provided by SunOS
• Timesharing - TS
• Interactive - IA
• System - SYS
• Real-time - RT
• The classes are loaded dynamically as they are used
• Scheduling algorithms are table-driven

Dispatch Priorities
Dispatch Queues Dispatch Queues
169 Interrupt thread priorities

160
159
Real-time priorities
109 Interrupt thread priorities
100 100
99 99
Kernel priorities Kernel priorities
60 60
59 59
Timesharing/Interactive Timesharing/Interactive
priorities priorities
0 0
Without realtime With realtime

Scheduling States
Running
Blocked Scheduled
Sleeping Ready
Awakened

The Interactive Scheduling Class

• Used to enhance interactive performance
• This is the default scheduling class for processes in CDE
and OpenWindows sessions
• Uses most of the time-sharing class facilities
• Bumps up the priority of the task in the active window
by 10 points
• Priority reset when no longer the active window
• Niced or otherwise changed processes are not boosted

Timesharing/Interactive
Dispatch Parameter Table
globpri quantm tqexp slpret mxwait lwait
0 20 0 50 0 50
1 20 0 50 0 50
2 20 0 50 0 50
3 20 0 50 0 50
4 20 0 50 0 50
5 20 0 50 0 50
6 20 0 50 0 50
7 20 0 50 0 50
8 20 0 50 0 50
9 20 0 50 0 50
10 16 0 51 0 51
11 16 1 51 0 51
12 16 2 51 0 51
13 16 3 51 0 51
14 16 4 51 0 51
15 16 5 51 0 51
16 16 6 51 0 51
17 16 7 51 0 51
18 16 8 51 0 51
19 16 9 51 0 51
20 12 10 52 0 52
21 12 11 52 0 52
22 12 12 52 0 52
23 12 13 52 0 52
24 12 14 52 0 52
25 12 15 52 0 52
26 12 16 52 0 52
27 12 17 52 0 52
28 12 18 52 0 52
29 12 19 52 0 52

Timesharing/Interactive
Dispatch Parameter Table
30 8 20 53 0 53
31 8 21 53 0 53
32 8 22 53 0 53
33 8 23 53 0 53
34 8 24 53 0 53
35 8 25 54 0 54
36 8 26 54 0 54
37 8 27 54 0 54
38 8 28 54 0 54
39 8 29 54 0 54
40 4 30 55 0 55
41 4 31 55 0 55
42 4 32 55 0 55
43 4 33 55 0 55
44 4 34 55 0 55
45 4 35 56 0 56
46 4 36 57 0 57
47 4 37 58 0 58
48 4 38 58 0 58
49 4 39 58 0 58
50 4 40 59 0 59
51 4 41 59 0 59
52 4 42 59 0 59
53 4 43 59 0 59
54 4 44 59 0 59
55 4 45 59 0 59
56 4 46 59 0 59
57 4 47 59 0 59
58 4 48 59 0 59
59 2 49 59 32000 59

Dispatch Table Issues

For time-sharing class processes:
• Reducing time quanta favors interactive processes
• Raising time quanta favors compute-bound and large
processes
• The ts_maxwait and ts_lwait fields can be used to
control CPU starvation
• By slightly raising the values of ts_tqexp, you can
expect the priority of compute-bound processes to drop
more slowly
• Change the table if necessary to fit your workload

Timesharing Dispatch
Parameter Table (Batch)
0 100 0 10 5 10
1 100 0 11 5 11
2 100 1 12 5 12
3 100 1 13 5 13
4 100 2 14 5 14
5 100 2 15 5 15
6 100 3 16 5 16
7 100 3 17 5 17
8 100 4 18 5 18
9 100 4 19 5 19
10 80 5 20 5 20
11 80 5 21 5 21
12 80 6 22 5 22
13 80 6 23 5 23
14 80 7 24 5 24
15 80 7 25 5 25
16 80 8 26 5 26
17 80 8 27 5 27
18 80 9 28 5 28
19 80 9 29 5 29
20 60 10 30 5 30
21 60 11 31 5 31
22 60 12 32 5 32
23 60 13 33 5 33
24 60 14 34 5 34
25 60 15 35 5 35
26 60 16 36 5 36
27 60 17 37 5 37
28 60 18 38 5 38
29 60 19 39 5 39

Timesharing Dispatch
Parameter Table (Batch)
globpri quantum tqexp slpret maxwait lwait
30 40 20 40 5 40
31 40 21 41 5 41
32 40 22 42 5 42
33 40 23 43 5 43
34 40 24 44 5 44
35 40 25 45 5 45
36 40 26 46 5 46
37 40 27 47 5 47
38 40 28 48 5 48
39 40 29 49 5 49
40 20 30 50 5 50
41 20 31 50 5 50
42 20 32 51 5 51
43 20 33 51 5 51
44 20 34 52 5 52
45 20 35 52 5 52
46 20 36 53 5 53
47 20 37 53 5 53
48 20 38 54 5 54
49 20 39 54 5 54
50 10 40 55 5 55
51 10 41 55 5 55
52 10 42 56 5 56
53 10 43 56 5 56
54 10 44 57 5 57
55 10 45 57 5 57
56 10 46 58 5 58
57 10 47 58 5 58
58 10 48 59 5 59
59 10 49 59 32000 59

The dispadmin Command

• Displays or changes scheduler parameters
• Options:
• -l - lists currently loaded scheduling classes
• -c - class to use
• -g - display configured parameters
• Simple way of formatting control file
• -s file - set parameters from a file

dispadmin Example
# dispadmin -c TS -g
#Time Sharing Dispatcher Configuration
RES=1000
#ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL

200 0 50 0 50 # 0
200 0 50 0 50 # 1
200 0 50 0 50 # 2
200 0 50 0 50 # 3
200 0 50 0 50 # 4
200 0 50 0 50 # 5
200 0 50 0 50 # 6
...
40 44 58 0 59 # 54
40 45 58 0 59 # 55
40 46 58 0 59 # 56
40 47 58 0 59 # 57
40 48 58 0 59 # 58
20 49 59 32000 59 # 59

Real-Time Scheduling
Real-time systems have rigid response time or dispatch
latency requirements
• Failure to meet the time constraint can cause problems
General purpose systems have problems with real-time:
• Page faults – Unknown time to resolve
• All pages are locked in memory
• Including shared libraries and shared memory
• Non-preemptable kernel – Potential CPU delays
• Preemptable kernel

Real-Time Issues
• Real-time processes can hurt the system if they:
• Use lots of memory – All pages are locked in memory
• Are compute-intensive - Other tasks may be locked
out and system daemons may suffer
• fork other processes – Children inherit real-time
• Get stuck in a loop - May be hard to stop
• Make sure that you really need real-time before using it
• Consider using a real-time thread instead of a whole
process
• Set hires_tick to 1 to set clock interrupts to 1000 Hz
Real-Time Dispatch Parameter Table

globpri quantm globpri quantm
100 100 130 40
101 100 131 40
102 100 132 40
103 100 133 40
104 100 134 40
105 100 135 40
106 100 136 40
107 100 137 40
108 100 138 40
109 100 139 40
110 80 140 20
111 80 141 20
112 80 142 20
113 80 143 20
114 80 144 20
115 80 145 20
116 80 146 20
117 80 147 20
118 80 148 20
119 80 149 20
120 60 150 10
121 60 151 10
122 60 152 10
123 60 153 10
124 60 154 10
125 60 155 10
126 60 156 10
127 60 157 10
128 60 158 10
129 60 159 10

The priocntl Command

• Works only at the process level
• Options
• -l - Lists currently loaded scheduling classes
• -d - Display a process’ scheduling parameters
• -s - Set a process’ scheduling parameters
• -e[-c class] - Create a new process
• The priocntl(2) system call is available for programs
• Will work on threads and LWPs

The priocntl Command

Example
# priocntl -e -c RT -p 50 -t 500 command args
• Run command with operand args

• In the real-time scheduling class
• At real-time class relative priority 50
• Global priority 150 (50 + base RT class priority)
• Use a time quantum of 500 milliseconds

Thread Dispatching
Switch request
Sch. class routine
Finished with
time slice?
no yes
Queue front Queue new

Pick new thread
Set it up
Pick a CPU Run it
Old thread New thread

CPU Performance Statistics

Activity Origin vmstat -? (unit) sar -? (unit)
Time CPU is executing in cpu.sysinfo[CPU_USER] us (%) %usr (%)

user mode
Time CPU is executing in cpu.sysinfo[CPU_KERNEL] sy (%) %sys (%)
kernel mode
Time CPU is waiting for I/O cpu.sysinfo[CPU_WAIT] summed as %wio (%)
id (%)
Time CPU is idle cpu.sysinfo[CPU_IDLE] %idle (%)
Average number of threads sysinfo.waiting b N/A
blocked on I/O
Average # of runnable sysinfo.runque r runq-sz
threads over the interval
% of time the run queue was sysinfo.runocc N/A %runocc (%)
occupied
Average number of thread cpu_sysinfo.pswitch cs pswch/s
context switches

Available CPU Performance Data

• cpu_sysinfo.cpu[CPU_USER] – User mode CPU
• cpu_sysinfo.cpu[CPU_KERNEL] – Kernel mode CPU
• cpu_sysinfo.cpu[CPU_WAIT] – CPU idle, block I/O
active
• cpu_sysinfo.cpu[CPU_IDLE] – CPU idle
• cpu_sysinfo.pswitch - Number of thread switches
• sysinfo.runque - Average number of runnable
threads
• sysinfo.runocc - Percentage of time a thread was on
any dispatch queue

CPU Control and Monitoring

Processor control and information is reported by:
• mpstat - CPU usage statistics
• psrinfo - Which CPUs are enabled
• psradm - Enable and disable individual CPUs
• /usr/kvm/prtdiag - print system configuration and
diagnostic information
• prtconf - show device configuration
• sysdef - show software configuration

The mpstat Monitor

• Utility to display per-processor performance statistics
• Similar to vmstat in operation
• Used to check per-CPU loads and load balancing
• Usually not needed in normal tuning
• Can be useful with erratic performance
• Use lockstat to look at locking problems

mpstat Data
• xcal - number of interprocessor cross calls
(cpu_sysinfo.xcalls)
• intr - number of interrupts received on each CPU
(cpu_sysinfo.intr)
• migr - number of threads which migrated from another
CPU (cpu_sysinfo.cpumigrate)
• smtx - number of failed adaptive mutex_enter() calls
(cpu_sysinfo.mutex_adenters)
• srw - number of times readers/writer locks were not
acquired on the first try (cpu_sysinfo.rw_wrfails,
cpu_sysinfo.rw_rdfails)

Processor Sets
• Allows exclusive use of groups of processors by certain
processes
• Also known as CPU fencing
• Very different from pbind(1M)
• Managed by the psrset(1M) command
• Controlled by root
• System-defined processor sets may be used by any user
• DR will release bindings if necessary

Using Processor Sets

• You must leave one processor for the system
• Make sure you leave enough
• A CPU may be in only on processor set
• The system will not adjust dynamically
• The tools do not show queues by processor set
• Queueing on a procesor set could be missed
• You can change CPU membership dynamically
• You’ll need to watch fenced processes carefully

Check Your Progress

• Describe the Solaris scheduling classes
• Explain the use of dispatch priorities
• Describe the time-sharing scheduling class
• Describe Solaris real-time processing
• Describe the commands that manage scheduling
• Describe the purpose and use of Processor Sets

Module 11
Performance Tuning Summary

Module Overview
• Course map
• Relevance
• Objectives

Course Map

Objectives
• List four major performance bottlenecks
• Describe characteristics and solutions for each
• Describe several different ways to detect a performance

problem
• Identify the type of performance problem that you have,

and what tools are available to resolve it

Some General Guidelines

• Tune one thing at a time
• Spend the most time tuning the things which have the
largest effect; expect trade-offs
• One size does not fit all
• What works on a small system may not work on a
larger system
• Many parameters have to be scaled or adjusted
• Don’t spend money first

What You Can Do

• Use accounting to see what is being run
• Start accounting during boot
• Use the adm crontab to process it
• Collect performance data regularly with the sys
crontab entry
• Generate special purpose measurements with sar,
vmstat, etc.
• Graph the data to see exceptions and trends
• Consider using 3-D graphs

Application Source Code Tuning

It’s usually much more productive to tune applications than
to change OS parameters! (But do both.)
Application performance is affected by:
• Programming Model
• Algorithms
• Choice of programming language
• Compiler & Optimizations
• Effective use of provided system functions

Application Executable Tuning Tips

• Use accounting to see what’s using resources
• Trace system calls (truss) to see what is going on
• Use cachefs to speed up read-mostly NFS
• Set the proper values for your basic tuning parameters
• Be aware that parameters change from release to release
• Review your parameters with each new release
• Watch the README files for patches

Compiler Optimization
• Can provide performance improvements of several
hundred percent
• Usually not done by the programmer
• Makes programs a bit harder to debug
• Even better optimization is available for specific CPUs
• Basic compile flags to use are -xO3 -xdepend
• Compiles will run longer
• Use optimized libraries if available

Performance Bottlenecks
Memory I/O CPU Lock Contention
Bottleneck Bottleneck Bottleneck Bottleneck

Memory Bottlenecks
Memory
Bottleneck
Detection
Symptom
Steady page-out activity

Scan rate is non-zero (page daemon is active)
Swapper is active
Free memory is at or below lotsfree
Hardware cache misses

Memory Bottleneck Solutions

• Adjust paging parameters such as lotsfree
• Decrease parameters scaled from maxusers
• Use shared libraries
• Use madvise() to use memory more efficiently
• Analyze locality of reference in applications
• Set memory usage limits with ulimits, limits and
setrlimit(2)
• Modify the process load if possible
• Add more physical memory

I/O Bottlenecks
I/O
Bottleneck
Detection
Symptoms
Uneven workload
Many threads blocked waiting on I/O
High disk utilization rate
Active disk with no free space

I/O Bottleneck Solutions

• Adjust the DNLC and inode cache sizes
• Set appropriate UFS partition parameters
• Balance your disk load
• Use RAID 1 (striping)
• Use mmap instead of read and write
• Use shared libraries and compiler optimization
• Organize I/O requests to be more contiguous
• Move slow devices off of production buses
• Add more or faster disks or buses

CPU Bottlenecks
CPU
Bottleneck
Detection
Symptom
CPU idle time is low

Threads waiting on run queue
Slower response time or interactive performance

CPU Bottleneck Solutions

• Modify applications to use system calls more efficiently
• Use mmap instead of read and write
• Use compiler optimization
• Don’t run unnecessary system daemons
• Use priocntl and nice to modify process priorities
• Modify or limit the process load
• Modify the dispatch parameter tables
• Check old custom device drivers for PIO usage
• Add more or faster CPUs

Ten Tuning Tips

1. Remember the interrelationships in the system. The
source of a problem is not always obvious.
2. Sustained high scan rates indicate a memory shortage
3. Ignore free memory numbers; they should be about
lotsfree.
4. Ignore paging I/O statistics
5. Always check for disk bottlenecks

Ten Tuning Tips

6. Use nfsstat -m on slow NFS clients to find the
source of the problem
7. Make sure that your file systems are tuned correctly
8. Graph your data to see exceptions quickly
9. Base your tuning on the characteristics of the
component (bus or cache)
10.Things change: check patches and new release
documentation

Check Your Progress

• List four major performance bottlenecks
• Describe characteristics and solutions for each
• Describe several different ways to detect a performance

problem
• Identify the type of performance problem that you have,

and what tools are available to resolve it


SA400 - Solaris System Performance Management - Oh - 0898

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SA400 - Solaris System Performance Management - Oh - 0898

Uploaded by

Copyright:

Available Formats

Sun Educational Services

Sun Educational Services

Enterprise Server Performance Management August 1998

About This Course

Enterprise Server Performance Management Preface 2 of 10

Enterprise Server Performance Management Preface 3 of 10

• Processes and Threads

Enterprise Server Performance Management Preface 4 of 10

Enterprise Server Performance Management Preface 5 of 10

• Describe the common characteristics of system

Enterprise Server Performance Management Preface 6 of 10

Topics Not Covered

Enterprise Server Performance Management Preface 7 of 10

• Basic organization and operation of the SPARCcenter™

Enterprise Server Performance Management Preface 8 of 10

Enterprise Server Performance Management Preface 9 of 10

How to Use the Course Materials

Enterprise Server Performance Management Preface 10 of 10

Enterprise Server Performance Management August 1998

Enterprise Server Performance Management Module 1, slide 2 of 15

Enterprise Server Performance Management Module 1, slide 3 of 15

• Characterize a workload to make meaningful

• Explain performance measurement terminology

• Describe a typical performance tuning task

Enterprise Server Performance Management Module 1, slide 4 of 15

Enterprise Server Performance Management Module 1, slide 5 of 15

Basic Tuning Procedure

Get base line measurements

Enterprise Server Performance Management Module 1, slide 6 of 15

Conceptual Model of Performance

Workload Computer System Performance measurements

Random fluctuations and noise

Enterprise Server Performance Management Module 1, slide 7 of 15

Where to Tune First?

Enterprise Server Performance Management Module 1, slide 8 of 15

Enterprise Server Performance Management Module 1, slide 9 of 15

Enterprise Server Performance Management Module 1, slide 10 of 15

Sample Tuning Task

Enterprise Server Performance Management Module 1, slide 11 of 15

Load Level or User Count

Response Time Gradual

Load Level or User Count

Load Level or User Count

Enterprise Server Performance Management Module 1, slide 12 of 15

The Tuning Cycle

Enterprise Server Performance Management Module 1, slide 13 of 15

Kernel System calls

Enterprise Server Performance Management Module 1, slide 14 of 15

Check Your Progress

• Characterize a workload to make meaningful

• Explain performance measurement terminology

• Describe a typical performance tuning task

Enterprise Server Performance Management Module 1, slide 15 of 15

Processes and Threads

Enterprise Server Performance Management August 1998

Enterprise Server Performance Management Module 2, slide 2 of 23

Enterprise Server Performance Management Module 2, slide 3 of 23

Enterprise Server Performance Management Module 2, slide 4 of 23

Enterprise Server Performance Management Module 2, slide 5 of 23

fork with exec Example

Enterprise Server Performance Management Module 2, slide 6 of 23

Process Performance Issues

Enterprise Server Performance Management Module 2, slide 7 of 23