You are on page 1of 8


Roger Snowden, Center of Expertise, Oracle Support
November 14, 2007

Although modern server platforms use virtual memory managers to provide resilience and
robustness of operation, severe shortages of physical memory can have negative and even
catastrophic consequences for production servers. In order to provide sufficient physical memory for
optimal operation, it is necessary to have a basic understanding of how virtual memory is used, and a
means to measure current and historic memory usage.
This article provides a brief, elementary explanation of virtual memory managers, as typically
implemented on Unix and Linux systems; and introduces readers to simple methods and tools to
ascertain memory usage. The reader is also offered simple techniques to detect and diagnose
problems associated with physical memory shortfalls.
This article is intended for system administrators, database administrators, and managers who wish to
determine virtual and physical memory utilization on Unix and Linux platforms and to conduct basic
capacity planning.
In very simple terms, virtual memory is a technique whereby an application program is able to use
large amounts of memory, which can exceed the physical memory on a machine. Essentially, physical
memory used by a process is extended transparently, by using disk resources.
In order for a process to access more memory than exists on a machine, physical memory is divided
into uniform-sized pages. When a process needs to allocate memory, it obtains that memory from the
virtual memory manager. The virtual memory manager obtains a reference to a page of physical
memory from a pool of memory reserved for that purpose by the operating system, and places that
reference in the page table.
The application process does not need to do anything special to access or manage the memory other
than use the appropriate operating system call to make the allocation request. Memory use via virtual
memory is meant to be entirely transparent to application processes. A diagram of processes and
virtual memory components follows, with some discussion of those components and their functions.
Application 1 


Page Table 

Swap File


Application 2 





In the diagram above, in addition to application processes that consume memory, physical memory is
shown, along with the swap file and the page table. When more memory pages are allocated than
physically exist, some previously allocated page of memory must be reused. A page of memory from
the page table is chosen, based on its relatively non-recent use by its owning application process.
When a process attempts to access memory, the actual page lookup and address translation are
performed by a memory management unit (MMU). The MMU is hardware device that makes virtual
memory feasible and transparent to application processes.
In order to preserve the contents of that reused page, the original page is written to a unique location
within the swap file, for later retrieval. This is known as a page-out operation. The paging mechanism
may involve physical I/O, but permits application programs to allocate nearly unlimited amounts of
memory, transparently. The event of requesting a page from the page table, when the page is not
present, is known as a page fault.
Once the page-out operation is complete, the memory page in the page table is then granted to the
process that requested memory— that is, the process that incurred the page fault. That requesting
process can then modify the memory. By having been written to the swap file, the paged-out
memory becomes “clean” and safe to modify. Neither the allocating process nor the process whose
page was written to the swap file are aware of the page-out operation.
Since the virtual memory management code does introduce some overhead for execution, paging
does incur some process execution time. Transparency to the application program is the point of
virtual memory, although the flexibility and resilience gained by use of virtual memory is not “free”.
When the process that owns the previously paged-out memory needs to access that memory page
again, the page from the swap file must then be read back into physical memory, into an available
entry from the page table. If an unused page is not available, another least-recently page of memory
from another process must now be paged out, and the cycle continues.
Some paging is normal in a busy system. However, when memory demands become heavy, and free
unused physical memory becomes exceptionally low, then more drastic measures must be taken to
make physical memory available to processes.
Unix and Linux systems have kernel-owned processes responsible for monitoring overall free
physical memory. Known as swappers, these processes will detect situations when free physical
memory drops below a predetermined threshold. At that point, those swappers— one per CPU—
will begin to grab multiple pages from entire processes and write those pages to disk in order to free
up large chunks of memory. When this happens, all other CPU activity is typically suspended until
some higher threshold of memory becomes available. System administrators set these threshold
values at system configuration time.
While light paging activity is considered normal, and not necessarily performance impairing,
swapping results in severe performance degradation. This is not only because of the extreme and
time-consuming I/O involved, but also because swapped processes cannot run until their memory is
swapped back into the page table, which often means another process must then be swapped out.
Moreover, the swapper process dominates CPU resources, noticeably blocking other processes from
execution during the time of extreme swapping activity.
In Linux, all pages of memory will be in one of five states, shown in the diagram below. Other
operating system memory states will vary, but are similar to this:


accessed/kscand  Inactive 

page out 


Inactive  Inactive 
Clean  kupdated/bdflush  Laundry 
During the lifecycle of a memory page, each page will move from one state to another as needed.
Those states are:
Free: A free page is not being used, and is available for allocation to a process.
Active: An active page is in use by a process.
Inactive Dirty: When a page is unused for a particular period of time, it is marked as inactive dirty,
and is a candidate for reuse by another process. A kernel process periodically scans all memory pages
and tracks how recently that page has been used. A busy page is left in the active state, while an
unused page is moved to the inactive laundry list.
Inactive Laundered: A page on the inactive laundry list has its contents written to disk for
preservation, should the owning process need to access it later. Once the write operation is complete,
the page enters the inactive laundered state, which is a transitional state.
Inactive Clean: An inactive laundered page is moved to the inactive clean state to indicate that page is
now eligible for reuse. It may be deallocated or overwritten as needed.
When free memory is drawn down to some predetermined critical level, the operating system will
move inactive memory pages to disk in order to satisfy increased memory demands. This is the
swapping process described earlier in this article. The determination of “critical” and the mechanism
for dealing with the situation vary by operating system, but generally, an operating system process
known as the swapper will begin to swap the memory of entire processes disk, such that the swapped
process enters a suspended state. If that process is not entirely idle, then when it gets its next
execution opportunity (time slice) and wakes up, its memory is then reclaimed from disk, as perhaps
another process is then forced to have its memory swapped out.
When a system is in a state where far more memory is demanded by processes than is physically
available, and those processes must alternately become swapped, the system begins to thrash. This is a
desperately serious state in which overall machine performance is severely impaired, since it is
spending more time managing memory than executing application code. Therefore, it is essential that
enough physical memory be available on a system to avoid swapping. Swapping, if it continues, can
lead to complete memory exhaustion and a system halt, at which time the machine becomes
unavailable altogether, until rebooted.
Swappers, also known as swap daemons, are operating system processes that exist for each CPU on the
machine. When they are actively trying to reclaim memory by swapping other processes to disk, they
usually run at a sufficiently high priority such that normal application processes have to wait in a run
queue for CPU time to become available once the swapper has resolved the temporary memory
When a process is waiting for CPU in a run queue, it is not executing. For time-critical services, such
as the cssd daemon of Oracle Portable Clusterware (also known as CRS), this situation can be fatal for a
cluster node, since it may be unable to respond to the heartbeat messages from other nodes in the
cluster. This situation can lead to unexpected node evictions in an Oracle RAC cluster.
To avoid critical performance and availability issues for servers, some commonly available utilities
can be helpful. On Unix and Linux systems, vmstat provides a useful picture of the current memory
situation. Vmstat operates by taking samples of operating system information at regular intervals,
settable as a command line parameter. While a thorough discussion of vmstat is outside the scope of
this article, a sample of vmstat obtained from a server undergoing memory exhaustion is included for

[root@ceintcb­14 proc]# vmstat 2 30 
procs                      memory       swap          io     system         cpu 
r  b   swpd    free   buff  cache   si    so    bi    bo   in    cs us sy id wa 
1  0 142120  312400  19468 142540  0     1     1     2   18    10  2  3   7  0 
1  0 142120  244800  19468 142540    0     0     0    24  109    43  0 100  0  0 
1  0 142120  195196  19476 142540    0     0     0    44  109    44  0 100  0  0 
2  0 142120  137240  19476 142540    0     0  0     0  106    31  0 100  0  0 
1  0 142120   61688  19480 142540    0     0     0    32  108    44  0 100  0  0 
2  0 241896   20904  14256 134448    0   404     0   428  124   164  0 100  0  0 
2  0 271212   20376  13572 120260    0   448     0   448  118    31  0 100  0  0 
4  0 365144   19956  12800 114376    0  6800     2  6838  140    48  1  99  0  0 
4  0 442024   18904  10888 109484    0 10768     0 10768  113    98  0 100  0  0 
5  2 587060   18860  10648 105564   42     0    48    12  188    35  0 100  0  0 
1  0 177624 1874148  10268 102684   40     0    76    18  118    47  0  95  5  0 
0  0 177624 1874340  10268 102696    0     0     6    24  118    51  0   2  98 0 
0  0 177624 1874340  10272 102720    6     0    18    64  111    42  0   0 100 0

In the example above, the leftmost “r” column represents CPU run queue length, a symptom of
processes waiting for CPU, and thus CPU resource exhaustion. Together with the “id” (CPU percent
idle) column to the far right, we can tell this machine is CPU-bound. The run queue average spikes
upward suddenly, while idle percentage drops to zero.
While a casual observer might conclude the worst bottleneck on this system is CPU and not memory,
the reason CPU waits are high is because of memory exhaustion. As discussed earlier, under duress, a
system’s swapper kernel process will dominate process execution time until enough memory is free
for current memory demand. All other processes must wait in a run queue until the swapper’s task is
Note the “swpd” and “free” memory columns, representing total system swapped and free memory
respectively. The free memory drops rapidly until a critical threshold is reached, at which time the
swapping activity begins. The “so” column indicates memory pages swapping out to disk, while “si”
indicates memory being swapped back in. As one might expect, after swapping out much memory,
the “free” value jumps significantly. However, this is not always apparent as other processes may be
consuming that memory as soon as it becomes free.
The “si” activity burst following the “so” swapping-out activity is the result of some processes whose
memory was previously swapped out, now being swapped back in. Those processes are using some
of the memory freed up by the swap-out operation.
As for the CPU run queue and percent idle values, note the run queue size drops to zero and the
percent idle increases quickly to 100 percent idle as the swapping increases the amount of free
memory and the memory-starvation crisis is resolved. The swapper daemon no longer dominates
CPU and other processes can get sufficient execution time for the run queue length to become zero.
The amount of memory swapped to disk will remain high until the swapped out processes need to
run again, and access memory pages that were previously swapped.
In cases where sudden and extreme memory consumption leads to swapping, it may not be obvious
to the system administrator what is the precise cause of problem. In such cases, a tool such as top
may be invoked to analyze relative memory usage among processes.
Top collects information from processes consuming either CPU or virtual memory resources, with
some useful details. A complete discussion of top is outside the scope of this article, and the reader is
encouraged to read appropriate Unix or Linux documentation to fully understand top, and similar
utilities. Here is a test case example, designed to deliberately “leak” memory, to illustrate use of top:

top ­ 17:16:58 up 12 days, 17:45,  3 users,  load average: 0.94, 0.77, 0.42 
Tasks: 143 total,   1 running, 142 sleeping,   0 stopped,   0 zombie 
Cpu(s):  3.1% us,  0.5% sy,  0.2% ni, 95.3% id,  1.0% wa,  0.0% hi,  0.0% si 
Mem:   4072172k total,  4055640k used,    16532k free,      920k buffers 
Swap:  4144760k total,   452076k used,  3692684k free,  1838948k cached 

19586 root      15   0  394m 391m  348 S    0  9.9   0:07.76 2256 yyksd 
19585 root      15   0  328m 325m  348 S    0  8.2   0:08.40 2760 yyksd 
19588 root      15   0  306m 305m  348 S    1  7.7   0:07.11 1376 yyksd 
19587 root      15   0  284m 283m  348 S  0  7.1   0:07.89 1228 yyksd 
19584 root      15   0  270m 268m  348 S    2  6.8   0:08.29 1648 yyksd 
19589 root      15   0  236m 234m  348 S    2  5.9   0:09.11 1840 yyksd 
20708 oracle    16   0  596m 111m 108m S    0  2.8   1:02.49 484m oracle 
30959 oracle    15   0  595m 109m 106m S    0  2.7   1:15.14 486m oracle 
30965 oracle    16   0  604m 103m  95m S    0  2.6   0:26.17 500m oracle 
13297 oracle    15   0  595m 103m 100m S    0  2.6   1:58.85 492m oracle 
20703 oracle    16   0  596m  88m  84m S    0  2.2   2:11.92 508m oracle 
20674 oracle    16   0  610m  83m  11m S    0  2.1   5:29.31 526m java

In this test case example, we see a single process is currently running and 142 processes are sleeping,
or suspended. In the tabular part of the output, we see the “COMMAND” column on the right,
listing process names in order of physical memory consumption (“RES”, resource column). The top
memory-consuming processes are all named “yyksd”. As mentioned, this is a contrived case to
illustrate this specific memory diagnostic technique.
As we can see, the first “yyksd” has consumed 391 megabytes of physical memory, which is 9.9
percent of all memory on the system. The “S” column indicates process state, which is suspended in
this case. All other processes in the list are also suspended.
Note the “SWAP” column, which shows all processes listed as having at least some memory
swapped to disk, but in particular the processes, in this partial display of output, following the
“yyksd”, starting with the “oracle” processes, have hundreds of megabytes of memory swapped out.
A logical starting point for diagnosing this situation would be to investigate the nature of “yyksd”
and determine why it is consuming so much physical memory, forcing others to be swapped.
A full diagnostic discussion is out of the scope of this article, but such efforts might include truss or
strace capture, perhaps some process stack traces captures with pstack or a similar utility, and a detailed
analysis of memory consumed by this process, as contained within the /proc filesystem for the
process in question. For such purposes, the PID column shows the process id of each process listed.
For further clarification of the problem case, here is top output sorted by CPU consumption at some
earlier point during this test case:

top ­ 17:14:53 up 12 days, 17:43,  3 users,  load average: 0.79, 0.60, 0.32 
Tasks: 143 total,   2 running, 141 sleeping,   0 stopped,   0 zombie 
Cpu(s):  2.7% us,  9.8% sy,  0.0% ni, 44.8% id, 42.7% wa,  0.0% hi,  0.0% si 
Mem:   4072172k total,  4055808k used,    16364k free,      268k buffers 
Swap:  4144760k total,    24012k used,  4120748k free,  2148344k cached 

56 root      15   0     0    0    0 D    8  0.0   0:22.80    0 kswapd0 
19587 root      15   0  223m 222m  348 S    5  5.6   0:05.21 1356 yyksd 
19584 root      15   0  236m 235m  348 S  3  5.9   0:06.27 1616 yyksd 
19586 root      15   0  315m 313m  348 S    3  7.9   0:05.78 2256 yyksd 
19585 root      15   0  265m 263m  348 S    2  6.6   0:06.16 2856 yyksd 
19589 root      15   0  201m 199m  348 S    2  5.0   0:06.80 1744 yyksd 
19588 root      15   0  243m 242m  348 S    1  6.1   0:05.34 1376 yyksd 
36 root       5 ­10     0    0    0 S    0  0.0   0:02.62    0 kblockd/1 
332 root      16   0     0    0    0 S    0  0.0   7:58.64    0 kjournald 
14113 oracle  25  10 39128  22m 9.9m S    0  0.6   4:52.29  16m rhn­applet­gui 
30957 oracle    16   0  594m  18m  16m S    0  0.5  16:09.45 575m oracle

Note the top CPU-consuming process is kswapd0, the “swapper” daemon for one of the CPUs. The
presence of a swapper process as the top CPU-consumer is clear evidence of heavy swap activity, and
of potential performance problems resulting from the inability of other processes to run while
waiting for the swapper to complete its high priority task. The process state (“S” column, with the
“D” value) indicates the swapper is in uninterruptible sleep state. This means it is sleeping, most likely
because it is waiting for completion of an I/O request from its most recent swap file write or read. It
cannot be interrupted. So, since it is itself waiting for I/O, the entire machine is indirectly waiting for
that same I/O. Not a good thing for performance, obviously.
The rest of the processes in the list are in “S” state, which means they are suspended, or sleeping.
While vmstat and top are excellent tools for monitoring overall virtual memory health, they need to be
run at appropriate intervals and their output captured for later analysis. In cases where a problem is
reported after-the-fact, such tools are not helpful unless they were running at the time the problem
occurred, and their output archived for later analysis. To address such situations, Oracle Support’s
Center of Expertise has developed OSWatcher, a script-based tool for Unix and Linux systems that
runs and archives output from a number of operating system monitoring utilities, such as vmstat, top,
iostat, mpstat and ps.
OSWatcher is available from Metalink as note 301137.1. It is a shell script tool and will run on Unix
and Linux servers. It operates as a background process and runs the native operating system utilities
at user-settable intervals, by default 30 seconds, and retains an archive of the output for a user-
settable period, defaulting to 48 hours. This value may be increased in order to retain more
information when evaluating performance, and to capture baseline information during important
cycle-end periods.
Oracle recommends customers download and install OSWatcher on all production and test servers
that need to be monitored.
For upgrade and migration planning, as well as informal capacity planning, the vmstat archive files
from OSWatcher can be gathered from production and test systems and analyzed for symptoms such
as illustrated above. If any sign of swapping is noted, more memory needs to be made available, or
some analysis of memory-consuming processes made in order to control critical memory