You are on page 1of 7

TLB, TSB Misses and Page Faults

I was trying to explain to my team mates about TLB misses and Page faults when I realized I
was not 100% confident about it myself. I spent some time reading up Solaris Internals on this
topic. I also wrote to our Sun contact and got first hand information from Sun. The below is a
rather simplified description of TLB/TSB misses and Page faults.

Basics

Memory is divided into page sized chunks. Supported Page sizes depend on hardware platform
and the Operating System. The current UltraSparc platform running Solaris 10 supports 8K,
64K, 512K and 4M pages. The cool threads servers (T2000 and newer versions) running Solaris
10 supports 256M page sizes also (512K is not supported).

The Terminology - TLB, TSB , HPT, HME and TTE

When a process requests memory, only virtual memory is allocated. Physical memory is not
allocated yet. The first time a process requests access to a page within the allocated virtual
memory, a page fault occurs. As a result, a physical page (from free lists) is then mapped to
the virtual page of the process.

This mapping is created by the virtual memory system in software and stored in the HPT
(Hash Page Tables) in the form of HAT Mapping Entries (HME). Also a copy of the entry is
inserted into the TLB and the TSB as Translation Table Entries (TTE).

The TLB or Translation Lookaside Buffer is a cache of the most recently used virtual to
physical memory mappings or Translation Table Entries (TTE) on the CPU. There are
multiple TLBs on the CPU. There is the iTLB used to store entries for text/library and the dTLB
used to store entries for data (heap/stack).

The number of entries in either TLB is limited and dependent on the CPU. For example, on the
UltraSparc IV+ CPU, there is the iTLB which can store 512 entries. There are 2 dTLBs, each of
which can store 512 entries.

Since the number of entries in the TLB is limited, there is a bigger cache of the TTEs in physical
RAM called the TSB (Translation Storage Buffer). Each process has its own dedicated TSB.

The default and maximum size (up to 1MB/user process) that a user process TSB can grow to,
can be changed in Solaris 10. The TSB grows and shrinks as need be and each process has 2
TSBs – one for 8K, 64K and 512K pages and the other for 4M pages. The maximum memory
that can be allocated to all the user TSB can also be specified. And finally an entry in the TSB
requires 16 bytes. So it is easy to identify the size of the TSB to hold a specified number of
entries.
Page Faults

The CPU first checks the TLB for the TTE and if not found (TLB Miss), checks the TSB. If not
present in the TSB (TSB Miss), then it checks the HPT for the HME. If not present in the HPT, it
results in a Page Fault.
A Minor page fault happens when the HME is not present in the HPT, however the contents
of the requested page are in physical memory. The mappings need to be re-established in the
HPT and the TSB and TLB reloaded with the entries.

A Major page fault happens when the HME is not present in the HPT and the contents of the
requested page are paged out to the swap device. The requested page needs to be mapped
back into a free page in physical memory and the contents copied from swap into the physical
memory page. The entries are stored in the HPT and the TSB and TLB are reloaded again with
the entries.

Swap and Page in/Page out

Each physical memory page has a backing store identified by a file and offset. Page outs occur
when the physical page contents are migrated to the backing store and Page-in is the reverse.

Anonymous memory (heap and stack) use swap as the backing store. For file caching, Solaris
uses the file on disk itself as the backing store.

Swap is a combination of the swap device (on disk) and free physical memory.

Why and when do I need to worry about TLB/TSB misses and Page Faults?

As RAM gets cheaper, it is common place to see entry level systems with 16GB of memory or
more as a starting point. This is for both X-86 and proprietary Unix Systems. With more
available physical memory, a DBA configures oracle with bigger SGA and PGA sizes to take
advantage of the available physical memory.

While the above discussion is focused entirely on the Sparc platform, the concept of pages, TLB
and page tables is present for all systems. If using 8K pages (Solaris) and there is 16GB of
memory, then one would require ~ 2 million mappings to address the entire physical memory.
If using 4K pages (Linux), then the number of mappings would be ~4 million.

For maximum efficiency, relevant entries must be accessible to the CPU with minimal delay – in
TLB preferably or at worst in the TSB.

However, we know the number of entries the TLB can hold is limited by hardware. The TSB for
a single user process (in Solaris 10 only) can be grown to a max of 1MB (65,536 entries), so it
is limited too. It would not make sense to search the HPT for every TLB/TSB miss as it costs
CPU cycles to search the hash mappings for required entries. And we must avoid page faults as
much as possible.

From an oracle perspective, if CPU wait is one of your top waits and you have ruled out
other issues such as available CPUs, CPU scheduling etc and you are seeing
significant increase in page faults then it probably makes sense to look deeper into
TLB/TSB misses. As always, it pays to work on improving an area which can potentially deliver
the biggest impact to customer experience. From my experience, the impact of TLB/TSB misses
on an oracle instance can be over emphasized (Solaris Platforms) at times. So you would be the
best judge to identify if this requires further analysis.
What do I need to measure?

Okay, so we get the idea that more RAM and bigger memory working sizes means
more mappings and it is not possible to cache all the entries in TLB/TSB. So it is
inevitable that there are going to be TLB/TSB misses and possibly page faults. But
how do I put a price to it? How costly is a miss? How much time is spent on
servicing these misses?

The answer lies in using trapstat to check the % of time spent by the CPU in servicing TLB/TSB
misses. Unfortunately the tool does not give an estimate of the time spent on servicing
major/minor faults. To identify the number of page faults, one uses vmstat or kstat.

How do I measure and analyze the impact?

Running trapstat –T will show the TLB/TSB miss with the appropriate page sizes. Trapstat
needs to be run as root. As you can see below, it shows the %time spent in user mode (u) and
kernel mode (k). It shows both TLB and TSB misses in a page size breakdown.
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim

----------+-------------------------------+-------------------------------+----

0 u 8k| 0 0.0 0 0.0 | 1 0.0 0 0.0 | 0.0

0 u 64k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

0 u 512k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

0 u 4m| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -

0 k 8k| 0 0.0 0 0.0 | 146 0.0 3 0.0 | 0.0

0 k 64k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

0 k 512k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

0 k 4m| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

==========+===============================+===============================+====

ttl | 619 0.0 0 0.0 | 4137 0.0 300 0.0 | 0.0

The last line gives the overall statistics for all the CPUs. If you are seeing around 20% or more
time (%tim) spent on servicing TLB/TSB misses, then it probably makes sense to revisit your
page sizing for your instance.

Page Faults can be observed through vmstat (minor), vmstat –s (major and minor) and
kstat (major and minor). The stats from vmstat –s and kstat (reports/CPU) are cumulative in
nature.
mkrishna@OCPD:> vmstat 1

kthr memory page disk faults cpu

r b w swap free re mf pi po fr de sr s0 s1 s6 s9 in sy cs us sy id

0 0 0 41966192 34216352 0 5063 0 0 0 0 0 0 0 0 0 559 794 811 7 1 92

0 0 0 41925744 34175896 0 4995 0 0 0 0 0 0 0 0 0 517 767 745 7 1 92

mkrishna@OCPD:> vmstat -s

0 micro (hat) faults

2377024933 minor (as) faults

16504390 major faults

mkrishna@OCPD:> kstat |egrep 'as_fault|maj_fault'

as_fault 142199182

maj_fault 984358

A dTSB miss results in a search for the entry in the HPT for the relevant HME. If not found in
the HPT, then it results in a page fault. So perhaps a % of the time spent on dTSB miss can
be assumed to be spent on servicing page faults (minor and major)? I do not know for sure and
could not find out from Sun either.

Since there will always be page faults when a virtual memory page is accessed for
the first time, we cannot eliminate it completely.

By definition, major page faults are bad, minor page faults are better than major
page faults, but still need to be avoided. Ideally minor faults should be far greater
than major faults. In well configured environments I have seen the ratio of
major/minor faults to be < 0.5%.

Major faults can occur when there is a memory shortage and heavy page out/swap
outs. I have also seen a higher number of major faults when there is extensive file
system data caching or double buffering happening on Oracle databases.

How do I reduce TLB/TSB misses and Page Faults from an Oracle perspective?

Theoretically, to reduce the incidence of TLB/TSB misses and page faults, one would use bigger
sized pages to reduce the number of entries required to map a segment and use an optimally
sized TSB to prevent TSB misses (TLB being fixed in size). This is assuming that you have
configured the instance correctly to fit within the available physical memory. The below would
be a practical way to implement it.

1. Reduce thread migrations (Harden affinity to CPUs) - Thread affinity will ensure a
thread is executed on the same CPU as before. This will improve chances that the
entries for the running thread are already present in the TLB on the CPU. Thread
migrations can be seen using mpstat (migr column). Thread affinity is set as system
parameter – rechoose_interval. The default value for rechoose_interval is 3. For a
Datawarehouse system, I normally set it to 150.

2. Oracle Shared Memory - Oracle uses shared memory (SGA) and private anonymous
memory (PGA). On Solaris, Oracle uses ISM for shared memory. ISM along with other
benefits enables use of 4M pages and so already uses biggest possible page size on the
UltraSparc IV+ platform running Solaris 10. Also for processes sharing the same
segments, TSB is shared. So by default, when using ISM for the SGA, Oracle is already
well optimized for minimal TLB/TSB misses. For the cool threads platform (Solaris 10), a
mix of 256M and 4M Page sizes is used for ISM segments and so is even better
optimized.

3. Oracle PGA - For your PGA or private memory, the page size setting is controlled by
the parameter _realfree_heap_pagesize_hint (10g). The default value is set to 64K
and so should use a 64K page size. However, it does not seem to be so. I have
observed that when set at 64K, it uses 8K pages only. However setting it to 512K or 4M
does indeed change the page size for PGA usage to 512K or 4M. Setting this parameter
results in memory being allocated in realfree_heap_pagesize_hint sized chunks
(64K, 512K, 4M) and so can potentially result in wastage of memory and starve other
sessions/applications of physical memory. Setting this to 512K/4M also reduces your
page faults considerably.

4. TSB Configuration - Increase the size of default startup TSB (Solaris 10) to prevent
TSB misses. 1 entry in the TSB requires 16 bytes. So depending on your memory
allocation to the SGA and PGA, you can set the default TSB size accordingly. Each
process can have up to 2TSB with one of the TSB being dedicated to service 4M Page
entries. There are several configuration parameters that can be set in the /etc/system.

a. default_tsb_size – The default value is 0 (8KB). 8KB will hold 512 entries. For
Oracle, you have to consider both PGA and SGA usage. Let us assume you have
configured 12GB for your SGA (using ISM with 4M pages as default) and 6GB
PGA (using 4M page size). 12GB of SGA would require 3072 entries or 48KB TSB.
6GB of PGA would result in global memory bound of ~700MB (serial operations -
175 pages of 4M each) or ~2100 MB (parallel operations – 525 pages of 4M
each). So for this case, a default_tsb_size of 8K would be too small and get
resized frequently. A default size of 32KB (default_tsb_size = 2) which can then
grow accordingly (to a max of 1M) would be preferable. The problem with having
bigger default sizes is that it consumes physical memory, which is however
capped by the tsb_alloc_hiwater_factor.

b. tsb_alloc_hiwater_factor – Default is 32. This setting ensures that total TSB


usage on the system for user processes does not exceed 1/32 of physical
memory. So if you have 32GB of memory, then total TSB usage will be capped at
1GB. If you have lots of memory to spare and expect a high number of long lived
sessions connecting to the instance, then this can be reduced.
c. tsb_rss_factor – Default is 384. Value of tsb_rss_factor/512 is the threshold
beyond which the tsb is resized. The default setting is 75% (384/512). It
probably makes sense to reduce this to 308 so that at 60% utilization of the TSB,
it will get resized.

d. tsb_sectsb_threshold – In Solaris 10, each process can have up to 2TSB –


one for 8K, 64K and 512K pages and one for 4M pages. This setting controls the
number of 4M mappings the process must have before the second TSB for 4M
pages is initialized. It varies by the CPU. For a UltraSparc IV, the default is 8
pages.

5. To reduce page faults from user sessions, change _realfree_heap_pagesize_hint


from 64K to either 512K or 4M. Also use ODM or Direct i/o. Avoid file system buffering
for oracle data files.

6. Also ensure that the memory requirements of oracle can be met entirely within the
physical memory.

You might also like