You are on page 1of 58

NUMA & Oracle

Yuri Pudovchenko, Alexey Selin


Contents

• SMP, MPP, NUMA


• NUMA & HW
• NUMA & SW
• NUAM & VM
• NUMA & Oracle
• NUMA & Exadata
• NUMA & Troubleshooting
• Recommendations
Symmetric MultiProcessing (SMP)
http://www.intel.com/content/www/us/en/chipsets/5400-chipset-memory-controller-hub-datasheet.html
Theory

• SMP – Symmetric MultiProcessing


(UMA)
• NUMA – Non-Uniform Memory
Access
• MPP – Massive Parallel Processing
Прошлое – Настоящее - Будущее

Gulftown
•NUMA & HW
NUMA = Non-Uniform Memory Access
http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/

• All cores have access to all RAM


Exadata X2-2 server architecture
http://www.oracle.com/technetwork/articles/systems-hardware-architecture/sfx4170m2-x4270m2arch-163860.pdf
Exadata X2-8
http://www.oracle.com/technetwork/articles/systems-hardware-architecture/sf4800g5-architecture-163848.pdf
Exadata X2-8
Exadata X2-8
Sun Fire X4800
•NUMA & SW
NUMA-principles

• AFFINITY: Processor and memory binding

• NUMA Policy API


node distances:
node 0 1 2 3
0: 10 16 16 22
1: 16 10 22 16
2: 16 22 10 16
3: 22 16 16 10
NUMA Policies
http://manpages.ubuntu.com/manpages/intrepid/man3/numa.3.html

• NUMA API currently supports four policies:


• Default = Allocate on the local node.
• bind = Allocate on a specific set of nodes
• interleave = Interleave memory allocations on a set of nodes.
• preferred = Try to allocate on a local node first

• Two types of Policy


• per process,
• per memory region

• Linux: From 2003, kernel 2.5, RHEL 4, SUSE 9


OS ready?
• # rpm –qa | grep numa
numactl-devel-0.9.8-11.el5
numactl-0.9.8-11.el5

• # ls -l /usr/lib64/libnuma.so
lrwxrwxrwx 1 root root 12 Jul 7 15:43 /usr/lib64/libnuma.so

• numactl , numastat, numademo


Is numa working ?
•#/proc/…/node*/numastat
•# numastat
• node1 node0
•numa_hit 729789959 2359365014
•numa_miss 670076216 180873526
•numa_foreign 180873526 670076216
•interleave_hit 1194727 846662
•local_node 728648241 2358571294
•other_node 671217934 181667246
Example
•# numactl --hardware
•available: 2 nodes (0-1)
–node 0 size: 12278 MB
–node 0 free: 113 MB
–node 1 size: 12288 MB
–node 1 free: 13 MB node distances:
•node distances: node 0 1 2 3
•node 0 1 0: 10 16 16 22
• 0: 10 20
1: 16 10 22 16
• 1: 20 10
2: 16 22 10 16
3: 22 16 16 10
How to see NUMA-properties ?
• taskset (linux)
–# ps -ef|grep lgwr
–oracle 9883 1 0 Oct06 ? 00:05:59 ora_lgwr_orcl
–# taskset -c -p 9883
–pid 9883's current affinity list: 6-11,18-23

• /proc/<pid>/numa_maps - contains information
about each memory area used by a given
process allowing the determination of which
nodes were used for the pages.
• N<node>=<pages> - The number of pages used
on a node.
/proc/$PID/numa_maps
• # ps -ef|grep lgwr
• oracle 27792 1 0 Aug12 ? 00:57:39 ora_lgwr_ora11g

• cat /proc/27792/numa_maps
00400000 default file=/ora11g/ora11g/db_1/bin/oracle
mapped=3156 mapmax=260 N0=127 N1=3029
0b602000 default file=/ora11g/ora11g/db_1/bin/oracle
anon=18 dirty=18 mapped=66 mapmax=260 N1=66

0b7c9000 default anon=49 dirty=49 active=48 N0=1 N1=48


0bb50000 default heap anon=68 dirty=68 active=59 N0=9
N1=59
60000000 interleave:0-1 file=/SYSV8759fb1c\040(deleted)
huge dirty=567 N0=287 N1=280
http://jcole.us/blog/files/numa-maps-summary.pl

N0 : 834 ( 0.00 GB)


N1 : 4173 ( 0.02 GB)
active : 137 ( 0.00 GB)
anon : 756 ( 0.00 GB)
dirty : 1323 ( 0.01 GB)
mapmax : 7148 ( 0.03 GB)
mapped : 3704 ( 0.01 GB)
Difference measuring tool
• numademo – Linux NUMA disbalance and
performance measuring tool
# numactl --cpunodebind=0 numademo 256m memset

2 nodes available
memory with no policy memset Avg 7353.27 MB/s Min 7401.85 MB/s Max 7295.63 MB/s
local memory memset Avg 7371.89 MB/s Min 7427.66 MB/s Max 7345.14 MB/s
memory interleaved on all nodes memset Avg 6934.56 MB/s Min 6966.20 MB/s Max 6864.65 MB/s
memory on node 0 memset Avg 8977.47 MB/s Min 9038.23 MB/s Max 8860.43 MB/s
memory on node 1 memset Avg 5788.59 MB/s Min 5828.20 MB/s Max 5731.39 MB/s
memory interleaved on 0 1 memset Avg 6870.66 MB/s Min 6968.73 MB/s Max 6552.32 MB/s
setting preferred node to 0
memory without policy memset Avg 7222.30 MB/s Min 7358.43 MB/s Max 7071.53 MB/s
setting preferred node to 1
memory without policy memset Avg 5724.52 MB/s Min 5779.02 MB/s Max 5651.27 MB/s
manual interleaving to all nodes memset Avg 6660.17 MB/s Min 6859.39 MB/s Max 6328.94 MB/s
manual interleaving on node 0/1 memset Avg 6672.88 MB/s Min 6938.47 MB/s Max 6320.29 MB/s
Difference Measuring Tool
Min Max Avg
numactl --cpunodebind=0 numademo 256m memset
memory on node 0 memset 8977 9038 8860

memory on node 1 memset 5731 5828 5788

Ratio 1,55 1,55 1,55

numactl --cpunodebind=1 numademo 256m memset


memory on node 0 memset 5589 5632 5609

memory on node 1 memset 8251 8521 8390

Ratio 1,48 1,51 1,50


•NUMA & Oracle
NUMA & Oracle ages ago
http://kevinclosson.files.wordpress.com/2007/04/oracle8i.pdf

• From 8.1.7 Oracle is


_enable_NUMA_optimization:
•bash-3.00$ sqlplus '/as sysdba';

SQL*Plus: Release 8.1.7.0.0 - Production on Fri Oct 7 14:31:19 2011

•PARAMETER SESSION_VALUE INSTANCE_VALUE


-------------------------- --------------- ---------------
_NUMA_pool_size Not specified Not specified
_enable_NUMA_optimization TRUE TRUE
_NUMA_instance_mapping Not specified Not specified
_db_block_numa 1 1
NUMA & 11g
•PARAMETER SESSION_VALUE INSTANCE_VALUE
•-------------------------- --------------- ---------------
•_NUMA_pool_size Not specified Not specified
•_enable_NUMA_optimization FALSE FALSE
•_enable_NUMA_support FALSE FALSE
•_enable_NUMA_interleave TRUE TRUE
•_NUMA_instance_mapping Not specified Not specified
•_numa_trace_level 0 0
•_rm_numa_simulation_pgs 0 0
•_rm_numa_simulation_cpus 0 0
•_rm_numa_sched_enable FALSE FALSE
•_db_block_numa 1 1
•_numa_buffer_cache_stats 0 0
Изменения в init.ora

• 11gR2 changes:

_enable NUMA_optimization – deprecated,

changed to

_enable_NUMA_support = TRUE|FALSE
NUMA & Oracle locality 1053332.1, 399261.1

• The goal of NUMA optimization is to localize


memory access as much as possible. Oracle
handles:
• SGA, Background processes, Shared, Parallel,
DBRM
NUMA & Oracle locality
• Oracle handles:
• SGA cell = SGA / number of cells
• DBWR
• LGWR
• Shared Servers
• Parallel Slaves
• DBRM
• AMM

• OS handles: user processes


NUMA & 11g AMM
• AMM:
memory_target
memory_max_target

• 60% MT = NUMA(SGA)
• 40% MT = Interleaved (to grow SGA)

• On some platforms setting memory_target


disables NUMA optimization.
NUMA-aware Listener
• Multiple Listeners:
numactl --cpunodebind=1 lsnrctl start listener1
numactl --cpunodebind=2 lsnrctl start listener2

• One Listener:
numactl --interleave=1,2,3,4 lsnrctl start

• Services, DBs - consolidation


Example
• Starting ORACLE instance (normal)
• ***************** Huge Pages Information *****************
• Huge Pages memory pool detected (total: 6370 free: 6370)
• DFLT Huge Pages allocation successful (allocated: 224)
• NUMA Huge Pages allocation on node (1) (allocated: 2976)
• NUMA Huge Pages allocation on node (0) (allocated: 2960)
• DFLT Huge Pages allocation successful (allocated: 1)
• *********************************************************
• SGA Local memory support enabled
• …
• NUMA system found and support enabled (2 domains - 12,12)
Breaking SGA
• $ ipcs -m
• ------ Shared Memory Segments --------
• key shmid owner perms bytes
nattch
• 0x00000000 23724037 oracle 660 469762048 256
• 0x00000000 23756806 oracle 660 6241124352 256
• 0x00000000 23789575 oracle 660 6207569920 256
• 0x8759fb1c 23822344 oracle 660 2097152 256

• 469762048/2048/1024 = 224 (allocated: 224)


• 6241124352/2048/1024 = 2976 (allocated: 2976)
• 6207569920/2048/1024 = 2960 (allocated: 2960)
• 2097152/2048/1024 = 1 (allocated: 1)
Background Processes
• # ps -ef|grep dbw
• oracle 9871 1 0 Oct06 ? 00:01:34 ora_dbw0_ora11g
• oracle 9875 1 0 Oct06 ? 00:02:12 ora_dbw1_ora11g

• oracle 9879 1 0 Oct06 ? 00:01:31 ora_dbw2_ora11g

• # taskset -c -p 9871
• pid 9871's current affinity list: 0-5,12-17
• # taskset -c -p 9875
• pid 9875's current affinity list: 6-11,18-23
• # taskset -c -p 9879
• pid 9879's current affinity list: 0-5,12-17
New pool - Numa Pool
• SQL> select pool, sum(bytes) as BYTES
from v$sgastat group by pool;

• POOL BYTES
• ------------ --------------
• 10544926728
• java pool 201326592
• streams pool 67108864
• shared pool 1241513984
• large pool 268435456
• numa pool 469762048
New pool – NUMA -Pool
• SQL> select * from v$sgainfo;
• NAME BYTES RES
• ------------------------------- ---------- ---
• Fixed SGA Size 2244616 No
• Redo Buffers 6590464 No
• …
• Granule Size 33554432 No
• Maximum SGA Size 1.2827E+10 No
• Startup overhead in Shared Pool 402653184 No
• Startup NUMA Shared Pool memory 469762048 No
select * from v$sgastat where pool='numa pool' order by bytes asc
• numa pool bloom filter 5552
• numa pool temporary table lock 5552
• numa pool analytic workspace 5808
• numa pool statement queuing 5808
• numa pool sort segment handle 11720
• numa pool Result Cache: State Objs 151160
• numa pool call 593848
• numa pool txncallback 1221648
• numa pool Temporary Tables State Ob 1627600
• numa pool constraints 1697376
• numa pool DML lock 8320896
• numa pool enqueue 13572112
• numa pool transaction 15084352
• numa pool private strands 78991360
select * from v$sgastat where pool='numa pool' order by bytes asc
• numa pool OS file lock 8944
• numa pool object queue hash buckets 3538944
• numa pool Checkpoint queue 4927488
• numa pool buffer handles 7840000
• numa pool file state object 8587488
• numa pool dirty object counts array 12582912
• numa pool dbwriter coalesce buffer 12595200
• numa pool write state object 15370752
• numa pool FileOpenBlock 63746784
• numa pool db_block_hash_buckets 93323264
Granule_size [947152.1]
SGA_MAX_SIZE (or
RDBMS GRANULE SIZE
memory_max_target)
9.2 <= 128MB 4MB
> 128MB 16MB
10.2 <= 1GB 4MB
> 1GB 16MB
11gR1 <= 1GB 4MB
>1Gb <= 4GB 16MB
>4Gb <= 16GB 64MB
>16Gb <= 64GB 256MB
> 64GB 512MB
11gR2 (and 11gR1 with
<= 1Gb 4Mb
patch 8813366 applied)
>1Gb <= 8Gb 16Mb
>8Gb <= 16Gb 32Mb
>16Gb <= 32Gb 64Mb
>32Gb <= 64Gb 128Mb
>64Gb <= 128Gb 256Mb
> 128Gb 512Mb
Error Example
• ***************** Huge Pages Information ****************
• Huge Pages memory pool detected (total: 6146 free: 6146)
• DFLT Huge Pages allocation successful (allocated: 224)
• NUMA Huge Pages allocation on node (1) (allocated: 2976)
• Huge Pages allocation failed (free: 2946 required: 2960)
• ******************************************************
• NUMA Huge Pages allocation on node (0) (allocated: 1488)
• Huge Pages allocation failed (free: 1458 required: 1472)
• ******************************************************
• NUMA Huge Pages allocation on node (0) (allocated: 736)
• NUMA Huge Pages allocation on node (0) (allocated: 368)
• NUMA Huge Pages allocation on node (0) (allocated: 192)
• NUMA Huge Pages allocation on node (0) (allocated: 96)
• NUMA Huge Pages allocation on node (0) (allocated: 48)
• NUMA Huge Pages allocation on node (0) (allocated: 16)

• 16+48+96+192+368+736= 1456
How to disable NUMA
• For versions 10.1-11.1 761065.1:
«Apply the Patch 8199533 to disable
NUMA.»
• Numa & VM
NUMA & VM
• In guest systems NUMA-awareness is not working:

ORA-600: internal error code, arguments: [kskreconfignuma2], [kskcpucc],


[], [], [], [], [], []
ORA-27300: OS system dependent operation:mpctl_ldomspus failed with
status: 22
ORA-27301: OS failure message: Invalid argument
ORA-27302: failure occurred at: skgsnnprocs
Restrictions
• Oracle is dependent of startup environment
• Best Practices: Static environment
• Oracle doesn’t allocate shared memory in
new cell
• Oracle need to be isolated from cell
reconfigurations
NUMA & RAC
• One cell = One RAC node:
• Allow cells to be romoved or added by
demand
•NUMA & Exadata
Exadata Х2-2
PARAMETER SESSION_VALUE INSTANCE_VALUE
-------------------------- --------------- ---------------
_NUMA_pool_size Not specified Not specified
_enable_NUMA_optimization FALSE FALSE
_enable_NUMA_support FALSE FALSE
_enable_NUMA_interleave TRUE TRUE
_NUMA_instance_mapping Not specified Not specified
_numa_trace_level 0 0
_rm_numa_simulation_pgs 0 0
_rm_numa_simulation_cpus 0 0
_rm_numa_sched_enable FALSE FALSE
_db_block_numa 1 1
_numa_buffer_cache_stats 0 0
Exadata Х2-2
• # uname -a
Linux edbmf01db01.fors.ru 2.6.18-194.3.1.0.4.el5 #1 SMP Sat Feb
19 03:38:37 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

На аналогичном сервере где NUMA действительно работает:


Linux ora1.telecom.net 2.6.32-100.26.2.el5 #1 SMP Tue Jan 18
20:11:49 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

• # rpm -qa|grep -i numa


numactl-devel-0.9.8-11.el5
numactl-0.9.8-11.el5

# numactl --hardware # numastat


available: 1 nodes (0) node0
node 0 size: 96917 MB numa_hit 29017048616
node 0 free: 535 MB numa_miss 0
No distance information available numa_foreign 0
interleave_hit 442682
local_node 29017048616
other_node 0
Exadata Х2-2
# cat grub.conf

kernel /vmlinuz-2.6.18-
194.3.1.0.3.el5 … numa=off
Disable NUMA on database servers to improve performance of
Linux file system utilities ID 1053332.1

• X2-8 Database servers should have NUMA on

• X2-2 Database servers should have NUMA off


Oracle Sun Database Machine X2-2
Setup/Configuration Best Practices [1274318.1]
• X2-2 Database servers in Oracle Exadata Database Machine by default are booted with operating
system NUMA support enabled. Commands that manipulate large files without using direct I/O
on ext3 file systems will cause low memory conditions on the NUMA node (Xeon 5500
processor) currently running the process.
• By turning NUMA off, a potential local node low memory condition and subsequent
performance drop is avoided.
• X2-8 Database servers should have NUMA on
• The impact of turning NUMA off is minimal.
• Once local node memory is depleted, system performance as a whole will be severely impacted.

• Follow the instructions in MOS Note 1053332.1 to turn NUMA off in the kernel for database
servers.

• NUMA is configured to be off in the storage servers and should not be changed.
NUMA to Oracle path
• HW is NUMA-aware
• BIOS
• Kernel Loader
# cat grub.conf
kernel /vmlinuz-2.6.18-194.3.1.0.3.el5 … numa=off …
# dmesg |grep -i numa  NUMA turned off

• RPM numactl installed ?


• numactl, numastat

• Oracle: _numa_support =
TRUE
Side effect
• Many shared memory segments
• More swappines
• Out-of-Memory errors
http://www.pythian.com/news/1324/oracle-performance-issue-
high-kernel-mode-cpu-usage/

• The solution
• Ten shared memory segments were created in
order to exploit NUMA technology. NUMA is an
excellent technology and it is a pity that we are
suffering a side-effect of NUMA. We need to
resolve this issue and we have a handful of
options at this point. We can:
• disable NUMA completely, by setting the
_enable_numa_optimization to false.
• reduce NUMA nodes using _db_block_numa.
(Interestingly, this throws an ORA-600 error
during startup.)
Recommendations
• For critical applications (Clusterware, ASM )
• Working on CPU about 100% of time
• For very disbalanced/uneven memory acces

• Oracle: для процессов, обрабатывающих много


данных в PGA : сортировки в памяти, джойны.

• X2-8 consolidation
Reference
•The Oracle Database on HP Integrity servers
http://h20195.www2.hp.com/v2/GetPDF.aspx/4AA2-
0547ENW.pdf
• Christoph Lameter «Local and Remote Memory:
Memory in a Linux/NUMA System»
• Oracle8i for NUMA-Q 2000,
http://kevinclosson.wordpress.com/kevin-closson-
index/oracle-on-opteron-k8l-numa-etc/
• http://www.pythian.com/news/1324/oracle-
performance-issue-high-kernel-mode-cpu-usage/
• http://jcole.us/blog
Вопросы и ответы

ypudovchenko@fors.ru
NUMA Pool content
POOL NAME BYTES numa pool dirty object counts array 12582912
------------ -------------------------- ---------- numa pool ksz parent 50976
numa pool bloom filter 5552 numa pool KFG SO child 8496
numa pool cp connection 5280 numa pool free memory 37078712
numa pool property service SO 8624 numa pool transaction 15084352
numa pool constraints 1697376 numa pool procs: ksunfy 14775552
numa pool ASM scan context 7088 numa pool DML lock 8320896
numa pool ASM kfk state object 5552 numa pool media recovery state obje 14176
numa pool name-service entry 7088 numa pool Result Cache: State Objs 151160
numa pool temporary table lock 5552 numa pool KSFQ buffer pool 6064
numa pool Temporary Tables State Ob 1627600 numa pool ksbxic obj 655568
numa pool branch 2254944 numa pool file state object 8587488
numa pool ksunfy : array of SSO fre 4032 numa pool write state object 15370752
numa pool fencing reid 18288 numa pool Checkpoint queue 4927488
numa pool KFD extent enqueue obj 43120 numa pool statement queuing 5808
numa pool FileOpenBlock 63746784 numa pool object queue hash table d 96768
numa pool locator state object 810384 numa pool ASM map operations 5808
numa pool sched job queue 6064 numa pool name-service request 4784
numa pool object queue hash buckets 3538944 numa pool dummy 814096
numa pool name-service recovery 6064 numa pool quiescing session 7856
numa pool FileIdentificatonBlock 745840 numa pool enqueue 13572112
numa pool call 593848 numa pool ksv slave 13872
numa pool Online Datafile Move sess 4784 numa pool osp allocation 146552
numa pool private strands 78991360 numa pool temporary foreign ref 7472
numa pool analytic workspace 5808 numa pool ksir State Object 6064
numa pool ASM rollback operations 5296 numa pool ksv reaper 13296
numa pool block media rcv state obj 5040 numa pool buffer handles 7840000
numa pool change tracking state cha 5168 numa pool OS proc request holder 64784
numa pool ASM generic network state 5168 numa pool OS file lock 8944
numa pool dbwriter coalesce buffer 12595200 numa pool db_block_hash_buckets 93323264
numa pool dbwr actual working sets 576 numa pool DBWR event stats array 648
numa pool txncallback 1221648
numa pool invalid low rba queue 12288
numa pool kqf runtime defined table 458448
numa pool ksunfy : SSO free list 64373760
numa pool sort segment handle 11720
numa pool channel handle 1388688
numa pool sched job slv 14320
numa pool object queue 1580800
numa pool dbwriter coalesce bitmap 768
numa pool KFM state obj 5808
numa pool temp lob duration state o 7856
numa pool resumable 7344
numa pool ksir PrivOp State Object 4784
numa pool ASM KFFD SO 5296
numa pool per_pg_set_descriptor_arr 344064
numa pool reservation state object 6112
numa pool KFG state obj 7344
numa pool kss unit test 7856

You might also like