Tuning your Red Hat System for Databases
Sanjay Rao Principal Software Engineer, Red Hat May 06, 2011
Objectives of this session
Share tuning tips
Bare metal
Aspects of tuning Tuning parameters Results of the tuning
Virtual Machines (RHEL)
Tools
Reading Graphs
Green arrow shows direction of best results
What To Know About Tuning
Proactive or Reactive Understand the Trade-offs No silver bullet You get what you pay for
What To Tune
I/O Memory CPU Network This session will cover I/O and Memory extensively
I/O Tuning - Hardware
Know Your Storage
SAS or SATA? Fibre Channel, Ethernet or SSD? Bandwidth limits
Device-mapper multipath Provides multipathing capabilities and LUN persistence Low level I/O tools dd, iozone, dt, etc. I/O representative of the database implementation
Multiple HBAs
How To
I/O Tuning Understanding I/O Elevators
Deadline
Two queues per device, one for read and one for writes I/Os dispatched based on time spent in queue
CFQ
Per process queue Each process queue gets fixed time slice (based on process priority)
Noop
FIFO Simple I/O Merging Lowest CPU Cost
I/O Tuning Configuring I/O Elevators
Boot-time
Grub command line elevator=deadline/cfq/noop echo deadline > /sys/class/block/sda/queue/scheduler
Dynamically, per device
ktune service (RHEL 5 ) tuned (RHEL6 utility)
tuned-adm profile throughput-performance tuned-adm profile enterprise-storage
CFQ vs Deadline 1 thread per device (4 devices)
Comparison CFQ vs Deadline
1 thread per multipath device (4 devices)
450M 4
400M
350M
300M 1 250M 0 200M -1 150M -2 CFQ MP dev Deadline MP dev % Diff (CFQ vs Deadline)
100M
50M
-3
0M
-4
8K-SW
16K-SW
32K-SW
32K-RW
16K-RW
64K-SW
8K-RW
64K-RW
16K-RR
8K-SR
8K-RR
16K-SR
32K-SR
32K-RR
64K-SR
64K-RR
CFQ vs Deadline 4 threads per device (4 devices)
Comparison CFQ vs Deadline
4 threads per multipath device (4 devices)
600M 250
500M
200
400M
150
300M
100
CFQ Deadline % diff (CFQ vs Deadline)
200M
50
100M
0M
-50
8K-SW
8K-RW
16K-SW
32K-SW
32K-RW
64K-SW
64K-RW
16K-RW
16K-RR
32K-RR
64K-RR
8K-SR
8K-RR
16K-SR
32K-SR
64K-SR
I/O Tuning Elevators OLTP - Sybase
Sybase IO scheduler testing - RHEL 5.5
OLTP transactional throughput on a Quad Core 4 Socket 2.5Ghz 96G Physical
180K
165K
160K
164K
162K
140K
120K
100K
80K
60K
40K
20K
DEADLINE
CFQ
NOOP
Impact of I/O Elevator - OLTP Workload
120K
100K
80K
Transactions / min
CFQ Deadline Noop
60K
40K
20K
K 10U 20U
Users
40U
60U
DSS Workload
Comparison CFQ vs Deadline
Oracle DSS Workload (with different thread count)
11:31 70
10:05
58.4 54.01
60
08:38
47.69
50
07:12
Time Elapsed
40 05:46 30 04:19 20 02:53
CFQ Deadline % diff
01:26
10
00:00
16
Parallel degree
32
I/O Tuning - File Systems
Direct I/O
Avoid double caching Predictable performance Reduce CPU overhead Eliminate synchronous I/O stall Critical for I/O intensive applications Database (parameters to configure read ahead) Block devices (getra, setra)
Asynchronous I/O
Configure read ahead
Turn off I/O barriers (RHEL6 and enterprise storage only)
I/O Tuning Effect of Direct I/O, Asynch I/O
OLTP Workload - 4 Socket 2 cores - 16G mem
Mid-level Fibre channel storage
80K
70K
60K
50K
Trans/min
40K
Setall DIO only AIO only No AIO DIO
30K
20K
10K
I/O Tuning Database Layout
Separate files by I/O (data , logs, undo, temp) OLTP data files / logs DSS data files / temp files Use low latency / high bandwidth devices for hot spots
I/O Tuning OLTP Logs - Fusion-IO
OLTP workload - Logs on FC vs Fusion-IO
Single Instance 400K
23.77 22.3 21.01
25
350K
20
300K
250K
Trans/Min
15
200K
10
Logs Fusion-io Logs FC %diff
150K
100K
5
50K
K
10U 40U 80U
I/O Tuning Storage (OLTP database)
OLTP workload - Fibre channel vs Fusion-IO
4 database instances 1400K
1200K
1000K
800K
Trans/Min
600K
400K
200K
4G-FC
Fusion-io
I/O Tuning OLTP Database (vmstats)
4 database instances Fibre channel
r b 62 19 3 20 21 27 7 20 25 18 4 21 1 32 4 17 17 14 35 14 20 13 1 14 23 24 34 12 r b 77 0 77 1 76 3 81 1 82 3 81 3 79 1 79 0 76 2 61 3 77 1 54 4 80 2 swpd free buff cache si so bi bo in cs us sy id wa st 5092 44894312 130704 76267048 0 0 8530 144255 35350 113257 43 4 45 9 0 5092 43670800 131216 77248544 0 0 6146 152650 29368 93373 33 3 53 11 0 5092 42975532 131620 77808736 0 0 2973 147526 20886 66140 20 2 65 13 0 5092 42555764 132012 78158840 0 0 2206 136012 19526 61452 17 2 69 12 0 5092 42002368 132536 78647472 0 0 2466 144191 20255 63366 19 2 67 11 0 5092 41469552 132944 79111672 0 0 2581 144470 21125 66029 22 2 65 11 0 5092 40814696 133368 79699200 0 0 2608 151518 21967 69841 23 2 64 11 0 5092 40046620 133804 80385232 0 0 2638 151933 23044 70294 24 2 64 10 0 5092 39499580 134204 80894864 0 0 2377 152805 23663 72655 25 2 62 10 0 5092 38910024 134596 81436952 0 0 2278 152864 24944 74231 27 2 61 9 0 5092 38313900 135032 81978544 0 0 2091 156207 24257 72968 26 2 62 10 0 5092 37831076 135528 82389120 0 0 1332 155549 19798 58195 20 2 67 11 0 5092 37430772 135936 82749040 0 0 1955 145791 19557 56133 18 2 66 14 0 5092 36864500 136396 83297184 0 0 1546 141385 19957 56894 19 2 67 13 0 swpd free buff cache si so 6604 55179876 358888 66226960 0 0 6604 50630092 359288 70476248 0 0 6604 46031168 360132 74444776 0 0 6604 41510608 360512 78641480 0 0 6604 35358836 361012 84466256 0 0 6604 34991452 361892 84740008 0 0 6604 34939792 362296 84747016 0 0 6604 34879644 362992 84754016 0 0 6604 34844616 363396 84760976 0 0 6604 34808680 363828 84768016 0 0 6604 34781944 364180 84774992 0 0 6604 34724948 364772 84803456 0 0 6604 34701500 365500 84809072 0 0 bi 7325 6873 5818 4970 4011 2126 2323 2275 2275 2209 2172 3031 2216 bo in cs us sy 266895 70185 149686 90 7 306900 70166 149804 88 7 574286 77388 177454 88 8 452939 75322 168464 89 7 441042 74022 162443 88 7 440876 73702 161618 88 7 400324 73091 161592 90 6 412631 73271 160766 89 6 415777 73019 158614 89 6 401522 72367 159100 89 6 401966 73253 159064 90 6 421299 72990 156224 89 6 573246 76404 175922 88 7 id wa st 4 0 0 5 0 0 4 0 0 3 0 0 4 0 0 5 0 0 3 0 0 4 0 0 4 0 0 4 0 0 4 0 0 4 0 0 5 1 0
4 database instances Solid State devices (PCI) - Fusion-io
I/O Tuning DSS - Temp
DSS Workload - Sort-Merge table create - Time Metric - Smaller is better
01:55:12
01:40:48
01:26:24
01:12:00
Elapsed Time
00:57:36
00:43:12
00:28:48
00:14:24
00:00:00
4G-FC
Fusion-IO
I/O Tuning RHEL 6 DSS Workload Sybase IQ 15.2
120K 100K
98K 87K
80K
77K
60K
51K 44K 47K
2 FC Arrays Fusion-io
40K
20K
K
Different measurement metrics
What To Tune
I/O Memory Memory CPU Network
Memory Tuning
Dense Memory Based on Architecture NUMA Huge Pages
Understanding NUMA (Non Uniform Memory Access)
Multi Socket Multi core architecture
NUMA required for scaling RHEL 5 / 6 completely NUMA aware Additional performance gains by enforcing NUMA placement
How to enforce NUMA placement
numactl CPU and memory pinning taskset CPU pinning cgroups (only in RHEL6) libvirt for KVM guests CPU pinning
Memory Tuning Huge Pages
2M pages vs 4K standard linux page Virtual to physical page map is 512 times smaller TLB can map more physical pages, resulting in fewer misses Traditional Huge Pages always pinned Transparent Huge Pages in RHEL6 Most databases support Huge pages How to configure Huge Pages (16G)
echo 8192 > /proc/sys/vm/nr_hugepages vi /etc/sysctl.conf (vm.nr_hugepages=8192)
Memory Tuning Huge Pages Sybase - OLTP
Sybase Huge Pages Testing - RHEL 5.5
OLTP transactional throughput on a Quad Core 4 Socket 2.5Ghz 96G Physical
180K
175K
170K
165K
160K
150K
140K
130K
120K
110K
100K
default
hugepages
OLTP Workload Effect of NUMA and Huge Pages
OLTP workload - Multi Instance
1400K
20
17.82
1200K
18
16
11.7
800K
Trans/Min
12
10
600K
8.23
8
400K
200K
2
0 Non NUMA NUMA non NUMA Huge Pages NUMA Huge Pages
% diff compared to Non NUMA number
1000K
14
NUMA and Huge Pages
Huge page allocation takes place uniformly across NUMA nodes Make sure that database shared segments are sized to fit Workaround Allocate Huge pages / Start DB / De-allocate Huge pages
Physical Memory 128G 4 NUMA nodes Huge Pages 80G 20G in each NUMA node 24G DB Shared Segment using Huge Pages 24G DB Shared Segment using NUMA and Huge Pages Huge Pages 100G 25G in each NUMA node
Tuning Memory Flushing Caches
Drop unused Cache Frees unused memory File cache If the DB uses cache, may notice slowdown
Free pagecache echo 1 > /proc/sys/vm/drop_caches Free slabcache
echo 2 > /proc/sys/vm/drop_caches echo 3 > /proc/sys/vm/drop_caches
Free pagecache and slabcache
CPU Tuning
CPU performance
Clock speed Multiple cores Power savings mode
cpuspeed off performance ondemand powersave
How To
echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor best of both worlds cron jobs to configure the governor mode ktune (RHEL5) tuned-adm profile server-powersave (RHEL6)
Tuning CPU Impact of Power Settings
RHEL6 - Database - OLTP workload
cpuspeed settings ( Range - 2.27 GHz - 1.06 GHz)
450K 400K 350K 300K 250K 200K 150K 100K 50K K
10U 20U
cpspeed off performance ondemand powersave
Tuning CPU Effect of Power Settings - DSS
DSS workload (I/O intensive)
Time Metric (Lower is better)
10:05
08:38
07:12
05:46
04:19
02:53
01:26
00:00
performance
ondemand
powersave
vmstat output during test: 7 12 7 12 2 0 7 11 1 15 5884 122884416 485900 734376 5884 122885024 485900 734376 5884 122884928 485908 734376 5884 122885056 485912 734372 5884 122885176 485920 734376 0 0 0 0 0 0 184848 39721 9175 37669 0 217766 27468 9904 42807 0 168496 45375 6294 27759 0 178790 40969 9433 38140 0 248283 19807 7710 37788 4 4 4 4 5 1 89 2 87 1 90 1 90 2 86 6 6 5 5 7 0 0 0 0 0
Network Tuning
Network Performance
Separate network for different functions If on same network, use arp_filter to prevent ARP flux
echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter Supports RDMA w/ RHEL6 high performance networking package
10GigE
Infiniband Packet size
Database Performance
Application tuning
Design Reduce locking / waiting Database tools (optimize regularly)
Tuning - Virtualization KVM (RHEL guests)
Virtualization Tuning - Caching
Cache = none
I/O from the guest is not cached on the host I/O from the guest is cached and written through on the host Works well on large systems (lots of memory and CPU) Potential scaling problems with this option with multiple guests (host CPU used to maintain cache) Can lead to swapping on the host
Cache = writethrough
How To Configure I/O - Cache per disk in qemu command line or libvirt
Effect of I/O Cache Settings on Guest Performance
1000K 900K 800K 700K 600K 500K 400K 300K 200K 100K K
0 15 25 35
31.67
30
20 Cache=none Cache=WT %diff
10
5.82
5
1Guest
4Guests
Configurable per device: Virt-Manager - drop-down option under Advanced Options Libvirt xml file - driver name='qemu' type='raw' cache='writethrough' io='native'
AIO Native vs Threaded (default)
1200K
1000K
800K
600K
AIO Native AIO Default (threaded)
400K
200K
K
10U 20U
Configurable per device (only by xml configuration file): Libvirt xml file - driver name='qemu' type='raw' cache='writethrough' io='native'
Virtualization Tuning I/O Elevators - OLTP
Host Running Deadline
Trans / min - Higher is better
300K
250K
200K CFQ Deadline Noop
150K
100K
50K
1Guest
2 Guests
4 Guests
Virtualization Tuning I/O elevators - DSS
Host Running Deadline
Time metric - Lower is better
20:10
17:17
14:24
11:31
CFQ Deadline Noop
08:38
05:46
02:53
00:00
1Guest
2 Guests
4 Guests
Virtualization Tuning Using NUMA
400K
35
350K
28.6
30
300K
25
250K
20
200K
15
150K
10
Guest 4 Guest 3 Guest 2 Guest 1 % improvement
100K
5
50K
0.0
4Guest-24vcpu-56G 4Guest-24vcpu-56G-NUMA
Virtualization Tuning - Network
VirtIO
VirtIO drivers for network Bypass the qemu layer Bypass the host and pass the PCI device to the guest Can be passed only to one guest
vhost_net (low latency close to line speed)
PCI pass through
SR-IOV (Single root I/O Virtualization)
Pass through to the guest Can be shared among multiple guests Limited hardware support
Latency Comparison RHEL 6
Network Latency by Guest Interface Method
Guest Receive (Lower is better)
400 350 300
Latency (usecs)
250 200 150 100 50 0
1 6 16 27 45 64 99 189 256 387 765 4 102 9 153 9 306 6 409 5 4 9 9 6 7 7 614 1228 1638 2457 4914 6553 9830
vhost-net improves latency bringing it close to bare metal
Message Size (Bytes)
host RX virtio RX vhost RX SR-IOV RX
Performance Setting Tool
tuned for RHEL6
Configure system for different performance profiles
laptop-ac-powersave spindown-disk latency-performance laptop-battery-powersave server-powersave throughput-performance desktop-powersave enterprise-storage default
Performance Monitoring Tools
Monitoring tools
top, vmstat, ps, iostat, netstat, sar, perf /proc, sysctl, AltSysRq ethtool, ifconfig oprofile, strace, ltrace, systemtap, perf
Kernel tools
Networking
Profiling
Wrap Up Bare Metal
I/O
Choose the right elevator Eliminated hot spots Direct I/O or Asynchronous I/O Virtualization Caching NUMA Huge Pages Swapping Managing Caches
Memory
RHEL has many tools to help with debugging / tuning
Wrap Up Bare Metal (cont.)
CPU
Check cpuspeed settings Separate networks arp_filter Packet size
Network
Wrap Up Virtualization
VirtIO drivers aio (native) NUMA Cache options (none, writethrough) Network (vhost-net)