You are on page 1of 52

Linux

Filesystems and MySQL


Ammon Sutherland
April 23, 2013

Friday, April 26, 13

Preface...
"Who is it?" said Arthur.
"Well," said Ford, "if we're lucky it's just the Vogons
come to throw us into space."
"And if we're unlucky?"
"If we're unlucky," said Ford grimly, "the captain might
be serious in his threat that he's going to read us some of
his poetry first ..."

Friday, April 26, 13

Background
3

Long-time Linux System Administrator turned DBA

University systems
Managed Hosting
Online Auctions
E-commerce, SEO, marketing, data-mining

A bit of an optimization junkie


Once in a while I share: http://shamallu.blogspot.com/

Friday, April 26, 13

Agenda
4

Basic Theory
Directory structure
LVM
RAID
SSD
Filesystem concepts

Filesystem choices

Friday, April 26, 13

MySQL Tuning
Benchmarks
IO tests
FS maintenance
OLTP

AWS EC2
Conclusions

Basic Theory
deadlock detected
we rollback transaction two
err one two one three
- A MySQL Haiku -

Friday, April 26, 13

Directory Structure
6

Things that must be stored on disk


Data files (.ibd or .MYD and .MYI) Random IO
Main InnoDB data file (ibdata1) Random IO
InnoDB Log files (ib_logfile0, ib_logfile1) Sequential IO (one
at a time)
Binary logs and relay logs Sequential IO
General query log and Slow query log Sequential IO
Master.info technically Random IO
Error log Infrequent Sequential IO

Friday, April 26, 13

Linux IO Sub-System
7

Friday, April 26, 13

Hard Drives
8

Rotating platters
SAS vs. SATA

SAS 6gb/s connectors can handle SATA 3gb/s drives


SAS typically cost more (much more for larger size)
SAS often will do higher rpm rates (10k, 15k rpm)
SAS has more logic on the drives
SAS has more data consistency and error reporting logic vs.
SATA S.M.A.R.T.
SAS uses higher voltages allowing for external arrays with
longer signal runs
SAS does TCQ vs. SATA NCQ (provides some similar effect)
Both do 8b10b encoding (25% parity overhead)

Friday, April 26, 13

SSD
9

Pros:
Very fast random reads and writes
Handle high concurrency very well

Cons:
Cost per GB
Lifespan and performance depend on write-cycles. Beware
write amplification
Requires care with RAID cards

Friday, April 26, 13

RAID
10

Typical RAID Modes:


RAID-0: Data striped, no redundancy (2+ disks)
RAID-1: Data mirrored, 1:1 redundancy (2+ disks)
RAID-5: Data striped with parity (3+ disks)
RAID-6: Data striped with double parity (4+ disks)
RAID-10: Data striped and mirrored (4+ disks)
RAID-50: RAID-0 striping of multiple RAID-5 groups (6+

Friday, April 26, 13

disks)

RAID (cont.)
11

Typical RAID Benefits and risks:


RAID-0 - Scales reads and writes, multiplies space (risky, no disks can fail)
RAID-1 - Scales reads not writes, no additional space gain (data intact
with only one disk and rebuilt)

RAID-5 - Scales reads and some writes (parity penalty, can survive one
disk failure and rebuild)

RAID-6 - Scales reads and less writes than RAID-5 (double parity penalty,
can survive 2 disk failures and rebuild)

RAID-10 - Scales 2x reads vs writes, (can lose up to two disks in particular


combinations)

RAID-50 - Scales reads and writes (can lose one disk per RAID-5 group and still
rebuild)

Friday, April 26, 13

RAID Cards
12

Purpose:
Offload RAID calculations from CPU, including parity
Routine disk consistency checks
Cache

Tips:
Controller Cache is best mostly for writes
Write-back cache is good - Beware of learn cycles
Disk Cache - best disabled on SAS drives. SATA drives frequently use for NCQ
Stripe size - should be at least the size of the basic block being accessed.
Bigger usually better for larger files
Read ahead - depends on access patterns

Friday, April 26, 13

LVM
13

Why use it?


Ability to easily expand disk
Snapshots (easy for dev, proof of concept, backups)

Cost?
Straight usage usually 2-3% performance penalty
With 1 snapshot 40-80% penalty
Additional snapshots are only 1-2% additional penalty each

Friday, April 26, 13

IO Scheduler
14

Goal - minimize seeks, prioritize process io


CFQ - multiple queues, priorities, sync and async
Anticipatory - anticipatory pauses after reads, not useful
with RAID or TCQ

Deadline - "deadline" contract for starting all requests, best


with many disk RAID or TCQ

Noop - tries to not interfere, simple FIFO, recommended for


VM's and SSD's

Friday, April 26, 13

Filesystem Concepts
15

Inode - stores, block pointers and metadata of a file or directory


Block - stores data
Superblock - stores filesystem metadata
Extent - contiguous "chunk" of free blocks
Journal - record of pending and completed writes
Barrier - safety mechanism when dealing with RAID or disk
caches
fsck - filesystem check

Friday, April 26, 13

VFS Layer
16

API layer between system calls and filesystems,


similar to MySQL storage engine API layer

Friday, April 26, 13

Linux IO Sub-System
17

Friday, April 26, 13


18

Filesystem Choices
In the style of Edgar Allan Poes The Raven
Once upon a SQL query
While I joked with Apple's Siri
Formatting many a logical volume on my quad core
Suddenly there came an alert by email
as of some threshold starting to wail
wailing like my SMS tone
"Tis just Nagios" I muttered,
"sending alerts unto my phone,
Only this - I might have known."
Friday, April 26, 13

Ext filesystems
19

ext2 - no journal
ext3 - adds journal, some enhancements like directory hashes, online
resizing

ext4 - adds extents, barriers, journal checksum, removes inode locking


common features - block groups, reserved blocks
ex2/3 max FS size=32 TiB, max file size=2 TiB
ext4 max FS size=1 EiB, max file size=16 TiB

Friday, April 26, 13

XFS
20

extents, data=writeback style journaling,


barriers, delayed allocation, dynamic inode
creation, online growth, cannot be shrunk
max FS size=16 EiB, max file size 8 EiB

Friday, April 26, 13

Btrfs
21

extents, data and metadata checksums,


compression, subvolumes, snapshots, online b-
tree rebalancing and defrag, SSD TRIM support
max FS size=16 EiB, max file size 16 EiB

Friday, April 26, 13

ZFS*
22

volume management, RAID-Z, continuous integrity


checking, extents, data and metadata checksums,
compression, subvolumes, snapshots, encryption,
ARC cache, transactional writes, deduplication
max FS size=16 EiB, max file size 16
* note that not all these features are yet supported natively on Linux

Friday, April 26, 13

Filesystem Maintenance
23

FS Creation (732GB)
btrfs"

Less is better

xfs"
Time"

ext4"
ext3"
ext2"

0"

20"

40"

60"

80"

100"

FSCK
Less is better

btrfs"
xfs"
1"

ext4"
ext3"
ext2"

0"

Friday, April 26, 13

50"

100"

150"

200"

250"

300"


24

MySQL Tuning Options


Continuing in the style of The Raven
Ah distinctly I remember
as I documented for each member
of the team just last Movember
in the wiki that we keep
write and keep and nothing more
When my query thus completed
Fourteen duplicate rows deleted
All my replicas then repeated
repeated the changes as before
I dumped it all to a shared disk
kept as a backup forever more.
Friday, April 26, 13

MySQL Tuning Options for IO


25

innodb_flush_logs_at_trx_commit
innodb_flush_method
innodb_buffer_pool_size
innodb_io_capacity
Innodb_adaptive_flushing
Innodb_change_buffering
Innodb_log_buffer_size
Innodb_log_file_size
innodb_max_dirty_pages_pct
innodb_max_purge_lag
innodb_open_files
table_open_cache
innodb_page_size
innodb_random_read_ahead
innodb_read_ahead_threshold
innodb_read_io_threads
innodb_write_io_threads
sync_binlog
general_log
slow_log
tmp_table_size, max_heap_table_size

Friday, April 26, 13

InnoDB Flush Method


26

Applies to InnoDB Log and Data file writes


O_DIRECT - Try to minimize cache effects of the I/O to and from this file. In general
this will degrade performance, but it is useful in special situations, such as when
applications do their own caching. File I/O is done directly to/from user space buffers. -
Applies to log and data files, follows up with fsync, eliminates need for doublewrite buffer

DSYNC - Write I/O operalons on the le descriptor shall complete as dened by


synchronized I/O data integrity complelon. - Applies to log les, data les get fsync

fdatasync - (deprecated option in 5.6) Default mode. fdatasync on every write to log
or disk

O_DIRECT_NO_FSYNC - (5.6 only) O_DIRECT without fsync (not suitable for XFS)
fsync - flush all data and metadata for a file to disk before returning
fdatasync - flush all data and only metadata necessary to read the file properly to disk
before returning

Friday, April 26, 13

InnoDB Flush Method - Notes


27

O_DIRECT - The thing that has always disturbed me about O_DIRECT is that the whole interface
is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling
substances. --Linus Torvalds

O_DIRECT - The behaviour of O_DIRECT with NFS will differ from local file systems. Older kernels,

or kernels configured in certain ways, may not support this combination. The NFS protocol does not
support passing the flag to the server, so O_DIRECT I/O will only bypass the page cache on the client;
the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve
the synchronous semantics of O_DIRECT. Some servers will perform poorly under these
circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients
about the I/O having reached stable storage; this will avoid the performance penalty at some risk to
data integrity in the event of server power failure. The Linux NFS client places no alignment
restrictions on O_DIRECT I/O.

DSYNC - POSIX provides for three dierent variants of synchronized I/O, corresponding to the

ags O_SYNC, O_DSYNC, and O_RSYNC. Currently (2.6.31), Linux only implements O_SYNC, but glibc
maps O_DSYNC and O_RSYNC to the same numerical value as O_SYNC. Most Linux le systems
don't actually implement the POSIX O_SYNC semanqcs, which require all metadata updates of a
write to be on disk on returning to user space, but only the O_DSYNC semanqcs, which require only
actual le data and metadata necessary to retrieve it to be on disk by the qme the system call
returns.

Friday, April 26, 13


28

Benchmarks
There once was a small database program

It had InnoDB and MyISAM

One did transactions well,

and one would crash like hell
Between the two they used all of my RAM
- A database Limerick -

Friday, April 26, 13

Testing Setup...
29

Dell PowerEdge 1950


2x Quad-core Intel Xeon 5150 @ 2.66 Ghz
16 GB RAM
4 x 300 GB SAS disks at 10k rpm (RAID-5, 64KB
stripe size)
Dell Perc 6/i RAID Controller with 512MB cache
CentOS 6.4 (sysbench io tests done with Ubuntu
12.10)
MySQL 5.5.30

Friday, April 26, 13

Testing Setup (cont)


30

my.cnf settings:
log-error
skip-name-resolve
key_buffer = 1G
max_allowed_packet = 1G
query_cache_type=0
query_cache_size=0
slow-query_log=1
long-query-time=1
log-bin=mysql-bin
max_binlog_size=1G
binlog_format=MIXED
innodb_buffer_pool_size = 4G # or 14G, see tests
innodb_additional_mem_pool_size = 16M
innodb_log_file_size = 1G
innodb_file_per_table = 1
innodb_flush_method = O_DIRECT # Unless specified as fdatasync or O_DSYNC
innodb_flush_log_at_trx_commit = 1
### innodb_doublewrite_buffer=0 # for zfs tests only

Friday, April 26, 13

IO Tests - Sysbench - Sequential Reads


31

500"

MB/s
Higher is better

450"
400"
350"

ext2"

300"

ext3"

250"

ext4"

200"
150"

xfs"

100"

btrfs"

50"
0"
1"thread" 2"thread" 4"thread" 8"thread" 16"thread"32"thread"64"thread"

Friday, April 26, 13

IO Tests - Sysbench - Sequential Writes


32

300"

MB/s
Higher is better

250"
ext2"

200"

ext3"

150"

ext4"

100"

xfs"
btrfs"

50"
0"
1"thread" 2"thread" 4"thread" 8"thread" 16"thread"32"thread"64"thread"

Friday, April 26, 13

IO Tests - Sysbench - Random Reads


33

30"

MB/s
Higher is better

25"
ext2"

20"

ext3"

15"

ext4"

10"

xfs"
btrfs"

5"
0"
1"thread" 2"thread" 4"thread" 8"thread" 16"thread" 32"thread" 64"thread"

Friday, April 26, 13

IO Tests - Sysbench - Random Writes


34

10"

MB/s
Higher is better

9"
8"
7"

ext2"

6"

ext3"

5"

ext4"

4"

xfs"

3"

btrfs"

2"
1"
0"
1"thread" 2"thread" 4"thread" 8"thread" 16"thread" 32"thread" 64"thread"

Friday, April 26, 13

Mount Options
35

ext2: noatime
ext3: noatime
ext4: noatime,barrier=0
xfs: inode64,nobarrier,noatime,logbufs=8
btrfs: noatime,nodatacow,space_cache
zfs: noatime (recordsize=16k, compression=off, dedup=off)
all - noatime - Do not update access times (atime) metadata on files after reading or writing them
ext4 / xfs - barrier=0 / nobarrier - Do not use barriers to pause and receive assurance when writing (aka,
trust the hardware)
xfs - inode64 - use 64 bit inode numbering - became default in most recent kernel trees
xfs - logbufs=8 - Number of in-memory log buffers (between 2 and 8, inclusive)
btrfs - space_cache - Btrfs stores the free space data ondisk to make the caching of a block group much
quicker (Kernel 2.6.37+). It's a persistent change and is safe to boot into old kernels
btrfs - nodatacow - Do not copy-on-write data. datacow is used to ensure the user either has access to
the old version of a file, or to the newer version of the file. datacow makes sure we never have partially
updated files written to disk. nodatacow gives slight performance boost by directly overwriting data (like
ext[234]), at the expense of potentially getting partially updated files on system failures. Performance
gain is usually < 5% unless the workload is random writes to large database files, where the difference
can become very large
btrfs - compress=zlib - Better compression ratio. It's the default and safe for olders kernels
btrfs - compress=lzo - Fastest compression. btrfs-progs 0.19 or olders will fail with this option. The
default in the kernel 2.6.39 and newer

Friday, April 26, 13

iobench with mount options


2500"

MB/s
Higher is better
ext2"

2000"

ext2"+"op6ons"
ext3"

1500"

ext3"+"op6ons"
ext4"
ext4"+"op6ons"

1000"

xfs"
xfs"+"op6ons"

500"

btrfs"
btrfs"+"op6ons"

0"
Read"MB/s"

Friday, April 26, 13

Write"MB/s"


37

IO Scheduler Choices
Round and round the disk drive spins
but SSD sits still and grins.
It is randomly fast
for data current and past.
My database upgrade begins

Friday, April 26, 13

SQLite
160"

Seconds
lower is better

140"
120"
CFQ"

100"

An5cipatory"

80"

Deadline"

60"

Noop"

40"
20"
0"
ext2"

Friday, April 26, 13

ext3"

ext4"

xfs"

btrfs"

aio-stress
1000"

MB/s
Higher is better

900"
800"
700"

CFQ"

600"
500"

An8cipatory"

400"

Deadline"

300"

Noop"

200"
100"
0"
ext2"

Friday, April 26, 13

ext3"

ext4"

xfs"

btrfs"

iozone read
2450%
MB/s
Higher is Better

2400%
2350%

CFQ%
An4cipatory%

2300%

Deadline%
2250%

Noop%

2200%
2150%
ext2%

Friday, April 26, 13

ext3%

ext4%

xfs%

btrfs%

iozone write
250"
MB/s
Higher is Better

200"
CFQ"

150"

An4cipatory"
Deadline"

100"

Noop"
50"
0"
ext2"

Friday, April 26, 13

ext3"

ext4"

xfs"

btrfs"

Real World Workloads


Flush local tables
Make an LVM snapshot
Backup with rsync
- A Haiku on easy backups -

Friday, April 26, 13

O_ O_D
DI
I
RE REC
C T T #4
#e
#4 #
N
O_ FS xt2#
D I #( e
R
x
O_ ECT t2)#
D I #4 #e
RE
xt
O_ CT#4 3#
DI #ex
R
t
O_ ECT 4#
D I #4 #x
O_ REC fs#
D I T #4
RE #zf
f d f d a C T s#
#
at ta
as sy btrf
n
s
yn
c#4 c#4#e #
#
x
fd NFS t2#
at
#( e
as
x
fd ync t2)#
at
#
as 4#ex
yn t3
fd c#4 #
#
at
as ext4
fd ync #
at
#4 #
fd asy xfs#
n
at
as c#4#
z
y
O_ O_ nc#4 fs#
D S D S #b t
YN YN rfs
C # C #4 #
#
#4 #
NF ext2
O_
S
D S #( e #
x
Y
O_ NC t2)#
D S #4 #e
Y N xt
O_ C#4 3#
DS #ex
t4
Y
O_ NC #
D #4 #
O_ S Y N xf s#
DS C#
YN 4#zf
C # s#
4 #b
trf
s#

Data Loading Performance


43

Time in Seconds
Lower is Better

Friday, April 26, 13

Load%&me%115GB%

16000#

15000#

14000#

13000#

12000#

11000#

10000#

9000#

8000#

7000#

1000#

Friday, April 26, 13

O_DSYNC#0#btrfs#

O_DSYNC#0#zfs#

O_DSYNC#0#xfs#

O_DSYNC#0#ext4#

O_DSYNC#0#ext3#

O_DSYNC#0#NFS#(ext2)#

O_DSYNC#0#ext2#

fdatasync#0#btrfs#

fdatasync#0#zfs#

fdatasync#0#xfs#

fdatasync#0#ext4#

fdatasync#0#ext3#

fdatasync#0#NFS#(ext2)#

fdatasync#0#ext2#

O_DIRECT#btrfs#

O_DIRECT#0#zfs#

O_DIRECT#0#xfs#

O_DIRECT#0#ext4#

O_DIRECT#0#ext3#

O_DIRECT#0#NFS#(ext2)#

O_DIRECT#0#ext2#

OLTP Performance - 1 thread


44

2400#

2200#
Time in Seconds
Lower is Better

2000#

1800#

1600#

1400#
1/4#ram#0#1#thread#

1#thread,#7/8#ram#

1200#

0"

Friday, April 26, 13

O_DSYNC"0"btrfs"

O_DSYNC"0"zfs"

O_DSYNC"0"xfs"

O_DSYNC"0"ext4"

O_DSYNC"0"ext3"

O_DSYNC"0"NFS"(ext2)"

O_DSYNC"0"ext2"

fdatasync"0"btrfs"

fdatasync"0"zfs"

fdatasync"0"xfs"

fdatasync"0"ext4"

fdatasync"0"ext3"

fdatasync"0"NFS"(ext2)"

fdatasync"0"ext2"

O_DIRECT"btrfs"

O_DIRECT"0"zfs"

O_DIRECT"0"xfs"

O_DIRECT"0"ext4"

O_DIRECT"0"ext3"

O_DIRECT"0"NFS"(ext2)"

O_DIRECT"0"ext2"

OLTP Performance - 16 thread


45

4000"

3500"
Time in Seconds
Lower is Better

3000"

2500"

2000"

1500"

1000"
16"thread"1/4"ram"

500"
16"thread,"7/8"ram"


46

AWS Cloud Options


Performance, uptime,
Consistency and scale-up:
No, this is a cloud
- A haiku on clouds -

Friday, April 26, 13

Cloud Performance
47

EC2 - Slightly unpredictable

*Note: not my research or graphs. See blog.scalyr.com for benchmarks and writeup

Friday, April 26, 13


48

Conclusions
Oracle is Red,
IBM is Blue,
I like stuff for free
MySQL will do.

Friday, April 26, 13

Conclusions
49

IO Schedulers - Deadline or Noop


Filesystem - Ext3 is usually slowest. Btrfs not there
quite yet but looking better. Linux zfs is cool, but
performance is sub-par.
InnoDB Flush Method - O_DIRECT not always best
Filesystem Mount options make a difference
Artificial benchmarks are fun, but like most things
comparative speed is very workload dependent
Friday, April 26, 13

Further Reading...
50

For more information please see these great resources:


Wikipedia:
http://en.wikipedia.org/wiki/Ext2 and http://en.wikipedia.org/wiki/Ext3 and http://en.wikipedia.org/wiki/Ext4 and http://
en.wikipedia.org/wiki/XFS and http://en.wikipedia.org/wiki/Btrfs
MySQL Performance Blog:
http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/
http://www.mysqlperformanceblog.com/2012/05/22/btrfs-probably-not-ready-yet/
http://www.mysqlperformanceblog.com/2013/01/03/is-there-a-room-for-more-mysql-io-optimization/
http://www.mysqlperformanceblog.com/2012/03/15/ext4-vs-xfs-on-ssd/
http://www.mysqlperformanceblog.com/2011/12/16/setting-up-xfs-the-simple-edition/
MySQL at Facebook (and dom.as blog):
http://dom.as/2008/11/03/xfs-write-barriers/
http://www.facebook.com/note.php?note_id=10150210901610933
Dimitrik:
http://dimitrik.free.fr/blog/archives/2012/01/mysql-performance-linux-io.html
http://dimitrik.free.fr/blog/archives/02-01-2013_02-28-2013.html#159
http://dimitrik.free.fr/blog/archives/2011/01/mysql-performance-innodb-double-write-buffer-redo-log-size-impacts-mysql-55.html

Friday, April 26, 13

...Further Reading
51

For more information please see these great resources:


Phoronix:
http://www.phoronix.com/scan.php?page=article&item=ubuntu_1204_fs&num=1
http://www.phoronix.com/scan.php?page=article&item=linux_39_fs&num=1
http://www.phoronix.com/scan.php?page=article&item=fedora_15_lvm&num=3
Misc:
http://erikugel.wordpress.com/2011/04/14/the-quest-for-the-fastest-linux-filesystem/
https://raid.wiki.kernel.org/index.php/Performance
http://uclibc.org/~aldot/mkfs_stride.html
http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797
http://linux.die.net/man/2/open
http://linux.die.net/man/2/fsync
http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/
http://docs.openstack.org/trunk/openstack-object-storage/admin/content/filesystem-considerations.html
https://btrfs.wiki.kernel.org/index.php/Main_Page
http://zfsonlinux.org/
https://blogs.oracle.com/realneel/entry/mysql_innodb_zfs_best_practices

Friday, April 26, 13

Parting thought
Do you like MyISAM?
I do not like it, Sam-I-am.
I do not like MyISAM.
Would you use it here or there?
I would not use it here or there.
I would not use it anywhere.
I do not like MyISAM.
I do not like it, Sam-I-am.
Would you like it in an e-commerce site?
Would you like it with in the middle of the night?
I do not like it for an e-commerce site.
I do not like it in the middle of the night.
I would not use it here or there.
I would not use it anywhere.
I do not like MyISAM.
I do not like it Sam-I-am.
Would you could you for foreign keys?
Use it, use it, just use it please!
You may like it, you will see
Just convert these tables three
Not for foreign keys, not for those tables three!
I will not use it, you let me be!

Friday, April 26, 13

You might also like