Professional Documents
Culture Documents
•Many bits
•Lots of bits
•Huge bits
•It's gunna be great
•Very excited
•We have the best filesystems
•People tell me this is true
•Except the fake media, they didn't tell me this
5 PostgreSQL and ZFS: It's about the bits and storage, stupid.
•Many bits
•Lots of bits
•Huge bits
•It's gunna be great
•Very excited
•We have the best filesystems
•People tell me this is true
•Except the fake media, they didn't tell me this
Too soon?
6 PostgreSQL and ZFS
Some FS minutiae may have been harmed in the making of this talk.
Nit-pick as necessary (preferably after).
7 PostgreSQL - A Storage Administrator's View
pg_start_backup('my_backup', true)
a.k.a. aggressive checkpointing (vs default perf hit of:
0.5 * checkpoint_completion_target)
14 Quick ZFS Primer
15 Quick ZFS Primer
Userland Application
Userland
20 Block Filesystems: Top-Down
Userland Application
Userland
Kernel
VFS Layer
Userland Application
Userland
Kernel
VFS Layer
System Buffers
22 Block Filesystems: Top-Down
Userland Application
write(fd, buffer, cnt) buffer
Userland
Kernel
VFS Layer
System Buffers
VFS Layer
System Buffers
1: #7014
24 Block Filesystems: PostgreSQL Edition
Userland
25 Block Filesystems: PostgreSQL Edition
Userland
Kernel
VFS Layer
System Buffers
0 1 2 3
26 Block Filesystems: PostgreSQL Edition
Kernel
VFS Layer
System Buffers
0 1 2 3
2: #9971
3: #0016 0: #8884
1: #7014
27 Quiz Time
foo_table Tuple
<~182 tuples
Userland Application
write(fd, buffer, cnt) 8k buffer
29 ZFS Tip: postgresql.conf: full_page_writes=off
Userland Application
•buffers can be 4K
cnt = 2
write(fd, buffer, cnt) 8k buffer
VFS Layer
0 1 2 3
31 ZFS Filesystem Storage Abstraction
Physical Storage is
decoupled
from
Filesystems.
VFS
Dataset Name Mountpoint
tank/ROOT /
tank/db /db
canmount=off
tank/ROOT/usr /usr
tank/local none
tank/local/etc /usr/local/etc
34 Offensively Over Simplified Architecture Diagram
Filesystem zvol
Datasets
VFS
Dataset Name Mountpoint
tank/ROOT /
tank/db /db
tank/ROOT/usr /usr
tank/local none
tank/local/etc /usr/local/etc
Filesystem zvol
Datasets
uberblock
Disk Block with
foo_table Tuple
39 ZFS: User Data + File dnode
t1
40 ZFS: Object Set
t2
t1
41 ZFS: Meta-Object Set Layer
t3
t2
t1
42 ZFS: Uberblock
t4
t3
t2
t1
43 At what point did the filesystem become inconsistent?
t4
t3
t2
t1
44 At what point could the filesystem become inconsistent?
t4
At t1
t3
t2
t1
45 How? I lied while explaining the situation. Alternate Truth.
ZFS is Copy-On-Write
What what's not been deleted and on disk is immutable.
t1
48 At what point did the filesystem become inconsistent?
t2
t1
49 At what point did the filesystem become inconsistent?
t3
t2
t1
50 At what point did the filesystem become inconsistent?
t4
t3
t2
t1
51 At what point could the filesystem become inconsistent?
t4
NEVER
t1
t2
t3
52 TIL about ZFS: Transactions and Disk Pages
I have yet to see compression slow down benchmarking results or real world
workloads. My experience is with:
•spinning rust (7.2K RPM, 10K RPM, and 15KRPM)
•fibre channel connected SANs
•SSDs
•NVME
55 ZFS Tip: ALWAYS enable compression
•Data written at the same time is stored near each other because it's frequently
part of the same record
•Data can now pre-fault into kernel cache (ZFS ARC) by virtue of the temporal
adjacency of the related pwrite(2) calls
•Write locality + compression=lz4 + pg_repack == PostgreSQL Dream Team
57 ZFS Perk: Data Locality
•Data written at the same time is stored near each other because it's frequently
part of the same record
•Data can now pre-fault into kernel cache (ZFS ARC) by virtue of the temporal
adjacency of the related pwrite(2) calls
•Write locality + compression=lz4 + pg_repack == PostgreSQL Dream Team
If you don't know what pg_repack is, figure out how to move into a database
environment that supports pg_repack and use it regularly.
•This is not just my recommendation, it's also from the community and author
of the feature.
•These are not the droids you are looking for
•Do not pass Go
•Do not collect $200
•Go straight to system unavailability jail
•The feature works, but you run the risk of bricking your ZFS server.
•TL;DR: 4.2% -> 34% of SSDs have one UBER per year
64 TIL: Bitrot Roulette
• Temperature
• Bus Power Consumption
• Data Written by the System Software
• Workload changes due to SSD failure
In a Datacenter no-one can hear your bits scream...
...except maybe they can.
68 Take Care of your bits
•Fibre Channel
•SAS
•SATA
•Tape
•SANs
•Cloud Object Stores
70 So what about PostgreSQL?
VDEV | vē-dēv
noun
a virtual device
zpool | zē-pool ͞
noun
an abstraction of physical storage made up of a set of VDEVs
ZPL | zē-pē-el
noun
ZFS POSIX Layer
ZIL | zil
noun
ZFS Intent Log
ARC | ärk
noun
Adaptive Replacement Cache
The pool should never consume more than 80% of the available space
82 Storage Management
• Disable atime
•Enable compression
• Tune the recordsize
•Consider tweaking the primarycache
84 ZFS Dataset Tuning
• metadata instructs ZFS's ARC to only cache metadata (e.g. dnode entries),
not page data itself
•Default: cache all data
• metadata instructs ZFS's ARC to only cache metadata (e.g. dnode entries),
not page data itself
•Default: cache all data
•Double-caching happens
2-4µs/pwrite(2)!!
89 Performance Wins
90 Performance Wins
91 Performance Wins
PSA: The "Compressed ARC" feature was added to catch checksum errors in RAM
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 819M 56.8G 96K none
rpool/db 160K 48.0G 96K /db
rpool/root 818M 56.8G 818M /
# zfs create rpool/db/pgdb1
# chown postgres:postgres /db/pgdb1
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 819M 56.8G 96K none
rpool/db 256K 48.0G 96K /db
rpool/db/pgdb1 96K 48.0G 96K /db/pgdb1
rpool/root 818M 56.8G 818M /
# zfs set reservation=1G rpool/db/pgdb1
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 1.80G 55.8G 96K none
rpool/db 1.00G 47.0G 96K /db
rpool/db/pgdb1 96K 48.0G 12.0M /db/pgdb1
rpool/root 818M 55.8G 818M /
96 initdb like a boss
# su postgres -c 'initdb --no-locale -E=UTF8 -n -N -D /db/pgdb1'
Running in noclean mode. Mistakes will not be cleaned up.
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
postgres=# \q
# rm -rf *
Guilty Pleasure
# ls -1 | wc -l
0
During Demos
# psql -U postgres
psql: FATAL: could not open relation mapping file "global/pg_filenode.map":
No such file or directory
98 Backups: Has Them
$ psql
psql: FATAL: could not open relation mapping file "global/pg_filenode.map": No such file or directory
# cat postgres.log
LOG: database system was shut down at 2017-03-03 21:08:05 UTC
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
FATAL: could not open relation mapping file "global/pg_filenode.map": No such file or directory
LOG: could not open temporary statistics file "pg_stat_tmp/global.tmp": No such file or directory
LOG: could not open temporary statistics file "pg_stat_tmp/global.tmp": No such file or directory
...
LOG: could not open temporary statistics file "pg_stat_tmp/global.tmp": No such file or directory
LOG: could not open file "postmaster.pid": No such file or directory
LOG: performing immediate shutdown because data directory lock file is invalid
LOG: received immediate shutdown request
LOG: could not open temporary statistics file "pg_stat/global.tmp": No such file or directory
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: database system is shut down
# ll
total 1
drwx------ 2 postgres postgres 2 Mar 3 21:40 ./
drwxr-xr-x 3 root root 3 Mar 3 21:03 ../
99 Restores: As Important as Backups
Position: since ZFS will never be inconsistent and therefore PostgreSQL will
never loose integrity, 1s of actual data loss is a worthwhile tradeoff for a ~10x
performance boost in write-heavy applications.
# cat /sys/module/zfs/parameters/zfs_txg_timeout
5
# echo 1 > /sys/module/zfs/parameters/zfs_txg_timeout
# echo 'options zfs zfs_txg_timeout=1' >> /etc/modprobe.d/zfs.conf
# psql -c 'ALTER SYSTEM SET synchronous_commit=off'
ALTER SYSTEM
# zfs set logbias=throughput rpool/db
QUESTIONS?