You are on page 1of 8

Tuning Postg reSQL WAL Synchronization This is an introduction to the low-level implementation of the write logging used

in PostgreSQL 8.2. The database engine assures data integrity by a process that involves writing out data to a Write Ahead Log (WAL), then writing to the main database. Because there are so many operating systems PostgreSQL runs on, the exact way the WAL is handled varies considerably from platform to platform. Accurately tuning how writes to the WAL are done without impacting the data integrity requirements of the database requires understanding quite a bit of low-level implementation details; that’s what’s covered here in excruciating detail. What’s a WAL? Much of this won’t make sense unless you’re already familiar with the terminology of the Write-Ahead Log. There are two sections of the PostgreSQL documentation that are prerequisites here. I would recommend reading the sections from the latest documentation, currently located at First read chapter 27, “Reliability and the Write-Ahead Log”, then skim section 17.5 “Write Ahead Log” to see what the parameters can be adjusted. The section we’re focusing on here is 17.5.1, specifically wal_sync_method (yes, this entire document is about one parameter). PostgreSQL lets you specify which methods it should use when writing to a WAL file via the configuration wal_sync_method. From the documentation:
· · · · ·

open_datasync (write WAL files with open() option O_DSYNC) fdatasync (call fdatasync() at each commit) fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache) fsync (call fsync() at each commit) open_sync (write WAL files with open() option O_SYNC) Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform.

This is certainly cryptic if don’t know something about UNIX file mechanics. To get started, you can determine which fsync method is currently being used by your database (helpful when you’re letting the server pick automatically) inside psql with: show wal_sync_method; Here’s a summary of which options are available on various popular platforms. Look at the line for your platform and read from left to right; the first one that says Yes or Direct will be your default: Platform Linux Mac OS X/Darwin Solaris open_datasync No No Yes fdatasync Yes No Yes fsync_writethroug h No Yes No fsync Yes Yes Yes open_sync Direct Yes (Direct?) Yes

see section 11. and in those cases you normally need to make sure it’s configured in write-through mode for proper Postgres integrity. the IDE drives are so dependant on this feature that turning it off gives unreasonably low results. Write Caches and Disk Arrays To understand why this decision is complicated at . One approach regularly debated on the PostgreSQL performance list uses a card or array that has a battery-backed disk cache on it.Solaris Windows BSD/OS 4 .5 at http://www. WAL writing In order to completely tune how your system performs. PostgreSQL expects that the disks connected to it can be trusted to correctly report when data as been written to them. Unfortunately. your performance numbers will be inflated--it's extra performance at the expense of /04 /03/writecache_enabled. But first we need a diversion into hard disk technology and UNIX file writing techniques.3 FreeBSD 4 . SCSI disks are considered much more accurate in terms of actually having written data when they say they have. At the lowest levels. Inexpensive IDE and SATA drives are widely criticized for PostgreSQL use because they often “lie”. and eventually you’ll get database it’s necessary to learn a bit more than you probably wanted to know about how UNIX does file I/O operations. Fans of a batterybacked cache suggest they can’t achieve their target workloads without it. So the first rule here is that if your hardware isn’t actually writing when it says it is. reporting that data has been written when in fact it’s just been put into the disk cache. Critics of the battery backup approach suggest that if you run such a system under load.postgresql. again you’re back to needing to make sure that disk writes are being completed when they say they are. PDFmyURL. Disk performance at all levels is greatly affected by disk caching. you need to know something about how hard drives deal with disk writes.freebsd. If you’re using any disk array or intelligent controller card.html (it’s also a regular topic of discussion on the PostgreSQL performance mailing list at http://archives. A reference good implementation here for Linux are the LSI MegaRAID SCSI cards. while still ensuring that data that has been reported written will properly make it to the disks eventually even if there’s a system failure. A good introduction to this subject is at http://www.thescripts. write caching is artificially inflating the results so that the better SCSI hardware doesn’t look as impressive as it ). This is one of the reasons the perceived performance difference between IDE and SCSI is perceived as narrowing The above table will all make more sense after each method is explained. which have a write-caching policy that’s adjustable in the BIOS. and that a good backup strategy is the cure for an event that’s deemed as extremely rare as this possibility. eventually you’ll have a failure in this relatively complicated cache method that will corrupt your database in a way that’s nearly impossible to detect or recover from.9 Yes Yes (Direct?) No No Yes No No No No Yes No No Yes Yes Yes Yes Yes Yes Yes (Direct?) Yes (Direct?) There seems to be a problem with O_DIRECT not working right on Solaris. hard drives have their own internal cache that’s used for reads and writes. Better controllers and arrays have their own cache. That allows you to get the instant fsync() response that allows extremely high database disk writes.html for comments on how FreeBSD has been challenged by this issue.ISO8859-1/books/handbook/configtuning-disk.jasonbrome.html and comments on how this interacts with Postgres are at http://www.1.

boulder. it really doesn’t need to wait for all the file-system information to be while D_SYNC just requires the writing of your data be completed. even if the attributes are not required to retrieve the file data. Synchronous Writes Situations where the application can’t continue until the data has been completely written out are referred to by UNIX as synchronous writes. it just needs the data written.html fsync: http://www.html ) you can ask that all writes to that file be done in one of two synchronous modes: O_DSYNC and O_SYNC.opengroup. recommended reading is in the second edition of “Advanced Programming in the UNIX PDFmyURL. The other way to cope with a synchronized writing situation is to open the file normally. In order for the WAL mechanism to work correctly.opengroup.opengroup. while fsync is like a write with O_SYNC.The PostgreSQL Write-Ahead Log (WAL) does only a few basic operations to the logs it works on: 1) 2) 3) 4) 5) Open a log file Write a series of records to the log Close the log file Switch to or create a new log file Repeat Normally. and then ask the operating system to flush that O_SYNC: In addition to items specified by O_DSYNC. This write-caching greatly improves performance for normal applications. your operating system caches that information and writes it when it gets a There . O_SYNC requires all filesystem meta-data to be For a description of what’s going on at the programmer Since database systems like PostgreSQL have their own method for dealing with metadata within the WAL. A better discussion comes from IBM’s discussion of how these settings impact journaling filesystems at http://publib16.html These two operate similarly to the above. When you open a UNIX file using the POSIX standard calls (see http://www. when you ask to write a disk block to a drive. are two ways to do this under UNIX: fdatasync: http://www. The official spec doesn’t tell you a lot about the difference between these two. write your data. The main thing you need to walk away from this description with is that O_DSYNC can be a much smaller operation than O_SYNC is. O_SYNC specifies that the write () system call will not return until all file attributes relative to the I/O are written to their permanent storage locations. Postgres needs to know when the data written has actually been committed to the disk drive. with fdatasync being like a write done with O_DSYNC. the write () system call will not return until the file data and all file system meta-data required to retrieve the file data are both written to their permanent storage locations.htm#wq222 From that document: O_DSYNC: When a file is opened using the O_DSYNC open mode.

The default method for writing under Windows is fsync_writethrough We’ve now covered four of the methods available for synchronizing the but now PostgreSQL uses a method to simulate O_DSYNC that acts similarly to how that type of write works on other platforms. and have a way to call fsync in a way that forces the drive’s cache to flush. which assumes write caching is not occurring on your drives. http://archives. Direct I/O Since we can’t use the operating disk cache usefully. if the drives covers all these concepts.aix.…). If you want to keep caching on. Back to the beg inning Repeating where we started from.postgresql. see http://archives.genprogc/doc/genprogc/fileio.pdf and the actual details of O_DIRECT writes are at http://publib. This used to be the preferred way to write under Windows.boulder. where it maps to fctnl(fd. switching to fsync_writethrough is one option. forcing write-through of any disk write cache) PDFmyURL. On other platforms.php has additional notes on this but have PostgreSQL work as /msg00390. The option in Mac OS X that forces the drives to behave is commented on at Since the WAL is never read.14 (“fcntl Function”) even includes a simple benchmark showing difference in speed among these methods for Linux and Mac OS X.html and it was added part of the fsync_writethrough code in .mail-archive. you don’t have proper data integrity.mgogala.php for notes on that addition. this can be a big benefit.jsp?topic=/com. Chapter 3. Just what is fsync_writethough then? Any time there’s a caching write situation in the actual drive. PostgreSQL lets you specify which of these methods it should use when writing to a file via the configuration wal_sync_method.postgresql. which keeps WAL writes from using the operating system’s buffer cache. Mac OX S (Darwin) or other platforms with (F_FULLFSYNC). fsync_writethrough returns an error (unless fsync is turned off altogether). why not bypass it altogether? As of PostgreSQL 8. Some operating systems recognize this. especially for write-heavy “File I/O”. fsync_writethrough works on two platforms: Windows (Win32) where it maps to _commit.Environment” by Stevens/Rago. you can’t necessarily assume fsync will actually make sure data makes it to the disk. Right now. More information about O_DIRECT in a database context is at http://www. which should make a lot more sense now: · · · open_datasync (write WAL files with open() option O_DSYNC) fdatasync (call fdatasync() at each commit) fsync_writethrough (call fsync() at each commit. writing with O_DSYNC/O_SYNC will also use O_DIRECT when F_FULLFSYNC. and section 3. Repeating the documentation again.

but it will use various fsync-type calls if it’s not available. found under the src/tools/fsync directory. . fsync. The problem comes when you need to optimize use of PostgreSQL on hardware that does some of this work for fsync() can sync data written on a different descriptor. The default is the first method in the above list that is supported by the platform. A basic introduction to fsync concepts from a performance perspective is at http://developer. These defaults are good ones. close 4. and most setups will never need to touch this parameter (with the Windows case covered in the fsync_writethrough section being one glaring exception).092272 Compare one o_sync write to two: one 16k o_sync write 4. Results from this tool are measured in milliseconds. the first thing you should do is compile and test out the test_fsync tool. The developers have put the possible write methods in order by how efficient they usually are.204583 two 8k o_sync writes 8. PostgreSQL source code. The database server prefers that it write with O_DSYNC.· · fsync (call fsync() at each commit) open_sync (write WAL files with open() option O_SYNC) Not all of these choices are available on all platforms. it falls back to writing with O_SYNC.out If you get results that are all basically 0.090127 write.pdf . Here is a sample run: Simple write timing: write 0. You use it like this: cd postgres/src/tools/fsync make .osdl. test_fsync If you want to learn exactly what the capabilities of your platform are./test_fsync -f / databasedirectory/fsync_test.308203 PDFmyURL.) write. that probably means that you don’t really have correct permissions to do all operations necessary to run this tool in that directory. Should all other methods not be available. fsync 4.009100 It comes with the Compare fsync times on write() and non-write() descriptor: (If the times are similar. like when using a storage array that actually has reliable disk caching.

fsync. you’ll also need to take a look at the writing characteristics of the filesystem your WAL is on.0 Rot at ion (ms) 11. understand that in some ways you’re paying for data integrity twice.0 10000 166.301195 write. a platform that doesn’t have O_DSYNC.0 At around 4 .280166 This from a Linux system. when writing an enormous amount of data. If the results were better than predicted by an analysis based on RPM.3 6.295135 4.325998 Compare file sync methods with 2 8k writes: (o_dsync unavailable) open o_sync. right? Not necessarily. and you’re done.7 15000 250. regardless of what the low-level test suggests. Write-caching that PostgreSQL doesn’t know about it deadly to its ability to ensure data integrity for your database.3ms to write and sync.0 4. you would have reason to believe there is some write caching at work. and it’s possible for writing in O_SYNC mode to outperform it under load—even though this simple test suggests O_SYNC is the slowest method to use. we’re pretty close to the maximum speed the drive is capable of (4 ms). you may discover that the implementation quirks of your hardware or the filesystem your database is stored on strongly prefers one mode over the others. Filesystem Characteristics In order to completely understand what you’re working with. fdatasync 4.293409 write.1 8.0 7200 120. where fsync-type methods have a very negative impact on the operating system cache. The overhead of using the fsync and fdatasync calls gets larger as the amount of data you’re passing through a filesystem goes up. In the real world. fsync. 4. and you’re also using the WAL. you’re running a journaling filesystem. This is a test that writes a small amount of data. . fdatasync write. write 8. Linux ext3 If PDFmyURL.Compare file sync methods with one 8k write: (o_dsync unavailable) write. This is particularly true for platforms that support Direct I/O. which is a 15K rpm drive: RPM Rot at ions/sec 5400 90. You can compare these results against the theoretical maximum performance based on the RPM of this disk drive. So all you need to do here is pick the method with the fastest speed.

com . For example. as modified by the convosync= mount option. The best documentation on this subject is in the man page for mount_vxfs.noatime 1 2 While it’s obsolete at this point. using delaylog or templog can introduce the possibility of data corruption during a crash. From the man page: In all cases.pdf Relying on VxFS to enforce data integrity is not entirely without risk. the main danger here is that other applications doing synchronous writes to that disk might misbehave. or pwrite(2) are not affected by the logging mount option. But this is not the case for data written synchronously. Combining these two. The first setting that normally affects data integrity is the option for writing the VxFS intent log. O_DSYNC. using open_sync instead will be faster.osdl. Another possibility is convosync=direct. further documentation from Veritas is at http://ftp.html is an interesting table showing how changing the ext3 options can affect performance in the various sync modes discussed Because this convert option makes VxFS break POSIX compliance. VX_DSYNC. which guarantees that data will survive. pages 5-6 of the HP-UX Performance Cookbook ( PDFmyURL.veritas. and templog. VxFS is fully POSIX compliant. you can reduce Linux’s overhead by changing /etc/fstab to mount the filesystem in writeback mode.Current generation ext3 defaults to flushing updates to disk before committing the update to the main filesystem. In fact. it has enormous flexibility in terms of optimizing for performance. Normally. http://developer. Veritas VxFS Filesystems Veritas’s VxFS product is one of the better journaling filesystems available. It may be the case that with VxFS. That can be turned off as well. it may be possible to get even better PostgreSQL performance out of VxFS by using the convosync=dsync option when mounting the filesystem. or VX_DIRECT flag. has been specified for the file descriptor. by default ext3 filesystem will update the last access time attribute every time a file is written to. This means that in the case of PostgreSQL. using this conversion would serve to emulate the more efficient DSYNC behavior on operating systems that don’t necessarily support it. The effects of these system calls are guaranteed to be persistent only if the -09-29. an optimal fstab for the WAL might look like this: /dev/hda2 /var ext3 defaults. delaylog. There are three options available: log. using either the open_datasync or open_sync WAL methods (which translate into O_DSYNC and O_SYNC writes) will push the burden of maintaining disk integrity to the VxFS software. Since Postgres only requires DSYNC level integrity for the WAL. writev(2). This is also unneeded for the WAL. This is overkill for the disk the WAL is on. which will bypass the buffer cache (utterly appropriate as WAL data isn’t going to be read again later). you will normally default to using fdatasync for synchronizing writes. The persistence guarantees for data or metadata modified by write(2).writeback. Assuming you’ve put the WAL on a separate disk (or at least partition). On platforms that don’t support O_DSYNC (like Linux and BSD). In addition to good performance. because you’re letting the VxFS driver handle data integrity at its level rather than requiring the more complicated operating system level fdatasync transaction. what ext3 calls ordered mode.

so a synchronous write only requires one disk write instead of two. Here is what the VxFS File System Administrator’s Guide has to say on this issue (see http://ftp. a VxFS file system uses the intent log for synchronous writes.veritas. so logging synchronous writes should be allowed. then a failure on the data update is unlikely. does not support bad block revectoring.pdf being the Linux version): Use the nodatainlog mode on systems with disks that do not support bad block revectoring. then a failure is more likely. nodatainlog in order to keep from happening. so the nodatainlog mode should be used. The revectoring case described here shouldn’t impact PostgreSQL. Usually. PDFmyURL. Database use of VxFS should certainly consider whether the possible performance hit of mounting with nodatainlog is worth the small possibility of corruption from drive failure.pdf ) describes one failure mode where relying on the intent log can introduce application integrity issues. as it doesn’t care about the metadata when it needs to rebuild using the for this documentation. C o p yrig ht 2007 G re g o ry Smit those all mean the same thing. Other operations are not effected. They recommend mounting VxFS with options delaylog.www2. If the disk A nodatainlog mode file system is approximately 50 percent slower than a standard mode VxFS file system for synchronous writes. If a disk supports bad block revectoring. the file system has told the applications that the data is already written. When the synchronous write returns to the application.veritas. The “revectoring” of blocks here are alternately referred to as bad block remapping or reallocated sectors. then the file must be marked bad and the entire file is lost. Last up d at e 5/15/ . If a disk error causes the metadata update to 37. with http://ftp. depending on who you’re talking to. The inode update and the data are both logged in the transaction.