You are on page 1of 8

Tuning Postg reSQL WAL Synchronization This is an introduction to the low-level implementation of the write logging used in PostgreSQL

8.2. The database engine assures data integrity by a process that involves writing out data to a Write Ahead Log (WAL), then writing to the main database. Because there are so many operating systems PostgreSQL runs on, the exact way the WAL is handled varies considerably from platform to platform. Accurately tuning how writes to the WAL are done without impacting the data integrity requirements of the database requires understanding quite a bit of low-level implementation details; thats whats covered here in excruciating detail. Whats a WAL? Much of this wont make sense unless youre already familiar with the terminology of the Write-Ahead Log. There are two sections of the PostgreSQL documentation that are prerequisites here. I would recommend reading the sections from the latest documentation, currently located at First read chapter 27, Reliability and the Write-Ahead Log, then skim section 17.5 Write Ahead Log to see what the parameters can be adjusted. The section were focusing on here is 17.5.1, specifically wal_sync_method (yes, this entire document is about one parameter). PostgreSQL lets you specify which methods it should use when writing to a WAL file via the configuration wal_sync_method. From the documentation:

open_datasync (write WAL files with open() option O_DSYNC) fdatasync (call fdatasync() at each commit) fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache) fsync (call fsync() at each commit) open_sync (write WAL files with open() option O_SYNC) Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform.

This is certainly cryptic if dont know something about UNIX file mechanics. To get started, you can determine which fsync method is currently being used by your database (helpful when youre letting the server pick automatically) inside psql with: show wal_sync_method; Heres a summary of which options are available on various popular platforms. Look at the line for your platform and read from left to right; the first one that says Yes or Direct will be your default: Platform Linux Mac OS X/Darwin Solaris open_datasync No No Yes fdatasync Yes No Yes fsync_writethroug h No Yes No fsync Yes Yes Yes open_sync Direct Yes (Direct?) Yes

Solaris Windows BSD/OS 4 .3 FreeBSD 4 .9

Yes Yes (Direct?) No No

Yes No No No

No Yes No No

Yes Yes Yes Yes

Yes Yes Yes (Direct?) Yes (Direct?)

There seems to be a problem with O_DIRECT not working right on Solaris. The above table will all make more sense after each method is explained. But first we need a diversion into hard disk technology and UNIX file writing techniques. Write Caches and Disk Arrays To understand why this decision is complicated at all, you need to know something about how hard drives deal with disk writes. Disk performance at all levels is greatly affected by disk caching. At the lowest levels, hard drives have their own internal cache thats used for reads and writes. PostgreSQL expects that the disks connected to it can be trusted to correctly report when data as been written to them. Inexpensive IDE and SATA drives are widely criticized for PostgreSQL use because they often lie, reporting that data has been written when in fact its just been put into the disk cache. A good introduction to this subject is at /04 /03/writecache_enabled.html and comments on how this interacts with Postgres are at (its also a regular topic of discussion on the PostgreSQL performance mailing list at ). So the first rule here is that if your hardware isnt actually writing when it says it is, your performance numbers will be inflated--it's extra performance at the expense of reliability, and eventually youll get database corruption. SCSI disks are considered much more accurate in terms of actually having written data when they say they have. This is one of the reasons the perceived performance difference between IDE and SCSI is perceived as narrowing recently; write caching is artificially inflating the results so that the better SCSI hardware doesnt look as impressive as it is. Unfortunately, the IDE drives are so dependant on this feature that turning it off gives unreasonably low results; see section at for comments on how FreeBSD has been challenged by this issue. If youre using any disk array or intelligent controller card, again youre back to needing to make sure that disk writes are being completed when they say they are. Better controllers and arrays have their own cache, and in those cases you normally need to make sure its configured in write-through mode for proper Postgres integrity. A reference good implementation here for Linux are the LSI MegaRAID SCSI cards, which have a write-caching policy thats adjustable in the BIOS. One approach regularly debated on the PostgreSQL performance list uses a card or array that has a battery-backed disk cache on it. That allows you to get the instant fsync() response that allows extremely high database disk writes, while still ensuring that data that has been reported written will properly make it to the disks eventually even if theres a system failure. Critics of the battery backup approach suggest that if you run such a system under load, eventually youll have a failure in this relatively complicated cache method that will corrupt your database in a way thats nearly impossible to detect or recover from. Fans of a batterybacked cache suggest they cant achieve their target workloads without it, and that a good backup strategy is the cure for an event thats deemed as extremely rare as this possibility. WAL writing In order to completely tune how your system performs, its necessary to learn a bit more than you probably wanted to know about how UNIX does file I/O operations.

The PostgreSQL Write-Ahead Log (WAL) does only a few basic operations to the logs it works on: 1) 2) 3) 4) 5) Open a log file Write a series of records to the log Close the log file Switch to or create a new log file Repeat

Normally, when you ask to write a disk block to a drive, your operating system caches that information and writes it when it gets a chance. This write-caching greatly improves performance for normal applications. In order for the WAL mechanism to work correctly, Postgres needs to know when the data written has actually been committed to the disk drive. Synchronous Writes Situations where the application cant continue until the data has been completely written out are referred to by UNIX as synchronous writes. When you open a UNIX file using the POSIX standard calls (see ) you can ask that all writes to that file be done in one of two synchronous modes: O_DSYNC and O_SYNC. The official spec doesnt tell you a lot about the difference between these two. A better discussion comes from IBMs discussion of how these settings impact journaling filesystems at From that document: O_DSYNC: When a file is opened using the O_DSYNC open mode, the write () system call will not return until the file data and all file system meta-data required to retrieve the file data are both written to their permanent storage locations. O_SYNC: In addition to items specified by O_DSYNC, O_SYNC specifies that the write () system call will not return until all file attributes relative to the I/O are written to their permanent storage locations, even if the attributes are not required to retrieve the file data. The main thing you need to walk away from this description with is that O_DSYNC can be a much smaller operation than O_SYNC is. O_SYNC requires all filesystem meta-data to be completed, while D_SYNC just requires the writing of your data be completed. Since database systems like PostgreSQL have their own method for dealing with metadata within the WAL, it really doesnt need to wait for all the file-system information to be done; it just needs the data written. The other way to cope with a synchronized writing situation is to open the file normally, write your data, and then ask the operating system to flush that data. are two ways to do this under UNIX: fdatasync: fsync: These two operate similarly to the above, with fdatasync being like a write done with O_DSYNC, while fsync is like a write with O_SYNC. For a description of whats going on at the programmer level, recommended reading is in the second edition of Advanced Programming in the UNIX


Environment by Stevens/Rago. Chapter 3, File I/O, covers all these concepts, and section 3.14 (fcntl Function) even includes a simple benchmark showing difference in speed among these methods for Linux and Mac OS X. Direct I/O Since we cant use the operating disk cache usefully, why not bypass it altogether? As of PostgreSQL 8.1, writing with O_DSYNC/O_SYNC will also use O_DIRECT when available, which keeps WAL writes from using the operating systems buffer cache. Since the WAL is never read, this can be a big benefit, especially for write-heavy loads. More information about O_DIRECT in a database context is at and the actual details of O_DIRECT writes are at fsync_writethrough Weve now covered four of the methods available for synchronizing the WAL. Just what is fsync_writethough then? Any time theres a caching write situation in the actual drive, you cant necessarily assume fsync will actually make sure data makes it to the disk. Some operating systems recognize this, and have a way to call fsync in a way that forces the drives cache to flush. Right now, fsync_writethrough works on two platforms: Windows (Win32) where it maps to _commit. This used to be the preferred way to write under Windows, but now PostgreSQL uses a method to simulate O_DSYNC that acts similarly to how that type of write works on other platforms. Mac OX S (Darwin) or other platforms with (F_FULLFSYNC), where it maps to fctnl(fd, F_FULLFSYNC,). The option in Mac OS X that forces the drives to behave is commented on at and it was added part of the fsync_writethrough code in PostgreSQL; see /msg00390.php for notes on that addition. On other platforms, fsync_writethrough returns an error (unless fsync is turned off altogether). The default method for writing under Windows is open_datasync, which assumes write caching is not occurring on your drives; if the drives cache, you dont have proper data integrity. If you want to keep caching on, but have PostgreSQL work as designed, switching to fsync_writethrough is one option. has additional notes on this topic. Back to the beg inning Repeating where we started from, PostgreSQL lets you specify which of these methods it should use when writing to a file via the configuration wal_sync_method. Repeating the documentation again, which should make a lot more sense now:

open_datasync (write WAL files with open() option O_DSYNC) fdatasync (call fdatasync() at each commit) fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache)

fsync (call fsync() at each commit) open_sync (write WAL files with open() option O_SYNC) Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform.

The developers have put the possible write methods in order by how efficient they usually are. The database server prefers that it write with O_DSYNC, but it will use various fsync-type calls if its not available. Should all other methods not be available, it falls back to writing with O_SYNC. These defaults are good ones, and most setups will never need to touch this parameter (with the Windows case covered in the fsync_writethrough section being one glaring exception). The problem comes when you need to optimize use of PostgreSQL on hardware that does some of this work for you, like when using a storage array that actually has reliable disk caching. A basic introduction to fsync concepts from a performance perspective is at . test_fsync If you want to learn exactly what the capabilities of your platform are, the first thing you should do is compile and test out the test_fsync tool. PostgreSQL source code, found under the src/tools/fsync directory. You use it like this: cd postgres/src/tools/fsync make ./test_fsync -f / databasedirectory/fsync_test.out If you get results that are all basically 0, that probably means that you dont really have correct permissions to do all operations necessary to run this tool in that directory. Results from this tool are measured in milliseconds. Here is a sample run: Simple write timing: write 0.009100 It comes with the

Compare fsync times on write() and non-write() descriptor: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 4.090127 write, close, fsync 4.092272 Compare one o_sync write to two: one 16k o_sync write 4.204583 two 8k o_sync writes 8.308203

Compare file sync methods with one 8k write: (o_dsync unavailable) write, fdatasync write, fsync, 4.295135 4.325998

Compare file sync methods with 2 8k writes: (o_dsync unavailable) open o_sync, write 8.301195 write, fdatasync 4.293409 write, fsync, 4.280166 This from a Linux system, a platform that doesnt have O_DSYNC. You can compare these results against the theoretical maximum performance based on the RPM of this disk drive, which is a 15K rpm drive: RPM Rot at ions/sec 5400 90.0 7200 120.0 10000 166.7 15000 250.0 Rot at ion (ms) 11.1 8.3 6.0 4.0

At around 4 .3ms to write and sync, were pretty close to the maximum speed the drive is capable of (4 ms). If the results were better than predicted by an analysis based on RPM, you would have reason to believe there is some write caching at work. Write-caching that PostgreSQL doesnt know about it deadly to its ability to ensure data integrity for your database. So all you need to do here is pick the method with the fastest speed, and youre done, right? Not necessarily. This is a test that writes a small amount of data. In the real world, when writing an enormous amount of data, you may discover that the implementation quirks of your hardware or the filesystem your database is stored on strongly prefers one mode over the others, regardless of what the low-level test suggests. The overhead of using the fsync and fdatasync calls gets larger as the amount of data youre passing through a filesystem goes up, and its possible for writing in O_SYNC mode to outperform it under loadeven though this simple test suggests O_SYNC is the slowest method to use. This is particularly true for platforms that support Direct I/O, where fsync-type methods have a very negative impact on the operating system cache. Filesystem Characteristics In order to completely understand what youre working with, youll also need to take a look at the writing characteristics of the filesystem your WAL is on. youre running a journaling filesystem, and youre also using the WAL, understand that in some ways youre paying for data integrity twice. Linux ext3 If

Current generation ext3 defaults to flushing updates to disk before committing the update to the main filesystem, what ext3 calls ordered mode. This is overkill for the disk the WAL is on. Assuming youve put the WAL on a separate disk (or at least partition), you can reduce Linuxs overhead by changing /etc/fstab to mount the filesystem in writeback mode. Similarly, by default ext3 filesystem will update the last access time attribute every time a file is written to. This is also unneeded for the WAL. That can be turned off as well. Combining these two, an optimal fstab for the WAL might look like this: /dev/hda2 /var ext3 defaults,writeback,noatime 1 2

While its obsolete at this point, -09-29.html is an interesting table showing how changing the ext3 options can affect performance in the various sync modes discussed here. Veritas VxFS Filesystems Veritass VxFS product is one of the better journaling filesystems available. In addition to good performance, it has enormous flexibility in terms of optimizing for performance. The first setting that normally affects data integrity is the option for writing the VxFS intent log. The best documentation on this subject is in the man page for mount_vxfs. There are three options available: log, delaylog, and templog. Normally, using delaylog or templog can introduce the possibility of data corruption during a crash. But this is not the case for data written synchronously. From the man page: In all cases, VxFS is fully POSIX compliant. The persistence guarantees for data or metadata modified by write(2), writev(2), or pwrite(2) are not affected by the logging mount option. The effects of these system calls are guaranteed to be persistent only if the O_SYNC, O_DSYNC, VX_DSYNC, or VX_DIRECT flag, as modified by the convosync= mount option, has been specified for the file descriptor. This means that in the case of PostgreSQL, using either the open_datasync or open_sync WAL methods (which translate into O_DSYNC and O_SYNC writes) will push the burden of maintaining disk integrity to the VxFS software, which guarantees that data will survive. On platforms that dont support O_DSYNC (like Linux and BSD), you will normally default to using fdatasync for synchronizing writes. It may be the case that with VxFS, using open_sync instead will be faster, because youre letting the VxFS driver handle data integrity at its level rather than requiring the more complicated operating system level fdatasync transaction.

In fact, it may be possible to get even better PostgreSQL performance out of VxFS by using the convosync=dsync option when mounting the filesystem. Since Postgres only requires DSYNC level integrity for the WAL, using this conversion would serve to emulate the more efficient DSYNC behavior on operating systems that dont necessarily support it. Because this convert option makes VxFS break POSIX compliance, the main danger here is that other applications doing synchronous writes to that disk might misbehave. Another possibility is convosync=direct, which will bypass the buffer cache (utterly appropriate as WAL data isnt going to be read again later); further documentation from Veritas is at Relying on VxFS to enforce data integrity is not entirely without risk. For example, pages 5-6 of the HP-UX Performance Cookbook ( ) describes one failure mode where relying on the intent log can introduce application integrity issues. They recommend mounting VxFS with options delaylog, nodatainlog in order to keep from happening. Here is what the VxFS File System Administrators Guide has to say on this issue (see for this documentation, with 37.pdf being the Linux version): Use the nodatainlog mode on systems with disks that do not support bad block revectoring. Usually, a VxFS file system uses the intent log for synchronous writes. The inode update and the data are both logged in the transaction, so a synchronous write only requires one disk write instead of two. When the synchronous write returns to the application, the file system has told the applications that the data is already written. If a disk error causes the metadata update to fail, then the file must be marked bad and the entire file is lost. If a disk supports bad block revectoring, then a failure on the data update is unlikely, so logging synchronous writes should be allowed. does not support bad block revectoring, then a failure is more likely, so the nodatainlog mode should be used. If the disk

A nodatainlog mode file system is approximately 50 percent slower than a standard mode VxFS file system for synchronous writes. Other operations are not effected. The revectoring of blocks here are alternately referred to as bad block remapping or reallocated sectors, depending on who youre talking to; those all mean the same thing. Database use of VxFS should certainly consider whether the possible performance hit of mounting with nodatainlog is worth the small possibility of corruption from drive failure. The revectoring case described here shouldnt impact PostgreSQL, as it doesnt care about the metadata when it needs to rebuild using the WAL.
C o p yrig ht 2007 G re g o ry Smit h. Last up d at e 5/15/2007.