This action might not be possible to undo. Are you sure you want to continue?
• • • INTRODUCTION HISTORY FEATURES o STORAGE POOLS o CAPACITY o COPY-ON-WRITE TRANSACTIONAL MODEL o SNAPSHOTS AND CLONES o DYNAMIC STRIPING o VARIABLE BLOCK SIZES o LIGHTWEIGHT FILESYSTEM CREATION o ADDITIONAL CAPABILITIES o CACHE MANAGEMENT o ADAPTIVE ENDIANNESS • LIMITATION o SOLARIS IMPLEMENTATION ISSUES • PLATFORMS o OPENSOLARIS o FREEBSD o MAC OS X o LINUX • REFERENCES
In computing, ZFS is a file system designed by Sun Microsystems for the Solaris Operating System. The features of ZFS include support for high storage capacities, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, on-line integrity checking and repair, and RAID-Z. ZFS is implemented as open-source software, licensed under the Common Development and Distribution License (CDDL).
2005. but is now an orphan acronym. one year after the opening of the OpenSolaris community. . Source code for ZFS was integrated into the main trunk of Solaris development on October 31. The name originally stood for "Zettabyte File System". It was announced on September 14. 2005 and released as part of build 27 of OpenSolaris on November 16. 2004. Sun announced that ZFS was included in the 6/06 update to Solaris 10 in June 2006.HISTORY ZFS was designed and implemented by a team at Sun led by Jeff Bonwick.
with the last being the recommended usage. The storage capacity of all vdevs is available to all of the file system instances in the zpool. . hard drive partitions. which are themselves constructed of block devices: files. and a reservation can be set to guarantee that space will be available to a file system instance. A quota can be set to limit the amount of space a file system instance can occupy. A zpool is constructed of virtual devices (vdevs). as a RAID-Z group of three or more devices.FEATURES STORAGE POOLS Unlike traditional file systems. which reside on single devices and thus require a volume manager to use more than one device. depending on needs and space available: non-redundantly (similar to RAID 0). as a mirror (RAID 1) of two or more devices. ZFS filesystems are built on top of virtual storage pools called zpools. Block devices within a vdev may be configured in different ways. or as a RAID-Z2 group of four or more devices. or entire drives.
84 × 1019) times more data than current 64-bit systems.CAPACITY ZFS is a 128-bit file system. so it can store 18 billion billion (1. The limitations of ZFS are designed to be so large that they will not be encountered in practice for some time. Some theoretical limits in ZFS are: • • • • • • • 264 — Number of snapshots of any file system 248 — Number of entries in any individual directory 16 EiB (264 bytes) — Maximum size of a file system 16 EiB — Maximum size of a single file 16 EiB — Maximum size of any attribute 256 ZiB (278 bytes) — Maximum size of any zpool 256 — Number of attributes of a file (actually constrained to 248 for the number of files in a ZFS file system) • 256 — Number of files in a directory (actually constrained to 248 for the number of files in a ZFS file system) • • • 264 — Number of devices in any zpool 264 — Number of zpools in a system 264 — Number of file systems in a zpool .
To operate at the 1031 bits/kg limit. In particular. however. it has been shown that 1 kilogram of matter confined to 1 litre of space can perform at most 1051 operations per second on at most 1031 bits of information.4x106 J/kg * 1." Later he clarified: “ Although we'd all like Moore's Law to continue forever. literally. The mass of the oceans is about 1.2x1028 J. Thus. ” . A fully populated 128-bit storage pool would contain 2128 blocks = 2137 bytes = 2140 bits.4x1021 kg = 3.Project leader Bonwick said.000 J to raise the temperature of 1 kg of water by 1 degree Celsius.4x1027 J. The latent heat of vaporization adds another 2 million J/kg. therefore the minimum mass required to hold the bits would be (2140 bits) / (1031 bits/kg) = 136 billion kg. the entire mass of the computer must be in the form of pure energy. By E=mc². fully populating a 128-bit storage pool would. Thus the energy required to boil the oceans is about 2. the rest energy of 136 billion kg is 1. You couldn't fill a 128-bit storage pool without boiling the oceans. require more energy than boiling the oceans. "Populating 128-bit file systems would exceed the quantum limits of earth-based storage.4x1021 kg. and thus about 400. It takes about 4.000 J to heat 1 kg of water from freezing to boiling. quantum mechanics imposes some fundamental limits on the computation rate and information capacity of any physical device.
multiple updates are grouped into transaction groups. allowing a snapshot version of the file system to be maintained. Writeable snapshots ("clones") can also be created. they are also space efficient.COPY-ON-WRITE TRANSACTIONAL MODEL ZFS uses a copy-on-write transactional object model. new data blocks . and then any metadata blocks referencing it are similarly read. the blocks containing the old data can be retained. Blocks containing active data are never overwritten in place. a new block is allocated. SNAPSHOTS AND CLONES An advantage of copy-on-write is that when ZFS writes new data. ZFS snapshots are created very quickly. As changes are made to any of the clone file systems. since all the data composing the snapshot is already stored. resulting in two independent file systems that share a set of blocks. To reduce the overhead of this process. and written. and an intent log is used when synchronous write semantics are required. since any unchanged data is shared among the file system and its snapshots. instead. reallocated. modified data is written to it. All block pointers within the filesystem contain a 256-bit checksum of the target block which is verified when the block is read.
Automatic tuning to match workload characteristics is contemplated. . but any unchanged blocks continue to be shared. DYNAMIC STRIPING Dynamic striping across all devices to maximize throughput means that as additional devices are added to the zpool. which balances the write load across them.are created to reflect those changes. no matter how many clones exist. VARIABLE BLOCK SIZES ZFS uses variable-sized blocks of up to 128 kilobytes. thus all disks in a pool are used. variable block sizes are used. the stripe width automatically expands to include them. the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations). The currently available code allows the administrator to tune the maximum block size used as certain workloads do not perform well with large blocks. If a block can be compressed to fit into a smaller block size. If data compression (LZJB) is enabled.
Claimed globally optimal I/O sorting and aggregation. So a pool without redundancy can lose data if you . End-to-end checksumming.LIGHTWEIGHT FILESYSTEM CREATION In ZFS. filesystem manipulation within a storage pool is easier than volume manipulation within a traditional filesystem. ZFS tries to replicate over different devices. two or three times (according to metadata importance). the time and effort required to create or resize a ZFS filesystem is closer to that of making a new directory than it is to volume manipulation in some other systems. Load and space usage sharing between disks in the pool. ADDITIONAL CAPABILITIES • • • Explicit I/O priority with deadline scheduling. allowing data corruption detection (and recovery if you have redundancy in the pool). constant-time directory operations. Multiple independent prefetch streams with automatic length and stride detection. If the pool has several devices. • • Parallel. • • • Intelligent scrubbing and resilvering. Ditto blocks: Metadata is replicated inside the pool.
a new method for cache management. • When entire disks are added to a ZFS pool. . • ZFS design (copy-on-write + superblocks) is safe when using disks with write cache enabled.find bad sectors. This is not done when ZFS only manages discrete slices of the disk. since it doesn't know if other slices are managed by non-write-cache safe filesystems. • Filesystem encryption is supported. if they support the cache flush commands issued by ZFS. ZFS automatically enables their write cache. but metadata should be fairly safe even in this scenario. though is currently in a beta stage. CACHE MANAGEMENT ZFS also uses the ARC. This feature provides safety and a performance boost compared with some other filesystems. instead of the traditional Solaris virtual memory page cache. like UFS.
as is usual in POSIX systems. This does not affect the stored data itself. ADAPTIVE ENDIANNESS Pools and their associated ZFS file systems can be moved between different platform architectures. the metadata is byteswapped in memory. including systems implementing different byte orders. if the stored endianness doesn't match the endianness of the system. files appear to applications as simple arrays of bytes. When reading. individual metadata blocks are written with the native byte order of the system writing the block.1. so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness. . The ZFS block pointer format stores filesystem metadata in an endian-adaptive way.
Newly written data will dynamically start to use all available vdevs. • It is currently not possible to reduce the number of vdevs in a pool nor otherwise reduce pool capacity. it is possible to create user-owned filesystems. If a snapshot is taken during this process. • Capacity expansion is normally achieved by adding groups of disks as a vdev (stripe. where the data cannot be separated per user. each with its own size limit. However. RAID-Z. for example). it is currently being worked on by the ZFS team. Intrinsically. Still not available as of Solaris 10 05/08 (AKA update 5). It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself — the heal time will depend on amount of stored information. Instead. there is no practical quota solution for the file systems shared among several users (such as team projects. although it could be implemented on top of the ZFS stack. RAID-Z2. it will cause the heal to be restarted.LIMITATIONS • ZFS doesn't support per-user or per-group quotas. not the disk size. . or mirrored).
8. if you had a striped ZFS pool consisting of disks on a SAN. you cannot add the local-disks as a mirrored vdev. SOLARIS IMPLEMENTATION ISSUES The current ZFS implementation (Solaris 10 11/06) has some issues administrators should know before deploying it. Many of these issues are scheduled to be addressed in future releases. distributed. and recreating the pool with the new • ZFS is not a native cluster. destroying the pool. since the standard installer still does . This feature appears very difficult to implement. • You cannot mix vdev types in a zpool. • ZFS root filesystem support is currently set to off on Solaris 10 default installations. storage requires copying data offline. Sun's Lustre distributed filesystem will adapt ZFS as back-end storage for both data and metadata in version 1. You can however create a new RAIDZ vdev and add it to the zpool.• It is not possible to add a disk to a RAID-Z or RAID-Z2 vdev. which will be released in Q2 2008. • Reconfiguring policy. or parallel file system and cannot provide concurrent access from multiple hosts as ZFS is a local file system. For example.
• If a Solaris Zone is put on ZFS. the OS will need to be reinstalled. this is still not fixed in Solaris Nevada (up to build 95). The issue is currently fixed in the OpenSolaris code base. This can make some fsync() slow when running alongside a workload which writes a lot of data to filesystem cache. but is said to be a very challenging problem. • A file "fsync" will commit to disk all pending modifications on the filesystem. Bootable ZFS file systems are available for x86 systems in Solaris Indiana. but they cannot be removed. This issue is planned to be addressed in a Solaris 10 update in 2007. but it cannot be removed (even if the size to be removed is less than the pool's unused space). That is. The ability to shrink a zpool is a work in progress. • New vdevs can be added to a storage pool. and the Solaris Nevada installer supports ZFS boot on both SPARC and x86 platforms as of build 90. A vdev can be exchanged for a bigger one. an "fsync" on a file will flush out all deferred (cached) operations to the filesystem (not the pool) in which the file is located. Originally targeted for Solaris 10 update 5.not fully support ZFS roots. the system cannot be upgraded. so it is . The ZFS Boot project successfully added boot support to the OpenSolaris project in March 2007.
There are two root causes. • ZFS encourages creation of many filesystems inside the pool (for example. currently 128 KB by default. currently being worked on: a) Translating from znode to dnode is slower than necessary because ZFS doesn't use translation information it already has. for quota control). • ZFS uses a lot of CPU when doing small writes (for example. decreasing performance. • ZFS blocksize is configurable per filesystem. but importing a pool with thousands of filesystems is a slow operation (can take minutes). and b) Current partial-block update code is very inefficient. a single byte). • ZFS copy-on-write operation can degrade on-disk file layout (file fragmentation) when files are modified. Reads or writes which are smaller than the block size suffer a performance penalty. for example a database. If your workload reads/writes data in fixed sizes (blocks). you should (manually) configure ZFS blocksize .unlikely that this feature will appear in any Solaris 10 release until at least 2009. Refer to this zfs-discuss thread on OpenSolaris for updates.
Swapping over ZVOL pseudo-devices can hang the system. The gzip compression algorithm was added in Solaris Nevada as part of 6536606 and is planned for a Solaris 10 update in Spring 2008. Read/write errors or slow/timed-out operations do not currently cause a disk to be marked as faulty. • There is work in progress to provide automatic and periodic disk scrubbing. This was resolved in the OpenSolaris code base at the same . • • Not all symbolic links are protected by ditto blocks. but the compression ratio is not comparable to gzip or similar algorithms. • If a snapshot is taken or destroyed while the zpool is scrubbing/resilvering. in order to provide corruption detection and early disk-rotting detection. Currently the data scrubbing must be done manually with "zpool scrub" command.equal to the application blocksize. the process will be restarted from the beginning. This bug is now resolved in the OpenSolaris code base. for better performance and to conserve cache memory and disk bandwidth. This is fixed in Solaris Nevada via 6520519. • ZFS only offlines a faulty hard disk if it can't be opened. • Current ZFS compression/decompression code is very fast.
A separate zvol is required for swap and dump. This can be a problem when for example a large server has multiple filesystems used for different purposes . • If a non-redundant disk in a zpool goes offline the entire operating system will panic on the next read or write. This is fixed with the zpool "failmode" option added in Nevada b77 .one filesystem failure shouldn't cause the entire system to go down. Zvols can also be used as the system crash dump device.time that ZFS root/boot support was added for SPARC.
0. NexentaStor includes a GUI that simplifies the process of utilizing ZFS. OPENSOLARIS OpenSolaris 2008. More recently. includes a ZFS implementation. added in version alpha1. called version NexentaCore Platform 1. of their operating system which serves as a basis for software appliances from Nexenta and . Nexenta announced in February of 2008 a significant release.PLATFORMS ZFS is part of Sun's own Solaris operating system and is thus available on both SPARC and x86-based systems. a complete GNU-based open source operating system built on top of the OpenSolaris kernel and runtime. Nexenta Systems announced NexentaStor.05 uses ZFS as its default filesystem. Since the code for ZFS is open source. a port to other operating systems and platforms can be produced without Sun's involvement. There are a half dozen 3rd party distributions. Nexenta OS. their ZFS storage appliance providing NAS/SAN/iSCSI capabilities and based on Nexenta OS.
5 Leopard. attempts to format local drives using ZFS were unsuccessful. 2008.5 Developer Seed 9A321. VP for Solaris Marketing later wrote to .0 Alpha1 Release (unstable)" is available. announced it was porting ZFS to their Mac OS X operating system.other distributions. Apple Inc. As a part of the 2007 Google Summer of Code a ZFS port was started for NetBSD. As of June 15 '08 "NexentaCore 2. this is a known bug. 2007. FREEBSD Pawel Jakub Dawidek has ported and committed ZFS to FreeBSD in experimental capacity for inclusion in FreeBSD 7. noted above. Sun's CEO Jonathan I. On June 6.0. MAC OS X In a post on the opensolaris. Marc Hamilton. but lacks the ability to act as a root partition. Schwartz announced that Apple would make ZFS "the" filesystem in Mac OS 10. released on February 28. The current recommendation is to use it only on amd64 platforms with sufficient memory but there is a newer port (uncommited yet) which fixes the memory issue. Also. Nexenta downloads.org zfs-discuss mailing list. From Mac OS X 10. support for ZFS has been included.
1 for Leopard" is currently only working on version 10. . The current Mac OS Forge release of the Mac OS X ZFS project is version 117 and synchronized with the OpenSolaris ZFS SVN version 72 .5. Apple has also unveiled support for ZFS in its development version of Mac OS X "Snow Leopard".clarify that.0. but Apple has also released the "ZFS Read/Write Developer Preview 1. The installer for the "ZFS Read/Write Developer Preview 1. In the release version of Mac OS X 10. and has not been updated for version 10. Apple provides read-write binaries and source.5 Leopard. ZFS is available in read-only mode from the command line.which allows read-write access and the creation of zpools. Apple is planning to use ZFS in future versions of Mac OS X. As of January 2008.5. which lacks the possibility to create zpools or write to them.1 for Leopard". in his opinion. Alex Blewitt put together an installer for the 102-A binaries. which doesn't need any handholding to install. but they must be installed by hand. but not necessarily as the default filesystem for Mac OS X 10. an OS optimized for machines with multi-core processors.5.1 and above.
Sun Microsystems has stated that a Linux port is being investigated. prohibits linking with code under certain licenses. One solution to this problem is to port ZFS to Linux's FUSE system so the filesystem runs in userspace instead.LINUX Porting ZFS to Linux is complicated by the fact that the GNU General Public License. which governs the Linux kernel. However. NTFS-3G (another file system driver built on FUSE) performs well when compared to other traditional file system drivers. A project to do this was sponsored by Google's Summer of Code program in 2006. and is in Beta stage as of March 2008. . such as CDDL. This shows that excellent performance is possible with ZFS on Linux after proper optimization. Running a file system outside the kernel on traditional Unix-like systems can have a significant performance impact. The ZFS on FUSE project is available here. the license ZFS is released under.
Jeff Bonwick's Blog. Sun Microsystems (June 20. . Sun Microsystems. Retrieved on 2006-04-30. "ZFS: The Last Word in Filesystems". Retrieved on 2007-10-02. • "Solaris ZFS Administration Guide". • "Sun Celebrates Successful One-Year Anniversary of OpenSolaris". I say zetta". 2004). "You say zeta. Sun Microsystems. Retrieved on 2006-09-08. • "ZFS: the last word in file systems". • "Solaris ZFS Administration Guide". Retrieved on 200710-21.REFERENCES • "OpenSolaris. Sun Microsystems. Sun Microsystems. • Jeff Bonwick (October 31. • "Solaris ZFS Administration Guide". Solaris Performance Wiki. Retrieved on 2006-0430. Retrieved on 2007-10-02. Retrieved on 2007-10-05.org". Retrieved on 2007-10-05. 2005). Sun Microsystems (September 14. • "ZFS Best Practices Guide". 2006). Jeff Bonwick's Blog. • Jeff Bonwick (2006-05-04).
doi:10.• Lloyd. Flippin' off bits Weblog (2006-05-12).1038/35023282. • Jeff Bonwick (September 25. Retrieved on 2006-07-12. Retrieved on 2008-02-18. 2004). "128-bit storage: are you high?". Sun Microsystems. Retrieved on 2007-02-23. Nature 406: 1047–1054. Retrieved on 2007-02-23. Seth (2000). "Ultimate physical limits to computation". • "ZFS Block Allocation". Jeff Bonwick's Weblog (2006-05-02). Retrieved on 2007-03-01. . • "Ditto Blocks . Sun Microsystems. Jeff Bonwick's Weblog (2006-1104). • "Smokin' Mirrors".The Amazing Tape Repellent". • "Architecture ZFS for Lustre".
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.