You are on page 1of 25

University of Wisconsin - Madison

RELIABILITY ANALYSIS
OF ZFS
CS 736 Project
Reliability Analysis of ZFS Summary

 To perform reliability analysis of ZFS


 Test existing reliability claims
 Layered driver interface – simulating transient
block corruptions at various levels in ZFS on-disk
hierarchy.
 Results
 Classes of fault handled by ZFS.
 Measure of the robustness of ZFS.
 Lessons on building a reliable, robust file system.

University of Wisconsin - Madison


Coming Up Outline of the talk

 ZFS Organization
 ZFS On Disk format
 ZFS features and specs regarding reliability.
 Experimental Setup and Experiments
 Results and Conclusions
 Future Work

University of Wisconsin - Madison


ZFS Organization Pooled Storage Model

ZFS ZFS ZFS ZFS ZFS Pool

-Pooled Storage Model


- Disk is a ZFS pool comprising of many file
systems.

University of Wisconsin - Madison


ZFS Organization Object based

 Transactional based object file system


 Every structure is an object.
 Operation on object(s) is a transaction.
 Grouping of transaction as transaction group.
 All data and metadata blocks are checksummed.
 No silent corruptions.
 Modifications are always Copy on Write
 Always on-disk consistent.
 All metadata and data(optional) is compressed.

University of Wisconsin - Madison


ZFS Structures
 Entire file system is represented as
 Objects - dnode_phys_t
 Object Sets - dnode_phys_t [ ]

 P/L analogy – each object is a template. The bonus


buffer describes specific attributes.

University of Wisconsin - Madison


ZFS Structures Blocks and block pointers

 Data transferred to disks in terms of blocks.


 Block pointers (blkptr_t) used to locate, verify and
describe blocks.
 Contains checksum and compression information.
 Physical size of block <> Logical Size of block
 Gang blocks

University of Wisconsin - Madison


ZFS Structures Block pointers

 Data Virtual Address –


combination of fields in
blkptr_t to locate block on disk. vdev1 asize
 Wideness – blkptr_t can store
offset1
upto three copies of the data
pointed by a unique DVA. vdev2 asize
These blocks are called as
“ditto blocks”. offset2
 Three for pool wide metadata
vdev3 asize
 Two for file system wide
metadata offset3
 One for data (configurable)
Lvl typ cksum comp psize
lsize

University of Wisconsin - Madison


ZFS Structures Wideness

University of Wisconsin - Madison


ZFS Structures Attributes on disk

 ZAP (ZFS Attribute Processor)


 ZAP objects used to handle arbitrary (name, object)
associations within an object set (objset)
 Most commonly used to implement directories
 Also used extensively throughout the DSL

University of Wisconsin - Madison


Putting it all together Objects

•Everything in ZFS is an
object.
Objects
•A dnode describes and
organizes a collection of
blocks making up an object.

University of Wisconsin - Madison


Putting it all together Object Sets

•Group related objects to


Object set
form objsets.
Objects
•Filesystems, volumes,
clones and snapshots are
objsets.

University of Wisconsin - Madison


Putting it all together DataSets

•Encapsulates objset and


Object set Space
map
provides
Objects
• Space usage
Snapshot
Information • Snapshot Information

DataSet

University of Wisconsin - Madison


Putting it all together Dataset directories

•Groups Datasets
Object set Space
Objects map
•Properties such as
Snapshot
Information quotas, compression

•Dataset Relationships
DataSet
Properties Child
Map

DataSet Directory
University of Wisconsin - Madison
A road less travelled From vdev label to data

University of Wisconsin - Madison


To sum up Moving forward

 Layers of indirection
 End to end Checksums which are separated from
data.
 Wideness (Ditto Blocks) (3 – 2 – 1)
 Compression
 Copy on Write
 Scrub facility

University of Wisconsin - Madison


Experimental Setup
 Corruption Framework
 Corrupter Driver
 Modify physical disk
blocks
 Analyzer App
 Understand on-disk ZFS
structures
 Consumer App
 Monitor ZFS responses,
error codes

University of Wisconsin - Madison


Experimental Setup - Simplification

 Setup on Solaris 10 VM
 Only one physical vdev (disk)
 No striping, mirror, raid…
 Initial target – Pointer Corruption
 Reduced Sample Space
 Interesting Cases
 Disable compression as much as possible

University of Wisconsin - Madison


Initial Finding
 All metadata compressed
 Cannot disable metadata compression
 Pointer Corruption not feasible
 Perform corruptions on compressed objects
 Representative of effects of disk faults on ZFS

University of Wisconsin - Madison


Corruption Experiments
 TYPE:
 Type-aware Object Corruptions
 TARGET (Targeted On-Disk Objects)
 Vdev labels [@Pool]
 Uberblocks [@Pool]
 Object sets
 Meta Object Set [@Pool]
 objset_phys_t (describing object set)
 Object array
 Myfs Object Set [@FS]
 objset_phys_t
 Indirect blkptr objects
 Object array
 ZIL [@FS]
 File Data [@FS]
 Directory Data [@FS]
University of Wisconsin - Madison
Results
Detection Recovery Correction
vdev label YES/Checksum YES/Replica NO/COW
uberblock YES/Checksum YES/Replica NO/COW
MOS Object YES/Checksum YES/Replica NO/COW
MOS Object Set YES/Checksum YES/Replica NO/COW
FS Object YES/Checksum YES/Replica NO/COW
FS Indirect Objects YES/Checksum YES/Replica NO/COW
FS Object Set YES/Checksum YES/Replica NO/COW
ZIL YES/Checksum NO NO
Directory Data YES/Checksum NO/Configurable NO/Configurable
File Data YES/Checksum NO/Configurable NO/Configurable
University of Wisconsin - Madison
Summary (using IRON Taxonomy)
 Detection
 Checksums in
parent blkptrs

 Recovery
 Replication in
parent blkptrs
(ditto blocks)

University of Wisconsin - Madison


Conclusion
 Integration of File System and Volume Manager
 Saves an additional translation
 Use of one generic pointer block for checksums
and replication
 Merkel tree provides Robustness
 Use of replication/compression in commodity file
system viable
 COW can be used effectively

University of Wisconsin - Madison


Observations/Questions
 No correction of ditto blocks: relies on COW
 Consecutive (n=wideness) failures without transaction
group commit ??
 Snapshot corruption ??
 Explicit scrubbing corrects ditto blocks in-place
 Potential for corruption ??
 Space/ Performance hit due to
redundancy/compression
 2% hit in terms of space/IO ?? (Banham & Nash)
 No Page Cache, uses ARC
University of Wisconsin - Madison
Future Work
 Snapshot corruptions
 Multiple device configuration
 Striping
 Mirror
 RAID-Z

University of Wisconsin - Madison

You might also like