Draft

ZFS On-Disk Specification

Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A

1

© 00! Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A. "#is $ro%uct or %ocument is $rotecte% &y co$yri'#t an% %istri&ute% un%er licenses restrictin' its use, co$yin', %istri&ution, an% %ecom$ilation. No $art o( t#is $ro%uct or %ocument may &e re$ro%uce% in any (orm &y any means wit#out $rior written aut#ori)ation o( Sun an% its licensors, i( any. "#ir%*$arty so(tware, inclu%in' (ont tec#nolo'y, is co$yri'#te% an% license% (rom Sun su$$liers. +arts o( t#e $ro%uct may &e %eri,e% (rom -erkeley -S. systems, license% (rom t#e Uni,ersity o( Cali(ornia. Sun, Sun Microsystems, t#e Sun lo'o, /a,a, /a,aSer,er +a'es, Solaris, an% Stor0%'e are tra%emarks or re'istere% tra%emarks o( Sun Microsystems, Inc. in t#e U.S. an% ot#er countries. U.S. 1o,ernment 2i'#ts Commercial so(tware. 1o,ernment users are su&3ect to t#e Sun Microsystems, Inc. stan%ar% license a'reement an% a$$lica&le $ro,isions o( t#e 4A2 an% its su$$lements. .5CUM0N"A"I5N IS +256I.0. AS IS AN. A77 08+20SS 52 IM+7I0. C5N.I"I5NS, 20+20S0N"A"I5NS AN. 9A22AN"I0S, INC7U.IN1 AN: IM+7I0. 9A22AN": 54 M02C;AN"A-I7I":, 4I"N0SS 452 A +A2"ICU7A2 +U2+5S0 52 N5N*IN42IN10M0N", A20 .ISC7AIM0., 08C0+" "5 ";0 08"0N" ";A" SUC; .ISC7AIM02S A20 ;07. "5 -0 701A77: IN6A7I.. Unless ot#erwise license%, use o( t#is so(tware is aut#ori)e% $ursuant to t#e terms o( t#e license (oun% at< #tt$<==%e,elo$ers.sun.com=&erkeley>license.#tml Ce $ro%uit ou %ocument est $rot?'? $ar un co$yri'#t et %istri&u? a,ec %es licences @ui en restrei'nent lAutilisation, la co$ie, la %istri&ution, et la %?com$ilation. Aucune $artie %e ce $ro%uit ou %ocument ne $eut Btre re$ro%uite sous aucune (orme, $ar @uel@ue moyen @ue ce soit, sans lAautorisation $r?ala&le et ?crite %e Sun et %e ses &ailleurs %e licence, sAil y en a. 7e lo'iciel %?tenu $ar %es tiers, et @ui com$ren% la tec#nolo'ie relati,e auC $olices %e caractDres, est $rot?'? $ar un co$yri'#t et licenci? $ar %es (ournisseurs %e Sun. .es $arties %e ce $ro%uit $ourront Btre %?ri,?es %u systDme -erkeley -S. licenci?s $ar lAUni,ersit? %e Cali(ornie. Sun, Sun Microsystems, le lo'o Sun, /a,a, /a,aSer,er +a'es, Solaris, et Stor0%'e sont %es mar@ues %e (a&ri@ue ou %es mar@ues %?$os?es, %e Sun Microsystems, Inc. auC 0tats*Unis et %ans %Aautres $ays. C0""0 +U-7ICA"I5N 0S" 45U2NI0 0N 7A0"A" 0" AUCUN0 1A2AN"I0, 08+20SS0 5U IM+7ICI"0, NA0S" ACC52.00, : C5M+2IS .0S 1A2AN"I0S C5NC02NAN" 7A 6A70U2 MA2C;AN.0, 7AA+"I"U.0 .0 7A +U-7ICA"I5N A 20+5N.20 A UN0 U"I7ISA"I5N +A2"ICU7I020, 5U 70 4AI" EUA0770 N0 S5I"+AS C5N"204AISAN"0 .0 +25.UI" .0 "I02S. C0 .0NI .0 1A2AN"I0 N0 SAA++7IEU02AI" +AS, .ANS 7A M0SU20 5U I7 S02AI" "0NU /U2I.IEU0M0N" NU7 0" N5N A60NU.

Table of Contents
Intro%uction............................................................ ................................................................5 C#a$ter 5ne F 6irtual .e,ices G,%e,sH, 6%e, 7a&els, an% -oot -lock................................! Section 1.1< 6irtual .e,ices.............................................. ................................................! Section 1. < 6%e, 7a&els........................................................ ..........................................! Section 1. .1< 7a&el 2e%un%ancy.............................................................. ...................I Section 1. . < "ransactional "wo Sta'e% 7a&el U$%ate..............................................I Section 1.J< 6%e, "ec#nical .etails.......................................................... ........................K Section 1.J.1< -lank S$ace............................................................................ ...............K Section 1.J. < -oot -lock ;ea%er.............................................. ..................................K Section 1.J.J< Name*6alue +air 7ist...........................................................................K Section 1.J.4< "#e U&er&lock................................................................... ..................1 Section 1.4< -oot -lock.............................................. .....................................................14 C#a$ter "wo< -lock +ointers an% In%irect -locks................................................................15 Section .1< .6A F .ata 6irtual A%%ress.............................................. .......................15 Section . < 12I............................................. ............................................................1! Section .J< 1AN1.................................................................................... ...................1! Section .4< C#ecksum..................................................................................................1I Section .5< Com$ression..............................................................................................1K Section .! < -lock Si)e.............................................................. ...................................1K Section .I< 0n%ian.................................................................. ......................................19 Section .K< "y$e.............................................................. .............................................19 Section .9< 7e,el............................................................ .............................................. 0 Section .10< 4ill.................................................................................. .......................... 0 Section .11< -irt# "ransaction..................................................................................... 1 Section .1 < +a%%in'................................................ .................................................... 1 C#a$ter "#ree< .ata Mana'ement Unit...................................................... .......................... Section J.1 < 5&3ects.............................................................. .......................................... Section J. < 5&3ect Sets............................................................... .................................... ! C#a$ter 4our F .S7 .................................................. .......................................................... 9 Section 4.1 < .S7 In(rastructure.................................................. .................................... 9 Section 4. < .S7 Im$lementation .etails.......................................................................J1 Section 4.J< .ataset Internals..........................................................................................J Section 4.4< .S7 .irectory Internals..............................................................................J4 C#a$ter 4i,e F LA+.............................................................................. ................................JI Section 5.1< "#e Micro La$............................................ .................................................JK Section 5. < "#e 4at La$...................................................................... ...........................J9 Section 5. .1< )a$>$#ys>t...........................................................................................J9 Section 5. . < +ointer "a&le.................................................. ......................................41 Section 5. .J< )a$>lea(>$#ys>t...................................................................................41 Section 5. .4 < )a$>lea(>c#unk...................................................................................4J C#a$ter SiC F L+7...................................................................... ..........................................45 Section !.1< L+7 4ilesystem 7ayout......................................................... .......................45 Section !. < .irectories an% .irectory "ra,ersal.............................................. ...............45 Section !.J< L4S Access Control 7ists............................................................................4I J

C#a$ter Se,en F L4S Intent 7o'.............................................. ..........................................51 Section I.1< LI7 #ea%er...................................................................................................51 Section I. < LI7 &locks............................................................................ ........................5 C#a$ter 0i'#t F L657 GL4S ,olumeH.............................................................. ....................55

4

Introduction
ZFS is a new filesystem technology that provides immense capacity (128-bit), provable data integrity, always-consistent on-dis format, self-optimi!ing performance, and real-time remote replication" ZFS departs from traditional filesystems by eliminating the concept of vol#mes" $nstead, ZFS filesystems share a common storage pool consisting of writeable storage media" %edia can be added or removed from the pool as filesystem capacity re&#irements change" Filesystems dynamically grow and shrin as needed witho#t the need to re-partition #nderlying storage" ZFS provides a tr#ly consistent on-dis format, b#t #sing a copy on write ('()) transaction model" *his model ens#res that on dis data is never overwritten and all on dis #pdates are done atomically" *he ZFS software is comprised of seven distinct pieces+ the S,- (Storage ,ool -llocator), the .S/ (.ataset and Snapshot /ayer), the .%0 (.ata %anagement /ayer), the Z-, (ZFS -ttrib#te ,rocessor), the Z,/ (ZFS ,osi1 layer), the Z$/ (ZFS $ntent /og), and Z2(/ (ZFS 2ol#me)" *he on-dis str#ct#res associated with each of these pieces are e1plained in the following chapters+ S,- ('hapters 1 and 2), .S/ ('hapter 3), .%0 ('hapter 4), Z-, ('hapter 5), Z,/ ('hapter 6), Z$/ ('hapter 7), Z2(/ ('hapter 8)"

5

8. G%iskH M.%e.2: Vdev Labels <ach physical vdev within a storage pool contains a 236=: str#ct#re called a vdev label" *he vdev label contains information describing this partic#lar physical vdev and all other vdevs which share a common top-level vdev as an ancestor" For e1ample.1: Virtual Devices ZFS storage pools are made #p of a collection of virt#al devices" *here are two types of virt#al devices+ physical virt#al devices (sometimes called leaf vdevs) and logical virt#al devices (sometimes called interior vdevs)" . the second mirror 8%29 contains two dis s represented by 8vdev '9 and 8vdev . in the previo#s ill#stration. G%iskH MCN . they are also top-level vdevs since they originate from the 8root vdev9" Internal=7o'ical 6%e.N . ! . for e1ample)" .%e.physical vdev. GMirrorC=.logical vdev is a concept#al gro#ping of physical vdevs" 2devs are arranged in a tree with physical vdev e1isting as leaves of the tree" -ll pools have a special logical vdev called the 8root9 vdev which roots the tree" -ll direct children of the 8root9 vdev (physical or logical) are called top-level vdevs" *he $ll#stration below shows a tree of vdevs representing a sample pool config#ration containing two mirrors" *he first mirror (labeled 8%19) contains two dis . :.%e. is a writeable media bloc device (a dis . and Boot Block Section 1.%e. and 8%29" *he contents of the vdev label are described in greater detail in section 1"4.9" 2devs -.H +#ysical=7ea( 6%e.9.s MM N . GMirror A=-H "o$ 7e. G%iskH Illustration 1 vdev tree sample configuration Section 1.Chapter One – Virtual Devices (vdevs). Vdev Labels. the vdev label str#ct#re contained on vdev 8'9.N MM1N .%e. wo#ld contain information describing the following vdevs+ 8'9.s Mroot .s MAN . and .%e. G%iskH M-N . '. Vdev Technical Details.el . are all physical vdevs" 8%19 and %29 are logical vdevs.%e.%e. represented by 8vdev -9 and 8vdev :9" /i ewise.

these fo#r labels are identical and any copy can be #sed to access and verify the contents of the pool" )hen a device is added to the pool.*he vdev label serves two p#rposes+ it provides access to a pool>s contents and it is #sed to verify a pool>s integrity and availability" *o ens#re that the vdev label is always available and always valid. at any point in time.#ring label #pdates. placing the labels in non-contig#o#s locations (front and bac ) provides ZFS with a better probability that some label will remain accessible in the case of media fail#re or accidental overwrite (eg" #sing the dis as a swap device while it is still part of a ZFS storage pool)" ection !"#"#$ Transactional Two Staged Label Update *he location of the vdev labels are fi1ed at the time the device is added to the pool" *h#s. red#ndancy and a staged #pdate model are #sed" *o provide red#ndancy. fo#r copies of the label are written to each physical vdev within the pool" *he fo#r copies are identical within a vdev. a two staged transactional approach is #sed to ens#re that a valid vdev label is always available on dis " 2dev label red#ndancy and the transactional #pdate model are described in more detail below" ection !"#"!$ Label %edundanc& Fo#r copies of the vdev label are written to each physical vdev within a ZFS storage pool" -side from the small time frame d#ring label #pdate (described below). there is a potential for error" *o ens#re that ZFS always has access to its labels. the odd labels will still be valid" (nce the even labels have made it o#t to stable storage. when a vdev label is #pdated. b#t are not identical across vdevs in the pool" . /2 and /4 represent the bac two labels" 0 5!O 51 O N*51 O N* 5!O 70 71 7 7J Illustration 2 Vdev Label layout on a block device of size N :ased on the ass#mption that corr#ption (or accidental dis overwrites) typically occ#rs in contig#o#s ch#n s. the odd labels (/1 and /4) are #pdated and written to dis " *his approach has been caref#lly designed to ens#re that a valid copy of the label remains on dis at all times" I . a staged approach is #sed d#ring #pdate" *he first stage of the #pdate writes the even labels (/? and /2) to dis " $f. the contents of the label are overwritten" -ny time on-dis data is overwritten. ZFS places two labels at the front of the device and two labels at the bac of the device" *he drawing below shows the layo#t of these labels on a device of si!e N: /? and /1 represent the front two labels. the system comes down or fa#lts d#ring this #pdate. the vdev label does not have copy-on-write semantics li e everything else in ZFS" 'onse&#ently.

and #berbloc array (section 1"4"5)" 70 71 7 7J -lank S$ace -oot .Section 1.. la&els an% w#ile t#eir namin' is similar. It s#oul% &e note% t#at %isk la&els are a com$letely se$arate entity (rom . 112=: of name-val#e pairs. K . 8:9.. to s#pport 2*(' labels.isk la&els %escri&e %isk $artition an% slice in(ormation. the vdev label on device 8-9 (seen in the ill#stration below) wo#ld contain information describing the s#btree highlighted+ incl#ding vdevs 8-9.ea%er Name=6alue +airs .detailed description of each components follows+ blan space (section 1"4"1). 8= of boot header information. 1 KO U&er&lock Array 5!O 0 KO 1!O Illustration 3 Components of a vdev label (blank space boot block !eader name"value pairs uberblock array# ection !"'"!$ Blank pace ZFS s#pports both 2*(' (2ol#me *able of 'ontents) and <F$ dis labels as valid methods of describing dis layo#t"1 )hile <F$ labels are not written as part of a slice (they have their own reserved space). boot bloc header (section 1"4"2).air List *he ne1t 112=: of the label holds a collection of name-val#e pairs describing this vdev and all of it>s related vdevs" Belated vdevs are defined as all vdevs within the s#btree rooted at this vdev>s top-level vdev" For e1ample. t#ey s#oul% not &e con(use% as &ein' similar. See (%iskG1MH an%=or (ormatG1MH (or more in(ormation on %isk $artitions an% slices. name@val#e pair list (section 1"4"4). and 8%19 (top-level vdev)" 1 .. and 128=: of 1= si!ed #berbloc str#ct#res" *he drawing below shows an e1panded view of the /? label" .3: Vdev Technical Details *he contents of a vdev label are bro en #p into fo#r pieces+ 8=: of blan space. 2*(' labels m#st be written to the first 8= of slice ?" *h#s. the first 8 of the vdevAlabel is left empty to prevent potentially overwriting a 2*(' dis label" ection !"'"#$ Boot Block (eader *he boot bloc header is an 8= str#ct#re that is reserved for f#t#re #se" *he contents of this bloc will be described in a f#t#re appendi1 of this paper" ection !"'"'$ )a*e+Value .%e.

escription+ (n dis format version" '#rrent val#e is 819" Name: Dame+ 8name9 2al#e+ .s Mroot .<A0$D*65 .%e.N .%e. GMirrorC=.-$B) man pages" *he following name-val#e pairs are contained within this 112=: portion of the vdevAlabel" Version< Dame+ 8version9 2al#e+ .N MM1N .B encoding or nvlists. G%iskH M.%e.Internal=7o'ical 6%e.%e.-*-A*E.<A0$D*65 .B encoded nvlists" For more information on C.s MM N .s MAN .%e.%e. GMirror A=-H "o$ 7e. G%iskH M-N .<AS*B$DF .%e.-*-A*E.escription+ Dame of the pool in which this vdev belongs" State: Dame+ 8state9 2al#e+ . G%iskH MCN . see the libnvpair(4/$:) and nvlistAfree(4D2.escription+ State of this pool" *he following table shows all e1isting pool states" 9 .-*-A*E.%e.el .H +#ysical=7ea( 6%e. G%iskH Illustration $ vdev tree s!o%ing related vdevs in !ig!lig!ted circle -ll name-val#e pairs are stored in C.

((/AS*-*<A-'*$2< .<AD2/$S* .<A0$D*65 .<A0$D*65 .-*-A*E.escription+ *he vdevAtree is a nvlist str#ct#re which is #sed rec#rsively to describe the hierarchical nat#re of the vdev tree as seen in ill#strations one and fo#r" *he vdevAtree rec#rsively describes each 8related9 vdev within this vdev>s s#btree" *he ill#stration below shows what the 8vdevAtree9 entry might loo li e for 8vdev -9 as shown in ill#strations one and fo#r earlier in this doc#ment" 10 .<A0$D*65 . Value ? 1 2 "a&le 1 +ool states an% .<A0$D*65 . .alues.((/AS*-*<A<C.((/AS*-*<A.-*-A*E.-*-A*E.(B*<.-*-A*E. Transaction Dame+ 8t1g9 2al#e+ .-*-A*E.escription+ Flobal #ni&#e identifier for the top-level vdev of this s#btree" !"id Dame+ 8g#id9 2al#e+ .escription+ Flobal #ni&#e identifier (g#id) for the pool" Top !"id Dame+ 8topAg#id9 2al#e+ .<S*B(E<.escriptions+ *ransaction gro#p n#mber in which this label was written to dis " ool !"id Dame+ 8poolAg#id9 2al#e+ .escription+ Flobal #ni&#e identifier for this vdev" Vde# Tree Dame+ 8vdevAtree9 2al#e+ .State .

<A0$D*65 .%e.< mirror Interior .< rai%) Interior .< &lock stora'e 7ea( .%e.escription+ Flobal 0ni&#e $dentifier for this vdevAtree element" 11 . "5 45% 65!6425!!!. "5 45% 65!7!87!!!!.type='mirror' vdev_tree id=1 guid=1659 !!966!"!1 516#6 metasla$%array = 1 metasla$%shi&t = ## ashi&t = 9 asi'e =519569"!( children)!* type='dis+' vdev_tree id=# guid=66"99(159695 "1#9.< (ile stora'e Interior .sd/001232T1%0T .< a sli'#t . a vdevAtree nvlist may contain only a s#bset of the elements described below" Dame+ 8type9 2al#e+ .<AS*B$DF ." path='-dev-ds+-c"t!d!' devid='id1.%e. tree MrootN &able 2 Vdev &ype 'trings Dame+ 8id9 2al#e+ ."!"1"N0-a' children)1* type='dis+' vdev_tree id= guid= 6"(!"! !!19 #91"!5 path='-dev-ds+-c"t1d!' devid='id1.%e.sd/001232T1%0T ."!"D69N-a' Illustration ( vdev tree nvlist entry for )vdev *) as seen in Illustrations 1 and $ <ach vdevAtree nvlist contains the following elements as described in the section below" Dote that not all nvlist elements are applicable to all vdevs types" *herefore.-*-A*E.<A0$D*65 .P use% &y L4S w#en re$lacin' one %isk wit# anot#er Interior .%e.< t#e root o( t#e .%e.escription+ String val#e indicating type of vdev" *he following vdevAtypes are valid" Type M%iskN M(ileN MmirrorN Mrai%)N Mre$lacin'N Description 7ea( .%e.escription+ *he id is the inde1 of this vdev in its parent>s children array" Dame+ 8g#id9 2al#e+ .-*-A*E.-*-A*E.%e.ariation on t#e mirror .

-*-A*E.-*-A*E.escription+ (bGect n#mber of an obGect containing an array of obGect n#mbers" <ach element of this array (maHiI) is.escription+ -mo#nt of space that can be allocated from this top level vdev Dame+ 8children9 2al#e+ .-*-A*E. in t#rn.escription+ -rray of vdevAtree nvlists for each child of this vdevAtree element" ection !"'"-$ The .-*-A*E.<A0$D*65 .evice $.escription+ log base 2 of the metaslab si!e Dame+ 8ashift9 2al#e+ .<A0$D*65 . 1 . the active #berbloc is never "#e u&er&lock is similar to t#e su$er&lock in U4S.Dame+ 8path9 2al#e+ .-*-A*E.! config#ration.evice path" (nly #sed for leaf vdevs" Dame+ 8devid9 2al#e+ .escription+ .-*-A*E.<AD2/$S*A-BB-E .berblock $mmediately following the nvpair lists in the vdev label is an array of #berbloc s" *he #berbloc is the portion of the label containing information necessary to access the contents of the pool2" (nly one #berbloc in the pool is active at any point in time" *he #berbloc with the highest transaction gro#p n#mber and valid SK--236 chec s#m is the active #berbloc " *o ens#re constant access to the active #berbloc .<AS*B$DF .escription+ .<AS*B$DF .<A0$D*65 . for this vdevAtree element" (nly #sed for vdevs of type dis " Dame+ 8metaslabAarray9 2al#e+ . >J> otherwise" Dame+ 8asi!e9 2al#e+ .-*-A*E.escription+ /og base 2 of the minim#m allocatable #nit for this top level vdev" *his is c#rrently >1?> for a B-$. an obGect n#mber of a space map for metaslab >i'" Dame+ 8metaslabAshift9 2al#e+ .<A0$D*65 .

e u&er&lock u&er&lock>$#ys>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t &lk$tr>t Illustration . all #pdates to an #berbloc are done by writing a modified #berbloc to another element of the #berbloc array" 0pon writing the new #berbloc . u&>ma'ic u&>..ersion u&>'ui%>sum uint!4>t u&>tC' u&>timestam$ uint!4>t u&>root&$ u&>'ui%>sum uint!4>t u&>timestam$ &lk$tr>t u&>root&$ acti.. ?1??bab1?c ?1?cb1ba?? "b%#ersion *he version field is #sed to identify the on-dis format in which this data is laid o#t" *he c#rrent on-dis format version n#mber is &'(" *his field contains the same val#e as the 8version9 element of the name@val#e pairs described in section 1"4"4" "b%t'g 1J .overwritten" $nstead.ea%er Name=6alue +airs . the transaction gro#p n#mber and timestamps are incremented thereby ma ing it the new active #berbloc in a single atomic action" 0berbloc s are written in a ro#nd robin fashion across the vario#s vdevs with the pool" *he ill#stration below has an e1panded view of two #berbloc s within an #berbloc array" 70 71 7 7J -lank S$ace -oot ..+berblock array s!o%ing uberblock contents Uberblock Tec$nical Details *he #berbloc is stored in the machine>s native endian format and has the following contents+ "b%magic *he #berbloc magic n#mber is a 65 bit integer #sed to identify a device as containing ZFS data" *he val#e of the #bAmagic is ?1??bab1?c (oo-ba-bloc )" *he following table shows the #bAmagic n#mber as seen on dis " Machine Endianness Uberblock Value :ig <ndian /ittle <ndian &able 3 +berblock values per mac!ine endian type.ersion uint!4>t u&>ma'ic u&>tC' uint!4>t u&>.

-ll writes in ZFS are done in transaction gro#ps" <ach gro#p has an associated transaction gro#p n#mber" *he #bAt1g val#e reflects the transaction gro#p in which this #berbloc was written" *he #bAt1g n#mber m#st be greater than or e&#al to the 8t1g9 n#mber stored in the nvlist for this label to be valid" "b%g"id%s"m *he #bAg#idAs#m is #sed to verify the availability of vdevs within a pool" )hen a pool is opened.4: Boot Block $mmediately following the /? and /1 labels is a 4"3%: ch#n reserved for f#t#re #se" *he contents of this bloc will be described in a f#t#re appendi1 of this paper" 0 5!O 51 O 4M N*51 O N* 5!O 70 71 -oot -lock 7 7J Illustration . see section 1"4"4) it enco#nters" *his comp#ted s#m is chec ed against the #bAg#idAs#m to verify the availability of all vdevs within this pool" "b%timestamp 'oordinated 0niversal *ime (0*') when this #berbloc was written in seconds since Lan#ary 1st 1J7? (F%*)" "b%rootbp *he #bArootbp is a bl ptr str#ct#re containing the location of the %(S" :oth the %(S and bl ptr str#ct#res are described in later chapters of this doc#ment+ 'hapters 5 and 2 respectively" Section 1.s (a vdev>s g#id is stored in the guid nvpair entry. ZFS traverses all leaf vdevs within the pool and totals a r#nning s#m of all the F0$. 14 . Vdev label layout including boot block reserved space.

ata is transferred between dis and main memory in #nits called bloc s" .bloc pointer (bl ptrAt) is a 128 byte ZFS str#ct#re #sed to physically locate. 25 16 8 -S$Z< ? Illustration / 0lock pointer structure s!o%ing byte by byte usage. -S$Z< 58 5? 42 M FB$. Section 2.2-s #sed per bloc pointer is p#rely a policy decision and is called the 8wideness9 of the bloc pointer+ 15 . and describe bloc s of data on dis " *he 128 byte bl ptrAt str#ct#re layo#t is shown in the ill#stration below" 65 ? 1 F 2 4 F 5 3 F 6 < lvl 7 8 J a b c d e f type vdev4 offset4 c s#m comp padding padding padding birth t1g fill co#nt chec s#mH?I chec s#mH1I chec s#mH2I chec s#mH4I . -S$Z< 36 vdev1 offset1 FB$.ointers and Indirect Blocks .Chapter T/o$ Block .(dva1)" ZFS provides the capability of storing #p to three copies of the data pointed to by the bloc pointer.S$Z< /S$Z< vdev2 offset2 FB$.2. for e1ample the combination of vdev1 and o&&set1 ma e #p a .2.(dva1. dva2.1: DVA – Data Virtual Address *he data virt#al address is the name given to the combination of the vdev and o&&set portions of the bloc pointer. or dva4)" *he data stored in each of these copies is identical" *he n#mber of . verify. each pointed to by a #ni&#e .

2-s).is a 64 bit integer val#e holding the offset (starting after the vdev labels (/? and /1) and boot bloc ) within that device where the data lives" *ogether. the val#e inside o&&set m#st be shifted over (NN) by J (2J O312) and this val#e m#st be added to ?15????? (si!e of two vdevAlabels and boot bloc )" physical $loc+ address = :o&&set .gang bloc contains #p to 4 bloc pointers followed by a 42 byte chec s#m" *he format of the gang bloc is described by the following str#ct#res" 1! . containing this bloc " *he o&&set portion of the .pointer to this gang bloc is ret#rned to the re&#ester.2. and triple wide bloc pointer (4 ..single wide bloc pointer (1 . self chec s#mming bloc s" . reserved for f#t#re #se" Section 2. do#ble wide bloc pointer (2 .2-s)" *he vdev portion of each . giving the re&#ester the perception of a single bloc " Fang bloc s are identified by the 839 bit" “G” bit value ? 1 &able $ 1ang 0lock Values Description non-gang bloc gang bloc Fang bloc s are 312 byte si!ed.2. several smaller bloc s will be allocated (totaling #p to the si!e re&#ested) and a gang bloc will be created to contain the bloc pointers for the allocated bloc s" . the vdev and o&&set #ni&#ely identify the bloc address of the data it points to" *he val#e stored in o&&set is the offset in terms of sectors (312 byte bloc s)" *o find the physical bloc byte offset from the beginning of a slice.is a 42 bit integer which #ni&#ely identifies the vdev $. 9< P ?15????? (5%:) Section 2.2 : !"D Baid-Z layo#t information.2-).3: A# .gang bloc is a bloc whose contents contain bloc pointers" Fang bloc s are #sed when the amo#nt of space re&#ested is not available in a contig#o#s bloc " $n a sit#ation of this ind.

S 18?-2.-AF:KAF$//<BI.typedef str#ct !ioAgbh Q bl ptrAt !gAbl ptrHS. R )bt%magic: Z$( bloc tail magic n#mber" *he val#e is 0x210da7ab10c7a11 :'io=data=$loc=tail<.4: $hecksu% :y defa#lt ZFS chec s#ms all of its data and metadata" ZFS s#pports several algorithms for chec s#mming incl#ding fletcher2.-AF:KAD:/=. #int65At !gAfillerHS. !ioAbloc AtailAt !gAtail". typedef !ioAc s#m Q uint!4>t )c>wor%R4SP T)io>cksum>tP zc_word: (our K &yte wor%s containin' t#e c#ecksum (or t#is 'an' &lock.*BSI. available at http+@@csrc"nist"gov@cryptval)" *he algorithm #sed to chec s#m this bloc is identified by the 8 bit integer stored in the c+sum portion of the bloc pointer" *he following table pairs each integer with a description and algorithm #sed to chec s#m this bloc >s contents" 1I . Section 2. )g%blkptr: array of bloc pointers" <ach 312 byte gang bloc can hold #p to 4 bloc pointers" )g%filler: *he filler fields pads o#t the gang bloc so that it is nicely byte aligned" typedef str#ct !ioAbloc Atail Q #int65At !btAmagic. R !ioAgbhAphysAt. fletcher5 and SK--236 (236-bit Sec#re Kash -lgorithm in F$. !ioAc s#mAt !btAc s#m.

chec+sum)1*.Compression Values and associated algorit!m. Section 2. chec+sum)#*. and chec+sum) * will be !ero" (therwise.< are sel& chec+summing.Description on off label gang header !ilog fletcher2 fletcher5 SK--236 Value 1 2 4 5 3 6 7 8 Algorithm fletcher2 none SK--236 SK--236 fletcher2 fletcher2 fletcher5 SK--236 &able ( C!ecksum Values and associated c!ecksum algorit!ms. the 236 bit chec s#m comp#ted for this bloc is stored in the chec+sum)!*. psi'e. Description on off l!Gb &able . lsi'e: /ogical si!e" *he si!e of the data witho#t compression. lsi'e. and chec+sum) * fields. Note: The computed chec+sum is al>ays o& the data. even i& this is a gang $loc+. . chec+sum)1*. raid! or gang overhead" psi'e+ physical si!e of the bloc on dis after compression 1K .&: $o%'ression ZFS s#pports several algorithms for compression" *he type of compression #sed to compress this bloc is stored in the comp portion of the bloc pointer. Value 1 2 4 Algorithm l!Gb none l!Gb Section 2. chec+sum)#*.( : Block Si)e *he si!e of a bloc is described by three different fields in the bloc pointer.236 bit chec s#m of the data is comp#ted for each bloc #sing the algorithm identified in c+sum" $f the c s#m val#e is 2 (off). a chec s#m will not be comp#ted and chec+sum)!*. and asi'e. 3ang $loc+s :see a$ove< and 'ilog $loc+s :see ?hapter .

total si!e of all bloc s allocated to hold this data incl#ding any gang headers or raid-Z parity information $f compression is t#rned off and ZFS is not on Baid-Z storage. lsi!e..asi'e: allocated si!e.*: +ndian ZFS is an adaptive-endian filesystem (providing the restrictions described in 'hapter (ne) that allows for moving pools across machines with different architect#res+ little endian vs" big endian" *he 81@ portion of the bloc pointer indicates which format this bloc has been written o#t in" :loc are always written o#t in the machine>s native endian format" ndian /ittle <ndian :ig <ndian &able . 2ndian Values Value 1 ? $f a pool is moved to a machine with a different endian format. the contents of the bloc are byte swapped on read" Section 2. and psi!e will all be e&#al" -ll si!es are stored as the n#mber of 312 byte sectors (min#s one) needed to represent the si!e of this bloc " Section 2.: T-'e *he type portion of the bloc pointer indicates what type of data this bloc holds" *he type can be the following val#es" %ore detail is provided in chapter 4 regarding obGect types" 19 . asi!e.

0 .MU>5">5-/0C">.S7>.MU>5">.MU>5">AC7 .A"AS0" .MU>5">.2 .02 . the fill co#nt contains 0 ..MU>5">+7AIN>4I70>C5N"0N"S .1/: 0ill *he fill co#nt describes the n#mber of non-!ero bloc pointers #nder this bloc pointer" *he fill co#nt for a data bloc pointer is 1.MU>5">.S7>5-/S0" .D(.I7.S7>.MU>5">+ACO0. as it does not have any bloc pointers beneath it" *he fill co#nt is #sed slightly differently for bloc pointers of type .N5.MU>5">N5N0 .0 .070"0>EU0U0 .MU>5">MAS"02>N5.I20C"52: .MU>5">5-/S0">SNA+>MA+ .MU>5">S+AC0>MA+>.MU>5">5-/0C">A22A: .0A.>N67IS" .MU>5">L657>+25+ Value 0 1 J 4 5 ! I K 9 10 11 1 1J 14 15 1! 1I 1K 19 0 1 J 4 &able / 3b4ect &ypes Section 2.MU>5">-+7IS" .>MA+ ..Type .MU>5">.I20C"52:>C5N"0N"S .MU>5">LN5.MU>5">.MU>5">L657 .MU>5">.S7>+25+S .MU>5">5-/S0" .<" For bloc pointers of this type.: Level *he level portion of the bloc pointer is the n#mber of levels (n#mber of bloc pointers which need to be traversed to arrive at this data")" See 'hapter 4 for a more complete definition of level" Section 2.MU>5">-+7IS">.MU>5">IN"0N">751 .0 .A"AS0">C.MU>5">S+AC0>MA+ .MU>5">N67IS">SIL0 .MU>5">.%0A(*A.

11: Birth Transaction *he birth transaction stored in the 8birth t1g9 bloc pointer field is a 65 bit integer containing the transaction gro#p n#mber in which this bloc pointer was allocated" Section 2.the n#mber of free dnodes beneath this bloc pointer" For more information on dnodes see 'hapter 4" Section 2.12: 1addin2 *he three padding fields in the bloc pointer are space reserved for f#t#re #se" 1 .

%0) cons#mes bloc s and gro#ps them into logical #nits called obGects" (bGects can be f#rther gro#ped by the .%0A(*A.D(. described in chapters 1 and 2. .S/ Z-. and version for a filesystem" .< .dis bloc #sage list" $ntent /og (bGect of dnodes (metadnode) 'ollection of obGects" .S . obGect containing snapshot information for a dataset" . obGect containing child . obGect (bGect #sed to store an array of obGect n#mbers" .B .$B<'*(BEA'(D*<D*S .< . .S/ obGect directory Z-.ac ed nvlist obGect" S. properties obGect containing properties for a .S/ Z-.-'<A%-.%0A(*A(:L<'*A.%0 into obGect sets" :oth obGects and obGect sets are described in this chapter" Section 3.lain file Z.S/A.S/A(:LS<*ASD-./ .ata %anagement 0nit (.B(.%0A(*A-'/ .%0A(*A.%0A(*A.Chapter Three$ Data 0ana1e*ent .S/ Z-.AD2/$S* ./$S* header+ stores the bplistAphysAt str#ct#re" -'/ (-ccess 'ontrol /ist) obGect Z.%0A(*A:.A%-. obGect+ head obGect #sed to identify root directory.%0A(*A./ .-*-S<*A'K$/. .%0A(*A%-S*<BAD(.irectory Z-.%0A(*A:.S/A.%0A(*A.nit *he . (bGect Z.%0A(*A.S/ directory information" .1 : 3b4ects )ith the e1ception of a small amo#nt of infrastr#ct#re.-'=<./-$DAF$/< ./$S* Description 0nallocated obGect .$B<'*(BE .%0A(*AD(D< .%0A(*A(:L<'*A-BB-E . many of these types are described in greater detail in f#t#re chapters of this doc#ment" Type .%0A(*AS.A%-. and the 8deferred free list9 #sed for sync to convergence" :./ %aster Dode Z-./$S*AK. delete &#e#e. everything in ZFS is an obGect" *he following table lists e1isting ZFS obGect types..S/ dir obGect" :loc pointer list S #sed to store the 8deadlist9 + list of bloc pointers deleted since the last snapshot.%0A(*A.%0A(*A(:LS<* .%0A(*A$D*<D*A/(F .

seen in the ill#stration below.aria&le len't# (iel%s Illustration 5 dnode8p!ys8t structure dn%t*pe -n 8-bit n#meric val#e indicating an obGect>s type" See *able 8 for a list of valid obGect types and their associated 8 bit identifiers" dn%indblks$ift and dn%datablks)sec ZFS s#pports variable data and indirect (see dnAnlevels below for a description of indirect bloc s) bloc si!es ranging from 312 bytes to 128 =bytes" J A %no%e is similar to an ino%e in U4S.%0A(*AZ2(/ .Type .B(.%0A(*AZ2(/A. J . contains several fi1ed length fields and two variable length fields" <ach of these fields are described in detail below" %no%e>$#ys>t uintK>t uintK>t uintK>t uintK>t uintK>t uintK>t uintK>t uintK>t uint1!>t uint1!>t uintK>t uint!4>t uint!4>t uint!4>t &lk$tr>t uintK>t %n>ty$eP %n>in%&lks#i(tP %n>nle.els %n>n&lk$trP %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP %n>&lk$trRNSP %n>&onusR-5NUS70NS (iCe% len't# (iel%s .dnode describes and organi!es a collection of bloc s ma ing #p an obGect" *he dnode (dnodeAphysAt str#ct#re).</<*<AT0<0< Description *he delete &#e#e provides a list of deletes that were in-progress when the filesystem was force #nmo#nted or as a res#lt of a system fail#re s#ch as a power o#tage" 0pon the ne1t mo#nt of the filesystem. the delete &#e#e is processed to remove the files@dirs that are in the delete &#e#e" *his mechanism is #sed to avoid lea ing files and directories in the filesystem" ZFS vol#me (Z2(/) Z2(/ properties . &able 5 67+ 3b4ect &ypes (bGects are defined by 312 bytes str#ct#res called dnodes4" .%0A(*A.

and dva4) for metadata and single wide bloc pointers for its data (see 'hapter two for a description of bloc pointer wideness)" *he bloc s at level ? are data bloc s" 4 . see above) of bloc pointers to describe an obGect>s data" For a dnode #sing the largest data bloc si!e (128=:) and containing the ma1im#m n#mber of bloc pointers (4). dva2. the largest obGect si!e it can represent (witho#t indirection) is 485 =:+ 4 1 128=: O 485=:" *o allow for larger obGects. indirect bloc s are #sed" -n indirect bloc is a bloc containing bloc pointers" *he n#mber of bloc pointers that an indirect bloc can hold is dependent on the indirect bloc si!e (represented by dn%ind$l+shi&t< and can be calc#lated by dividing the indirect bloc si!e by the si!e of a bl ptr (128 bytes)" *he largest indirect bloc (128=:) can hold #p to 1?25 bloc pointers" -s an obGect>s si!e increases.dnode has a limited n#mber (dnAnbl ptr. level 1.dn!indbl"shi#t: (-bit n#meric val#e containing the log (base 2) of the si!e (in bytes) of an indirect bloc for this obGect" dn!databl"s$sec: 16-bit n#meric val#e containing the data bloc si!e (in bytes) divided by 312 (si!e of a dis sector)" *his val#e can range between 1 (for a 312 byte bloc ) and 236 (for a 128 =byte bloc )" dn%nblkptr and dn%blkptr dnAbl ptr is a variable length field that can contains between one and three bloc pointers" *he n#mber of bloc pointers that the dnode contains is set at obGect allocation time and remains constant thro#gho#t the life of the dnode" dn!nbl"ptr + 8 bit n#meric val#e containing the n#mber of bloc pointers in this dnode" dn!bl"ptr% bloc pointer array containing dn%n$l+ptr bloc pointers dn%nle#els dnAnlevels is an 8 bit n#meric val#e containing the n#mber of levels that ma e #p this obGect" *hese levels are often referred to as levels of indirection" +ndirection . and level 2)" *his obGect has triple wide bloc pointers (dva1. more indirect bloc s and levels of indirection are created" .new level of indirection is created once an obGect grows so large that it e1ceeds the capacity of the c#rrent level" ZFS provides #p to si1 levels of indirection to s#pport files #p to 265 bytes long" *he ill#stration below shows an obGect with 4 levels of bloc s (level ?.

%no%e>$#ys>t uintK>t uintK>t uint8_t uint8_t uintK>t uintK>t uintK>t uintK>t uint1!>t uint1!>t uintK>t uint!4>t uint!4>t uint!4>t blkptr_t uintK>t %n>ty$eP %n>in%&lks#i(tP dn_nlevels = 3 dn_nblkptr = 3 %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP dn_blkptr[3]. the second an id of 1. and so forth" *he dn%maA$l+id field in the dnode is set to the val#e of the largest data (level !ero) bloc id for this obGect" Dote on :loc $ds+ Fiven a bloc id and level. ta e an obGect which has 128=: si!ed indirect bloc s" -n indirect bloc of this si!e can hold 1?25 bloc pointers" Fiven a level ? bloc id of 1646?....el 1 7e. 7e. &riple %ide block pointers used for metadata: single %ide block pointers used for data. dn%ma'blkid -n obGect>s bloc s are identified by bloc ids" *he bloc s in each level of indirection are n#mbered from ? to D..a1 %.el .a %. and n#mber of bloc pointers in an indirect bloc " For e1ample. it can be determined that bloc 13 (bloc id 13) of level 1 contains the bloc pointer for level ? bl id 1646?" level 1 bl id O 1646?U1?25 O 13 *his calc#lation can be performed rec#rsively #p the tree of indirect bloc s #ntil the top level of indirection has been reached" 5 . %n>&onusR-5NUS70NS %n>&lk$trRS %..... . .. bloc level.a1 %. 7e... ZFS can determine the e1act branch of indirect bloc s which contain the bloc " *his calc#lation is done #sing the bloc id.el 0 Illustration 19 3b4ect %it! 3 levels. . where the first bloc at a given level is given an id of ?.aJ %..aJ %.aJ %.a %.a %.a1 .

s#ch as obGects in a filesystem. Section 3.AD2/$S* obGect" Spa space map header" .%0A(*A(:LS<*" Z.AD2/$S*AS$Z< Description )etadata Structure Value :on#s b#ffer type containing #int65At si!e in bytes of a .%0A(*A.AK<-.2: 3b4ect Sets *he .irectory obGect #sed to define relationships and properties between related datasets" spaceAmapAobGAt dslAdirAphysAt 5 7 .%0A(*A.-'=<.<B .< &able 19 0onus 0uffer &ypes and associated structures.%0A(*A.-'=<.%0A(*A./ metadata !nodeAphysAt 16 17 .dn%secp$*s *he s#m of all asi'e val#es for all bloc pointers (data and indirect) for this obGect" dn%bon"s. or vol#me" (bGect sets are represented by a 1= byte o$Bset%phys%t str#ct#re" <ach member of this str#ct#re is defined in detail below" ! .S/A.%0 organi!es obGects into gro#ps called obGect sets" (bGect sets are #sed in ZFS to gro#p related obGects.S/A.-*-S<* .S/ . clone.S/ dataset obGect #sed to dslAdatasetAphysAt organi!e snapshot and #sage static information for obGects of type .-'<A%-. and dn%bon"st*pe *he bon#s b#ffer (dnAbon#s) is defined as the space following a dnode>s bloc pointer array (dnAbl ptr)" *he amo#nt of space is dependent on obGect type and can range between 65 and 42? bytes" dn!bonus% dnAbon#slen si!ed ch#n of data " *he format of this data is defined by dnAbon#stype" dn!bonuslen: /ength (in bytes) of the bon#s b#ffer" dn!bonust&pe: 8 bit n#meric val#e identifying the type of data contained within the bon#s b#ffer" *he following table shows valid bon#s b#ffer types and the str#ct#res which are stored in the bon#s b#ffer" *he contents of each of these str#ct#res will be disc#ssed later in this specification" 'onus (&pe . dn%bon"slen.$B 12 .%0A(*AZD(.%0A(*AS. snapshot.

os%type" *he table below lists available . containing this obGect>s dnodeAphysAt" *he ill#stration below shows an obGect set with the metadnode e1panded" *he metadnode contains three bloc pointers.%0A(S*AZFS . where each obGect set type has it>s own well defined format@layo#t for its obGects" *he obGect set>s type is identified by a 65 bit integer. each obGect is described by a dnodeAphysAt" *he collection of dnodeAphysAt str#ct#res describing the obGects in this obGect set are stored as an obGect pointed to by the metadnode" *he data contained within this obGect is formatted as an array of dnodeAphysAt str#ct#res (one for each obGect within the obGect set)" <ach obGect within an obGect set is #ni&#ely identified by a 65 bit integer called an obGect n#mber" -n obGect>s 8obGect n#mber9 identifies the array element.%0A(S*A%<*. See 'hapter 8 Value ? 1 2 4 &able 11 67+ 3b4ect 'et &ypes os%)il%$eader *he Z$/ header is described in detail in 'hapter 7 of this doc#ment" metadnode -s described earlier in this chapter. See 'hapter 5 Z.o&3set>$#ys>t %no%e>$#ys>t meta%no%e )il>#ea%er>t os>)il>#ea%er uint!4>t os>ty$e Illustration 11 ob4set8p!ys8t structure os%t*pe *he .%0A(S*AD(D< .S/ (bGect Set . each of which have been e1panded to show their contents" (bGect n#mber 5 has been f#rther e1panded to show the details of the dnodeAphysAt and the bloc str#ct#re referenced by this dnode" I . in the dnode array. See 'hapter 6 Z2(/ (bGect Set.%0 s#pports several types of obGect sets.%0 obGect set types and their associated os%type integer val#e" *b+ect Set (&pe .%0A(S*AZ2(/ Description 0ninitiali!ed (bGect Set ./ (bGect Set.

.. %n>&lk$trRS 7e... .o&3set>$#ys>t dn_type DM _!"_D#!D$ %n>in%&lks#i(tP dn_nlevels % dn_nblkptr 3 %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP dn_blkptr[3].... Illustration 12 3b4ect 'et 7e. %no%e>$#ys>t uintK>t uintK>t uint8_t uint8_t uintK>t uintK>t uintK>t uintK>t uint1!>t uint1!>t uintK>t uint!4>t uint!4>t uint!4>t blkptr_t uintK>t %n>ty$eP %n>in%&lks#i(tP dn_nlevels = 3 dn_nblkptr = 3 %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP dn_blkptr[3]...el 0 K .el ... %n>&onusR-5NUS70NS %no%e>$#ys>t meta%no%e )il>#ea%er>t os>)il>#ea%er uint!4>t os>ty$e c#ar os>$a%RJI!S %n>&lk$trRS 5 ! I K 9 10 10 10 10 10 0 1 J 4 10 J 10 4 04I 04K . ..el 1 7e. %n>&onusR-5NUS70NS .

snapshots. and vol#mes" ZFS filesystem+ .snapshot is a point-in-time image of the data in the obGect set in which it was created" . clones.atasets are gro#ped together hierarchically into collections called .S/ and the relationships it describes. clone.(S$C compliant manner" ZFS clone+ .S/ (. and eeps trac of any snapshots inter-dependencies" . contains obGect set location information.ataset .dataset manages space cons#mption statistics for an obGect set.ataset and Snapshot /ayer) provides a mechanism for describing and managing relationships-between and properties-of obGect sets" :efore describing the . . clone. or vol#me can not be destroyed #nless its snapshots are also destroyed" 'hildren+ ZFS s#pport hierarchically str#ct#red obGect sets.1 : DSL "n5rastructure <ach obGect set is represented in the .vol#me is a logical vol#me that is e1ported by ZFS as a bloc device" ZFS s#pports several operations and@or config#rations which ca#se interdependencies amongst obGect sets" *he p#rpose of the .irectories" 9 .S/ is to manage these relationships" *he following is a list of s#ch relationships" 'lones+ . obGect sets within obGect sets" .clone is related to the snapshot from which it originated" (nce a clone is created.filesystem stores and organi!es obGects in an easily accessible. the snapshot in which it originated can not be deleted #nless the clone is also deleted" Snapshots+ . or vol#me at a partic#lar point in time" ZFS vol#me+ .clone is identical to a filesystem with the e1ception of its origin" 'lones originate from snapshots and their initial contents are identical to that of the snapshot from which it originated" ZFS snapshot+ . a brief overview of the vario#s flavors of obGect sets is necessary" Ob-ect Set O#er#iew ZFS provides the ability to create fo#r inds of obGect sets+ filesystems.filesystem.child is dependent on the e1istence of its parent" .S/ as a dataset" .Chapter 2our – D L *he .snapshot is a read-only version of a filesystem.parent can not be destroyed witho#t first destroying all children" Section 4.

S7 .S/ directory" . clones..S7 .S/ .5 obGect containing a listing of all child@parent dependencies" *o the right of the .S/ .S/ directories" *he top level .S/ .S7 C#il% .S/ . or child@parent dependencies" *he following pict#re shows the .irectories manage a related gro#ping of datasets and the properties associated with that gro#ping" . .ataset LA+ 5&3ect .irectory is the 8active dataset9" *he active dataset represents the live filesystem" (riginating from the active dataset is a lin ed list of snapshots which have been ta en at different points in time" <ach dataset str#ct#re points to a .ataset .MU 5&3ect Set Gsna$s#otH Illustration 13 6'L Infrastructure 4 "#e LA+ is eC$laine% in C#a$ter 5.irectory can be seen at the top@center of this fig#re" .irectory Gc#il% H .irectory is a child Z-.irectly below the .S/ datasets and .irectories are described in the Dataset Cnternals and D04 Directories Cnternals sections below" ('ild Dataset )nfor*ation .S7 +ro$erties LA+ 5&3ect .S7 .S/ directory always has e1actly one 8active dataset9" -ll other datasets #nder the .S/ infrastr#ct#re incl#ding a pictorial view of how obGect set relationships are described via the .ataset .S7 .irectory Gc#il%1H .eH .MU 5&3ect Set Gacti.%0 (bGect Set which is the act#al obGect set containing obGect data" *o the left of the top level .S/ directory is a properties Z-.atasets and .S7 In(rastructure J0 .ataset Gsna$s#otH .eH .S/ directory are related to the 8active9 dataset thro#gh snapshots.ataset Gsna$s#otH &naps'ots .S7 .detailed description of .ataset Gacti.irectory .S7 . obGect containing properties for the datasets within this .MU 5&3ect Set Gsna$s#otH .S7 .listing of all properties can be seen in *able 12 below" ..

or %(S" *here is only one %(S per pool and the #berbloc (see 'hapter (ne) points to it directly" *here is a single disting#ished obGect in the %eta (bGect Set" *his obGect is called the obGect directory and is always located in the second element of the dnode array (inde1 1)" -ll obGects./$S*" *his obGect contains a list of bloc pointers which need to be freed d#ring the ne1t transaction" *he ill#stration below shows the meta obGect set (%(S) in relation to the #berbloc and label str#ct#res disc#ssed in 'hapter 1" J1 .Section 4.S/A. obGect (an obGect containing name@val#e pairs -see chapter 3 for a description of Z-. root%dataset: *he 8root%dataset@ attrib#te contains a 65 bit integer val#e identifying the obGect n#mber of the root .2: DSL "%'le%entation Details *he . can be located by traversing thro#gh a set of obGect references starting at this obGect" T$e ob-ect director* *he obGect directory is a Z-.S/ is implemented as an obGect set of type .<.": D04 Directory Cnternals" con&ig: *he 8con&ig@ attrib#te contains a 65 bit integer val#e identifying the obGect n#mber for an obGect of type .%0A(*A.-'=<. is an obGect of type . con&ig. with the e1ception of the obGect directory.S/ directory for the pool" *he root .S/ directory is a special obGect whose contents reference all top level datasets within the pool" *he 8rootAdataset9 directory. name val#e pairs describing this pools vdev config#ration" $ts contents are similar to those described in section 1"4"4+ name@val#e pairs list" sync%$plist: *he Dsync%$plist@ attrib#te contains a 65 bit integer val#e identifying the obGect n#mber for an obGect of type .%0A(*A.AD2/$S*" *his obGect contains C.$B and will be e1plained in greater detail in 0ection ". obGects) containing three attrib#te pairs (name@val#e) named+ root%dataset.BA<D'(. and sync%$plist.%0A(*ASED'A:.%0A(S*A%<*-" *his obGect set is often called the %eta (bGect Set.

A"AS0". uintK>t %n>&onusR-5NUS70NS 4 5 ! I K 04! 04I 04K 049 050 051 05 o&3ect>%irectory root>%ataset 10 10 10 10 10 0 1 J 4 10 10 J sync>&$list uint8_t dn_type= DM _!"_!._t ds_prev_snap_ob-: I( t#is %ataset re$resents a (ilesystem. root>%ataset U con(i' U 4 sync>&$list U 10 J Illustration 1$ 7eta 3b4ect 'et Section 4.olume.%r Name=6alue +airs .MU>5">. t#is (iel% contains t#e !4 &it o&3ect num&er (or t#e sna$s#ot taken $rior to t#is sna$s#ot..S7>..atasets are store% as an o&3ect o( ty$e .%e.._t ds_dir_ob-: 5&3ect num&er o( t#e .els uintK>t %n>n&lk$trP uintK>t %n>&onusty$eP uintK>t %n>c#ecksumP uintK>t %n>com$ressP uintK>t %n>$a%R1SP uint1!>t %n>%ata&lks)secP uint1!>t %n>&onuslenP uintK>t %n>$a% R4SP uint!4>t %n>maC&lki%P uint!4>t %n>sec$#ysP uint!4>t %n>$a%JR4SP blkptr_t dn_blkptr[3].S7 %irectory re(erencin' t#is %ataset. uintK>t %n>&onusty$eP uintK>t %n>c#ecksumP uintK>t %n>com$ressP uintK>t %n>$a%R1SP uint1!>t %n>%ata&lks)secP uint1!>t %n>&onuslenP uintK>t %n>$a% R4SP uint!4>t %n>maC&lki%P uint!4>t %n>sec$#ysP uint!4>t %n>$a%JR4SP blkptr_t dn_blkptr[%]P uintK>t %n>&onusR-5NUS70NS con(i' ... I( t#is %ataset re$resents a sna$s#ot. uint+.e &een taken..3: Dataset "nternals . "#is (iel% is )ero i( t#ere are no $re... or clone.>sum uint!4>t u&>timestam$ &lk$tr>t u&>root&$ %no%e>$#ys>t meta%no%e )il>#ea%er>t os>)il>#ea%er uint!4>t os>ty$e U . "#e contents o( t#e %sl>%ataset>$#ys>t structure are s#own &elow. %no%e>$#ys>t uint!4>t u&>ma'ic uint!4>t u&>. .ersion uint!4>t u&>tC' uint!4>t u&>.ious J J0I0 J0I1 ../$("_D)0$("!01 uintK>t %n>in%&lks#i(tP uint8_t dn_nlevels = % uint8_t dn_nblkptr = %. "#is o&3ect ty$e uses t#e &onus &u((er in t#e %no%e>$#ys>t to #ol% a dsl8dataset8p!ys8t structure.MU>5S">M0"A uint8_t dn_type =DM _!"_D#!D$ uintK>t %n>in%&lks#i(tP uintK>t %n>nle. .70 71 -oot 7 7J u&er&lock>$#ys>t array -lank -oot S$ace . uint+. . t#is (iel% contains t#e !4 &it o&3ect num&er (or t#e most recent sna$s#ot takenP t#is (iel% is )ero i( no sna$s#ots #a.

uint+. 4or sna$s#ots.olumes.alue $airs (or eac# sna$s#ot o( t#is %ataset.e &een takenH V t#e num&er o( clones ori'inatin' (rom t#is sna$s#ot._t ds_unco*pressed_bytes: num&er o( uncom$resse% &ytes in t#e o&3ect set re$resente% &y t#is %ataset uint+.e co$y. or (ilesystems. uint+._t ds_co*pressed_bytes: num&er o( com$resse% &ytes in t#e o&3ect set re$resente% &y t#is %ataset uint+._t ds_uni4ue_bytes: 9#en a sna$s#ot is taken. It contains t#e o&3ect num&er o( t#e %ataset w#ic# is t#e most recent sna$s#ot. uint+. uint+._t ds_prev_snap_t23: "#e transaction 'rou$ num&er w#en t#e $re. uint+. or (rom t#e acti._t ds_creation_t23: "#e transaction 'rou$ num&er in w#ic# t#is %ataset was create%._t ds_creation_ti*e: Secon%s since /anuary 1st 19I0 G1M"H w#en t#is %ataset was create%. uint+.sna$s#ots.e co$y o( t#e %ata. . "#e amount o( uni@ue sna$s#ot %ata is store% in t#is (iel%< it is )ero (or clones._t ds_deadlist_ob-: "#e o&3ect num&er o( t#e %ea%list Gan array o( &lk$trAs %elete% since t#e last sna$s#otH._t ds_ne2t_snap_ob-: "#is (iel% is only use% (or %atasets re$resentin' sna$s#ots. its initial contents are i%entical to t#at o( t#e acti.e %ataset i( no sna$s#ots #a. uint+. As t#at #a$$ens. 0ac# $air contains t#e name o( t#e sna$s#ot an% t#e o&3ect num&er associate% wit# itAs ._t ds_snapna*es_zapob-: 5&3ect num&er o( a LA+ o&3ect Gsee C#a$ter 5H containin' name . more an% more %ata &ecomes uni@ue to t#e sna$s#ot Gt#e %ata %i.S7 %ataset structure. an% (ilesystems. t#is is t#e num&er o( re(erences to t#is sna$s#ot< 1 G(rom t#e neCt sna$s#ot taken. uint+. As t#e %ata c#an'es in t#e acti._t ds_fsid_3uid: !4 &it I. ._t ds_used_bytes: uni@ue &ytes use% &y t#e o&3ect set re$resente% &y t#is %ataset uint+. t#at is 'uarantee% to &e uni@ue amon'st all JJ . "#is (iel% is always )ero (or %atasets re$resentin' clones._t ds_nu*_c'ildren: Always )ero i( not a sna$s#ot. uint+.er'es (rom t#e sna$s#otH. t#e amount o( %ata uni@ue to t#e sna$s#ot increases.olumes.ious sna$s#ot G$ointe% to &y ds8prev8snap8ob4H was taken.

uint+._t dd_unco*pressed_bytes: Num&er o( uncom$resse% &ytes (or all %atasets wit#in t#is .S7 %irectory.currently o$en %atasets.alue $airs (or eac# c#il% o( t#is . uint+.S7 ."nternals "#e . uint+.S7 %irectory. 5 See t#e L4S A%min 1ui%e (or in(ormation a&out t#e )(s comman%. t#is I.S7 %irectory. t#is (iel% contains t#e o&3ect num&er o( sna$s#ot use% to create t#is clone.S7 %irectory._t dd_reserved: "#e amount o( s$ace reser._t ds_3uid: !4 &it 'lo&al i% (or t#is %ataset. uint+._t dd_creation_ti*e: Secon%s since /anuary 1st.S7 %irectory. 19I0 G1M"H w#en t#is . w#ic# can not &e eCcee%e% &y t#e %atasets wit#in t#is ._t dd_'ead_dataset_ob-: !4 &it o&3ect num&er o( t#e acti. Note._t dd_4uota: . uint+.S7 %irectory was create%._t dd_used_bytes: Num&er o( &ytes use% &y all %atasets wit#in t#is %irectory< inclu%es any sna$s#ot an% c#il% %ataset use% &ytes.4: DSL Director. "#e contents o( t#is structure are %escri&e% in %etail &elow5 uint+. coul% c#an'e &etween successi._t dd_co*pressed_bytes: Num&er o( com$resse% &ytes (or all %atasets wit#in t#is . uint+. uint+. uint+. i( any.alue ne. "#is ._t dd_c'ild_dir_zapob-: 5&3ect num&er o( a LA+ o&3ect containin' name* ._t dd_clone_parent_ob-: 4or cloned o&3ect sets.e %ataset o&3ect uint+.S7 %irectory uint+. Section 4.er c#an'es %urin' t#e li(etime o( t#e o&3ect set.esi'nate% @uota.e %ataset o$ens._t dd_parent_ob-:!4 &it o&3ect num&er o( t#e $arent . J4 . uint+._t ds_restorin3: "#e (iel% is set to M1N i( L4S is in t#e $rocess o( restorin' to t#is %ataset t#rou'# A)(s restoreA5 blkptr_t ds_bp: -lock $ointer containin' t#e location o( t#e o&3ect set t#at t#is %ataset re$resents.e% (or consum$tion &y t#e %atasets wit#in t#is .irectory o&3ect contains a dsl8dir8p!ys8t structure in its &onus &u((er.

S/ . incl#ding all child datasets and child .e(ault.S/ . 5nly t#e non*in#erite% = locally set .irectory" 'ontrols whether device nodes can be opened on datasets" 'ontrols whether files can be e1ec#ted on a dataset" %o#ntpoint path for datasets within this .S7 %irectory.S/ .irectories" J5 .uint+.alues.alues are re$resente% in t#is LA+ o&3ect. in#erite% . "#e (ollowin' ta&le s#ows .irectory" off O ? on O 1 (defa#lt) on O 1 (defa#lt) off O ? chec s#m compression 'ompression algorithm for all on O 1 datasets within this .S/ off O ? (defa#lt) .irectory" devices O ? nodevices O 1 (defa#lt) e1ec O 1 (defa#lt) noe1ec O ? string devices e1ec mo#ntpoint &#ota /imits the amo#nt of space all &#ota si!e in bytes or datasets within a .ali% $ro$erty . ._t dd_props_zapob-: !4 &it o&3ect num&er o( a LA+ o&3ect containin' t#e $ro$erties (or all %atasets wit#in t#is .S/ .alues are in(erre% w#en t#ere is an a&sence o( an entry.irectory. Property Description Values aclinherit 'ontrols inheritance behavior discard O ? for datasets" noallow O 1 passthro#gh O 4 sec#re O 5 (defa#lt) aclmode 'ontrols chmod and file@dir discard O ? creation behavior for datasets" gro#pmas O 2 (defa#lt) passthro#gh O 4 atime 'ontrols whether atime is #pdated on obGects within a dataset " 'hec s#m algorithm for all datasets within this .irectory readonly O 1 readwrite O ? (defa#lt) recordsi!e in bytes readonly recordsi!e reservation -mo#nt of space reserved for reservation si!e in bytes this .S/ .S/ !ero for no &#ota (defa#lt) directory can cons#me" 'ontrols whether obGects can be modified on a dataset" :loc Si!e for all obGects within the datasets contained in this .

so it sho#ld be set at vol#me creation time" 2ol#me si!e.S/ . only applicable to vol#mes" vol#me si!e in bytes volsi!e !oned 'ontrols whether a dataset is on O 1 managed thro#gh a local !one" off O ? (defa#lt) &able 12 2ditable .roperty Values stored in t!e dd8props8zabob4 J! .efa#lts to 8= changed once the vol#me has been written. set#id O 1 (defa#lt) bit is respected on a dataset" noset#id O ? 'ontrols whether the datasets string S any valid nfs share in a . powers bloc si!e of the vol#me" *he of two" blocksize cannot be .irectory are shared options by DFS" 'ontrols whether "!fs is hidden or visible in the root filesystem" hidden O ? visible O 1 (defa#lt) snapdir volbloc si!e For vol#mes. specifies the between 312 to 128=.Property set#id sharenfs Description Values 'ontrols whether the set-0$.

micro!ap obGects and fat!ap obGects" %icro!ap obGects are a lightweight version of the fat!ap and provide a simple and fast loo #p mechanism for a small n#mber of attrib#te entries" *he fat!ap is better s#ited for Z-.%0A(*A.$BA'K$/. .</<*<AT0<0< .S/A.S/A. obGects come in two forms.A%-. &able 13 <*. obGect types" .%0A(*A. a fat!ap obGect is #sed" *he first 65 bit word in each bloc of a Z-.S/A.%0 obGect #sed to store attrib#tes in the form of name-val#e pairs" *he name portion of the attrib#te is a !ero-terminated string of #p to 236 bytes (incl#ding terminating D0//)" *he val#e portion of the attrib#te is an array of integers whose si!e is only limited by the si!e of a Z-.%0A(*A(:L<'*A.$B<'*(BE . store pool properties and more" *he following table contains a list of Z-. . navigate filesystem obGects.A.*b+ect (&pe .B(.%0A(*AZ2(/A.SASD-.B(. obGect is #sed to identify the type of Z-. obGects are #sed to store properties for a dataset. obGects" . data bloc " Z-.%0 and operates on obGects called Z-.Chapter 2ive – 34. (ZFS -ttrib#te .%0A(*A.rocessor) is a mod#le which sits on top of the .%0A(*A.%0A(*A.A%-.$B<'*(BEA'(D*<D*S . contents contained within this bloc " *he table below shows these val#es" JI . *he Z-. obGect is a .S .Z-.%0A(*A%-S*<BAD(. 3b4ect &ypes Z-.< . obGects containing large n#mbers of attrib#tes" *he following g#idelines are #sed by ZFS to decide whether or not to #se a fat!ap or a micro!ap obGect" .micro!ap obGect is #sed if all three conditions below are met+ • all name-val#e pair entries fit into one bloc " *he ma1im#m data bloc si!e in ZFS is 128=: and this si!e bloc can fit #p to 2?57 micro!ap entries" • *he val#e portion of all attrib#tes are of type #int65At" • *he name portion of each attrib#te is less than or e&#al to 3? characters in length (incl#ding D0// terminating character)" $f any of the above conditions are not met.

#int42At m!eAcd.2) typedef str#ct m!apAentAphys Q #int65At m!eAval#e.A<D*A/<D Vdefine %Z-.1: The 6icro 7a' *he micro!ap implements a simple mechanism for storing a small n#mber of attrib#tes" micro!ap obGect consists of a single bloc containing an array of micro!ap entries (m'ap%ent%phys%t str#ct#res)" <ach attrib#te stored in a micro!ap obGect is represented by one of these micro!ap entry str#ct#res" .A<D*A/<D . char m!eAnameH%Z-. salt JK .. m)a$>ent>$#ys>t array Illustration 1( 7icrozap block layout *he m!apAentAphysAt str#ct#re and associated Vdefines are shown below" Vdefine %Z-.<B Description *his bloc contains micro!ap entries *his bloc is #sed for the fat!ap" *his identifier is only #sed for the first bloc in a fat!ap obGect" *his bloc is #sed for the fat!ap" *his identifier is #sed for all bloc s in the fat!ap with the e1ception of the first" Value (10// NN 64) P 4 (10// NN 64) P 1 Z:*A/<-F (10// NN 64) P ? &able 1$ <*.8 S 5 .AD-%<A/<DI.denti#ier Z:*A%$'B( Z:*AK<-.. #in16At m!eApad. R m!apAentAphysAt.micro!ap bloc is laid o#t as follows+ the first 128 bytes of the bloc contain a micro!ap header str#ct#re called the m!apAphysAt" *his str#ct#re contains a 65 bit Z:*A%$'B( val#e indicating that this bloc is #sed to store micro!ap entries" Following this val#e is a 65 bit salt val#e that is stirred into the hash so that the hash f#nction is different for each Z-..AD-%<A/<D 65 (%Z-. 3b4ect 0lock &ypes Section &. obGect" *he ne1t 52 bytes of this header is intentionally left blan and the last 65 bytes contain the first micro!ap entry (a str#ct#re of type m!apAentAphysAt)" *he remaining bytes in this bloc are #sed to store an array of m!apAentAphysAt str#ct#res" *he ill#stration below shows the layo#t of this bloc " (irst 1 K &ytes micro)a$ &lock J $a%%in' .

9)+ associated with an entry whose hash val#e is the same as another entry within this Z-. the lowest '. the pointer table will grow if the n#mber of entries hashing to a partic#lar b#c et e1ceeds the capacity of one leaf bloc (e1plained in detail below)" *he pointer table entries reference a chain of fat!ap bloc s called leaf bloc s.fatzap structure overvie% ection 5"#"!$ 6ap7ph&s7t *he first bloc of a fat!ap obGect contains a 128=: !apAphysAt str#ct#re" . val#e will be !ero" m)e%pad: reserved for f#t#re #se m)e%name: D0// terminated string less than or e&#al to 3? characters in length Section &. represented by the !apAleafAphys str#ct#re" <ach leaf bloc is bro en #p into some n#mber of ch#n s (!apAleafAch#n s) and each attrib#te is stored in one or more of these leaf ch#n s" *he ill#stration below shows the basic fat!ap str#ct#res.2: The 0at 7a' *he fat!ap implements a fle1ible architect#re for storing large n#mbers of attrib#tes. obGect. the '. obGect" )hen an entry is inserted into the Z-. which is not already #sed by an entry with the same hash val#e is assigned" $n the absence of hash collisions.m)e%#al"e: 65 bit integer m)e%cd: 42 bit collision differentiator (8'. each component is e1plained in detail in the following sections" )a$>$#ys>t 4irst -lock in LA+ 5&3ect $ointer ta&le )a$>lea(>$#ys>t )a$ lea( c#unks )a$>lea(>$#ys>t )a$ lea( c#unks )a$>lea(>$#ys>t )a$ lea( c#unks Illustration 1.epending on the J9 . and@or attrib#tes with long names or comple1 val#es (not #int65At)" *his section begins with an e1planation of the basic str#ct#re of a fat!ap obGect and is followed by a detailed e1planation of each component of a fat!ap obGect" -ll entries in a fat!ap obGect are arranged based on a 65 bit hash of the attrib#te>s name" *he hash is #sed to inde1 into a pointer table (as can be seen on the left side of the ill#stration below)" *he n#mber of bits #sed to inde1 into this table (sometimes called the pre&iA< is dependent on the n#mber of entries in the table" *he n#mber of entries in the table can change over time" -s policy stands today.

/%t )t%blks%copied: 40 ./%t )t%ne'tblk: "int. bloc " -lways set to Z:*AK<-. magic n#mber+ !A#85#2E#2E :'&s='ap='ap< )ap%table%p$*s: str#ct#re whose contents are #sed to describe the pointer table )t%blk: :l id for the first bloc of the pointer table" *his field is only #sed when the pointer table is e1ternal to the !apAphysAt str#ct#re. !ero otherwise" )t%s$ift: D#mber of bits #sed from the hash val#e to inde1 into the pointer table" $f the pointer table is contained within the !apAphys.si!e of the pointer table. this str#ct#re may contain the pointer table" $f the pointer table is too large to fit in the space provided by the !apAphysAt. this val#e will be 14" "int. zap8p!ys8t structure )ap%block%t*pe: 65 bit integer identifying type of Z-.<B (see *able 15) for the first bloc in the fat!ap" )ap%magic: 65 bit integer containing the Z-. some information abo#t where it can be fo#nd is store in the !apAtableAphys portion of this str#ct#re" *he definitions of the !apAphysAt contents are as follows+ )a$>$#ys>t uint!4>t )a$>&lock>ty$e uint!4>t )a$>ma'ic struct )a$>ta&le>$#ys Q uint!4>t )t>&lk uint!4>t )t>num&lks uint!4>t )t>s#i(t uint!4>t )t>neCt&lk uint!4>t )t>&lk>co$ie% T )a$>$trt&lP uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t )a$>(ree&lk )a$>num>lea(s )a$>num>entries )a$>salt )a$>$a%RK1K1S )a$>lea(sRK19 S Illustration 1. !ero otherwise" )t%n"mblks: D#mber of bloc s #sed to hold the pointer table" *his field is only #sed when the pointer table is e1ternal to the !apAphysAt str#ct#re.

#int65At lhrAprefi1. obGect" )ap%n"m%entries: D#mber of attrib#tes stored in this Z-.ointer Table *he pointer table is a hash table which #ses a chaining method to handle collisions" <ach hash b#c et contains a 65 bit integer which describes the level !ero bloc id (see 'hapter 4 for a description of bloc ids) of the first element in the chain of entries hashed here" -n entries hash b#c et is determined by #sing the first few bits of the 65 bit Z-. and some n#mber of ch#n s" typedef str#ct !apAleafAphys Q str#ct !apAleafAheader Q #int65At lhrAbloc Atype. bloc that can be #sed to allocate a new !apAleaf" )ap%n"m%leafs: D#mber of !apAleafAphysAt str#ct#res (described below) contained within this Z-. #int42At lhrAmagic. so that the hash f#nction is different for each Z-. #int16At lhrAprefi1Alen. this field is #n#sed" ection 5"#"#$ . #int8At lhApad2H12I. obGect" )ap%leafs01(234+ *he !apAleaf array contains 214 (81J2) slots" $f the pointer table has fewer than 214 entries. R lAhdr. @W 2 25-byte ch#n s W@ 41 . a hash table. #int65At lhrAne1t. entry hash comp#ted from the attrib#te>s name" *he val#e #sed to inde1 into the pointer table is called the pre&iA and is the 't%shi&t high order bits of the 65 bit comp#ted hash" ection 5"#"'$ 6ap7leaf7ph&s7t *he !apAleafAphysAt is the str#ct#re referenced by the pointer table" 'ollisions in the pointer table res#lt in !apAleafAphysAt str#ct#res being str#ng together in a lin list fashion" *he !apAleafAphysAt str#ct#re contains a header. #int16At lhrAnentries. #int16At lhAfreelist. obGect" )ap%salt: *he salt val#e is a 65 bit integer that is stirred into the hash f#nction. the pointer table will be stored here" $f not. #int16At lhrAnfree.*he above two fields are #sed when the pointer table changes si!es" )ap%freeblk: 65 bit integer containing the first available Z-.

str#ct !apAleafAfree Q #int8At lfAtype. #int16At leAnameAlength. #int16At leAval#eAch#n .A/<-FA-BB-EA:E*<SI. 5eader *he header for the Z-. leaf is stored in a !apAleafAheader str#ct#re" $t>s description is as follows+ l$r%block%t*pe: always Z:*A/<-F (see *able 15 for val#es) l$r%ne't: 65 bit integer bloc id for the ne1t leaf in a bloc chain" l$r%prefi' and l$r%prefi'%len: <ach leaf (or chain of leafs) stores the Z-.A/<-FAD0%'K0D=SI. R lAentry. #int8At leApadH2I. #int8At laAarrayHZ-. R !apAleafAphysAt.A/<-FA-BB-EA:E*<SI. R lAch#n HZ-. #int16At lfAne1t. entries whose first lhrAprefi1len bits of their hash val#e e&#als lhrAprefi1" lhrAprefi1len can be e&#al to or less than !tAshift (the n#mber of bits #sed to inde1 into the pointer table) in which case m#ltiple pointer table b#c ets reference the same leaf" l$r%magic: leaf magic n#mber OO 0x2A'1 A/ (!ap-leaf) l$r%nfree: n#mber of free ch#n s in this leaf (ch#n s described below) l$r%nentries: n#mber of Z-.#int16At lAhashHZ-. #int16At leAnameAch#n . R lAfree. #int16At leAne1t. #int16At laAne1t. str#ct !apAleafAarray Q #int8At laAtype. #int8At lfApadHZ-. #int8At leAintAsi!e. R lAarray. #int16At leAcd. #int65At leAhash. #int16At leAval#eAlength. 16 bit integer #sed to inde1 into the !apAleafAch#n array 4 . #nion !apAleafAch#n Q str#ct !apAleafAentry Q #int8At leAtype.A/<-FAK-SKAD0%<D*B$<SI. entries stored in this leaf l$r%freelist: head of a list of free ch#n s.

alue>len't# le>c% le>#as# )a$>lea(>array uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCt )a$>lea(>array uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCtU82ffff )a$>lea(>array uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCt )a$>lea(>array uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCt )a$>lea(>array uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCtU82ffff Illustration 1/ zap leaf structure )ap%leaf%entr*+ *he leaf hash table (described above) points to ch#c s of this type" *his entry contains pointers to ch#n s of type !apAleafAarray which hold the name and val#e for the attrib#tes being stored here" le!t&pe: Z-.A/<-FA<D*BE OO 232 le!int!si$e: Si!e of integers in bytes for this entry" le!next% De1t entry in the !apAleafAch#n chain" 'hains occ#r when there are collisions in the hash table" *he end of the chain is designated by a leAne1t val#e of ?1ffff" le!name!chun": 16 bit integer identifying the ch#n of type 4J .alue>len't# le>c% le>#as# )a$>lea(>entry uintK>t uintK>t uint1!>t uint1!>t uint1!>t uint1!>t uint1!>t uintJ >t uint!4>t le>ty$e U 676 le>int>si)e le>neCt le>name>c#unk le>name>len't# le>. !apAleafAarray.alue>c#unk le>.detailed description of each ch#n type follows the ill#stration" one )a$ entry )a$>lea(>entry uintK>t uintK>t uint1!>t uint1!>t uint1!>t uint1!>t uint1!>t uintJ >t uint!4>t le>ty$e U 676 le>int>si)e le>neCt le>name>c#unk le>name>len't# le>.alue>len't# le>c% le>#as# )a$>lea(>entry uintK>t uintK>t uint1!>t uint1!>t uint1!>t uint1!>t uint1!>t uintJ >t uint!4>t le>ty$e U 676 le>int>si)e le>neCtU82ffff le>name>c#unk le>name>len't# le>.alue>c#unk le>. and !apAleafAfree" <ach attrib#te is represented by some n#mber of these ch#n s+ one !apAleafAentry and some n#mber of !apAleafAarray ch#n s" *he ill#stration below shows how these ch#n s are arranged" .alue>c#unk le>.Leaf 5as$ *he ne1t 8=: of the !apAleafAphysAt is the !ap leaf hash table" *he entries in the has table reference ch#n s of type !apAleafAentry" *welve bits (the twelve following the lhrAprefi1Alen #sed to #ni&#ely identify this bloc ) of the attrib#te>s hash val#e are #sed to inde1 into the this table" Kash table collisions are handled by chaining entries" <ach b#c et in the table contains a 16 bit integer which is the inde1 into the !apAleafAch#n array" ection 5"#".$ 6ap7leaf7chunk <ach leaf contains an array of ch#n s" *here are three types of ch#n s+ !apAleafAentry.

A/<-FAFB<< OO 234 l#!next% 16 bit integer pointing to the ne1t free ch#n " 44 . the '. in integer increments (le%int%si'e) le!cd% *he collision differentiator (8'.A/<-FA-BB-E OO 231 la!arra&% 21 byte array containing the name or val#e>s val#e" 2al#es of type 8integer9 are always stored in big endian format. the lowest '. obGect" )hen an entry is inserted into the Z-. obGect. incl#ding the D0// character" le!value!chun"%16 bit integer identifying the first ch#n (type !apAleafAarray) containing the first 21 bytes of the attrib#te>s val#e" le!value!length% *he length. regardless of the machine>s native endianness" la!next% 16 bit integer #sed to inde1 into the !apAleafAch#n array and references the ne1t !apAleafAarray ch#n for this attrib#te.) is #sed to designate the end of the chain )ap%leaf%free: 0n#sed ch#n s are ept in a chained free list" *he root of the free list is stored in the leaf header" l#!t&pe% Z-.9) is a val#e associated with an entry whose hash val#e is the same as another entry within this Z-. attrib#te" *hese ch#n s can be str#ng together to provide for long names or large val#es" !apAleafAarray ch#n s are pointed to by a !apAleafAentry ch#n " la!t&pe% Z-. val#e will be !ero" le!hash% 65 bit hash of this attrib#te>s name" )ap%leaf%arra*+ 'h#n s of the !apAleafAarray hold either the name or the val#e of the Z-.!apAleafAarray which contains the first 21 characters of this attrib#te>s name" le!name!length% *he length of the attrib#te>s name. which is not already #sed by an entry with the same hash val#e is assigned" $n the absence of hash collisions. a val#e of ?1ffff ('K-$DA<D.

Traversal Filesystem directories are implemented as Z-. and B((*" Dame+ ./ represents filesystems as an obGect set of type .Chapter i8 – 3.(S$C is a standard defining the set of services a filesystem m#st provide" ZFS filesystems provide all of these re&#ired services" *he Z. ZFS .1: 71L 0iles-ste% La-out . the delete &#e#e is processed to remove the files@dirs that are in the delete &#e#e" *his mechanism is #sed to avoid lea ing files and directories in the filesystem" Dame+ 2<BS$(D 2al#e+ '#rrently a val#e of 819" .</<*<AT0<0<.escription+ *his attrib#te>s val#e contains the obGect n#mber for the top level directory in this filesystem.%0A(*A.escription+ Z.(S$C filesystem" ././ #ses a well defined format for organi!ing obGects in its obGect set" *he section below describes this layo#t" Section (.</<*<AT0<0< 2al#e+ 65 bit obGect n#mber for the delete &#e#e obGect .L *he Z. obGects (obGect type . clones and filesystems are implemented as an obGect set of this type" *he Z.(S$C /ayer.%0 obGects loo li e a ./ obGect set has one obGect with a fi1ed location and fi1ed obGect n#mber" *his obGect is called the 8master node9 and always has an obGect n#mber of 1" *he master node is a Z-. the root directory" Section (.%0A(S*AZFS" -ll snapshots.Z.$B<'*(BE)" <ach directory holds a set of name-val#e pairs which contain the names and obGect n#mbers for each directory entry" *raversing thro#gh a directory tree is as simple as loo ing #p the val#e for an entry and reading that obGect n#mber" -ll filesystem obGects contain a !nodeAphysAt str#ct#re in the bon#s b#ffer of it>s dnode" *his str#ct#re stores the attrib#tes for the filesystem obGect" *he !nodeAphysAt str#ct#re is shown below" 45 . obGect containing three attrib#tes+ .2: Directories and Director./ version #sed to lay o#t this filesystem" Dame+ B((* 2al#e+ 65 bit obGect n#mber . 2<BS$(D.escription+ *he delete &#e#e provides a list of deletes that were in-progress when the filesystem was force #nmo#nted or as a res#lt of a system fail#re s#ch as a power o#tage" 0pon the ne1t mo#nt of the filesystem. ma es .

!fsA!nodeAaclAt !pAacl. #int65At !pActimeH2I. #int65At !pAcrtimeH2I.typedef str#ct !nodeAphys Q #int65At !pAatimeH2I. R !nodeAphysAt )p%atime: *wo 65 bit integers containing the last file access time in seconds (!pAatimeH?I) and nanoseconds (!pAatimeH1I) since Lan#ary 1st 1J7? (F%*)" )p%mtime: *wo 65 bit integers containing the last file modification time in seconds (!pAmtimeH?I) and nanoseconds (!pAmtimeH?I) since Lan#ary 1st 1J7? (F%*)" )p%ctime: *wo 65 bit integers containing the last file change time in seconds (!pActimeH?I) and nanoseconds (!pActimeH1I) since Lan#ary 1st 1J7? (F%*)" )p%crtime: *wo 65 bit integers containing the file>s creation time in seconds (!pAcrtimeH?I) and nanoseconds (!pAcrtimeH1I) since Lan#ary 1st 1J7? (F%*)" )p%gen: 65 bit generation n#mber. #int65At !pAmode. #int65At !pAgid. #int65At !pAsi!e. #int65At !pAflags. #int65At !pArdev. #int65At !pA1attr. #int65At !pApadH5I. contains the transaction gro#p n#mber of the creation of this file" )p%mode: 65 bit integer containing file mode bits and file type" *he lower 8 bits of the mode contain the access mode bits. #int65At !pAparent. for e1ample 733" *he Jth bit is the stic y bit and can be a val#e of !ero or one" :its 14-16 are #sed to designate the file type" *he file types can be seen in the table below" 4! . #int65At !pA#id. #int65At !pAmtimeH2I. #int65At !pAlin s. #int65At !pAgen.

3: 70S Access $ontrol Lists -ccess control lists (-'/) serve as a mechanism to allow or restrict #ser access privileges on a ZFS obGect" -'/s are implemented in ZFS as a table containing -'<s (-ccess 'ontrol <ntries)" *he !nodeAphysAt contains a !fsA!nodeAacl str#ct#re" *his str#ct#re is shown below" Vdefine -'<AS/(*A'D* typedef str#ct !fsA!nodeAacl Q #int65At !AaclAe1ternAobG.ort Value in bits 10112 ?11 ?12 ?15 ?16 ?18 ?1?1' ?1.(B* Fifo Description 'haracter Special .zp8flag values Value ?11 ?12 )p%"id: !4 &it inte'er Gui%>tH o( t#e (iles owner.$B SA$F:/= SA$FB<F SA$F/D= SA$FS('= SA$F.oor <vent . eCce$t t#at its #i%%en an% an a$$lication will nee% to WtunnelW into t#e (ile .irectory :loc special device Beg#lar file Symbolic /in Soc et .(&pe SA$F$F( SA$F'KB SA$F. ?1< &able 1( =ile &ypes and t!eir associated mode bits )p%si)e: si!e of file in bytes )p%parent: obGect id of the parent directory containing this file )p%links: n#mber of hard lin s to this file )p%'attr: o&3ect I.evice .ia o$enatGH to 'et to it.ersistent flags set on the file" *he following are valid flag val#es" /lag ZFSAC-**B ZFSA$DK<B$*A-'< &able 1. )p%gid: 65 bit integer (gidAt) owning gro#p of the file" )p%acl: !fsA!nodeAacl str#ct#re containing any -'/ entries set on this file" *he !fsA!nodeAacl str#ct#re is defined below" Section (. )p%rde#: devAt for files of type SA$F'KB or SA$F:/= )p%flags: . o( a LA+ o&3ect w#ic# is t#e #i%%en attri&ute %irectory. 6 4I .((B SA$F. It is treate% like a normal %irectory in L4S.

its for -'/s great than 6 -'<s" *he obGect type of an e1tern -'/ is . this field will contain a 0$.%0A(*A-'/" )%acl%co"nt: n#mber of -'< entries that ma e #p an -'/ )%acl%#ersion: reserved for f#t#re #se" )%acl%pad: reserved for f#t#re #se" )%ace%data: -rray of #p to 6 -'<s" -n -'< specifies an access right to an individ#al #ser or gro#p for a specific obGect" typedef str#ct ace Q #idAt aAwho. aceAt !AaceAdataH-'<AS/(*A'D*I. or F$. R !fsA!nodeAaclAt.<D*$F$<BAFB(0. #int16At !AaclAversion. a%w$o: *his field is only meaningf#l when the AC0>59N02. #int16At aAflags. #int42At aAaccessAmas . the aAwho field will contain a F$." (therwise." $f the -'<A$. R aceAt. )%acl%e'tern%ob-: 0sed for holding -'/s that won>t fit in the !node" $n other words. *he aAwho field contains a 0$. flag is set in a%&lags (see below)." a%access%mask: 42 bit access mas " *he table below shows the access attrib#te associated with each bit" 4K . AC0>125U+ an% AC0>0602:5N0 (la's Gset in a8flags %escri&e% &elowH are not asserte%. #int16At !AaclApad. #int16At aAtype.#int42At !AaclAco#nt.

.<D*$F$<BAFB(0.-F-*<A$DK<B$*A-'< -'<A$DK<B$*A(D/EA-'< -'<AS0''<SSF0/A-''<SSA-'<AF/-F -'<AF-$/<.<D.AD-%<..$B<'*(BE -'<AB<-.A-''<SSA-'<AF/-F -'<A$.A.A-'/ -'<A)B$*<A-'/ -'<A)B$*<A()D<B -'<ASED'KB(D$Z< &able 1.Attribute -'<AB<-.A-**BS -'<A)B$*<AD-%<. -'<A<2<BE(D< &able 1/ 2ntry &ype and In!eritance =lag Value Value ?1???1 ?1???2 ?1???5 ?1???8 ?1??1? ?1??2? ?1??5? ?11??? ?12??? ?15??? a%t*pe: *he type of this ace" *he following types are listed in the table below" 49 .B(. -'<A()D<B -'<AFB(0.$B<'*(BE -'<A)B$*<A.A.AF$/< -'<A-.A-**BS -'<A<C<'0*< -'<A.</<*<A'K$/.$B<'*(BEA$DK<B$*A-'< -'<AD(A. -'<AB<-.A-**B$:0*<S -'<A)B$*<A-**B$:0*<S -'<A.AS0:.-*-'<A/$S*A.-*-'<A-.</<*< -'<AB<-. *ccess 7ask Values Value ?1???????1 ?1???????1 ?1???????2 ?1???????2 ?1???????5 ?1???????5 ?1???????8 ?1??????1? ?1??????2? ?1??????5? ?1??????8? ?1?????1?? ?1???1???? ?1???2???? ?1???5???? ?1???8???? ?1??1????? a%flags: 16 bit integer whose val#e describes the -'/ entry type and inheritance flags" A3 #lag -'<AF$/<A$DK<B$*A-'< -'<A.-*-'<A-..

<D$<.A-'<A*E. 50 .< -'<A-''<SSA.A-'<A*E.< ?1???4 &able 15 *C2 &ypes and Values ! "#e action taken as an e((ect o( tri''erin' an au%it is currently un%e(ine% in Solaris.$*A-'<A*E.enies access as described in aAaccessAmas " -#dit the s#ccessf#l or failed accesses (depending on the presence of the s#ccessf#l@failed access flags) as defined in the aAaccessAmas " 6 -larm the s#ccessf#l of failed accesses as defined in the aAaccessAmas "7 -'<ASES*<%A-/-B%A-'<A*E. I "#e action taken as an e((ect o( tri''erin' an alarm is currently un%e(ine% in Solaris.< -'<ASES*<%A-0.(&pe -'<A-''<SSA-//()<.< Value ?1???? ?1???1 ?1???2 Description Frants access as described in aAaccessAmas " .

SED' or other synchrono#s re&#irement" $n the event of a panic or power fail#re.ea%er 7o' 2ecor% 7o' 2ecor% "railer "railer 7o' -lock 7o' 2ecor% .Z$/ records .%0 transaction gro#p (t1g) commits them to the stable pool and they can be discarded.. the log records (transactions) are replayed" *here is one Z$/ per file system" $ts on-dis (pool) format consists of 4 parts+ . or they are fl#shed to the stable log (also in the pool) d#e to a fsync.Z$/ bloc s .log record holds a system call transaction" /og bloc s can hold many log records and the bloc s are chained together" <ach Z$/ bloc contains a bloc pointer in the trailer(bl ptrAt) to the ne1t Z$/ bloc in the chain" /og bloc s can be different si!es" *he Z$/ header points to the first bloc in the chain" Dote there is not a fi1ed place in the pool to hold bloc s" *hey are dynamically allocated and freed as needed from the bloc s available" *he ill#stration below shows the Z$/ str#ct#re showing log bloc s and log records of different si!es+ 7o' -lock . (A.Z$/ header . Illustration 15 3vervie% of <IL 'tructure %ore details of the c#rrent Z$/ on dis str#ct#res are given below" Section *.Chapter even – 32 Intent Lo1 *he ZFS intent log (Z$/) saves transaction records of system calls that change the file system in memory with eno#gh information to be able to replay them" *hese are stored in memory #ntil either the ..1: 7"L header *here is one of these per Z$/ and it has a simple str#ct#re+ ty$e%e( struct )il>#ea%er Q uint!4>t )#>claim>tC'P =X tC' in w#ic# lo' &locks were claime% X= uint!4>t )#>re$lay>se@P =X #i'#est re$laye% se@uence num&er X= &lk$tr>t )#>lo'P =X lo' c#ain X= T )il>#ea%er>tP 51 .

Section *.$B.e (ile X= =X 2emo. *CA%=. layer will pass in-memory pointers to vnodes" *hese have to be converted to stable pool obGect identifiers (oids)" )hen replaying the transaction the 2(.I2 Y%e(ine "8>7INO Y%e(ine "8>20NAM0 Y%e(ine "8>92I"0 Y%e(ine "8>"2UNCA"0 Y%e(ine "8>S0"A""2 Y%e(ine "8>AC7 1 J 4 5 ! I K 9 10 11 1 =X common lo' recor% #ea%er X= =X intent lo' transaction ty$e X= =X transaction recor% len't# X= =X %mu transaction 'rou$ num&er X= =X intent lo' se@uence num&er X= =X Create (ile X= =X Make %irectory X= =X Make 8A""2 %irectory X= =X Create sym&olic link to a (ile X= =X 2emo. and lrAremoveAt is #sed for both 5 . *CA%=C-**B and *CASE%/$D=. call)" *he 2(.2: 7"L blocks Z$/ bloc s contain Z$/ records" *he bloc s are allocated on demand and are of a variable si!e according to need" *he si!e field is part of the bl ptrAt which points to a log bloc " <ach bloc is filled with records and contains a !ilAtrailerAt at the end of the bloc + Z+L Trailer ty$e%e( struct )il>trailer Q &lk$tr>t )it>neCt>&lkP =X neCt &lock in c#ain X= uint!4>t )it>nuse%P =X &ytes in lo' &lock use% X= )io>&lock>tail>t )it>&tP =X &lock trailer X= T )il>trailer>tP Z+L records Z+L record common str"ct"re Z$/ records all start with a common section followed by a record (transaction) specific str#ct#re" *he common log record str#ct#re and record types (val#es for lrcAt1type) are+ ty$e%e( struct Q uint!4>t lrc>tCty$eP uint!4>t lrc>reclenP uint!4>t lrc>tC'P uint!4>t lrc>se@P T lr>tP Y%e(ine "8>C20A"0 Y%e(ine "8>MO.I2 Y%e(ine "8>MO8A""2 Y%e(ine "8>S:M7INO Y%e(ine "8>20M560 Y%e(ine "8>2M. layer is called again" *o do this we reopen the obGect and pass it>s vnode" Some of the record specific str#ct#res are #sed for more than one transaction type" *he lrAcreateAt record specific str#ct#re is #sed for+ *CA'B<-*<.e %irectory X= =X Create #ar% link to a (ile X= =X 2ename a (ile X= =X 4ile write X= =X "runcate a (ile X= =X Set (ile attri&utes X= =X Set acl X= Z+L record specific str"ct"res For each of the record (transaction) types listed above there is a specific str#ct#re which embeds the common str#ct#re" )ithin each record eno#gh information is saved in order to be able to replay the transaction (#s#ally one 2(.

$B" -ll fields (other than strings and #ser data) are 65 bits wide" *his provides for a well defined alignment which allows for easy compatibility between different architect#res.e>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>%oi%P =X o&3 i% o( %irectory X= uint!4>t lr>link>o&3P =X o&3 i% o( link X= =X name o( o&3ect to link (ollows t#is X= T lr>link>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>s%oi%P =X o&3 i% o( source %irectory X= uint!4>t lr>t%oi%P =X o&3 i% o( tar'et %irectory X= =X strin's< names o( source an% %estination (ollow t#is X= T lr>rename>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>(oi%P =X (ile o&3ect to write X= uint!4>t lr>o((setP =X o((set to write to X= uint!4>t lr>len't#P =X user %ata len't# to write X= uint!4>t lr>&lko((P =X o((set re$resente% &y lr>&lk$tr X= &lk$tr>t lr>&lk$trP =X s$a &lock $ointer (or re$lay X= =X write %ata will (ollow (or small writes X= T lr>write>tP ty$e%e( struct Q lr>t uint!4>t uint!4>t uint!4>t lr>commonP lr>(oi%P lr>o((setP lr>len't#P =X common $ortion o( lo' recor% X= =X o&3ect i% o( (ile to truncate X= =X o((set to truncate (rom X= =X len't# to truncate X= 5J .*CAB<%(2< and *CAB%.e (ollows t#is X= T lr>remo. link content (ollows name X= T lr>create>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>%oi%P =X o&3 i% o( %irectory X= =X name o( o&3ect to remo. and easy endianness conversion if necessary" Kere>s the definition of the record specific str#ct#res+ ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>%oi%P =X o&3ect i% o( %irectory X= uint!4>t lr>(oi%P =X o&3ect i% o( create% (ile o&3ect X= uint!4>t lr>mo%eP =X mo%e o( o&3ect X= uint!4>t lr>ui%P =X ui% o( o&3ect X= uint!4>t lr>'i%P =X 'i% o( o&3ect X= uint!4>t lr>'enP =X 'eneration GtC' o( creationH X= uint!4>t lr>crtimeR SP =X creation time X= uint!4>t lr>r%e. o( o&3ect to create X= =X name o( o&3ect to create (ollows t#is X= =X (or symlinks.P =X r%e.

T lr>truncate>tP ty$e%e( struct Q lr>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t T lr>setattr>tP lr>commonP lr>(oi%P lr>maskP lr>mo%eP lr>ui%P lr>'i%P lr>si)eP lr>atimeR SP lr>mtimeR SP =X common $ortion o( lo' recor% X= =X (ile o&3ect to c#an'e attri&utes X= =X mask o( attri&utes to set X= =X mo%e to set X= =X ui% to set X= =X 'i% to set X= =X si)e to set X= =X access time X= =X mo%i(ication time X= ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>(oi%P =X o&3 i% o( (ile X= uint!4>t lr>aclcntP =X num&er o( acl entries X= =X lr>aclcnt num&er o( ace>t entries (ollow t#is X= T lr>acl>tP 54 .

obGect type .B(.Chapter 9i1ht – 3VOL (32 volu*e) Z2(/ (ZFS 2ol#mes) provides a mechanism for creating logical vol#mes" ZFS vol#mes are e1ported as bloc devices and can be #sed li e any other bloc device" Z2(/s are represented in ZFS as an obGect set of type .%0A(*AZ2(/A.B(.%0A(*AZ2(/ respectively" :oth obGects have statically assigned obGect $ds" <ach obGect is described below" FVG4 Hroperties G$Bect *ype+ . in bytes.%0A(S*AZ2(/ (see *able 11)" .%0A(*AZ2(/ (bGect V+ 1 .Z2(/ obGect set has a very simple format consisting of two obGects+ a properties obGect and a data obGect. of the vol#me" FVG4 Data *ype+ .escription+ *his obGect stores the contents of this virt#al bloc device" 55 .escription+*he Z2(/ property obGect is a Z-. (bGect V+ 2 .partic#lar attrib#te of interest is the 8volsi'e@ attrib#te" *his attrib#te contains the si!e. and .%0A(*AZ2(/A. obGect containing attrib#tes associated with this vol#me" .

Sign up to vote on this title
UsefulNot useful