Professional Documents
Culture Documents
ZF Son Disk Format
ZF Son Disk Format
Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A
00! Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A. "#is $ro%uct or %ocument is $rotecte% &y co$yri'#t an% %istri&ute% un%er licenses restrictin' its use, co$yin', %istri&ution, an% %ecom$ilation. No $art o( t#is $ro%uct or %ocument may &e re$ro%uce% in any (orm &y any means wit#out $rior written aut#ori)ation o( Sun an% its licensors, i( any. "#ir%*$arty so(tware, inclu%in' (ont tec#nolo'y, is co$yri'#te% an% license% (rom Sun su$$liers. +arts o( t#e $ro%uct may &e %eri,e% (rom -erkeley -S. systems, license% (rom t#e Uni,ersity o( Cali(ornia. Sun, Sun Microsystems, t#e Sun lo'o, /a,a, /a,aSer,er +a'es, Solaris, an% Stor0%'e are tra%emarks or re'istere% tra%emarks o( Sun Microsystems, Inc. in t#e U.S. an% ot#er countries. U.S. 1o,ernment 2i'#ts Commercial so(tware. 1o,ernment users are su&3ect to t#e Sun Microsystems, Inc. stan%ar% license a'reement an% a$$lica&le $ro,isions o( t#e 4A2 an% its su$$lements. .5CUM0N"A"I5N IS +256I.0. AS IS AN. A77 08+20SS 52 IM+7I0. C5N.I"I5NS, 20+20S0N"A"I5NS AN. 9A22AN"I0S, INC7U.IN1 AN: IM+7I0. 9A22AN": 54 M02C;AN"A-I7I":, 4I"N0SS 452 A +A2"ICU7A2 +U2+5S0 52 N5N*IN42IN10M0N", A20 .ISC7AIM0., 08C0+" "5 ";0 08"0N" ";A" SUC; .ISC7AIM02S A20 ;07. "5 -0 701A77: IN6A7I.. Unless ot#erwise license%, use o( t#is so(tware is aut#ori)e% $ursuant to t#e terms o( t#e license (oun% at< #tt$<==%e,elo$ers.sun.com=&erkeley>license.#tml Ce $ro%uit ou %ocument est $rot?'? $ar un co$yri'#t et %istri&u? a,ec %es licences @ui en restrei'nent lAutilisation, la co$ie, la %istri&ution, et la %?com$ilation. Aucune $artie %e ce $ro%uit ou %ocument ne $eut Btre re$ro%uite sous aucune (orme, $ar @uel@ue moyen @ue ce soit, sans lAautorisation $r?ala&le et ?crite %e Sun et %e ses &ailleurs %e licence, sAil y en a. 7e lo'iciel %?tenu $ar %es tiers, et @ui com$ren% la tec#nolo'ie relati,e auC $olices %e caractDres, est $rot?'? $ar un co$yri'#t et licenci? $ar %es (ournisseurs %e Sun. .es $arties %e ce $ro%uit $ourront Btre %?ri,?es %u systDme -erkeley -S. licenci?s $ar lAUni,ersit? %e Cali(ornie. Sun, Sun Microsystems, le lo'o Sun, /a,a, /a,aSer,er +a'es, Solaris, et Stor0%'e sont %es mar@ues %e (a&ri@ue ou %es mar@ues %?$os?es, %e Sun Microsystems, Inc. auC 0tats*Unis et %ans %Aautres $ays. C0""0 +U-7ICA"I5N 0S" 45U2NI0 0N 7A0"A" 0" AUCUN0 1A2AN"I0, 08+20SS0 5U IM+7ICI"0, NA0S" ACC52.00, : C5M+2IS .0S 1A2AN"I0S C5NC02NAN" 7A 6A70U2 MA2C;AN.0, 7AA+"I"U.0 .0 7A +U-7ICA"I5N A 20+5N.20 A UN0 U"I7ISA"I5N +A2"ICU7I020, 5U 70 4AI" EUA0770 N0 S5I"+AS C5N"204AISAN"0 .0 +25.UI" .0 "I02S. C0 .0NI .0 1A2AN"I0 N0 SAA++7IEU02AI" +AS, .ANS 7A M0SU20 5U I7 S02AI" "0NU /U2I.IEU0M0N" NU7 0" N5N A60NU.
Table of Contents
Intro%uction............................................................ ................................................................5 C#a$ter 5ne F 6irtual .e,ices G,%e,sH, 6%e, 7a&els, an% -oot -lock................................! Section 1.1< 6irtual .e,ices.............................................. ................................................! Section 1. < 6%e, 7a&els........................................................ ..........................................! Section 1. .1< 7a&el 2e%un%ancy.............................................................. ...................I Section 1. . < "ransactional "wo Sta'e% 7a&el U$%ate..............................................I Section 1.J< 6%e, "ec#nical .etails.......................................................... ........................K Section 1.J.1< -lank S$ace............................................................................ ...............K Section 1.J. < -oot -lock ;ea%er.............................................. ..................................K Section 1.J.J< Name*6alue +air 7ist...........................................................................K Section 1.J.4< "#e U&er&lock................................................................... ..................1 Section 1.4< -oot -lock.............................................. .....................................................14 C#a$ter "wo< -lock +ointers an% In%irect -locks................................................................15 Section .1< .6A F .ata 6irtual A%%ress.............................................. .......................15 Section . < 12I............................................. ............................................................1! Section .J< 1AN1.................................................................................... ...................1! Section .4< C#ecksum..................................................................................................1I Section .5< Com$ression..............................................................................................1K Section .! < -lock Si)e.............................................................. ...................................1K Section .I< 0n%ian.................................................................. ......................................19 Section .K< "y$e.............................................................. .............................................19 Section .9< 7e,el............................................................ .............................................. 0 Section .10< 4ill.................................................................................. .......................... 0 Section .11< -irt# "ransaction..................................................................................... 1 Section .1 < +a%%in'................................................ .................................................... 1 C#a$ter "#ree< .ata Mana'ement Unit...................................................... .......................... Section J.1 < 5&3ects.............................................................. .......................................... Section J. < 5&3ect Sets............................................................... .................................... ! C#a$ter 4our F .S7 .................................................. .......................................................... 9 Section 4.1 < .S7 In(rastructure.................................................. .................................... 9 Section 4. < .S7 Im$lementation .etails.......................................................................J1 Section 4.J< .ataset Internals..........................................................................................J Section 4.4< .S7 .irectory Internals..............................................................................J4 C#a$ter 4i,e F LA+.............................................................................. ................................JI Section 5.1< "#e Micro La$............................................ .................................................JK Section 5. < "#e 4at La$...................................................................... ...........................J9 Section 5. .1< )a$>$#ys>t...........................................................................................J9 Section 5. . < +ointer "a&le.................................................. ......................................41 Section 5. .J< )a$>lea(>$#ys>t...................................................................................41 Section 5. .4 < )a$>lea(>c#unk...................................................................................4J C#a$ter SiC F L+7...................................................................... ..........................................45 Section !.1< L+7 4ilesystem 7ayout......................................................... .......................45 Section !. < .irectories an% .irectory "ra,ersal.............................................. ...............45 Section !.J< L4S Access Control 7ists............................................................................4I J
C#a$ter Se,en F L4S Intent 7o'.............................................. ..........................................51 Section I.1< LI7 #ea%er...................................................................................................51 Section I. < LI7 &locks............................................................................ ........................5 C#a$ter 0i'#t F L657 GL4S ,olumeH.............................................................. ....................55
Introduction
ZFS is a new filesystem technology that provides immense capacity (128-bit), provable data integrity, always-consistent on-dis format, self-optimi!ing performance, and real-time remote replication" ZFS departs from traditional filesystems by eliminating the concept of vol#mes" $nstead, ZFS filesystems share a common storage pool consisting of writeable storage media" %edia can be added or removed from the pool as filesystem capacity re&#irements change" Filesystems dynamically grow and shrin as needed witho#t the need to re-partition #nderlying storage" ZFS provides a tr#ly consistent on-dis format, b#t #sing a copy on write ('()) transaction model" *his model ens#res that on dis data is never overwritten and all on dis #pdates are done atomically" *he ZFS software is comprised of seven distinct pieces+ the S,- (Storage ,ool -llocator), the .S/ (.ataset and Snapshot /ayer), the .%0 (.ata %anagement /ayer), the Z-, (ZFS -ttrib#te ,rocessor), the Z,/ (ZFS ,osi1 layer), the Z$/ (ZFS $ntent /og), and Z2(/ (ZFS 2ol#me)" *he on-dis str#ct#res associated with each of these pieces are e1plained in the following chapters+ S,- ('hapters 1 and 2), .S/ ('hapter 3), .%0 ('hapter 4), Z-, ('hapter 5), Z,/ ('hapter 6), Z$/ ('hapter 7), Z2(/ ('hapter 8)"
Chapter One Virtual Devices (vdevs), Vdev Labels, and Boot Block
Section 1.1: Virtual Devices
ZFS storage pools are made #p of a collection of virt#al devices" *here are two types of virt#al devices+ physical virt#al devices (sometimes called leaf vdevs) and logical virt#al devices (sometimes called interior vdevs)" - physical vdev, is a writeable media bloc device (a dis , for e1ample)" - logical vdev is a concept#al gro#ping of physical vdevs" 2devs are arranged in a tree with physical vdev e1isting as leaves of the tree" -ll pools have a special logical vdev called the 8root9 vdev which roots the tree" -ll direct children of the 8root9 vdev (physical or logical) are called top-level vdevs" *he $ll#stration below shows a tree of vdevs representing a sample pool config#ration containing two mirrors" *he first mirror (labeled 8%19) contains two dis , represented by 8vdev -9 and 8vdev :9" /i ewise, the second mirror 8%29 contains two dis s represented by 8vdev '9 and 8vdev .9" 2devs -, :, ', and . are all physical vdevs" 8%19 and %29 are logical vdevs; they are also top-level vdevs since they originate from the 8root vdev9"
MM N ,%e, GMirrorC=.H
+#ysical=7ea( 6%e,s
*he vdev label serves two p#rposes+ it provides access to a pool>s contents and it is #sed to verify a pool>s integrity and availability" *o ens#re that the vdev label is always available and always valid, red#ndancy and a staged #pdate model are #sed" *o provide red#ndancy, fo#r copies of the label are written to each physical vdev within the pool" *he fo#r copies are identical within a vdev, b#t are not identical across vdevs in the pool" .#ring label #pdates, a two staged transactional approach is #sed to ens#re that a valid vdev label is always available on dis " 2dev label red#ndancy and the transactional #pdate model are described in more detail below"
70
71
7J
:ased on the ass#mption that corr#ption (or accidental dis overwrites) typically occ#rs in contig#o#s ch#n s, placing the labels in non-contig#o#s locations (front and bac ) provides ZFS with a better probability that some label will remain accessible in the case of media fail#re or accidental overwrite (eg" #sing the dis as a swap device while it is still part of a ZFS storage pool)"
Name=6alue +airs
....
1 KO U&er&lock Array 5!O
KO
1!O
Illustration 3 Components of a vdev label (blank space boot block !eader name"value pairs uberblock array#
1 .isk la&els %escri&e %isk $artition an% slice in(ormation. See (%iskG1MH an%=or (ormatG1MH (or more in(ormation on %isk $artitions an% slices. It s#oul% &e note% t#at %isk la&els are a com$letely se$arate entity (rom ,%e, la&els an% w#ile t#eir namin' is similar, t#ey s#oul% not &e con(use% as &ein' similar.
MM N ,%e, GMirrorC=.H
+#ysical=7ea( 6%e,s
-ll name-val#e pairs are stored in C.B encoded nvlists" For more information on C.B encoding or nvlists, see the libnvpair(4/$:) and nvlistAfree(4D2,-$B) man pages" *he following name-val#e pairs are contained within this 112=: portion of the vdevAlabel" Version< Dame+ 8version9 2al#e+ .-*-A*E,<A0$D*65 .escription+ (n dis format version" '#rrent val#e is 819" Name: Dame+ 8name9 2al#e+ .-*-A*E,<AS*B$DF .escription+ Dame of the pool in which this vdev belongs" State: Dame+ 8state9 2al#e+ .-*-A*E,<A0$D*65 .escription+ State of this pool" *he following table shows all e1isting pool states"
Value ? 1 2
"a&le 1 +ool states an% ,alues. Transaction Dame+ 8t1g9 2al#e+ .-*-A*E,<A0$D*65 .escriptions+ *ransaction gro#p n#mber in which this label was written to dis " ool !"id Dame+ 8poolAg#id9 2al#e+ .-*-A*E,<A0$D*65 .escription+ Flobal #ni&#e identifier (g#id) for the pool" Top !"id Dame+ 8topAg#id9 2al#e+ .-*-A*E,<A0$D*65 .escription+ Flobal #ni&#e identifier for the top-level vdev of this s#btree" !"id Dame+ 8g#id9 2al#e+ .-*-A*E,<A0$D*65 .escription+ Flobal #ni&#e identifier for this vdev" Vde# Tree Dame+ 8vdevAtree9 2al#e+ .-*-A*E,<AD2/$S* .escription+ *he vdevAtree is a nvlist str#ct#re which is #sed rec#rsively to describe the hierarchical nat#re of the vdev tree as seen in ill#strations one and fo#r" *he vdevAtree rec#rsively describes each 8related9 vdev within this vdev>s s#btree" *he ill#stration below shows what the 8vdevAtree9 entry might loo li e for 8vdev -9 as shown in ill#strations one and fo#r earlier in this doc#ment"
10
type='mirror' vdev_tree id=1 guid=1659 !!966!"!1 516#6 metasla$%array = 1 metasla$%shi&t = ## ashi&t = 9 asi'e =519569"!( children)!* type='dis+' vdev_tree id=# guid=66"99(159695 "1#9," path='-dev-ds+-c"t!d!' devid='id1.sd/001232T1%0T , "5 45% 65!7!87!!!!,"!"1"N0-a' children)1* type='dis+' vdev_tree id= guid= 6"(!"! !!19 #91"!5 path='-dev-ds+-c"t1d!' devid='id1.sd/001232T1%0T , "5 45% 65!6425!!!,"!"D69N-a'
Illustration ( vdev tree nvlist entry for )vdev *) as seen in Illustrations 1 and $
<ach vdevAtree nvlist contains the following elements as described in the section below" Dote that not all nvlist elements are applicable to all vdevs types" *herefore, a vdevAtree nvlist may contain only a s#bset of the elements described below" Dame+ 8type9 2al#e+ .-*-A*E,<AS*B$DF .escription+ String val#e indicating type of vdev" *he following vdevAtypes are valid"
Type M%iskN M(ileN MmirrorN Mrai%)N Mre$lacin'N Description 7ea( ,%e,< &lock stora'e 7ea( ,%e,< (ile stora'e Interior ,%e,< mirror Interior ,%e,< rai%) Interior ,%e,< a sli'#t ,ariation on t#e mirror ,%e,P use% &y L4S w#en re$lacin' one %isk wit# anot#er Interior ,%e,< t#e root o( t#e ,%e, tree
MrootN
Dame+ 8id9 2al#e+ .-*-A*E,<A0$D*65 .escription+ *he id is the inde1 of this vdev in its parent>s children array" Dame+ 8g#id9 2al#e+ .-*-A*E,<A0$D*65 .escription+ Flobal 0ni&#e $dentifier for this vdevAtree element" 11
Dame+ 8path9 2al#e+ .-*-A*E,<AS*B$DF .escription+ .evice path" (nly #sed for leaf vdevs" Dame+ 8devid9 2al#e+ .-*-A*E,<AS*B$DF .escription+ .evice $. for this vdevAtree element" (nly #sed for vdevs of type dis " Dame+ 8metaslabAarray9 2al#e+ .-*-A*E,<A0$D*65 .escription+ (bGect n#mber of an obGect containing an array of obGect n#mbers" <ach element of this array (maHiI) is, in t#rn, an obGect n#mber of a space map for metaslab >i'" Dame+ 8metaslabAshift9 2al#e+ .-*-A*E,<A0$D*65 .escription+ log base 2 of the metaslab si!e Dame+ 8ashift9 2al#e+ .-*-A*E,<A0$D*65 .escription+ /og base 2 of the minim#m allocatable #nit for this top level vdev" *his is c#rrently >1?> for a B-$.! config#ration, >J> otherwise" Dame+ 8asi!e9 2al#e+ .-*-A*E,<A0$D*65 .escription+ -mo#nt of space that can be allocated from this top level vdev Dame+ 8children9 2al#e+ .-*-A*E,<AD2/$S*A-BB-E .escription+ -rray of vdevAtree nvlists for each child of this vdevAtree element"
overwritten" $nstead, all #pdates to an #berbloc are done by writing a modified #berbloc to another element of the #berbloc array" 0pon writing the new #berbloc , the transaction gro#p n#mber and timestamps are incremented thereby ma ing it the new active #berbloc in a single atomic action" 0berbloc s are written in a ro#nd robin fashion across the vario#s vdevs with the pool" *he ill#stration below has an e1panded view of two #berbloc s within an #berbloc array"
70 71
7J
-lank S$ace
-oot ;ea%er
Name=6alue +airs
....
u&>ma'ic u&>,ersion uint!4>t u&>ma'ic u&>tC' uint!4>t u&>,ersion u&>'ui%>sum uint!4>t u&>tC' u&>timestam$ uint!4>t u&>root&$ u&>'ui%>sum uint!4>t u&>timestam$ &lk$tr>t u&>root&$ acti,e u&er&lock u&er&lock>$#ys>t
Uberblock Tec$nical Details *he #berbloc is stored in the machine>s native endian format and has the following contents+ "b%magic *he #berbloc magic n#mber is a 65 bit integer #sed to identify a device as containing ZFS data" *he val#e of the #bAmagic is ?1??bab1?c (oo-ba-bloc )" *he following table shows the #bAmagic n#mber as seen on dis " Machine Endianness Uberblock Value
:ig <ndian /ittle <ndian &able 3 +berblock values per mac!ine endian type, ?1??bab1?c ?1?cb1ba??
"b%#ersion *he version field is #sed to identify the on-dis format in which this data is laid o#t" *he c#rrent on-dis format version n#mber is &'(" *his field contains the same val#e as the 8version9 element of the name@val#e pairs described in section 1"4"4" "b%t'g 1J
-ll writes in ZFS are done in transaction gro#ps" <ach gro#p has an associated transaction gro#p n#mber" *he #bAt1g val#e reflects the transaction gro#p in which this #berbloc was written" *he #bAt1g n#mber m#st be greater than or e&#al to the 8t1g9 n#mber stored in the nvlist for this label to be valid" "b%g"id%s"m *he #bAg#idAs#m is #sed to verify the availability of vdevs within a pool" )hen a pool is opened, ZFS traverses all leaf vdevs within the pool and totals a r#nning s#m of all the F0$.s (a vdev>s g#id is stored in the guid nvpair entry, see section 1"4"4) it enco#nters" *his comp#ted s#m is chec ed against the #bAg#idAs#m to verify the availability of all vdevs within this pool" "b%timestamp 'oordinated 0niversal *ime (0*') when this #berbloc was written in seconds since Lan#ary 1st 1J7? (F%*)" "b%rootbp *he #bArootbp is a bl ptr str#ct#re containing the location of the %(S" :oth the %(S and bl ptr str#ct#res are described in later chapters of this doc#ment+ 'hapters 5 and 2 respectively"
70
71
-oot -lock
7J
14
single wide bloc pointer (1 .2-), do#ble wide bloc pointer (2 .2-s), and triple wide bloc pointer (4 .2-s)" *he vdev portion of each .2- is a 42 bit integer which #ni&#ely identifies the vdev $. containing this bloc " *he o&&set portion of the .2- is a 64 bit integer val#e holding the offset (starting after the vdev labels (/? and /1) and boot bloc ) within that device where the data lives" *ogether, the vdev and o&&set #ni&#ely identify the bloc address of the data it points to" *he val#e stored in o&&set is the offset in terms of sectors (312 byte bloc s)" *o find the physical bloc byte offset from the beginning of a slice, the val#e inside o&&set m#st be shifted over (NN) by J (2J O312) and this val#e m#st be added to ?15????? (si!e of two vdevAlabels and boot bloc )" physical $loc+ address = :o&&set ;; 9< P ?15????? (5%:)
Section 2.2 :
!"D
Section 2.3:
A#
- gang bloc is a bloc whose contents contain bloc pointers" Fang bloc s are #sed when the amo#nt of space re&#ested is not available in a contig#o#s bloc " $n a sit#ation of this ind, several smaller bloc s will be allocated (totaling #p to the si!e re&#ested) and a gang bloc will be created to contain the bloc pointers for the allocated bloc s" - pointer to this gang bloc is ret#rned to the re&#ester, giving the re&#ester the perception of a single bloc " Fang bloc s are identified by the 839 bit"
G bit value ? 1 &able $ 1ang 0lock Values Description non-gang bloc gang bloc
Fang bloc s are 312 byte si!ed, self chec s#mming bloc s" - gang bloc contains #p to 4 bloc pointers followed by a 42 byte chec s#m" *he format of the gang bloc is described by the following str#ct#res"
1!
)g%blkptr: array of bloc pointers" <ach 312 byte gang bloc can hold #p to 4 bloc pointers" )g%filler: *he filler fields pads o#t the gang bloc so that it is nicely byte aligned"
typedef str#ct !ioAbloc Atail Q #int65At !btAmagic; !ioAc s#mAt !btAc s#m; R
)bt%magic: Z$( bloc tail magic n#mber" *he val#e is 0x210da7ab10c7a11 :'io=data=$loc=tail<.
typedef !ioAc s#m Q uint!4>t )c>wor%R4SP T)io>cksum>tP
zc_word: (our K &yte wor%s containin' t#e c#ecksum (or t#is 'an' &lock.
1I
Value 1 2 4 5 3 6 7 8
- 236 bit chec s#m of the data is comp#ted for each bloc #sing the algorithm identified in c+sum" $f the c s#m val#e is 2 (off), a chec s#m will not be comp#ted and chec+sum)!*. chec+sum)1*. chec+sum)#*. and chec+sum) * will be !ero" (therwise, the 236 bit chec s#m comp#ted for this bloc is stored in the chec+sum)!*. chec+sum)1*. chec+sum)#*. and chec+sum) * fields. Note: The computed chec+sum is al>ays o& the data. even i& this is a gang $loc+. 3ang $loc+s :see a$ove< and 'ilog $loc+s :see ?hapter ,< are sel& chec+summing.
asi'e: allocated si!e, total si!e of all bloc s allocated to hold this data incl#ding any gang headers or raid-Z parity information $f compression is t#rned off and ZFS is not on Baid-Z storage, lsi!e, asi!e, and psi!e will all be e&#al" -ll si!es are stored as the n#mber of 312 byte sectors (min#s one) needed to represent the si!e of this bloc "
$f a pool is moved to a machine with a different endian format, the contents of the bloc are byte swapped on read"
19
Type .MU>5">N5N0 .MU>5">5-/0C">.I20C"52: .MU>5">5-/0C">A22A: .MU>5">+ACO0.>N67IS" .MU>5">N67IS">SIL0 .MU>5">-+7IS" .MU>5">-+7IS">;.2 .MU>5">S+AC0>MA+>;0A.02 .MU>5">S+AC0>MA+ .MU>5">IN"0N">751 .MU>5">.N5.0 .MU>5">5-/S0" .MU>5">.S7>.A"AS0" .MU>5">.S7>.A"AS0">C;I7.>MA+ .MU>5">5-/S0">SNA+>MA+ .MU>5">.S7>+25+S .MU>5">.S7>5-/S0" .MU>5">LN5.0 .MU>5">AC7 .MU>5">+7AIN>4I70>C5N"0N"S .MU>5">.I20C"52:>C5N"0N"S .MU>5">MAS"02>N5.0 .MU>5">.070"0>EU0U0 .MU>5">L657 .MU>5">L657>+25+
Value 0 1
J 4 5 ! I K 9 10 11 1 1J 14 15 1! 1I 1K 19 0 1
J 4
the n#mber of free dnodes beneath this bloc pointer" For more information on dnodes see 'hapter 4"
Description
0nallocated obGect .S/ obGect directory Z-, obGect (bGect #sed to store an array of obGect n#mbers" ,ac ed nvlist obGect" S,- dis bloc #sage list" $ntent /og (bGect of dnodes (metadnode) 'ollection of obGects" .S/ Z-, obGect containing child .S/ directory information" .S/ Z-, obGect containing snapshot information for a dataset" .S/ Z-, properties obGect containing properties for a .S/ dir obGect" :loc pointer list S #sed to store the 8deadlist9 + list of bloc pointers deleted since the last snapshot, and the 8deferred free list9 #sed for sync to convergence" :,/$S* header+ stores the bplistAphysAt str#ct#re" -'/ (-ccess 'ontrol /ist) obGect Z,/ ,lain file Z,/ .irectory Z-, (bGect Z,/ %aster Dode Z-, obGect+ head obGect #sed to identify root directory, delete &#e#e, and version for a filesystem"
Type
.%0A(*A.</<*<AT0<0<
Description
*he delete &#e#e provides a list of deletes that were in-progress when the filesystem was force #nmo#nted or as a res#lt of a system fail#re s#ch as a power o#tage" 0pon the ne1t mo#nt of the filesystem, the delete &#e#e is processed to remove the files@dirs that are in the delete &#e#e" *his mechanism is #sed to avoid lea ing files and directories in the filesystem" ZFS vol#me (Z2(/) Z2(/ properties
(bGects are defined by 312 bytes str#ct#res called dnodes4" - dnode describes and organi!es a collection of bloc s ma ing #p an obGect" *he dnode (dnodeAphysAt str#ct#re), seen in the ill#stration below, contains several fi1ed length fields and two variable length fields" <ach of these fields are described in detail below"
%no%e>$#ys>t
uintK>t uintK>t uintK>t uintK>t uintK>t uintK>t uintK>t uintK>t uint1!>t uint1!>t uintK>t uint!4>t uint!4>t uint!4>t &lk$tr>t uintK>t %n>ty$eP %n>in%&lks#i(tP %n>nle,els %n>n&lk$trP %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP %n>&lk$trRNSP %n>&onusR-5NUS70NS
dn%t*pe -n 8-bit n#meric val#e indicating an obGect>s type" See *able 8 for a list of valid obGect types and their associated 8 bit identifiers" dn%indblks$ift and dn%datablks)sec ZFS s#pports variable data and indirect (see dnAnlevels below for a description of indirect bloc s) bloc si!es ranging from 312 bytes to 128 =bytes"
J A %no%e is similar to an ino%e in U4S.
dn!indbl"shi#t: (-bit n#meric val#e containing the log (base 2) of the si!e (in bytes) of an indirect bloc for this obGect" dn!databl"s$sec: 16-bit n#meric val#e containing the data bloc si!e (in bytes) divided by 312 (si!e of a dis sector)" *his val#e can range between 1 (for a 312 byte bloc ) and 236 (for a 128 =byte bloc )" dn%nblkptr and dn%blkptr dnAbl ptr is a variable length field that can contains between one and three bloc pointers" *he n#mber of bloc pointers that the dnode contains is set at obGect allocation time and remains constant thro#gho#t the life of the dnode" dn!nbl"ptr + 8 bit n#meric val#e containing the n#mber of bloc pointers in this dnode" dn!bl"ptr% bloc pointer array containing dn%n$l+ptr bloc pointers dn%nle#els dnAnlevels is an 8 bit n#meric val#e containing the n#mber of levels that ma e #p this obGect" *hese levels are often referred to as levels of indirection" +ndirection - dnode has a limited n#mber (dnAnbl ptr, see above) of bloc pointers to describe an obGect>s data" For a dnode #sing the largest data bloc si!e (128=:) and containing the ma1im#m n#mber of bloc pointers (4), the largest obGect si!e it can represent (witho#t indirection) is 485 =:+ 4 1 128=: O 485=:" *o allow for larger obGects, indirect bloc s are #sed" -n indirect bloc is a bloc containing bloc pointers" *he n#mber of bloc pointers that an indirect bloc can hold is dependent on the indirect bloc si!e (represented by dn%ind$l+shi&t< and can be calc#lated by dividing the indirect bloc si!e by the si!e of a bl ptr (128 bytes)" *he largest indirect bloc (128=:) can hold #p to 1?25 bloc pointers" -s an obGect>s si!e increases, more indirect bloc s and levels of indirection are created" - new level of indirection is created once an obGect grows so large that it e1ceeds the capacity of the c#rrent level" ZFS provides #p to si1 levels of indirection to s#pport files #p to 265 bytes long" *he ill#stration below shows an obGect with 4 levels of bloc s (level ?, level 1, and level 2)" *his obGect has triple wide bloc pointers (dva1, dva2, and dva4) for metadata and single wide bloc pointers for its data (see 'hapter two for a description of bloc pointer wideness)" *he bloc s at level ? are data bloc s"
%no%e>$#ys>t
uintK>t uintK>t uint8_t uint8_t uintK>t uintK>t uintK>t uintK>t uint1!>t uint1!>t uintK>t uint!4>t uint!4>t uint!4>t blkptr_t uintK>t %n>ty$eP %n>in%&lks#i(tP dn_nlevels = 3 dn_nblkptr = 3 %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP dn_blkptr[3]; %n>&onusR-5NUS70NS
%n>&lk$trRS
%,aJ %,a %,a1 %,aJ %,a %,a1 %,aJ %,a %,a1
...
7e,el
7e,el 1
7e,el 0
Illustration 19 3b4ect %it! 3 levels, &riple %ide block pointers used for metadata: single %ide block pointers used for data,
dn%ma'blkid -n obGect>s bloc s are identified by bloc ids" *he bloc s in each level of indirection are n#mbered from ? to D, where the first bloc at a given level is given an id of ?, the second an id of 1, and so forth" *he dn%maA$l+id field in the dnode is set to the val#e of the largest data (level !ero) bloc id for this obGect" Dote on :loc $ds+ Fiven a bloc id and level, ZFS can determine the e1act branch of indirect bloc s which contain the bloc " *his calc#lation is done #sing the bloc id, bloc level, and n#mber of bloc pointers in an indirect bloc " For e1ample, ta e an obGect which has 128=: si!ed indirect bloc s" -n indirect bloc of this si!e can hold 1?25 bloc pointers" Fiven a level ? bloc id of 1646?, it can be determined that bloc 13 (bloc id 13) of level 1 contains the bloc pointer for level ? bl id 1646?" level 1 bl id O 1646?U1?25 O 13 *his calc#lation can be performed rec#rsively #p the tree of indirect bloc s #ntil the top level of indirection has been reached"
dn%secp$*s *he s#m of all asi'e val#es for all bloc pointers (data and indirect) for this obGect" dn%bon"s, dn%bon"slen, and dn%bon"st*pe *he bon#s b#ffer (dnAbon#s) is defined as the space following a dnode>s bloc pointer array (dnAbl ptr)" *he amo#nt of space is dependent on obGect type and can range between 65 and 42? bytes" dn!bonus% dnAbon#slen si!ed ch#n of data " *he format of this data is defined by dnAbon#stype" dn!bonuslen: /ength (in bytes) of the bon#s b#ffer" dn!bonust&pe: 8 bit n#meric val#e identifying the type of data contained within the bon#s b#ffer" *he following table shows valid bon#s b#ffer types and the str#ct#res which are stored in the bon#s b#ffer" *he contents of each of these str#ct#res will be disc#ssed later in this specification" 'onus (&pe
.%0A(*A,-'=<.AD2/$S*AS$Z<
Description
)etadata Structure
Value
:on#s b#ffer type containing #int65At si!e in bytes of a .%0A(*A,-'=<.AD2/$S* obGect" Spa space map header" .S/ .irectory obGect #sed to define relationships and properties between related datasets" spaceAmapAobGAt dslAdirAphysAt
5 7
.%0A(*AS,-'<A%-,AK<-.<B .%0A(*A.S/A.$B
12
.%0A(*A.S/A.-*-S<*
.S/ dataset obGect #sed to dslAdatasetAphysAt organi!e snapshot and #sage static information for obGects of type .%0A(*A(:LS<*" Z,/ metadata !nodeAphysAt
16 17
.%0A(*AZD(.<
os%t*pe *he .%0 s#pports several types of obGect sets, where each obGect set type has it>s own well defined format@layo#t for its obGects" *he obGect set>s type is identified by a 65 bit integer, os%type" *he table below lists available .%0 obGect set types and their associated os%type integer val#e" *b+ect Set (&pe
.%0A(S*AD(D< .%0A(S*A%<*.%0A(S*AZFS .%0A(S*AZ2(/
Description
0ninitiali!ed (bGect Set .S/ (bGect Set , See 'hapter 5 Z,/ (bGect Set, See 'hapter 6 Z2(/ (bGect Set, See 'hapter 8
Value
? 1 2 4
os%)il%$eader *he Z$/ header is described in detail in 'hapter 7 of this doc#ment" metadnode -s described earlier in this chapter, each obGect is described by a dnodeAphysAt" *he collection of dnodeAphysAt str#ct#res describing the obGects in this obGect set are stored as an obGect pointed to by the metadnode" *he data contained within this obGect is formatted as an array of dnodeAphysAt str#ct#res (one for each obGect within the obGect set)" <ach obGect within an obGect set is #ni&#ely identified by a 65 bit integer called an obGect n#mber" -n obGect>s 8obGect n#mber9 identifies the array element, in the dnode array, containing this obGect>s dnodeAphysAt" *he ill#stration below shows an obGect set with the metadnode e1panded" *he metadnode contains three bloc pointers, each of which have been e1panded to show their contents" (bGect n#mber 5 has been f#rther e1panded to show the details of the dnodeAphysAt and the bloc str#ct#re referenced by this dnode"
o&3set>$#ys>t
dn_type DM _!"_D#!D$ %n>in%&lks#i(tP dn_nlevels % dn_nblkptr 3 %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP dn_blkptr[3]; %n>&onusR-5NUS70NS
%n>&lk$trRS
5 ! I K 9
10 10 10 10 10
0 1
J 4
10 J 10 4
04I 04K
...
%no%e>$#ys>t
uintK>t uintK>t uint8_t uint8_t uintK>t uintK>t uintK>t uintK>t uint1!>t uint1!>t uintK>t uint!4>t uint!4>t uint!4>t blkptr_t uintK>t %n>ty$eP %n>in%&lks#i(tP dn_nlevels = 3 dn_nblkptr = 3 %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP dn_blkptr[3]; %n>&onusR-5NUS70NS
...
...
%n>&lk$trRS
7e,el
... .....
Illustration 12 3b4ect 'et
7e,el 1
7e,el 0
Chapter 2our D L
*he .S/ (.ataset and Snapshot /ayer) provides a mechanism for describing and managing relationships-between and properties-of obGect sets" :efore describing the .S/ and the relationships it describes, a brief overview of the vario#s flavors of obGect sets is necessary" Ob-ect Set O#er#iew ZFS provides the ability to create fo#r inds of obGect sets+ filesystems, clones, snapshots, and vol#mes" ZFS filesystem+ - filesystem stores and organi!es obGects in an easily accessible, ,(S$C compliant manner" ZFS clone+ - clone is identical to a filesystem with the e1ception of its origin" 'lones originate from snapshots and their initial contents are identical to that of the snapshot from which it originated" ZFS snapshot+ - snapshot is a read-only version of a filesystem, clone, or vol#me at a partic#lar point in time" ZFS vol#me+ - vol#me is a logical vol#me that is e1ported by ZFS as a bloc device" ZFS s#pports several operations and@or config#rations which ca#se interdependencies amongst obGect sets" *he p#rpose of the .S/ is to manage these relationships" *he following is a list of s#ch relationships" 'lones+ - clone is related to the snapshot from which it originated" (nce a clone is created, the snapshot in which it originated can not be deleted #nless the clone is also deleted" Snapshots+ - snapshot is a point-in-time image of the data in the obGect set in which it was created" - filesystem, clone, or vol#me can not be destroyed #nless its snapshots are also destroyed" 'hildren+ ZFS s#pport hierarchically str#ct#red obGect sets; obGect sets within obGect sets" - child is dependent on the e1istence of its parent" - parent can not be destroyed witho#t first destroying all children"
.ataset .irectories manage a related gro#ping of datasets and the properties associated with that gro#ping" - .S/ directory always has e1actly one 8active dataset9" -ll other datasets #nder the .S/ directory are related to the 8active9 dataset thro#gh snapshots, clones, or child@parent dependencies" *he following pict#re shows the .S/ infrastr#ct#re incl#ding a pictorial view of how obGect set relationships are described via the .S/ datasets and .S/ directories" *he top level .S/ .irectory can be seen at the top@center of this fig#re" .irectly below the .S/ .irectory is the 8active dataset9" *he active dataset represents the live filesystem" (riginating from the active dataset is a lin ed list of snapshots which have been ta en at different points in time" <ach dataset str#ct#re points to a .%0 (bGect Set which is the act#al obGect set containing obGect data" *o the left of the top level .S/ .irectory is a child Z-,5 obGect containing a listing of all child@parent dependencies" *o the right of the .S/ directory is a properties Z-, obGect containing properties for the datasets within this .S/ directory" - listing of all properties can be seen in *able 12 below" - detailed description of .atasets and .S/ .irectories are described in the Dataset Cnternals and D04 Directories Cnternals sections below"
('ild Dataset )nfor*ation
.S7 .irectory
&naps'ots
.S7 In(rastructure
J0
J1
70 71
-oot
7J
Name=6alue +airs
....
%no%e>$#ys>t
uint!4>t u&>ma'ic uint!4>t u&>,ersion uint!4>t u&>tC' uint!4>t u&>,%e,>sum uint!4>t u&>timestam$ &lk$tr>t u&>root&$
uint8_t dn_type =DM _!"_D#!D$ uintK>t %n>in%&lks#i(tP uintK>t %n>nle,els uintK>t %n>n&lk$trP uintK>t %n>&onusty$eP uintK>t %n>c#ecksumP uintK>t %n>com$ressP uintK>t %n>$a%R1SP uint1!>t %n>%ata&lks)secP uint1!>t %n>&onuslenP uintK>t %n>$a% R4SP uint!4>t %n>maC&lki%P uint!4>t %n>sec$#ysP uint!4>t %n>$a%JR4SP blkptr_t dn_blkptr[3]; uintK>t %n>&onusR-5NUS70NS
4 5 ! I K
04! 04I
o&3ect>%irectory root>%ataset
10 10 10 10 10
0 1
J 4
10
10 J sync>&$list
uint8_t dn_type= DM _!"_!./$("_D)0$("!01 uintK>t %n>in%&lks#i(tP uint8_t dn_nlevels = % uint8_t dn_nblkptr = %; uintK>t %n>&onusty$eP uintK>t %n>c#ecksumP uintK>t %n>com$ressP uintK>t %n>$a%R1SP uint1!>t %n>%ata&lks)secP uint1!>t %n>&onuslenP uintK>t %n>$a% R4SP uint!4>t %n>maC&lki%P uint!4>t %n>sec$#ysP uint!4>t %n>$a%JR4SP blkptr_t dn_blkptr[%]P uintK>t %n>&onusR-5NUS70NS
con(i'
...
...
...
J0I0 J0I1
sna$s#ots. uint+,_t ds_prev_snap_t23: "#e transaction 'rou$ num&er w#en t#e $re,ious sna$s#ot G$ointe% to &y ds8prev8snap8ob4H was taken. uint+,_t ds_ne2t_snap_ob-: "#is (iel% is only use% (or %atasets re$resentin' sna$s#ots. It contains t#e o&3ect num&er o( t#e %ataset w#ic# is t#e most recent sna$s#ot. "#is (iel% is always )ero (or %atasets re$resentin' clones, ,olumes, or (ilesystems. uint+,_t ds_snapna*es_zapob-: 5&3ect num&er o( a LA+ o&3ect Gsee C#a$ter 5H containin' name ,alue $airs (or eac# sna$s#ot o( t#is %ataset. 0ac# $air contains t#e name o( t#e sna$s#ot an% t#e o&3ect num&er associate% wit# itAs .S7 %ataset structure. uint+,_t ds_nu*_c'ildren: Always )ero i( not a sna$s#ot. 4or sna$s#ots, t#is is t#e num&er o( re(erences to t#is sna$s#ot< 1 G(rom t#e neCt sna$s#ot taken, or (rom t#e acti,e %ataset i( no sna$s#ots #a,e &een takenH V t#e num&er o( clones ori'inatin' (rom t#is sna$s#ot. uint+,_t ds_creation_ti*e: Secon%s since /anuary 1st 19I0 G1M"H w#en t#is %ataset was create%. uint+,_t ds_creation_t23: "#e transaction 'rou$ num&er in w#ic# t#is %ataset was create%. uint+,_t ds_deadlist_ob-: "#e o&3ect num&er o( t#e %ea%list Gan array o( &lk$trAs %elete% since t#e last sna$s#otH. uint+,_t ds_used_bytes: uni@ue &ytes use% &y t#e o&3ect set re$resente% &y t#is %ataset uint+,_t ds_co*pressed_bytes: num&er o( com$resse% &ytes in t#e o&3ect set re$resente% &y t#is %ataset uint+,_t ds_unco*pressed_bytes: num&er o( uncom$resse% &ytes in t#e o&3ect set re$resente% &y t#is %ataset uint+,_t ds_uni4ue_bytes: 9#en a sna$s#ot is taken, its initial contents are i%entical to t#at o( t#e acti,e co$y o( t#e %ata. As t#e %ata c#an'es in t#e acti,e co$y, more an% more %ata &ecomes uni@ue to t#e sna$s#ot Gt#e %ata %i,er'es (rom t#e sna$s#otH. As t#at #a$$ens, t#e amount o( %ata uni@ue to t#e sna$s#ot increases. "#e amount o( uni@ue sna$s#ot %ata is store% in t#is (iel%< it is )ero (or clones, ,olumes, an% (ilesystems. uint+,_t ds_fsid_3uid: !4 &it I. t#at is 'uarantee% to &e uni@ue amon'st all JJ
currently o$en %atasets. Note, t#is I. coul% c#an'e &etween successi,e %ataset o$ens. uint+,_t ds_3uid: !4 &it 'lo&al i% (or t#is %ataset. "#is ,alue ne,er c#an'es %urin' t#e li(etime o( t#e o&3ect set. uint+,_t ds_restorin3: "#e (iel% is set to M1N i( L4S is in t#e $rocess o( restorin' to t#is %ataset t#rou'# A)(s restoreA5 blkptr_t ds_bp: -lock $ointer containin' t#e location o( t#e o&3ect set t#at t#is %ataset re$resents.
J4
uint+,_t dd_props_zapob-: !4 &it o&3ect num&er o( a LA+ o&3ect containin' t#e $ro$erties (or all %atasets wit#in t#is .S7 %irectory. 5nly t#e non*in#erite% = locally set ,alues are re$resente% in t#is LA+ o&3ect. .e(ault, in#erite% ,alues are in(erre% w#en t#ere is an a&sence o( an entry. "#e (ollowin' ta&le s#ows ,ali% $ro$erty ,alues. Property Description Values
aclinherit 'ontrols inheritance behavior discard O ? for datasets" noallow O 1 passthro#gh O 4 sec#re O 5 (defa#lt) aclmode 'ontrols chmod and file@dir discard O ? creation behavior for datasets" gro#pmas O 2 (defa#lt) passthro#gh O 4 atime 'ontrols whether atime is #pdated on obGects within a dataset " 'hec s#m algorithm for all datasets within this .S/ .irectory" off O ? on O 1 (defa#lt) on O 1 (defa#lt) off O ?
chec s#m
compression
'ompression algorithm for all on O 1 datasets within this .S/ off O ? (defa#lt) .irectory" 'ontrols whether device nodes can be opened on datasets" 'ontrols whether files can be e1ec#ted on a dataset" %o#ntpoint path for datasets within this .S/ .irectory" devices O ? nodevices O 1 (defa#lt) e1ec O 1 (defa#lt) noe1ec O ? string
devices
/imits the amo#nt of space all &#ota si!e in bytes or datasets within a .S/ !ero for no &#ota (defa#lt) directory can cons#me" 'ontrols whether obGects can be modified on a dataset" :loc Si!e for all obGects within the datasets contained in this .S/ .irectory readonly O 1 readwrite O ? (defa#lt) recordsi!e in bytes
readonly recordsi!e
reservation
-mo#nt of space reserved for reservation si!e in bytes this .S/ .irectory, incl#ding all child datasets and child .S/ .irectories"
J5
Property
set#id sharenfs
Description
Values
'ontrols whether the set-0$. set#id O 1 (defa#lt) bit is respected on a dataset" noset#id O ? 'ontrols whether the datasets string S any valid nfs share in a .S/ .irectory are shared options by DFS" 'ontrols whether "!fs is hidden or visible in the root filesystem" hidden O ? visible O 1 (defa#lt)
snapdir
volbloc si!e
For vol#mes, specifies the between 312 to 128=, powers bloc si!e of the vol#me" *he of two" blocksize cannot be .efa#lts to 8= changed once the vol#me has been written, so it sho#ld be set at vol#me creation time" 2ol#me si!e, only applicable to vol#mes" vol#me si!e in bytes
volsi!e !oned
J!
Z-, obGects come in two forms; micro!ap obGects and fat!ap obGects" %icro!ap obGects are a lightweight version of the fat!ap and provide a simple and fast loo #p mechanism for a small n#mber of attrib#te entries" *he fat!ap is better s#ited for Z-, obGects containing large n#mbers of attrib#tes" *he following g#idelines are #sed by ZFS to decide whether or not to #se a fat!ap or a micro!ap obGect" - micro!ap obGect is #sed if all three conditions below are met+ all name-val#e pair entries fit into one bloc " *he ma1im#m data bloc si!e in ZFS is 128=: and this si!e bloc can fit #p to 2?57 micro!ap entries" *he val#e portion of all attrib#tes are of type #int65At" *he name portion of each attrib#te is less than or e&#al to 3? characters in length (incl#ding D0// terminating character)" $f any of the above conditions are not met, a fat!ap obGect is #sed" *he first 65 bit word in each bloc of a Z-, obGect is #sed to identify the type of Z-, contents contained within this bloc " *he table below shows these val#es"
JI
Description *his bloc contains micro!ap entries *his bloc is #sed for the fat!ap" *his identifier is only #sed for the first bloc in a fat!ap obGect" *his bloc is #sed for the fat!ap" *his identifier is #sed for all bloc s in the fat!ap with the e1ception of the first"
Z:*A/<-F
(10// NN 64) P ?
micro)a$ &lock
$a%%in'
...
m)a$>ent>$#ys>t array
typedef str#ct m!apAentAphys Q #int65At m!eAval#e; #int42At m!eAcd; #in16At m!eApad; char m!eAnameH%Z-,AD-%<A/<DI; R m!apAentAphysAt;
salt
JK
m)e%#al"e: 65 bit integer m)e%cd: 42 bit collision differentiator (8'.9)+ associated with an entry whose hash val#e is the same as another entry within this Z-, obGect" )hen an entry is inserted into the Z-, obGect, the lowest '. which is not already #sed by an entry with the same hash val#e is assigned" $n the absence of hash collisions, the '. val#e will be !ero" m)e%pad: reserved for f#t#re #se m)e%name: D0// terminated string less than or e&#al to 3? characters in length
$ointer ta&le
)a$>lea(>$#ys>t
)a$ lea( c#unks
)a$>lea(>$#ys>t
)a$ lea( c#unks
)a$>lea(>$#ys>t
)a$ lea( c#unks
si!e of the pointer table, this str#ct#re may contain the pointer table" $f the pointer table is too large to fit in the space provided by the !apAphysAt, some information abo#t where it can be fo#nd is store in the !apAtableAphys portion of this str#ct#re" *he definitions of the !apAphysAt contents are as follows+
)a$>$#ys>t uint!4>t )a$>&lock>ty$e uint!4>t )a$>ma'ic struct )a$>ta&le>$#ys Q uint!4>t )t>&lk uint!4>t )t>num&lks uint!4>t )t>s#i(t uint!4>t )t>neCt&lk uint!4>t )t>&lk>co$ie% T )a$>$trt&lP uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t )a$>(ree&lk )a$>num>lea(s )a$>num>entries )a$>salt )a$>$a%RK1K1S )a$>lea(sRK19 S
)ap%block%t*pe: 65 bit integer identifying type of Z-, bloc " -lways set to Z:*AK<-.<B (see *able 15) for the first bloc in the fat!ap" )ap%magic: 65 bit integer containing the Z-, magic n#mber+ !A#85#2E#2E :'&s='ap='ap< )ap%table%p$*s: str#ct#re whose contents are #sed to describe the pointer table )t%blk: :l id for the first bloc of the pointer table" *his field is only #sed when the pointer table is e1ternal to the !apAphysAt str#ct#re; !ero otherwise" )t%n"mblks: D#mber of bloc s #sed to hold the pointer table" *his field is only #sed when the pointer table is e1ternal to the !apAphysAt str#ct#re; !ero otherwise" )t%s$ift: D#mber of bits #sed from the hash val#e to inde1 into the pointer table" $f the pointer table is contained within the !apAphys, this val#e will be 14" "int./%t )t%ne'tblk: "int./%t )t%blks%copied: 40
*he above two fields are #sed when the pointer table changes si!es" )ap%freeblk: 65 bit integer containing the first available Z-, bloc that can be #sed to allocate a new !apAleaf" )ap%n"m%leafs: D#mber of !apAleafAphysAt str#ct#res (described below) contained within this Z-, obGect" )ap%salt: *he salt val#e is a 65 bit integer that is stirred into the hash f#nction, so that the hash f#nction is different for each Z-, obGect" )ap%n"m%entries: D#mber of attrib#tes stored in this Z-, obGect" )ap%leafs01(234+ *he !apAleaf array contains 214 (81J2) slots" $f the pointer table has fewer than 214 entries, the pointer table will be stored here" $f not, this field is #n#sed"
41
#int16At lAhashHZ-,A/<-FAK-SKAD0%<D*B$<SI; #nion !apAleafAch#n Q str#ct !apAleafAentry Q #int8At leAtype; #int8At leAintAsi!e; #int16At leAne1t; #int16At leAnameAch#n ; #int16At leAnameAlength; #int16At leAval#eAch#n ; #int16At leAval#eAlength; #int16At leAcd; #int8At leApadH2I; #int65At leAhash; R lAentry; str#ct !apAleafAarray Q #int8At laAtype; #int8At laAarrayHZ-,A/<-FA-BB-EA:E*<SI; #int16At laAne1t; R lAarray; str#ct !apAleafAfree Q #int8At lfAtype; #int8At lfApadHZ-,A/<-FA-BB-EA:E*<SI; #int16At lfAne1t; R lAfree; R lAch#n HZ-,A/<-FAD0%'K0D=SI; R !apAleafAphysAt;
5eader *he header for the Z-, leaf is stored in a !apAleafAheader str#ct#re" $t>s description is as follows+ l$r%block%t*pe: always Z:*A/<-F (see *able 15 for val#es) l$r%ne't: 65 bit integer bloc id for the ne1t leaf in a bloc chain" l$r%prefi' and l$r%prefi'%len: <ach leaf (or chain of leafs) stores the Z-, entries whose first lhrAprefi1len bits of their hash val#e e&#als lhrAprefi1" lhrAprefi1len can be e&#al to or less than !tAshift (the n#mber of bits #sed to inde1 into the pointer table) in which case m#ltiple pointer table b#c ets reference the same leaf" l$r%magic: leaf magic n#mber OO 0x2A'1 A/ (!ap-leaf) l$r%nfree: n#mber of free ch#n s in this leaf (ch#n s described below) l$r%nentries: n#mber of Z-, entries stored in this leaf l$r%freelist: head of a list of free ch#n s, 16 bit integer #sed to inde1 into the !apAleafAch#n array
Leaf 5as$ *he ne1t 8=: of the !apAleafAphysAt is the !ap leaf hash table" *he entries in the has table reference ch#n s of type !apAleafAentry" *welve bits (the twelve following the lhrAprefi1Alen #sed to #ni&#ely identify this bloc ) of the attrib#te>s hash val#e are #sed to inde1 into the this table" Kash table collisions are handled by chaining entries" <ach b#c et in the table contains a 16 bit integer which is the inde1 into the !apAleafAch#n array"
)a$>lea(>entry
uintK>t uintK>t uint1!>t uint1!>t uint1!>t uint1!>t uint1!>t uintJ >t uint!4>t le>ty$e U 676 le>int>si)e le>neCt le>name>c#unk le>name>len't# le>,alue>c#unk le>,alue>len't# le>c% le>#as#
)a$>lea(>entry
uintK>t uintK>t uint1!>t uint1!>t uint1!>t uint1!>t uint1!>t uintJ >t uint!4>t le>ty$e U 676 le>int>si)e le>neCt le>name>c#unk le>name>len't# le>,alue>c#unk le>,alue>len't# le>c% le>#as#
)a$>lea(>entry
uintK>t uintK>t uint1!>t uint1!>t uint1!>t uint1!>t uint1!>t uintJ >t uint!4>t le>ty$e U 676 le>int>si)e le>neCtU82ffff le>name>c#unk le>name>len't# le>,alue>c#unk le>,alue>len't# le>c% le>#as#
)a$>lea(>array
uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCt
)a$>lea(>array
uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCtU82ffff
)a$>lea(>array
uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCt
)a$>lea(>array
uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCt
)a$>lea(>array
uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCtU82ffff
)ap%leaf%entr*+ *he leaf hash table (described above) points to ch#c s of this type" *his entry contains pointers to ch#n s of type !apAleafAarray which hold the name and val#e for the attrib#tes being stored here" le!t&pe: Z-,A/<-FA<D*BE OO 232 le!int!si$e: Si!e of integers in bytes for this entry" le!next% De1t entry in the !apAleafAch#n chain" 'hains occ#r when there are collisions in the hash table" *he end of the chain is designated by a leAne1t val#e of ?1ffff" le!name!chun": 16 bit integer identifying the ch#n of type 4J
!apAleafAarray which contains the first 21 characters of this attrib#te>s name" le!name!length% *he length of the attrib#te>s name, incl#ding the D0// character" le!value!chun"%16 bit integer identifying the first ch#n (type !apAleafAarray) containing the first 21 bytes of the attrib#te>s val#e" le!value!length% *he length, in integer increments (le%int%si'e) le!cd% *he collision differentiator (8'.9) is a val#e associated with an entry whose hash val#e is the same as another entry within this Z-, obGect" )hen an entry is inserted into the Z-, obGect, the lowest '. which is not already #sed by an entry with the same hash val#e is assigned" $n the absence of hash collisions, the '. val#e will be !ero" le!hash% 65 bit hash of this attrib#te>s name" )ap%leaf%arra*+ 'h#n s of the !apAleafAarray hold either the name or the val#e of the Z-, attrib#te" *hese ch#n s can be str#ng together to provide for long names or large val#es" !apAleafAarray ch#n s are pointed to by a !apAleafAentry ch#n " la!t&pe% Z-,A/<-FA-BB-E OO 231 la!arra&% 21 byte array containing the name or val#e>s val#e" 2al#es of type 8integer9 are always stored in big endian format, regardless of the machine>s native endianness" la!next% 16 bit integer #sed to inde1 into the !apAleafAch#n array and references the ne1t !apAleafAarray ch#n for this attrib#te; a val#e of ?1ffff ('K-$DA<D.) is #sed to designate the end of the chain )ap%leaf%free: 0n#sed ch#n s are ept in a chained free list" *he root of the free list is stored in the leaf header" l#!t&pe% Z-,A/<-FAFB<< OO 234 l#!next% 16 bit integer pointing to the ne1t free ch#n "
44
Chapter i8 3,L
*he Z,/, ZFS ,(S$C /ayer, ma es .%0 obGects loo li e a ,(S$C filesystem" ,(S$C is a standard defining the set of services a filesystem m#st provide" ZFS filesystems provide all of these re&#ired services" *he Z,/ represents filesystems as an obGect set of type .%0A(S*AZFS" -ll snapshots, clones and filesystems are implemented as an obGect set of this type" *he Z,/ #ses a well defined format for organi!ing obGects in its obGect set" *he section below describes this layo#t"
45
typedef str#ct !nodeAphys Q #int65At !pAatimeH2I; #int65At !pAmtimeH2I; #int65At !pActimeH2I; #int65At !pAcrtimeH2I; #int65At !pAgen; #int65At !pAmode; #int65At !pAsi!e; #int65At !pAparent; #int65At !pAlin s; #int65At !pA1attr; #int65At !pArdev; #int65At !pAflags; #int65At !pA#id; #int65At !pAgid; #int65At !pApadH5I; !fsA!nodeAaclAt !pAacl; R !nodeAphysAt
)p%atime: *wo 65 bit integers containing the last file access time in seconds (!pAatimeH?I) and nanoseconds (!pAatimeH1I) since Lan#ary 1st 1J7? (F%*)" )p%mtime: *wo 65 bit integers containing the last file modification time in seconds (!pAmtimeH?I) and nanoseconds (!pAmtimeH?I) since Lan#ary 1st 1J7? (F%*)" )p%ctime: *wo 65 bit integers containing the last file change time in seconds (!pActimeH?I) and nanoseconds (!pActimeH1I) since Lan#ary 1st 1J7? (F%*)" )p%crtime: *wo 65 bit integers containing the file>s creation time in seconds (!pAcrtimeH?I) and nanoseconds (!pAcrtimeH1I) since Lan#ary 1st 1J7? (F%*)" )p%gen: 65 bit generation n#mber, contains the transaction gro#p n#mber of the creation of this file" )p%mode: 65 bit integer containing file mode bits and file type" *he lower 8 bits of the mode contain the access mode bits, for e1ample 733" *he Jth bit is the stic y bit and can be a val#e of !ero or one" :its 14-16 are #sed to designate the file type" *he file types can be seen in the table below"
4!
(&pe SA$F$F( SA$F'KB SA$F.$B SA$F:/= SA$FB<F SA$F/D= SA$FS('= SA$F.((B SA$F,(B* Fifo
Description 'haracter Special .evice .irectory :loc special device Beg#lar file Symbolic /in Soc et .oor <vent ,ort
Value in bits 10112 ?11 ?12 ?15 ?16 ?18 ?1?1' ?1. ?1<
)p%si)e: si!e of file in bytes )p%parent: obGect id of the parent directory containing this file )p%links: n#mber of hard lin s to this file )p%'attr: o&3ect I. o( a LA+ o&3ect w#ic# is t#e #i%%en attri&ute %irectory. It is treate% like a normal %irectory in L4S, eCce$t t#at its #i%%en an% an a$$lication will nee% to WtunnelW into t#e (ile ,ia o$enatGH to 'et to it. )p%rde#: devAt for files of type SA$F'KB or SA$F:/= )p%flags: ,ersistent flags set on the file" *he following are valid flag val#es"
/lag ZFSAC-**B ZFSA$DK<B$*A-'< &able 1- zp8flag values Value ?11 ?12
)p%"id: !4 &it inte'er Gui%>tH o( t#e (iles owner. )p%gid: 65 bit integer (gidAt) owning gro#p of the file" )p%acl: !fsA!nodeAacl str#ct#re containing any -'/ entries set on this file" *he !fsA!nodeAacl str#ct#re is defined below"
4I
)%acl%e'tern%ob-: 0sed for holding -'/s that won>t fit in the !node" $n other words, its for -'/s great than 6 -'<s" *he obGect type of an e1tern -'/ is .%0A(*A-'/" )%acl%co"nt: n#mber of -'< entries that ma e #p an -'/ )%acl%#ersion: reserved for f#t#re #se" )%acl%pad: reserved for f#t#re #se" )%ace%data: -rray of #p to 6 -'<s" -n -'< specifies an access right to an individ#al #ser or gro#p for a specific obGect"
typedef str#ct ace Q #idAt aAwho; #int42At aAaccessAmas ; #int16At aAflags; #int16At aAtype; R aceAt;
a%w$o: *his field is only meaningf#l when the AC0>59N02, AC0>125U+ an% AC0>0602:5N0 (la's Gset in a8flags %escri&e% &elowH are not asserte%. *he aAwho field contains a 0$. or F$." $f the -'<A$.<D*$F$<BAFB(0, flag is set in a%&lags (see below), the aAwho field will contain a F$." (therwise, this field will contain a 0$." a%access%mask: 42 bit access mas " *he table below shows the access attrib#te associated with each bit"
4K
Attribute -'<AB<-.A.-*-'<A/$S*A.$B<'*(BE -'<A)B$*<A.-*-'<A-..AF$/< -'<A-,,<D.A.-*-'<A-..AS0:.$B<'*(BE -'<AB<-.AD-%<.A-**BS -'<A)B$*<AD-%<.A-**BS -'<A<C<'0*< -'<A.</<*<A'K$/. -'<AB<-.A-**B$:0*<S -'<A)B$*<A-**B$:0*<S -'<A.</<*< -'<AB<-.A-'/ -'<A)B$*<A-'/ -'<A)B$*<A()D<B -'<ASED'KB(D$Z< &able 1. *ccess 7ask Values
Value ?1???????1 ?1???????1 ?1???????2 ?1???????2 ?1???????5 ?1???????5 ?1???????8 ?1??????1? ?1??????2? ?1??????5? ?1??????8? ?1?????1?? ?1???1???? ?1???2???? ?1???5???? ?1???8???? ?1??1?????
a%flags: 16 bit integer whose val#e describes the -'/ entry type and inheritance flags"
A3 #lag -'<AF$/<A$DK<B$*A-'< -'<A.$B<'*(BEA$DK<B$*A-'< -'<AD(A,B(,-F-*<A$DK<B$*A-'< -'<A$DK<B$*A(D/EA-'< -'<AS0''<SSF0/A-''<SSA-'<AF/-F -'<AF-$/<.A-''<SSA-'<AF/-F -'<A$.<D*$F$<BAFB(0, -'<A()D<B -'<AFB(0, -'<A<2<BE(D< &able 1/ 2ntry &ype and In!eritance =lag Value Value ?1???1 ?1???2 ?1???5 ?1???8 ?1??1? ?1??2? ?1??5? ?11??? ?12??? ?15???
a%t*pe: *he type of this ace" *he following types are listed in the table below"
49
Description Frants access as described in aAaccessAmas " .enies access as described in aAaccessAmas " -#dit the s#ccessf#l or failed accesses (depending on the presence of the s#ccessf#l@failed access flags) as defined in the aAaccessAmas " 6 -larm the s#ccessf#l of failed accesses as defined in the aAaccessAmas "7
-'<ASES*<%A-/-B%A-'<A*E,<
?1???4
! "#e action taken as an e((ect o( tri''erin' an au%it is currently un%e(ine% in Solaris. I "#e action taken as an e((ect o( tri''erin' an alarm is currently un%e(ine% in Solaris.
50
...
%ore details of the c#rrent Z$/ on dis str#ct#res are given below"
51
Z+L records Z+L record common str"ct"re Z$/ records all start with a common section followed by a record (transaction) specific str#ct#re" *he common log record str#ct#re and record types (val#es for lrcAt1type) are+
ty$e%e( struct Q uint!4>t lrc>tCty$eP uint!4>t lrc>reclenP uint!4>t lrc>tC'P uint!4>t lrc>se@P T lr>tP Y%e(ine "8>C20A"0 Y%e(ine "8>MO.I2 Y%e(ine "8>MO8A""2 Y%e(ine "8>S:M7INO Y%e(ine "8>20M560 Y%e(ine "8>2M.I2 Y%e(ine "8>7INO Y%e(ine "8>20NAM0 Y%e(ine "8>92I"0 Y%e(ine "8>"2UNCA"0 Y%e(ine "8>S0"A""2 Y%e(ine "8>AC7 1 J 4 5 ! I K 9 10 11 1 =X common lo' recor% #ea%er X= =X intent lo' transaction ty$e X= =X transaction recor% len't# X= =X %mu transaction 'rou$ num&er X= =X intent lo' se@uence num&er X= =X Create (ile X= =X Make %irectory X= =X Make 8A""2 %irectory X= =X Create sym&olic link to a (ile X= =X 2emo,e (ile X= =X 2emo,e %irectory X= =X Create #ar% link to a (ile X= =X 2ename a (ile X= =X 4ile write X= =X "runcate a (ile X= =X Set (ile attri&utes X= =X Set acl X=
Z+L record specific str"ct"res For each of the record (transaction) types listed above there is a specific str#ct#re which embeds the common str#ct#re" )ithin each record eno#gh information is saved in order to be able to replay the transaction (#s#ally one 2(, call)" *he 2(, layer will pass in-memory pointers to vnodes" *hese have to be converted to stable pool obGect identifiers (oids)" )hen replaying the transaction the 2(, layer is called again" *o do this we reopen the obGect and pass it>s vnode" Some of the record specific str#ct#res are #sed for more than one transaction type" *he lrAcreateAt record specific str#ct#re is #sed for+ *CA'B<-*<, *CA%=.$B, *CA%=C-**B and *CASE%/$D=, and lrAremoveAt is #sed for both 5
*CAB<%(2< and *CAB%.$B" -ll fields (other than strings and #ser data) are 65 bits wide" *his provides for a well defined alignment which allows for easy compatibility between different architect#res, and easy endianness conversion if necessary" Kere>s the definition of the record specific str#ct#res+
ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>%oi%P =X o&3ect i% o( %irectory X= uint!4>t lr>(oi%P =X o&3ect i% o( create% (ile o&3ect X= uint!4>t lr>mo%eP =X mo%e o( o&3ect X= uint!4>t lr>ui%P =X ui% o( o&3ect X= uint!4>t lr>'i%P =X 'i% o( o&3ect X= uint!4>t lr>'enP =X 'eneration GtC' o( creationH X= uint!4>t lr>crtimeR SP =X creation time X= uint!4>t lr>r%e,P =X r%e, o( o&3ect to create X= =X name o( o&3ect to create (ollows t#is X= =X (or symlinks, link content (ollows name X= T lr>create>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>%oi%P =X o&3 i% o( %irectory X= =X name o( o&3ect to remo,e (ollows t#is X= T lr>remo,e>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>%oi%P =X o&3 i% o( %irectory X= uint!4>t lr>link>o&3P =X o&3 i% o( link X= =X name o( o&3ect to link (ollows t#is X= T lr>link>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>s%oi%P =X o&3 i% o( source %irectory X= uint!4>t lr>t%oi%P =X o&3 i% o( tar'et %irectory X= =X strin's< names o( source an% %estination (ollow t#is X= T lr>rename>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>(oi%P =X (ile o&3ect to write X= uint!4>t lr>o((setP =X o((set to write to X= uint!4>t lr>len't#P =X user %ata len't# to write X= uint!4>t lr>&lko((P =X o((set re$resente% &y lr>&lk$tr X= &lk$tr>t lr>&lk$trP =X s$a &lock $ointer (or re$lay X= =X write %ata will (ollow (or small writes X= T lr>write>tP ty$e%e( struct Q lr>t uint!4>t uint!4>t uint!4>t lr>commonP lr>(oi%P lr>o((setP lr>len't#P =X common $ortion o( lo' recor% X= =X o&3ect i% o( (ile to truncate X= =X o((set to truncate (rom X= =X len't# to truncate X=
5J
T lr>truncate>tP ty$e%e( struct Q lr>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t T lr>setattr>tP lr>commonP lr>(oi%P lr>maskP lr>mo%eP lr>ui%P lr>'i%P lr>si)eP lr>atimeR SP lr>mtimeR SP =X common $ortion o( lo' recor% X= =X (ile o&3ect to c#an'e attri&utes X= =X mask o( attri&utes to set X= =X mo%e to set X= =X ui% to set X= =X 'i% to set X= =X si)e to set X= =X access time X= =X mo%i(ication time X=
ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>(oi%P =X o&3 i% o( (ile X= uint!4>t lr>aclcntP =X num&er o( acl entries X= =X lr>aclcnt num&er o( ace>t entries (ollow t#is X= T lr>acl>tP
54
55