You are on page 1of 40

<Insert Picture Here>

ZFS Brown-bag sessions: 1 ZFS On-Disk Structures and Concepts


Nagaraj Yedathore nagaraj.yedathore@oracle.com
RPE – ZFS
March-April 2014
Agenda ZFS Overview

• ZFS Standing out Features


• ZFS Configurations
• The 'zfs' Properties: Overview
• The 'zpool' Properties: Overview
• ZFS on-disk concepts

Oracle Confidential 3
ZFS Features
Why ZFS?

• Flexible configuration. Simplified Storage Solutions.


• Variable Block-sizes, catering to different applications
• Data checksum – No silent data corruption
• In-built Compression – More space utilization
• In-built Encryption – More security
• Self-healing in mirror/raid-z configurations
• COW – Copy On Write. No fsck(1M) at all.

Oracle Confidential 4
FS/Volume Model vs. Pooled Storage
Traditional Volumes ZFS Pooled Storage
Abstraction: virtual disk Abstraction: malloc/free
Partition/volume for each FS No partitions to manage
Grow/shrink by hand Grow/shrink automatically
Each FS has limited bandwidth All bandwidth always available
Storage is fragmented, stranded All storage in the pool is shared

FS FS FS ZFS ZFS ZFS

Volume Volume Volume Storage Pool

Oracle Confidential 5
Self-Healing Data in ZFS
1. Application issues a 2. ZFS tries the second 3. ZFS returns known
read. ZFS mirror tries the disk. Checksum indicates good data to the
first disk. Checksum that the block is good. application and
reveals that the block is repairs the damaged
corrupt on disk. block.
Application Application Application

ZFS mirror ZFS mirror ZFS mirror

Oracle Confidential 6
<Insert Picture Here>

ZFS Configurations

Oracle Confidential 7
Single/multiple devices linear/striped configuration
• Disks used to create pool with concatenation
• Dynamic striping across all the disks
• No redundancy, no parity – One disk failure fails the pool
• Normal/Improved performance, No fault-tolerant.
• Pool capacity is sum of all the disks' size.
ZFS ZFS ZFS ZFS ZFS ZFS

Storage Pool Storage Pool

Oracle Confidential 8
Mirror (single/multiple) devices configuration
• pool with mirrored disks
• Data redundancy – One disk fails, the other provides data.
ZFS repairs bad disk automatically
• Normal performance, Better fault-tolerant.
• Pool capacity is the size of smaller disk in the mirror.
ZFS ZFS ZFS ZFS ZFS ZFS

Storage Pool Storage Pool

Oracle Confidential 9
RAID-Z configuration
• pool with striping of disks with distributed parity.
• Data redundancy – Many disk fails, the other provides data.
ZFS repairs bad disk automatically
• Normal performance, Great fault-tolerant.
• Pool capacity is the size of all disks minus parity disk.
ZFS ZFS ZFS ZFS ZFS ZFS

Storage Pool Storage Pool

Oracle Confidential 10
<Insert Picture Here>

The 'zpool' Properties

Oracle Confidential 11
Properties of zfs pool
/usr/sbin/zpool get all ( RO readonly; C create-time, I import-time )
• allocated (RO): Amount of space allocated
• capacity (RO): %age used of total capacity
• free (RO): Size corresponding to blocks not allocated in this pool.
• size (RO): Total size of pool.
• guid (RO): Global Unique ID for the pool
• health (RO): Pool status, one of “ONLINE”, “DEGRADED” or “FAULTED”
• altroot (C/I): path. Alternate root. The value prefixed to the mount-points. No
cache file entry.
• readonly (C/I): [on|off]. Whether to import read-only.

Oracle Confidential 12
Properties of zfs pool - 2
/usr/sbin/zpool get all ( RO readonly; C create-time, I import-time, A Anytime )
• bootfs : <dataset name> bootable dataset for a BE. Used by beadm/LU programs
• cachefile (C/I): path. The file to be used as cachefile for storing pool config.
Default is /etc/zfs/zpool.cache. Use zdb -U <cachefile path> for non-default.
• failmode (A): [wait/continue/panic]. What to do in case of catastrophic pool failure.
• autoexpand (A): [on/off]. Adjust pool size upon LUN/Slice expansion.
• listsnaps (A): [on/off]. Whether to list snapshots or not when doing 'zfs list'.
• version (A): [current version <= num <= software SPA version]. Pool version or
SPA version. Should be <= Software SPA version. Preferred is 'zpool upgrade'.

Oracle Confidential 13
<Insert Picture Here>

The 'zfs' Properties

Oracle Confidential 14
Properties of zfs dataset
/usr/sbin/zfsl get all --- RO readonly; C create-time, M Mount-time, A Anytime
• available (RO): Amount of space available from the pool for this dataset
• compressionratio (RO): Ratio of compression. See compression.
• mounted (RO): Whether mounted
• type (RO): If this dataset is filesystem / snapshot / volume / clone.
• origin (RO): Which snapshot did this clone ZFS/ZVOL originate from, if it is one.
• referenced (RO): Amount of space accessible by this dataset
• used (RO): Amount of space used by this dataset and its children.
• usedby*(RO): children, dataset, refreservation, snapshots.
• volblocksize (A): size. Block-size for volumes.
• recordsize (A): size. Block-size for filesystems.
• readonly (C/M): [on|off]. Whether to mount readonly.

Oracle Confidential 15
Properties of zfs dataset - 2
/usr/sbin/zfsl get all --- RO readonly; C create-time, M Mount-time, A Anytime
• sharenfs (A): Publish the dataset as NFS Share
• sharesmb (A): Publish the dataset as smb Share
• snapdir (A): [hidden|visible]. Default 'hidden'. Whether to mount snap under .zfs/snapshot
• zoned (A): Whether or not managed by zone.
• encryption (A): [on|off|<value>]. If set, what kind of encryption to adopt. (on | aes-128-ccm
| aes-192-ccm | aes-256-ccm | aes-128-gcm | aes-192-gcm | aes-256-gcm). When on it is
aes-128-ccm.
• compression (A): [on|off|<value>]. What compression to adopt. On is lzjb, (on | off | lzjb |
gzip | gzip-N | zle). gzip is gzip-6. N is 1-9. 1 being fastest and 9 being best compression.
• Checksum (A): [on|off|<value>]. If not set to off, what checksum algorithm to adopt. (on |
off | fletcher2,| fletcher4 | sha256 | sha256+mac). When on it is fletcher4, currently. When
off, only userdata is not checksum-verified. The metadata is always verified.
• atime (A): [on|off]. Whether to update access time on file access.

Oracle Confidential 16
<Insert Picture Here>

Section ZFS on-disk concepts

Oracle Confidential 17
Virtual Devices
Internal
VDEVS or Logical
VDEVs

• Pool is a collection of virtual devices. ROOT


VDEV
• Two types of VDEVs
– Physical or Leaf
M0 Top Level M1
• Writable media block device VDEV VDEVs VDEV
Mirror Mirror
• Disk or files. A/B C/D

– Logical or Internal
• Conceptual grouping of
physical VDEVs A B C D
VDEV VDEV VDEV VDEV
• Organized as a tree (disk) (disk) (file) (file)

• Direct children of Root VDEV (logical Physical or Leaf VDEVs


or physical) are Top-Level VDEVs

Oracle Confidential 18
Physical VDEV Layout
Label L0 256K VDEV Configuration 112K Array of 128 Uberblocks 128K

Label L1 256K • VDEV Size is multiple of 256K


Boot Area 3.5M
14 x 256K
• Each Physical VDEV has 4 labels of each 256K
– Two at the beginning and two at the end
Allocatable Data • Boot Area of 14 x 256K
Area X * 256K
• 2 Stage Label Update
– No COW
– Stage 1. L0 and L2
– Stage 2. L1 and L3
Label L2 256K • While reading, any label can be read
Label L3 256K • Allocatable space starts at 4M (16 x 256K)
• All offsets in rest of ZFS IO is relative to this (4M)

Oracle Confidential 19
Label Layout: VDEV Configuration
VDEV Configuration 112K Array of 128 Uberblocks 128K

• Each VDEV Label has 4 parts.


– Unused area to protect VTOC (8K)
– Boot Block reserved area (8K)
– VDEV configuration – Packed nvlist (112K)
– Uberblock Ring (128K)
• VDEV configuration is a packed nvlist
• Compiled with that from all the related vdevs – all children of top-level vdev
• Tallied with the one from zpool cache or the requested one during import.

Oracle Confidential 20
Label Layout: Uberblock – uberblock_t
VDEV Configuration 112K Array of 128 Uberblocks 128K

• Array of self-checksummed Uberblock structures


– Each entry aligned to sector size.
– 128 uberblocks in case of ASHIFT=9, 512b sector size
– 32 Uberblocks in case of ASHIFT=12, 4K sector size uberblock_t {
ub_magic = 0x00bab10c
ub_version = SPA_VERSION
• Active (best) Uberblock: valid sum, highest TXG and most ub_txg = SYNC_TXG
recent timestamp ub_guid_sum
blkptr_t ub_root_bp

– zio takes care of checksum verification ub_pool_guid

– vdev_uberblock_load takes care of the rest }

• Has a rootbp – a blkptr to the objset_phys_t of type Meta


Object Set

Oracle Confidential 21
Block Pointer – blkptr_t 64 56 48 40 32 24 16 8
Locates a block of data 0 VDEV 1 ncpy|L4T ASIZE
1 G| OFFSET 1
• ZFS deals with data in blocks. 2 VDEV 2 ncpy|L4T ASIZE
• A blkptr locates a block on a vdev. 3 G| OFFSET 2
4 VDEV 3 ncpy|L4T ASIZE
• Parameters to locate a block 5 G| OFFSET 3
– vdev id, asize and offset 6 BDE| LVL TYPE CKSUM COMP PSIZE LSIZE
7 PADDING
– Called DVA, 16 bytes structure 8 PADDING
– First 48 bytes forms 3 DVAs 9 PHYSICAL BIRTH TXG
A BIRTH TXG
• Checksum type, compression type, obj type,
B FILL COUNT
block level, endianness, psize / lsize. C CHECKSUM[0]
• Birth TXG, Physical Birth TXG. D CHECKSUM[1]
E CHECKSUM[2]
• Fill count – No. of alloc'd blks accounted for F CHECKSUM[3]
• Checksum of the content of the block

Oracle Confidential 22
Block Pointer
to see them laid out for a plain file
Creating a file with 5 blocks of 128K (default in ZFS) each
# dd if=/dev/urandom of=/tp/file bs=131072 count=5
5+0 records in
5+0 records out
# ls -li /tp/file
8 -rw-r--r-- 1 root root 655360 Feb 5 11:51 /tp/file The inode field is the obj id in ZFS

# zdb -ddddd tp 8
Dataset tp [ZPL], ID 18, cr_txg 1, 673K, 8 objects, rootbp DVA[0]=<0:b2c00:200:STD:1> Summry of
DVA[1]=<0:18012c00:200:STD:1> [L0 DMU objset] fletcher4 lzjb LE contiguous unique unencrypted 2-copy
size=800L/200P birth=37129L/37129P fill=8 cksum=157453ee91:765706dc8f2:15983eda56fd3:2c3a4cdaee8af9
dataset

Object lvl iblk dblk dsize


lsize %full type
8 2 16K 128K 642K
640K 100.00 ZFS plain file
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED Indirect block pointer pointing to data blocks at this location
dnode maxblkid: 4 0 is the vdev id, b0200 is the offset from 4M
path /file
...
Indirect blocks: Fill count of L1 block is no. of allocated blocks below 5
0 L1 0:b0200:400 4000L/400P F=5 B=37129/37129
0 L0 0:10200:20000 20000L/20000P F=1 B=37129/37129
20000 L0 0:50200:20000 20000L/20000P F=1 B=37129/37129
Logical Birth / Physical Birth
40000 L0 0:30200:20000 20000L/20000P F=1 B=37129/37129
60000 L0 0:70200:20000 20000L/20000P F=1 B=37129/37129
80000 L0 0:90200:20000 20000L/20000P F=1 B=37129/37129

segment [0000000000000000, 00000000000a0000) size


Oracle Confidential
640K 23
Block Pointer - 2 Read (-R) the block at vdev id 0 offset b0200 (+4M) of size 400;
Decompress (d) and interpret it as indirect block (i) aka blkptr array
to see them laid out for a plain file
# zdb -R tp 0:b0200:400:di
Found vdev: /var/tmp/disk_tp
DVA[0]=<0:b0200:400:STD:1> [L0 unallocated] off lzjb LE contiguous unique unencrypted 1-copy size=4000L/400P
birth=4L/4P fill=0 cksum=0:0:0:0
bp[ 0] = DVA[0]=<0:10200:20000:STD:1> [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique
unencrypted 1-copy size=20000L/20000P birth=37129L/37129P fill=1
cksum=400dc802f421:ff607f81b069fd8:32bf283821f5fb10:cb5cfe8d8455c099
bp[ 1] = DVA[0]=<0:50200:20000:STD:1> [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique
unencrypted 1-copy size=20000L/20000P birth=37129L/37129P fill=1
cksum=400534f6fc4a:1001819e4f77083b:58e301be5d223d00:2073512e6d52e78e
bp[ 2] = DVA[0]=<0:30200:20000:STD:1> [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique
unencrypted 1-copy size=20000L/20000P birth=37129L/37129P fill=1
cksum=3ff5f10b2aa7:1001cfafb9cfee8b:886165e63c729471:9c27c5bb0691d18b
bp[ 3] = DVA[0]=<0:70200:20000:STD:1> [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique
unencrypted 1-copy size=20000L/20000P birth=37129L/37129P fill=1
cksum=400a337d885d:1006c19934bf0014:9c73a13282d3dd6b:fee02c1a25b05975
bp[ 4] = DVA[0]=<0:90200:20000:STD:1> [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique
unencrypted 1-copy size=20000L/20000P birth=37129L/37129P fill=1
cksum=401b23024113:10087f4f9d11da8e:f668c686c23815be:e555bff81ca14651
bp[ 5] = <hole>
bp[ 6] = <hole>
bp[ 7] = <hole>

Oracle Confidential 24
DMU Object
zdb -dd pool/dataset / zdb -MM objset pool
• Almost everything is stored on disk as objects (exceptions are like contents of labels)
• Example: Files, Directories, Datasets, list of snapshots and Array of objects
• A dnode of 512B defines an object.
• Defines the type of object, block-size, levels of indirection, max. allocated blocks, etc.
• Has up to 3 block pointers that points either an indirect or a data block.
• Has optionally one bonus block that contains additional information about the object.
• Type DMU_OT_DNODE (0xa) is a special object pointing to array of dnodes

Oracle Confidential 25
The dnode – dnode_phys_t
Identifies and defines each object uni8_t dn_type
• Of type dmu_object_type_t dn_type and dn_bonus_type
uint8_t dn_indblkshift
define what object and bonus block contain
uint8_t dn_nlevels
• Shift for indirect block size (16K = 14)
uint8_t dn_nblkptr
• Can support upto 7 levels.
uint8_t dn_bonustype
• Can have up to 3 block pointers that points either an indirect
or data block. uint16_t dn_datablkszsec

• dn_datablkszsec is the block size in multiples of 512 B sec. uint16_t dn_bonuslen

uint64_t dn_maxblkid
• dn_maxblkid is the id of last block in L0

Variable Size
dn_blkptr[0] dn_blkptr[1] dn_blkptr[2]
• Object size in bytes = (dn_maxblkid + 1)*(dn_datablkszsec
*512) uint64_t dn_bonus[BONUSLEN]

• The bonus area is shared between the dn_blkptr[ ] and


dn_bonus
• dn_bonuslen defines the length of bonus buffer

Oracle Confidential 26
The dnode – dnode_phys_t – dn_type / dn_bonus_type
type of an object – what it holds
• Some dn_type value we encounter generally:
– ZAP object (many object types implemented as ZAP)
• DMU_OT_OBJECT_DIRECTORY
• DMU_OT_DSL_PROPS
• DMU_OT_DSL_DATASET_CHILD_MAP
– DSL Directory / DSL Dataset (only part of MOS)
• DMU_OT_DSL_DIR, Describes a DSL Directory
• DMU_OT_DSL_DATASET, Describes a DSL Dataset
– ZFS Plain File / ZFS Directory (not part of MOS)
• DMU_OT_PLAIN_FILE, regular files in file-system
• DMU_OT_DIRECTORY_CONTENTS, Directories in file-system, A ZAP Object
– Config (packed nvlist)
• DMU_OT_PACKED_NVLIST

Oracle Confidential 27
The dnode – dnode_phys_t – bonus buffer
additional information about the object
• Additional information like written on back of the sheet.
– Usually the on-disk structure pertinent to the object
– Identified by a dmu_object_type_t
– If fits, will be embedded in the dnode_phys_t
– Else, bonus blkptr points to where bonus buffer is.
– The regular file and directory contains file/directory stat information like
znode_phys_t or sa_header_phys_t
• dn_bonuslen defines how long is the bonus buffer
– Valid only if bonus buffer is embedded

Oracle Confidential 28
objset_phys_t {
os_meta_dnode

Object-set – objset_phys_t os_meta_

FS / Snap / Clone / Vol / MOS Dnode


Obj id 0

• Set of objects belonging to a group.


os_zil_header
os_type
• Example: MOS, FS, Volume, Clone, Snap …
}
• On-disk structure objset_phys_t (1KB)
Dnode Dnode Dnode Dnode Dnode
• os_type: if it is MOS or FS or Volume or ...

Obj id 1 Obj id 2 Obj id 3 Obj id 30 Obj id 31


snap
• os_meta_dnode is meta dnode (MDN)
– dn_type is DMU_OT_DNODE Dnode Dnode Dnode Dnode ... Dnode Dnode

Obj id 32 Obj id 33 Obj id 34 Obj id 35 Obj id 62 Obj id 63


– Data block size is 16K
– data is array of dnode_phys_t
– Can have indirect level too Dnode Dnode Dnode Dnode ... Dnode Dnode

Obj id 64 Obj id 65 Obj id66 Obj id 67 Obj id 94 Obj id 95

Oracle Confidential 29
objset_phys_t {
os_meta_dnode

Object-set – objset_phys_t - 2 os_meta_

FS / Snap / Clone / Vol / MOS Dnode


Obj id 0

• MOS is meta object-set


os_zil_header
os_type
– os_type DMU_OST_META …
}
– One per pool
– Contains only meta data objects Dnode Dnode Dnode ... Dnode Dnode

Obj id 1 Obj id 2 Obj id 3 Obj id 30 Obj id 31


• os_type DMU_OST_ZFS is ZFS filesystem
/ snap / clone
• os_type DMU_OST_ZVOL is ZFS Volume Dnode Dnode Dnode Dnode Dnode Dnode
...

typedef enum dmu_objset_type { Obj id 32 Obj id 33 Obj id 34 Obj id 35 Obj id 62 Obj id 63


DMU_OST_NONE,
DMU_OST_META,
DMU_OST_ZFS,
DMU_OST_ZVOL,
Dnode Dnode Dnode Dnode ... Dnode Dnode
DMU_OST_OTHER, /* For testing only! */
DMU_OST_ANY, /* Be careful! */ Obj id 64 Obj id 65 Obj id66 Obj id 67 Obj id 94 Obj id 95
DMU_OST_NUMTYPES
} dmu_objset_type_t;

Oracle Confidential 30
Bigger Picture
VDEV to Label to active uberblock to rootbp to objset to MDN to dnodes
Label L0 256K VDEV Configuration 112K Array of 128 Uberblocks 128K

Label L1 256K
Boot Area 3.5M
14 x 256K uberblock_t {
ub_magic = 0x00bab10c
ub_version = SPA_VERSION
ub_txg = SYNC_TXG Dnode Dnode Dnode ... Dnode Dnode

Allocatable Data ub_guid_sum


Obj id 1 Obj id 2 Obj id 3 Obj id 30 Obj id 31
blkptr_t ub_root_bp
Area X * 256K …
ub_pool_guid

}
objset_phys_t {
os_meta_dnode Dnode Dnode Dnode Dnode ... Dnode Dnode

Obj id 32 Obj id 33 Obj id 34 Obj id 35 Obj id 62 Obj id 63


Meta
Dnode
Obj Id 0

Label L2 256K
Dnode Dnode Dnode Dnode ... Dnode Dnode
os_zil_header
Label L3 256K os_type = DMU_OS_META Obj id 64 Obj id 65 Obj id66 Obj id 67 Obj id 94 Obj id 95

}

Oracle Confidential 31
Bigger Picture (extension)
leading to DSL
Label L0 256K VDEV Configuration 112K Array of 128 Uberblocks 128K

Label L1 256K
uberblock_t {
Boot Area 3.5M ub_magic = 0x00bab10c
Object id 1 of MOS is an object
ub_version = SPA_VERSION
14 x 256K
ub_txg = SYNC_TXG Dnode Dnode directory ZAP Object
ub_guid_sum
blkptr_t ub_root_bp
Obj id 1 Obj id 2
name “root_dataset” indicates the

Allocatable Data ub_pool_guid root dataset for the pool.
Area X * 256K …
} The value indicates the object of
Object Directory
objset_phys_t { root_dataset = 2 type DMU_OT_DSL_DIR
os_meta_dnode config = 24
creation_version =
33
This Object contains all information
Meta
...
about the DSL Directory.
Dnode
Obj id 0 dsl_dir_phys_t’s “child_dir_zapobj”
of type ZAP establishes the parent-
Label L2 256K os_zil_header children relationship
os_type = DMU_OS_META
Label L3 256K … dsl_dir_phys_t’s “head dataset” has
}
the dsl_dataset for the dir's active FS

Oracle Confidential 32
DSL Layer
Dataset and Snapshot Layer
• DSL Dataset: Represents an object
set Child Dataset Information

DSL DSL
• DSL Directory: Provides a Child Dataset
DSL
Properties
Directory
hierarchical framework to fit ZAP Object ZAP Object

– related Datasets Snapshots

– their properties inheritance DSL DSL DSL


DSL

Infrastructure
DSL
Dataset Dataset Dataset
– space accounting, estimation DSL Directory
(child2) (active) (snapshot) (snapshot)
Directory
and enforcement (child1)

• Each DSL dir has one Active Dataset


• May have snapshot datasets
• One DSL dataset for each object-set DMU DMU DMU
Object Set Object Set Object Set
• DSL Pool: Provides overall in- (active) (snapshot) (snapshot)

memory state of the pool

Oracle Confidential 33
DSL dir (dsl_dir_phys_t)
Dataset and Snapshot Layer
typedef struct dsl_dir_phys {
uint64_t dd_creation_time; TS of creation of this DD
uint64_t dd_head_dataset_obj; object containing the dsl dataset for active FS
uint64_t dd_parent_obj; object containing parent dsl dir of this DD
uint64_t dd_origin_obj; (just for clones) obj containing origin DS
uint64_t dd_child_dir_zapobj; ZAP obj containing map of name of children-DD
uint64_t dd_used_bytes; bytes used by all DS of this DD
uint64_t dd_compressed_bytes; compressed bytes used by all DS's of this DD
uint64_t dd_uncompressed_bytes; uncompressed bytes used by all DS's of this DD
uint64_t dd_quota; quota if any set for all DS's of this DD
uint64_t dd_reserved; reservation if any set for all DS's of this DD
uint64_t dd_props_zapobj; ZAP object containing the non-default properties
uint64_t dd_deleg_zapobj;
uint64_t dd_flags; typedef enum dd_used {
uint64_t dd_used_breakdown[DD_USED_NUM]; DD_USED_HEAD,
uint64_t dd_clones; DD_USED_SNAP,
uint64_t dd_keychain_obj; DD_USED_CHILD,
uint64_t dd_pad[12]; DD_USED_CHILD_RSRV,
} dsl_dir_phys_t; DD_USED_REFRSRV,
DD_USED_NUM
} dd_used_t;

Oracle Confidential 34
DSL Dataset (dsl_dataset_t)
Dataset and Snapshot Layer
typedef struct dsl_dataset_phys {
uint64_t ds_dir_obj; object containing the dsl directory for this DS
uint64_t ds_prev_snap_obj; object containing the dsl dataset for previous snap
uint64_t ds_prev_snap_txg; TXG the previous snap was created in. Important.
uint64_t ds_next_snap_obj; object containing the dsl dataset for next snap
uint64_t ds_snapnames_zapobj; ZAP obj containing map of snapname-dsl dataset obj
uint64_t ds_num_children; next snap OR active FS AND one from each clone
uint64_t ds_creation_time; TS for creation of this DS.
uint64_t ds_creation_txg; TXG this DS was created in
uint64_t ds_deadlist_obj; object containing the dsl deadlist for this.
uint64_t ds_used_bytes; bytes used by objset(this DS)
uint64_t ds_compressed_bytes; compressed bytes used by objset(this DS)
uint64_t ds_uncompressed_bytes; uncompressed bytes used by objset(this DS)
uint64_t ds_unique_bytes; (just for snaps) extent of divergence from active
uint64_t ds_fsid_guid;
uint64_t ds_guid;
uint64_t ds_flags;
blkptr_t ds_bp; Where the objset_phys_t located
uint64_t ds_next_clones_obj; ZAP obj containing list of clone dsl dataset obj
uint64_t ds_props_obj;
uint64_t ds_userrefs_obj;
uint64_t ds_shares_obj;
} dsl_dataset_phys_t;
Oracle Confidential 35
Bigger Picture (extension)
leading to DSL and FS Objset
objset_phys_t { objset_phys_t {
os_meta_dnode Dnode Dnode Dnode os_meta_dnode

Obj id 1 Obj id 2 Obj id 18
Meta Meta
Dnoe Dnoe
Obj id 0 Obj id 0

Object Directory DSL Directory DSL Dataset


os_zil_header ZAP DSL_DIR DSL_DATASET os_zil_header
os_type = DMU_OS_META root_dataset = 2 head_dataset = 18 ds_bp = os_type = DMU_OS_ZFS
… config = 24 ... ... …
} creation_version = }
33
...

These Objects belong to objset of type DMU_OT_ZFS Dnode Dnode Dnode ... Dnode Dnode

Obj id 1 Obj id 2 Obj id 3 Obj id 30 Obj id 31

Oracle Confidential 36
DSL
Dataset and Snapshot Layer DSL Layout for a typical ZFS hierarchy
2
62 18

dsl_dir dsl_dataset
rpool @one rpool
rootdir
31 33
58 71

dsl_dir dsl_dataset
ROOT @one ROOT
child
@two
8 46

dsl_dir 1 91
dsl_dataset
S11u10
child
@Dec S11u10

Clone of S11 76
u11@Aug
43 54 83

dsl_dir @Aug dsl_dataset


S11u11 @Mar
child
S11u11

12 DSL object id
10 12
dsl_dir 15
dsl_dataset DSL dataset links
$ORIGI
N $ORIGIN
4 @ORIGIN DSL dir links
0

dsl_dir dsl_dataset snapshots


$MOS NULL
7
0 DSL directories
dsl_dir dsl_dataset
$FREE NULL DSL head dataset

Oracle Confidential 37
Q&A

38
<Insert Picture Here>

Appendix

39

You might also like