You are on page 1of 19

[How Things Work]: Linux

File systems & Block Devices


@khenidak

Nov 12th, 2021


you are
here

Jan++

Processes FS + I/O Memory + Scheduling + Virtualization Containers Boot + Security


+ IPC Block Dev VMM Threading Extending Kernel Roundup

Where we are. Where we are going.


Refresher
• Processes are identified by pid.
• Each process has a unique address space.
• Kernel maps itself into every user space process address space ”downward”
at the top of the address space.
• Syscall is kernel interface to user space programs. Syscalls are executed by
1) writing input to cpu registers [usually int-variants &&/|| bit masks +
mem address(s)] 2) raising a soft irq.
• Data cross kernel/userspace cut line by copy_to_user()/copy_from_user()
kernel functions.
• Opps from last session:
• We didn’t talk about: Socketpair(2) as a bi-directional pipe for IPC.
How are we going to tell today’s story?

Block How Linux File IO


Devices Does “Files” Systems Flow
Block Devices
Logical View I/O Layer
• Block devices are “well named” .
• BD is long array of bytes addressable by “sector”. Sector size is multiple of Software Queues
512b.
• BD (driver) executes I/O as “Requests”. Each Request is “one continues
extent” aligned to multiples of sector size and have max, min, and preferred Request Request Request
request sizes. In addition to max in flights requests. Requests are vectored
IO.
Request Request Request
• BD also defines “write back throttle” which is a throttle chock point for page
cache write back to BD if latency of actual in flights requests > xx usec * IO
Scheduler Request Request Request
• BD may have one or more request H/W processing queues (e.g., R/W).
Driver may register one or multi s/w queues (blkmq)**.
• BDs are identified by major:minor number (u16 in a u32. Higher 8 are major. Request Request Request
Lower is minor).
• BD Driver may register multiple software queues (e.g., per CPU I/O
queue(s)).
• Software queues are operated on by I/O scheduler. I/O Scheduler may re- Driver
order, merge requests as needed (i.e., elevator algo).
• BD Driver may execute the request anyway it sees fit: Block Device
• H/W***
• DMA Single / Multi H/W queue
• …
S S S S
But.. But! I/O Layer

Software Queues
H/W Specific /Controller
Request Request Request

Network endpoint (nbd || Request Request Request


iscsi) IO
Scheduler Request Request Request

vmbus (hyper-v)
Request Request Request

Driver

virtio (kvm/qemu)

custom
Ok then, what is this ioctl thing?
• Block Devices (and char devices) need to offer userspace some sort of
“additional control knops”. Think of it as “command” interface.
• What they do is up to the device and how it interprets the ioctl request.
• Most devices use it for runtime configuration. It is a major PITA to read/write
files inside the kernel (just lookup how to read files inside the kernel). Also
eliminates the need to watch for file changes etc.
• ioctl itself is a syscall that looks like “int ioctl(int fd, unsigned long request, long
param);” each ioctl has a 64bit request (id) and a single argument 64bit
(remember how syscall was executed?).
• Note: The first argument is fd. Devices register a set of “file operations” e.g.
(seek, open, close etc). User can then create a ”node device file” in any file
system pointing to the device’s major:minor. When open(2) etc is called the
device ops are called. And yes, there is one mapped to ioctl(2).
How Linux Does “Files”
Everything is a File, But what is a file?
attr xattr content

• Owner user, owner group, other. • Not supported everywhere**, *** • Some are just plain old files with content.
• USER | GROUP | OTHER Permissions • namespaces.key: val map with • Some are files are firehose style (i.e.,
sockets, pipes, fifo). Don’t support seeking
• Sticky Bits U | G | T* variable sizes for both depending on
nor mapping.
• 3 Dates file system (max 64k). • Some are special files such as device
• Access Time • Namespaces: security “nodes”.
• Content Modify Time (CAP_SYS_ADMIN),
• Content || att Change time
system(CAP_SYS_ADMIN, trusted,
• (no creation TS in linux)
user)

But really what can be a file? A: Everything


• Files, directories.
• Sockets || fifos.
• Devices (block and chars).
• links to other.

**links, TO WHAT?!
• Symbolic links (symlink) are a lot similar to href. They follow file itself or directory (by exact location).
• Hard links is a mirror of the file, that will remain even after the original file is deleted (How do these things work?).
Let us do a design experiment
Process A Process B
Mem Mem
File Object*
FD FD

FD FD

FD FD
dentry
… …

inode

Links are to the inode itself. dentry has name:inode mapping and has
parent/child for child file/dirs. dentry and inode are frequently
accessed hence a large cache in kernel colloquially referred to as
dentry/inode cache. Note: inode cache is *not* file content.
File Systems
Linux VFS
• inode and dentry objects are part of Linux Virtual File System interface.
• VFS was created to enable Linux to “talk” to multiple file systems. Each file system can
implement things anyway they see best.
• When you mount you are not “mounting block devices” you are “mounting file systems”.
• Linux has a single “tree” where file systems mount on it (remember each process can have a
different view of it using mount namespaces*).

Note: diagram does not show


pseudo file systems

Source: https://techviewleo.com/mounting-and-unmounting-filesystems-on-linux/
If everything is a file, then filesystems are:
Real psudo Something in between, something else

• Backed by a block device (or a partition of). • Union File Systems such as aufs and
• I/Os are eventually persisted to disk (or by overlay.
fsync(2)) • They represent union layered view of other
file systems usually with top layer writer as
COW. Used heavily in container verse.
Others • Bind while not exactly a VFS
implementation (works by making dentry
of source -> destination).
Kernel own Data Structure
• 9p, nfs, ceph and cifs, tempfs are file
systems that are not backed by block
• /dev /sys /proc are all kernel data device but writes to a remote endpoint or
structures offered as files others.
• … (can you name a more modern • Fuse works in tandem with a user space
example?) process that can do <whatever>.
• Debugfs (allow module dev to expose • Usually implemented by a kernel module
debug data) that registers the file system type.

mount command has the following format:


mount –t <..> <what> <to-mount-point> -o <options passed to whatever implements the filesystem>
t: type of vfs. Type is the name vfs used to register itself with the kernel
o: options, passed as is to whoever implements the fs
What: a name that identifies an inode that can be device, directory. Or an arbitrary string (example <ip>:<port> for nfs)
Destination: always a mount point, always an inode

VFS implementation returns a dentry (root of its tree) this dentry then is added to kernel fs main tree.
Where is the cut line between block dev and fs?
• inode space is allocated once per FS. Typically, at 1% (optionally increased when running
mkfs). inode data structure does not include “start offset(s)” of where the file is saved in
block device? Only number of blocks, where/how belongs to FS not the kernel.
• inode and dentry data is saved as “super blocks”. file systems is required to save (typically
starting at block zero) and report them to the kernel. File systems typically save them
multiple times on different places on the disk the entire “fsck dance” is mostly about
finding a good copy of super blocks and restore them to primary location. Use “dumpe2fs
<dev> | grep super” to learn about them.
• Therefore, hard links can’t span FSs.
• File system implementation has control over:
• Where the files is saved on the disk and mapping between blocks and inodes.
• File systems don’t even need to track “blocks” they usually use extents (range of blocks)
which minimizes the tracking on per-file bases. This data is managed by FS
implementation kernel does not know about.
• File system can be journaled via jbd (ext{3,4} or another implementation specific (e.g.,
btrfs and xfs has its own journaling).
IO Flow
inode + dentry lru cache
• Read Path
• FD has the offset the file is at
• FD is tied to FO + inode (permission assertion)
Write(fd, …) write(fd,..) FD VFS inode
• FS is called to execute the read where offsets
are translated to disk offsets.
• IO Request is created by FS and submitted to
FS blkio.
• If the request’s content is in page cache* then
Request result is returned immediately.
(offset + pages) • If not request is put in s/w queue where the
driver handles there and end it when its done
Page cache blkio (request has empty pages waiting for results).
• Write Path
• … (similar to read steps till request is created)
• VMM freeing pages O_DIRECT
• If I/O was done as IO_DIRECT then page cache
• fsync O_SYNC is not written to and request goes directly to
s/w queues S/W queues then driver. Call might return
before is completed**
• If I/O request was done w/ O_SYNC then
request goes to page cache (caller is still
waiting). Then request is created following
device driver
normal path. Call will not return until device
acks the request completion.
• Normal request goes to page cache until fsync
is called (or other conditions triggers the
device device write)
What have we not talked about?
• You don’t even need FS: Linux supports raw devices. Where I/O is done without file
systems. IOs are done using device own offsets.
• Request/bio/bio_vec/page object graph: IOs are requests. Requests consists of lists of
bios, each bio has vectors, each vector has a page (page cache) and offset in page. This
mess is needed to allow IO scheduler to merge requests as needed. This means multiple
aligned write touching many files can be executed in one request.
• IO Schedulers in Details: The docs cover them quite well but without understanding the
crud around them the docs are a bit too confusing to understand.
• Stacked blkio Devices: lvm and device mapper are examples of devices that are stacked on
top of each other. For example, disk encryption is done by dm-crypt by representing a stack
of encryption as blkdev on top actual blkdev.
• User->(blk requests to encryption blkvdev)->encrypt->(blk requests to actual device->actual
• Char Devices: storage media that don’t have addressable storage (e.g., serial ports).
• Slabs: We will talk in details about them in vmm session. For now it is a method of
allocating large memspace then reuse it instead of malloc()/free() for every mem usage.
Thank You

You might also like