You are on page 1of 64

File Systems

Operating Systems
File systems
● Way to organize and (persistently) store
information
● Abstraction over storage devices
○ Hard disk, SSD, network, …
● Organized in files and (typically) directories.
● Examples:
○ FAT12/FAT16: MS-DOS
○ NTFS: Windows
○ Ext4: Linux
○ APFS: macOS/iOS
Overview
Storage

Operating System

I want to read /home/koen/shell.c

User program
Overview
Storage
HDD, SSD, network, RAM, …

Operating System Buffer cache

FAT driver
Virtual File Page NTFS driver
System cache
Ext4 driver

...
Syscall
open, read, write, readdir, ...

User program
Overview
Storage Chapter 5 (I/O)
HDD, SSD, network, RAM, … (next week)

Operating System Buffer cache

FAT driver Next FS


Virtual File Page NTFS driver lectures
System cache
Ext4 driver
(implementation)

...
Syscall

This lecture
User program (files, directories,
user operations)
Overview
● Files and directories + user operations
● File system implementation
Files
● Abstract storage nodes
● File access:
○ Sequential vs. random access
● File types:
○ Regular files, directories, soft links
○ Special files (e.g., device files, metadata files)
● File structure:
○ OS’ perspective: Files as streams of bytes
○ Program’s perspective: Archives, Executables, etc.
○ Is OS ever aware of the file structure?
File Naming
● Different file systems have different
limitations/conventions for file names
● File extensions
● File name length
○ FAT12: 8.3 characters (later extended to 255)
○ Ext4: 255 characters
● Special characters in file names
○ FAT12: No "*/:<>?\| and more
○ Ext4: No '\0' and '/' or the special names "." and ".."
● Case sensitivity
Possible File Attributes
File Operations
● Create/Delete
● Open/Close
● Read/Write
● Append
● Seek
● GetAttributes/SetAttributes
● Rename
File Operations - Unix
● Opening and reading file:
○ int fd = open(“foo.txt”, O_RDONLY);
char buf[512];
ssize_t bytes_read = read(fd, buf, 512);
close(fd);
printf(“read %zd: %s\n”, bytes_read, buf);
● Opening a file returns a handle (file descriptor)
for future operations.
● Any function may return an error, e.g.:
○ -ENOENT: File does not exist
○ -EBADF: Bad file descriptor
File Operations - Unix
● Seeking in files:
○ int fd = open(“foo.txt”, O_RDONLY);
lseek(fd, 128, SEEK_CUR);
char buf[8];
read(fd, buf, 8);
close(fd);
● Move current position in file forwards or
backwards.
File Operations - Unix
● Writing files:
○ int fd = open(“foo.txt”, O_WRONLY |
O_CREAT |
O_TRUNC);
char buf[] = “Hi there”;
write(fd, buf, strlen(buf));
close(fd);
● O_CREAT: Create file if it does not exist
● O_TRUNC: “Truncate” file to size 0 if it exists
(i.e., throw away current file contents)
File Operations - Unix
● Removing file:
○ unlink(“foo.txt”);
● Renaming file:
○ rename(“foo.txt”, “bar.txt”)
● Change file permission attribute:
○ chmod(“foo.txt”, 0755);
● Change file owner attribute:
○ chown(“foo.txt”, uid, gid);
Directories
● Data structures organizing and maintaining
information about files
● Often stored as file entries with special
attributes
● Directories are denoted by “/” (Unix) or “\”
(Windows)
● Special directory entries:
○ . Current directory
○ .. Parent directory
Directory Hierarchies
Directory Operations
● Create/Delete
● Opendir/Closedir
● Readdir
● Rename
● Link/Unlink (hard and soft links)
Directory Operations - Unix
● Reading current directory contents:
○ DIR *dirp = opendir(“.”);

struct dirent *dirent;


while ((dirent = readdir(dirp)))
printf(“%s\n”, dirent->d_name);

closedir(dirp);
Virtual File System (VFS)
● Interface between user programs and
underlying file systems
● Programs only see a single file system, e.g.:
○ /: Partition 1 (Ext3)
○ /home: Partition 2 (Ext4)
○ /mnt/usb: USB stick partition 1 (FAT)
○ /mnt/share: Remote network share (NFS)
○ /tmp: RAM (non-persistent) storage (tmpfs)
Virtual File System (VFS)
VFS Data Structures
Overview
Storage Chapter 5 (I/O)
HDD, SSD, network, RAM, … (next week)

Operating System Buffer cache

FAT driver
Virtual File Page NTFS driver
System cache
Ext4 driver

...
Syscall
This/next lecture
(FS implementation)

User program Last lecture


File System
Implementation
● How to store files?
● How to implement directories?
● How to manage disk space?
● How to ensure file system reliability?
File System
Implementation
● How to store files?
● How to implement directories?
● How to manage disk space?
● How to ensure file system reliability?
File System Layout

● MBR: contains boot code to locate the active


primary partition and execute its boot block
● Partitions: primary, extended, subpartitions
● MBR nowadays superseded by GPT
File System Layout
● How to store files on the disk?
○ Contiguous allocation
■ Store files as a contiguous stream of bytes
■ Prone to fragmentation, requires max file size
■ Never used in practice?

○ Block-based strategies:
■ Linked List
■ File Allocation Table
■ i-nodes
File Storage:
Linked List

● Add pointer to next block at start of each block


● Sequential accesses OK, random accesses slow
● Size of stored data no longer power-of-two: inefficient
File Storage:
File Allocation Table
● Store linked-list pointer in separate table (the FAT)
● Efficient random accesses, blocks contain power-of-two
stored bytes
● Does not scale: requires pre-allocated table for every
block on the disk
File Storage:
i-nodes
● Store list of data block addresses per file in an i-node
● Better scalability: only require metadata for files/blocks
that are actually used
● For larger files, one i-node can point to additional blocks
containing more data block addresses

Simple i-node example Complex i-node for large file


File System
Implementation
● How to store files?
● How to implement directories?
● How to manage disk space?
● How to ensure file system reliability?
Directory
Implementation
● How to store directory information?
● Are attributes stored in directory entries?
● How to find the root directory?
● How to find any other directory?

Storing data in-line Using i-nodes


FAT12/FAT16 Layout
Boot Reserved FAT #1 FAT #2 Root Clusters
Sector Directory

● Reserved later used for FS information


● FAT #1/#2: preallocated File Allocation Tables
● Preallocated root directory
● Clusters represent addressable blocks:
○ FAT contains chains indexed by starting cluster
FAT12/16 Directory Entry
● Filename in 8.3 format
○ Later on, Long File Name (LFN) entries used
multiple entries for a single file, setting several
reserved bits for backwards compatibility
FAT Directory Entry
UNIX Directory Entry

● Directory entry stores only name and #i-node


● Attributes are stored in the i-node
● One i-node per file, first i-node is the root
● Where are i-nodes stored?
UNIX FS Layout
Directory Lookup
File System
Implementation
● How to store files?
● How to implement directories?
● How to manage disk space?
● How to ensure file system reliability?
Disk Space:
Block Size Impact
● How do to choose the block size?
○ Trade-off between data rate and space overhead
(fragmentation)

(assumption: average file size is 2KB)


Disk Space:
Tracking free blocks
● Option 1: Linked list of all free blocks
○ Stored in free blocks, less metadata the fuller the
disk gets (except for fragmentation)
● Option 2: Static bitmap with 1 bit per block
○ Size is fixed, even if disk is full
Creating a file system
● Format partition
● Create basic, FS
specific, layout
● Unix: mkfs
● E.g, mkfs.ext4,
mkfs.fat
File System
Implementation
● How to store files?
● How to implement directories?
● How to manage disk space?
● How to ensure file system reliability?
File Systems:
Reliability Threats
● Disk failures:
○ Bad blocks
○ Whole-disk errors
● Power failures:
○ (Meta)-data inconsistently written to disk
● Software bugs:
○ Bad (meta)-data written to disk
● User errors:
○ rm *.o vs. rm * .o
○ dd if=/dev/zero of=zeros bs=1M # fill quota
Reliability solution:
Backups
Backups:
Technical aspects
● Incremental vs. full
● Online vs. offline
● Physical vs. logical
● Compressed vs. uncompressed
● Local vs. remote
Image source: Wikipedia CC BY-SA 3.0

● RAID 0: Expand file system over multiple disks


(increase storage)
● RAID 1: Duplicate file system over disks
(increase redundancy)
● RAID 3-6: Parity on one disk (increase both)

RAID: Redundant Array of Independent Disks


RAID 0

RAID 1
RAID 5

● RAID 0 & 2: Expand file system over multiple


disks (increase storage)
● RAID 1: Duplicate file system over disks
(increase redundancy)
● RAID 3-6: Parity on one disk (increase both)

RAID: Redundant Array of Independent Disks


File System
Integrity check & repair
● Programs like fsck and chkdsk try to find (and
fix) consistency errors in file system metadata
○ Corrupt values
○ Used blocks also marked as free
○ Blocks not marked as free nor used
○ Blocks being used multiple times
○ ...
(a) Consistent (b) Missing block
(c) Duplicate free block (d) Duplicate block

fsck: File System Consistency Check


File System Performance
● Minimize disk access
● Minimize seek time
File System Performance
● Minimize disk access:
○ Caching: buffer cache, inode cache, direntry cache
○ Block read ahead
● Minimize seek time:
○ Try to allocate files contiguously
○ Spread i-nodes over disk
○ Store small file data “inline” in i-node
○ Defragment disk
Minimize disk accesses:
Caches
● Buffer cache: Cache disk blocks in RAM
● Page cache: Cache VFS pages (before
going to driver)
● Often contains the same data as buffer cache, so
OSs often merge these
Cache management
● Caches have LRU semantics adapted to:
○ Blocks with poor temporal locality
○ Critical blocks for file system consistency
● Write-through caching vs. periodic syncing
● Disk read-ahead: read blocks into cache that
may be used soon
File System Performance
● Minimize seek time (on traditional HDDs):
○ Try to allocate files contiguously
○ Defragment disk
○ Spread i-nodes over disk
○ Store small file data “inline” in i-node
● Not so much an issue on SSDs
○ Instead, optimize
wear reduction
Log-structured
File Systems
● Idea: Optimize for frequent small writes
● Collect any pending write in a log segment
● Segments contain i-nodes, entries, blocks
● Segments are regularly flushed to disk and
and can be large (e.g., 1MB)
● i-node map to find i-nodes in the log
● Garbage collection to reclaim old log entries
i-nodes Data blocks segment segment

UNIX FS Log-Structured File System


Journaling File Systems
● Idea: Use “logs” for crash recovery
● First write transactional operations in log.
● E.g., removing a file:
a. Remove file from its directory.
b. Release i-node to the pool of free i-nodes.
c. Return all disk blocks to pool of free disk blocks.
● After crash, replay operations from log
● Single operations need to be idempotent
● Should support multiple, arbitrary crashes
● Journaling widely used in modern systems,
e.g., Ext4 and NTFS
Textbook (Chapter 4)
● All sections except:
○ Section 4.5.3
FUSE:
Userspace drivers
Extra material for Asg3 (fs)
FUSE
● Filesystem in Userspace:
Allow FS drivers implemented in userspace
instead of inside kernel
Storage

Buffer cache
OS
FAT driver
NTFS driver
VFS Page
Cache
Ext4 driver

FUSE

Syscall
read, readdir, ...
libfuse
libc
Userspace
User program FUSE driver
FUSE
● The driver is just a normal userspace
process
● libfuse talks with the kernel module
Storage

Buffer cache
OS
FAT driver
NTFS driver
VFS Page
Cache
Ext4 driver

FUSE

Syscall Normal
syscalls FUSE syscalls (read/write /dev/fuse)
read, readdir, ...
libc libfuse
libc
Userspace
User program FUSE driver
FUSE
● Makes driver development much easier
● No permissions required for new FS drivers
● But, slower than in-kernel drivers

● Examples using FUSE:


● sshfs: Mount remote machine over ssh
● ntfs-3g: NTFS on Linux
● exfat: exfat on Linux (in-kernel only since Nov 2019)

● WikipediaFS: “View and edit Wikipedia as files” (??)


● EmojiFS: “Manipulate your emojis on Slack/Discord” (????)
● πfs: “Store files (as offsets) in π” (yes, really... https://github.com/philipl/pifs )
Using FUSE
● Compile and link a program against libfuse
● Run binary to mount into VFS
● Callback driven:
● Kernel/FUSE calls you when someone interacts with
your files/directories

Source: https://github.com/libfuse/libfuse/blob/master/example/hello.c
FUSE callbacks
● Most important callback: getattr
● Called before any operation on a file
● Retrieves file(/directory) attributes
via struct stat - see man 2 stat
● Directory example: stbuf->st_mode = S_IFDIR | 0755;
● File example: stbuf->st_mode = S_IFREG | 0755;
● Non-existing entry: return -ENOENT
FUSE callbacks
● Familiar from syscalls:
● readdir: List directory contents
● read: Read file contents
● mkdir: Create directory
● rmdir: Remove directory
● unlink: Remove file
● create: Create file
● truncate: Resize file
● write: Write data to file
● rename: Rename/move file

You might also like