Architectural and Design Issues in the General Parallel File System

May 12, 2002

IBM Research Lab in Haifa

Benny Mandler -


What is GPFS? H ? a file system for deep computing R GPFS uses L General architecture How does GPFS meet its challenges - architectural issues
performance ? scalability ? high availability ? concurrency control

web servers..aggregating file. web servers onto a centrally-managed machine centrallyStreaming video and audio for multimedia presentation Scalable object store for large digital libraries. .simulation. databases.. data mining Server consolidation . seismic analysis.Scalable Parallel Computing RS/6000 SP Scalable Parallel Computer 1 highH -512 nodes connected by high-speed switch R1-16 CPUs per node (Power2 or PowerPC) >1 TB disk per node L500 MB/s full duplex per switch port Scalable parallel computing enables I/O-intensive applications: I/ODeep computing . What is GPFS? .

100s of GB per file. disks.0 "POSIX") with minor exceptions What is GPFS? . parallel data access . RAID support survives node and disk failures Uniform access via shared disks .. Storage nodes. High Availability faultfault-tolerance via logging.within a file and across files H Support fully parallel access both to file data and metadata client caching enabled by distributed locking R wide striping. large data blocks. Standards compliant (X/Open 4. file system nodes.multiple GB/s to/from a single file concurrent reads and writes. (Nadapters. replication. prefetch L Scalability scales up to 512 nodes (N-Way SMP)..GPFS addresses SP I/O requirements High Performance .Single image file system High capacity multiple TB per file system.

all instances can access all disks .GPFS vs.application can only access files on its own node Applications must do their own data partitioning DCE Distributed File System (follow-up of AFS) Application nodes (DCE clients) share files on server node Switch is used as a fast LAN Coarse-grained (file or segment level) parallelism Server node is performance and capacity bottleneck GPFS Parallel File System GPFS file systems are striped across multiple disks on multiple storage nodes Independent GPFS instances run on each application node GPFS instances use storage nodes as "block servers" . local and distributed file systems on the SP2 H R L Native AIX File System (JFS) No file sharing .

news.. karaoke. education . H Video distribution via hybrid fiber/coax R "live" since June '96 Trial Currently 500 subscribers L 6 Mbit/sec MPEG video streams 100 simultaneous viewers (75 MB/sec) 200 hours of video on line (700 GB) 1212-node SP-2 (7 distribution.Tokyo Video on Demand Trial Video on Demand for new "borough" of Tokyo Applications: movies. 5 storage) SP- ..

Engineering Design Major aircraft manufacturer Using GPFS to store CATIA designs and structural modeling data GPFS allows all nodes to share designs and models H Using CATIA for large designs. Elfini for structural modeling and analysis R SP used for modeling/analysis L GPFS uses .

Shared Disks .Virtual Shared Disk architecture File systems consist of one or more shared disks H ? Individual disk can contain data. or both R ? Disks are designated to failure group L ? Data and metadata are striped to balance load and maximize parallelism Recoverable Virtual Shared Disk for accessing disk storage ? Disks are physically attached to SP nodes ? VSD allows clients to access disks over the SP switch ? VSD client looks like disk device driver on client node ? VSD server executes I/O requests on storage node. fencing. ? VSD supports JBOD or RAID volumes. metadata. multipathing (where physical hardware permits) GPFS only assumes a conventional block I/O interface General architecture .

e. replace disks and nodes.GPFS Architecture Overview Implications of Shared Disk Model H ? All data and metadata on globally accessible disks (VSD) R ? All access to permanent data through disk I/O interface L ? Distributed protocols. rebalance file system) ? General architecture . delete. coordinate disk access from multiple nodes ? Fine-grained locking allows parallel access by multiple clients ? Logging and Shadowing restore consistency after node failures Implications of Large Scale Support up to 4096 disks of up to 1 TB each (4 Petabytes) The largest system in production is 75 TB ? Failure detection and recovery protocols to handle node failures ? Replication and/or RAID protect against disk / storage node failure ? On-line dynamic reconfiguration (add.g.. distributed locking.

Node Roles Three types of nodes: file system. read/write data to/from storage nodes Limplement virtual file system interface cooperate with manager nodes to perform metadata operations ? Manager nodes (one per ³file system´) global lock manager recovery manager global allocation manager quota manager file metadata manager admin services fail over ? Storage nodes implement block I/O interface shared access from file system and manager nodes interact with manager nodes for recovery (e. and manager Each node can perform any of these functions H ? File system nodes ? R run user programs.GPFS Architecture .g. storage. fencing) file data and metadata striped across multiple disks on multiple storage nodes General architecture .

GPFS Software Structure H R L General architecture .

no fixed placement function: L Flexible replication (e. replicate only metadata. General architecture .g.Disk Data Structures: Files Large block size allows efficient use of disk bandwidth H Fragments reduce space overhead for small files R No designated "mirror".. or only important files) Dynamic reconfiguration: data can migrate block-by-block Multi level indirect blocks ? Each disk address: list of pointers to replicas ? Each pointer: disk id + sector no.

Large File Block Size Conventional file systems store data in small blocks to pack data H more densely R GPFS uses large blocks (256KB default) to optimize disk transfer L speed / ( Th r ughpu t M B sec ) o 4201 698 867 )setybK( eziS refsnarT O/I 046 215 483 652 821 0 7 6 5 4 3 2 1 0 Performance .

used for configuration changes I/O slowdown effects Additional I/O activity rather than token server overload .conflicting operations forwarded to a L designated node .used for file metadata Distributed locking + centralized hints .Parallelism and consistency Distributed locking .used for space allocation Central coordinator .acquire appropriate lock for every operation H used for updates to user data R Centralized management .

a byte range) Tokens can be held across file system operations. enabling coherent data caching in clients Cached data discarded or written to disk when token is revoked Performance optimizations: required/desired ranges. metanode.g. a file) or subset of an object (e.g. special token modes for file size operations Performance . data shipping.Parallel File Access From Multiple Nodes GPFS allows parallel applications on multiple nodes to access nonoverlapping ranges of a single file with no conflict H Global locking serializes access to overlapping ranges of a file R Global locking based on "tokens" which convey access rights to an L object (e.

Deep Prefetch for High Throughput GPFS stripes successive blocks across successive disks H Disk I/O for sequential reads and writes is done in parallel R GPFS measures application "think time" . and cache L state to automatically determine optimal parallelism Prefetch algorithms now recognize strided and reverse sequential access Accepts hints Write-behind policy Application reads at 15 MB/sec Each disk reads at 5 MB/sec Three I/Os executed in parallel Performance .disk throughput.

GPFS Throughput Scaling for Non-cached Files NonHardware: Power2 wide nodes. SSA disks H Experiment: sequential R read/write from large L number of GPFS nodes to varying number of storage nodes Result: throughput increases nearly linearly with number of storage nodes Bottlenecks: microchannel limits node throughput to 50MB/s ? system throughput limited by available storage nodes ? Scalability .

Disk Data Structures: Allocation map Segmented Block Allocation H MAP: R L Each segment contains bits representing blocks on all disks Each segment is a separately lockable unit Minimizes contention for allocation map when writing files on multiple nodes Allocation manager service provides hints which segments to try Similar: inode allocation map Scalability .

High Availability . delete. ...Logging and Recovery Problem: detect/fix file system inconsistencies after a failure of one H or more nodes R ? All updates that may leave inconsistencies if uncompleted are logged L ? Write-ahead logging policy: log record is forced to disk before dirty metadata is written ? Redo log: replaying all log records at recovery time restores file system consistency Logged updates: ? I/O to replicated data ? directory operations (create. move.) ? allocation map changes ordered writes ? shadowing Other techniques: ? High Availability .

Node Failure Recovery Application node failure: H ? force-on-steal policy ensures that all changes visible to other nodes have been R written to disk and will not be lost L ? all potential inconsistencies are protected by a token and are logged file system manager runs log recovery on behalf of the failed node after successful log recovery tokens held by the failed node are released ? actions taken: restore metadata being updated by the failed node to a consistent state. add/delete disk) Dual-attached disk: use alternate path (VSD) ? Single attached disk: treat as disk failure High Availability Storage node failure: ? .g. release resources held by the failed node ? File system manager failure: ? new node is appointed to take over ? new file system manager restores volatile state by querying other nodes ? New file system manager may have to undo or finish a partially completed configuration change (e..

Handling Disk Failures When a disk failure is detected H ? The node that detects the failure informs the file system manager R ? File system manager updates the configuration data to mark the failed disk as L "down" (quorum algorithm) While a disk is down ? Read one / write all available copies ? "Missing update" bit set in the inode of modified files File system manager searches inode file for missing update bits ? All data & metadata of files with missing updates are copied back to the recovering disk (one file at a time. data on the recovering disk is treated as write-only Failed disk is deleted from configuration or replaced by a new one ? New replicas are created on the replacement or on other disks When/if disk recovers ? Unrecoverable disk failure ? . normal locking protocol) ? Until missing update recovery is complete.

merge. total Balance dynamically according to usage patterns Avoid fragmentation .Cache Management H R L Stats Total Cache Seq / random General Pool: Clock list. re-map optimal. total Seq / random Block Size pool: Clock list optimal. total Seq / random Block Size pool: Clock list optimal.internal and external Unified steal Periodical re-balancing . total Seq / random Block Size pool: Clock list optimal.

usenix.html ? Tiger Shark: .com/cs/ ? TeraSort .com/journal/rd/422/ on clusters ranging from L a few nodes with less than a TB of H including the largest (ASCI white) R Installed at several hundred customer 432 file system and 56 storage nodes (604e 332 MHz) ? total 6 TB disk space ? References ? GPFS home page: http://www.~20 filed patents State of the art TeraSort world record of 17 minutes ? using 488 node SP.http://www. up to 512 nodes with 140 TB of disk in 2 file systems IP rich .research.html ? FAST 2002: http://www.Epilogue Used on six of the ten most powerful supercomputers in the world.