You are on page 1of 20

Unit 9:

GPFS Overview

© Copyright IBM Corporation 2013


Agenda
• Overview

• Terminology and concepts

• GPFS and Oracle

© Copyright IBM Corporation 2013


GPFS overview
• GPFS: General Parallel File System
• GPFS is a high performance , shared-disk file
system that can provide data access from
nodes in a cluster environment. Node Node Node
A B C
• Developed by IBM Research for IBM SP
supercomputers. GPFS

– GPFS has been available on AIX ,Linux and


Windows
– Disks (data and metadata) shared across all
SAN
nodes
– Concurrent, parallel access of data and
metadata
– Files accessed using standard UNIX
interfaces and commands

© Copyright IBM Corporation 2013


Why choose GPFS?
• GPFS is highly available and fault tolerant
– Data protection mechanisms include journaling, re plication (like mirroring), support for storage array
copies (synch and asynch)
– Heartbeat mechanism to recover from multiple disk, node, connectivity failures
– Recovery software mechanisms implemented in all layers
• GPFS is highly scalable: (2000+ nodes)
– Symmetric, scalable software architecture
– Distributed metadata management
– Allows for incremental scaling of system (nodes, disk space) with ease
• GPFS is a high performance file system
– Large block size (tunable) support with wide striping (across nodes and disks)
– Parallel access to files from multiple nodes
– Efficient deep prefetching: read ahead, write behind
– Recognize access patterns (adaptable mechanism)
– Highly multithreaded daemon

© Copyright IBM Corporation 2013


Agenda
• Overview

• Terminology and concepts

• GPFS and Oracle

© Copyright IBM Corporation 2013


GPFS terminology: SAN, NSD, VSD
• SAN – Storage Area Network
– The disk is visible as a local device from any node in a cluster,
typically over a switched Fiber Channel network.
• VSD – Virtual Shared Disk
– This is remote access across a network to the disk, for example, the
disk is local to one or m ore nodes, and remote to others. I/O
access is over the network, or an IBM high-performance switch.
VSD requires the rsct.vsd fileset.
• NSD – Network Shared Disk
– This is the ability to make a raw LUN available to remote clients over
TCP/IP.

© Copyright IBM Corporation 2013


GPFS terminology: Disks and file systems

• Any given disk can belong to only one file system


• One file system can have many disks
• When a file system is created on more than one disk, the file
system is striped across disks using the block size specified
when the file system is created. Possible block sizes: 16K, 64K,
256K, 512K, and 1024K (256K is default)
• Many operations on GPFS can be done dynamically, like
– Adding/deleting disks
– Restriping (Rebalancing)
– Increasing # inodes
– Adding/removing nodes

© Copyright IBM Corporation 2013


GPFS terminology: Replication
• Replication is the duplication of data and/or
metadata (usually both) on GPFS disks for failover
support
• Requires 2x the storage
• This is GPFS synchronous “mirroring”, GPFS can’ t
mirror at the logical volume level
• Can be used with extended distance RAC clusters

© Copyright IBM Corporation 2013


GPFS terminology: Cluster data configuration
file
• A primary cluster data server must be defined to ac t as the
primary holder of the GPFS cluster configuration information
file
/var/mmfs/gen/mmsdrfs
• A secondary GPFS cluster data server is highly recommended
• If you don’t have a secondary cluster data server , when the
node housing the primary data server fails, no changes can be
made to the cluster configuration.
• The cluster data server is specified when the cluster is formed
#mmcrcluster –p node1_priv –s node2_priv…

© Copyright IBM Corporation 2013


GPFS terminology: Configuration manager (CfgMgr)

• The configuration manager is needed to check against failure of components


(Hardware and software: network, adapters, disks, nodes …)
– Drives recovery from node failure within the cluster
– Selects the file system mana ger avoiding data corruption
• Disk Leasing: Request from the node to CfgMgr to renew its lease
– leaseDuration: Time a disk-lease is granted from CfgMgr to any node
(Default is 35 sec.)
– leaseRecoveryWait: Additional wait time to allow transactions to
complete (Default is 35 sec.)
• Pings: sent from CfgMgr to the node when node fails to renew lease
– PingPeriod: seconds between pings (Default is 2 sec.)
– totalPingTimeout: total ping time before giving up (Default is 120 sec.)

© Copyright IBM Corporation 2013


GPFS terminology: File system manager
• Processing changes to file system: Adding disks, changing disk
availability, repairing the file system, and mount or unmount file
system
• Management of disk space allocation: Controls which regions of disks
are allocated to each nod e, allowing effective parallel allocation of
space
• Token management: Coordinates access to files on shared disks by
granting tokens that convey the right to read or write the data or
metadata of a file
• Quota management: Allocating disk blocks to nodes that are writing to
the file system and comparing the allocated space to the quota limits
at regular intervals
• If node containing the File System Manager fails, the Configuration
Manager assigns it to another node

© Copyright IBM Corporation 2013


GPFS terminology: File system descriptors

• A structure in GPFS which is initially written to every disk in the file


system, but is replicated on a subset of the disks as changes to the
file system occur, such as adding or deleting disks
• Failure group: A collection of disks that share a common access
paths or adapter connection, and could all become unavailable
through a single hardware failure. When used in conjunction with
the replication feature of GPFS, the creation of multiple failure
groups provides fo r increased file availability should a group of
disks fail.
• GPFS creates replicas of the file system descriptors according to
these rules:
– > 5 failure groups = 5 replicas
– > 3 disks = 3 replicas
– 1-2 disks = 1 replica per disk
© Copyright IBM Corporation 2013
GPFS terminology: Quorum
• Quorum node: A node in the cluster that is counted to determine whether a
quorum exists.
• Two methods for determining quorum
– Node quorum: The minimum number of nodes that must be running in order
for the daemon to start. Defined as Quorum min = (#quorum nodes/2 + 1)
– Node quorum with tiebreaker disks
• Node quorum with tiebreaker disks allows you to run with as little as one
quorum node available as long as you have access to a majority of the
quorum disks
• You may have one, two, or three tiebreaker disks. However, you should use an
odd number of tiebreaker disks
• Use the tiebreakerDisks parameter on the mmchconfig command
• File System descriptor quorum
– The number of disks needed in order to write the file system descriptor
correctly.

© Copyright IBM Corporation 2013


Agenda
• Overview

• Terminology and concepts

• GPFS and Oracle

© Copyright IBM Corporation 2013


GPFS and Oracle
• Consistent with the Or acle SAME methodology
– Stripe And Mirror Everything
• Supports Direct I/O
– Bypass AIX buffer cache and GPFS cache for additional
performance
– Lets Oracle manage and optimize buffer caching
– Performance close to that of raw LVs
• Supports all files: Oracle Home, CRS Home, data files, log files,
control files, back ups, and so forth.

© Copyright IBM Corporation 2013


Software Requirements
Oracle AIX / GPFS Ver. 3.5 GPFS Ver 3.4 GPFS Ver 3.3 GPFS Ver 3.2
RAC VIOS
11gR2 AIX 7.1 7.1-TL1-SP02 7.1-TL0-SP02 7.1-TL0-SP2 Not Planned
GPFS 3.5.0.1 (IZ89165) (IZ89165)
RAC 11.2.0.1, GPFS 3.4.0.2 GPFS 3.3.0.12
11.2.0.2, RAC 11.2.0.1, RAC 11.2.0.1,
11.2.0.3 11.2.0.2, 11.2.0.2,
11.2.0.3 11.2.0.3
AIX 6.1 6.1-TL7 - SP03 6.1-TL6-SP03 6.1-TL4-SP6 6.1-TL3-SP1
GPFS 3.5.0.1 (IZ88711) GPFS 3.3.0.6 GPFS 3.2.1.14
RAC 11.2.0.1, GPFS 3.4.0.2 RAC 11.2.0.1, RAC 11.2.0.1
11.2.0.2, RAC 11.2.0.1, 11.2.0.2
11.2.0.3 11.2.0.2,
11.2.0.3
AIX 5.3 Not Planned 5.3-TL10-SP05 5.3-TL09-SP2 5.3-TL09-SP2
GPFS 3.4.0.2 GPFS 3.3.0.6 GPFS 3.2.1.14
RAC 11.2.0.1, RAC 11.2.0.1 RAC 11.2.0.1
11.2.0.2,
11.2.0.3
VIOS*(1) VIOS 2.2..1.3 VIOS 2.2- VIOS 2.1- VIOS2.1-
fixpack24-SP01 fixpack21 fixpack20.1
VIOS* - If AIX, GPFS or Oracle RAC minimum certified levels are different while using a VIO server, the levels are specially noted under VIOS rows.
Otherwise, for the VIOS rows that don’t specify levels of AIX, please see corresponding RAC and AIX release rows to determine minimum certified
levels.
© Copyright IBM Corporation 2013
GPFS and Oracle tuning (1 of 2)
• GPFS with Oracle uses Direct I/O
– Oracle parameter filesystemio_options ignored
• gpfs_block_size:
– generally suggested, 512-1024 KB
– 256 KB if Oracle is combined with something else that
contains a lot of small files
• db_file_multiblock_read_count :
– If Oracle block file size is 16 KB, GPFS block size is 512 KB,
parameter is 32 or 64

© Copyright IBM Corporation 2013


GPFS and Oracle tuning (2 of 2)
• GPFS threads:
– worker1Threads+prefetchThreads <= 550
– prefetchThreads defaults to 64, set higher for sequential Oracle activity
between 50 and 100)
– mmchconfig worker1Threads=(550-prefetchThreads) to allow a high
degree of parallelism of the Oracle AIO threads
– set aio maxservers=(worker1Threads/# cpus) + 10
– When mmchconfig is used, cluster has to be restarted
• Other parameters:
– Autoload – Starts GPFS automatically when nodes are rebooted
– pagepool – GPFS file system buffer cache
• tune when I/O is expected to file systems outside of Oracle control
• Default= 64 MB
• Max value = 8 GB
© Copyright IBM Corporation 2013
GPFS and Oracle: Other recommendations
• Have < 10 total GPFS file systems
• OCR, vote placement is recommended on raw LUNs, rather than as GPFS files, for fastest
failover time
• Because of different usage patterns, use separate filesystems for binaries than Oracle
data files
• It is strongly recommended to use a local (JFS2) file system for $CRS_HOME
– Facilitates rolling upgrades of Oracle clusterware
– If shared $CRS_HOME is required, the $CRS_HOME/log directory for each node must
be linked to a local (JFS2) directory, otherwise the cluster will not survive a failover
of the FS manager for the $CRS_HOME
• $ORACLE_HOME for the database binaries may be placed on GPFS with no additional
modifications.
– For best support of rolling upgrades , use local file systems (JFS2) for binaries rather
than sharing $ORACLE_HOME directories
– However, few 10g database patches can be applied in a rolling fashion

© Copyright IBM Corporation 2013


CSS Heartbeat and GPFS
• CSS has two heartbeat mechanisms :
– Network heartbeat across interconnect to establish/confirm
cluster membership
• CSS misscount parameter
• If network ping time > CSS misscount, node evicted
• CSS misscount = 30seconds for UNIX+Oracle Clusterware
– Disk heartbeat to voting device
• – Internal I/O timeout interval (IOT) where an I/O to voting disk must complete
• – If I/O to voting disk > IOT, node evicted

© Copyright IBM Corporation 2013