You are on page 1of 5

GPFS and HDFS

IBM General Parallel File System (IBM GPFS) is a highly scalable and clustered file system

1. used to distribute and manage data across multiple servers, and is implemented in
many high-performance computing and large-scale storage environments.
2. used for commercial applications requiring high-speed access to large volumes of
data, such as digital media, seismic data processing and engineering design.
3. described as a parallel file system because GPFS data is broken into blocks and
striped across multiple disks in an array, then read in parallel when data is accessed.
This allows for faster read and write speeds.
4. provides other management features such as high availability, replication, mirroring,
policy-based automation and disaster recovery.
5. be deployed in shared-disk or shared-nothing distributed parallel modes, or a
combination of these.

6. Client having 18PB file system.

7. No single point of failure because of distributed of meta data in active configuration

8. Policy based data ingestion

9. Enterprise class storage solution.

10. POSIX compliant system.

Portable Operating System Interface (POSIX) is a family of standards


specified by the IEEE Computer Society for maintaining compatibility between
operating systems. POSIX-compliant means file systems that comply to the IEEE Std
1003.1 system interfaces.

Storage used for large supercomputers is often GPFS-based. GPFS provides concurrent high-
speed file access to applications executing on multiple nodes of clusters.

The system stores data on standard block storage volumes, but includes an internal RAID
layer that can virtualize those volumes for redundancy and parallel access much like a RAID
block storage system. It also has the ability to replicate across volumes at the higher file level.

Other Features of the architecture include

 Distributed metadata, including the directory tree. There is no single "directory


controller" or "index server" in charge of the filesystem.
 Efficient indexing of directory entries for very large directories.
 Distributed locking. This allows for full POSIX filesystem semantics, including
locking for exclusive file access.
 Partition Aware. A failure of the network may partition the filesystem into two or
more groups of nodes that can only see the nodes in their group. This can be detected
through a heartbeat protocol, and when a partition occurs, the filesystem remains live
for the largest partition formed. This offers a graceful degradation of the filesystem —
some machines will remain working.
 Filesystem maintenance can be performed online. Most of the filesystem maintenance
chores (adding new disks, rebalancing data across disks) can be performed while the
filesystem is live. This ensures the filesystem is available more often, so keeps the
supercomputer cluster itself available for longer.
 high availability, ability to be used in a heterogeneous cluster, disaster recovery,
security, DMAPI, HSM and ILM.

GPFS was originally developed as a SAN file system. That would normally prevent it from
being used in Hadoop and the direct-attach disks that make up a cluster. This is where an
IBM GPFS feature called File Placement Optimization (FPO) comes into play. . In 2009,
IBM hooked GPFS to Hadoop, and today IBM is running GPFS, which scales into the
petabyte range and has more advanced data management capabilities than HDFS.

GPFS FPO (File Placement Optimizer) is specifically designed for Hadoop environment
and is replacement for HDFS. The flexibility to access GPFS-resident data from Hadoop and
non-Hadoop applications frees users to build more flexible big data workflows.

Most of the traditional Big Data Cluster implementations commonly use Hadoop Distributed
File System (HDFS) as an underlying file system to store data. Following are some Spectrum
Scale features which are not supported by HDFS.

Spectrum Scale is a POSIX compliant file system and HDFS is not.


All applications will run as is or with very minimal changes in a Hadoop cluster if Spectrum
Scale is used as the underlying file system instead of HDFS. Using GPFS minimizes the new
application code development, testing costs and your big data cluster is production ready in
the least amount of time.

Spectrum Scale provides seamless integration of Hadoop clusters with other data
warehouse environments and transfers the data with ease between your Hadoop cluster and
Spectrum Scale. This offers high flexibility to easily integrate your big data environment with
traditional analytics environments.

Spectrum Scale is a highly available File system.


Managing large clusters with thousands of nodes and petabytes of storage is a complex task
and providing high availability is a very key requirement in such environments. Spectrum
Scale provides both data and metadata replication up to 3 copies, file system replication
across multiple sites, multiple failure groups, and node based quorum and disk based quorum,
automated node recovery, automated data striping and rebalancing and more. These high
availability features in my view make Spectrum Scale a better choice than HDFS for
enterprise production data.

Security Compliance
Security Compliance of business critical data is another critical requirement for any
enterprise. But it is often overlooked during the development phase of many Big Data Proof
of Concepts. As many Big Data PoCs use a wide variety of open source components, it often
becomes a daunting task to get the required security compliance. The PoC implementation
cannot go to production unless it meets all the security compliance requirements. Consider
the security compliance features of Spectrum Scale like file system encryption, NIST SP 800-
131A compliance, NFS V4 ACLs support, SeLinux compatibility offered by Spectrum Scale
when selecting appropriate file system for Big Data clusters. It’s much easier to implement
these Operating System Security features with Spectrum Scale than with HDFS.

Information Life-cycle Management


Spectrum Scale has an extensive Information Life-cycle Management (ILM) features which
are necessary when working with large Big Data clusters with petabytes of storage. Using
Spectrum Scale ILM policies, aging data can be automatically archived, deleted or moved to
a low performance disk. This is a major advantage of Spectrum Scale over HDFS which
minimizes continuously growing storage costs.

Lately, IBM has been talking up the benefits of hooking Hadoop up to the General Parallel
File System (GPFS). IBM has done the work of integrating GPFS with Hadoop.

It is used in IBM’s biggest supercomputers, such


as Blue Gene, ASCI Purple, Watson, Sequoia, and MIRA

FPO essentially emulates a key component of HDFS: moving the application workload to the
data. “Basically, it moves the job to the data as opposed to moving data to the job,” he says in
the interview. “Say I have 20 servers in a rack and three racks. GPFS FPO knows a copy of
the data I need is located on the 60th server and it can send the job right to that server. This
reduces network traffic since GPFS- FPO does not need to move the data. It also improves

performance and efficiency.”

GPFS also brings benefits in the area of data de-duplication, because it does not tend to
duplicate data as HDFS does, IBM says. However, if users prefer to have copies of their data
spread out in multiple places on their cluster, they can use the Write-affinity depth (WAD)
feature that debuted with the introduction of FPO. The GPFS quote system also helps to
control the number of files and the amount of file data in the file system, which helps to
manage storage.

The shared-nothing architecture used by GPFS-FPO also provides greater resilience than
HDFS by allowing each node to operate independently, reducing the impact of failure events
across multiple nodes. The elimination of the HDFS NameNode also eliminates the single-
point-of-failure problem that shadows enterprise Hadoop deployments.

The Active File Management (AFM) feature of GPFS also boosts resiliency by caching
datasets in different places on the cluster, thereby ensuring applications access to data even
when the remote storage cluster is unavailable. AFM also effectively masks wide-area
network latencies and outages. Customers can either use AFM to maintain an asynchronous
copy of the data at a separate physical location or use GPFS synchronous replication, which
are used by FPO replicas.

Sr. No. GPFS HDFS


1 It is POSIX compliant System It is not
2 It breaks file into small blocks It breaks file into blocks of 64 MB or
128 MB
3 No single point of failure Single point of failure(Name Node)
4 It enables any application running It supports only Hadoop applications
atop the hadoop to access data stored which must go through Java based
in FS HDFS APIs5
5 More Secure Less secure
6 Aging data is automatically archived, No such function and hence more
deleted or moved to low performance storage costs
disk
7 It supports SAN It does not
8 It uses RAID It does not
9 It uses point-in-time snapshots and It uses copy command
off-site replication capabilities
10. Only filling up of disks is of concern Administrators need to carefully
and no need to dedicate storage design the disk space for hadoop
cluster and for output of Map-Reduce
jobs and log files
11 Third party management tool can No such function
manage data storage for internal
storage pools. It means only
‘Hottest’ data can be kept on the
fastest disk. It is called Policy based
information lifecycle management
function.

You might also like