Professional Documents
Culture Documents
IBM General Parallel File System (IBM GPFS) is a highly scalable and clustered file system
1. used to distribute and manage data across multiple servers, and is implemented in
many high-performance computing and large-scale storage environments.
2. used for commercial applications requiring high-speed access to large volumes of
data, such as digital media, seismic data processing and engineering design.
3. described as a parallel file system because GPFS data is broken into blocks and
striped across multiple disks in an array, then read in parallel when data is accessed.
This allows for faster read and write speeds.
4. provides other management features such as high availability, replication, mirroring,
policy-based automation and disaster recovery.
5. be deployed in shared-disk or shared-nothing distributed parallel modes, or a
combination of these.
Storage used for large supercomputers is often GPFS-based. GPFS provides concurrent high-
speed file access to applications executing on multiple nodes of clusters.
The system stores data on standard block storage volumes, but includes an internal RAID
layer that can virtualize those volumes for redundancy and parallel access much like a RAID
block storage system. It also has the ability to replicate across volumes at the higher file level.
GPFS was originally developed as a SAN file system. That would normally prevent it from
being used in Hadoop and the direct-attach disks that make up a cluster. This is where an
IBM GPFS feature called File Placement Optimization (FPO) comes into play. . In 2009,
IBM hooked GPFS to Hadoop, and today IBM is running GPFS, which scales into the
petabyte range and has more advanced data management capabilities than HDFS.
GPFS FPO (File Placement Optimizer) is specifically designed for Hadoop environment
and is replacement for HDFS. The flexibility to access GPFS-resident data from Hadoop and
non-Hadoop applications frees users to build more flexible big data workflows.
Most of the traditional Big Data Cluster implementations commonly use Hadoop Distributed
File System (HDFS) as an underlying file system to store data. Following are some Spectrum
Scale features which are not supported by HDFS.
Spectrum Scale provides seamless integration of Hadoop clusters with other data
warehouse environments and transfers the data with ease between your Hadoop cluster and
Spectrum Scale. This offers high flexibility to easily integrate your big data environment with
traditional analytics environments.
Security Compliance
Security Compliance of business critical data is another critical requirement for any
enterprise. But it is often overlooked during the development phase of many Big Data Proof
of Concepts. As many Big Data PoCs use a wide variety of open source components, it often
becomes a daunting task to get the required security compliance. The PoC implementation
cannot go to production unless it meets all the security compliance requirements. Consider
the security compliance features of Spectrum Scale like file system encryption, NIST SP 800-
131A compliance, NFS V4 ACLs support, SeLinux compatibility offered by Spectrum Scale
when selecting appropriate file system for Big Data clusters. It’s much easier to implement
these Operating System Security features with Spectrum Scale than with HDFS.
Lately, IBM has been talking up the benefits of hooking Hadoop up to the General Parallel
File System (GPFS). IBM has done the work of integrating GPFS with Hadoop.
FPO essentially emulates a key component of HDFS: moving the application workload to the
data. “Basically, it moves the job to the data as opposed to moving data to the job,” he says in
the interview. “Say I have 20 servers in a rack and three racks. GPFS FPO knows a copy of
the data I need is located on the 60th server and it can send the job right to that server. This
reduces network traffic since GPFS- FPO does not need to move the data. It also improves
GPFS also brings benefits in the area of data de-duplication, because it does not tend to
duplicate data as HDFS does, IBM says. However, if users prefer to have copies of their data
spread out in multiple places on their cluster, they can use the Write-affinity depth (WAD)
feature that debuted with the introduction of FPO. The GPFS quote system also helps to
control the number of files and the amount of file data in the file system, which helps to
manage storage.
The shared-nothing architecture used by GPFS-FPO also provides greater resilience than
HDFS by allowing each node to operate independently, reducing the impact of failure events
across multiple nodes. The elimination of the HDFS NameNode also eliminates the single-
point-of-failure problem that shadows enterprise Hadoop deployments.
The Active File Management (AFM) feature of GPFS also boosts resiliency by caching
datasets in different places on the cluster, thereby ensuring applications access to data even
when the remote storage cluster is unavailable. AFM also effectively masks wide-area
network latencies and outages. Customers can either use AFM to maintain an asynchronous
copy of the data at a separate physical location or use GPFS synchronous replication, which
are used by FPO replicas.