You are on page 1of 6

big data storage

Posted by: Margaret Rouse


WhatIs.com
Contributor(s): Garry Kranz






Big data storage is a compute-and-storage architecture that collects and manages large data sets
and enables real-time data analytics.

Companies apply big data analytics to get greater intelligence from metadata. In most cases, big
data storage uses low-cost hard disk drives, although moderating prices for flash appear to have
opened the door to using flash in servers and storage systems as the foundation of big data
storage. These systems can be all-flash or hybrids mixing disk and flash storage.

The data itself in big data is unstructured, which means mostly file-based and object storage.

Although a specific volume size or capacity is not formally defined, big data storage usually
refers to volumes that grow exponentially to terabyte or petabyte scale.

The big promise behind big data

Several factors have fueled the rise of big data. People now store and keep more information
than ever before due to widespread digitization of paper records among businesses. The
proliferation of sensor-based Internet of Things (IoT) devices has led to a corresponding rise in
the number of applications based on artificial intelligence (AI), which is an enabling technology
for machine learning. These devices generate their own data without human intervention.

A misconception about big data is that the term refers solely to the size of the data set. Although
this is true in the main, the science behind big data is more focused. The intention is to mine
specific subsets of data from multiple, large storage volumes. This data may be widely dispersed
in different systems and may not have an obvious correlation. The objective is to unify the data
with structure and intelligence to allow it to be rapidly analyzed.

The ability to collect different data from various sources, and place those associations in an
understandable context, allows an organization to glean details that are not readily apparent
otherwise. The analysis is used to inform decision-making, such as examining online browsing
behavior to tailor product and services to a customer's habits or preferences.
Big data analytics has paved the way for DevOps organizations to emerge as a strategic analytics
arm within many enterprises. Companies in finance, health care and energy need to analyze data
to pinpoint trends and improve business functions. In the past, businesses were limited to using a
data warehouse or high-performance computing (HPC) cluster to parallelize batch processing of
structured data, a process that could take days or perhaps weeks to complete.

By contrast, big data analytics processes large semi-structured or unstructured data and streams
the results within seconds. Google and Facebook exploit fast big data storage to serve targeted
advertising to users as they surf the Internet, for example. A data warehouse or HPC cluster may
be used separately as an adjunct to a big data system.

Analyst firm IDC estimates the market for big data storage hardware, services and software to
generate $151 billion in 2017, achieving through 2020 a compound annual growth rate of nearly
12%, when revenues are pegged to hit $210 billion.

Big data storage demand will approach 163 zettabytes by 2025, according to a separate 2017
report by IDC and Seagate. The report attributes the growth to increased use of cognitive
computing, embedded systems, machine learning, mobile devices and security.

The components of big data storage infrastructure

A big data storage system clusters a large number of commodity servers attached to high-
capacity disk to support analytic software written to crunch vast quantities of data. The system
relies on massively parallel processing databases to analyze data ingested from a variety of
sources.

Big data often lacks structure and comes from various sources, making it a poor fit for
processing with a relational database. The Apache Hadoop Distributed File System (HDFS) is
the most prevalent analytics engine for big data, and is typically combined with some flavor of a
NoSQL database.

Big Data Storage Methods


In this video, Ben Woo, managing editor of Neuralytix discusses Hadoop and storage for big
data projects.
 

Ben Woo, managing editor of Neuralytix Inc., discusses Hadoop and storage for big data
projects.

Hadoop is open source software written in the Java programming language. HDFS spreads the
data analytics across hundreds or even thousands of server nodes without a performance hit.
Through its MapReduce component, Hadoop distributes processing in this way as a safeguard
against catastrophic failure. The multiple nodes serve as a platform for data analysis at a
network's edge. When a query arrives, MapReduce executes processing directly on the storage
node on which data resides. Once analysis is completed, MapReduce gathers the collective
results from each server and “reduces” them to present a single cohesive response.
How big data storage compares to traditional enterprise storage

Big data can bring an organization a competitive advantage from large-scale statistical analysis
of the data or its metadata. In a big data environment, the analytics mostly operate on a
circumscribed set of data, using a series of data mining-based predictive modeling forecasts to
gauge customer behaviors or the likelihood of future events.

Statistical big data analysis and modeling is gaining adoption in a cross-section of industries,
including aerospace, environmental science, energy exploration, financial markets, genomics,
healthcare and retailing. A big data platform is built for much greater scale, speed and
performance than traditional enterprise storage. Also, in most cases, big data storage targets a
much more limited set of workloads on which it operates.

For example, your enterprise resource planning systems might be attached to a dedicated storage
area network (SAN). Meanwhile, your clustered network-attached storage (NAS) supports
transactional databases and corporate sales data, while a private cloud handles on-premises
archiving.

It's not uncommon for larger organizations to have multiple SAN and NAS environments that
support discrete workloads. Each enterprise storage silo may contain pieces of data that pertain
to your big data project.

Antony Adshead, a site editor at Computer Weekly, discuss what defines big data and the key
attributes required of big data storage.

00:00
06:18

Download this podcast

Legacy storage systems handle a broader number of application workloads. The generally
accepted industry practice in primary storage is to assign an individual service level to each
application to govern availability, backup policies, data access, performance and security.
Storage used for production -- the activities a company uses daily to generate revenue --
demands high uptime, whereas big data storage projects can tolerate higher latency.

The three Vs of big data storage technologies

Storage for big data is designed to collect voluminous data produced at variable speeds by
multiple sources and in varied formats. Industry experts describe this process as the three Vs: the
variety, velocity and volume of data.

Variety describes the different sources and types of data to be mined. Sources include audio files,
documents, email, file storage, images, log data, social media posts, streaming video and user
clickstreams.
Velocity pertains to the speed at which storage is able to ingest big data volumes and run analytic
operations against it. Volume acknowledges that modern applications scripts are large and
growing larger, outstripping the storage capabilities of existing legacy storage.

Some experts suggest big data storage needs to encompass a fourth V: veracity. This involves
ensuring that the data sources being mined are verifiably trustworthy. A major pitfall of big data
analytics is that errors tend to be compounded, through corruption, user error or other causes.
Veracity may be the most important element and the toughest issue to solve, in many cases
possible only after a thorough data cleansing of databases.

How machine learning affects big data storage

Machine learning is a branch of AI whose rising prominence mirrors that of big data analytics.
Trillions of data points are generated each day by AI-based sensors embedded in IoT devices
ranging from automobiles to oil wells to refrigerators.

In machine learning, a computing device produces analysis without human intervention. Iterative
statistical analytics models apply a series of mathematical formulas. With each computation, the
machine learns different pieces of intelligence that it uses it to fine-tune the results.

The theory of machine learning is that the analysis will grow more reliable over time. Google's
self-driving car is an example of machine learning in the corporate world, but consumers use it
when they click on a recommended streaming video or receive a fraud-detection alert from their
bank.

Most machine data exists in an unstructured format. Human intellect alone is not capable of
rendering this data in context. Making sense of it requires massively scalable, high-performance
storage, overlaid with powerful software intelligence that imposes structure on the raw data and
extracts it in a way that is easy to digest.

Building custom big data storage

Big data storage architecture generally falls into geographically distributed server nodes -- the
Hadoop model -- or using scale-out NAS or object systems. Each has its advantages and
disadvantages. Depending on the nature of the big data storage requirements, a blend of several
systems might be used to build out your infrastructure.
Hyper-scale cloud providers often design big data storage architecture around mega-sized server
clusters using direct-attached storage. In this arrangement, PCI Express flash might be placed in
the server for performance and surrounded with just a bunch of disk to control storage costs. This
type of architecture is geared for workloads that seek and open hundreds of small files.
Consequently, the setup comes with limitations, such as the inability to deliver shared storage
among users and the need to add third-party data management services.

As big data implementations have started to mature, the prospect of a do-it-yourself big data
storage platform is not as daunting as it once was, although it is not a task to undertake lightly. It
requires taking stock of internal IT to determine if it make sense to build a big data storage
environment.

An enterprise IT staff will shoulder the burden of drawing up hardware specs and building the
system from scratch, including sourcing components, testing and managing the overall
deployment. Hadoop integration can be challenging for enterprises used to relational database
platforms.

Big data storage also requires a development team to write code for in-house analytics software
or to integrate third-party software. Enterprises also need to weigh the cost-to-benefit ratio of
creating a system for a limited set of applications, compared to enterprise data storage that
handles a more diverse range of primary workloads.

Buying big data storage: scale-out NAS, object storage

Clustered and scale-out NAS provides shared access to parallel file-based storage. Data is
distributed among many storage nodes and capacity scales to billions of files, with independent
scaling of compute and storage. NAS is recommended for big data jobs involving large files.
Most NAS vendors provide automated tiering for data redundancy and to lower the cost per
gigabyte.
Data Direct Networks SFA, Dell EMC Isilon, NetApp FAS, Qumulo QC and Panasas ActiveStor
are among the leading scale-out NAS arrays.

Similar to scale-out NAS, object storage archive systems can extend to support potentially
billions of files. Instead of a file tree, an object storage system attaches a unique identifier to
each file. The objects are presented as a single managed system within a flat address space.

Most legacy block storage and file storage vendors have added object storage to their portfolios.
A number of newer vendors sell object-based storage with native file support.

A data lake is sometimes considered an outgrowth of object storage, although critics deride the
term as a marketing ploy. Frequently associated with Hadoop-based big data storage, a data lake
streamlines the management of non-relational data scattered among numerous Hadoop clusters.

Big data storage security strategies

Several issues of big data storage security exist, with no single solution to the matter. NoSQL
databases are hailed for the ability to ingest, manage and quickly process a flood of complex
data. Aside from embedding some inherent security basics, NoSQL security is not as robust as
more mature relational databases, underscoring the need to surround a big data project with data
management and data protection tools for critical data.

You might also like