Big Data Storage

Mechanical Punch Card
Herman Hollerith invented the mechanical punch card in 1890. Punch

cards were made of thick card stock with the dimensions equal to a
bank note (7 3/8 inches wide by 3 ¼ inches high). The first major
application of the punch card was for the tallying the 1890 census.
Compared to doing the computations by hand, this was a huge
advancement in terms of productivity and technological advancement.
The 1890 Census only took one year to tabulate while the 1880
Census took eight. In order to use the card you needed a punch, a
tabulating machine and a sorting machine; no use of an actual
computer was required. The cards and machines evolved throughout
the years, but when using punch cards, the equipment cost was high
and the process for using the cards was very cumbersome.
Magnetic Tape
Magnetic tape, the cold storage king, was invented
in 1928 by Fritz Pfleumer. It was originally used to
record audio and then expanded to storing data in
1951 with the invention of the UNISERVO I, the
first digital tape recorder. The UNISERVO I was
implemented in the UNIVAC I computer. The
UNISERVO tape drives were able to provide both
read and write functionality at a rate of 12.8 KBps
and had a storage capacity up to 184 KB.
Hard Disk Drive (HDD)
IBM designed the first hard disk drive in 1956. The IBM 350 Disk File
was designed to work with the IBM 305 RAMAC mainframe
computer. This first hard drive weighed about 1 ton and stood almost
6 feet high. The IBM 350 Disk File was configured with 50 magnetic
disks which contained 50,000 sectors and stored 3.75MB of data.
The IBM 305 RAMAC mainframe computer combined with the IBM
350 Disk File allowed businesses to record and maintain data in real
time. In 1980 the 5.25 inch drive arrived on the market and
significantly reduced the space needed for the drive. This leads into
today’s drive which is even smaller than 3.5 inches in size, but much
larger in storage capacity (3.5 inch HDDs can hold up to 10TB).
Virtualization
The first program which created virtual machines
released to the public in 1968. Before its release in
order to time share on a mainframe the resources
including memory, disk space and CPU cycles, would
have to be manually divided among users. Virtualization
with the advancements from VMware and Microsoft,
revolutionized the way data was stored in a data center.
With virtualization, data center managers are able to
maximize their storage space and reduce their hardware
expenses.
Solid State Drive (SSD)
The first solid state drive was introduced to the public in
1976 by Datram. It was sized very different from its
modern predecessors measuring 19 inches wide by 15.75
inches tall and had a rack-mount chassis. Bulk Core could
hold up to 8 memory boards with 256KB of RAM chips.
The advantages with an SSD drive for managing data is
the faster storage and retrieval times. They are great for
frequently accessed data and operating systems. Other
benefits are reduced power consumption, size, noise and
the ability to function in more extreme environments.
Software Defined Storage (SDS)
The concept of Software Defined Storage was born in the 1990s
as a mechanism for common interoperability and manageability
between hardware manufacturers within a network. In fear that
SDS would remove the differentiators in the hardware,
manufacturers held off on bringing it to market. When
virtualization took off so did the application demands of the
hardware behind it. SDS was brought in to work with
virtualization to manage hardware requirements and allocate
the workload. With the addition of SDS data management has
become more efficient. Not only managing the data more
efficient, but the use of hardware as well.
Cloud Storage
• Although the idea of Cloud Storage was created somewhere in the
1960s, it did not really come to life until the arrival of Salesforce.com
in 1999. Salesforce.com enabled the delivery of enterprise
applications via a website, something that had not been done in the
past. This delivery of an application over the internet opened the
doors for other software companies alike. Amazon followed suit in
2002 with internet-based (cloud-based) services including storage. In
2006, Amazon started renting computers for small business to run
applications on and with this the cloud. Since then a variety of cloud
providers have entered the space and recently ISO Standards for the
cloud have been implemented.
• Data management in the cloud removes the responsibility of
hardware management from the company and places it in the hands
of the Cloud Storage provider, thus reducing hardware costs for the
data center.
• Manual Record (4000 BC – AD 1900).
• Punched Card Record (1900 - 55).
• Programmed Record (1955 - 1970).
• On – Line Network Data Management (1965 -
80).
• Relational Data Management.
• Multimedia Databases.
• Web-Based System.
Big Data Storage
• Big data storage primarily supports storage and
input/output operations on storage with a very large
number of data files and objects. A typical big data
storage architecture is made up of a redundant and
scalable supply of direct attached storage (DAS) pools,
scale-out or clustered network attached storage (NAS)
or an infrastructure based on object storage format.
• The storage infrastructure is connected to computing
server nodes that enable quick processing and retrieval
of big quantities of data. Moreover, most big data
storage architectures/ infrastructures have native
support for big data analytics solutions such as Hadoop,
Cassandra and NoSQL.
Column and row oriented databases
• Row oriented databases are databases that

organize data by record, keeping all of the
data associated with a record next to each
other in memory.
• Column oriented databases are databases that
organize data by field, keeping all of the data
associated with a field next to each other in
memory.
example data:
• In a row oriented database that table would have
been stored like this:
• Row oriented databases perform write operations

very quickly and are still widely used for Online
Transactional Processing(OLTP) style applications.
• In a Column Oriented database it would be
stored like this:
• Column oriented databases perform aggregate

functions quickly and efficiently as all of the
data located in a field is stored sequentially in
memory.
Big Data Architecture
• Big data architecture primarily serves as the key
design reference for big data infrastructures
and solutions.
• It is created by big data designers/architects
before physically implementing a big data
solution. Creating big data architecture
generally requires an understanding of the
organization and its big data needs.
The architecture has multiple layers. Four logical
layers that exist in any big data architecture.
• Big data sources layer: Data sources for big data architecture are all over the map.
Data can come through from company servers and sensors, or from third-party
data providers. The big data environment can take data in batch mode or real-time.
A few data source examples include enterprise applications like ERP or CRM, MS
Office docs, data warehouses and relational database management systems
(RDBMS), databases, mobile devices, sensors, social media, and email.
• Data massaging and storage layer: This layer receives data from the sources. If
necessary, it converts unstructured data to a format that analytic tools can
understand and stores the data according to its format. The big data architecture
might store structured data in a RDBMS, and unstructured data in a specialized file
system like Hadoop Distributed File System (HDFS), or a NoSQL database.
• Analysis layer: The analytics layer interacts with stored data to extract business
intelligence. Multiple analytics tools operate in the big data environment.
Structured data supports mature technologies like sampling, while unstructured
data needs more advanced (and newer) specialized analytics toolsets.
• Consumption layer: This layer receives analysis results and presents them to the
appropriate output layer. Many types of outputs cover human viewers,
applications, and business processes.
In addition to the logical layers, four major processes operate
cross-layer in the big data environment: data source connection,
governance, systems management, and quality of service (QoS).
• Connecting to data sources: Fast data access requires connectors and adapters that can efficiently
connect to different storage systems, protocols, and networks; and data formats running the variety
of data from database records to social media content to sensors.
• Governing big data: Big data architecture includes governance provisions for privacy and security.
Organizations can choose to use native compliance tools on analytics storage systems, invest in
specialized compliance software for their Hadoop environment, or sign service level security
agreements with their cloud Hadoop provider. Compliance policies must operate from the point of
ingestion through processing, storage, analysis, and deletion or archive.
• Managing systems: Big data architecture is typically built on large-scale distributed clusters with
highly scalable performance and capacity. It must continuously monitor and address system health
via central management consoles. If your big data environment is in the cloud, you will still need to
spend time and effort to establish and monitor strong service level agreements (SLAs) with your
cloud provider.
• Protecting Quality of service: QoS is the framework that supports defining data quality, compliance
policies, ingestion frequency and sizes, and filtering data. For example, a public cloud provider
experimented with QoS-based data storage scheduling in a cloud-based, distributed big data
environment. The provider wanted to improve the data massage/storing layer’s availability and
response time, so they automatically routed ingested data to predefined virtual clusters based on
QoS service levels.
Big Data computation
The R Project
• R is a free software environment for statistical
computing and graphics. R is a general statistical
analysis platform that runs on the command line R can
find means, medians, standard deviations, correlations
and much more, including linear and generalized
linear models, nonlinear regression models, time
series analysis, classical parametric and nonparametric
tests, clustering and smoothing.
Time Flow
• This is desktop software for analyzing the time
attribute. TimeFlow can generate visual
timelines from text files, with entries color-
and size-coded for easy pattern spotting. It
also allows the information to be sorted and
filtered, and it gives some statistical
summaries of the data.
NodeXL
• NodeXL is visualization and analysis software
of networks and relationships. It uses a
technology referring to the discipline of
finding connections between people based on
various data sets
Tableau Plateau
• Data visualization tools allow anyone to organize and
present information spontaneously. It is exceptionally
powerful in business because it communicates
insights through data visualization. This tool can turn
data into any number of visualizations, from simple to
complex. You can drag and drop fields onto the work
area and ask the software to suggest a visualization
type, then customize everything from labels and tool
tips to size, interactive filters and legend display.
CSVKit
• CSVKit contains tools for importing, analyzing
and reformatting comma-separated data files.
CSVKit makes it quick and easy to preview,
slice and summarize your file to examine it
Relational database in Big Data
• Big data is becoming an important element in
the way organizations are leveraging high-
volume data at the right speed to solve specific
data problems. Relational Database
Management Systems are important for this
high volume. Big data does not live in isolation.
To be effective, companies often need to
combine the results of big data analysis with
the data that exists within the business.
• Relational databases are built on one or more
relations and are represented by tables. These
tables are defined by their columns, and the data
is stored in the rows. The primary key is often the
first column in the table. The consistency of the
database and much of its value are achieved by
“normalizing” the data. Normalized data has
been converted from native format into a shared,
agreed upon format.

Big Data Storage

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Storage

Uploaded by

Copyright:

Available Formats

Mechanical Punch Card

Herman Hollerith invented the mechanical punch card in 1890. Punch

• Row oriented databases are databases that

• Row oriented databases perform write operations

• Column oriented databases perform aggregate

You might also like