CC Unit 04

Storage Systems
A cloud provides the vast amounts of storage and computing cycles demanded by many applications. The
network-centric content model allows a user to access data stored on a cloud from any device con-nected to the Internet.
Mobile devices with limited power reserves and local storage take advantage of cloud environments to store audio and
video files. Clouds provide an ideal environment for multimedia content delivery.
A variety of sources feed a continuous stream of data to cloud applications. An ever-increasing number of
cloud-based services collect detailed data about their services and information about the users of these services. Then the
service providers use the clouds to analyze that data.
A new concept, “big data,” reflects the fact that many applications use data sets so large that they cannot be
stored and processed using local resources. The consensus is that “big data” growth can be viewed as a three-dimensional
phenomenon; it implies an increased volume of data, requires an increased processing speed to process more data and
produce more results, and at the same time it involves a diversity of data sources and data types.
6.1 Advantages and disadvantages of cloud storage
cloud Storage is a service where data is remotely maintained, managed, and backed up. The service allows the users to
store files online, so that they can access them from any location via the Internet. According to a recent survey conducted
with more than 800 business decision makers and users worldwide, the number of organizations gaining competitive
advantage through high cloud adoption has almost doubled in the last few years and by 2017, the public cloud services
market is predicted to exceed $244 billion. Now, let’s look into some of the advantages and disadvantages of Cloud
Storage.
Advantages of Cloud Storage
1. Usability: All cloud storage services reviewed in this topic have desktop folders for Mac’s and PC’s. This allows users
to drag and drop files between the cloud storage and their local storage.
2. Bandwidth: You can avoid emailing files to individuals and instead send a web link to recipients through your email.
3. Accessibility: Stored files can be accessed from anywhere via Internet connection.
4. Disaster Recovery: It is highly recommended that businesses have an emergency backup plan ready in the case of an
emergency. Cloud storage can be used as a back‐up plan by businesses by providing a second copy of important files.
These files are stored at a remote location and can be accessed through an internet connection.
5. Cost Savings: Businesses and organizations can often reduce annual operating costs by using cloud storage; cloud
storage costs about 3 cents per gigabyte to store data internally. Users can see additional cost savings because it does not
require internal power to store information remotely.
Disadvantages of Cloud Storage
1. Usability: Be careful when using drag/drop to move a document into the cloud storage folder. This will permanently
move your document from its original folder to the cloud storage location. Do a copy and paste instead of drag/drop if
you want to retain the document’s original location in addition to moving a copy onto the cloud storage folder.
2. Bandwidth: Several cloud storage services have a specific bandwidth allowance. If an organization surpasses the
given allowance, the additional charges could be significant. However, some providers allow unlimited bandwidth. This
is a factor that companies should consider when looking at a cloud storage provider.
3. Accessibility: If you have no internet connection, you have no access to your data.
4. Data Security: There are concerns with the safety and privacy of important data stored remotely. The possibility of
private data commingling with other organizations makes some businesses uneasy. If you want to know more about those
issues that govern data security and privacy, here is an interesting article on the recent privacy debates.
5. Software: If you want to be able to manipulate your files locally through multiple devices, you’ll need to download the
service on all devices.
6.2 The evolution of storage technology

Since the invention of computers and various computing devices, data storage capacity has always been a
major concern. As computers got more advanced, data size increased thus creating an everlasting demand for increased
storage capacity.
Data storage devices have evolved drastically from being large trunks with the capacity to hold a few
kilobytes of data, to microchips able to hold a few gigabytes of data.
The technological capacity to store information has grown over time at an accelerated pace :
1986: 2.6 EB; equivalent to less than one 730 MB CD-ROM of data per computer user.
1993: 15.8 EB; equivalent to four CD-ROMs per user.
2000: 54.5 EB; equivalent to 12 CD-ROMs per user.
2007: 295 EB; equivalent to almost 61 CD-ROMs per user.
6.3 Storage models, file systems, and databases
A storage model describes the layout of a data structure in physical storage; a data model captures
the most important logical aspects of a data structure in a database. The physical storage can be a
local disk, a removable media, or storage accessible via a network.
file systems and database

File Management System, better known as File System is the most ancient and still the most popular way to
keep your data files organised on your drives.
On the other hand, when it comes to security and appropriate management of data based on constraints and
other stuff that we are going to talk about, the first choice of many experts, is Database Management System (DBMS).
So what are they? What are the parameters to decide the best one for your need? Let’s come to these aspects
now.
File Systems is the traditional way to keep your data organised in a way which is easy for physical access,
whether it’s on your shelf or on the drives.
Earlier people used to keep records and maintain data in registers and any alteration/retrieval to this data was
difficult. When computers came, same agenda was followed for storing the data on drives.
File System actually stores data in the form of isolated files which have their own set of property table and
physical location on the drive and user manually goes to these locations to access the files.
It is an easy way to store data in general files like images, text, videos, audios etc., but security is less because
only options available to these files are the options given by the operating system such as locks, hidden files and sharing.
These files are hard to maintain when it comes to frequent changes to these files.
Data redundancy is more and can’t be controlled easily. Data integration is hard to achieve and also data
consistency is not met.
Database Management System, abbreviated as DBMS, is an effective way to store the data when constraints
are high and data maintenance and security are the primary concern of the user.
DBMS stores data in the form of interrelated tables and files. These are generally consist of database
management system providers that are used to store and manipulate databases, hardware where the data is physically
stored and an user friendly software developed to met specific purpose in certain situations, using which user can easily
access database without worrying about the underlying schema of the database.
Database Management System is great way to manage data as, the data redundancy is minimized due to
interrelation of data entities and also provide a procedure for data integration due to centralisation of data in the database.
Security of data is also maximized using password protection, encryption/decryption, granting authorized access and
others.
File System vs DBMS – Difference between File System and DBMS
File Management System Database Management System
File System is a general, easy-to-

use system to store general files which Database management system is
require less security and constraints. used when security constraints are high.
Data Redundancy is more in file Data Redundancy is less in

management system. database management system.
Data Inconsistency is more in file Data Inconsistency is less in
system. database management system.
Centralisation is hard to get when Centralisation is achieved in

it comes to File Management System. Database Management System.
User locates the physical address In Database Management System,

of the files to access data in File user is unaware of physical address where
Management System. data is stored.
Security is low in File Security is high in Database

Management System. Management System.
File Management System stores Database Management System

unstructured data as isolated data stores structured data which have well
files/entities. defined constraints and interrelation.
6.4 General Parallel File System

Parallel I/O implies execution of multiple input/output operations concurrently. Support for parallel I/O is
essential to the performance of many applications [236]. Therefore, once distributed file systems becameubiquitous, the
natural next step in the evolution of the file system was to support parallel access. Parallel file systems allow multiple
clients to read and write concurrently from the same file.
Concurrency control is a critical issue for parallel file systems. Several semantics for handling the shared
access are possible. For example, when the clients share the file pointer, successive reads issued by multiple clients
advance the file pointer; another semantic is to allow each client to have its own file pointer. Early supercomputers such
as the Intel Paragon took advantage of parallel file systems to support applications based on the same program, multiple
data (SPMD) paradigm.
The General Parallel File System (GPFS) was developed at IBM in the early 2000s as a successor to the
TigerShark multimedia file system. GPFS is a parallel file system that emulates closely the behavior of a general-purpose
POSIX system running on a single system. GPFS was designed for optimal performance of large clusters; it can support a
file system of up to 4 PB consisting of up to 4, 096 disks of 1 TB each (see Figure).
The maximum file size is (2 pow 63−1) bytes. A file consists of blocks of equal size, ranging from 16 KB to 1
MB striped across several disks. The system could support not only very large files but also a very large number of files.
The directories use extensible hashing techniques to access a file. The system maintains user data, file metadata such as
the time when last modified, and file system metadata such as allocation maps. Metadata, such as file attributes and data
block addresses, is stored in inodes and indirect blocks.
6.5 Distributed File System

The Distributed File System (DFS) functions provide the ability to logically group shares on multiple servers
and to transparently link shares into a single hierarchical namespace. DFS organizes shared resources on a network in a
treelike structure.
DFS supports stand-alone DFS namespaces, those with one host server, and domain-basednamespaces that
have multiple host servers and high availability.
The DFS topology data for domain-based namespaces is stored in Active Directory. The data includes the
DFS root, DFS links, and DFS targets.
Each DFS tree structure has one or more root targets. The root target is a host server that runs the DFS
service.
A DFS tree structure can contain one or more DFS links. Each DFS link points to one or more shared folders
on the network.
We can add, modify and delete DFS links from a DFS namespace. When you remove the last target associated
with a DFS link, DFS deletes the DFS link in the DFS namespace.
Basic distributed file systems provide an essential underpinning for organizational computing based on
intranets.
File systems were originally developed for centralized computer systems and desktop computers as an
operating system facility providing a convenient programming interface to disk storage.
They subsequently acquired features such as access-control and file-locking mechanisms that made them
useful for the sharing of data and programs.
Distributed file systems support the sharing of information in the form of files and hardware resources in the
form of persistent storage throughout an intranet. A welldesigned file service provides access to files stored at a server
with performance and reliability similar to, and in some cases better than, files stored on local disks.
Ones, allowing users to access their files from any computer in an intranet. The concentration of persistent
storage at a few servers reduces the need for local disk storage and (more importantly) enables economies to be made in
the management and archiving of the persistent data owned by an organization. Other services, such as the name service,
the user authentication service and the print service, can be more easily
6.6 Google File System

The Google File System (GFS) was developed in the late 1990s. It uses thousands of storage systems built
from inexpensive commodity components to provide petabytes of storage to a large user community with diverse needs .
It is not surprising that a main concern of the GFS designers was to ensure the reliability of a system exposed to hardware
failures, system software errors, application errors, and last but not least, human errors.
The system was designed after a careful analysis of the file characteristics and of the access models. Some of
the most important aspects of this analysis reflected in the GFS design are:
• Scalability and reliability are critical features of the system; they must be considered from the
beginning rather than at some stage of the design.
• The vast majority of files range in size from a few GB to hundreds of TB.
• The most common operation is to append to an existing file; random write operations to a file
are extremely infrequent.
• Sequential read operations are the norm.
• The users process the data in bulk and are less concerned with the response time.
• The consistency model should be relaxed to simplify the system implementation, but without
placing an additional burden on the application developers.
Several design decisions were made as a result of this analysis:

1. Segment a file in large chunks.
2. Implement an atomic file append operation allowing multiple applications operating
concurrently to append to the same file.
3. Build the cluster around a high-bandwidth rather than low-latency interconnection network.
Separate the flow of control from the data flow; schedule the high-bandwidth data flow by
pipelining the data transfer over TCP connections to reduce the response time. Exploit network
topology by sending data to the closest node in the network.
4. Eliminate caching at the client site. Caching increases the overhead for maintaining consistency
among cached copies at multiple client sites and it is not likely to improve performance.
5. Ensure consistency by channeling critical file operations through a master, a component of the
cluster that controls the entire system.
6. Minimize the involvement of the master in file access operations to avoid hot-spot contention and
to ensure scalability.
7. Support efficient checkpointing and fast recovery mechanisms.
8. Support an efficient garbage-collection mechanism.
GFS was built primarily as the fundamental storage service for Google’s search engine. As the size of the
web data that was crawled and saved was quite substantial, Google needed a distributed file system to
redundantly store massive amounts of data on cheap and unreliable computers. None of the traditional distributed
file systems can provide such functions and hold such large amounts of data. In addition, GFS was designed for
Google applications, and Google applications were built for GFS. In traditional file system design, such a philosophy
is not attractive, as there should be a clear interface between applications and the file system, such as a POSIX
interface.
Thus, Google made some special decisions regarding the design of GFS. As noted earlier, a 64 MB block
size was chosen. Reliability is achieved by using replications (i.e., each chunk or data block of a file is replicated
across more than three chunk servers). A single master coordinates access as well as keeps the metadata.
The architecture of GFS will look very familiar if you know HDFS. In GFS, there is a single master server
(similar to HDFS Name Node) and one chunkserver per server (similar to HDFS Data Node). The files are broken
down to large, fixed-size chunks of 64MB (similar to HDFS blocks), which are stored as local linux files and are
replicated for HA (three replicas by default). The master maintains all the metadata of the files and chunks in-
memory.
6.7 Apache Hadoop

Hadoop supports distributed applications handling extremely large volumes of data. Many members of the
community contributed to the development and optimization of Hadoop and several related Apache projects such as Hive
and HBase.
Hadoop is used by many organizations from industry, government, and research; the long list of Hadoop users
includes major IT companies such as Apple, IBM, HP, Microsoft, Yahoo!, and Amazon; media companies such as The
New York Times and Fox; social networks, including Twitter, Facebook, and LinkedIn; and government agencies, such
as the U.S. Federal Reserve. In 2011, the Facebook Hadoop cluster had a capacity of 30 PB.
The Hadoop core is divided into two fundamental layers: the MapReduce engine and HDFS. The
MapReduce engine is the computation engine running on top of HDFS as its data storage manager.
HDFS: HDFS is a distributed file system inspired by GFS that organizes files and stores their data on a
distributed computing system.
HDFS Architecture: HDFS has a master/slave architecture containing a single NameNode as the master
and a number of DataNodes as workers (slaves). To store a file in this architecture,HDFS splits the file into fixed-
size blocks (initially 64 MB now it is 128MB ) and stores them on workers (DataNodes). The mapping of blocks to
DataNodes is determined by the NameNode. The NameNode (master) also manages the file system’s metadata and
namespace. In such systems,the namespace is the area maintaining the metadata, and metadata refers to all the
information stored by a file system that is needed for overall management of all files.
For example,NameNode in the metadata stores all information regarding the location of input
splits/blocks in all DataNodes. Each DataNode, usually one per node in a cluster, manages the storage attached to
the node. Each DataNode is responsible for storing and retrieving its file blocks
Running a Job in Hadoop
• Job Submission Each job is submitted from a user node to the JobTracker node that might be situated in a
different node within the cluster through the following procedure:
➢ A user node asks for a new job ID from the JobTracker and computes input file splits.
➢ The user node copies some resources, such as the job’s JAR file, configuration file, and computed input
splits, to the JobTracker’s file system.
➢ The user node submits the job to the JobTracker by calling the submitJob() function.
• Task assignment The JobTracker creates one map task for each computed input split by the user node
and assigns the map tasks to the execution slots of the TaskTrackers. The JobTracker considers the
localization of the data when assigning the map tasks to the TaskTrackers. The JobTracker also creates
reduce tasks and assigns them to the TaskTrackers. The number of reduce tasks is predetermined by the
user, and there is no locality consideration in assigning them.
• Task execution The control flow to execute a task (either map or reduce) starts inside the TaskTracker
by copying the job JAR file to its file system. Instructions inside the job JAR file are executed after
launching a Java Virtual Machine (JVM) to run its map or reduce task.
• Task running check A task running check is performed by receiving periodic heartbeat messages to the
JobTracker from the TaskTrackers. Each heartbeat notifies the JobTracker that the sending askTracker is
alive, and whether the sending TaskTracker is ready to run a new task.
Bigtable
Google BigTable is a nonrelational, distributed and multidimensional data storage mechanism built on the
proprietary Google storage technologies for most of the company's online and back-end applications/products. It
provides scalable data architecture for very large database infrastructures.
Google BigTable is mainly used in proprietary Google products, although some access is available in the
Google App Engine and third-party database applications.
Google BigTable is a persistent and sorted map. Each string in the map contains a row, columns (several
types) and time stamp value that is used for indexing. For example, the string of data for a website is saved as
follows:
The reversed URL address is saved as the row name (com.google.www).
The content column stores the Web page contents.
The anchor content saves any anchor text or content referencing the page.
A time stamp provides the exact time when the data was stored and is used for sorting multiple instances
of a page.
Google BigTable is built on technologies like Google File System (GFS) and SSTable. It is used by more than
60 Google applications, including Google Finance, Google Reader, Google Maps, Google Analytics and Web
indexing.
Megastore
Megastore is a storage system developed to meet the requirements of today's interactive online services.
Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way,
and provides both strong consistency guarantees and high availability. We provide fully serializable ACID
semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write
across a wide area network with reasonable latency and support seamless failover between datacenters. This
paper describes Megastore's semantics and replication algorithm. It also describes our experience supporting a
wide range of Google production services built with Megastore.
The middle ground between traditional and NoSQL databases taken by the Megastore designers is also
reflected in the data model. The data model is declared in a schema consisting of a set of tables composed of
entries, each entry being a collection of named and typed properties. The unique primary key of an entity in a table
is created as a composition of entry properties. A Megastore table can be a root or a child table. Each child entity
must reference a special entity, called a root entity in its root table. An entity group consists of the primary entity
and all entities that reference it.
The system makes extensive use of BigTable. Entities from different Megastore tables can be mapped to
the same BigTable row without collisions. This is possible because the BigTable column name is a concatenation of
the Megastore table name and the name of a property. A BigTable row for the root entity stores the transaction
and all metadata for the entity group. As we saw in Section 8.9, multiple versions of the data with different time
stamps can be stored in a cell. Megastore takes advantage of this feature to implement multi-version concurrency
control (MVCC); when a mutation of a transaction occurs, this mutation is recorded along with its time stamp,
rather than marking the old data as obsolete

CC Unit 04

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CC Unit 04

Uploaded by

Copyright:

Available Formats

Storage Systems

6.1 Advantages and disadvantages of cloud storage

Disadvantages of Cloud Storage

6.2 The evolution of storage technology

file systems and database

File Management System Database Management System

File System is a general, easy-to-

Data Redundancy is more in file Data Redundancy is less in

Centralisation is hard to get when Centralisation is achieved in

User locates the physical address In Database Management System,

Security is low in File Security is high in Database

File Management System stores Database Management System

6.4 General Parallel File System

6.5 Distributed File System

6.6 Google File System

Several design decisions were made as a result of this analysis:

6.7 Apache Hadoop

Running a Job in Hadoop

You might also like