You are on page 1of 16

UNIT-IV

1. CLOUD STORAGE: OVER VIEW OF CLOUD STORAGE

 Cloud storage is a service model in which data is maintained, managed, backed up


remotely and made available to users over a network (typically the Internet).
 Users generally pay for their cloud data storage on a per-consumption, monthly rate.
 Although the per-gigabyte cost has been radically driven down, cloud storage providers
have added operating expenses that can make the technology more expensive than users
bargained for. Cloud security continues to be a concern among users.
 Providers have tried to deal with those fears by building security capabilities, such as
encryption and authentication, into their services.

Advantages of Cloud Storage

1. Usability: All cloud storage services reviewed in this topic have desktop folders for Mac’s
and PC’s. This allows users to drag and drop files between the cloud storage and their local
storage.

2. Bandwidth: You can avoid emailing files to individuals and instead send a web link to
recipients through your email.

3. Accessibility: Stored files can be accessed from anywhere via Internet connection.

4. Disaster Recovery: It is highly recommended that businesses have an emergency backup


plan ready in the case of an emergency. Cloud storage can be used as a back‐up plan by
businesses by providing a second copy of important files. These files are stored at a remote
location and can be accessed through an internet connection.

5. Cost Savings: Businesses and organizations can often reduce annual operating costs by using
cloud storage; cloud storage costs about 3 cents per gigabyte to store data internally. Users can
see additional cost savings because it does not require internal power to store information
remotely.

Cloud Storage Requirements

Ensuring your company's critical data is safe, secure, and available when needed is essential.
There are several fundamental requirements when considering storing data in the cloud.
Durability: Data should be redundantly stored, ideally across multiple facilities and multiple
devices in each facility. Natural disasters, human error, or mechanical faults should not result in
data loss.

Availability: All data should be available when needed, but there is a difference between
production data and archives. The ideal cloud storage will deliver the right balance of retrieval
times and cost.

Security: All data is ideally encrypted, both at rest and in transit. Permissions and access
controls should work just as well in the cloud as they do for on premises storage.

Types of Cloud Storage

There are three types of cloud data storage, and each offers their own advantages and have their
own use cases:

1. Object Storage - Applications developed in the cloud often take advantage of object storage's
vast scalablity and metadata characteristics. Object storage solutions like Amazon Simple
Storage Service (S3) are ideal for building modern applications from scratch that require scale
and flexibility, and can also be used to import existing data stores for analytics, backup, or
archive.

2. File Storage - Some applications need to access shared files and require a file system. This
type of storage is often supported with a Network Attached Storage (NAS) server. File storage
solutions like Amazon Elastic File System (EFS) are ideal for use cases like large content
repositories, development environments, media stores, or user home directories.

3. Block Storage - Other enterprise applications like databases or ERP systems often require
dedicated, low latency storage for each host. This is analagous to direct-attached storage (DAS)
or a Storage Area Network (SAN). Block-based cloud storage solutions like Amazon Elastic
Block Store (EBS) are provisioned with each virtual server and offer the ultra low latency
required for high performance workloads.

There are three main cloud-based storage architecture models: public, private and hybrid.

Public cloud storage services provide a multi-tenant storage environment that is most suited for
unstructured data. Data is stored in global data centers with storage data spread across multiple
regions or continents. Customers generally pay on a per-use basis similar to the utility payment
model.
Private cloud, or on-premises, storage services provide a dedicated environment protected
behind an organization's firewall. Private clouds are appropriate for users who need
customization and more control over their data.

Hybrid cloud is a mix of private cloud and third-party public cloud services with orchestration
between the platforms for management. The model offers businesses flexibility and more data
deployment options.

2. CLOUD STORAGE PROVIDERS

A cloud storage provider is a third-party company that offers organizations and end users the
ability to save data to an off-site storage system

Instead of storing data to a local hard drive or other local storage device, such as tape backup or
flash drives, the data is stored on a system in a remote data center.

Customers can access their files from anywhere with an Internet connection.

Both individual and corporate customers can store unlimited data capacity on the provider's
servers at a low price point.

The key to a cloud storage model is flexibility, so the user has the advantage of not worrying
about storage limitations. Typically, they only pay for what they use, much like a utility.

The key to storing data in the cloud is availability, and the level of availability can impact the
price.
Companies can use cloud storage providers as an offsite vault for backups, thus eliminating the
need for tape backups, or they can use cloud storage as the primary destination for backups. It
can also be used for primary storage.

There are numerous cloud storage providers, including but not limited to Nirvanix, Rackspace,
Amazon Simple Storage Service (Amazon S3), AT&T’s Synaptic Storage and Mozy.

To find the best cloud storage providers based on following criteria

 We removed services that are focused primarily on media- and OS-level backups.
 We removed services that are just for business and have no personal option.
 We cut all services without extensive support for OS X, Windows, Android, and iOS.
 We cut any cloud storage services that did not offer a premium version.
 We cut any contenders that didn’t have an average of 3.5 stars or higher from the App
Store, Google Play Store, and Windows Store.

Dropbox

Perhaps the biggest and best-known name in cloud storage, Dropbox is many people’s first and
last experience of a cloud provider. They are a highly popular choice for personal storage, thanks
to their simple interface and competitive pricing.

Dropbox syncs your files automatically and allows you to share them with your family and
friends, even if they don’t have a Dropbox account. It’s supremely usable and intuitive, and files
can be accessed from any device. You can share folders to collaborate on documents, though it’s
not really suitable for business use – it’s designed for individuals and casual users, rather than
enterprise.

Google drive

The giant of the web, Google provides a reliable and low-cost solution for cloud storage (and
much more besides). You get 15GB of free storage, and if you need more it’s just $1.99 a month
for 100GB, or $23.88 per year. You can use Google Drive on unlimited machines, but there’s no
Linux option – just Windows and Mac.

CrashPlan

CrashPlan comes in at $5.99 a month or $59.99 a year, with a 30-day free trial. It’s available for
Windows, Mac or Linux. Once you’ve set up your preferences, CrashPlan works away in the
background, making sure you don’t lose any vital files – music, photos and docs are
continuously and automatically backed up for you.

3.CASE STUDIES: WALRUS

Walrus offers persistent storage to all of the virtual machines in the Eucalyptus cloud and can be
used as a simple HTTP put/get storage as a service solution.

There are no data type restrictions for Walrus, and it can contain images (i.e., the building blocks
used to launch virtual machines), volume snapshots (i.e., point-in-time copies), and application
data. Only one Walrus can exist per cloud.

Backing Store (where the buckets are)

Currently, Walrus requires a POSIX filesystem to store buckets and objects. It uses, by default,
the filesystem at $EUCALYPTUS/var/lib/eucalyptus/bukkits and creates a directory per bucket
and each object is stored as a single file within that bucket directory.

How is the Backing Store Used

 Bucket directories are named exactly as the bucket itself, so bukkits/bucket01 is the
directory for the 'bucket01' bucket, as expected. However, object files do not retain their
name on the filesystem for several reasons and are instead stored as files which have a
random hash as the name.
 This prevents name conflicts due to object versioning and allows the system to handle
object names which are not valid POSIX filenames (as the S3 API allows '/' in object
names, for example).
 This does of course, mean that the number of buckets in a single Walrus installation is
limited by the number of subdirectories allowed by the file system.
 For ext3 that number is 32,000 and for ext4 it is 64,000. XFS and bars have much higher
limits (i.e. > 1M). XFS also has many other advantageous features such as full metadata
journaling that allows repair of the FS without a full sack run.
 There is plenty of info online for all the filesystems, but we suggest XFS or ext4. Walrus
does not utilize the filesystem for storing metadata, that is all persisted in the database
managed by the Cloud Controller (CLC).
 Therefore, Walrus does not require extended metadata support from the filesystem itself.
HA (DRBD and more)

 To provide high-availability, Walrus leverages DRBD to replicate the backing block


storage device at the block level.
 This was chosen for several reasons, one of which is that it handles replication such that
once the block device reports a write is complete the system can be sure that it has been
written to disk on both hosts and is not simply in a memory buffer somewhere awaiting a
lazy write.
 One of the key issues with distributed and/or shared filesystems as a replication
mechanism is that they typically heavily utilize caching to boost performance
(ehem..NFS) at the cost of data consistency in the case of unexpected failures.
 Walrus uses a local filesystem on top of a replicated block device to ensure that
completed writes are truly completed on both hosts.
 Walrus uses DRBD for a single volume replicated using replication protocol C (DRBD
Replication) in an active-passive configuration.
 As per all of Eucalyptus's HA components, one Walrus node is ENABLED and one is
DISABLED such that only the ENABLED node services user requests.
 The enabled node has a DRBD block device (volume) mounted to the filesystem at
$EUCALYPTUS/var/lib/eucalyptus/bukkits and acts as primary role.
 On the disabled Walrus node the DRBD volume is unmounted (to the filesystem) and is
the secondary role in DRBD terms.
 If the disabled Walrus node becomes enabled it becomes primary on the DRBD volume
and mounts it in the filesystem.
 Walrus does not configure DRBD itself, but expects the administrator to have configured
a block-device as a DRBD volume and specifies that volume to Walrus via the walrus.

Walrus API

 Walrus implements much of the Amazon Web Services Simple Storage Service (S3) API:
S3 Documentation
 Specifically, Walrus supports: get/put/head/post of objects and buckets, object/bucket
ACLs, and object/bucket versioning.
 Walrus does not yet implement static websites, bucket-policies, degraded durability
modes, or multi-part uploads. We are, of course, working on many of those features, but
can't give a timeline just yet. Walrus supports the S3 SOAP API as well as REST and
supports normal S3 authorization-header-based request authentication or query-string
authentication.

Walrus and the Database

 Walrus uses a database to manage metadata for buckets and objects as well as ACLs and
policies.
 Walrus has its own database but is currently co-hosted by the Cloud Controller (CLC)
with all of the other databases that Eucalyptus uses.
 The current implementation leverages a PostGreSQL database although version prior to
Eucalyptus 3.1 ran on MySQL.
 Walrus, like all Eucalyptus components written in Java, utilizes Hibernate to interact with
the db layer and database HA is handled via a replicating Hibernate JDBC connection

Authentication, ACLs, Permissions

Walrus implements and supports S3-style ACLs and IAM policies. It does not support bucket-
policies or bucket-lifecycles (yet!).

The ACL instance holds a detailed list of Access Control Entries (ACEs) which are used to make
access decisions. ACL system focuses on two main objectives:

 providing a way to efficiently retrieve a large amount of ACLs/ACEs for your domain
objects, and to modify them;
 providing a way to easily make decisions of whether a person is allowed to perform an
action on a domain object or not.

4. AMAZON S3

 Amazon Simple Storage Service (Amazon S3) is object storage with a simple web
service interface to store and retrieve any amount of data from anywhere on the web.
 It is designed to deliver 99.999999999% durability, and scale past trillions of objects
worldwide.
 Customers use Amazon S3 as primary storage as a bulk repository for user-generated
content, as a tier in an active archive, with server less computing applications, as a "data
lake," for Big Data analytics, and as a target for backup & recovery and disaster recovery.
 It's simple to move large volumes of data into or out of Amazon S3 with Amazon's cloud
data migration options.
 Once data is stored in S3, it can be automatically tiered into lower cost, longer-term cloud
storage classes like S3 Standard - Infrequent Access and Amazon Glacier for archiving.

Amazon S3 features

Simple-Amazon S3 is simple to use with a web-based management console and mobile app.

Durable-Amazon S3 provides durable infrastructure to store important data and is designed for
durability of 99.9% of objects.

Scalable-With Amazon S3, you can store as much data as you want and access it when needed.

Secure-Amazon S3 supports data transfer over SSL and automatic encryption of your data once
it is uploaded

Available-Amazon S3 Standard is designed for up to 99.99% availability of objects over a given


year and is backed by the Amazon S3 Service Level Agreement, ensuring that you can rely on it
when needed

Low Cost-Amazon S3 allows you to store large amounts of data at a very low cost.

Simple Data Transfer-Amazon provides multiple options for cloud data migration, and makes it
simple and cost-effective for you to move large volumes of data into or out of Amazon S3.

Integrated-Amazon S3 is deeply integrated with other AWS services to make it easier to build
solutions that use a range of AWS services

Easy to Manage-Amazon S3 Storage Management features allow you to take a data-driven


approach to storage optimization, data security, and management efficiency.

Amazon S3 Use Cases

Amazon S3 is an excellent choice for a large variety of use cases ranging from a simple storage
repository for backup & recovery to primary storage for some of the most cutting edge cloud-
native applications in the market today - and everything in between.

Backup & Archiving-Amazon S3 offers a highly durable, scalable, and secure destination for
backing up and archiving your critical data.

you can distribute your content directly from S3 or use S3 as an origin store for delivering
content to your Amazon CloudFront edge locations.
Big Data Analytics-Whether you’re storing pharmaceutical or financial data, or multimedia files
such as photos and videos, Amazon S3 can be used as your big data object store.

Static Website Hosting-You can host your entire static website on Amazon S3 for a low-cost,
highly available hosting solution that can scale automatically to meet traffic demands.

Cloud-native Application Data-Amazon S3 provides high performance, highly available


storage that makes it easy to scale and maintain cost-effective mobile and Internet-based apps
that run fast

Disaster Recovery-Amazon S3’s highly durable, secure, global infrastructure offers a robust
disaster recovery solution designed to provide superior data protection.

Amazon S3 Storage Classes

Amazon S3 offers a range of storage classes designed for different use cases.

These include Amazon S3 Standard for general-purpose storage of frequently accessed data,
Amazon S3 Standard - Infrequent Access for long-lived, but less frequently accessed data, and
Amazon Glacier for long-term archive.

Amazon S3 also offers configurable lifecycle policies for managing your data throughout its
lifecycle.

Once a policy is set, your data will automatically migrate to the most appropriate storage class
without any changes to your application.

General Purpose

Amazon S3 Standard

 Amazon S3 Standard offers high durability, availability, and performance object storage
for frequently accessed data.
 Because it delivers low latency and high throughput, Standard is perfect for a wide
variety of use cases including cloud applications

Infrequent Access

Amazon S3 Standard - Infrequent Access

 Amazon S3 Standard - Infrequent Access (Standard - IA) is an Amazon S3 storage class


for data that is accessed less frequently, but requires rapid access when needed. Standard
 This combination of low cost and high performance make Standard - IA ideal for long-
term storage, backups, and as a data store for disaster recovery

Archive

Amazon Glacier

 Amazon Glacier is a secure, durable, and extremely low-cost storage service for data
archiving.
 You can reliably store any amount of data at costs that are competitive with or cheaper
than on-premises solutions.
 To keep costs low yet suitable for varying retrieval needs, Amazon Glacier provides three
options for access to archives, from a few minutes to several hours.

5. CLOUD FILE SYSTEM

Cloud file storage (CFS) is a storage service that is delivered over the Internet, billed on a pay-
per-use basis and has an architecture based on common file level protocols such as Server
Message Block (SMB), Common Internet File System (CIFS) and Network File System (NFS).

FS and cloud block storage are the two main types of cloud storage services. (These are akin to
the two types of networked storage, NAS or file storage and SAN, or block storage.) Most cloud
storage services, including cloud backup services, use a file storage architecture.

Cloud file storage is most appropriate for unstructured data or semi-structured data, such as
documents, spreadsheets, presentations and other file-based data.

For applications that may have very large files, such as databases, a block storage architecture is
more appropriate to maintain satisfactory performance.

Only a handful of cloud storage services offer block storage, however, because the bandwidth
required to traverse the cloud would not likely be adequate to maintain acceptable database
performance without significant latency.

Cloud File Sharing

Cloud file sharing, also called cloud-based file sharing or online file sharing, is a system in
which a user is allotted storage space on a server and reads and writes are carried out over the
Internet.
Cloud file sharing provides end users with the ability to access files with any Internet-capable
device from any location. Usually, the user has the ability to grant access privileges to other
users as they see fit.

Although cloud file sharing services are easy to use, the user must rely upon the service provider
ability to provide high availability (HA) and backup and recovery in a timely manner.

In the enterprise, cloud file sharing can present security risks and compliance concerns if
company data is stored on third-party providers without the IT department's knowledge. Popular
third-party providers for cloud file sharing include Box, Dropbox, Egnyte and Syncplicity.

6.MAP REDUCE

MapReduce is a core component of the Apache Hadoop software framework.

Hadoop enables resilient, distributed processing of massive unstructured data sets across
commodity computer clusters, in which each node of the cluster includes its own storage.
MapReduce serves two essential functions: It parcels out work to various nodes within the
cluster or map, and it organizes and reduces the results from each node into a cohesive answer to
a query.

MapReduce is composed of several components, including:

JobTracker -- the master node that manages all jobs and resources in a cluster

TaskTrackers -- agents deployed to each machine in the cluster to run the map and reduce tasks

JobHistoryServer -- a component that tracks completed jobs, and is typically deployed as a


separate function or with JobTracker

To distribute input data and collate results, MapReduce operates in parallel across massive
cluster sizes.

Because cluster size doesn't affect a processing job's final results, jobs can be split across almost
any number of servers.

Therefore, MapReduce and the overall Hadoop framework simplify software development.
MapReduce is available in several languages, including C, C++, Java, Ruby, Perl and Python.

Programmers can use MapReduce libraries to create tasks without dealing with communication
or coordination between nodes.
MapReduce is also fault-tolerant, with each node periodically reporting its status to a master
node.

If a node doesn't respond as expected, the master node re-assigns that piece of the job to other
available nodes in the cluster.

This creates resiliency and makes it practical for MapReduce to run on inexpensive commodity
servers.

MapReduce in action

For example, users can list and count the number of times every word appears in a novel as a
single server application, but that is time consuming.

By contrast, users can split the task among 26 people, so each takes a page, writes a word on a
separate sheet of paper and takes a new page when they're finished.

This is the map aspect of MapReduce. And if a person leaves, another person takes his place.
This exemplifies MapReduce's fault-tolerant element.

When all pages are processed, users sort their single-word pages into 26 boxes, which represent
the first letter of each word.

Each user takes a box and sorts each word in the stack alphabetically. The number of pages with
the same word is an example of the reduce aspect of MapReduce.

MapReduce is a framework for processing parallelizable problems across large datasets using a
large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across
geographically and administratively distributed systems, and use more heterogenous hardware).

It do the following steps

Map" step: Each worker node applies the "map()" function to the local data, and writes the
output to a temporary storage. A master node ensures that only one copy of redundant input data
is processed.

"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the
"map()" function), such that all data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.

MapReduce is as a 5-step parallel and distributed computation:

Prepare the Map() input – The "MapReduce system" designates Map processors, assigns the
input key value K1 that each processor would work on, and provides that processor with all the
input data associated with that key value.

Run the user-provided Map() code – Map() is run exactly once for each K1 key value,
generating output organized by key values K2.

"Shuffle" the Map output to the Reduce processors – The MapReduce system designates
Reduce processors, assigns the K2 key value each processor should work on, and provides that
processor with all the Map-generated data associated with that key value.

Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value
produced by the Map step.

Produce the final output – The MapReduce system collects all the Reduce output, and sorts it
by K2 to produce the final outcome.

The frozen part of the MapReduce framework is a large distributed sort. The hot spots,
which the application defines, are:

 an input reader
 a Map function
 a partition function
 a compare function
 a Reduce function
 an output writer
Map reduce Architecture

7.HADOOP

 Hadoop is an open source, Java-based programming framework that supports the


processing and storage of extremely large data sets in a distributed computing
environment.
 It is part of the Apache project sponsored by the Apache Software Foundation.
 Hadoop makes it possible to run applications on systems with thousands of commodity
hardware nodes, and to handle thousands of terabytes of data.
 Its distributed file system facilitates rapid data transfer rates among nodes and allows the
system to continue operating in case of a node failure.
 This approach lowers the risk of catastrophic system failure and unexpected data loss,
even if a significant number of nodes become inoperative.
 Consequently, Hadoop quickly emerged as a foundation for big data processing tasks,
such as scientific analytics, business and sales planning, and processing enormous
volumes of sensor data, including from internet of things sensors.
 Hadoop was created by computer scientists Doug Cutting and Mike Cafarella in 2006 to
support distribution for the Nutch search engine.
 It was inspired by Google's MapReduce, a software framework in which an application is
broken down into numerous small parts.
 Any of these parts, which are also called fragments or blocks, can be run on any node in
the cluster.
 After years of development within the open source community, Hadoop 1.0 became
publically aavailable in November 2012 as part of the Apache project sponsored by the
Apache Software Foundation.
 Since its initial release, Hadoop has been continuously developed and updated.
 The second iteration of Hadoop (Hadoop 2) improved resource management and
scheduling.
 It features a high-availability file-system option and support for Microsoft Windows and
other components to expand the framework's versatility for data processing and analytics.
 Organizations can deploy Hadoop components and supporting software packages in their
local data center.
 However, most big data projects depend on short-term use of substantial computing
resources.
 This type of usage is best-suited to highly scalable public cloud services, such as Amazon
Web Services (AWS), Google Cloud Platform and Microsoft Azure.
 Public cloud providers often support Hadoop components through basic services, such as
AWS Elastic Compute Cloud and Simple Storage Service instances.
 However, there are also services tailored specifically for Hadoop-type tasks, such as
AWS Elastic MapReduce, Google Cloud Dataproc and Microsoft Azure HDInsight.
Hadoop modules and projects

 As a software framework, Hadoop is composed of numerous functional modules. At a


minimum, Hadoop uses Hadoop Common as a kernel to provide the framework's
essential libraries.
 Other components include Hadoop Distributed File System (HDFS), which is capable of
storing data across thousands of commodity servers to achieve high bandwidth between
nodes;
 Hadoop Yet Another Resource Negotiator (YARN), which provides resource
management and scheduling for user applications; and Hadoop MapReduce, which
provides the programming model used to tackle large distributed data processing --
mapping data and reducing it to a result.
 Hadoop also supports a range of related projects that can complement and extend
Hadoop's basic capabilities. Complementary software packages include:

Apache Flume: A tool used to collect, aggregate and move huge amounts of streaming data into
HDFS.

Apache HBase:An open source, nonrelational, distributed database;

Apache Hive. A data warehouse that provides data summarization, query and analysis;

Cloudera Impala. A massively parallel processing database for Hadoop, originally created by
the software company Cloudera, but now released as open source software;

Apache Oozie:A server-based workflow scheduling system to manage Hadoop jobs;

Apache Phoenix: An open source, massively parallel processing, relational database engine for
Hadoop that is based on Apache HBase;

Apache Pig:A high-level platform for creating programs that run on Hadoop;

Apache Sqoop. A tool to transfer bulk data between Hadoop and structured data stores, such as
relational databases;

Apache Spark. A fast engine for big data processing capable of streaming and supporting SQL,
machine learning and graph processing;

Apache Storm. An open source data processing system; and

Apache ZooKeeper. An open source configuration, synchronization and naming registry service
for large distributed systems

You might also like