Professional Documents
Culture Documents
1. Usability: All cloud storage services reviewed in this topic have desktop folders for Mac’s
and PC’s. This allows users to drag and drop files between the cloud storage and their local
storage.
2. Bandwidth: You can avoid emailing files to individuals and instead send a web link to
recipients through your email.
3. Accessibility: Stored files can be accessed from anywhere via Internet connection.
5. Cost Savings: Businesses and organizations can often reduce annual operating costs by using
cloud storage; cloud storage costs about 3 cents per gigabyte to store data internally. Users can
see additional cost savings because it does not require internal power to store information
remotely.
Ensuring your company's critical data is safe, secure, and available when needed is essential.
There are several fundamental requirements when considering storing data in the cloud.
Durability: Data should be redundantly stored, ideally across multiple facilities and multiple
devices in each facility. Natural disasters, human error, or mechanical faults should not result in
data loss.
Availability: All data should be available when needed, but there is a difference between
production data and archives. The ideal cloud storage will deliver the right balance of retrieval
times and cost.
Security: All data is ideally encrypted, both at rest and in transit. Permissions and access
controls should work just as well in the cloud as they do for on premises storage.
There are three types of cloud data storage, and each offers their own advantages and have their
own use cases:
1. Object Storage - Applications developed in the cloud often take advantage of object storage's
vast scalablity and metadata characteristics. Object storage solutions like Amazon Simple
Storage Service (S3) are ideal for building modern applications from scratch that require scale
and flexibility, and can also be used to import existing data stores for analytics, backup, or
archive.
2. File Storage - Some applications need to access shared files and require a file system. This
type of storage is often supported with a Network Attached Storage (NAS) server. File storage
solutions like Amazon Elastic File System (EFS) are ideal for use cases like large content
repositories, development environments, media stores, or user home directories.
3. Block Storage - Other enterprise applications like databases or ERP systems often require
dedicated, low latency storage for each host. This is analagous to direct-attached storage (DAS)
or a Storage Area Network (SAN). Block-based cloud storage solutions like Amazon Elastic
Block Store (EBS) are provisioned with each virtual server and offer the ultra low latency
required for high performance workloads.
There are three main cloud-based storage architecture models: public, private and hybrid.
Public cloud storage services provide a multi-tenant storage environment that is most suited for
unstructured data. Data is stored in global data centers with storage data spread across multiple
regions or continents. Customers generally pay on a per-use basis similar to the utility payment
model.
Private cloud, or on-premises, storage services provide a dedicated environment protected
behind an organization's firewall. Private clouds are appropriate for users who need
customization and more control over their data.
Hybrid cloud is a mix of private cloud and third-party public cloud services with orchestration
between the platforms for management. The model offers businesses flexibility and more data
deployment options.
A cloud storage provider is a third-party company that offers organizations and end users the
ability to save data to an off-site storage system
Instead of storing data to a local hard drive or other local storage device, such as tape backup or
flash drives, the data is stored on a system in a remote data center.
Customers can access their files from anywhere with an Internet connection.
Both individual and corporate customers can store unlimited data capacity on the provider's
servers at a low price point.
The key to a cloud storage model is flexibility, so the user has the advantage of not worrying
about storage limitations. Typically, they only pay for what they use, much like a utility.
The key to storing data in the cloud is availability, and the level of availability can impact the
price.
Companies can use cloud storage providers as an offsite vault for backups, thus eliminating the
need for tape backups, or they can use cloud storage as the primary destination for backups. It
can also be used for primary storage.
There are numerous cloud storage providers, including but not limited to Nirvanix, Rackspace,
Amazon Simple Storage Service (Amazon S3), AT&T’s Synaptic Storage and Mozy.
We removed services that are focused primarily on media- and OS-level backups.
We removed services that are just for business and have no personal option.
We cut all services without extensive support for OS X, Windows, Android, and iOS.
We cut any cloud storage services that did not offer a premium version.
We cut any contenders that didn’t have an average of 3.5 stars or higher from the App
Store, Google Play Store, and Windows Store.
Dropbox
Perhaps the biggest and best-known name in cloud storage, Dropbox is many people’s first and
last experience of a cloud provider. They are a highly popular choice for personal storage, thanks
to their simple interface and competitive pricing.
Dropbox syncs your files automatically and allows you to share them with your family and
friends, even if they don’t have a Dropbox account. It’s supremely usable and intuitive, and files
can be accessed from any device. You can share folders to collaborate on documents, though it’s
not really suitable for business use – it’s designed for individuals and casual users, rather than
enterprise.
Google drive
The giant of the web, Google provides a reliable and low-cost solution for cloud storage (and
much more besides). You get 15GB of free storage, and if you need more it’s just $1.99 a month
for 100GB, or $23.88 per year. You can use Google Drive on unlimited machines, but there’s no
Linux option – just Windows and Mac.
CrashPlan
CrashPlan comes in at $5.99 a month or $59.99 a year, with a 30-day free trial. It’s available for
Windows, Mac or Linux. Once you’ve set up your preferences, CrashPlan works away in the
background, making sure you don’t lose any vital files – music, photos and docs are
continuously and automatically backed up for you.
Walrus offers persistent storage to all of the virtual machines in the Eucalyptus cloud and can be
used as a simple HTTP put/get storage as a service solution.
There are no data type restrictions for Walrus, and it can contain images (i.e., the building blocks
used to launch virtual machines), volume snapshots (i.e., point-in-time copies), and application
data. Only one Walrus can exist per cloud.
Currently, Walrus requires a POSIX filesystem to store buckets and objects. It uses, by default,
the filesystem at $EUCALYPTUS/var/lib/eucalyptus/bukkits and creates a directory per bucket
and each object is stored as a single file within that bucket directory.
Bucket directories are named exactly as the bucket itself, so bukkits/bucket01 is the
directory for the 'bucket01' bucket, as expected. However, object files do not retain their
name on the filesystem for several reasons and are instead stored as files which have a
random hash as the name.
This prevents name conflicts due to object versioning and allows the system to handle
object names which are not valid POSIX filenames (as the S3 API allows '/' in object
names, for example).
This does of course, mean that the number of buckets in a single Walrus installation is
limited by the number of subdirectories allowed by the file system.
For ext3 that number is 32,000 and for ext4 it is 64,000. XFS and bars have much higher
limits (i.e. > 1M). XFS also has many other advantageous features such as full metadata
journaling that allows repair of the FS without a full sack run.
There is plenty of info online for all the filesystems, but we suggest XFS or ext4. Walrus
does not utilize the filesystem for storing metadata, that is all persisted in the database
managed by the Cloud Controller (CLC).
Therefore, Walrus does not require extended metadata support from the filesystem itself.
HA (DRBD and more)
Walrus API
Walrus implements much of the Amazon Web Services Simple Storage Service (S3) API:
S3 Documentation
Specifically, Walrus supports: get/put/head/post of objects and buckets, object/bucket
ACLs, and object/bucket versioning.
Walrus does not yet implement static websites, bucket-policies, degraded durability
modes, or multi-part uploads. We are, of course, working on many of those features, but
can't give a timeline just yet. Walrus supports the S3 SOAP API as well as REST and
supports normal S3 authorization-header-based request authentication or query-string
authentication.
Walrus uses a database to manage metadata for buckets and objects as well as ACLs and
policies.
Walrus has its own database but is currently co-hosted by the Cloud Controller (CLC)
with all of the other databases that Eucalyptus uses.
The current implementation leverages a PostGreSQL database although version prior to
Eucalyptus 3.1 ran on MySQL.
Walrus, like all Eucalyptus components written in Java, utilizes Hibernate to interact with
the db layer and database HA is handled via a replicating Hibernate JDBC connection
Walrus implements and supports S3-style ACLs and IAM policies. It does not support bucket-
policies or bucket-lifecycles (yet!).
The ACL instance holds a detailed list of Access Control Entries (ACEs) which are used to make
access decisions. ACL system focuses on two main objectives:
providing a way to efficiently retrieve a large amount of ACLs/ACEs for your domain
objects, and to modify them;
providing a way to easily make decisions of whether a person is allowed to perform an
action on a domain object or not.
4. AMAZON S3
Amazon Simple Storage Service (Amazon S3) is object storage with a simple web
service interface to store and retrieve any amount of data from anywhere on the web.
It is designed to deliver 99.999999999% durability, and scale past trillions of objects
worldwide.
Customers use Amazon S3 as primary storage as a bulk repository for user-generated
content, as a tier in an active archive, with server less computing applications, as a "data
lake," for Big Data analytics, and as a target for backup & recovery and disaster recovery.
It's simple to move large volumes of data into or out of Amazon S3 with Amazon's cloud
data migration options.
Once data is stored in S3, it can be automatically tiered into lower cost, longer-term cloud
storage classes like S3 Standard - Infrequent Access and Amazon Glacier for archiving.
Amazon S3 features
Simple-Amazon S3 is simple to use with a web-based management console and mobile app.
Durable-Amazon S3 provides durable infrastructure to store important data and is designed for
durability of 99.9% of objects.
Scalable-With Amazon S3, you can store as much data as you want and access it when needed.
Secure-Amazon S3 supports data transfer over SSL and automatic encryption of your data once
it is uploaded
Low Cost-Amazon S3 allows you to store large amounts of data at a very low cost.
Simple Data Transfer-Amazon provides multiple options for cloud data migration, and makes it
simple and cost-effective for you to move large volumes of data into or out of Amazon S3.
Integrated-Amazon S3 is deeply integrated with other AWS services to make it easier to build
solutions that use a range of AWS services
Amazon S3 is an excellent choice for a large variety of use cases ranging from a simple storage
repository for backup & recovery to primary storage for some of the most cutting edge cloud-
native applications in the market today - and everything in between.
Backup & Archiving-Amazon S3 offers a highly durable, scalable, and secure destination for
backing up and archiving your critical data.
you can distribute your content directly from S3 or use S3 as an origin store for delivering
content to your Amazon CloudFront edge locations.
Big Data Analytics-Whether you’re storing pharmaceutical or financial data, or multimedia files
such as photos and videos, Amazon S3 can be used as your big data object store.
Static Website Hosting-You can host your entire static website on Amazon S3 for a low-cost,
highly available hosting solution that can scale automatically to meet traffic demands.
Disaster Recovery-Amazon S3’s highly durable, secure, global infrastructure offers a robust
disaster recovery solution designed to provide superior data protection.
Amazon S3 offers a range of storage classes designed for different use cases.
These include Amazon S3 Standard for general-purpose storage of frequently accessed data,
Amazon S3 Standard - Infrequent Access for long-lived, but less frequently accessed data, and
Amazon Glacier for long-term archive.
Amazon S3 also offers configurable lifecycle policies for managing your data throughout its
lifecycle.
Once a policy is set, your data will automatically migrate to the most appropriate storage class
without any changes to your application.
General Purpose
Amazon S3 Standard
Amazon S3 Standard offers high durability, availability, and performance object storage
for frequently accessed data.
Because it delivers low latency and high throughput, Standard is perfect for a wide
variety of use cases including cloud applications
Infrequent Access
Archive
Amazon Glacier
Amazon Glacier is a secure, durable, and extremely low-cost storage service for data
archiving.
You can reliably store any amount of data at costs that are competitive with or cheaper
than on-premises solutions.
To keep costs low yet suitable for varying retrieval needs, Amazon Glacier provides three
options for access to archives, from a few minutes to several hours.
Cloud file storage (CFS) is a storage service that is delivered over the Internet, billed on a pay-
per-use basis and has an architecture based on common file level protocols such as Server
Message Block (SMB), Common Internet File System (CIFS) and Network File System (NFS).
FS and cloud block storage are the two main types of cloud storage services. (These are akin to
the two types of networked storage, NAS or file storage and SAN, or block storage.) Most cloud
storage services, including cloud backup services, use a file storage architecture.
Cloud file storage is most appropriate for unstructured data or semi-structured data, such as
documents, spreadsheets, presentations and other file-based data.
For applications that may have very large files, such as databases, a block storage architecture is
more appropriate to maintain satisfactory performance.
Only a handful of cloud storage services offer block storage, however, because the bandwidth
required to traverse the cloud would not likely be adequate to maintain acceptable database
performance without significant latency.
Cloud file sharing, also called cloud-based file sharing or online file sharing, is a system in
which a user is allotted storage space on a server and reads and writes are carried out over the
Internet.
Cloud file sharing provides end users with the ability to access files with any Internet-capable
device from any location. Usually, the user has the ability to grant access privileges to other
users as they see fit.
Although cloud file sharing services are easy to use, the user must rely upon the service provider
ability to provide high availability (HA) and backup and recovery in a timely manner.
In the enterprise, cloud file sharing can present security risks and compliance concerns if
company data is stored on third-party providers without the IT department's knowledge. Popular
third-party providers for cloud file sharing include Box, Dropbox, Egnyte and Syncplicity.
6.MAP REDUCE
Hadoop enables resilient, distributed processing of massive unstructured data sets across
commodity computer clusters, in which each node of the cluster includes its own storage.
MapReduce serves two essential functions: It parcels out work to various nodes within the
cluster or map, and it organizes and reduces the results from each node into a cohesive answer to
a query.
JobTracker -- the master node that manages all jobs and resources in a cluster
TaskTrackers -- agents deployed to each machine in the cluster to run the map and reduce tasks
To distribute input data and collate results, MapReduce operates in parallel across massive
cluster sizes.
Because cluster size doesn't affect a processing job's final results, jobs can be split across almost
any number of servers.
Therefore, MapReduce and the overall Hadoop framework simplify software development.
MapReduce is available in several languages, including C, C++, Java, Ruby, Perl and Python.
Programmers can use MapReduce libraries to create tasks without dealing with communication
or coordination between nodes.
MapReduce is also fault-tolerant, with each node periodically reporting its status to a master
node.
If a node doesn't respond as expected, the master node re-assigns that piece of the job to other
available nodes in the cluster.
This creates resiliency and makes it practical for MapReduce to run on inexpensive commodity
servers.
MapReduce in action
For example, users can list and count the number of times every word appears in a novel as a
single server application, but that is time consuming.
By contrast, users can split the task among 26 people, so each takes a page, writes a word on a
separate sheet of paper and takes a new page when they're finished.
This is the map aspect of MapReduce. And if a person leaves, another person takes his place.
This exemplifies MapReduce's fault-tolerant element.
When all pages are processed, users sort their single-word pages into 26 boxes, which represent
the first letter of each word.
Each user takes a box and sorts each word in the stack alphabetically. The number of pages with
the same word is an example of the reduce aspect of MapReduce.
MapReduce is a framework for processing parallelizable problems across large datasets using a
large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across
geographically and administratively distributed systems, and use more heterogenous hardware).
Map" step: Each worker node applies the "map()" function to the local data, and writes the
output to a temporary storage. A master node ensures that only one copy of redundant input data
is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the
"map()" function), such that all data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
Prepare the Map() input – The "MapReduce system" designates Map processors, assigns the
input key value K1 that each processor would work on, and provides that processor with all the
input data associated with that key value.
Run the user-provided Map() code – Map() is run exactly once for each K1 key value,
generating output organized by key values K2.
"Shuffle" the Map output to the Reduce processors – The MapReduce system designates
Reduce processors, assigns the K2 key value each processor should work on, and provides that
processor with all the Map-generated data associated with that key value.
Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value
produced by the Map step.
Produce the final output – The MapReduce system collects all the Reduce output, and sorts it
by K2 to produce the final outcome.
The frozen part of the MapReduce framework is a large distributed sort. The hot spots,
which the application defines, are:
an input reader
a Map function
a partition function
a compare function
a Reduce function
an output writer
Map reduce Architecture
7.HADOOP
Apache Flume: A tool used to collect, aggregate and move huge amounts of streaming data into
HDFS.
Apache Hive. A data warehouse that provides data summarization, query and analysis;
Cloudera Impala. A massively parallel processing database for Hadoop, originally created by
the software company Cloudera, but now released as open source software;
Apache Phoenix: An open source, massively parallel processing, relational database engine for
Hadoop that is based on Apache HBase;
Apache Pig:A high-level platform for creating programs that run on Hadoop;
Apache Sqoop. A tool to transfer bulk data between Hadoop and structured data stores, such as
relational databases;
Apache Spark. A fast engine for big data processing capable of streaming and supporting SQL,
machine learning and graph processing;
Apache ZooKeeper. An open source configuration, synchronization and naming registry service
for large distributed systems