You are on page 1of 46

Unit III

Storage in clouds

Presented By
M S RATHOD, LIF, GP Amravati
Course Outcomes
At the end of this course, student will be able to:
1. Maintain Cloud Based Application.
2. Apply virtualization in Cloud Computing.
3. Maintain and Manage Storage System in Cloud.
4. Use the Cloud Based Services.
5. Implement Security in Cloud Computing.
6. Carry out cloud based activities on different
platforms.
History of cloud storage

The technology of data storage in computer has evolved over the


years.
Starting from punch cards, magnetic tapes, floppy disks and hard
drives.
Another common form of storage: CDs, DVDs
And more recently flash memory devices such as pen drives, SD
cards.
History of cloud storage
Cloud computing was invented by Joseph Carl Robnett
Licklider in the 1960s with his work on ARPANET that
connects people and data from anywhere at any time.
In 2006, Amazon Web Services introduced storage
device AWS S3. 
In 2009, Google released their application systems like
Google Docs and Google sheets.
Hence, the cloud has been around and functional for
quite some time.
Cloud storage
Cloud storage is a digital storage solution
which utilizes multiple servers to store
data in logical pools.
The organizations buy the storage
capacity from the providers to store user,
organization, or application data.
Cloud storage
In past few years, the cloud storage has grown in
popularity and has become a direct challenger to
local storage, mainly due to the benefits it provides:
Security: The backups are located across multiple
servers and are better protected from data loss or
hacking.
Accessibility: The data stored is accessible online
regardless of location.
Cloud storage
There are two major providers in the field of
cloud storage namely:
•Amazon S3: It enables file storage to multiple
servers and offers file encryption wherein we can
share the data publicly.
•Google Cloud: It offers unlimited storage space.
It also has the ability to resume the file transfer
after a failure.
Three levels of storage architecture :File,
Block, and Object based storage.
 File storage, also called file-level or file-based storage, is exactly
what you think it might be:
 Data is stored as a single piece of information inside a folder, just
like you’d organize pieces of paper inside a manila folder.
 When you need to access that piece of data, your computer needs to
know the path to find it.
Three levels of storage architecture :File,
Block, and Object based storage.
 Block storage chops data into blocks and stores them as separate
pieces.
 Each block of data is given a unique identifier, which allows a
storage system to place the smaller pieces of data wherever is most
convenient.
 That means that some data can be stored in a Linux® environment
and some can be stored in a Windows unit.
Three levels of storage architecture :File,
Block, and Object based storage.
 Object storage, also known as object-based storage, is a flat
structure in which files are broken into pieces and spread out
among hardware.
 In object storage, the data is broken into discrete units called
objects and is kept in a single repository, instead of being kept as
files in folders or as blocks on servers.
Cloud storage architecture

Cloud Storage Architecture


Cloud storage architecture
Cloud storage is based on virtualized
infrastructure and is like cloud computing in
terms of accessible interfaces, scalability and
metered resources.
It refers to a hosted object storage service, but
the term has broadened to include other types of
data storage that are now available as a service,
like block storage.
Cloud storage architecture
Object storage is a data storage
architecture for large stores of
unstructured data.
It designates each piece of data as an
object, keeps it in a separate storehouse,
and bundles it with metadata and a unique
identifier for easy access and retrieval.
Benefits of cloud storage
Usability and accessibility
Security
Cost-efficient
Convenient sharing of files
Automation
Multiple users
Synchronization
Convenience
Scalable
Disaster recovery
Virtualized Data Center
A virtual data center is a pool or collection of cloud
infrastructure resources specifically designed for
enterprise business needs.
The basic resources are the processor (CPU), memory
(RAM), storage (disk space) and networking (bandwidth).
It is a virtual representation of a physical data center,
complete with servers, storage clusters and lots of
networking components, all of which reside in virtual
space being hosted by one or more actual data centers.
Virtual Data Center
Virtual Data Center
Virtual Data Center
Virtual Data Center
A virtual data center is a product of the Infrastructure as a Service
(IaaS) delivery model of cloud computing.
It can provide on-demand computing, storage and networking, as
well as applications, all of which can be seamlessly integrated
into an organization's existing IT infrastructure.
The premise of the virtual data center solution is to give
organizations the option of adding capacity or installing a new IT
infrastructure without the need to buy or install costly hardware,
which takes up additional manpower, space and power.
The whole data center infrastructure is provided over the cloud.
Virtual Data Center
This is one of the greatest benefits of cloud
computing:
to allow relatively small organizations access to IT
infrastructure in the form of a virtual data center
without spending millions of dollars in capital in
order to construct an actual data center.
They only need to pay for the resources that they use,
which allows for great flexibility and scalability.
Benefits of VDC
VDC Architecture
VMWare Data Center
VMware virtual data center presents a set of virtual elements which
are:
Hosts, Clusters and Resource (compute and memory) Pools –
A Host is the virtual representation of the computing and memory
resources of a physical machine running ESXi Server.
When one or more physical machines are grouped together to work
and be managed as a whole, the aggregate computing and memory
resources form a Cluster.
Machines can be dynamically added or removed from a Cluster.
Computing and memory resources from Hosts and Clusters can be
finely partitioned into a hierarchy of Resource Pools.
VMWare Data Center
Datastores (storage resources)
Datastores are virtual representations of combinations of underlying
physical storage resources in the data center.
These physical storage resources can include local SCSI disks of the
server, the Fiber Channel SAN disk arrays, the iSCSI SAN disk arrays,
or Network Attached Storage (NAS) arrays. 
Each virtual machine is stored as a set of files in a directory in the
datastore.
The disk storage associated with each virtual guest is a set of files
within the guest’s directory.
VMWare Data Center
Networking resources called Networks
Networks in the virtual environment connect virtual machines to each other or to
the physical network outside of the virtual data center.
They are virtual network interface cards (vNIC), vNetwork Standard Switches
(vSwitch), vNetwork Distributed Switches (vDS), and port groups.
Virtual machines.
Virtual machines are designated to a particular Host, Cluster or Resource Pool and
a Datastore when they are created.
A virtual machine consumes resources like a physical appliance. While in powered-
off or suspended, it consumes no resources. Once powered-on, it consumes
resources dynamically, using more as the workload increases or give back
resources dynamically as the workload decreases. Provisioning of virtual machines
is much faster and easier than physical machines.
Virtual Storage area network
 A virtual storage area network (virtual SAN, VSAN
or vSAN) is a logical representation of a physical 
storage area network (SAN).
 A VSAN abstracts the storage-related operations from
the physical storage layer,
 and provides shared storage access to the applications
and virtual machines by combining the servers' local
storage over a network into a single or multiple storage
pools.
Storage Area Network
A storage area network (SAN) or storage
network is a computer network which provides
access to consolidated, block-level data storage.
SANs are primarily used to access data storage
 devices, such as disk arrays and tape libraries
 from servers so that the devices appear to the 
operating system as direct-attached storage. 
Storage Area Network
A storage area network (SAN) is a dedicated,
independent high-speed network for data storage.
In addition to storing data, SANs allow for the
automatic backup of data, and the monitoring of
the storage as well as the backup process.
 A SAN is a combination of hardware and
software.
Storage Area Network
A storage area network (SAN) is a
dedicated, independent high-speed network
that interconnects and delivers shared pools
of storage devices to multiple servers or
computers.
Each server can access shared storage as if it
were a drive directly attached to the server.
SAN vs. NAS
A SAN and network-attached storage
(NAS) are two different types of shared
networked storage solutions.
While a SAN is a local network
composed of multiple devices, NAS is a
single storage device that connects to a
local area network (LAN). 
Virtual SAN
 A virtual storage area network (SAN) is a software-based
component that provides a virtualized ‘pool’ of storage to multiple
virtual machines (VMs) and applications.
Benefits of VSAN
 Lower Cost : Physical storage infrastructure, such as a NAS or
SAN, typically requires specialist hardware and more significant
network architecture and specialist IT support, cooling, spare parts,
physical space.
 VSANs are less expensive to purchase up front, and require less
resources to upkeep.
 High Availability : VSANs duplicates data and applications across
multiple servers eliminating single point of failure. Therefore no
downtime.
 (which is the case with physical SAN)
Benefits of VSAN
 Flexibility : As a physical piece of hardware, a physical SAN must
be installed in a very specific way within the datacenter, and has
several strict requirements in terms of management and
maintenance.
 A virtual SAN has lower requirements and enables an organization
to tailor their configuration to meet their exact needs, and deploy it
in a way that best suits them.
 It can be configured in many different ways, and provides businesses
with much more flexibility to select the components (hardware and
software) that make the most sense for their environment.
Virtual Provisioning
Virtual provisioning is a strategy for efficiently
managing space in a storage area network (SAN)
by allocating physical storage on an "as needed"
basis.
This strategy is also called thin provisioning.
Virtual provisioning is designed to simplify storage
administration by allowing storage administrators
to meet requests on-demand. 
Virtual Provisioning
Virtual provisioning gives a host, application or file system
the illusion that it has more storage than is physically
provided.
Physical storage is allocated only when the data is written,
rather than when the application is initially configured.
Virtual provisioning can reduce power and cooling costs
by cutting down on the amount of idle storage devices in
the array. 
Virtual Provisioning
Virtual provisioning gives a host, application or file system
the illusion that it has more storage than is physically
provided.
Physical storage is allocated only when the data is written,
rather than when the application is initially configured.
Virtual provisioning can reduce power and cooling costs
by cutting down on the amount of idle storage devices in
the array. 
Automated Storage Tiering
It is a storage software management feature that dynamically
moves information between different disk types and RAID
 levels to meet space, performance and cost requirements. 
Automated storage tiering features use policies that are set up
by storage administrators.
For example, a data storage administrator can assign
infrequently used data to slower, less-expensive SATA storage
but allow it to be automatically moved to higher-performing
SAS or solid-state drives (SSDs) as it becomes more active (and
vice versa).
Google File System
Google File System
A GFS cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients.
Files are divided into fixed-size chunks.
Each chunk is identified by globally unique 64 bit chunk
handle assigned by the master at the time of chunk
creation.
Chunkservers store chunks on local disks as Linux files
and read or write chunk data specified by a chunk
handle and byte range.
Google File System
For reliability, each chunkis replicated on multiple chunkservers.
The master maintains all file system metadata such as the
namespace, access control information, the mapping from files to
chunks, and the current locations of chunks.
GFS client code linked into each application implements the file
system API and communicates with the master and chunkservers
to read or write data on behalf of the application.
Clients interact with the master for metadata operations, but all
data-bearing communication goes directly to the chunkservers.
Hadoop Distributed File System
For reliability, each chunkis replicated on multiple chunkservers.
The master maintains all file system metadata such as the
namespace, access control information, the mapping from files to
chunks, and the current locations of chunks.
GFS client code linked into each application implements the file
system API and communicates with the master and chunkservers
to read or write data on behalf of the application.
Clients interact with the master for metadata operations, but all
data-bearing communication goes directly to the chunkservers.
HDFS
 HDFS (Hadoop Distributed File System) is a unique design that
provides storage for extremely large files with streaming data access
pattern and it runs on commodity hardware.
Extremely large files: Here we are talking about the data in range of
petabytes(1000 TB).
Streaming Data Access Pattern: HDFS is designed on principle
of write-once and read-many-times. Once data is written large
portions of dataset can be processed any number times.
Commodity hardware: Hardware that is inexpensive and easily
available in the market. This is one of feature which specially
distinguishes HDFS from other file system.
HDFS
Nodes: Master-slave nodes typically forms the HDFS cluster. 
MasterNode: 
◦ Manages all the slave nodes and assign work to them.
◦ It executes filesystem namespace operations like opening, closing, renaming files and
directories.
◦ It should be deployed on reliable hardware which has the high config. not on
commodity hardware.

NameNode: 
◦ Actual worker nodes, who do the actual work like reading, writing, processing etc.
◦ They also perform creation, deletion, and replication upon instruction from the master.
◦ They can be deployed on commodity hardware.
HDFS
HDFS daemons: Daemons are the processes running in background. 
Namenodes: 
◦ Run on the master node.
◦ Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
◦ Require high amount of RAM.
◦ Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
DataNodes: 
◦ Run on slave nodes.
◦ Require high memory as data is actually stored here.
HDFS
Features:  
Distributed data storage.
Blocks reduce seek time.
The data is highly available as the same block is
present at multiple datanodes.
Even if multiple datanodes are down we can still do
our work, thus making it highly reliable.
High fault tolerance.
HDFS
Limitations: Though HDFS provide many features there are some
areas where it doesn’t work well. 
Low latency data access: Applications that require low-latency
access to data i.e in the range of milliseconds will not work well with
HDFS, because HDFS is designed keeping in mind that we need high-
throughput of data even at the cost of latency.
Small file problem: Having lots of small files will result in lots of
seeks and lots of movement from one datanode to another datanode to
retrieve each small file, this whole process is a very inefficient data
access pattern.

You might also like