CLOUD COMPUTING: MAPREDUCE AND HADOOP

CLOUD COMPUTING
UNIT III
Parallel and Distributed Programming Paradigms – MapReduce , Twister and Iterative MapReduce
– Hadoop Library from Apache – Mapping Applications - Cloud Software Environments -
Eucalyptus, Open Nebula, OpenStack,
Distributed Programming Paradigms

In distributed computing we have multiple autonomous computers which seems to
the user as single system. In distributed systems there is no shared memory and
computers communicate with each other through message passing. In distributed
computing a single task is divided among different computers.
Difference between Parallel Computing and Distributed Computing:
Distributed
S.NO Parallel Computing
Computing
Many operations are System components
1. performed are located at
simultaneously different locations
Single computer is Uses multiple
2.
required computers
Multiple processors Multiple computers
3. perform multiple perform multiple
operations operations
It may have shared or It have only
4.
distributed memory distributed memory
Processors Computer Improves system
Improves the
communicate with communicate with scalability, fault
5. 6. system
each other through each other through tolerance and resource
performance
bus message passing. sharing capabilities
MapReduce , Twister and Iterative MapReduce
What is MapReduce?
MapReduce is a programming model or pattern within the Hadoop framework that is
used to access big data stored in the Hadoop File System (HDFS). It is a core component,
integral to the functioning of the Hadoop framework.
MapReduce facilitates concurrent processing by splitting petabytes of data into smaller
chunks, and processing them in parallel on Hadoop commodity servers. In the end, it
aggregates all the data from multiple servers to return a consolidated output back to the
application.
With MapReduce, rather than sending data to where the application or logic resides,
the logic is executed on the server where the data already resides, to expedite
processing. Data access and storage is disk-based—the input is usually stored as files
containing structured, semi-structured, or unstructured data, and the output is also stored in
files.
MapReduce was once the only method through which the data stored in the HDFS could be
retrieved, but that is no longer the case. Today, there are other query-based systems such as
Hive and Pig that are used to retrieve data from the HDFS using SQL-like statements.
However, these usually run along with jobs that are written using the MapReduce model.
That’s because MapReduce has unique advantages.
How MapReduce Works
At the crux of MapReduce are two functions: Map and Reduce. They are sequenced one after
the other.
The Map function takes input from the disk as <key,value> pairs, processes them,
and produces another set of intermediate <key,value> pairs as output.
The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output.
MapReduce is a programming model and an associated implementation for processing and
generating big data sets with a parallel, distributed algorithm on a cluster .
Hadoop Library from Apache

Apache Hadoop software is an open source framework that allows for
the distributed storage and processing of large datasets across
clusters of computers using simple programming models. Hadoop is
designed to scale up from a single computer to
thousands of clustered computers, with each machine offering local
computation and storage. In this way, Hadoop can efficiently store
and process large datasets ranging in size from gigabytes to petabytes
of data.
The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to scale up
from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver
high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service
on top of a cluster of computers, each of which may be prone to
failures.
Four modules comprise the primary Hadoop framework and work
collectively to form the Hadoop ecosystem:
Hadoop Distributed File System (HDFS): As the primary component of

the Hadoop ecosystem, HDFS is a distributed file system that
provides high-throughput access to application data with no need for
schemas to be defined up front.
Yet Another Resource Negotiator (YARN): YARN is a resource-

management platform responsible for managing compute resources in
clusters and using them to schedule users’ applications. It performs
scheduling and resource allocation across the Hadoop system.
MapReduce: MapReduce is a programming model for large-scale data

processing. Using distributed and parallel computation algorithms,
MapReduce makes it possible to carry over processing logic and helps
to write applications that transform big datasets into one manageable
set.
Hadoop Common: Hadoop Common includes the libraries and utilities

used and shared by other Hadoop modules.
What is Apache Hadoop used for?
Here are some common uses cases for Apache Hadoop:
Analytics and big data
A wide variety of companies and organizations use Hadoop for

research, production data processing, and analytics that require
processing terabytes or petabytes of big data, storing diverse
datasets, and data parallel processing.
Vertical industries
Companies in myriad industries—including technology, education,

healthcare, and financial services—rely on Hadoop for tasks that
share a common theme of high variety, volume, and velocity of
structured and unstructured data.
AI and machine learning
Hadoop ecosystems also play a key role in supporting the

development of artificial intelligence and machine learning
applications.
Cloud computing
Companies often choose to run Hadoop clusters on public, private,

or hybrid cloud resources versus on-premises hardware to gain
flexibility, availability, and cost control. Many cloud solution
providers offer fully managed services for Hadoop, such as Dataproc
from Google Cloud. With this kind of prepackaged service for cloud-
native Hadoop, operations that used to take hours or days can be
completed in seconds or minutes, with companies paying only for
the resources used.
Application Mapping
Application mapping refers to the process of identifying and

mapping interactions and relationships between applications and the
underlying infrastructure. An application, or network map, visualizes
the devices on a network and how they are related. It gives users a
sense of how the network performs in order to run analysis and
avoid data bottlenecks. For containerized applications, it depicts the
dynamic connectivities and interactions between the microservices.
What is Application Mapping?
As enterprises grow, the number and complexity of applications grow
as well. Application mapping helps IT teams track the interactions and
relationships between applications, software, and supporting
hardware.
In the past, companies mapped out interdependencies between apps
using extensive spreadsheets and manual audits of application code.
Today, however, companies can rely on an application mapping tool
that automatically discovers and visualizes interactions for IT teams.
Popular application mapping tools include configuration management
database – CMDB application mapping or UCMDB application mapping.

Some application delivery controllers also integrate application
mapping software.
Application mapping includes the following techniques:
• SNMP-Based Maps — Simple Network Management Protocol
(SNMP) monitors the health of computer and network equipment such
as routers. An SNMP-based map uses data from routers to switch
management information bases (MIBs).
• Active Probing — Creates a map with data from packets that report
IP router and switch forwarding paths to the destination address. The
maps are used to find “peering links” between Internet Service
Providers (ISPs). The peering links allow ISPs to exchange customer
traffic.
• Route Analytics — Creates a map by passively listening to layer 3
protocol exchanges between routers. This data facilitates real-time
network monitoring and routing diagnostics.
What are the Benefits of Application Mapping?

Application mapping diagrams can be helpful for the following
benefits:
• Visibility – locate where exactly applications are running and plan
accordingly for system failures

• Application health – understand the health of entire application
instead of analyzing individual infrastructure silos
• Quick troubleshooting – pinpoint faulty devices or software
components in seconds by conveniently tracing connections on the app
map, rather than sifting through the entire infrastructure
How are Application Maps Used in Networking?

IT personnel use app maps to conceptualize the relationships between
devices and transport layers that provide network services. Using the
application map, IT can monitor network statuses, identify data
bottlenecks, and troubleshoot when necessary.
What is an Application Mapping Example?

An application map (see image below) provides visual insights into
inter-app communications in a container-based microservices
application deployment. It captures the complex relationships of
containers. An application map can graph the latency, connections,
and throughput information of microservice relationships.

Eucalyptus
Eucalyptus is an open source software platform for implementing

Infrastructure as a Service (IaaS) in a private or hybrid
cloud computing environment.
The Eucalyptus cloud platform pools together

existing virtualized infrastructure to create cloud resources for
infrastructure as a service, network as a service and storage as a
service. The name Eucalyptus is an acronym for Elastic Utility
Computing Architecture for Linking Your Programs To Useful
Systems.
Eucalyptus in cloud computing pools together existing virtualised
framework to make cloud resources for storage as a service, network
as a service and infrastructure as a service. Elastic Utility Computing
Architecture for Linking Your Programs To Useful Systems is short
known as Eucalyptus in cloud computing.
Eucalyptus was founded out of a research project in the Computer

Science Department at the University of California, Santa Barbara,
and became a for-profit business called Eucalyptus Systems in 2009.
Eucalyptus Systems announced a formal agreement with Amazon Web
Services (AWS) in March 2012, allowing administrators to move
instances between a Eucalyptus private cloud and the Amazon Elastic
Compute Cloud (EC2) to create a hybrid cloud. The partnership also
allows Eucalyptus to work with Amazon’s product teams to develop
unique AWS-compatible features.
Eucalyptus features include:
Supports both Linux and Windows virtual machines (VMs).
Application program interface- (API) compatible with

Amazon EC2 platform.
Compatible with Amazon Web Services (AWS) and Simple Storage

Service (S3).
Works with multiple hypervisors including VMware, Xen and KVM.
Can be installed and deployed from source code or DEB

and RPM packages.
Internal processes communications are secured through SOAP and

WS-Security.
Multiple clusters can be virtualized as a single cloud.
Administrative features such as user and group management and

reports.
Version 3.3, which became generally available in June 2013, adds the
following features:
Auto Scaling: Allows application developers to scale Eucalyptus

resources up or down based on policies defined using Amazon EC2-
compatible APIs and tools
Elastic Load Balancing: AWS-compatible service that provides

greater fault tolerance for applications
CloudWatch: An AWS-compatible service that allows users to
collect metrics, set alarms, identify trends, and take action to
ensure applications run smoothly
Resource Tagging: Fine-grained reporting

for showback and chargeback scenarios; allows IT/ DevOps to build
reports that show cloud utilization by application, department or
user
Expanded Instance Types: Expanded set of instance types to more

closely align to those available in Amazon EC2. Was 5 before, now
up to 15 instance types.
Maintenance Mode: Allows for replication of a virtual machine’s

hard drive, evacuation of the server node and provides a
maintenance window.
EUCALYPTUS ARCHITECTURE
Eucalyptus CLIs can oversee both Amazon Web Services and their own
private occasions. Clients can undoubtedly relocate cases from
Eucalyptus to Amazon Elastic Cloud. Network, storage, and compute
are overseen by the virtualisation layer. Occurrences are isolated by
hardware virtualisation. The following wording is utilised by Eucalyptus
architecture in cloud computing.
1. Images: Any software application, configuration, module software
or framework software packaged and conveyed in the Eucalyptus cloud
is known as a Eucalyptus Machine Image.
2. Instances: When we run the picture and utilise it, it turns into an
instance.
3. Networking: The Eucalyptus network is partitioned into three
modes: Static mode, System mode, and Managed mode.

4. Access control: It is utilised to give limitation to clients.
5. Eucalyptus elastic block storage: It gives block-level storage
volumes to connect to an instance.
6. Auto-scaling and load adjusting: It is utilised to make or
obliterate cases or administrations dependent on necessities.
EUCALYPTUS COMPONENTS
Components of eucalyptus in cloud computing:
1. Cluster Controller: It oversees at least one Node controller and
liable for sending and overseeing occurrences on them.
2. Storage Controller: It permits the making of depictions of
volumes.
3. Cloud Controller: It is a front end for the whole environment.
4. Walrus Storage Controller: It is a straightforward record storage
framework.
5. Node Controller: It is an essential part of Nodes. It keeps up the
life cycle of the occasions running on every node.
OTHER TOOLS
Numerous other tools can be utilised to associate with AWS

and Eucalyptus in cloud computing, and they are recorded below.
1. Vagrant AWS Plugin: This instrument gives config records to
oversee AWS instances and oversee VMs on the local framework.
2. s3curl: This is a device for collaboration between AWS S3 and
Eucalyptus Walrus.
3. s3fs: This is a FUSE record framework, which can be utilised to
mount a bucket from Walrus or S3 as a local document framework.
4. Cloudberry S3 Explorer: This Windows instrument is for
overseeing documents among S3 and Walrus.
THE ADVANTAGES OF THE EUCALYPTUS CLOUD
The benefits of Eucalyptus in cloud computing are:
1.Eucalyptus can be utilised to benefit both the eucalyptus private

cloud and the eucalyptus public cloud.
2.Clients can run Amazon or Eucalyptus machine pictures as examples
on both clouds.
3.It isn’t extremely mainstream on the lookout yet is a solid contender
to CloudStack and OpenStack.
4.It has 100% Application Programming Interface similarity with all the
Amazon Web Services.
5.Eucalyptus can be utilised with DevOps apparatuses like Chef and
Puppet.
Features of eucalyptus in cloud computing are:
1.Supports both Windows and Linux virtual machines.

2.API is viable with the Amazon EC2 platform.
3.Viable with Simple Storage Service (S3) and Amazon Web Services
(AWS).
EUCALYPTUS VS OTHER IAAS PRIVATE CLOUDS
There are numerous Infrastructure-as-a-Service contributions

accessible in the market like OpenNebula, Eucalyptus, CloudStack and
OpenStack, all being utilised as private and public Infrastructure-as-
a-Service contributions.
Of the multitude of Infrastructure-as-a-Service contributions,

OpenStack stays the most well-known, dynamic and greatest open-
source cloud computing project. At this point, eagerness for
OpenNebula, CloudStack and Eucalyptus stay strong.
WHAT IS THE USE OF EUCALYPTUS IN CLOUD COMPUTING?
It is utilised to assemble hybrid, public and private cloud. It can

likewise deliver your own datacentre into a private cloud and permit
you to stretch out the usefulness to numerous different
organisations.
OpenNebula
OpenNebula is an open-source tool for datacenter virtualization. It

helps us to build any type of cloud: private, public and hybrid for data
centre management. This tool includes features for integration,
management, scalability, security and accounting of data-centres. Its
very efficient core is developed in C++ and possesses a highly scalable
database back-end with support for MySQL and SQLite.
A standard OpenNebula Cloud Architecture consists of the Cloud

Management Cluster, with the Front-end node(s), and the Cloud
Infrastructure, made of one or several workload Clusters. These can be
located at multiple geographical locations, with different
configurations and technologies to better meet your needs:
 Edge Clusters that can be automatically deployed both on premise and

on public cloud or edge providers to enable true hybrid environments.
 Open Cloud Clusters based on certified combinations of open source
hypervisors, storage and networking technologies.
 VMware Clusters that use existing VMware infrastructure
Figure 1: OpenNebula architecture.
OpenNebula is an open-source Cloud Management Tool that embraces

this vision. Its open, architecture, interfaces and components provide
the flexibility and extensibility that many enterprise IT shops need for
internal cloud adoption. These features also facilitate its integration
with any product and service in the cloud and virtualization
ecosystem, and management tool in the data centre. OpenNebula
provides an abstraction layer independent from underlying services
for security, virtualization, networking and storage, avoiding vendor
lock-in and enabling interoperability. OpenNebula is not only built on
standards, but has also provided reference implementation of open
community specifications, such us the OGF Open Cloud Computing
Interface. This open and flexible approach for cloud management
ensures widest possible market and user acceptability, and simplifies
adaptation to different environments.
What is the OpenNebula Technology?
OpenNebula provides the most simple but feature-rich and flexible

solution for the comprehensive management of virtualized data
centers to enable on-premise IaaS clouds. OpenNebula interoperability
makes cloud an evolution by leveraging existing IT assets, protecting
your investments, and avoiding vendor lock-in.
OpenNebula can be primarily used as a platform to manage your
virtualized infrastructure in the data center or cluster, which is usually
referred as Private Cloud. OpenNebula supports Hybrid Cloud to
combine local infrastructure with public cloud-based infrastructure,
enabling highly scalable hosting environments. OpenNebula also
supports Public Clouds by providing Cloud interfaces to expose its
functionality for virtual machine, storage and network management.
What are our Design Principles?
The OpenNebula technology is the result of many years of research
and development in efficient and scalable management of virtual
machines on large-scale distributed infrastructures. OpenNebula was
designed to address the requirements of business use cases from
leading companies and across multiple industries, such as Hosting,
Telecom, eGovernment, Utility Computing… The principles that have
guided the design of OpenNebula are:
 Openness of the architecture, interfaces, and code
 Flexibility to fit into any datacenter
 Interoperability and portability to prevent vendor lock-in
 Stability for use in production enterprise-class environments
 Scalability for large scale infrastructures
 SysAdmin-centrism with complete control over the cloud
 Simplicity, easy to deploy, operate and use
 Lightness for high efficiency
What Are Its Benefits?
For the Infrastructure Manager
 Faster respond to infrastructure needs for services with dynamic

resizing of the physical infrastructure by adding new hosts, and
dynamic cluster partitioning to meet capacity requirements of
services
 Centralized management of all the virtual and physical distributed
infrastructure
 Higher utilization of existing resources with the creation of a
infrastructure incorporating the heterogeneous resources in the
data center, and infrastructure sharing between different
departments managing their own production clusters, so removing
application silos
 Operational saving with server consolidation to a reduced number of
physical systems, so reducing space, administration effort, power
and cooling requirements
 Lower infrastructure expenses with the combination of local and
remote Cloud resources, so eliminating the over-purchase of
systems to meet peaks demands
For the Infrastructure User
 Faster delivery and scalability of services to meet dynamic demands

of service end-users
 Support for heterogeneous execution environments with multiple,
even conflicting, software requirements on the same shared
infrastructure
 Full control of the lifecycle of virtualized services management
For System Integrators
 Fits into any existing data center thanks to its open, flexible and
extensible interfaces, architecture and components
 Builds any type of Cloud deployment
 Open source software, Apache license
 Seamless integration with any product and service in the
virtualization/cloud ecosystem and management tool in the data
center, such as cloud providers, VM managers, virtual image
managers, service managers, management tools, schedulers.
OpenStack
WHAT IS OPENSTACK?
OpenStack is a cloud operating system that controls large pools of
compute, storage, and networking resources throughout a datacenter,
all managed and provisioned through APIs with common authentication
mechanisms.
A dashboard is also available, giving administrators control while

empowering their users to provision resources through a web
interface.
Beyond standard infrastructure-as-a-service functionality, additional

components provide orchestration, fault management and service
management amongst other services to ensure high availability of user
applications.
OpenStack is an open source platform that uses pooled virtual

resources to build and manage private and public clouds. The tools that
comprise the OpenStack platform, called "projects," handle the core
cloud-computing services of compute, networking, storage, identity,
and image services. More than a dozen optional projects can also be
bundled together to create unique, deployable clouds.
In virtualization, resources such as storage, CPU, and RAM are

abstracted from a variety of vendor-specific programs and split by a
hypervisor before being distributed as needed. OpenStack uses a
consistent set of application programming interfaces (APIs) to
abstract those virtual resources 1 step further into discrete pools used
to power standard cloud computing tools that administrators and users
interact with directly.
https://www.tutorialspoint.com/popular-public-clouds
EUCALYPTUS Architecture

CLOUD COMPUTING: MAPREDUCE AND HADOOP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CLOUD COMPUTING: MAPREDUCE AND HADOOP

Uploaded by

Copyright:

Available Formats

CLOUD COMPUTING

Distributed Programming Paradigms

MapReduce , Twister and Iterative MapReduce

integral to the functioning of the Hadoop framework.

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller

That’s because MapReduce has unique advantages.

How MapReduce Works

and produces another set of intermediate <key,value> pairs as output.

The Reduce function also takes inputs as <key,value> pairs, and produces

<key,value> pairs as output.

MapReduce is a programming model and an associated implementation for processing and

generating big data sets with a parallel, distributed algorithm on a cluster .

Hadoop Library from Apache

the distributed storage and processing of large datasets across

clusters of computers using simple programming models. Hadoop is

designed to scale up from a single computer to

thousands of clustered computers, with each machine offering local

computation and storage. In this way, Hadoop can efficiently store

and process large datasets ranging in size from gigabytes to petabytes

The Apache Hadoop software library is a framework that allows for

the distributed processing of large data sets across clusters of

computers using simple programming models. It is designed to scale up

from single servers to thousands of machines, each offering local

computation and storage. Rather than rely on hardware to deliver

high-availability, the library itself is designed to detect and handle

failures at the application layer, so delivering a highly-available service

on top of a cluster of computers, each of which may be prone to

collectively to form the Hadoop ecosystem:

Hadoop Distributed File System (HDFS): As the primary component of

Yet Another Resource Negotiator (YARN): YARN is a resource-

MapReduce: MapReduce is a programming model for large-scale data

Hadoop Common: Hadoop Common includes the libraries and utilities

What is Apache Hadoop used for?

Here are some common uses cases for Apache Hadoop:

Analytics and big data

A wide variety of companies and organizations use Hadoop for

Companies in myriad industries—including technology, education,

AI and machine learning

Hadoop ecosystems also play a key role in supporting the

Companies often choose to run Hadoop clusters on public, private,

Application mapping refers to the process of identifying and

as well. Application mapping helps IT teams track the interactions and

relationships between applications, software, and supporting

In the past, companies mapped out interdependencies between apps

using extensive spreadsheets and manual audits of application code.

Today, however, companies can rely on an application mapping tool

that automatically discovers and visualizes interactions for IT teams.

Popular application mapping tools include configuration management

database – CMDB application mapping or UCMDB application mapping.

Application mapping includes the following techniques:

• SNMP-Based Maps — Simple Network Management Protocol

(SNMP) monitors the health of computer and network equipment such

as routers. An SNMP-based map uses data from routers to switch

management information bases (MIBs).

IP router and switch forwarding paths to the destination address. The

maps are used to find “peering links” between Internet Service

Providers (ISPs). The peering links allow ISPs to exchange customer

• Route Analytics — Creates a map by passively listening to layer 3

protocol exchanges between routers. This data facilitates real-time

network monitoring and routing diagnostics.

What are the Benefits of Application Mapping?

• Visibility – locate where exactly applications are running and plan

accordingly for system failures