You are on page 1of 54

GRID AND CLOUD COMPUTING

Globus Toolkit and Hadoop Courtesy:


Dr Gnanasekaran
http://web.uettaxila.edu.pk/CMS/FALL2017/teGNCCms/ Thangavel
UNIT 6: Globus Toolkit and Hadoop
– Open source grid middleware packages:
– Globus Toolkit (GT6) Architecture, Components
– Usage of Globus
– Distributed processing using Hadoop
- Introduction to Hadoop
- Mapreduce, Input splitting, map and reduce functions, specifying
input and output parameters, configuring and running a job
- Design of Hadoop file system, HDFS concepts, command line and
java interface, dataflow of File read & File write.

1/10/2020 2
Open source grid middleware packages
• The Open Grid Forum and Object Management Group (OMG) are two well-formed
organizations behind the standards
• Middleware is the software layer that connects software components. It lies between
operating system and the applications.
• Grid middleware is specially designed a layer between hardware and software, enable the
sharing of heterogeneous resources and managing virtual organizations created around
the grid.
• The popular grid middleware are
1. BOINC - Berkeley Open Infrastructure for Network Computing.
2. UNICORE - Middleware developed by the German grid computing community.
3. Globus (GT4) - A middleware library jointly developed by Argonne National Lab., Univ. of
Chicago, and USC Information Science Institute, funded by DARPA, NSF, and NIH.
4. CGSP in ChinaGrid - The CGSP (ChinaGrid Support Platform) is a middleware library
developed by 20 top universities in China as part of the ChinaGrid Project

1/10/2020 3
Open source grid middleware packages conti…

5. Condor-G - Originally developed at the Univ. of Wisconsin for general


distributed computing, and later extended to Condor-G for grid job
management.
6. Sun Grid Engine (SGE) - Developed by Sun Microsystems for business grid
applications. Applied to private grids and local clusters within enterprises or
campuses.
7. gLight -Born from the collaborative efforts of more than 80 people in 12 different
academic and industrial research centers as part of the EGEE Project, gLite
provided a framework for building grid applications tapping into the power of
distributed computing and storage resources across the Internet.

1/10/2020 4
The Globus Toolkit Architecture (GT6)
• The Globus Toolkit, is an open middleware library for the grid computing communities. These
open source software libraries support many operational grids and their applications on an
international basis.
• The toolkit addresses common problems and issues related to grid resource discovery,
management, communication, security, fault detection, and portability. The software itself
provides a variety of components and capabilities.
• The library includes a rich set of service implementations. The implemented software supports
grid infrastructure management, provides tools for building new web services in Java , C, and
Python, builds a powerful standard-based security infrastructure and client APIs (in different
languages), and offers comprehensive command-line programs for accessing various grid
services.
• The Globus Toolkit was initially motivated by a desire to remove obstacles that prevent
seamless collaboration, and thus sharing of resources and services, in scientific and
engineering applications. The shared resources can be computers, storage, data, services,
networks, science instruments (e.g., sensors), and so on.
• The Globus library version GT6, is conceptually shown in Figure on next slide:
1/10/2020 5
GT6 is binary compatible with GT5 and GT5.2

1/10/2020 6
Grid Security Infrastructure (GSI)
• As Grid Resources and Users are Distributed and Owned by different organizations,
only authorized users should be allowed to access them.
• A simple authentication infrastructure is needed.
• Also, both users and owners should be protected from each other.
• The Users need be assured about security of their:
– Data
– Code
– Message
• GSI Provides all of the above
• GSI C are the C language Libraries included in Globus Toolkit
• They help in compiling and patching OpenSSH for use with GSI

1/10/2020 7
MyProxy
• Online repository of encrypted GSI credentials
• Provides authenticated retrieval of proxy credentials over the
network
• Improves usability
– Retrieve proxy credentials when/where needed without managing private
key and certificate files
• Improves security
– Long-term credentials stored encrypted on a well-secured server
% bin/grid-proxy-init
Your identity: /O=Grid/OU=Example/CN=Adeel Akram
Enter GRID pass phrase for this identity:
Creating proxy ................................. Done
Your proxy is valid until: Tue Oct 26 01:33:42 2010
1/10/2020 8
Credential Accessibility with MyProxy
• A MyProxy server can be deployed for a single user, a virtual
organization, or a Certificate Authority (CA)
• Users can delegate proxy credentials to the MyProxy server for
storage
– Can store multiple credentials with different names, lifetimes, and access
policies
• Then, they can retrieve stored proxies when needed using
MyProxy client tools
– And allow trusted services to retrieve proxies
• No need to copy certificate and key files between machines

1/10/2020 9
GridFTP

• GridFTP is an extension of the File Transfer Protocol


(FTP) for grid computing.
• The protocol was defined within the GridFTP working
group of the Open Grid Forum.
• There are multiple implementations of the protocol; the
most widely used is that provided by the Globus Toolkit.

1/10/2020 10
RLS

• The Replica Location Service (RLS) provides a framework


for tracking the physical locations of data that has been
replicated.
• At its simplest, RLS maps logical names (which don't
include specific pathnames or storage system information)
to physical names (which do include storage system
addresses and specific pathnames).

1/10/2020 11
GRAM5
• The Globus Toolkit includes a set of service components
collectively referred to as the Globus Resource Allocation
Manager (GRAM).
• GRAM simplifies the use of remote systems by providing a
single standard interface for requesting and using remote
system resources for the execution of "jobs".
• The most common use (and the best supported use) of
GRAM is remote job submission and control. This is typically
used to support distributed computing applications.

1/10/2020 12
Globus XIO
• Globus XIO is an extensible input/output library written in C for the
Globus Toolkit.
• It provides a single API (open/close/read/write) that supports
multiple wire (communication) protocols, with protocol
implementations encapsulated as drivers.
• The XIO drivers distributed with 6.0 include TCP, UDP, file, HTTP,
GSI, GSSAPI_FTP, TELNET and queuing.
• In addition, Globus XIO provides a driver development interface for
use by protocol developers.

1/10/2020 13
The Globus Toolkit 4

1/10/2020 14
The Globus Toolkit
• GT4 offers the middle-level core services in grid applications.
• The high-level services and tools, such as MPI , Condor-G, and
Nimrod/G, are developed by third parties for general purpose distributed
computing applications.
• The local services, such as LSF, TCP, Linux, and Condor, are at the
botom level and are fundamental tools supplied by other developers.
• As a de facto standard in grid middleware, GT6 is based on industry-
standard web service technologies.

1/10/2020 15
High Level Services and Tools
• DRM – Distributed Resource Management ➢ Resource manager
on ASCI supercomputer
• Cactus ➢ Grid-aware numerical solver framework
• MPICH-G2 ➢ Grid-enabled MPI
• Globusrun ➢ More complicated version of globus-job-run
• PUNCH ➢ Web browser based resource manager from Purdue
University
• Nimrod/G ➢ Model computational jobs from Monash
• Grid Status ➢ Repository of state of jobs in grid
• Condor-G ➢ Condor job management layer to Globus
1/10/2020 16
GT Core Services
• GASS – Globus Access to Secondary Storage ➢ File and executable staging
and I/O redirection
• GridFTP - Grid File Transfer Protocol ➢ Reliable, high performance FTP
• MDS - Metacomputing Directory Service ➢ Maintains information about
available resources
• GSI - Grid Security Interface ➢ Authentication, authorization via proxies,
delegation, PKI, SSL
• Replica Catalog ➢ Manages partial copies of full data set across grid
• GRAM - Grid Resource Allocation Management ➢ Allocation, reservation,
monitoring and control of programs on remote systems
• I/O ➢ Wrapper TCP, UDP, IP multicast and file I/O

1/10/2020 17
GT Local Services
• Condor ➢ Job and resource manager for compute-intensive jobs
• MPI – Message Passing Interface ➢ Portability across plaforms
• LSF – Load Sharing Facility ➢ Management of batch workload
• PBS – Portable Batch System ➢ Scheduling / resource
management
• NQE – Network Queueing Environment ➢ Resource manager on
Cray systems

1/10/2020 18
Functionalities of GT4
• Global Resource Allocation Manager (GRAM) - Grid Resource Access
and Management (HTTP-based)
• Communication (Nexus) - Unicast and multicast communication
• Grid Security Infrastructure (GSI) - Authentication and related security
services
• Monitoring and Discovery Service (MDS) - Distributed access to structure
and state information
• Health and Status (HBM) - Heartbeat monitoring of system components
• Global Access of Secondary Storage (GASS) - Grid access of data in
remote secondary storage
• Grid File Transfer (GridFTP) Inter-node fast file transfer

1/10/2020 19
Globus Job Workflow

1/10/2020 20
Globus Job Workflow
• A typical job execution sequence proceeds as follows: The user delegates his credentials to
a delegation service.
• The user submits a job request to GRAM with the delegation identifier as a parameter.
• GRAM parses the request, retrieves the user proxy certificate from the delegation service,
and then acts on behalf of the user.
• GRAM sends a transfer request to the RFT (Reliable File Transfer), which applies GridFTP
to bring in the necessary files.
• GRAM invokes a local scheduler via a GRAM adapter and the SEG (Scheduler Event
Generator) initiates a set of user jobs.
• The local scheduler reports the job state to the SEG. Once the job is complete, GRAM uses
RFT and GridFTP to stage out the resultant files. The grid monitors the progress of these
operations and sends the user a notification.

1/10/2020 21
Client-Globus Interactions

• There are strong interactions between provider programs and user code. GT4 makes heavy
use of industry-standard web service protocols and mechanisms in service description,
discovery, access, authentication, authorization.
• GT4 makes extensive use of java, C, and Python to write user code. Web service mechanisms
define specific interfaces for grid computing.
• Web services provide flexible, extensible, and widely adopted XML-based interfaces.

1/10/2020 22
Data Management Using GT4
• For Grid applications one needs to provide access to and/or integrate large quantities of data at multiple
sites. The GT4 tools can be used individually or in conjunction with other tools to develop interesting
solutions to efficient data access. The following list briefly introduces these GT4 tools:
• 1. Grid FTP supports reliable, secure, and fast memory-to-memory and disk-to-disk data movement over
high-bandwidth WANs. Based on the popular FTP protocol for internet file transfer, Grid FTP adds
additional features such as parallel data transfer, third-party data transfer, and striped data transfer. In
addition, Grid FTP benefits from using the strong Globus Security Infra structure for securing data
channels with authentication and reusability. It has been reported that the grid has achieved 27
Gbit/second end-to-end transfer speeds over some WANs.
• 2. RFT provides reliable management of multiple Grid FTP transfers. It has been used to orchestrate the
transfer of millions of files among many sites simultaneously.
• 3. RLS (Replica Location Service) is a scalable system for maintaining and providing access to
information about the location of replicated files and data sets.
• 4. OGSA-DAI (Globus Data Access and Integration) tools were developed by the UK eScience program
and provide access to relational and XML databases.

1/10/2020 23
Introduction to Hadoop
• Hadoop is an open source, Java-based programming framework that
supports the processing and storage of extremely large data sets in
a distributed computing environment.
• It is part of the Apache project sponsored by the Apache Software
Foundation.
• Hadoop essentially provides two things:
– A distributed filesystem called HDFS (Hadoop Distributed File System)
– A framework and API for building and running MapReduce jobs

1/10/2020 24
• It is a flexible and highly-available architecture for large scale computation and data
processing on a network of commodity hardware.
• Hadoop offers a software platform that was originally developed by Yahoo! group. The
package enables users to write and run applications over vast amounts of distributed
data.
• Users can easily scale Hadoop to store and process petabytes of data in the web space.
• Hadoop is economical in that it comes with an open source version of MapReduce that
minimizes overhead in task spawning and massive data communication.
• It is efficient, as it processes data with a high degree of parallelism across a large
number of commodity nodes, and it is reliable in that it automatically keeps multiple data
copies to facilitate redeployment of computing tasks upon unexpected system failures.

1/10/2020 25
Hadoop Distributed File System (HDFS)

• HDFS is structured similarly to a regular Unix filesystem


except that data storage is distributed across several
machines.
• It is not intended as a replacement to a regular filesystem,
but rather as a filesystem-like layer for large distributed
systems to use.
• It has in built mechanisms to handle machine outages, and
is optimized for throughput rather than latency.

1/10/2020 26
HDFS Cluster Machines
• There are two and a half types of machine in a HDFS
cluster:
– Datanode - where HDFS actually stores the data, there are
usually quite a few of these.
– Namenode - the ‘master’ machine. It controls all the meta data
for the cluster. Eg - what blocks make up a file, and what
datanodes those blocks are stored on.
– Secondary Namenode - this is NOT a backup namenode, but is
a separate service that keeps a copy of both the edit logs, and
filesystem image, merging them periodically to keep the size
reasonable.
1/10/2020 27
HDFS Cluster Machines

• Name Node
• Data Nodes
• Secondary Name Nodes

• Data can be accessed using either the Java API, or the


Hadoop command line client. Many operations are similar
to their Unix counterparts.
1/10/2020 28
HDFS Command Line

• list files in the root directory


hadoop fs -ls /
• list files in my home directory
hadoop fs -ls ./
• list files in my home directory
hadoop fs -text ./file.txt.gz
• upload and retrieve a file
hadoop fs -put ./localfile.txt /home/matthew/remotefile.txt
hadoop fs -get /home/matthew/remotefile.txt ./local/file/path/file.txt

1/10/2020 29
Hadoop’s Architecture
• Distributed, with some centralization

• Main nodes of cluster are where most of the computational power and storage of the
system lies

• Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also
DataNode to store needed blocks closely as possible

• Central control node runs NameNode to keep track of HDFS directories & files, and
JobTracker to dispatch compute tasks to TaskTracker

• Written in Java, also supports Python and Ruby

1/10/2020 30
Hadoop’s Architecture
• Hadoop Distributed
Filesystem
• Tailored to needs of
MapReduce
• Targeted towards many
reads of filestreams
• Writes are more costly
• High degree of data
replication (3x by default)
• No need for RAID on
normal nodes
• Large blocksize (64MB)
• Location awareness of
DataNodes in network

1/10/2020 31
Hadoop’s Architecture

NameNode:
• Stores metadata for the files, like the directory structure of a typical FS.
• The server holding the NameNode instance is quite crucial, as there is only one.
• Transaction log for file deletes/adds, etc. Does not use transactions for whole
blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a DataNode failure
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere

1/10/2020 32
HDFS Features
• HDFS is optimized differently than a regular file system. It is
designed for non-realtime applications demanding high
throughput instead of online applications demanding low latency.
• For example, files cannot be modified once written, and the
latency of reads/writes is really bad by filesystem standards. On
the flip side, throughput scales fairly linearly with the number of
datanodes in a cluster, so it can handle workloads no single
machine would ever be able to.

1/10/2020 33
HDFS Features Cont...
• Failure tolerant - data can be duplicated across multiple datanodes to
protect against machine failures. The industry standard seems to be a
replication factor of 3 (everything is stored on three machines).
• Scalability - data transfers happen directly with the datanodes so
your read/write capacity scales fairly well with the number of
datanodes
• Space - need more disk space? Just add more datanodes and re-
balance
• Industry standard - Lots of other distributed applications build on top
of HDFS (HBase, Map-Reduce)
• Optimized for MapReduce

1/10/2020 34
Deployment of Task Trackers with DataNodes

1/10/2020 35
Hadoop’s Architecture
MapReduce Engine:
• JobTracker & TaskTracker
• JobTracker splits up data into smaller tasks(“Map”) and sends it to the
TaskTracker process in each node
• TaskTracker reports back to the JobTracker node and reports on job progress,
sends data (“Reduce”) or requests new jobs
• None of these components are necessarily limited to using HDFS
• Many other distributed file-systems with quite different architectures work
• Many other software packages besides Hadoop's MapReduce platform make
use of HDFS

1/10/2020 36
MapReduce

• The second fundamental part of Hadoop is the


MapReduce layer.
• This is made up of two sub components:
– An API for writing MapReduce workflows in Java.
– A set of services for managing the execution of these
workflows.

1/10/2020 37
Map and Reduce APIs

• The basic premise is this:


– Map tasks perform a transformation.
– Reduce tasks perform an aggregation.

1/10/2020 38
MapReduce Services

• Hadoop MapReduce comes with two primary services for


scheduling and running MapReduce jobs. They are:
– the Job Tracker (JT) and
– the Task Trackers (TT)

1/10/2020 39
JT and TTs
• The JT is the master and is in charge of allocating tasks to task
trackers and scheduling these tasks globally.
• TTs are in charge of running the Map and Reduce tasks
themselves.

1/10/2020 40
JT and TTs Optimization
• Many things can go wrong in a big distributed system, so these services
have some clever tricks to ensure that your job finishes successfully:
– Automatic retries - if a task fails, it is retried N times (usually 3) on different task
trackers.
– Data locality optimizations - if you co-locate a TT with a HDFS Datanode (which you
should) it will take advantage of data locality to make reading the data faster
– Blacklisting a bad TT - if the JT detects that a TT has too many failed tasks, it will
blacklist it. No tasks will then be scheduled on this task tracker.
– Speculative Execution - the JT can schedule the same task to run on several
machines at the same time, just in case some machines are slower than others.
When one version finishes, the others are killed.

1/10/2020 41
Hadoop Framework Tools

1/10/2020 42
Hadoop in the Wild
Hadoop is in use at most organizations that handle big data:
• Yahoo!
• Facebook
• Amazon
• Netflix
• Etc…

Some examples of scale:


• Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo! Web
search

• FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½ PB/day (Nov, 2012)

1/10/2020 43
Three main applications of Hadoop

• Advertisement (Mining user behavior to generate recommendations)

• Searches (group related documents)

• Security (search for uncommon patterns)

1/10/2020 44
Hadoop Highlights
• Distributed File System
• Fault Tolerance
• Open Data Format
• Flexible Schema
• Queryable Database
Why use Hadoop?
• Need to process Multi Petabyte Datasets
• Data may not have strict schema
• Expensive to build reliability in each application
• Nodes fails everyday
• Need common infrastructure
• Very Large Distributed File System
• Assumes Commodity Hardware
• Optimized for Batch Processing
• Runs on heterogeneous OS
1/10/2020 45
DataNode
• A Block Sever
• Stores data in local file system
• Stores meta-data of a block - checksum
• Serves data and meta-data to clients
• Block Report
• Periodically sends a report of all existing blocks to NameNode
• Facilitate Pipelining of Data
• Forwards data to other specified DataNodes

1/10/2020 46
Block Placement
• Replication Strategy
• One replica on local node
• Second replica on a remote rack
• Third replica on same remote rack
• Additional replicas are randomly placed
• Clients read from nearest replica

1/10/2020 47
Data Correctness
• Use Checksums to validate data – CRC32
• File Creation
• Client computes checksum per 512 byte
• DataNode stores the checksum
• File Access
• Client retrieves the data and checksum from
DataNode
• If validation fails, client tries other replicas

1/10/2020 48
Data Pipelining

• Client retrieves a list of DataNodes on which to


place replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next
DataNode in the Pipeline
• When all replicas are written, the client moves on to
write the next block in file

1/10/2020 49
MapReduce Usage

• Log processing
• Web search indexing
• Ad-hoc queries

1/10/2020 50
MapReduce Process (org.apache.hadoop.mapred)

• JobClient
• Submit job
• JobTracker
• Manage and schedule job, split job into tasks
• TaskTracker
• Start and monitor the task execution
• Child
• The process that really execute the task

1/10/2020 51
References

1. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, “Distributed and Cloud
Computing: Clusters, Grids, Clouds and the Future of Internet”, First Edition,
Morgan Kaufman Publisher, an Imprint of Elsevier, 2012.
2. www.csee.usf.edu/~anda/CIS6930-S11/notes/hadoop.ppt
3. www.ics.uci.edu/~cs237/lectures/cloudvirtualization/Hadoop.pptx
4. http://www.softwaresummit.com/2003/speakers/BrownGridIntro.pdf
5. http://bigdata-madesimple.com/20-essential-hadoop-tools-for-crunching-big-data/
6. https://developer.yahoo.com/hadoop/tutorial/module1.html
7. https://blog.matthewrathbone.com/2013/04/17/what-is-hadoop.html
8. https://en.wikipedia.org/wiki/Apache_Hadoop

1/10/2020 52
Assignment #5
• Visit https://developer.yahoo.com/hadoop/tutorial/module1.html and sumarize
how hadoop compares to other distributed computing platforms. What are the
scenarios where MPI based cluster performs better than Hadoop?
• Write note on Data Access Frameworks and Orchestration Frameworks
mentioned in slide 42 (Hadoop Framework Tools)

1/10/2020 53
Thank You

Questions and Comments?

http://web.uettaxila.edu.pk/CMS/FALL2017/teGNCCms/

1/10/2020 54

You might also like