Cluster Tutorial

High Performance Cluster Computing
(Architecture, Systems, and Applications)
ISCA
2000
Rajkumar Buyya,
Monash University, Melbourne.
Email: rajkumar@csse.monash.edu.au / rajkumar@buyya.com

Web: http://www.ccse.monash.edu.au/~rajkumar / www.buyya.com
Objectives
Learn and Share Recent advances in cluster

computing (both in research and commercial
settings):
Architecture,
System Software
Programming Environments and Tools
Applications
Cluster Computing Infoware: (tutorial online)
http://www.buyya.com/cluster/
2
Agenda
Overview of Computing
Motivations & Enabling Technologies
Cluster Architecture & its Components
Clusters Classifications
Cluster Middleware
Single System Image
Representative Cluster Systems
Resources and Conclusions
3
Computing Elements
Applications
Programming Paradigms
Threads Interface
Operating System
Microkernel
Multi-Processor Computing System
P
P Processor
Thread
Hardware
Process
4
Two Eras of Computing

Architectures
System Software
Applications
P.S.Es
Architectures
System Software
Applications
P.S.Es
Sequential
Era
Parallel
Era
1940
50
60
70
80
90
2000
2030
Commercialization
R&D
Commodity
5
Computing Power and

Computer Architectures
Computing Power (HPC) Drivers

Solving grand challenge applications using
computer modeling, simulation and analysis
Life Sciences
CAD/CAM
Aerospace
Digital Biology
E-commerce/anything
Military Applications7
How to Run App. Faster ?
There are 3 ways to improve performance:
1. Work Harder
2. Work Smarter
3. Get Help
Computer Analogy
1. Use faster hardware: e.g. reduce the time per

instruction (clock cycle).
2. Optimized algorithms and techniques
3. Multiple computers to solve problem: That
is, increase no. of instructions executed per
8
clock cycle.
Application Case Study
Web Serving and E-Commerce
10
E-Commerce and PDC ?

What are/will be the major problems/issues in
eCommerce? How will or can PDC be applied to
solve some of them?
Other than Compute Power, what else can PDC
contribute to e-commerce?
How would/could the different forms of PDC
(clusters, hypercluster, GRID,) be applied to ecommerce?
Could you describe one hot research topic for
PDC applying to e-commerce?
A killer e-commerce application for PDC ?
...
11
Killer Applications of Clusters
Numerous Scientific & Engineering Apps.

Parametric Simulations
Business Applications
E-commerce Applications (Amazon.com, eBay.com .)
Database Applications (Oracle on cluster)
Decision Support Systems
Internet Applications
Web serving / searching
Infowares (yahoo.com, AOL.com)
ASPs (application service providers)
eMail, eChat, ePhone, eBook, eCommerce, eBank, eSociety, eAnything!
Computing Portals
Mission Critical Applications
command control systems, banks, nuclear reactor control, star-war, and
handling life threatening situations.
12
Major problems/issues in Ecommerce

Social Issues
Capacity Planning
Multilevel Business Support (e.g., B2P2C)

Information Storage, Retrieval, and Update
Performance
Heterogeneity
System Scalability
System Reliability
Identification and Authentication
System Expandability
Security
Cyber Attacks Detection and Control
(cyberguard)
Data Replication, Consistency, and Caching
Manageability (administration and control)
13
Amazon.com: Online sales/trading

killer E-commerce Portal
Several Thousands of Items
books, publishers, suppliers
Millions of Customers
Customers details, transactions details, support

for transactions update
(Millions) of Partners
Keep track of partners details, tracking referral

link to partner and sales and payment
Sales based on advertised price
Sales through auction/bids
A mechanism for participating in the bid

(buyers/sellers define rules of the game)
14
2100
2100
2100
2100
2100
2100
2100
2100
Can these drive

E-Commerce ?
Clusters are already in use for web serving, web-hosting, and

number of other Internet applications including E-commerce
scalability, availability, performance, reliable-high

performance-massive storage and database support.
Attempts to support online detection of cyber attacks (through
data mining) and control
Hyperclusters and the GRID:
Support for transparency in (secure) Site/Data Replication for high

availability and quick response time (taking site close to the user).
Compute power from hyperclusters/Grid can be used for data
mining for cyber attacks and fraud detection and control.
Helps to build Compute Power Market, ASPs, and Computing
Portals.
15
Science Portals - e.g., PAPIA system
Pentiums
Myrinet
NetBSD/Linuux
PM
Score-D
MPC++
RWCP Japan: http://www.rwcp.or.jp/papia/
PAPIA PC Cluster
16
PDC hot topics for E-commerce
Cluster based web-servers, search engineers, portals

Scheduling and Single System Image.
Heterogeneous Computing
Reliability and High Availability and Data Recovery
Parallel Databases and high performance-reliable-mass storage
systems.
CyberGuard! Data mining for detection of cyber attacks, frauds, etc.
detection and online control.
Data Mining for identifying sales pattern and automatically tuning
portal to special sessions/festival sales
eCash, eCheque, eBank, eSociety, eGovernment, eEntertainment,
eTravel, eGoods, and so on.
Data/Site Replications and Caching techniques
Compute Power Market
Infowares (yahoo.com, AOL.com)
ASPs (application service providers)
17
...
Sequential Architecture
Limitations
Sequential architectures reaching physical

limitation (speed of light, thermodynamics)
Hardware improvements like pipelining,

Superscalar, etc., are non-scalable and
requires sophisticated Compiler
Technology.
Vector Processing works well for certain

kind of problems.
18
Computational Power Improvement
C.P.I.
Multiprocessor
Uniprocessor
2. . . .
No. of Processors
19
Human Physical Growth Analogy:

Computational Power Improvement
Vertical
Growth
Horizontal
10
15 20 25
30
35
40
45 . . . .
Age
20
Why Parallel Processing

NOW?
The Tech. of PP is mature and can be

exploited commercially; significant
R & D work on development of tools
& environment.
Significant
development
in
Networking technology is paving a
way for heterogeneous computing.
21
History of Parallel Processing

PP can be traced to a tablet dated
around 100 BC.
Tablet has 3 calculating positions.
Infer that multiple positions:
Reliability/ Speed
22
Motivating Factors
Aggregated speed with

which complex calculations
carried out by millions of neurons in
human brain is amazing! although
individual neurons response is slow
(milli sec.) - demonstrate the
feasibility of PP
23
Taxonomy of Architectures
Simple classification by Flynn:

(No. of instruction and data streams)
SISD - conventional
SIMD - data parallel, vector computing
MISD - systolic arrays
MIMD - very general, multiple approaches.
Current
focus is on MIMD model, using

general
purpose
processors
or
multicomputers.
24
Main HPC Architectures..1a

SISD - mainframes, workstations, PCs.
SIMD Shared Memory - Vector machines, Cray...
MIMD Shared Memory - Sequent, KSR, Tera, SGI,
SUN.
SIMD Distributed Memory - DAP, TMC CM-2...
MIMD Distributed Memory - Cray T3D, Intel,
Transputers, TMC CM-5, plus recent workstation
clusters (IBM SP2, DEC, Sun, HP).
25
Motivation for using Clusters

The
communications bandwidth between

workstations is increasing as new
networking technologies and protocols are
implemented in LANs and WANs.
Workstation clusters are easier to integrate
into existing networks than special parallel
computers.
26
Main HPC Architectures..1b.
NOTE: Modern sequential machines are not purely

SISD - advanced RISC processors use many
concepts from
vector and parallel architectures (pipelining,

parallel execution of instructions, prefetching of
data, etc) in order to achieve one or more
arithmetic operations per clock cycle.
27
Parallel Processing Paradox

Time
required to develop a parallel

application for solving GCA is equal
to:
Half Life of Parallel Supercomputers.
28
The Need for Alternative

Supercomputing Resources
Vast
numbers of under utilised

workstations available to use.
Huge numbers of unused processor
cycles and resources that could be
put to good use in a wide variety of
applications areas.
Reluctance to buy Supercomputer
due to their cost and short life span.
Distributed compute resources fit
better into today's funding model.
29
Technology Trend
30
Scalable Parallel
Computers
31
Design Space of Competing

Computer Architecture
32
Towards Inexpensive
Supercomputing
It is:
Cluster Computing..
The Commodity
Supercomputing!
33
Cluster Computing Research Projects
Beowulf (CalTech and NASA) - USA

CCS (Computing Centre Software) - Paderborn, Germany
Condor - Wisconsin State University, USA
DQS (Distributed Queuing System) - Florida State University, US.
EASY - Argonne National Lab, USA
HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US
far - University of Liverpool, UK
Gardens - Queensland University of Technology, Australia
MOSIX - Hebrew University of Jerusalem, Israel
MPI (MPI Forum, MPICH is one of the popular implementations)
NOW (Network of Workstations) - Berkeley, USA
NIMROD - Monash University, Australia
NetSolve - University of Tennessee, USA
PBS (Portable Batch System) - NASA Ames and LLNL, USA
PVM - Oak Ridge National Lab./UTK/Emory, USA
34
Cluster Computing Commercial Software
Codine (Computing in Distributed Network Environment) GENIAS GmbH, Germany

LoadLeveler - IBM Corp., USA
LSF (Load Sharing Facility) - Platform Computing, Canada
NQE (Network Queuing Environment) - Craysoft Corp., USA
OpenFrame - Centre for Development of Advanced
Computing, India
RWPC (Real World Computing Partnership), Japan
Unixware (SCO-Santa Cruz Operations,), USA
Solaris-MC (Sun Microsystems), USA
ClusterTools (A number for free HPC clusters tools from Sun)
A number of commercial vendors worldwide are offering
clustering solutions including IBM, Compaq, Microsoft, a
number of startups like TurboLinux, HPTI, Scali,
BlackStone..)
35

Surveys
show utilisation of CPU cycles of

desktop workstations is typically <10%.
Performance of workstations and PCs is
rapidly improving
As performance grows, percent utilisation
will decrease even further!
Organisations are reluctant to buy large
supercomputers, due to the large expense
and short useful life span.
36

The
development tools for workstations

are more mature than the contrasting
proprietary solutions for parallel
computers - mainly due to the nonstandard nature of many parallel systems.
Workstation clusters are a cheap and
readily available alternative to
specialised High Performance Computing
(HPC) platforms.
Use of clusters of workstations as a
distributed compute resource is very cost
effective - incremental growth of system!!!
37
Cycle Stealing
Usually
a workstation will be owned by an

individual, group, department, or
organisation - they are dedicated to the
exclusive use by the owners.
This brings problems when attempting to
form a cluster of workstations for running
distributed applications.
38
Cycle Stealing
Typically,
there are three types of owners,

who use their workstations mostly for:
1. Sending and receiving email and preparing
documents.
2. Software development - edit, compile, debug and
test cycle.
3. Running compute-intensive applications.
39
Cycle Stealing
Cluster
computing aims to steal spare cycles

from (1) and (2) to provide resources for (3).
However, this requires overcoming the
ownership hurdle - people are very protective
of their workstations.
Usually requires organisational mandate that
computers are to be used in this way.
Stealing cycles outside standard work hours
(e.g. overnight) is easy, stealing idle cycles
during work hours without impacting
interactive use (both CPU and memory) is
much harder.
40
Rise & Fall of Computing

Technologies
Mainframes
Minis
1970
Minis
PCs
1980
PCs
Network
Computing
1995
41
Original Food Chain Picture
42
1984 Computer Food Chain
Mainframe
Mini Computer
Workstation
PC
Vector Supercomputer
43
1994 Computer Food Chain
(hitting wall soon)
Mini Computer
Workstation
(future is bleak)
PC
Mainframe
Vector Supercomputer
MPP
44
Computer Food Chain (Now and Future)
45
What is a cluster?
A
cluster is a type of parallel or distributed

processing system, which consists of a
collection of interconnected standalone/complete computers cooperatively
working together as a single, integrated
computing resource.
A typical cluster:
Network: Faster, closer connection than a typical
network (LAN)
Low latency communication protocols
Looser connection than SMP
46
Why Clusters now?

(Beyond Technology and Cost)
Building block is big enough
complete computers (HW & SW) shipped in

millions: killer micro, killer RAM, killer disks,
killer OS, killer networks, killer apps.
Workstations performance is doubling every 18
months.
Networks are faster
Higher link bandwidth (v 10Mbit Ethernet)

Switch
based networks coming (ATM)

Interfaces simple & fast (Active Msgs)
Striped files preferred (RAID)
Demise of Mainframes, Supercomputers, & MPPs
47
Architectural Drivers(cont)
Node architecture dominates performance
processor, cache, bus, and memory

design and engineering $ => performance
Greatest demand for performance is on large systems
must track the leading edge of technology without lag
MPP network technology => mainstream
system area networks
System on every node is a powerful enabler
very high speed I/O, virtual memory, scheduling,

48
...Architectural Drivers
Clusters can be grown: Incremental scalability (up,

down, and across)
Individual nodes performance can be improved by

adding additional resource (new memory blocks/disks)
New nodes can be added or nodes can be removed
Clusters of Clusters and Metacomputing
Complete software tools
Threads, PVM, MPI, DSM, C, C++, Java, Parallel

C++, Compilers, Debuggers, OS, etc.
Wide class of applications
Sequential and grand challenging parallel applications

49
Clustering of Computers
for Collective Computing: Trends
?
1960
1990
1995+ 2000
Example Clusters:
Berkeley NOW
100 Sun
UltraSparcs
200 disks
Myrinet SAN
160 MB/s
Fast comm.
AM, MPI, ...
Ether/ATM
switched
external net
Global OS
Self Config
51
Basic Components
MyriNet
160 MB/s
Myricom
NIC
I/O bus
$
P
Sun Ultra 170
52
Massive Cheap Storage

Cluster
Basic unit:
2 PCs double-ending
four SCSI chains of 8
disks each
Currently serving Fine Art at http://www.thinker.org/imagebase/

53
Cluster of SMPs (CLUMPS)
Four Sun E5000s
8 processors
4 Myricom NICs each
Multiprocessor, MultiNIC, Multi-Protocol
NPACI => Sun 450s
54
Millennium PC Clumps
Inexpensive, easy
to manage Cluster
Replicated in many
departments
Prototype for very
large PC cluster
55
Adoption of the Approach
56
So Whats So Different?
Commodity parts?
Communications Packaging?
Incremental Scalability?
Independent Failure?
Intelligent Network Interfaces?
Complete System on every node
virtual memory
scheduler
files
...
57
OPPORTUNITIES
&
CHALLENGES
58
Opportunity of Large-scale
Computing on NOW
Shared Pool of
Computing Resources:
Processors, Memory, Disks
Interconnect
Guarantee atleast one

workstation to many individuals
(when active)
Deliver large % of collective

resources to few individuals
at any one time
59
Windows of Opportunities
MPP/DSM:
Compute across multiple systems: parallel.
Network RAM:
Idle memory in other nodes. Page across

other nodes idle memory
Software RAID:
file system supporting parallel I/O and

reliablity, mass-storage.
Multi-path Communication:
Communicate across multiple networks:

Ethernet, ATM, Myrinet
60
Parallel Processing
Scalable Parallel Applications require
good floating-point performance

low overhead communication scalable
network bandwidth
parallel file system
61
Network RAM
Performance gap between processor and

disk has widened.
Thrashing to disk degrades performance

significantly
Paging across networks can be effective

with high performance networks and OS
that recognizes idle machines
Typically thrashing to network RAM can be 5

to 10 times faster than thrashing to disk
62
Software RAID: Redundant

Array of Workstation Disks
I/O Bottleneck:
Microprocessor performance is improving more

than 50% per year.
Disk access improvement is < 10%
Application often perform I/O
RAID cost per byte is high compared to single

disks
RAIDs are connected to host computers which are
often a performance and availability bottleneck
RAID in software, writing data across an array of
workstation disks provides performance and some
degree of redundancy provides availability.
63
Software RAID, Parallel File

Systems, and Parallel I/O
64
Cluster Computer and its

Components
65
Clustering Today
Clustering
gained momentum when 3

technologies converged:
1. Very HP Microprocessors
workstation performance = yesterday supercomputers
2. High speed communication

Comm. between cluster nodes >= between processors
in an SMP.
3. Standard tools for parallel/ distributed

computing & their growing popularity.
66
Cluster Computer
Architecture
67
Cluster Components...1a
Nodes
Multiple
High Performance Components:
PCs
Workstations
SMPs (CLUMPS)
Distributed HPC Systems leading to
Metacomputing
They can be based on different
architectures and running difference OS
68
Cluster Components...1b
Processors
There are many (CISC/RISC/VLIW/Vector..)
Intel: Pentiums, Xeon, Merceed.
Sun: SPARC, ULTRASPARC
HP PA
IBM RS6000/PowerPC
SGI MPIS
Digital Alphas
Integrate Memory, processing and
networking into a single chip
IRAM (CPU & Mem): (http://iram.cs.berkeley.edu)

Alpha 21366 (CPU, Memory Controller, NI)
69
Cluster Components2
OS
State of the art OS:
Linux
(Beowulf)
Microsoft NT (Illinois HPVM)
SUN Solaris
(Berkeley NOW)
IBM AIX
(IBM SP2)
HP UX
(Illinois - PANDA)
Mach (Microkernel based OS) (CMU)
Cluster Operating Systems (Solaris MC, SCO Unixware,
MOSIX (academic project)
OS gluing layers:
(Berkeley Glunix)
70
Cluster Components3
High Performance Networks
Ethernet
(10Mbps),
Fast Ethernet (100Mbps),
Gigabit Ethernet (1Gbps)
SCI (Dolphin - MPI- 12micro-sec
latency)
ATM
Myrinet (1.2Gbps)
Digital Memory Channel
FDDI
71
Cluster Components4
Network Interfaces
Network
Interface Card
Myrinet has NIC

User-level access support
Alpha 21364 processor integrates
processing, memory controller,
network interface into a single chip..
72
Cluster Components5
Communication Software
Traditional OS supported facilities (heavy

weight due to protocol processing)..
Sockets (TCP/IP), Pipes, etc.
Light weight protocols (User Level)
Active Messages (Berkeley)
Fast Messages (Illinois)
U-net (Cornell)
XTP (Virginia)
System systems can be built on top of the
above protocols
73
Cluster Components6a
Cluster Middleware
Resides
Between OS and Applications

and offers in infrastructure for supporting:
Single System Image (SSI)
System Availability (SA)
SSI makes collection appear as single
machine (globalised view of system
resources). Telnet cluster.myinstitute.edu
SA - Check pointing and process
migration..
74
Cluster Components6b
Middleware Components
Hardware
DEC Memory Channel, DSM (Alewife, DASH) SMP
Techniques
OS
/ Gluing Layers
Solaris MC, Unixware, Glunix)

Applications
and Subsystems
System management and electronic forms

Runtime systems (software DSM, PFS etc.)
Resource management and scheduling (RMS):
CODINE, LSF, PBS, NQS, etc.
75
Cluster Components7a
Programming environments
Threads (PCs, SMPs, NOW..)

POSIX Threads
Java Threads
MPI
Linux, NT, on many Supercomputers
PVM
Software DSMs (Shmem)
76
Cluster Components7b
Development Tools ?
Compilers
C/C++/Java/ ;
Parallel programming with C++ (MIT Press book)
RAD
(rapid application development

tools).. GUI based tools for PP
modeling
Debuggers
Performance Analysis Tools
Visualization Tools
77
Cluster Components8
Applications
Sequential
Parallel
app.)
/ Distributed (Cluster-aware
Grand Challenging applications

Weather Forecasting
Quantum Chemistry
Molecular Biology Modeling
Engineering Analysis (CAD/CAM)
.
PDBs, web servers,data-mining
78
Key Operational Benefits of Clustering
System availability (HA). offer inherent high system

availability due to the redundancy of hardware,
operating systems, and applications.
Hardware Fault Tolerance. redundancy for most system
components (eg. disk-RAID), including both hardware
and software.
OS and application reliability. run multiple copies of the
OS and applications, and through this redundancy
Scalability. adding servers to the cluster or by adding
more clusters to the network as the need arises or CPU to
SMP.
High Performance. (running cluster enabled programs)
79
Classification
of Cluster Computer
80
Clusters Classification..1
Based
on Focus (in Market)
High Performance (HP) Clusters

Grand Challenging Applications
High Availability (HA) Clusters

Mission Critical applications
81
HA Cluster: Server Cluster with

"Heartbeat" Connection
82
Based
on Workstation/PC Ownership
Dedicated Clusters
Non-dedicated clusters
Adaptive parallel computing
Also called Communal multiprocessing
83
Based
on Node Architecture..
Clusters of PCs (CoPs)

Clusters of Workstations (COWs)
Clusters of SMPs (CLUMPs)
84
Building Scalable Systems:

Cluster of SMPs (Clumps)
Performance of SMP Systems Vs.

Four-Processor Servers in a Cluster
85
Based
on Node OS Type..
Linux Clusters (Beowulf)

Solaris Clusters (Berkeley NOW)
NT Clusters (HPVM)
AIX Clusters (IBM SP2)
SCO/Compaq Clusters (Unixware)
.Digital VMS Clusters, HP
clusters, ..
86
Based
on node components
architecture & configuration
(Processor Arch, Node Type:
PC/Workstation.. & OS: Linux/NT..):
Homogeneous Clusters
All nodes will have similar configuration
Heterogeneous Clusters
Nodes based on different processors and
running different OSes.
87
Clusters Classification..6a
Dimensions of Scalability & Levels of

Clustering
(3)
Network
Public
Metacomputing (GRID)
Technology
Enterprise
(1)
Campus
Department
Workgroup
Uniprocessor
SMP
Cluster
MPP
Platform
(2)
88
Clusters Classification..6b
Levels of Clustering
Group
Clusters (#nodes: 2-99)
(a set of dedicated/non-dedicated computers mainly connected by SAN like Myrinet)

Departmental Clusters (#nodes: 99-999)
Organizational Clusters (#nodes: many 100s)
(using ATMs Net)
Internet-wide Clusters=Global Clusters:
(#nodes: 1000s to many millions)
Metacomputing
Web-based Computing
Agent Based Computing
Java plays a major in web and agent based computing
89
Major issues in cluster

design
Size Scalability (physical & application)
Enhanced Availability (failure management)
Single System Image (look-and-feel of one system)
Fast Communication (networks & protocols)
Load Balancing (CPU, Net, Memory, Disk)
Security and Encryption (clusters of clusters)
Distributed Environment (Social issues)
Manageability (admin. And control)
Programmability (simple API if required)
Applicability (cluster-aware and non-aware app.)

90
Cluster Middleware
and
Single System Image
91
A typical Cluster Computing

Environment
Application
PVM / MPI/ RSH
???
Hardware/OS
92
CC should support
Multi-user, time-sharing environments
Nodes with different CPU speeds and

memory sizes (heterogeneous configuration)
Many processes, with unpredictable

requirements
Unlike SMP: insufficient bonds between

nodes
Each computer operates independently
93
The missing link is provide by

cluster middleware/underware
Application
PVM / MPI/ RSH
Middleware or
Underware
Hardware/OS
94
SSI Clusters--SMP services on a CC

Pool Together the Cluster-Wide resources
Adaptive resource usage for better

performance
Ease of use - almost like SMP
Scalable configurations - by decentralized

control
Result: HPC/HAC at PC/Workstation prices

95
What is Cluster Middleware ?
An interface between between use

applications and cluster hardware and OS
platform.
Middleware packages support each other at
the management, programming, and
implementation levels.
Middleware Layers:
SSI Layer
Availability Layer: It enables the cluster services of
Checkpointing, Automatic Failover, recovery from
failure,
fault-tolerant operating among all cluster nodes.
96
Middleware Design Goals
Complete Transparency (Manageability)
Lets the see a single cluster system..
Single entry point, ftp, telnet, software loading...

Scalable Performance
Easy growth of cluster
no change of API & automatic load distribution.

Enhanced Availability
Automatic Recovery from failures

Employ checkpointing & fault tolerant technologies
Handle consistency of data when replicated..

97
What is Single System Image

(SSI) ?
A
single system image is the

illusion, created by software or
hardware, that presents a
collection of resources as one,
more powerful resource.
SSI makes the cluster appear like a
single machine to the user, to
applications, and to the network.
A cluster without a SSI is not a
cluster
98
Benefits of Single System

Image
Usage of system resources transparently

Transparent process migration and load
balancing across nodes.
Improved reliability and higher availability
Improved system response time and
performance
Simplified system management
Reduction in the risk of operator errors
User need not be aware of the underlying
system architecture to use these machines
effectively
99
Desired SSI Services
Single Entry Point
telnet cluster.my_institute.edu
telnet node1.cluster. institute.edu
Single File Hierarchy: xFS, AFS, Solaris MC Proxy

Single Control Point: Management from single GUI
Single virtual networking
Single memory space - Network RAM / DSM
Single Job Management: Glunix, Codine, LSF
Single User Interface: Like workstation/PC
windowing environment (CDE in Solaris/NT), may
it can use Web technology
100
Availability Support
Functions
Single I/O Space (SIO):
any node can access any peripheral or disk devices

without the knowledge of physical location.
Single Process Space (SPS)
Any process on any node create process with cluster

wide process wide and they communicate through
signal, pipes, etc, as if they are one a single node.
Checkpointing and Process Migration.
Saves the process state and intermediate results in

memory to disk to support rollback recovery when
node fails. PM for Load balancing...
101
Scalability Vs. Single System

Image
UP
102
SSI Levels/How do we
implement SSI ?
It is a computer science notion of levels of

abstractions (house is at a higher level of
abstraction than walls, ceilings, and floors).
Application and Subsystem Level
Operating System Kernel Level
Hardware Level
103
SSI at Application and

Subsystem Level
Level
Examples
Boundary
Importance
application
cluster batch system,

system management
an application
what a user
wants
subsystem
distributed DB,
OSF DME, Lotus
Notes, MPI, PVM
a subsystem
SSI for all

applications of
the subsystem
file system
Sun NFS, OSF,

DFS, NetWare,
and so on
toolkit
OSF DCE, Sun

ONC+, Apollo
Domain
shared portion of implicitly supports

the file system
many applications
and subsystems
explicit toolkit
best level of
facilities: user,
support for heterservice name,time ogeneous system
(c) In search of clusters
104
SSI at Operating System

Kernel Level
Level
Kernel/
OS Layer
kernel
interfaces
virtual
memory
microkernel
Examples
Solaris MC, Unixware
MOSIX, Sprite,Amoeba
/ GLunix
UNIX (Sun) vnode,
Locus (IBM) vproc
Boundary
Importance
each name space: kernel support for

files, processes,
applications, adm
pipes, devices, etc. subsystems
type of kernel
modularizes SSI
objects: files,
code within
processes, etc.
kernel
none supporting
each distributed
operating system kernel virtual memory
space
Mach, PARAS, Chorus, each service
OSF/1AD, Amoeba
outside the
microkernel
may simplify
implementation
of kernel objects
implicit SSI for
all system services
105
SSI at Harware Level

Level
Examples
Boundary
Importance
Application and Subsystem Level
Operating System Kernel Level

memory
SCI, DASH
memory space
memory
and I/O
SCI, SMP techniques
memory and I/O

device space
better communication and synchronization

lower overhead
cluster I/O
106
SSI Characteristics
1.
Every SSI has a boundary

2. Single system support can exist
at different levels within a system,
one able to be build on another
107
SSI Boundaries -- an
applications SSI boundary
Batch System
SSI
Boundary
(c) In search
of clusters
108
Relationship Among
Middleware Modules
109
SSI via OS path!
1. Build as a layer on top of the existing OS
Benefits: makes the system quickly portable, tracks

vendor software upgrades, and reduces development
time.
i.e. new systems can be built quickly by mapping
new services onto the functionality provided by the
layer beneath. Eg: Glunix
2. Build SSI at kernel level, True Cluster OS
Good, but Cant leverage of OS improvements by

vendor
E.g. Unixware, Solaris-MC, and MOSIX
110
SSI Representative Systems

OS
level SSI
SCO NSC UnixWare
Solaris-MC
MOSIX, .
Middleware level SSI
PVM, TreadMarks (DSM), Glunix,
Condor, Codine, Nimrod, .
Application level SSI
PARMON, Parallel Oracle, ...
111
SCO NonStop Cluster for UnixWare

http://www.sco.com/products/clustering/
UP or SMP node
UP or SMP node
Users, applications, and

systems management
Standard OS
kernel calls
Standard SCO
UnixWare
with clustering
hooks
Extensions
Modular
kernel
extensions
Users, applications, and

systems management
Extensions
Standard OS
kernel calls
Standard SCO
UnixWare
with clustering
hooks
Modular
kernel
extensions
Devices
Devices
ServerNet
Other nodes
112
How does NonStop Clusters

Work?
Modular Extensions and Hooks to Provide:
Single Clusterwide Filesystem view

Transparent Clusterwide device access
Transparent swap space sharing
Transparent Clusterwide IPC
High Performance Internode Communications
Transparent Clusterwide Processes, migration,etc.
Node down cleanup and resource failover
Transparent Clusterwide parallel TCP/IP networking
Application Availability
Clusterwide Membership and Cluster timesync
Cluster System Administration
Load Leveling
113
Solaris-MC: Solaris for

MultiComputers
Applications
System call interface
Network
File system
C++
Processes
Solaris MC
Other
nodes
global file
system
globalized
process
management
globalized
networking
and I/O
Object framework
Object invocations
Existing Solaris 2.5 kernel

Kernel
Solaris MC Architecture
http://www.sun.com/research/solaris-mc/
114
Solaris MC components
Applications
System call interface
Network
File system
C++
Processes
Object and
communication
support
High availability
support
PXFS global
distributed file
system
Process
mangement
Networking
Solaris MC
Object framework
Object invocations
Existing Solaris 2.5 kernel

Kernel
Solaris MC Architecture
Other
nodes
115
Multicomputer OS for UNIX

(MOSIX)
http://www.mosix.cs.huji.ac.il/
An OS module (layer) that provides the
applications with the illusion of working on a single
system
Remote operations are performed like local
operations
Transparent to the application - user interface
unchanged
Application
PVM / MPI / RSH
Hardware/OS
116
Main tool
Preemptive process migration that can
migrate--->any process, anywhere, anytime
Supervised by distributed algorithms that

respond
on-line to global resource
availability - transparently
Load-balancing - migrate process from overloaded to under-loaded nodes

Memory ushering - migrate processes from a
node that has exhausted its memory, to prevent
paging/swapping
117
MOSIX for Linux at HUJI
A scalable cluster configuration:
50 Pentium-II 300 MHz

38 Pentium-Pro 200 MHz (some are SMPs)
16 Pentium-II 400 MHz (some are SMPs)
Over 12 GB cluster-wide RAM
Connected by the Myrinet 2.56 G.b/s LAN
Runs Red-Hat 6.0, based on Kernel 2.2.7
Upgrade: HW with Intel, SW with Linux
Download MOSIX:
http://www.mosix.cs.huji.ac.il/
118
NOW @ Berkeley
Design & Implementation of higher-level system
Global OS (Glunix)
Parallel File Systems (xFS)
Fast Communication (HW for Active Messages)
Application Support
Overcoming technology shortcomings
Fault tolerance
System Management
NOW Goal: Faster for Parallel AND Sequential
http://now.cs.berkeley.edu/
119
NOW Software Components

Parallel Apps
Large Seq. Apps
Sockets, Split-C, MPI, HPF, vSM

Name Svr
Global Layer Unix
Unix
Workstation
Unix
Workstation
Unix
Workstation
VN segment
Driver
AM L.C.P.
VN segment
Driver
AM L.C.P.
VN segment
Driver
AM L.C.P.
Active Messages
Unix (Solaris)
Workstation
VN segment
Driver
AM L.C.P.
Myrinet Scalable Interconnect
120
3 Paths for Applications on

NOW?
Revolutionary (MPP Style): write new programs from

scratch using MPP languages, compilers, libraries,
Porting: port programs from mainframes,
supercomputers, MPPs,
Evolutionary: take sequential program & use
1) Network RAM: first use memory of many
computers to reduce disk accesses; if not fast
enough, then:
2) Parallel I/O: use many disks in parallel for
accesses not in file cache; if not fast enough,
then:
3) Parallel program: change program until it sees
enough processors that is fast=> Large speedup
without fine grain parallel program
121
Comparison of 4 Cluster Systems
122
Cluster Programming
Environments
Shared Memory Based
DSM
Threads/OpenMP (enabled for clusters)
Java threads (HKU JESSICA, IBM cJVM)
Message Passing Based
PVM (PVM)
MPI (MPI)
Parametric Computations
Nimrod/Clustor
Automatic Parallelising Compilers

Parallel Libraries & Computational Kernels (NetSolve)
123
Levels of Parallelism
PVM/MPI
Threads
Compilers
CPU
Task i-l
func1 ( )
{
....
....
}
a ( 0 ) =..
b ( 0 ) =..
Task i
func2 ( )
{
....
....
}
a ( 1 )=..
b ( 1 )=..
Task i+1
func3 ( )
{
....
....
}
a ( 2 )=..
b ( 2 )=..
Load
Code-Granularity
Code Item
Large grain
(task level)
Program
Medium grain
(control level)
Function (thread)
Fine grain
(data level)
Loop (Compiler)
Very fine grain
(multiple issue)
With hardware
124
MPI (Message Passing

Interface)
http://www.mpi-forum.org/
A standard message passing interface.
MPI 1.0 - May 1994 (started in 1992)

C and Fortran bindings (now Java)
Portable (once coded, it can run on virtually all HPC

platforms including clusters!
Performance (by exploiting native hardware features)
Functionality (over 115 functions in MPI 1.0)
environment management, point-to-point &

collective communications, process group,
communication world, derived data types, and virtual
topology routines.
Availability - a variety of implementations available,

both vendor and public domain.
125
A Sample MPI Program...

# include <stdio.h>
# include <string.h>
#include mpi.h
main( int argc, char *argv[ ])
{
int my_rank; /* process rank */
int p; /*no. of processes*/
int source; /* rank of sender */
int dest; /* rank of receiver */
int tag = 0; /* message tag, like email subject */
char message[100]; /* buffer */
MPI_Status status; /* function return status */
/* Start up MPI */
MPI_Init( &argc, &argv );
/* Find our process rank/id */
MPI_Comm_rank( MPI_COM_WORLD, &my_rank);
/*Find out how many processes/tasks part of this run */
MPI_Comm_size( MPI_COM_WORLD, &p);
(master)
Hello,...
(workers)
126
A Sample MPI Program
if( my_rank == 0) /* Master Process */

{
for( source = 1; source < p; source++)
{
MPI_Recv( message, 100, MPI_CHAR, source, tag, MPI_COM_WORLD, &status);
printf(%s \n, message);
}
}
else /* Worker Process */
{
sprintf( message, Hello, I am your worker process %d!, my_rank );
dest = 0;
MPI_Send( message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COM_WORLD);
}
/* Shutdown MPI environment */
MPI_Finalise();
}
127
Execution
% cc -o hello hello.c -lmpi
% mpirun -p2 hello
Hello, I am process 1!
% mpirun -p4 hello
% mpirun hello
(no output, there are no workers.., no greetings)
128
PARMON: A Cluster
Monitoring Tool
PARMON Client on JVM
PARMON Server
on each node
parmon
parmond
PARMON
High-Speed
Switch
http://www.buyya.com/parmon/
129
Resource Utilization at a
Glance
130
Globalised Cluster Storage
Single I/O Space and

Design Issues
Reference:
Designing SSI Clusters with Hierarchical Checkpointing and Single I/O
Space, IEEE Concurrency, March, 1999
by K. Hwang, H. Jin et.al

131
Clusters with & without Single

I/O Space
Users
Users
Single I/O Space Services
Without Single I/O Space
With Single I/O Space Services
132
Benefits of Single I/O Space
Eliminate the gap between accessing local disk(s) and remote

disks
Support persistent programming paradigm
Allow striping on remote disks, accelerate parallel I/O

operations
Facilitate the implementation of distributed checkpointing and

recovery schemes
133
Single I/O Space Design Issues
Integrated I/O Space
Addressing and Mapping Mechanisms
Data movement procedures
134
Integrated I/O Space

LD1
LD2
...
...
D11 D12
D21 D22
D1t
D2t
...
Sequential
addresses
LDn
...
Dn1 Dn2
B11
B12
SD1
...
...
B21
B22
SD2
...
Dnt
Bm1
Bm2
SDm
...
B1k
B2k
P1
. . .
Local
Disks,
(RADD
Space)
Shared
RAIDs,
(NASD Space)
Bmk
Peripherals
(NAP Space)
Ph
135
Addressing and Mapping
User Applications
Name Agent
I/O Agent
Disk/RAID/
NAP Mapper
I/O Agent
RADD
Block Mover
I/O Agent
I/O Agent
NASD
NAP
User-level
Middleware
plus some
Modified OS
System Calls
136
Data Movement Procedures

User
Application
I/O Agent
Node 1
Block
Mover
Request
Data
Block A
Node 2
I/O Agent
LD2 or SDi
LD1
User
Application
I/O Agent
of the NASD
Node 1
Node 2
Block
Mover
I/O Agent
LD2 or SDi
LD1
of the NASD
A
137
What Next ??
Clusters of Clusters (HyperClusters)
Global Grid
Interplanetary Grid
Universal Grid??
138
Clusters of Clusters (HyperClusters)

Cluster 1
Scheduler
Master
Daemon
LAN/WAN
Submit
Graphical
Control
Cluster 3
Execution
Daemon
Scheduler
Clients
Master
Daemon
Cluster 2
Submit
Graphical
Control
Scheduler
Master
Daemon
Execution
Daemon
Clients
Submit
Graphical
Control
Execution
Daemon
Clients
139
Towards Grid Computing.
For illustration, placed resources arbitrarily on the GUSTO test-bed!!
140
What is Grid ?
An infrastructure that couples

Computers (PCs, workstations, clusters, traditional
supercomputers, and even laptops, notebooks, mobile
computers, PDA, and so on)
Software ? (e.g., renting expensive special purpose applications
on demand)
Databases (e.g., transparent access to human genome database)
Special Instruments (e.g., radio telescope--SETI@Home
Searching for Life in galaxy, Austrophysics@Swinburne for
pulsars)
People (may be even animals who knows ?)
across the local/wide-area networks (enterprise,

organisations, or Internet) and presents them as
an unified integrated (single) resource.
141
Conceptual view of the Grid
Leading to Portal (Super)Computing
http://www.sun.com/hpc/
142
Grid Application-Drivers
Old and New applications getting enabled due

to coupling of computers, databases,
instruments, people, etc:
(distributed) Supercomputing
Collaborative engineering
high-throughput computing
large scale simulation & parameter studies
Remote software access / Renting Software

Data-intensive computing
On-demand computing
143
Grid Components
Applications and Portals
Scientific
Engineering
Collaboration
Prob. Solving Env.
Development Environments and Tools
Languages
Libraries
Debuggers
Monitoring
Resource Brokers
Web enabled Apps
Distributed Resources Coupling Services
Comm.
Sign on & Security
Information
Process
Data Access
Web tools
QoS
Grid
Apps.
Grid
Tools
Grid
Middleware
Local Resource Managers
Operating Systems
Computers
Queuing Systems
Clusters
Libraries & App Kernels
Networked Resources across

Organisations
Storage Systems
Data Sources
TCP/IP & UDP
Grid
Fabric
Scientific Instruments
144
Many GRID Projects and Initiatives
PUBLIC FORUMS
Computing Portals
Grid Forum
European Grid Forum
IEEE TFCC!
GRID2000 and more.
Australia
Nimrod/G
EcoGrid and GRACE
DISCWorld
Europe
Public Grid Initiatives
Distributed.net
SETI@Home
Compute Power Grid
USA
Globus
Legion
JAVELIN
AppLes
NASA IPG
Condor
Harness
NetSolve
NCSA Workbench
WebFlow
EveryWhere
and many more...
UNICORE
MOL
METODIS
Globe
Poznan Metacomputing
Japan
CERN Data Grid
Ninf
Bricks
MetaMPI
and many more...
DAS
JaWS
and many more... http://www.gridcomputing.com/
145
NetSolve
Client/Server/Agent -- Based Computing
Easy-to-use tool to provide efficient and uniform
access to a variety of scientific packages on UNIX platforms
Client-Server design
Network-enabled solvers
Network Resources
Seamless access to resources
Non-hierarchical system
Load Balancing
Fault Tolerance
reply
Interfaces to Fortran, C, Java, Matlab, more
Software Repository
choice
request
Software is available
www.cs.utk.edu/netsolve/
NetSolve Client
NetSolve Agent
146
HARNESS Virtual Machine

Scalable Distributed control and CCA based Daemon
Discovery and registration
Host A
Host D
Host B
Another
VM
Virtual
Machine
Component
based daemon
Host C
process control
Operation within VM uses

Distributed Control
user features
HARNESS daemon
Customization
and extension
by dynamically
adding plug-ins
http://www.epm.ornl.gov/harness/
147
HARNESS Core Research

Parallel Plug-ins for Heterogeneous Distributed Virtual Machine
One research goal is to understand and implement
a dynamic parallel plug-in environment.
provides a method for many users to extend Harness
in much the same way that third party serial plug-ins
extend Netscape, Photoshop, and Linux.
Research issues with Parallel plug-ins include:

heterogeneity, synchronization, interoperation, partial success
(three typical cases):
load plug-in into single host of VM w/o communication
load plug-in into single host broadcast to rest of VM
load plug-in into every host of VM w/ synchronization
148
Nimrod - A Job Management

System
http://www.dgs.monash.edu.au/~davida/nimrod.html
149
Job processing with Nimrod
150
Nimrod/G Architecture
Nimrod/G Client
Nimrod/G Client
Nimrod/G Client
Nimrod Engine
Schedule Advisor
Trading Manager
Persistent
Store
Dispatcher
Grid Explorer
TM
Middleware Services
TS
GE
GIS
Grid Information Services
RM & TS
RM & TS
RM & TS
GUSTO Test Bed
RM: Local Resource Manager, TS: Trade Server
151
Compute Power Market
Grid Information Server
Grid Explorer
Application
Job
Control
Agent
Schedule Advisor
Trade Server
Charging Alg.
Trading
Trade Manager
Deployment Agent
User
Resource Broker
Accounting
Resource
Reservation
Other
services
Resource Allocation
R1
R2
Rn
A Resource Domain
152
Pointers to Literature on
Cluster Computing
153
Reading Resources..1a
Internet & WWW
Computer Architecture:
http://www.cs.wisc.edu/~arch/www/
PFS & Parallel I/O

http://www.cs.dartmouth.edu/pario/
Linux Parallel Procesing

http://yara.ecn.purdue.edu/~pplinux/Sites/
DSMs
http://www.cs.umd.edu/~keleher/dsm.html
154
Reading Resources..1b
Internet & WWW
Solaris-MC
http://www.sunlabs.com/research/solaris-mc
Microprocessors: Recent Advances

http://www.microprocessor.sscc.ru
Beowulf:
http://www.beowulf.org
Metacomputing
http://www.sis.port.ac.uk/~mab/Metacomputing/
155
Reading Resources..2
Books
In Search of Cluster
by G.Pfister, Prentice Hall (2ed), 98
High Performance Cluster Computing

Volume1: Architectures and Systems
Volume2: Programming and Applications
Edited by Rajkumar Buyya, Prentice Hall, NJ, USA.
Scalable Parallel Computing

by K Hwang & Zhu, McGraw Hill,98
156
Reading Resources..3
Journals
A Case of NOW, IEEE Micro, Feb95
by Anderson, Culler, Paterson
Fault Tolerant COW with SSI, IEEE

Concurrency, (to appear)
by Kai Hwang, Chow, Wang, Jin, Xu
Cluster Computing: The Commodity

Supercomputing, Journal of Software
Practice and Experience-(get from my web)
by Mark Baker & Rajkumar Buyya
157
Cluster Computing Infoware
http://www.csse.monash.edu.au/~rajkumar/cluster/
158
Cluster Computing Forum
IEEE Task Force on Cluster Computing

(TFCC)
http://www.ieeetfcc.org
159
TFCC Activities...
Network Technologies
OS Technologies
Parallel I/O
Programming Environments
Java Technologies
Algorithms and Applications
>Analysis and Profiling
Storage Technologies
High Throughput Computing
160
TFCC Activities...
High Availability
Single System Image
Performance Evaluation
Software Engineering
Education
Newsletter
Industrial Wing
TFCC Regional Activities
All the above have there own pages, see pointers

from:
161
http://www.ieeetfcc.org
TFCC Activities...
Mailing list, Workshops, Conferences, Tutorials,

Web-resources etc.
Resources for introducing subject in senior

undergraduate and graduate levels.
Tutorials/Workshops at IEEE Chapters..
.. and so on.
FREE MEMBERSHIP, please join!
Visit TFCC Page for more details:
http://www.ieeetfcc.org (updated daily!).

162
Clusters Revisited
163
Summary
We have discussed Clusters

Enabling Technologies
Architecture & its Components
Classifications
Middleware
Single System Image
Representative Systems
164
Conclusions
Clusters are promising..

Solve parallel processing paradox
Offer incremental growth and matches with
funding pattern.
New trends in hardware and software
technologies are likely to make clusters more
promising..so that
Clusters based supercomputers can be seen
everywhere!
165
166
Thank You ...
167
Backup Slides...
168
SISD : A Conventional Computer

Instructions
Data Input
Speed
Processor
Data Output
is limited by the rate at which computer can

transfer information internally.
Ex:PC, Macintosh, Workstations

169
The MISD Architecture

Instruction
Stream A
Instruction
Stream B
Instruction Stream C
Processor
Data
Output
Stream
A
Data
Input
Stream
Processor
B
Processor
C
More
of an intellectual exercise than a practical configuration.

Few built, but commercially not available
170
SIMD Architecture
Instruction
Stream
Data Input
stream A
Data Input
stream B
Data Input
stream C
Data Output
stream A
Processor
A
Data Output
stream B
Processor
B
Processor
Data Output
stream C
Ci<= Ai * Bi
Ex: CRAY machine vector processing, Thinking machine cm*
171
MIMD Architecture
Instruction Instruction Instruction
Stream A Stream B Stream C
Data Input
stream A
Data Input
stream B
Data Input
stream C
Data Output
stream A
Processor
A
Data Output
stream B
Processor
B
Processor
Data Output
stream C
Unlike SISD, MISD, MIMD computer works asynchronously.

Shared memory (tightly coupled) MIMD
Distributed memory (loosely coupled) MIMD
172

Cluster Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Tutorial

Uploaded by

Copyright:

Available Formats

High Performance Cluster Computing

(Architecture, Systems, and Applications)

Monash University, Melbourne.

Email: rajkumar@csse.monash.edu.au / rajkumar@buyya.com

Learn and Share Recent advances in cluster

Cluster Computing Infoware: (tutorial online)

Two Eras of Computing

Computing Power and

Computing Power (HPC) Drivers

How to Run App. Faster ?

There are 3 ways to improve performance:

1. Use faster hardware: e.g. reduce the time per

Application Case Study

Web Serving and E-Commerce

E-Commerce and PDC ?

Killer Applications of Clusters

Numerous Scientific & Engineering Apps.

Major problems/issues in Ecommerce

Multilevel Business Support (e.g., B2P2C)

Amazon.com: Online sales/trading

Several Thousands of Items

books, publishers, suppliers

Customers details, transactions details, support

Keep track of partners details, tracking referral

A mechanism for participating in the bid

Can these drive

Clusters are already in use for web serving, web-hosting, and

scalability, availability, performance, reliable-high

Hyperclusters and the GRID:

Support for transparency in (secure) Site/Data Replication for high

Science Portals - e.g., PAPIA system

RWCP Japan: http://www.rwcp.or.jp/papia/

PDC hot topics for E-commerce

Cluster based web-servers, search engineers, portals

Sequential architectures reaching physical

Hardware improvements like pipelining,

Vector Processing works well for certain

Computational Power Improvement

Human Physical Growth Analogy:

Why Parallel Processing

The Tech. of PP is mature and can be

History of Parallel Processing

Tablet has 3 calculating positions.

Infer that multiple positions:

Aggregated speed with

Simple classification by Flynn:

focus is on MIMD model, using

Main HPC Architectures..1a

Motivation for using Clusters

communications bandwidth between

Main HPC Architectures..1b.

NOTE: Modern sequential machines are not purely

vector and parallel architectures (pipelining,

Parallel Processing Paradox

required to develop a parallel

The Need for Alternative

numbers of under utilised

Design Space of Competing

Cluster Computing Research Projects

Beowulf (CalTech and NASA) - USA

Cluster Computing Commercial Software

Codine (Computing in Distributed Network Environment) GENIAS GmbH, Germany

Motivation for using Clusters

show utilisation of CPU cycles of