You are on page 1of 172

High Performance Cluster Computing

(Architecture, Systems, and Applications)

ISCA
2000

Rajkumar Buyya,

Monash University, Melbourne.

Email: rajkumar@csse.monash.edu.au / rajkumar@buyya.com


Web: http://www.ccse.monash.edu.au/~rajkumar / www.buyya.com

Objectives

Learn and Share Recent advances in cluster


computing (both in research and commercial
settings):

Architecture,
System Software
Programming Environments and Tools
Applications

Cluster Computing Infoware: (tutorial online)

http://www.buyya.com/cluster/
2

Agenda

Overview of Computing
Motivations & Enabling Technologies
Cluster Architecture & its Components
Clusters Classifications
Cluster Middleware
Single System Image
Representative Cluster Systems
Resources and Conclusions
3

Computing Elements

Applications

Programming Paradigms
Threads Interface
Operating System

Microkernel
Multi-Processor Computing System
P

P Processor

Thread

Hardware

Process
4

Two Eras of Computing


Architectures
System Software
Applications
P.S.Es
Architectures
System Software
Applications
P.S.Es

Sequential
Era

Parallel
Era

1940

50

60

70

80

90

2000

2030

Commercialization
R&D

Commodity
5

Computing Power and


Computer Architectures

Computing Power (HPC) Drivers


Solving grand challenge applications using
computer modeling, simulation and analysis

Life Sciences

CAD/CAM

Aerospace

Digital Biology

E-commerce/anything

Military Applications7

How to Run App. Faster ?

There are 3 ways to improve performance:

1. Work Harder
2. Work Smarter
3. Get Help

Computer Analogy

1. Use faster hardware: e.g. reduce the time per


instruction (clock cycle).
2. Optimized algorithms and techniques
3. Multiple computers to solve problem: That
is, increase no. of instructions executed per
8
clock cycle.

Application Case Study

Web Serving and E-Commerce

10

E-Commerce and PDC ?


What are/will be the major problems/issues in
eCommerce? How will or can PDC be applied to
solve some of them?
Other than Compute Power, what else can PDC
contribute to e-commerce?
How would/could the different forms of PDC
(clusters, hypercluster, GRID,) be applied to ecommerce?
Could you describe one hot research topic for
PDC applying to e-commerce?
A killer e-commerce application for PDC ?
...

11

Killer Applications of Clusters

Numerous Scientific & Engineering Apps.


Parametric Simulations
Business Applications
E-commerce Applications (Amazon.com, eBay.com .)
Database Applications (Oracle on cluster)
Decision Support Systems
Internet Applications
Web serving / searching
Infowares (yahoo.com, AOL.com)
ASPs (application service providers)
eMail, eChat, ePhone, eBook, eCommerce, eBank, eSociety, eAnything!
Computing Portals
Mission Critical Applications
command control systems, banks, nuclear reactor control, star-war, and
handling life threatening situations.

12

Major problems/issues in Ecommerce


Social Issues
Capacity Planning

Multilevel Business Support (e.g., B2P2C)


Information Storage, Retrieval, and Update
Performance
Heterogeneity
System Scalability
System Reliability
Identification and Authentication
System Expandability
Security
Cyber Attacks Detection and Control
(cyberguard)
Data Replication, Consistency, and Caching
Manageability (administration and control)

13

Amazon.com: Online sales/trading


killer E-commerce Portal

Several Thousands of Items

books, publishers, suppliers

Millions of Customers

Customers details, transactions details, support


for transactions update

(Millions) of Partners

Keep track of partners details, tracking referral


link to partner and sales and payment
Sales based on advertised price
Sales through auction/bids

A mechanism for participating in the bid


(buyers/sellers define rules of the game)

14

2100

2100

2100

2100

2100

2100

2100

2100

Can these drive


E-Commerce ?

Clusters are already in use for web serving, web-hosting, and


number of other Internet applications including E-commerce

scalability, availability, performance, reliable-high


performance-massive storage and database support.
Attempts to support online detection of cyber attacks (through
data mining) and control

Hyperclusters and the GRID:

Support for transparency in (secure) Site/Data Replication for high


availability and quick response time (taking site close to the user).
Compute power from hyperclusters/Grid can be used for data
mining for cyber attacks and fraud detection and control.
Helps to build Compute Power Market, ASPs, and Computing
Portals.

15

Science Portals - e.g., PAPIA system

Pentiums
Myrinet
NetBSD/Linuux
PM
Score-D
MPC++

RWCP Japan: http://www.rwcp.or.jp/papia/

PAPIA PC Cluster
16

PDC hot topics for E-commerce

Cluster based web-servers, search engineers, portals


Scheduling and Single System Image.
Heterogeneous Computing
Reliability and High Availability and Data Recovery
Parallel Databases and high performance-reliable-mass storage
systems.
CyberGuard! Data mining for detection of cyber attacks, frauds, etc.
detection and online control.
Data Mining for identifying sales pattern and automatically tuning
portal to special sessions/festival sales
eCash, eCheque, eBank, eSociety, eGovernment, eEntertainment,
eTravel, eGoods, and so on.
Data/Site Replications and Caching techniques
Compute Power Market
Infowares (yahoo.com, AOL.com)
ASPs (application service providers)
17
...

Sequential Architecture
Limitations

Sequential architectures reaching physical


limitation (speed of light, thermodynamics)

Hardware improvements like pipelining,


Superscalar, etc., are non-scalable and
requires sophisticated Compiler
Technology.

Vector Processing works well for certain


kind of problems.
18

Computational Power Improvement

C.P.I.

Multiprocessor

Uniprocessor

2. . . .

No. of Processors
19

Human Physical Growth Analogy:


Computational Power Improvement

Vertical

Growth

Horizontal

10

15 20 25

30

35

40

45 . . . .

Age
20

Why Parallel Processing


NOW?

The Tech. of PP is mature and can be


exploited commercially; significant
R & D work on development of tools
& environment.

Significant

development
in
Networking technology is paving a
way for heterogeneous computing.
21

History of Parallel Processing


PP can be traced to a tablet dated
around 100 BC.

Tablet has 3 calculating positions.

Infer that multiple positions:

Reliability/ Speed

22

Motivating Factors

Aggregated speed with


which complex calculations
carried out by millions of neurons in
human brain is amazing! although
individual neurons response is slow
(milli sec.) - demonstrate the
feasibility of PP

23

Taxonomy of Architectures

Simple classification by Flynn:


(No. of instruction and data streams)

SISD - conventional
SIMD - data parallel, vector computing
MISD - systolic arrays
MIMD - very general, multiple approaches.

Current

focus is on MIMD model, using


general
purpose
processors
or
multicomputers.
24

Main HPC Architectures..1a


SISD - mainframes, workstations, PCs.
SIMD Shared Memory - Vector machines, Cray...
MIMD Shared Memory - Sequent, KSR, Tera, SGI,
SUN.
SIMD Distributed Memory - DAP, TMC CM-2...
MIMD Distributed Memory - Cray T3D, Intel,
Transputers, TMC CM-5, plus recent workstation
clusters (IBM SP2, DEC, Sun, HP).

25

Motivation for using Clusters


The

communications bandwidth between


workstations is increasing as new
networking technologies and protocols are
implemented in LANs and WANs.
Workstation clusters are easier to integrate
into existing networks than special parallel
computers.

26

Main HPC Architectures..1b.

NOTE: Modern sequential machines are not purely


SISD - advanced RISC processors use many
concepts from

vector and parallel architectures (pipelining,


parallel execution of instructions, prefetching of
data, etc) in order to achieve one or more
arithmetic operations per clock cycle.

27

Parallel Processing Paradox


Time

required to develop a parallel


application for solving GCA is equal
to:
Half Life of Parallel Supercomputers.

28

The Need for Alternative


Supercomputing Resources
Vast

numbers of under utilised


workstations available to use.
Huge numbers of unused processor
cycles and resources that could be
put to good use in a wide variety of
applications areas.
Reluctance to buy Supercomputer
due to their cost and short life span.
Distributed compute resources fit
better into today's funding model.
29

Technology Trend

30

Scalable Parallel
Computers

31

Design Space of Competing


Computer Architecture

32

Towards Inexpensive
Supercomputing
It is:

Cluster Computing..
The Commodity
Supercomputing!

33

Cluster Computing Research Projects

Beowulf (CalTech and NASA) - USA


CCS (Computing Centre Software) - Paderborn, Germany
Condor - Wisconsin State University, USA
DQS (Distributed Queuing System) - Florida State University, US.
EASY - Argonne National Lab, USA
HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US
far - University of Liverpool, UK
Gardens - Queensland University of Technology, Australia
MOSIX - Hebrew University of Jerusalem, Israel
MPI (MPI Forum, MPICH is one of the popular implementations)
NOW (Network of Workstations) - Berkeley, USA
NIMROD - Monash University, Australia
NetSolve - University of Tennessee, USA
PBS (Portable Batch System) - NASA Ames and LLNL, USA
PVM - Oak Ridge National Lab./UTK/Emory, USA
34

Cluster Computing Commercial Software

Codine (Computing in Distributed Network Environment) GENIAS GmbH, Germany


LoadLeveler - IBM Corp., USA
LSF (Load Sharing Facility) - Platform Computing, Canada
NQE (Network Queuing Environment) - Craysoft Corp., USA
OpenFrame - Centre for Development of Advanced
Computing, India
RWPC (Real World Computing Partnership), Japan
Unixware (SCO-Santa Cruz Operations,), USA
Solaris-MC (Sun Microsystems), USA
ClusterTools (A number for free HPC clusters tools from Sun)
A number of commercial vendors worldwide are offering
clustering solutions including IBM, Compaq, Microsoft, a
number of startups like TurboLinux, HPTI, Scali,
BlackStone..)

35

Motivation for using Clusters


Surveys

show utilisation of CPU cycles of


desktop workstations is typically <10%.
Performance of workstations and PCs is
rapidly improving
As performance grows, percent utilisation
will decrease even further!
Organisations are reluctant to buy large
supercomputers, due to the large expense
and short useful life span.
36

Motivation for using Clusters


The

development tools for workstations


are more mature than the contrasting
proprietary solutions for parallel
computers - mainly due to the nonstandard nature of many parallel systems.
Workstation clusters are a cheap and
readily available alternative to
specialised High Performance Computing
(HPC) platforms.
Use of clusters of workstations as a
distributed compute resource is very cost
effective - incremental growth of system!!!
37

Cycle Stealing
Usually

a workstation will be owned by an


individual, group, department, or
organisation - they are dedicated to the
exclusive use by the owners.
This brings problems when attempting to
form a cluster of workstations for running
distributed applications.

38

Cycle Stealing
Typically,

there are three types of owners,


who use their workstations mostly for:
1. Sending and receiving email and preparing
documents.
2. Software development - edit, compile, debug and
test cycle.
3. Running compute-intensive applications.

39

Cycle Stealing
Cluster

computing aims to steal spare cycles


from (1) and (2) to provide resources for (3).
However, this requires overcoming the
ownership hurdle - people are very protective
of their workstations.
Usually requires organisational mandate that
computers are to be used in this way.
Stealing cycles outside standard work hours
(e.g. overnight) is easy, stealing idle cycles
during work hours without impacting
interactive use (both CPU and memory) is
much harder.
40

Rise & Fall of Computing


Technologies

Mainframes

Minis
1970

Minis

PCs
1980

PCs

Network
Computing
1995
41

Original Food Chain Picture

42

1984 Computer Food Chain

Mainframe
Mini Computer

Workstation

PC

Vector Supercomputer

43

1994 Computer Food Chain

(hitting wall soon)

Mini Computer

Workstation
(future is bleak)

PC

Mainframe

Vector Supercomputer

MPP

44

Computer Food Chain (Now and Future)

45

What is a cluster?
A

cluster is a type of parallel or distributed


processing system, which consists of a
collection of interconnected standalone/complete computers cooperatively
working together as a single, integrated
computing resource.
A typical cluster:
Network: Faster, closer connection than a typical
network (LAN)
Low latency communication protocols
Looser connection than SMP

46

Why Clusters now?


(Beyond Technology and Cost)

Building block is big enough

complete computers (HW & SW) shipped in


millions: killer micro, killer RAM, killer disks,
killer OS, killer networks, killer apps.
Workstations performance is doubling every 18
months.
Networks are faster

Higher link bandwidth (v 10Mbit Ethernet)


Switch

based networks coming (ATM)


Interfaces simple & fast (Active Msgs)
Striped files preferred (RAID)
Demise of Mainframes, Supercomputers, & MPPs

47

Architectural Drivers(cont)

Node architecture dominates performance

processor, cache, bus, and memory


design and engineering $ => performance

Greatest demand for performance is on large systems

must track the leading edge of technology without lag

MPP network technology => mainstream

system area networks

System on every node is a powerful enabler

very high speed I/O, virtual memory, scheduling,


48

...Architectural Drivers

Clusters can be grown: Incremental scalability (up,


down, and across)

Individual nodes performance can be improved by


adding additional resource (new memory blocks/disks)
New nodes can be added or nodes can be removed
Clusters of Clusters and Metacomputing

Complete software tools

Threads, PVM, MPI, DSM, C, C++, Java, Parallel


C++, Compilers, Debuggers, OS, etc.

Wide class of applications

Sequential and grand challenging parallel applications


49

Clustering of Computers
for Collective Computing: Trends
?

1960

1990

1995+ 2000

Example Clusters:
Berkeley NOW

100 Sun
UltraSparcs
200 disks
Myrinet SAN
160 MB/s
Fast comm.
AM, MPI, ...
Ether/ATM
switched
external net
Global OS
Self Config

51

Basic Components
MyriNet
160 MB/s

Myricom
NIC

I/O bus
$
P

Sun Ultra 170

52

Massive Cheap Storage


Cluster

Basic unit:
2 PCs double-ending
four SCSI chains of 8
disks each

Currently serving Fine Art at http://www.thinker.org/imagebase/


53

Cluster of SMPs (CLUMPS)

Four Sun E5000s

8 processors
4 Myricom NICs each

Multiprocessor, MultiNIC, Multi-Protocol

NPACI => Sun 450s

54

Millennium PC Clumps

Inexpensive, easy
to manage Cluster
Replicated in many
departments
Prototype for very
large PC cluster

55

Adoption of the Approach

56

So Whats So Different?
Commodity parts?
Communications Packaging?
Incremental Scalability?
Independent Failure?
Intelligent Network Interfaces?
Complete System on every node

virtual memory
scheduler
files
...
57

OPPORTUNITIES
&
CHALLENGES

58

Opportunity of Large-scale
Computing on NOW
Shared Pool of
Computing Resources:
Processors, Memory, Disks

Interconnect

Guarantee atleast one


workstation to many individuals
(when active)

Deliver large % of collective


resources to few individuals
at any one time
59

Windows of Opportunities

MPP/DSM:

Compute across multiple systems: parallel.

Network RAM:

Idle memory in other nodes. Page across


other nodes idle memory

Software RAID:

file system supporting parallel I/O and


reliablity, mass-storage.

Multi-path Communication:

Communicate across multiple networks:


Ethernet, ATM, Myrinet

60

Parallel Processing

Scalable Parallel Applications require

good floating-point performance


low overhead communication scalable
network bandwidth
parallel file system

61

Network RAM

Performance gap between processor and


disk has widened.

Thrashing to disk degrades performance


significantly

Paging across networks can be effective


with high performance networks and OS
that recognizes idle machines

Typically thrashing to network RAM can be 5


to 10 times faster than thrashing to disk
62

Software RAID: Redundant


Array of Workstation Disks

I/O Bottleneck:

Microprocessor performance is improving more


than 50% per year.
Disk access improvement is < 10%
Application often perform I/O

RAID cost per byte is high compared to single


disks
RAIDs are connected to host computers which are
often a performance and availability bottleneck
RAID in software, writing data across an array of
workstation disks provides performance and some
degree of redundancy provides availability.

63

Software RAID, Parallel File


Systems, and Parallel I/O

64

Cluster Computer and its


Components

65

Clustering Today
Clustering

gained momentum when 3


technologies converged:

1. Very HP Microprocessors
workstation performance = yesterday supercomputers

2. High speed communication


Comm. between cluster nodes >= between processors
in an SMP.

3. Standard tools for parallel/ distributed


computing & their growing popularity.
66

Cluster Computer
Architecture

67

Cluster Components...1a
Nodes
Multiple

High Performance Components:

PCs
Workstations
SMPs (CLUMPS)
Distributed HPC Systems leading to
Metacomputing
They can be based on different
architectures and running difference OS
68

Cluster Components...1b
Processors
There are many (CISC/RISC/VLIW/Vector..)
Intel: Pentiums, Xeon, Merceed.
Sun: SPARC, ULTRASPARC
HP PA
IBM RS6000/PowerPC
SGI MPIS
Digital Alphas
Integrate Memory, processing and
networking into a single chip

IRAM (CPU & Mem): (http://iram.cs.berkeley.edu)


Alpha 21366 (CPU, Memory Controller, NI)

69

Cluster Components2
OS
State of the art OS:
Linux
(Beowulf)
Microsoft NT (Illinois HPVM)
SUN Solaris
(Berkeley NOW)
IBM AIX
(IBM SP2)
HP UX
(Illinois - PANDA)
Mach (Microkernel based OS) (CMU)
Cluster Operating Systems (Solaris MC, SCO Unixware,
MOSIX (academic project)
OS gluing layers:
(Berkeley Glunix)
70

Cluster Components3
High Performance Networks
Ethernet

(10Mbps),
Fast Ethernet (100Mbps),
Gigabit Ethernet (1Gbps)
SCI (Dolphin - MPI- 12micro-sec
latency)
ATM
Myrinet (1.2Gbps)
Digital Memory Channel
FDDI

71

Cluster Components4
Network Interfaces
Network

Interface Card

Myrinet has NIC


User-level access support
Alpha 21364 processor integrates
processing, memory controller,
network interface into a single chip..

72

Cluster Components5
Communication Software

Traditional OS supported facilities (heavy


weight due to protocol processing)..
Sockets (TCP/IP), Pipes, etc.
Light weight protocols (User Level)
Active Messages (Berkeley)
Fast Messages (Illinois)
U-net (Cornell)
XTP (Virginia)
System systems can be built on top of the
above protocols
73

Cluster Components6a
Cluster Middleware
Resides

Between OS and Applications


and offers in infrastructure for supporting:
Single System Image (SSI)
System Availability (SA)
SSI makes collection appear as single
machine (globalised view of system
resources). Telnet cluster.myinstitute.edu
SA - Check pointing and process
migration..
74

Cluster Components6b
Middleware Components
Hardware
DEC Memory Channel, DSM (Alewife, DASH) SMP
Techniques
OS

/ Gluing Layers

Solaris MC, Unixware, Glunix)


Applications

and Subsystems

System management and electronic forms


Runtime systems (software DSM, PFS etc.)
Resource management and scheduling (RMS):
CODINE, LSF, PBS, NQS, etc.

75

Cluster Components7a
Programming environments

Threads (PCs, SMPs, NOW..)


POSIX Threads
Java Threads
MPI
Linux, NT, on many Supercomputers
PVM
Software DSMs (Shmem)

76

Cluster Components7b
Development Tools ?
Compilers
C/C++/Java/ ;
Parallel programming with C++ (MIT Press book)
RAD

(rapid application development


tools).. GUI based tools for PP
modeling
Debuggers
Performance Analysis Tools
Visualization Tools
77

Cluster Components8
Applications
Sequential
Parallel

app.)

/ Distributed (Cluster-aware

Grand Challenging applications


Weather Forecasting
Quantum Chemistry
Molecular Biology Modeling
Engineering Analysis (CAD/CAM)
.

PDBs, web servers,data-mining

78

Key Operational Benefits of Clustering

System availability (HA). offer inherent high system


availability due to the redundancy of hardware,
operating systems, and applications.
Hardware Fault Tolerance. redundancy for most system
components (eg. disk-RAID), including both hardware
and software.
OS and application reliability. run multiple copies of the
OS and applications, and through this redundancy
Scalability. adding servers to the cluster or by adding
more clusters to the network as the need arises or CPU to
SMP.
High Performance. (running cluster enabled programs)

79

Classification
of Cluster Computer

80

Clusters Classification..1
Based

on Focus (in Market)

High Performance (HP) Clusters


Grand Challenging Applications

High Availability (HA) Clusters


Mission Critical applications

81

HA Cluster: Server Cluster with


"Heartbeat" Connection

82

Clusters Classification..2
Based

on Workstation/PC Ownership

Dedicated Clusters
Non-dedicated clusters
Adaptive parallel computing
Also called Communal multiprocessing

83

Clusters Classification..3
Based

on Node Architecture..

Clusters of PCs (CoPs)


Clusters of Workstations (COWs)
Clusters of SMPs (CLUMPs)

84

Building Scalable Systems:


Cluster of SMPs (Clumps)

Performance of SMP Systems Vs.


Four-Processor Servers in a Cluster

85

Clusters Classification..4
Based

on Node OS Type..

Linux Clusters (Beowulf)


Solaris Clusters (Berkeley NOW)
NT Clusters (HPVM)
AIX Clusters (IBM SP2)
SCO/Compaq Clusters (Unixware)
.Digital VMS Clusters, HP
clusters, ..

86

Clusters Classification..5
Based

on node components
architecture & configuration
(Processor Arch, Node Type:
PC/Workstation.. & OS: Linux/NT..):

Homogeneous Clusters
All nodes will have similar configuration

Heterogeneous Clusters
Nodes based on different processors and
running different OSes.
87

Clusters Classification..6a

Dimensions of Scalability & Levels of


Clustering

(3)
Network
Public

Metacomputing (GRID)

Technology

Enterprise

(1)

Campus
Department
Workgroup
Uniprocessor

SMP

Cluster
MPP

Platform

(2)
88

Clusters Classification..6b
Levels of Clustering
Group

Clusters (#nodes: 2-99)

(a set of dedicated/non-dedicated computers mainly connected by SAN like Myrinet)


Departmental Clusters (#nodes: 99-999)
Organizational Clusters (#nodes: many 100s)
(using ATMs Net)
Internet-wide Clusters=Global Clusters:
(#nodes: 1000s to many millions)
Metacomputing
Web-based Computing
Agent Based Computing

Java plays a major in web and agent based computing

89

Major issues in cluster


design

Size Scalability (physical & application)

Enhanced Availability (failure management)

Single System Image (look-and-feel of one system)

Fast Communication (networks & protocols)

Load Balancing (CPU, Net, Memory, Disk)

Security and Encryption (clusters of clusters)

Distributed Environment (Social issues)

Manageability (admin. And control)

Programmability (simple API if required)

Applicability (cluster-aware and non-aware app.)


90

Cluster Middleware
and
Single System Image

91

A typical Cluster Computing


Environment

Application
PVM / MPI/ RSH

???

Hardware/OS
92

CC should support

Multi-user, time-sharing environments

Nodes with different CPU speeds and


memory sizes (heterogeneous configuration)

Many processes, with unpredictable


requirements

Unlike SMP: insufficient bonds between


nodes

Each computer operates independently

93

The missing link is provide by


cluster middleware/underware

Application
PVM / MPI/ RSH

Middleware or
Underware

Hardware/OS
94

SSI Clusters--SMP services on a CC


Pool Together the Cluster-Wide resources

Adaptive resource usage for better


performance

Ease of use - almost like SMP

Scalable configurations - by decentralized


control

Result: HPC/HAC at PC/Workstation prices


95

What is Cluster Middleware ?

An interface between between use


applications and cluster hardware and OS
platform.
Middleware packages support each other at
the management, programming, and
implementation levels.
Middleware Layers:
SSI Layer
Availability Layer: It enables the cluster services of
Checkpointing, Automatic Failover, recovery from
failure,
fault-tolerant operating among all cluster nodes.
96

Middleware Design Goals

Complete Transparency (Manageability)

Lets the see a single cluster system..

Single entry point, ftp, telnet, software loading...


Scalable Performance

Easy growth of cluster

no change of API & automatic load distribution.


Enhanced Availability

Automatic Recovery from failures


Employ checkpointing & fault tolerant technologies

Handle consistency of data when replicated..


97

What is Single System Image


(SSI) ?
A

single system image is the


illusion, created by software or
hardware, that presents a
collection of resources as one,
more powerful resource.
SSI makes the cluster appear like a
single machine to the user, to
applications, and to the network.
A cluster without a SSI is not a
cluster
98

Benefits of Single System


Image

Usage of system resources transparently


Transparent process migration and load
balancing across nodes.
Improved reliability and higher availability
Improved system response time and
performance
Simplified system management
Reduction in the risk of operator errors
User need not be aware of the underlying
system architecture to use these machines
effectively
99

Desired SSI Services

Single Entry Point

telnet cluster.my_institute.edu
telnet node1.cluster. institute.edu

Single File Hierarchy: xFS, AFS, Solaris MC Proxy


Single Control Point: Management from single GUI
Single virtual networking
Single memory space - Network RAM / DSM
Single Job Management: Glunix, Codine, LSF
Single User Interface: Like workstation/PC
windowing environment (CDE in Solaris/NT), may
it can use Web technology
100

Availability Support
Functions

Single I/O Space (SIO):

any node can access any peripheral or disk devices


without the knowledge of physical location.

Single Process Space (SPS)

Any process on any node create process with cluster


wide process wide and they communicate through
signal, pipes, etc, as if they are one a single node.

Checkpointing and Process Migration.

Saves the process state and intermediate results in


memory to disk to support rollback recovery when
node fails. PM for Load balancing...

101

Scalability Vs. Single System


Image

UP

102

SSI Levels/How do we
implement SSI ?

It is a computer science notion of levels of


abstractions (house is at a higher level of
abstraction than walls, ceilings, and floors).

Application and Subsystem Level

Operating System Kernel Level

Hardware Level

103

SSI at Application and


Subsystem Level
Level

Examples

Boundary

Importance

application

cluster batch system,


system management

an application

what a user
wants

subsystem

distributed DB,
OSF DME, Lotus
Notes, MPI, PVM

a subsystem

SSI for all


applications of
the subsystem

file system

Sun NFS, OSF,


DFS, NetWare,
and so on

toolkit

OSF DCE, Sun


ONC+, Apollo
Domain

shared portion of implicitly supports


the file system
many applications
and subsystems
explicit toolkit
best level of
facilities: user,
support for heterservice name,time ogeneous system
(c) In search of clusters
104

SSI at Operating System


Kernel Level
Level
Kernel/
OS Layer
kernel
interfaces
virtual
memory
microkernel

Examples
Solaris MC, Unixware
MOSIX, Sprite,Amoeba
/ GLunix
UNIX (Sun) vnode,
Locus (IBM) vproc

Boundary

Importance

each name space: kernel support for


files, processes,
applications, adm
pipes, devices, etc. subsystems
type of kernel
modularizes SSI
objects: files,
code within
processes, etc.
kernel

none supporting
each distributed
operating system kernel virtual memory
space
Mach, PARAS, Chorus, each service
OSF/1AD, Amoeba
outside the
microkernel

may simplify
implementation
of kernel objects
implicit SSI for
all system services
(c) In search of clusters
105

SSI at Harware Level


Level

Examples

Boundary

Importance

Application and Subsystem Level

Operating System Kernel Level


memory

SCI, DASH

memory space

memory
and I/O

SCI, SMP techniques

memory and I/O


device space

better communication and synchronization


lower overhead
cluster I/O
(c) In search of clusters
106

SSI Characteristics
1.

Every SSI has a boundary


2. Single system support can exist
at different levels within a system,
one able to be build on another

107

SSI Boundaries -- an
applications SSI boundary

Batch System
SSI
Boundary

(c) In search
of clusters
108

Relationship Among
Middleware Modules

109

SSI via OS path!

1. Build as a layer on top of the existing OS

Benefits: makes the system quickly portable, tracks


vendor software upgrades, and reduces development
time.
i.e. new systems can be built quickly by mapping
new services onto the functionality provided by the
layer beneath. Eg: Glunix

2. Build SSI at kernel level, True Cluster OS

Good, but Cant leverage of OS improvements by


vendor

E.g. Unixware, Solaris-MC, and MOSIX

110

SSI Representative Systems


OS

level SSI
SCO NSC UnixWare
Solaris-MC
MOSIX, .
Middleware level SSI
PVM, TreadMarks (DSM), Glunix,
Condor, Codine, Nimrod, .
Application level SSI
PARMON, Parallel Oracle, ...

111

SCO NonStop Cluster for UnixWare


http://www.sco.com/products/clustering/
UP or SMP node

UP or SMP node

Users, applications, and


systems management
Standard OS
kernel calls
Standard SCO
UnixWare
with clustering
hooks

Extensions

Modular
kernel
extensions

Users, applications, and


systems management

Extensions

Standard OS
kernel calls
Standard SCO
UnixWare
with clustering
hooks

Modular
kernel
extensions

Devices

Devices
ServerNet
Other nodes

112

How does NonStop Clusters


Work?

Modular Extensions and Hooks to Provide:

Single Clusterwide Filesystem view


Transparent Clusterwide device access
Transparent swap space sharing
Transparent Clusterwide IPC
High Performance Internode Communications
Transparent Clusterwide Processes, migration,etc.
Node down cleanup and resource failover
Transparent Clusterwide parallel TCP/IP networking
Application Availability
Clusterwide Membership and Cluster timesync
Cluster System Administration
Load Leveling
113

Solaris-MC: Solaris for


MultiComputers

Applications

System call interface

Network
File system
C++

Processes

Solaris MC

Other
nodes

global file
system
globalized
process
management
globalized
networking
and I/O

Object framework
Object invocations

Existing Solaris 2.5 kernel


Kernel
Solaris MC Architecture

http://www.sun.com/research/solaris-mc/

114

Solaris MC components

Applications
System call interface
Network
File system
C++

Processes

Object and
communication
support
High availability
support
PXFS global
distributed file
system
Process
mangement
Networking

Solaris MC

Object framework
Object invocations

Existing Solaris 2.5 kernel


Kernel
Solaris MC Architecture

Other
nodes

115

Multicomputer OS for UNIX


(MOSIX)
http://www.mosix.cs.huji.ac.il/
An OS module (layer) that provides the
applications with the illusion of working on a single
system
Remote operations are performed like local
operations
Transparent to the application - user interface
unchanged

Application
PVM / MPI / RSH

Hardware/OS

116

Main tool
Preemptive process migration that can
migrate--->any process, anywhere, anytime

Supervised by distributed algorithms that


respond
on-line to global resource
availability - transparently

Load-balancing - migrate process from overloaded to under-loaded nodes


Memory ushering - migrate processes from a
node that has exhausted its memory, to prevent
paging/swapping

117

MOSIX for Linux at HUJI

A scalable cluster configuration:

50 Pentium-II 300 MHz


38 Pentium-Pro 200 MHz (some are SMPs)
16 Pentium-II 400 MHz (some are SMPs)
Over 12 GB cluster-wide RAM
Connected by the Myrinet 2.56 G.b/s LAN
Runs Red-Hat 6.0, based on Kernel 2.2.7
Upgrade: HW with Intel, SW with Linux
Download MOSIX:

http://www.mosix.cs.huji.ac.il/
118

NOW @ Berkeley
Design & Implementation of higher-level system
Global OS (Glunix)
Parallel File Systems (xFS)
Fast Communication (HW for Active Messages)
Application Support
Overcoming technology shortcomings
Fault tolerance
System Management
NOW Goal: Faster for Parallel AND Sequential

http://now.cs.berkeley.edu/

119

NOW Software Components


Parallel Apps

Large Seq. Apps

Sockets, Split-C, MPI, HPF, vSM


Name Svr

Global Layer Unix

Unix
Workstation

Unix
Workstation

Unix
Workstation

VN segment
Driver
AM L.C.P.

VN segment
Driver
AM L.C.P.

VN segment
Driver
AM L.C.P.

Active Messages

Unix (Solaris)
Workstation
VN segment
Driver
AM L.C.P.

Myrinet Scalable Interconnect

120

3 Paths for Applications on


NOW?

Revolutionary (MPP Style): write new programs from


scratch using MPP languages, compilers, libraries,
Porting: port programs from mainframes,
supercomputers, MPPs,
Evolutionary: take sequential program & use
1) Network RAM: first use memory of many
computers to reduce disk accesses; if not fast
enough, then:
2) Parallel I/O: use many disks in parallel for
accesses not in file cache; if not fast enough,
then:
3) Parallel program: change program until it sees
enough processors that is fast=> Large speedup
without fine grain parallel program
121

Comparison of 4 Cluster Systems

122

Cluster Programming
Environments

Shared Memory Based

DSM
Threads/OpenMP (enabled for clusters)
Java threads (HKU JESSICA, IBM cJVM)

Message Passing Based

PVM (PVM)
MPI (MPI)

Parametric Computations

Nimrod/Clustor

Automatic Parallelising Compilers


Parallel Libraries & Computational Kernels (NetSolve)
123

Levels of Parallelism

PVM/MPI

Threads

Compilers
CPU

Task i-l

func1 ( )
{
....
....
}

a ( 0 ) =..
b ( 0 ) =..

Task i

func2 ( )
{
....
....
}

a ( 1 )=..
b ( 1 )=..

Task i+1

func3 ( )
{
....
....
}

a ( 2 )=..
b ( 2 )=..

Load

Code-Granularity
Code Item
Large grain
(task level)
Program

Medium grain
(control level)
Function (thread)

Fine grain
(data level)
Loop (Compiler)
Very fine grain
(multiple issue)
With hardware
124

MPI (Message Passing


Interface)

http://www.mpi-forum.org/

A standard message passing interface.

MPI 1.0 - May 1994 (started in 1992)


C and Fortran bindings (now Java)

Portable (once coded, it can run on virtually all HPC


platforms including clusters!
Performance (by exploiting native hardware features)
Functionality (over 115 functions in MPI 1.0)

environment management, point-to-point &


collective communications, process group,
communication world, derived data types, and virtual
topology routines.

Availability - a variety of implementations available,


both vendor and public domain.
125

A Sample MPI Program...


# include <stdio.h>
# include <string.h>
#include mpi.h
main( int argc, char *argv[ ])
{
int my_rank; /* process rank */
int p; /*no. of processes*/
int source; /* rank of sender */
int dest; /* rank of receiver */
int tag = 0; /* message tag, like email subject */
char message[100]; /* buffer */
MPI_Status status; /* function return status */
/* Start up MPI */
MPI_Init( &argc, &argv );
/* Find our process rank/id */
MPI_Comm_rank( MPI_COM_WORLD, &my_rank);
/*Find out how many processes/tasks part of this run */
MPI_Comm_size( MPI_COM_WORLD, &p);

(master)

Hello,...

(workers)

126

A Sample MPI Program

if( my_rank == 0) /* Master Process */


{
for( source = 1; source < p; source++)
{
MPI_Recv( message, 100, MPI_CHAR, source, tag, MPI_COM_WORLD, &status);
printf(%s \n, message);
}
}
else /* Worker Process */
{
sprintf( message, Hello, I am your worker process %d!, my_rank );
dest = 0;
MPI_Send( message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COM_WORLD);
}
/* Shutdown MPI environment */
MPI_Finalise();
}
127

Execution
% cc -o hello hello.c -lmpi
% mpirun -p2 hello
Hello, I am process 1!
% mpirun -p4 hello
Hello, I am process 1!
Hello, I am process 2!
Hello, I am process 3!
% mpirun hello
(no output, there are no workers.., no greetings)

128

PARMON: A Cluster
Monitoring Tool
PARMON Client on JVM

PARMON Server
on each node

parmon
parmond

PARMON
High-Speed
Switch

http://www.buyya.com/parmon/

129

Resource Utilization at a
Glance

130

Globalised Cluster Storage

Single I/O Space and


Design Issues
Reference:
Designing SSI Clusters with Hierarchical Checkpointing and Single I/O
Space, IEEE Concurrency, March, 1999

by K. Hwang, H. Jin et.al


131

Clusters with & without Single


I/O Space

Users

Users

Single I/O Space Services

Without Single I/O Space

With Single I/O Space Services

132

Benefits of Single I/O Space

Eliminate the gap between accessing local disk(s) and remote


disks

Support persistent programming paradigm

Allow striping on remote disks, accelerate parallel I/O


operations

Facilitate the implementation of distributed checkpointing and


recovery schemes

133

Single I/O Space Design Issues

Integrated I/O Space

Addressing and Mapping Mechanisms

Data movement procedures

134

Integrated I/O Space


LD1
LD2

...
...

D11 D12
D21 D22

D1t
D2t

...

Sequential
addresses

LDn

...

Dn1 Dn2
B11
B12
SD1

...
...

B21
B22
SD2

...

Dnt
Bm1
Bm2
SDm

...
B1k

B2k
P1

. . .

Local
Disks,
(RADD
Space)
Shared
RAIDs,
(NASD Space)

Bmk

Peripherals
(NAP Space)

Ph
135

Addressing and Mapping

User Applications

Name Agent

I/O Agent

Disk/RAID/
NAP Mapper
I/O Agent

RADD

Block Mover

I/O Agent

I/O Agent

NASD

NAP

User-level
Middleware
plus some
Modified OS
System Calls

136

Data Movement Procedures


User
Application
I/O Agent

Node 1
Block
Mover

Request
Data
Block A

Node 2
I/O Agent

LD2 or SDi
LD1

User
Application
I/O Agent

of the NASD

Node 1

Node 2
Block
Mover

I/O Agent

LD2 or SDi
LD1

of the NASD

A
137

What Next ??
Clusters of Clusters (HyperClusters)
Global Grid
Interplanetary Grid
Universal Grid??

138

Clusters of Clusters (HyperClusters)


Cluster 1
Scheduler

Master
Daemon

LAN/WAN
Submit
Graphical
Control

Cluster 3
Execution
Daemon

Scheduler

Clients

Master
Daemon

Cluster 2
Submit
Graphical
Control

Scheduler

Master
Daemon

Execution
Daemon

Clients

Submit
Graphical
Control

Execution
Daemon

Clients

139

Towards Grid Computing.

For illustration, placed resources arbitrarily on the GUSTO test-bed!!

140

What is Grid ?

An infrastructure that couples


Computers (PCs, workstations, clusters, traditional
supercomputers, and even laptops, notebooks, mobile
computers, PDA, and so on)
Software ? (e.g., renting expensive special purpose applications
on demand)
Databases (e.g., transparent access to human genome database)
Special Instruments (e.g., radio telescope--SETI@Home
Searching for Life in galaxy, Austrophysics@Swinburne for
pulsars)
People (may be even animals who knows ?)

across the local/wide-area networks (enterprise,


organisations, or Internet) and presents them as
an unified integrated (single) resource.
141

Conceptual view of the Grid

Leading to Portal (Super)Computing

http://www.sun.com/hpc/

142

Grid Application-Drivers

Old and New applications getting enabled due


to coupling of computers, databases,
instruments, people, etc:

(distributed) Supercomputing
Collaborative engineering
high-throughput computing
large scale simulation & parameter studies

Remote software access / Renting Software


Data-intensive computing
On-demand computing
143

Grid Components
Applications and Portals

Scientific

Engineering

Collaboration

Prob. Solving Env.

Development Environments and Tools

Languages

Libraries

Debuggers

Monitoring

Resource Brokers

Web enabled Apps

Distributed Resources Coupling Services

Comm.

Sign on & Security

Information

Process

Data Access

Web tools

QoS

Grid
Apps.

Grid
Tools

Grid
Middleware

Local Resource Managers

Operating Systems

Computers

Queuing Systems

Clusters

Libraries & App Kernels

Networked Resources across


Organisations

Storage Systems

Data Sources

TCP/IP & UDP

Grid
Fabric

Scientific Instruments

144

Many GRID Projects and Initiatives

PUBLIC FORUMS

Computing Portals
Grid Forum
European Grid Forum
IEEE TFCC!
GRID2000 and more.

Australia

Nimrod/G
EcoGrid and GRACE
DISCWorld

Europe

Public Grid Initiatives

Distributed.net
SETI@Home
Compute Power Grid

USA

Globus
Legion
JAVELIN
AppLes
NASA IPG
Condor
Harness
NetSolve
NCSA Workbench
WebFlow
EveryWhere
and many more...

UNICORE
MOL
METODIS
Globe
Poznan Metacomputing
Japan
CERN Data Grid
Ninf
Bricks
MetaMPI
and many more...
DAS
JaWS
and many more... http://www.gridcomputing.com/

145

NetSolve
Client/Server/Agent -- Based Computing
Easy-to-use tool to provide efficient and uniform
access to a variety of scientific packages on UNIX platforms

Client-Server design
Network-enabled solvers
Network Resources
Seamless access to resources
Non-hierarchical system
Load Balancing
Fault Tolerance
reply
Interfaces to Fortran, C, Java, Matlab, more

Software Repository

choice

request
Software is available
www.cs.utk.edu/netsolve/

NetSolve Client

NetSolve Agent
146

HARNESS Virtual Machine


Scalable Distributed control and CCA based Daemon
Discovery and registration
Host A
Host D

Host B

Another
VM

Virtual
Machine
Component
based daemon

Host C

process control

Operation within VM uses


Distributed Control

user features

HARNESS daemon

Customization
and extension
by dynamically
adding plug-ins

http://www.epm.ornl.gov/harness/

147

HARNESS Core Research


Parallel Plug-ins for Heterogeneous Distributed Virtual Machine
One research goal is to understand and implement
a dynamic parallel plug-in environment.
provides a method for many users to extend Harness
in much the same way that third party serial plug-ins
extend Netscape, Photoshop, and Linux.

Research issues with Parallel plug-ins include:


heterogeneity, synchronization, interoperation, partial success
(three typical cases):
load plug-in into single host of VM w/o communication
load plug-in into single host broadcast to rest of VM
load plug-in into every host of VM w/ synchronization

148

Nimrod - A Job Management


System

http://www.dgs.monash.edu.au/~davida/nimrod.html

149

Job processing with Nimrod

150

Nimrod/G Architecture
Nimrod/G Client

Nimrod/G Client

Nimrod/G Client

Nimrod Engine

Schedule Advisor
Trading Manager

Persistent
Store

Dispatcher

Grid Explorer
TM

Middleware Services

TS

GE

GIS

Grid Information Services

RM & TS
RM & TS

RM & TS
GUSTO Test Bed

RM: Local Resource Manager, TS: Trade Server

151

Compute Power Market

Grid Information Server

Grid Explorer
Application

Job
Control
Agent

Schedule Advisor

Trade Server

Charging Alg.

Trading
Trade Manager

Deployment Agent
User

Resource Broker

Accounting

Resource
Reservation

Other
services

Resource Allocation

R1

R2

Rn

A Resource Domain

152

Pointers to Literature on
Cluster Computing

153

Reading Resources..1a
Internet & WWW
Computer Architecture:
http://www.cs.wisc.edu/~arch/www/

PFS & Parallel I/O


http://www.cs.dartmouth.edu/pario/

Linux Parallel Procesing


http://yara.ecn.purdue.edu/~pplinux/Sites/

DSMs
http://www.cs.umd.edu/~keleher/dsm.html
154

Reading Resources..1b
Internet & WWW
Solaris-MC
http://www.sunlabs.com/research/solaris-mc

Microprocessors: Recent Advances


http://www.microprocessor.sscc.ru

Beowulf:
http://www.beowulf.org

Metacomputing
http://www.sis.port.ac.uk/~mab/Metacomputing/
155

Reading Resources..2
Books
In Search of Cluster
by G.Pfister, Prentice Hall (2ed), 98

High Performance Cluster Computing


Volume1: Architectures and Systems
Volume2: Programming and Applications
Edited by Rajkumar Buyya, Prentice Hall, NJ, USA.

Scalable Parallel Computing


by K Hwang & Zhu, McGraw Hill,98
156

Reading Resources..3
Journals
A Case of NOW, IEEE Micro, Feb95
by Anderson, Culler, Paterson

Fault Tolerant COW with SSI, IEEE


Concurrency, (to appear)
by Kai Hwang, Chow, Wang, Jin, Xu

Cluster Computing: The Commodity


Supercomputing, Journal of Software
Practice and Experience-(get from my web)
by Mark Baker & Rajkumar Buyya

157

Cluster Computing Infoware

http://www.csse.monash.edu.au/~rajkumar/cluster/
158

Cluster Computing Forum

IEEE Task Force on Cluster Computing


(TFCC)

http://www.ieeetfcc.org

159

TFCC Activities...
Network Technologies
OS Technologies
Parallel I/O
Programming Environments
Java Technologies
Algorithms and Applications
>Analysis and Profiling
Storage Technologies
High Throughput Computing

160

TFCC Activities...
High Availability
Single System Image
Performance Evaluation
Software Engineering
Education
Newsletter
Industrial Wing
TFCC Regional Activities

All the above have there own pages, see pointers


from:
161
http://www.ieeetfcc.org

TFCC Activities...

Mailing list, Workshops, Conferences, Tutorials,


Web-resources etc.

Resources for introducing subject in senior


undergraduate and graduate levels.
Tutorials/Workshops at IEEE Chapters..
.. and so on.
FREE MEMBERSHIP, please join!
Visit TFCC Page for more details:

http://www.ieeetfcc.org (updated daily!).


162

Clusters Revisited

163

Summary

We have discussed Clusters


Enabling Technologies
Architecture & its Components
Classifications
Middleware
Single System Image
Representative Systems
164

Conclusions

Clusters are promising..


Solve parallel processing paradox
Offer incremental growth and matches with
funding pattern.
New trends in hardware and software
technologies are likely to make clusters more
promising..so that
Clusters based supercomputers can be seen
everywhere!
165

166

Thank You ...

167

Backup Slides...

168

SISD : A Conventional Computer


Instructions

Data Input

Speed

Processor

Data Output

is limited by the rate at which computer can


transfer information internally.

Ex:PC, Macintosh, Workstations


169

The MISD Architecture


Instruction
Stream A
Instruction
Stream B
Instruction Stream C
Processor

Data
Output
Stream

A
Data
Input
Stream

Processor

B
Processor

C
More

of an intellectual exercise than a practical configuration.


Few built, but commercially not available
170

SIMD Architecture
Instruction
Stream

Data Input
stream A
Data Input
stream B
Data Input
stream C

Data Output
stream A

Processor

A
Data Output
stream B

Processor

B
Processor

Data Output
stream C

Ci<= Ai * Bi
Ex: CRAY machine vector processing, Thinking machine cm*
171

MIMD Architecture
Instruction Instruction Instruction
Stream A Stream B Stream C

Data Input
stream A
Data Input
stream B
Data Input
stream C

Data Output
stream A

Processor

A
Data Output
stream B

Processor

B
Processor

Data Output
stream C

Unlike SISD, MISD, MIMD computer works asynchronously.


Shared memory (tightly coupled) MIMD
Distributed memory (loosely coupled) MIMD

172

You might also like