You are on page 1of 46


Vishal Shrivastav
Indian Institute of Technology Kharagpur


- What is Supercomputer?
- Where do we use
- Dierences between
Supercomputers and PCs
- Brief History of
- Present Day Supercomputers

System Considerations

- Multi core Computing

- Symmetric Multiprocessing
- Distributed Computing
- Distributed Computing: Cluster
- Distributed Computing: Grid
- Cluster vs Grid

- Comparing various classes of


- Amdahls Law
- Gustafsons Law
- Analogies
- Dierences

Memory Considerations

Case Studies

- Memory Hierarchy

- K Computer
- Blue Gene
- Hopper Cray

Processor Considerations


What is Supercomputer?
Wikipedia denes a Supercomputer as a computer at the frontline of

current processing capacity, particularly speed of calculation.

Supercomputer is a computer that is only one generation behind what

large-scale users want.

NilLi Neil Lincoln, architect for the CDC Cyber 205 and others

Where do we use Supercomputers?

Engineering (Automotive)

- Crash Simulations
- Aerodynamics

- Weather Forecasts
- Hurricane Warnings
Applied Mathematics

- Lattice Boltzmann Flow Solvers


- Simulation of HIV protease Dynamics

Dierences between Supercomputers and PCs

Supercomputers dier from PCs in following aspects:
Memory Hierarchy

- Measured in Tera and Peta ops

- Range from $100,000s to $1,000,000s
- Require environmentally controlled rooms

Brief History of Supercomputers

In 1960s a series of Supercomputers were designed at Control Data

Corporation(CDC) by Seymour Cray

- CDC 6600, released in 1964, is considered to be the rst Supercomputer
Then arrived the Cray series of Supercomputers

- CRAY 1 : A 80Mhz Supercomputer released in 1976

- CRAY 2 : Released in 1985
An 8 processor liquid cooled computer with Fluorinert pumped
through it as it operated
Performed at 1.9 gigaops and was the world's fastest until 1990

CONTD . . .
In the 1990s machines with thousands of processors began to appear in US and

Intel Paragon : Ranked fastest in 1993

- A MIMD machine which connected processors via a high speed 2-D mesh
allowing processes to execute on separate nodes; communicating via the
Message Passing Interface
Fujitsu's Numerical Wind Tunnel : Ranked fastest in 1994

- Used 166 vector processors; achieved top speed of 1.7 gigaops / processor
Hitachi SR2201 : Ranked fastest in 1996

- Used 2048 processors connected via 3-D crossbar network

- Achieved peak performance of 600 gigaops

Present Day Supercomputers

Characterized by very high degree of parallelism

- Design of memory hierarchy is such that processor is kept fed with data and
instructions at all time
- The I/O systems have very high bandwidth
Uses various modern processing techniques:

- Vector Processing
- Non-uniform Memory access
- Parallel Filesystems
More than 90% of present day Supercomputers run some form of LINUX as their

operating system

CONTD . . .
The base programming language of Supercomputers is FORTRAN or C
The software tools for distributed processing include:

- APIs such as MPI and PVM, VTL

- Open source-based software solutions such as Beowulf, WareWulf and
An easy programming language for Supercomputers remains an open research

topic in computer science


Comparing various classes of Processors

Class of

Implementation Instruction

Inst. Issue

Speedup (wrt

Scalar (static)















(1 per minor





Memory Hierarchy
Shared Memory

Distributed Memory

Memory space is shared between

Many processors, each with local

multiple processors for read & write

Processors interact by modifying
data objects in the shared address
Memory-CPU bandwidth limits its

memory accessible only to it, are

connected via an interconnection
network (Highly Scalable)
They communicate by passing
messages to each other
Non-uniform memory access model

CONTD . . .
The largest and fastest computers in the world today employ hybrid of shared

and distributed memory architectures

At low level compute nodes comprise multiple processors sharing same
address space (shared memory)
At higher level Distributed Memory


Multi core Computing

- A processor that includes multiple execution units (cores) on the same

- Each core in a multi core processor can potentially be superscalar that is,
every cycle, each core can issue multiple instructions from one instruction
- IBMs Cell microprocessor, designed for use in the Sony Playstation 3, is a
prominent example of multi core processor

Symmetric Multiprocessing
- A computer system with multiple identical processors that share memory
and connect via bus
- Bus contention prevents bus architectures from scaling. As a result, SMPs
generally do not comprise more than 32 processors.





Main memory

I/O system

Distributed Computing
Individual memory for each processor
Messaging interface for communication










200+ processors can work on same application

But still not independent and collectively

constitute a computer
Synchronization aspects and communication

overheads are the major bottlenecks


Distributed Computing: Cluster

Cluster dened:
Coordinated use of interconnected autonomous computers in a

machine room
a collection of standalone workstations of PCs that are interconnected
by a high-speed network
work as an integrated collection of resources (unied computing
have a single system image spanning all its nodes
A cluster consists of:
standalone machines with storage
a fast interconnection network
Low latency communication protocols
software to give Single System Image: Cluster Middleware
Programming tools

Classication of Clusters
Non-dedicated Clusters:
Network of Workstations

Use spare computation cycles
of nodes
Background job distribution
Individual owners of
Dedicated Clusters:
Joint ownership
Dedicated nodes
Parallel computing

Homogeneous cluster:
Similar processors
Software, etc


data format
computational speed
system software, etc

Distributed Computing: Grid

Term grid computing originated in the early 1990s as a metaphor for making

computer power as easy to access as an electric power grid.

Makes use of computers communicating over the Internet to work on a given
Deals only with embarrassingly parallel problems, owing to low bandwidth and
extremely high latency available on internet
IBMs Grid computing has put forward:
Use open standards and protocols to gain access to computing resources

over the Internet

Uses a large number of small systems spread across a large geographical region

and presents a unied picture to the users.

Availability of high speed networking surmounts the distance problem.

The Big Question

At some level, all applications share common needs.
how to:
nd resources?
acquire resources?
locate and move data?
start/monitor computation?
all securely and conveniently?


Grid middleware: a single software infrastructure supports all of the above.

Grid Middleware Components

Grid Information Service (GIS):

Support registering and querying grid resources.

Grid Resource Broker (GRB):

End users submit their application requirements to GRB.

GRB discovers resources by querying GIS.

Grid fabric: manages resources

computers, storage devices, scientic instruments, etc.

Core grid middleware: Oers services

Process management, allocation of resources, security

User-level grid Middleware: Oers services:

Programming tools, resource brokers, scheduling application tasks for

execution on global resources.

Cluster vs Grid
Cluster computing can be said to be a subset of grid computing.
Cluster nodes are in close proximity and interconnected by LAN
Grid nodes are geographically separate.

Clusters provide guarantee of service

Nodes are expected to give full resource.

Clusters are usually a homogeneous set of nodes.

Availability and performance of grid resources are unpredictable:
Requests from within an administrative domain may gain more

priority over requests from outside.


Limitations of Parallelism
Parallelization is the process of formulating a problem in a way that

lends itself to concurrent execution by several execution units of some


Using N processors, we expect the execution time to come down N

times ideally. This is termed as speedup of N

Reality is not so perfect

- Shared resources
- Dependencies between processors
- Communication
- Load imbalance
The serial part limits speedup

Amdahls Law
1 processor : T(1) = s + p = 1 (s: serial part p: parallel part)
n processors : T(n) = s + (p/n)
Scalability (Speed Up) = T(1)/T(n) = 1 /(s + (1-s/n))

Gustafson's Law
- Addresses the shortcomings of Amdahl's law
- Says that problems with large, repetitive data sets can be eciently parallelized
where P is the number of processors, S is the speedup, and the non-
parallelizable part of the process
- Adding due consideration for large scale consideration and tasks

Amdahls Law
Suppose two cities are 60 km apart, a car has spent one hour travelling the rst
30 km. No matter how fast it drives the last 30 km, it is impossible to achieve an
average speed of 90 km/h before arriving the destination

Gustafsons Law
Suppose a car has already been travelling for some time at speed of less than
90km/h, and when given enough time and distance to travel, the cars average
speed can reach 90km/h as long as it drives faster than 90 km/h for some time.
And also the average speed can reach 120km/h and even 150km/h as long as it
drives fast enough in the following part

Amdahls Law

Gustafsons Law

Does not scale the availability of

Proposes that programmers set the

computing power as the number of

machines increases
Based on xed workload or

xed problem size. It implies that

the sequential part of a program
does not change with respect to
machine size (i.e., the number of
processors). However the parallel
part is evenly distributed
over P processors.

size of problems to use the available

equipment to solve problems within
a practical xed time

Therefore, if faster (more parallel)

equipment is available, larger

problems can be solved in the same

Redenes eciency as a need to

minimize the sequential part of a

program, even if it increases the
total amount of computation

Case Studies

K Computer
Produced by Fujitsu at the RIKEN Advanced

Institute for Computational Science campus

in Kobe, Japan
In June of 2011, K became the world's fastest

Supercomputer, as recorded by the TOP500,

with a rating of over 8 petaops (quadrillion
calculations per second)
In November 2011, it became the rst computer

to top 10 petaops

CONTD . . .
Major Features
Uses 68,544 2.0Ghz 8-core SPARC64 VIIIfx processors packed in 672 cabinets,

for a total of 548,352 cores

It uses 45nm CMOS technology
Each cabinet has 96 compute nodes, 6 IO nodes where each node contains a
single processor and 16GB of memory
Uses a 6-D torus network interconnect called Tofu, and a Tofu-optimized

Message Passing Interface based on Open MPI library

Adopts 2-level local/global lesystem

Fujitsu developed an optimized parallel le system based on Lustre, called

Fujitsu Exabyte File System, scalable to several hundred petabytes

Blue Gene
Blue Gene is a computer architecture

project carried out by a cooperative

eorts of IBM, Lawerence Livermore
National Lab, US Department of
Energy and academia
4 Blue Gene projects are in progress:

- Blue Gene/L
- Blue Gene/C
- Blue Gene/P
- Blue Gene/Q
The project was awarded the National Medal of Technology and Innovation

by US President Barack Obama in 2009

CONTD . . .
Major features
Trading the speed of processors for lower power consumption.
Dual processors per node with two working modes: co-processor (1 user

process/node: computation and communication work is shared by two

processors) and virtual mode (2 user processes/node)

A large number of nodes (scalable in increments of 1024 up to at least

Three-dimensional torus interconnect with auxiliary networks for global

communications, I/O, and management

Lightweight OS per node for minimum system overhead (computational


CONTD . . .
The block scheme of the Blue Gene/L ASIC including dual PowerPC 440 cores

Hopper Cray
Hopper is NERSC's rst petaop system
It has 153,216 compute cores
The size of main memory is 217 TB
The secondary storage (disk) size is 2 PB
Hopper placed number 5 on the November 2010 Top500 Supercomputer list.

CONTD . . .
Compute Nodes
6,384 nodes
2 twelve-core AMD 'Magny Cours' 2.1-GHz processors per node
24 cores per node (153,216 total cores)
32 GB DDR3 1333-MHz memory per node (6,000 nodes)
64 GB DDR3 1333-MHz memory per node (384 nodes)
Peak Gop/s rate:
8.4 Gops/core
201.6 Gops/node
1.28 Peta-ops for the entire machine

Each core has its own L1 and L2 caches, with 64 KB and 512KB respectively
One 6-MB L3 cache shared between 6 cores on the Magny Cours processor
Four DDR3 1333-MHz memory channels per twelve-core 'Magny Cours'


CONTD . . .
Magny Cours Processor

CONTD . . .
Hopper's compute nodes are connected via a custom high-bandwidth, low-

latency network provided by Cray

Each network node handles not only data destined for itself, but also data to be

relayed to other nodes

Nodes at the "edges" of the mesh network are connected to nodes at the other

edge to form a 3-D torus

CONTD . . .
The custom chips that route communication

over the network are known as "Gemini" and

the entire network is often referred to as the
"Cray Gemini Network."
Each pair of Hopper nodes (containing 48 cores)

is connected to one Gemini Application-Specic

Integrated Circuit (AISC)

Cray Gemini Interconnect with an "exploded

view of the ASIC

CONTD . . .
Wiring up a Cray XE6

CONTD . . .
File System
All of NERSC's global le systems are available on Hopper. Additionally,

Hopper has 2 PB of locally attached high-performance /scratch disk space

Scratch File Systems

Size (TB)

Aggregate Peak

# of Disks


1 PB

35 GB/sec



1 PB

35 GB/sec