You are on page 1of 109

B.

Tech IIIrd

CO322: PARALLEL PROCESSING AND ARCHITECTURE


(EIS II)
L 3, T 0, P 0, C 3
1 lecture/week [ from my side]

30 Marks Midsem
50 Marks Endsem
10 Marks Class tests/Quizz/Assignments
10 Marks Attandence
100 Marks Total

Parallel computer model 4


The state of Computing
Multiprocessors and Multicomputers
Multivector and SIMD Computers

Program and network Properties 4


Conditions of parallelism
Program Partitioning and scheduling
Program Flow Mechanism
System Interconnect Architecture

Principles of scalable performance 4


Performance Metrics and Measures
Parallel Processing Applications
Speedup Performance Laws
Scalability Analysis and Approaches

Processors and Memory Hierarchy 4


Advanced Processor Technology
Superscalar and vector Processors
Memory Hierarchy Technology
Virtual Memory Technology

Multiprocessors and Multicomputers


Multiprocessor system Interconnects
Cache Coherence and synchronization
Message Passing Mechanism

Multivector and SIMD Computers


Vector Processing Principles,
Multivector Multiprocessors
Compound Vector Processing
SIMD Computer Organization,
The Connection Machine CM5.

Scalable Multithreaded and dataflow Architecture

Latency Hiding Techniques,


Principles Of Multithreading,
Fine Grain MultiComputers,
Scalable and Multithreaded Architecture,
Dataflow and Hybrid Architectures.

Multicore Programming

Single Core Processor Fundamentals,


Introduction to Multi Core Architecture,
System Overview of Threading,
Fundamental Concepts of Parallel Programming,
Threading and Parallel Programming
7

1)
2)
3)
4)

5)

Kai Hwang, F. Briggs, Computer Architecture and Parallel


Processing, McGraw Hill International Edition, Reprint 2006.
M. Flynn, Computer Architecture: Pipelined and Parallel
Processor Design, 1/E, Jones and Bartlett, 1995
Harry F. Jordan, Fundamentals of Parallel Processing, 1/E,
2002
Hesham ElRewini and Mostsfa AbdElBarr, Advanced
Computer Architecture and Parallel Processing, Wiley
Interscience, 2005.
Shameem Akhter & Jason Roberts, MultiCore
Programming, Intel Press, 2006.
8

The State of Computing

Multiprocessors and Multicomputer

Multivector and SIMD Computers

10

Parallel processing
It is a form of processing in which many calculations are

carried out simultaneously


Increasing demand for higher performance, lower costs,

and nonstop productivity in real life applications.


Concurrent events are taking place in today's high

performance computers
due to the common practice of multiprogramming, multiprocessing,
or multicomputing.
11

Processor = Programmable computing element that


runs stored programs written using predefined
instruction set

A parallel computer (or multiple processor system)


is a collection of communicating processing
elements (processors) that cooperate to solve large
computational problems fast by dividing such
problems into parallel tasks.

12

Parallelism appears in various forms


lookahead, pipelining, vectorization, concurrency, data
parallelism, partitioning, interleaving, overlapping,
replication, timesharing, spacesharing, multitasking,
multiprogramming, multithreading, and distributed
computing at different processing levels.

Model physical architectures of


parallel computers, vector supercomputers,

multiprocessors, multicomputers and massively parallel


processors
13

Modern computers are equipped with powerful


hardware facilities driven by extensive software
packages
historical milestones in the development of
computers
crucial hardware and software elements
Identified and analyzing the performance of
computers

14

Computers have gone through two major stages of


development:
Mechanical and Electronic

Zuse's and Aiken's machines


Designed for generalpurpose computations
Computing and communication were carried out with

moving mechanical parts


Limited the computing speed and reliability of mechanical
computers

15

Modern computers
Electronic components
Moving parts in mechanical computers replaced by

high mobility electrons in electronic computers


Information transmission by mechanical gears or

levers replaced by electric signals


16

17

Modern computer is an integrated system consisting


of
machine hardware, an instruction set, system software,

application programs, and user interfaces

The use of a computer is driven by real life problems


demanding
Numerical computing, transaction processing, and logical

reasoning

18

19

Computing Problems
Numerical Computing: Science and engineering

numerical problems demand intensive integer and


floating point computations
Logical Reasoning: Artificial intelligence (AI)
demand logic inferences and symbolic
manipulations and large space searches

20

Algorithms and Data Structures


Special algorithms and data structures are needed to

specify the computations and communications involved in


computing problems
Most numerical algorithms are deterministic using regular
data structures
Symbolic processing may use heuristics or non
deterministic searches
Parallel algorithm development requires interdisciplinary
interaction

21

Hardware Resources
Processors, memory and peripheral devices
Special hardware interfaces built to I/O devices
Software interface programs
Processor connectivity (system interconnects,

network), memory organization, influence the


system architecture

22

Operating System
An effective operating system manages the allocation and

deallocation of resources during the execution of user


programs
Mapping to match algorithmic structures with hardware
architecture and vice versa: processor scheduling, memory
mapping, interprocessor communication
Parallelism utilization possible at:
1 algorithm design,
2 program writing,
3 compilation, and
4 run time
23

System Software Support


Needed for the development of efficient programs in high

level languages.
HLL to object code compiler
object code to machine code assembler
Loader is used to initiate the program execution

24

Compiler Support 3 approaches


Preprocessor
Uses a sequential compiler and a lowlevel library of the target
computer to implement highlevel parallel constructs

Precompiler
Program flow analysis, dependence checking limited optimizations
toward parallelism detection

Parallelizing compiler
Fully developed parallelizing compiler which can automatically
detect parallelism in source code and transform sequential codes
into parallel constructs

25

Computing Resources and Computation Allocation


The number of processing elements (PEs),computing power of each

element and amount/organization of physical memory used.


What portions of the computation and data are allocated or mapped
to each PE

Data access, Communication and Synchronization


How the processing elements cooperate and communicate.
How data is shared/transmitted between processors.
Abstractions and primitives for cooperation/communication and

synchronization.
The characteristics and performance of parallel system network
(System interconnects).

26

Parallel Processing Performance and Scalability Goals


Maximize performance enhancement of parallelism:

Maximize Speedup.
By minimizing parallelization overheads and balancing workload on
processors

Scalability of performance to larger systems

27

Application demands:
More computing cycles/memory needed
Scientific/Engineering computing: CFD, Biology, Chemistry,

Physics, ...
Generalpurpose computing: Video, Graphics, CAD,
Databases, Transaction Processing, Gaming
Mainstream multithreaded programs, are similar to parallel
programs

28

Challenging Applications in Applied Science/Engineering


Astrophysics
Atmospheric and Ocean Modeling
Bioinformatics
Biomolecular simulation: Protein folding

Such applications have very


high computational and
memory Requirements that
cannot be met with singleprocessor architectures.

Computational Chemistry
Computational Physics
Computer vision and image understanding
Data Mining and Dataintensive Computing

Many applications contain a


large degree of
computational parallelism

Engineering analysis (CAD/CAM)


Global climate modeling and forecasting
Military applications
Quantum chemistry
VLSI design

29

30

31

Study of arch. involved Hardware organization and


programming/ software requirements
Assembly language programmer point of view
Instruction set, which includes opcode (operation codes),
addressing modes, registers, virtual memory

Hardware implementation point of view


CPUs, caches, buses, microcode, pipelines, physical
memory

Architecture cover ISA + M/c implementation


32

Figure

33

The von Neumann architecture built as a sequential


machine executing scalar data

Sequential computer improved from


bit-serial to word-parallel operations
fixed-point to floating-point operations

The von Neumann architecture is slow due to


sequential execution of instructions in programs
34

Lookahead
Techniques introduced to prefetch instructions in order to

overlap I/E (instruction fetch/decode and execution)


operations and to enable functional parallelism

Functional parallelism
To use multiple functional units simultaneously
To practice pipelining at various processing levels

35

Pipeline Includes
pipelined instruction execution
Pipelined arithmetic computations
memoryaccess operations

Pipelining especially used to performing identical operations


repeatedly over vector data strings

Vector operations were originally carried out implicitly by


softwarecontrolled looping using scalar pipeline processors

36

Classification of various computer architectures


based on notions of instruction and data streams
SISD
single instruction stream over a single data stream

SIMD
single instruction stream over multiple data streams

MISD
multiple instruction stream and a single data streams

MIMD
multiple instruction stream over multiple data streams

37

SISD
Conventional sequential machines

38

They are also called scalar processor i.e., one


instruction at a time and each instruction have only
one set of operands.
Single instruction: only one instruction stream is
being acted on by the CPU during any one clock
cycle
Single data: only one data stream is being used as
input during any one clock cycle

39

SIMD
Vector computers are equipped with scalar and vector

hardware

40

Single instruction: All processing units execute the same


instruction issued by the control unit at any given clock cycle
as shown in figure where there are multiple processor
executing instruction given by one control unit.

Multiple data: Each processing unit can operate on a different


data element as shown if figure the processor are providing
multiple data to processing unit

41

MIMD

42

Multiple Instruction: every processor may be


executing a different instruction stream

Multiple Data: every processor may be working with


a different data stream as shown in figure multiple
data stream is provided by shared memory.

Can be categorized as loosely coupled or tightly


coupled depending on sharing of data and control

Execution can be synchronous or asynchronous,


deterministic or nondeterministic
43

MISD
The same data stream flows through a linear array of

processors executing different instruction streams

44

A single data stream is fed into multiple processing units.

Each processing unit operates on the data independently via


independent instruction streams as shown in figure

A single data stream is forwarded to different processing unit


which are connected to different control unit and execute
instruction given to it by processing unit to which it is
attached.

45

(Taxonomy)

Single Instruction stream over Multiple Data streams (SIMD):


Vector computers, array of synchronized processing elements.

Uniprocessor

CU = Control Unit
PE = Processing Element
M = Memory

Shown here:
array of synchronized
processing elements

Single Instruction stream over a Single Data stream (SISD):


Conventional sequential machines or uniprocessors.

Parallel computers
or multiprocessor systems
Multiple Instruction streams and a Single Data stream (MISD):
Systolic arrays for pipelined execution.

Multiple Instruction streams over Multiple


Data streams (MIMD): Parallel computers:
Distributed memory multiprocessor system shown

46

Parallel computers
execute programs in MIMD mode

Two major classes of parallel computers


sharedmemory multiprocessors
messagepassing multicomputers

Different in Memory sharing and the mechanisms


used for interprocessor communication

47

Multiprocessor system
communicate with each other through shared variables in

a common memory

Multicomputer system
Each computer node has a local memory, unshared with

other nodes
Interprocessor communication is done through message
passing among the nodes

48

Vector processors (Implicit)


Vector instructions
Equipped with multiple vector pipelined
Concurrently used under hardware or firmware control

Two families of pipelined (explicit) vector processors:


Memorytomemory architecture
Pipelined flow of vector operands directly from the memory to
pipelines and then back to the memory

Registertoregister architecture
Uses vector registers to interface between the memory and
functional pipelines
49

50

Hardware configurations differ from machine to machine


(even with the same Flynn classification)
Address spaces of processors
vary among different architectures, and
depend on memory organization, and
should match target application domain.

The communication model and language environments


should ideally be machineindependent
to allow porting to many computers with minimum conversion costs.

Application developers prefer architectural transparency

51

Programmability depends on the programming environment


provided to the users
Conventional computers are used in a sequential
programming environment with tools developed for a
uniprocessor computer
Parallel computers need
parallel tools that allow specification or easy detection of parallelism
operating systems that can perform parallel scheduling of concurrent

events, shared memory allocation, and shared peripheral and


communication links.

52

Use a conventional language (like C, Fortran, Lisp, or Pascal)


to write the program
Use a parallelizing compiler to translate the source code into
parallel code
The compiler must detect parallelism and assign target
machine resources
Success relies heavily on the quality of the compiler.

53

Programmer write explicit parallel code using


parallel dialects of common languages

Compiler has reduced need to detect parallelism,


but must still preserve existing parallelism and assign
target machine resources

54

Programmer

Programmer

Source code written in


sequential languages C, C++
FORTRAN, LISP ..

Source code written in


concurrent dialects of C, C++
FORTRAN, LISP ..

Parallelizing
compiler

Concurrency
preserving compiler

Parallel
object code

Concurrent
object code

Execution by
runtime system

Execution by
runtime system

(a) Implicit Parallelism

(b) Explicit Parallelism

55

Parallel extensions of conventional highlevel


languages

Integrated environments to provide


different levels of program abstraction validation,
testing and debugging performance prediction and
monitoring visualization support to aid program

development, performance measurement graphics display


and animation of computational results

56

SharedMemory Multiprocessors

DistributedMemory Multicomputers

57

Shared memory parallel computers generally have the ability


for all processors to access all memory as global address
space.
Multiple processors can operate independently but share the
same memory resources.
Changes in a memory location effected by one processor are
visible to all other processors.
Shared memory machines can be divided into two main
classes based upon memory access times: UMA, NUMA and
COMA.

58

Three sharedmemory multiprocessor models


The Uniform Memory Access (UMA) model,
The Non Uniform Memory Access (NUMA) model,
The Cache Only Memory Architecture (COMA) model

Models differ in how the memory and peripheral

resources are shared or distributed.


59

The UMA Model


The physical memory is uniformly shared by all the processors
Have equal access time to all memory words
Each processor use a private cache.
Peripherals are also shared
Called tightly coupled systems due to high degree of resource sharing
The system interconnect in the form of Common bus, crossbar switch,

or a multistage network
Symmetric multiprocessor all processors have equally capable of
running the executive programs
Asymmetric multiprocessor one or subset of processors have
executive capability

60

UMA

61

The NUMA Model


A NUMA multiprocessor
Sharedmemory system in which the access time varies
with the location of the memory word.
Shared memory is physically distributed to all processors
called local memories
Collection of all local memories forms a global address
space accessible by all processors
Delay through the interconnection network

62

NUMA

63

Globally shared memory


Three memoryaccess patterns
The fastest is local memory access
The next is global memory access
The slowest is access of remote memory

64

Hierarchically structured multiprocessor


processors are divided in to several clusters
Cluster is it self an UMA or a NUMA
Clusters are connected to global shared memory modules
Entire system is considered a NUMA
All processors belonging to the same cluster are allowed to

uniformly access the cluster shared-memory modules


All clusters have equal access to the global memory
Access time for cluster memory is shorter than that to the
global memory

65

66

The COMA Model


Is a special case of a NUMA machine
The distributed main memories are converted to caches
No memory hierarchy at each processor node
All the caches form a global address space
Remote cache access assisted by distributed cache

directories D.
Initial data placement is not critical

67

The COMA Model

68

Other variants for shared memory multiprocessors


CCNUMA
Cache coherent non uniform memory access
Model can be specified with distributed shared memory

and cache directory

CCCOMA
Cache coherent coma

69

System consists of
multiple computers called nodes
interconnected by a message passing network
provides pointtopoint static connections among the nodes

Each node is
autonomous computer consisting of a processor, local memory,

attached disks or I/O peripherals


All local memories are private and are accessible only by local
processors
NORMA noremotememoryaccess
Inter node communication is carried out by passing messages through
the static connection network

70

71

Multicomputers use hardware routers to pass messages


Computer node is attached to each router
Boundary router may be connected to I/O and peripheral
devices
Message passing between any two nodes involves a sequence
of routers and channels
Mixed types of nodes are allowed in a heterogeneous
multicomputer
Internode communications achieved through compatible data
representations and messagepassing protocols

72

First Based on processor board technology using hypercube


architecture and softwarecontrolled message switching

The second generation was implemented with mesh


connected architecture, hardware message routing software
environment for mediumgrain distributed computing

73

Important issues for multicomputers


Famous topologies include the ring, tree, mesh, torus,

hypercube, cube connected cycle etc.


Various communication patterns onetoone, broadcasting,
permutations and multicast patterns
messagerouting schemes, network flow control strategies,
deadlock avoidance, virtual channels, messagepassing
primitives and program decomposition techniques.

74

Introduce supercomputers and parallel processors for vector


processing and data parallelism

Classify supercomputers as pipelined vector machines using a


few powerful processors equipped with vector hardware

SIMD computers emphasizing massive data parallelism

75

A vector computer is often built on top of a scalar processor


The vector processor attached to the scalar processor as an
optional feature
Program and data loaded in to main memory through host
computer
All instructions are first decoded by the scalar control unit

76

77

If the decoded instruction is scalar operation or a program


control operation,
will directly executed by the scalar process or using the scalar

functional pipelines

If the decoded instruction is vector operation


will be sent to the vector control unit (CU)
CU supervise the flow of vector data between the main memory and

vector functional pipelines


Number of vector functional pipelines may be built in to a vector
processor

78

registertoregister architecture
Vector registers are
used to hold the vector operands, intermediate and final vector
results
Programmable in user instructions
Each equipped with a component counter
Keeps track of the component registers used in successive pipeline
cycles.
Length of each vector register is usually fixed
64bit component registers in a vector register in a CraySeries
supercomputer
The vector functional pipelines retrieve operands from and put
results into the vector registers
79

80

memorytomemory architecture
Differs from the use of a vector stream unit to

replace the vector registers


Vector operands and results are directly retrieved
from the main memory in superwords
512 bits as in the Cyber 205

81

An operational model of an SIMD computer is specified by a


5tuple: M = { N, C, I, M, R }
N is the number of processing elements(PEs) in the machine.
C is the set of instructions directly executed by the control unit(CU)

including scalar and program flow control instructions


I is the set of instructions broadcast by the CU to all PEs for parallel
execution
M is the set of masking schemes, where each mask partitions the set
of PEs in to enabled and disabled subsets.
R is the set of datarouting functions, specifying various patterns to
be setup in the interconnection network for interPE communications.

82

83

Performance Measures
The ideal performance of a computer system

demands a perfect match between machine


capability and program behavior.
Machine capability can be enhanced with better:
Hardware technology,
Innovative architectural features, and
Efficient resource management.

84

Program behavior, is difficult to predict due to its heavy


dependence on application and runtime conditions.
There are many factors affecting program behavior, including:
Algorithm design,
Data structures,
Language efficiency,
Programmer skill, and
Compiler technology.

85

Introduce some fundamental factors for


projecting the performance of computer
They can be used to guide system architect in
designing batter machine
To educate the programmer or compiler
writers in optimizing the codes for more
efficient execution

86

The simplest measure of program performance is the


turnaround time, which includes disk and memory accesses,
input and output activities, compilation time, OS overhead,
and CPU time.
In order to shorten the turnaround time, one must reduce all
these time factors.
In a multiprogrammed computer, the I/O and system
overheads of a given program may overlap with the CPU
times required in other programs.
It is fair to compare just the total CPU time needed for
program execution.
87

Performance Measures
Response Time (Execution time, Latency): The
time elapse between the start and the completion
of an event.
Throughput (Bandwidth): The amount of work
done in a given time.
Performance: Number of events occurring per
unit of time.
Note execution time is the reciprocal of performance
lower execution time implies higher performance.

88

Performance Measures
A system (X) is faster than (Y), if for a given task,

the response time on X is lower than on Y.


Execution time
n=

Execution time

Y
X

1
Performance
=

1
Performance

Y =

Performance
Performance

X
Y

89

Consequently, the statement that X is n%


faster than Y means:
Execution time
n
Y
= 1+
=
100
Execution time
X

1
Performance
1
Performance

Y =

Performance

X
Performance
Y

and hence,
n = 100

Performance

Performance

X
Performance

90

Example: Machine A runs a program in 10

seconds and machine B runs the same


program in 15 seconds. Therefore:
E xecution tim e
E xecution tim e

n
1 00

= 1+

E x ecution tim e
n = 1 00

n = 1 00

(
(

and

E x ecution tim e

E xecution tim e
15

10
10

h ence,

= 50

91

Clock Rate
The processor is driven by a clock with a constant cycle

time (t).
The inverse of the cycle time is the clock rate (f =1/t).

CPI - cycles per instruction


Size of a program is instruction count (Ic) number of

machine instructions to be executed.


Different instructions acquire different number of clock
cycles to execute

92

CPI is an important parameter for measuring the time needed


to execute each instruction
For a given instruction set, we can calculate an average CPI
over all instruction types, provided we know their
frequencies of appearance in the program.
An accurate estimate of the average CPI requires a large
amount of program code to be traced over a long period of
time.
CPI will be taken as an average value for a given instruction
set and a given program mix.

93

Let us define the average number of clock cycle per

instruction (CPI) as:

CPI =

CPU clock cycles for a program

Instruction count
n
Ii
=
( CPI i
)
I
c
i =1

( CPI i I i )

i =1
Ic

Where Ii is the number of time instruction i is executed and


CPI i is the average number of clock cycles for instruction
i.

94

CPI and clock rate depends on the technology and


architecture of the machine.
Instruction count depends on the instruction set of the
machine and compiler technology.

95

The CPU time (T) or Execution Time : is the time needed to


execute a given program excluding the waiting time for I/O or
other running programs.

CPU time is further divided into the user CPU time and the
system CPU time.

The CPU time is estimated as:


n

T = Ic * CPI * =

CPIi Ii )
(

i =1

96

The execution of an instruction requires going through cycle


of events involves the instruction fetch, decode, operand(s)
fetch, execution, and store result(s):

T = I c * CPI * = Ic * (p+m*k)*

p is the number of processor cycles needed to decode and


execute the instruction
m is the number of the memory references needed
k is the ratio between memory cycle time and processor cycle
time. ( latency factor how much the memory is slow w. r. to
CPU)
97

Now let C be Total number of cycles required to


execute a program.

C = Ic CPI

And the time to execute a program will be

T = C = C/ f
T = Ic CPI = Ic CPI / f
98

MIPS - Million Instructions Per Second


Measure the processor speed
I
C
MIPS =
6

Execution time(T) * 10

MIPS = Ic / T 106
MIPS = f / CPI 106
MIPS = f Ic / C 106
99

MFLOPS - Million Floating Point Operations Per


Second
is another performance measure to be used to

evaluate computers.

Number of floating point operations in a program

MFLOPS =

Execution time * 106

100

Throughput Rate
Number of programs executed per unit time.
Ws = CPU throughput
Wp = System (program/second)

Wp = 1 / T
Wp = MIPS 106 /Ic
Based on the MIPS rates and the average program length Ic
Ws <Wp in multiprogramming environment
always additional overheads like timesharing operating
system
101

Throughput (W/Tn) the execution rate on an n


processor system, measured in FLOPs/unittime or
instructions/unittime

Speedup (Sn = T1/Tn) how much faster an actual


machine with n processors compared to 1.

Efficiency (En = Sn/n) fraction of the maximum


speedup achieved by n processors

102

The attributes of a computer system which allow it to linearly


scaled up or down in size, to handle smaller or larger
workloads, or to obtain proportional decreases or increase in
speed on a given application

Good scalability requires the good algorithm and the machine


to have the right properties

Thus in general there are five performance factors (Ic,p,m,k,t)


which are influenced by four system attributes

103

System
Attributes

Performance Factors

Ic

CPI

Instruction-set
Architecture
Compiler
Technology
CPU
Implementation
& Technology
Memory
Hierarchy
104

A benchmark program is executed on a 40MHz processor. The


benchmark program has the following statistics.
Instruction Type

Instruction Count

Clock Cycle Count

Arithmetic
Branch
Load/Store
Floating Point

45000
32000
15000
8000

1
2
2
2

Calculate average CPI, MIPS rate & execution for the above
benchmark program
Solved in class
105

(Book Problem 1.4) Solved in class

106

(Book Problem 1.6) Solved in class

107

Operation

Frequency

CPI

ALU ops
Loads
Stores
Branches

35%
25%
15%
25%

1
2
2
3

Compute the Average CPI.


Solved in class

108

For the purpose of solving a given application problem, you benchmark a


program on two computer systems.
On system A, the object code executed 80 million Arithmetic Logic Unit
operations (ALU ops), 40 million load instructions, and 25 million branch
instructions.
On system B, the object code executed 50 million ALU ops, 50 million
loads, and 40 million branch instructions.
In both systems, each ALU op takes 1 clock cycles, each load takes 3 clock
cycles, and each branch takes 5 clock cycles.

A.
B.

Compute the relative frequency of occurrence of each type of instruction


executed in both systems
Find the Average CPI for each system.

109

You might also like