PPA Intro Chp1

B.
Tech IIIrd
CO322: PARALLEL PROCESSING AND ARCHITECTURE

(EIS II)
L 3, T 0, P 0, C 3
1 lecture/week [ from my side]
30 Marks Midsem
50 Marks Endsem
10 Marks Class tests/Quizz/Assignments
10 Marks Attandence
100 Marks Total
Parallel computer model 4

The state of Computing
Multiprocessors and Multicomputers
Multivector and SIMD Computers
Program and network Properties 4

Conditions of parallelism
Program Partitioning and scheduling
Program Flow Mechanism
System Interconnect Architecture
Principles of scalable performance 4

Performance Metrics and Measures
Parallel Processing Applications
Speedup Performance Laws
Scalability Analysis and Approaches
Processors and Memory Hierarchy 4

Advanced Processor Technology
Superscalar and vector Processors
Memory Hierarchy Technology
Virtual Memory Technology
Multiprocessors and Multicomputers

Multiprocessor system Interconnects
Cache Coherence and synchronization
Message Passing Mechanism

Vector Processing Principles,
Multivector Multiprocessors
Compound Vector Processing
SIMD Computer Organization,
The Connection Machine CM5.
Scalable Multithreaded and dataflow Architecture
Latency Hiding Techniques,

Principles Of Multithreading,
Fine Grain MultiComputers,
Scalable and Multithreaded Architecture,
Dataflow and Hybrid Architectures.
Multicore Programming
Single Core Processor Fundamentals,

Introduction to Multi Core Architecture,
System Overview of Threading,
Fundamental Concepts of Parallel Programming,
Threading and Parallel Programming
7
1)
2)
3)
4)
5)
Kai Hwang, F. Briggs, Computer Architecture and Parallel

Processing, McGraw Hill International Edition, Reprint 2006.
M. Flynn, Computer Architecture: Pipelined and Parallel
Processor Design, 1/E, Jones and Bartlett, 1995
Harry F. Jordan, Fundamentals of Parallel Processing, 1/E,
2002
Hesham ElRewini and Mostsfa AbdElBarr, Advanced
Computer Architecture and Parallel Processing, Wiley
Interscience, 2005.
Shameem Akhter & Jason Roberts, MultiCore
Programming, Intel Press, 2006.
8
The State of Computing
Multiprocessors and Multicomputer
10
Parallel processing
It is a form of processing in which many calculations are
carried out simultaneously

Increasing demand for higher performance, lower costs,
and nonstop productivity in real life applications.

Concurrent events are taking place in today's high
performance computers
due to the common practice of multiprogramming, multiprocessing,
or multicomputing.
11
Processor = Programmable computing element that

runs stored programs written using predefined
instruction set
A parallel computer (or multiple processor system)

is a collection of communicating processing
elements (processors) that cooperate to solve large
computational problems fast by dividing such
problems into parallel tasks.
12
Parallelism appears in various forms

lookahead, pipelining, vectorization, concurrency, data
parallelism, partitioning, interleaving, overlapping,
replication, timesharing, spacesharing, multitasking,
multiprogramming, multithreading, and distributed
computing at different processing levels.
Model physical architectures of

parallel computers, vector supercomputers,
multiprocessors, multicomputers and massively parallel

processors
13
Modern computers are equipped with powerful

hardware facilities driven by extensive software
packages
historical milestones in the development of
computers
crucial hardware and software elements
Identified and analyzing the performance of
computers
14
Computers have gone through two major stages of

development:
Mechanical and Electronic
Zuse's and Aiken's machines

Designed for generalpurpose computations
Computing and communication were carried out with
moving mechanical parts

Limited the computing speed and reliability of mechanical
computers
15
Modern computers
Electronic components
Moving parts in mechanical computers replaced by
high mobility electrons in electronic computers

Information transmission by mechanical gears or
levers replaced by electric signals

16
17
Modern computer is an integrated system consisting

of
machine hardware, an instruction set, system software,
application programs, and user interfaces
The use of a computer is driven by real life problems

demanding
Numerical computing, transaction processing, and logical
reasoning
18
19
Computing Problems
Numerical Computing: Science and engineering
numerical problems demand intensive integer and

floating point computations
Logical Reasoning: Artificial intelligence (AI)
demand logic inferences and symbolic
manipulations and large space searches
20
Algorithms and Data Structures

Special algorithms and data structures are needed to
specify the computations and communications involved in

computing problems
Most numerical algorithms are deterministic using regular
data structures
Symbolic processing may use heuristics or non
deterministic searches
Parallel algorithm development requires interdisciplinary
interaction
21
Hardware Resources
Processors, memory and peripheral devices
Special hardware interfaces built to I/O devices
Software interface programs
Processor connectivity (system interconnects,
network), memory organization, influence the

system architecture
22
Operating System
An effective operating system manages the allocation and
deallocation of resources during the execution of user

programs
Mapping to match algorithmic structures with hardware
architecture and vice versa: processor scheduling, memory
mapping, interprocessor communication
Parallelism utilization possible at:
1 algorithm design,
2 program writing,
3 compilation, and
4 run time
23
System Software Support

Needed for the development of efficient programs in high
level languages.
HLL to object code compiler
object code to machine code assembler
Loader is used to initiate the program execution
24
Compiler Support 3 approaches

Preprocessor
Uses a sequential compiler and a lowlevel library of the target
computer to implement highlevel parallel constructs
Precompiler
Program flow analysis, dependence checking limited optimizations
toward parallelism detection
Parallelizing compiler
Fully developed parallelizing compiler which can automatically
detect parallelism in source code and transform sequential codes
into parallel constructs
25
Computing Resources and Computation Allocation

The number of processing elements (PEs),computing power of each
element and amount/organization of physical memory used.

What portions of the computation and data are allocated or mapped
to each PE
Data access, Communication and Synchronization

How the processing elements cooperate and communicate.
How data is shared/transmitted between processors.
Abstractions and primitives for cooperation/communication and
synchronization.
The characteristics and performance of parallel system network
(System interconnects).
26
Parallel Processing Performance and Scalability Goals

Maximize performance enhancement of parallelism:
Maximize Speedup.
By minimizing parallelization overheads and balancing workload on
processors
Scalability of performance to larger systems
27
Application demands:
More computing cycles/memory needed
Scientific/Engineering computing: CFD, Biology, Chemistry,
Physics, ...
Generalpurpose computing: Video, Graphics, CAD,
Databases, Transaction Processing, Gaming
Mainstream multithreaded programs, are similar to parallel
programs
28
Challenging Applications in Applied Science/Engineering

Astrophysics
Atmospheric and Ocean Modeling
Bioinformatics
Biomolecular simulation: Protein folding
Such applications have very

high computational and
memory Requirements that
cannot be met with singleprocessor architectures.
Computational Chemistry
Computational Physics
Computer vision and image understanding
Data Mining and Dataintensive Computing
Many applications contain a

large degree of
computational parallelism
Engineering analysis (CAD/CAM)

Global climate modeling and forecasting
Military applications
Quantum chemistry
VLSI design
29
30
31
Study of arch. involved Hardware organization and

programming/ software requirements
Assembly language programmer point of view
Instruction set, which includes opcode (operation codes),
addressing modes, registers, virtual memory
Hardware implementation point of view

CPUs, caches, buses, microcode, pipelines, physical
memory
Architecture cover ISA + M/c implementation

32
Figure
33
The von Neumann architecture built as a sequential

machine executing scalar data
Sequential computer improved from

bit-serial to word-parallel operations
fixed-point to floating-point operations
The von Neumann architecture is slow due to

sequential execution of instructions in programs
34
Lookahead
Techniques introduced to prefetch instructions in order to
overlap I/E (instruction fetch/decode and execution)

operations and to enable functional parallelism
Functional parallelism
To use multiple functional units simultaneously
To practice pipelining at various processing levels
35
Pipeline Includes
pipelined instruction execution
Pipelined arithmetic computations
memoryaccess operations
Pipelining especially used to performing identical operations

repeatedly over vector data strings
Vector operations were originally carried out implicitly by

softwarecontrolled looping using scalar pipeline processors
36
Classification of various computer architectures

based on notions of instruction and data streams
SISD
single instruction stream over a single data stream
SIMD
single instruction stream over multiple data streams
MISD
multiple instruction stream and a single data streams
MIMD
multiple instruction stream over multiple data streams
37
SISD
Conventional sequential machines
38
They are also called scalar processor i.e., one

instruction at a time and each instruction have only
one set of operands.
Single instruction: only one instruction stream is
being acted on by the CPU during any one clock
cycle
Single data: only one data stream is being used as
input during any one clock cycle
39
SIMD
Vector computers are equipped with scalar and vector
hardware
40
Single instruction: All processing units execute the same

instruction issued by the control unit at any given clock cycle
as shown in figure where there are multiple processor
executing instruction given by one control unit.
Multiple data: Each processing unit can operate on a different

data element as shown if figure the processor are providing
multiple data to processing unit
41
MIMD
42
Multiple Instruction: every processor may be

executing a different instruction stream
Multiple Data: every processor may be working with

a different data stream as shown in figure multiple
data stream is provided by shared memory.
Can be categorized as loosely coupled or tightly

coupled depending on sharing of data and control
Execution can be synchronous or asynchronous,

deterministic or nondeterministic
43
MISD
The same data stream flows through a linear array of
processors executing different instruction streams
44
A single data stream is fed into multiple processing units.
Each processing unit operates on the data independently via

independent instruction streams as shown in figure
A single data stream is forwarded to different processing unit

which are connected to different control unit and execute
instruction given to it by processing unit to which it is
attached.
45
(Taxonomy)
Single Instruction stream over Multiple Data streams (SIMD):

Vector computers, array of synchronized processing elements.
Uniprocessor
CU = Control Unit
PE = Processing Element
M = Memory
Shown here:
array of synchronized
processing elements
Single Instruction stream over a Single Data stream (SISD):

Conventional sequential machines or uniprocessors.
Parallel computers
or multiprocessor systems
Multiple Instruction streams and a Single Data stream (MISD):
Systolic arrays for pipelined execution.
Multiple Instruction streams over Multiple

Data streams (MIMD): Parallel computers:
Distributed memory multiprocessor system shown
46
Parallel computers
execute programs in MIMD mode
Two major classes of parallel computers

sharedmemory multiprocessors
messagepassing multicomputers
Different in Memory sharing and the mechanisms

used for interprocessor communication
47
Multiprocessor system
communicate with each other through shared variables in
a common memory
Multicomputer system
Each computer node has a local memory, unshared with
other nodes
Interprocessor communication is done through message
passing among the nodes
48
Vector processors (Implicit)

Vector instructions
Equipped with multiple vector pipelined
Concurrently used under hardware or firmware control
Two families of pipelined (explicit) vector processors:

Memorytomemory architecture
Pipelined flow of vector operands directly from the memory to
pipelines and then back to the memory
Registertoregister architecture
Uses vector registers to interface between the memory and
functional pipelines
49
50
Hardware configurations differ from machine to machine

(even with the same Flynn classification)
Address spaces of processors
vary among different architectures, and
depend on memory organization, and
should match target application domain.
The communication model and language environments

should ideally be machineindependent
to allow porting to many computers with minimum conversion costs.
Application developers prefer architectural transparency
51
Programmability depends on the programming environment

provided to the users
Conventional computers are used in a sequential
programming environment with tools developed for a
uniprocessor computer
Parallel computers need
parallel tools that allow specification or easy detection of parallelism
operating systems that can perform parallel scheduling of concurrent
events, shared memory allocation, and shared peripheral and

communication links.
52
Use a conventional language (like C, Fortran, Lisp, or Pascal)

to write the program
Use a parallelizing compiler to translate the source code into
parallel code
The compiler must detect parallelism and assign target
machine resources
Success relies heavily on the quality of the compiler.
53
Programmer write explicit parallel code using

parallel dialects of common languages
Compiler has reduced need to detect parallelism,

but must still preserve existing parallelism and assign
target machine resources
54
Programmer
Programmer
Source code written in

sequential languages C, C++
FORTRAN, LISP ..
Source code written in

concurrent dialects of C, C++
FORTRAN, LISP ..
Parallelizing
compiler
Concurrency
preserving compiler
Parallel
object code
Concurrent
object code
Execution by
runtime system
Execution by
runtime system
(a) Implicit Parallelism
(b) Explicit Parallelism
55
Parallel extensions of conventional highlevel

languages
Integrated environments to provide

different levels of program abstraction validation,
testing and debugging performance prediction and
monitoring visualization support to aid program
development, performance measurement graphics display

and animation of computational results
56
SharedMemory Multiprocessors
DistributedMemory Multicomputers
57
Shared memory parallel computers generally have the ability

for all processors to access all memory as global address
space.
Multiple processors can operate independently but share the
same memory resources.
Changes in a memory location effected by one processor are
visible to all other processors.
Shared memory machines can be divided into two main
classes based upon memory access times: UMA, NUMA and
COMA.
58
Three sharedmemory multiprocessor models

The Uniform Memory Access (UMA) model,
The Non Uniform Memory Access (NUMA) model,
The Cache Only Memory Architecture (COMA) model
Models differ in how the memory and peripheral
resources are shared or distributed.

59
The UMA Model

The physical memory is uniformly shared by all the processors
Have equal access time to all memory words
Each processor use a private cache.
Peripherals are also shared
Called tightly coupled systems due to high degree of resource sharing
The system interconnect in the form of Common bus, crossbar switch,
or a multistage network
Symmetric multiprocessor all processors have equally capable of
running the executive programs
Asymmetric multiprocessor one or subset of processors have
executive capability
60
UMA
61
The NUMA Model

A NUMA multiprocessor
Sharedmemory system in which the access time varies
with the location of the memory word.
Shared memory is physically distributed to all processors
called local memories
Collection of all local memories forms a global address
space accessible by all processors
Delay through the interconnection network
62
NUMA
63
Globally shared memory

Three memoryaccess patterns
The fastest is local memory access
The next is global memory access
The slowest is access of remote memory
64
Hierarchically structured multiprocessor

processors are divided in to several clusters
Cluster is it self an UMA or a NUMA
Clusters are connected to global shared memory modules
Entire system is considered a NUMA
All processors belonging to the same cluster are allowed to
uniformly access the cluster shared-memory modules

All clusters have equal access to the global memory
Access time for cluster memory is shorter than that to the
global memory
65
66
The COMA Model

Is a special case of a NUMA machine
The distributed main memories are converted to caches
No memory hierarchy at each processor node
All the caches form a global address space
Remote cache access assisted by distributed cache
directories D.
Initial data placement is not critical
67
The COMA Model
68
Other variants for shared memory multiprocessors

CCNUMA
Cache coherent non uniform memory access
Model can be specified with distributed shared memory
and cache directory
CCCOMA
Cache coherent coma
69
System consists of
multiple computers called nodes
interconnected by a message passing network
provides pointtopoint static connections among the nodes
Each node is
autonomous computer consisting of a processor, local memory,
attached disks or I/O peripherals

All local memories are private and are accessible only by local
processors
NORMA noremotememoryaccess
Inter node communication is carried out by passing messages through
the static connection network
70
71
Multicomputers use hardware routers to pass messages

Computer node is attached to each router
Boundary router may be connected to I/O and peripheral
devices
Message passing between any two nodes involves a sequence
of routers and channels
Mixed types of nodes are allowed in a heterogeneous
multicomputer
Internode communications achieved through compatible data
representations and messagepassing protocols
72
First Based on processor board technology using hypercube

architecture and softwarecontrolled message switching
The second generation was implemented with mesh

connected architecture, hardware message routing software
environment for mediumgrain distributed computing
73
Important issues for multicomputers

Famous topologies include the ring, tree, mesh, torus,
hypercube, cube connected cycle etc.

Various communication patterns onetoone, broadcasting,
permutations and multicast patterns
messagerouting schemes, network flow control strategies,
deadlock avoidance, virtual channels, messagepassing
primitives and program decomposition techniques.
74
Introduce supercomputers and parallel processors for vector

processing and data parallelism
Classify supercomputers as pipelined vector machines using a

few powerful processors equipped with vector hardware
SIMD computers emphasizing massive data parallelism
75
A vector computer is often built on top of a scalar processor

The vector processor attached to the scalar processor as an
optional feature
Program and data loaded in to main memory through host
computer
All instructions are first decoded by the scalar control unit
76
77
If the decoded instruction is scalar operation or a program

control operation,
will directly executed by the scalar process or using the scalar
functional pipelines
If the decoded instruction is vector operation

will be sent to the vector control unit (CU)
CU supervise the flow of vector data between the main memory and
vector functional pipelines

Number of vector functional pipelines may be built in to a vector
processor
78
registertoregister architecture
Vector registers are
used to hold the vector operands, intermediate and final vector
results
Programmable in user instructions
Each equipped with a component counter
Keeps track of the component registers used in successive pipeline
cycles.
Length of each vector register is usually fixed
64bit component registers in a vector register in a CraySeries
supercomputer
The vector functional pipelines retrieve operands from and put
results into the vector registers
79
80
memorytomemory architecture
Differs from the use of a vector stream unit to
replace the vector registers

Vector operands and results are directly retrieved
from the main memory in superwords
512 bits as in the Cyber 205
81
An operational model of an SIMD computer is specified by a

5tuple: M = { N, C, I, M, R }
N is the number of processing elements(PEs) in the machine.
C is the set of instructions directly executed by the control unit(CU)
including scalar and program flow control instructions

I is the set of instructions broadcast by the CU to all PEs for parallel
execution
M is the set of masking schemes, where each mask partitions the set
of PEs in to enabled and disabled subsets.
R is the set of datarouting functions, specifying various patterns to
be setup in the interconnection network for interPE communications.
82
83
Performance Measures
The ideal performance of a computer system
demands a perfect match between machine

capability and program behavior.
Machine capability can be enhanced with better:
Hardware technology,
Innovative architectural features, and
Efficient resource management.
84
Program behavior, is difficult to predict due to its heavy

dependence on application and runtime conditions.
There are many factors affecting program behavior, including:
Algorithm design,
Data structures,
Language efficiency,
Programmer skill, and
Compiler technology.
85
Introduce some fundamental factors for

projecting the performance of computer
They can be used to guide system architect in
designing batter machine
To educate the programmer or compiler
writers in optimizing the codes for more
efficient execution
86
The simplest measure of program performance is the

turnaround time, which includes disk and memory accesses,
input and output activities, compilation time, OS overhead,
and CPU time.
In order to shorten the turnaround time, one must reduce all
these time factors.
In a multiprogrammed computer, the I/O and system
overheads of a given program may overlap with the CPU
times required in other programs.
It is fair to compare just the total CPU time needed for
program execution.
87
Response Time (Execution time, Latency): The
time elapse between the start and the completion
of an event.
Throughput (Bandwidth): The amount of work
done in a given time.
Performance: Number of events occurring per
unit of time.
Note execution time is the reciprocal of performance
lower execution time implies higher performance.
88
A system (X) is faster than (Y), if for a given task,
the response time on X is lower than on Y.

Execution time
n=
Execution time
Y
X
1
Performance
=
1
Performance
Y =
Performance
Performance
X
Y
89
Consequently, the statement that X is n%

faster than Y means:
Execution time
n
Y
= 1+
=
100
Execution time
X
1
Performance
1
Performance
Y =
Performance
X
Performance
Y
and hence,
n = 100
Performance
Performance
X
Performance
90
Example: Machine A runs a program in 10
seconds and machine B runs the same

program in 15 seconds. Therefore:
E xecution tim e
E xecution tim e
n
1 00
= 1+
E x ecution tim e
n = 1 00
n = 1 00
(
(
and
E x ecution tim e
E xecution tim e
15
10
10
h ence,
= 50
91
Clock Rate
The processor is driven by a clock with a constant cycle
time (t).
The inverse of the cycle time is the clock rate (f =1/t).
CPI - cycles per instruction

Size of a program is instruction count (Ic) number of
machine instructions to be executed.

Different instructions acquire different number of clock
cycles to execute
92
CPI is an important parameter for measuring the time needed

to execute each instruction
For a given instruction set, we can calculate an average CPI
over all instruction types, provided we know their
frequencies of appearance in the program.
An accurate estimate of the average CPI requires a large
amount of program code to be traced over a long period of
time.
CPI will be taken as an average value for a given instruction
set and a given program mix.
93
Let us define the average number of clock cycle per
instruction (CPI) as:
CPI =
CPU clock cycles for a program
Instruction count
n
Ii
=
( CPI i
)
I
c
i =1
( CPI i I i )
i =1
Ic
Where Ii is the number of time instruction i is executed and

CPI i is the average number of clock cycles for instruction
i.
94
CPI and clock rate depends on the technology and

architecture of the machine.
Instruction count depends on the instruction set of the
machine and compiler technology.
95
The CPU time (T) or Execution Time : is the time needed to

execute a given program excluding the waiting time for I/O or
other running programs.
CPU time is further divided into the user CPU time and the
system CPU time.
The CPU time is estimated as:

n
T = Ic * CPI * =
CPIi Ii )
(
i =1
96
The execution of an instruction requires going through cycle

of events involves the instruction fetch, decode, operand(s)
fetch, execution, and store result(s):
T = I c * CPI * = Ic * (p+m*k)*
p is the number of processor cycles needed to decode and

execute the instruction
m is the number of the memory references needed
k is the ratio between memory cycle time and processor cycle
time. ( latency factor how much the memory is slow w. r. to
CPU)
97
Now let C be Total number of cycles required to

execute a program.
C = Ic CPI
And the time to execute a program will be
T = C = C/ f
T = Ic CPI = Ic CPI / f
98
MIPS - Million Instructions Per Second

Measure the processor speed
I
C
MIPS =
6
Execution time(T) * 10
MIPS = Ic / T 106
MIPS = f / CPI 106
MIPS = f Ic / C 106
99
MFLOPS - Million Floating Point Operations Per

Second
is another performance measure to be used to
evaluate computers.
Number of floating point operations in a program
MFLOPS =
Execution time * 106
100
Throughput Rate
Number of programs executed per unit time.
Ws = CPU throughput
Wp = System (program/second)
Wp = 1 / T
Wp = MIPS 106 /Ic
Based on the MIPS rates and the average program length Ic
Ws <Wp in multiprogramming environment
always additional overheads like timesharing operating
system
101
Throughput (W/Tn) the execution rate on an n

processor system, measured in FLOPs/unittime or
instructions/unittime
Speedup (Sn = T1/Tn) how much faster an actual

machine with n processors compared to 1.
Efficiency (En = Sn/n) fraction of the maximum

speedup achieved by n processors
102
The attributes of a computer system which allow it to linearly

scaled up or down in size, to handle smaller or larger
workloads, or to obtain proportional decreases or increase in
speed on a given application
Good scalability requires the good algorithm and the machine

to have the right properties
Thus in general there are five performance factors (Ic,p,m,k,t)

which are influenced by four system attributes
103
System
Attributes
Performance Factors
Ic
CPI
Instruction-set
Architecture
Compiler
Technology
CPU
Implementation
& Technology
Memory
Hierarchy
104
A benchmark program is executed on a 40MHz processor. The

benchmark program has the following statistics.
Instruction Type
Instruction Count
Clock Cycle Count
Arithmetic
Branch
Load/Store
Floating Point
45000
32000
15000
8000
1
2
2
2
Calculate average CPI, MIPS rate & execution for the above
benchmark program
Solved in class
105
(Book Problem 1.4) Solved in class
106
(Book Problem 1.6) Solved in class
107
Operation
Frequency
CPI
ALU ops
Loads
Stores
Branches
35%
25%
15%
25%
1
2
2
3
Compute the Average CPI.

Solved in class
108
For the purpose of solving a given application problem, you benchmark a

program on two computer systems.
On system A, the object code executed 80 million Arithmetic Logic Unit
operations (ALU ops), 40 million load instructions, and 25 million branch
instructions.
On system B, the object code executed 50 million ALU ops, 50 million
loads, and 40 million branch instructions.
In both systems, each ALU op takes 1 clock cycles, each load takes 3 clock
cycles, and each branch takes 5 clock cycles.
A.
B.
Compute the relative frequency of occurrence of each type of instruction

executed in both systems
Find the Average CPI for each system.
109

PPA Intro Chp1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PPA Intro Chp1

Uploaded by

Copyright:

Available Formats

B.

CO322: PARALLEL PROCESSING AND ARCHITECTURE

Parallel computer model 4

Program and network Properties 4

Principles of scalable performance 4

Processors and Memory Hierarchy 4

Multiprocessors and Multicomputers

Multivector and SIMD Computers

Scalable Multithreaded and dataflow Architecture

Latency Hiding Techniques,

Single Core Processor Fundamentals,

Kai Hwang, F. Briggs, Computer Architecture and Parallel

The State of Computing

Multiprocessors and Multicomputer

Multivector and SIMD Computers

carried out simultaneously

and nonstop productivity in real life applications.

Processor = Programmable computing element that

A parallel computer (or multiple processor system)

Parallelism appears in various forms

Model physical architectures of

multiprocessors, multicomputers and massively parallel

Modern computers are equipped with powerful

Computers have gone through two major stages of

Zuse's and Aiken's machines

moving mechanical parts

high mobility electrons in electronic computers

levers replaced by electric signals

Modern computer is an integrated system consisting

application programs, and user interfaces

The use of a computer is driven by real life problems

numerical problems demand intensive integer and

Algorithms and Data Structures

specify the computations and communications involved in

network), memory organization, influence the

deallocation of resources during the execution of user

System Software Support

Compiler Support 3 approaches

Computing Resources and Computation Allocation

element and amount/organization of physical memory used.

Data access, Communication and Synchronization

Parallel Processing Performance and Scalability Goals

Scalability of performance to larger systems

Challenging Applications in Applied Science/Engineering

Such applications have very

Many applications contain a

Engineering analysis (CAD/CAM)

Study of arch. involved Hardware organization and

Hardware implementation point of view

Architecture cover ISA + M/c implementation

The von Neumann architecture built as a sequential

Sequential computer improved from

The von Neumann architecture is slow due to

overlap I/E (instruction fetch/decode and execution)

Pipelining especially used to performing identical operations

Vector operations were originally carried out implicitly by

Classification of various computer architectures

They are also called scalar processor i.e., one

Single instruction: All processing units execute the same

Multiple data: Each processing unit can operate on a different

Multiple Instruction: every processor may be

Multiple Data: every processor may be working with

Can be categorized as loosely coupled or tightly

Execution can be synchronous or asynchronous,