Professional Documents
Culture Documents
PPA Intro Chp1
PPA Intro Chp1
Tech IIIrd
30 Marks Midsem
50 Marks Endsem
10 Marks Class tests/Quizz/Assignments
10 Marks Attandence
100 Marks Total
Multicore Programming
1)
2)
3)
4)
5)
10
Parallel processing
It is a form of processing in which many calculations are
performance computers
due to the common practice of multiprogramming, multiprocessing,
or multicomputing.
11
12
14
15
Modern computers
Electronic components
Moving parts in mechanical computers replaced by
17
reasoning
18
19
Computing Problems
Numerical Computing: Science and engineering
20
21
Hardware Resources
Processors, memory and peripheral devices
Special hardware interfaces built to I/O devices
Software interface programs
Processor connectivity (system interconnects,
22
Operating System
An effective operating system manages the allocation and
level languages.
HLL to object code compiler
object code to machine code assembler
Loader is used to initiate the program execution
24
Precompiler
Program flow analysis, dependence checking limited optimizations
toward parallelism detection
Parallelizing compiler
Fully developed parallelizing compiler which can automatically
detect parallelism in source code and transform sequential codes
into parallel constructs
25
synchronization.
The characteristics and performance of parallel system network
(System interconnects).
26
Maximize Speedup.
By minimizing parallelization overheads and balancing workload on
processors
27
Application demands:
More computing cycles/memory needed
Scientific/Engineering computing: CFD, Biology, Chemistry,
Physics, ...
Generalpurpose computing: Video, Graphics, CAD,
Databases, Transaction Processing, Gaming
Mainstream multithreaded programs, are similar to parallel
programs
28
Computational Chemistry
Computational Physics
Computer vision and image understanding
Data Mining and Dataintensive Computing
29
30
31
Figure
33
Lookahead
Techniques introduced to prefetch instructions in order to
Functional parallelism
To use multiple functional units simultaneously
To practice pipelining at various processing levels
35
Pipeline Includes
pipelined instruction execution
Pipelined arithmetic computations
memoryaccess operations
36
SIMD
single instruction stream over multiple data streams
MISD
multiple instruction stream and a single data streams
MIMD
multiple instruction stream over multiple data streams
37
SISD
Conventional sequential machines
38
39
SIMD
Vector computers are equipped with scalar and vector
hardware
40
41
MIMD
42
MISD
The same data stream flows through a linear array of
44
45
(Taxonomy)
Uniprocessor
CU = Control Unit
PE = Processing Element
M = Memory
Shown here:
array of synchronized
processing elements
Parallel computers
or multiprocessor systems
Multiple Instruction streams and a Single Data stream (MISD):
Systolic arrays for pipelined execution.
46
Parallel computers
execute programs in MIMD mode
47
Multiprocessor system
communicate with each other through shared variables in
a common memory
Multicomputer system
Each computer node has a local memory, unshared with
other nodes
Interprocessor communication is done through message
passing among the nodes
48
Registertoregister architecture
Uses vector registers to interface between the memory and
functional pipelines
49
50
51
52
53
54
Programmer
Programmer
Parallelizing
compiler
Concurrency
preserving compiler
Parallel
object code
Concurrent
object code
Execution by
runtime system
Execution by
runtime system
55
56
SharedMemory Multiprocessors
DistributedMemory Multicomputers
57
58
or a multistage network
Symmetric multiprocessor all processors have equally capable of
running the executive programs
Asymmetric multiprocessor one or subset of processors have
executive capability
60
UMA
61
62
NUMA
63
64
65
66
directories D.
Initial data placement is not critical
67
68
CCCOMA
Cache coherent coma
69
System consists of
multiple computers called nodes
interconnected by a message passing network
provides pointtopoint static connections among the nodes
Each node is
autonomous computer consisting of a processor, local memory,
70
71
72
73
74
75
76
77
functional pipelines
78
registertoregister architecture
Vector registers are
used to hold the vector operands, intermediate and final vector
results
Programmable in user instructions
Each equipped with a component counter
Keeps track of the component registers used in successive pipeline
cycles.
Length of each vector register is usually fixed
64bit component registers in a vector register in a CraySeries
supercomputer
The vector functional pipelines retrieve operands from and put
results into the vector registers
79
80
memorytomemory architecture
Differs from the use of a vector stream unit to
81
82
83
Performance Measures
The ideal performance of a computer system
84
85
86
Performance Measures
Response Time (Execution time, Latency): The
time elapse between the start and the completion
of an event.
Throughput (Bandwidth): The amount of work
done in a given time.
Performance: Number of events occurring per
unit of time.
Note execution time is the reciprocal of performance
lower execution time implies higher performance.
88
Performance Measures
A system (X) is faster than (Y), if for a given task,
Execution time
Y
X
1
Performance
=
1
Performance
Y =
Performance
Performance
X
Y
89
1
Performance
1
Performance
Y =
Performance
X
Performance
Y
and hence,
n = 100
Performance
Performance
X
Performance
90
n
1 00
= 1+
E x ecution tim e
n = 1 00
n = 1 00
(
(
and
E x ecution tim e
E xecution tim e
15
10
10
h ence,
= 50
91
Clock Rate
The processor is driven by a clock with a constant cycle
time (t).
The inverse of the cycle time is the clock rate (f =1/t).
92
93
CPI =
Instruction count
n
Ii
=
( CPI i
)
I
c
i =1
( CPI i I i )
i =1
Ic
94
95
CPU time is further divided into the user CPU time and the
system CPU time.
T = Ic * CPI * =
CPIi Ii )
(
i =1
96
T = I c * CPI * = Ic * (p+m*k)*
C = Ic CPI
T = C = C/ f
T = Ic CPI = Ic CPI / f
98
Execution time(T) * 10
MIPS = Ic / T 106
MIPS = f / CPI 106
MIPS = f Ic / C 106
99
evaluate computers.
MFLOPS =
100
Throughput Rate
Number of programs executed per unit time.
Ws = CPU throughput
Wp = System (program/second)
Wp = 1 / T
Wp = MIPS 106 /Ic
Based on the MIPS rates and the average program length Ic
Ws <Wp in multiprogramming environment
always additional overheads like timesharing operating
system
101
102
103
System
Attributes
Performance Factors
Ic
CPI
Instruction-set
Architecture
Compiler
Technology
CPU
Implementation
& Technology
Memory
Hierarchy
104
Instruction Count
Arithmetic
Branch
Load/Store
Floating Point
45000
32000
15000
8000
1
2
2
2
Calculate average CPI, MIPS rate & execution for the above
benchmark program
Solved in class
105
106
107
Operation
Frequency
CPI
ALU ops
Loads
Stores
Branches
35%
25%
15%
25%
1
2
2
3
108
A.
B.
109