You are on page 1of 37

slide 1

Outline
• Classification
• ILP Architectures
• Data Parallel Architectures
• Process level Parallel Architectures
• Issues in parallel architectures
• Cache coherence problem
• Interconnection networks
slide 2
Outline
• Classification
• ILP Architectures
• Data Parallel Architectures
• Process level Parallel Architectures
• Issues in parallel architectures
• Cache coherence problem
• Interconnection networks
• Flynn’s [66]
• Feng’s [72]
• Händler’s [77]
• Modern (Sima, Fountain & Kacsuk)
slide 3
Flynn’s Classification
Architecture Categories
SISD SIMD MISD MIMD
slide 4
SISD
C P
M
IS
IS DS
slide 5
SIMD
C
P
P
M
IS
DS
DS
slide 6
MISD
C
C
P
P
M
IS
IS
IS
IS
DS
DS
slide 7
MIMD
C
C
P
P
M
IS
IS
IS
IS
DS
DS
slide 8
Feng’s Classification
1 16 32 64
1
16
64
256
16K
word length
bit slice
length
•MPP
•STARAN
•C.mmP
•PDP11
•PEPE
•IBM370
•IlliacIV
•CRAY-1
slide 9
Händler’s Classification
< K x K’ , D x D’ , W x W’ >
control data word
dash ÷ degree of pipelining
TI - ASC <1, 4, 64 x 8>
CDC 6600 <1, 1 x 10, 60> x <10, 1, 12> (I/O)
C.mmP <16,1,16> + <1x16,1,16> + <1,16,16>
PEPE <1 x 3, 288, 32>
Cray-1 <1, 12 x 8, 64 x (1 ~ 14)>
slide 10
Modern Classification
Parallel
architectures
Data-parallel
architectures
Function-parallel
architectures
slide 11
Data Parallel Architectures
Data-parallel
architectures
Vector
architectures
Associative
And neural
architectures
SIMDs Systolic
architectures
slide 12
Function Parallel Architectures
Function-parallel
architectures
Instr level
Parallel Arch
Thread level
Parallel Arch
Process level
Parallel Arch
(ILPs)
(MIMDs)
Pipelined
processors
VLIWs Superscalar
processors
Distributed
Memory
MIMD
Shared
Memory
MIMD
slide 13
Outline
• Classification
• ILP Architectures
• Data Parallel Architectures
• Process level Parallel Architectures
• Issues in parallel architectures
• Cache coherence problem
• Interconnection networks
• Pipelining
• VLIW
• Superscalar
slide 14
Pipelining
IF D RF EX/AG M WB
• faster throughput with pipelining
• resource sharing across cycles
• all instructions may not take same cycles
slide 15
Hazards in Pipelining
• Procedural dependencies => Control hazards
– conditional and unconditional branches, calls/returns
• Data dependencies => Data hazards
– RAW (read after write)
– WAR (write after read)
– WAW (write after write)
• Resource conflicts => Structural hazards
– use of same resource in different stages
slide 16
Pipeline Performance
CPI = 1 + (S - 1) * b
Time = CPI * T / S
T
S stages
Frequency of interruptions - b
slide 17
Cache/
memory
Fetch
Unit
Single multi-operation instruction
multi-operation instruction
FU FU FU
Register file
ILP in VLIW processors
slide 18
Cache/
memory
Fetch
Unit
Multiple instruction
Sequential stream of instructions
FU FU FU
Register file
Decode
and issue
unit
Instruction/control
Data
FU Funtional Unit
ILP in Superscalar processors
slide 19
Why Superscalars are popular ?
• Binary code compatibility among scalar &
superscalar processors of same family
• Same compiler works for all processors (scalars and
superscalars) of same family
• Assembly programming of VLIWs is tedious
• Code density in VLIWs is very poor - Instruction
encoding schemes

slide 20
FU FU FU
Register file
•Instruction encoding
•Scalability: Access time, area, power consumption
sharply increase with number of register ports
Issues in VLIW Architecture
slide 21
Tasks of superscalar processing
Parallel Superscalar Parallel Preserving the Preserving the
decoding instruction instruction sequential sequential
issue execution consistency of consistency of
execution exception
processing

slide 22
Outline
• Classification
• ILP Architectures
• Data Parallel Architectures
• Process level Parallel Architectures
• Issues in parallel architectures
• Cache coherence problem
• Interconnection networks
•SIMD Processors
•Vector Processors
•Associative Processors
•Systolic Arrays
slide 23
Data Parallel Architectures
• SIMD Processors
– Multiple processing elements driven by a single
instruction stream
• Vector Processors
– Uni-processors with vector instructions
• Associative Processors
– SIMD like processors with associative memory
• Systolic Arrays
– Application specific VLSI structures
slide 24
Systolic Arrays [H.T. Kung 1978]
Simplicity, Regularity, Concurrency, Communication
Example :
Band matrix multiplication
| |
(
(
(
(
(
(
(
(
¸
(

¸

-
(
(
(
(
(
(
(
(
¸
(

¸

=
66 65 64
56 55 54 53
45 44 43 42
34 33 32 31
23 22 21
12 11
66 65 64
56 55 54 53
45 44 43 42
34 33 32 31
23 22 21
12 11
0 0 0
0 0
0 0
0 0
0 0 0
0 0 0 0
0 0 0
0 0
0 0
0 0
0 0 0
0 0 0 0
B B B
B B B B
B B B B
B B B B
B B B
B B
A A A
A A A A
A A A A
A A A A
A A A
A A
C
B
11

B
12

B
21

B
31

A
11

A
12

A
21

A
22

A
31

A
23

T=0
slide 26
Outline
• Classification
• ILP Architectures
• Data Parallel Architectures
• Process level Parallel Architectures
• Issues in parallel architectures
• Cache coherence problem
• Interconnection networks
•MIMD Processors
- Shared Memory
- Distributed Memory
slide 27
Why Process level Parallel Architectures?
Function-parallel
architectures
Instruction
level PAs
Thread
level PAs
Process
level PAs
(MIMDs)
Distributed
Memory
MIMD
Shared
Memory
MIMD
Data-parallel
architectures
Built using
general purpose
processors
slide 28
MIMD Architectures
Design Space
• Extent of address space sharing
• Location of memory modules
• Uniformity of memory access
slide 29
Outline
• Classification
• ILP Architectures
• Data Parallel Architectures
• Process level Parallel Architectures
• Issues in parallel architectures
• Cache coherence problem
• Interconnection networks
•User’s perspective
•Architect’s perspective
slide 30
Issues from user’s perspective
• Specification / Program design
– explicit parallelism or
– implicit parallelism + parallelizing compiler
• Partitioning / mapping to processors
• Scheduling / mapping to time instants
– static or dynamic
• Communication and Synchronization
slide 31
Parallel programming models
Concurrent
control flow
Functional or
logic program
Vector/array
operations
Concurrent
tasks/processes/threads/objects
With shared variables
or message passing
Relationship between
programming model
and architecture ?
slide 32
Issues from architect’s perspective
• Coherence problem in shared memory with
caches
• Efficient interconnection networks
slide 33
Outline
• Classification
• ILP Architectures
• Data Parallel Architectures
• Process level Parallel Architectures
• Issues in parallel architectures
• Cache coherence problem
• Interconnection networks
•Coherence Protocols
- Bus or directory based
- Invalidate or update
- Definition of states
slide 34
Cache Coherence Problem
Multiple copies of data may exist
¬ Problem of cache coherence
Options for coherence protocols
• What action is taken?
– Invalidate or Update
• Which processors/caches communicate?
– Snoopy (broadcast) or directory based
• Status of each block?
slide 35
Outline
• Classification
• ILP Architectures
• Data Parallel Architectures
• Process level Parallel Architectures
• Issues in parallel architectures
• Cache coherence problem
• Interconnection networks
•Switching and control
•Topology
slide 36
Interconnection Networks
• Architectural Variations:
– Topology
– Direct or Indirect (through switches)
– Static (fixed connections) or Dynamic (connections
established as required)
– Routing type store and forward/worm hole)
• Efficiency:
– Delay
– Bandwidth
– Cost
slide 37
Books
• D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer
Architectures : A Design Space Approach", Addison Wesley,
1997.
• M.J. Flynn, "Computer Architecture : Pipelined and Parallel
Processor Design", Narosa Publishing House/ Jones and Bartlett,
1996.
• D.A. Patterson, J.L. Hennessy, "Computer Architecture : A
Quantitative Approach", Morgan Kaufmann Publishers, 2002.
• K. Hwang, "Advanced Computer Architecture : Parallelism,
Scalability, Programmability", McGraw Hill, 1993.
• H.G. Cragon, "Memory Systems and Pipelined Processors",
Narosa Publishing House/ Jones and Bartlett, 1998.
• D.E. Culler, J.P Singh and Anoop Gupta, "Parallel Computer
Architecture, A Hardware/Software Approach", Harcourt Asia /
Morgan Kaufmann Publishers, 2000.

Outline
• • • • • • • Classification ILP Architectures • Flynn’s [66] Data Parallel Architectures • Feng’s [72] Process levelHändler’s Architectures • Parallel [77] • Modern (Sima, Fountain & Kacsuk) Issues in parallel architectures Cache coherence problem Interconnection networks
slide 2

Flynn’s Classification

Architecture Categories

SISD

SIMD

MISD

MIMD

slide 3

SISD IS C IS P DS M slide 4 .

SIMD P IS DS C P DS M slide 5 .

MISD IS C IS P DS M IS C IS P DS slide 6 .

MIMD IS C IS P DS M IS C IS P DS slide 7 .

Feng’s Classification 16K 256 bit slice length 64 16 1 1 16 32 word length 64 slide 8 •MPP •STARAN •PEPE •IlliacIV •C.mmP •PDP11 •IBM370 •CRAY-1 .

D x D’ . 4.mmP PEPE Cray-1 <1.Händler’s Classification < K x K’ . 1. 60> x <10.1. 288. 1 x 10. W x W’ > control data word dash  degree of pipelining TI .16> + <1.1. 12> (I/O) <16.16. 64 x (1 ~ 14)> slide 9 .ASC CDC 6600 C. 32> <1.16> <1 x 3. 64 x 8> <1.16> + <1x16. 12 x 8.

Modern Classification Parallel architectures Data-parallel Function-parallel architectures architectures slide 10 .

Data Parallel Architectures Data-parallel architectures Vector architectures Associative And neural architectures SIMDs Systolic architectures slide 11 .

Function Parallel Architectures Function-parallel architectures Instr level Parallel Arch (ILPs) Thread level Parallel Arch Process level Parallel Arch (MIMDs) Pipelined VLIWs Superscalar processors processors Distributed Memory MIMD Shared Memory MIMD slide 12 .

Outline • • • • • • • Classification ILP Architectures Data Parallel Architectures • Pipelining Process level Parallel Architectures • VLIW Issues in parallel architectures • Superscalar Cache coherence problem Interconnection networks slide 13 .

Pipelining • resource sharing across cycles • all instructions may not take same cycles IF D RF EX/AG M WB • faster throughput with pipelining slide 14 .

Hazards in Pipelining • Procedural dependencies => Control hazards – conditional and unconditional branches. calls/returns • Data dependencies => Data hazards – RAW (read after write) – WAR (write after read) – WAW (write after write) • Resource conflicts => Structural hazards – use of same resource in different stages slide 15 .

b CPI = 1 + (S .1) * b Time = CPI * T / S slide 16 .Pipeline Performance T S stages Frequency of interruptions .

ILP in VLIW processors Cache/ memory Fetch Unit Single multi-operation instruction FU FU FU Register file multi-operation instruction slide 17 .

ILP in Superscalar processors Decode Cache/ memory Fetch Unit and issue unit Multiple instruction FU FU FU Sequential stream of instructions Instruction/control Data FU Funtional Unit slide 18 Register file .

Instruction encoding schemes slide 19 .Why Superscalars are popular ? • Binary code compatibility among scalar & superscalar processors of same family • Same compiler works for all processors (scalars and superscalars) of same family • Assembly programming of VLIWs is tedious • Code density in VLIWs is very poor .

power consumption sharply increase with number of register ports slide 20 . area.Issues in VLIW Architecture FU FU FU Register file •Instruction encoding •Scalability: Access time.

Tasks of superscalar processing Parallel Superscalar Parallel Preserving the decoding instruction instruction sequential issue execution consistency of execution Preserving the sequential consistency of exception processing slide 21 .

Outline • • • • • • • Classification ILP Architectures Data Parallel Architectures Process level Parallel Architectures •SIMD Processors Issues in parallel architectures •Vector Processors •Associative Processors Cache coherence problem •Systolic Arrays Interconnection networks slide 22 .

Data Parallel Architectures • SIMD Processors – Multiple processing elements driven by a single instruction stream • Vector Processors – Uni-processors with vector instructions • Associative Processors – SIMD like processors with associative memory • Systolic Arrays – Application specific VLSI structures slide 23 .

Concurrency.T. Communication Example : Band matrix multiplication  A11 A12 0 0 0 0   B11 B12 0 0 0 0   A A A 0 0 0  B B B 0 0 0   21 22 23   21 22 23   A31 A32 A33 A34 0 0   B31 B32 B33 B34 0 0  C      0 A42 A43 A44 A45 0  0 B42 B43 B44 B45 0   0 0 A A A A  0 0 B B B B  53 54 55 56 53 54 55 56     0 0 0 A64 A65 A66  0 0 0 B64 B65 B66      slide 24 . Regularity.Systolic Arrays [H. Kung 1978] Simplicity.

T=0 A23 B31 A22 A12 B21 A31 A21 A11 B11 B12 .

networks slide 26 .Shared Memory InterconnectionDistributed Memory .Outline • • • • • • • Classification ILP Architectures Data Parallel Architectures Process level Parallel Architectures Issues in parallel architectures •MIMD Processors Cache coherence problem .

Why Process level Parallel Architectures? Data-parallel architectures Instruction level PAs Function-parallel architectures Thread level PAs Process level PAs (MIMDs) Built using general purpose processors Distributed Memory MIMD Shared Memory MIMD slide 27 .

MIMD Architectures Design Space • Extent of address space sharing • Location of memory modules • Uniformity of memory access slide 28 .

Outline • • • • • • • Classification ILP Architectures •User’s perspective Data Parallel Architectures •Architect’s perspective Process level Parallel Architectures Issues in parallel architectures Cache coherence problem Interconnection networks slide 29 .

Issues from user’s perspective • Specification / Program design – explicit parallelism or – implicit parallelism + parallelizing compiler • Partitioning / mapping to processors • Scheduling / mapping to time instants – static or dynamic • Communication and Synchronization slide 30 .

Parallel programming models Concurrent control flow Functional or logic program Vector/array operations Concurrent tasks/processes/threads/objects With shared variables or message passing Relationship between programming model and architecture ? slide 31 .

Issues from architect’s perspective • Coherence problem in shared memory with caches • Efficient interconnection networks slide 32 .

Definition of states Issues in parallel architectures Cache coherence problem Interconnection networks slide 33 .Outline • • • • • • • Classification ILP Architectures Protocols •Coherence Bus or directory Data Parallel -Architectures based Invalidate or update Process level -Parallel Architectures .

Cache Coherence Problem Multiple copies of data may exist  Problem of cache coherence Options for coherence protocols • What action is taken? – Invalidate or Update • Which processors/caches communicate? – Snoopy (broadcast) or directory based • Status of each block? slide 34 .

Outline • • • • • • • Classification ILP Architectures Data Parallel Architectures Process level Parallel Architectures •Switching and control Issues in parallel architectures •Topology Cache coherence problem Interconnection networks slide 35 .

Interconnection Networks • Architectural Variations: – Topology – Direct or Indirect (through switches) – Static (fixed connections) or Dynamic (connections established as required) – Routing type store and forward/worm hole) • Efficiency: – Delay – Bandwidth – Cost slide 36 .

Narosa Publishing House/ Jones and Bartlett. 1998.G. • D. 1996. Cragon.J. Flynn. slide 37 . Patterson. "Parallel Computer Architecture. P. Programmability". A Hardware/Software Approach". Hennessy. Scalability.Books • D. 1993. "Advanced Computer Architectures : A Design Space Approach". Culler. • K. Kacsuk.A. Morgan Kaufmann Publishers. T.L. Addison Wesley. 2000. 1997. "Memory Systems and Pipelined Processors". • D.P Singh and Anoop Gupta. "Advanced Computer Architecture : Parallelism. J. Narosa Publishing House/ Jones and Bartlett. Fountain. • M. Sima. "Computer Architecture : Pipelined and Parallel Processor Design". Hwang. J. • H. 2002.E. McGraw Hill. Harcourt Asia / Morgan Kaufmann Publishers. "Computer Architecture : A Quantitative Approach".