Advanced VLSI Architecture: Lecture 25: Data Level Parallelism

CKV
Advanced VLSI Architecture

MEL G624
Lecture 25: Data Level Parallelism

CKV
Vector Processors
L.D F0, a ; load scalar a
DADDIU R4, Rx, #512 ; last address to load
Loop: L.D F2, 0(Rx) ; load X[i]
MUL.D F2, F2, F0 ; a x X[i]
L.D F4, 0(Ry) ; load Y[i]
ADD.D F4, F2, F2 ; a x X[i] + Y[i]
S.D F4, 0(Ry) ; store into Y[i]
DADDIU Rx, Rx, #8 ; increment index to X
DADDIU Ry, Ry, #8 ; increment index to Y
SUBBU R20, R4, Rx ; compute bound
BNEZ R20, Loop ; check if done
Execute nearly 600 instructions

CKV
Vector Processors
L.D F0, a ; load scalar a
LV V1, Rx ; load vector X
MULVS.D V2, V1, F0 ; vector-scalar multiply
LV V3, Ry ; load vector Y
ADDVV V4, V2, V3 ; add
SV Ry, V4 ; store the result
Only 6 instructions
CKV
Vector Execution Time
Execution Time depends on three factors:
The length of the operand vectors
Structural hazards among the operations
The data dependences

CKV
Vector Arithmetic Execution
Use deep pipeline to execute element operations
Fast clock
V1 V2 V3
V3 <- V1 * V2
Simplifies control of deep pipeline because elements in
vector are independent No Hazards
CKV
Initiation Rate  the rate at which a vector unit

consumes new operands and produces new results
Time taken for single vector instruction
𝑉𝑒𝑐𝑡𝑜𝑟 𝑙𝑒𝑛𝑔𝑡h × 𝐼𝑛𝑖𝑡𝑖𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒
Single Lane Multiple Lanes

VMIPS
Initiation Rate =1
CKV
Convoy
A set of vector instructions that could potentially execute
together.
Performance can be estimated by counting the number of

convoys
Convoys should not have structural hazard
Assume convoy of instructions must complete execution

before any other instructions can begin.
CKV
MULVS.D V2, V1, V0
ADDVVV4, V2, V3
Separate convoys?
Sequences with RAW hazards can be in the same convey
via chaining.
Vector version of register bypassing
Allows a vector operation to start as soon as the individual
elements of its vector source operand become available
CKV
Chaining Flexible Chaining
LV V1
MULV V3, V1, V2 V1 V2 V3 V4 V5
ADDV V5, V3, V4
Chain Chain
Load Unit
Memory
CKV
Vector Chaining Advantage
Without chaining, must wait for last element of result to be
written before starting dependent instruction
Load
Mul
Add
Time
With chaining, can start dependent instruction as soon as
first result appears
Load
Mul
Add
CKV
Timing metric to estimate the time for a convoy
Chime
Unit of time taken to execute one convoy
Vector sequence of m convoys executes in m chimes
For a vector length of n, its clock cycles
Approximation  ignores overheads
Better for long vectors

CKV
Example
LV V1, Rx ;load vector X
MULVS.D V2, V1, F0 ;vector‐scalar multiply
LV V3, Ry ;load vector Y
ADDVV.D V4, V2, V3 ;add two vectors
SV Ry, V4 ;store the sum
How many Convoys?

How many Chimes?
Total number of clock cycles?
CKV
Challenges of Vector Instructions
Start up time
Application and architecture must support long vectors.
Assume the same as Cray‐1
Floating‐point add => 6 clock cycles
Floating‐point multiply => 7 clock cycles
Floating‐point divide => 20 clock cycles
Vector load => 12 clock cycles
CKV
Multiple Lanes
Parallel nature of instructions allow an implementation to

execute these instructions
Using deeply pipelined functional unit
Array of parallel functional units
Combination of parallel and pipelined functional units

CKV
A[8] B[8] VMIPS ISA property  arithmetic instructions allow

element N of one vector register to take part in
A[7] B[7] operations with element N from other register
A[6] B[6]
A[5] B[5]
Lane
A[4] B[4]
A[3] B[3]
A[2] B[2]
A[1] B[1]
+ + + + +
C[0] C[0] C[1] C[2] C[3]
CKV
Multiple Lanes
CKV
Vector Length Registers
Vector length not known at compile time?
for (i=0; i<n; i=i+1)

C[i] = A[i] + B[i];
n value may not even be known at run time
Use Vector Length Register (VLR)
For n < Maximum Vector Length (MVL), set VLR to n
For n > MVL ? Strip Mining

CKV
Vector Length Registers
“Strip‐mining”
Break loops into pieces that fit into vector registers
Modify
for (i=0; i<n; i=i+1)

C[i] = A[i] + B[i];
CKV
Thank You for Attending

Advanced VLSI Architecture: Lecture 25: Data Level Parallelism

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced VLSI Architecture: Lecture 25: Data Level Parallelism

Uploaded by

Copyright:

Available Formats

CKV

Advanced VLSI Architecture

Lecture 25: Data Level Parallelism

Execute nearly 600 instructions

The length of the operand vectors

Structural hazards among the operations

The data dependences

Initiation Rate  the rate at which a vector unit

Time taken for single vector instruction

𝑉𝑒𝑐𝑡𝑜𝑟 𝑙𝑒𝑛𝑔𝑡h × 𝐼𝑛𝑖𝑡𝑖𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒

Single Lane Multiple Lanes

Performance can be estimated by counting the number of

Convoys should not have structural hazard

Assume convoy of instructions must complete execution

MULVS.D V2, V1, V0

Unit of time taken to execute one convoy

Vector sequence of m convoys executes in m chimes

For a vector length of n, its clock cycles

Approximation  ignores overheads

Better for long vectors

How many Convoys?

Parallel nature of instructions allow an implementation to

Using deeply pipelined functional unit

Array of parallel functional units

Combination of parallel and pipelined functional units

A[8] B[8] VMIPS ISA property  arithmetic instructions allow

Vector length not known at compile time?

for (i=0; i<n; i=i+1)

n value may not even be known at run time

Use Vector Length Register (VLR)

For n < Maximum Vector Length (MVL), set VLR to n

For n > MVL ? Strip Mining

for (i=0; i<n; i=i+1)

Thank You for Attending

You might also like