You are on page 1of 19

CKV

Advanced VLSI Architecture


MEL G624

Lecture 25: Data Level Parallelism


CKV
Vector Processors
L.D F0, a ; load scalar a
DADDIU R4, Rx, #512 ; last address to load
Loop: L.D F2, 0(Rx) ; load X[i]
MUL.D F2, F2, F0 ; a x X[i]
L.D F4, 0(Ry) ; load Y[i]
ADD.D F4, F2, F2 ; a x X[i] + Y[i]
S.D F4, 0(Ry) ; store into Y[i]
DADDIU Rx, Rx, #8 ; increment index to X
DADDIU Ry, Ry, #8 ; increment index to Y
SUBBU R20, R4, Rx ; compute bound
BNEZ R20, Loop ; check if done

Execute nearly 600 instructions


CKV
Vector Processors
L.D F0, a ; load scalar a
LV V1, Rx ; load vector X
MULVS.D V2, V1, F0 ; vector-scalar multiply
LV V3, Ry ; load vector Y
ADDVV V4, V2, V3 ; add
SV Ry, V4 ; store the result

Only 6 instructions
CKV
Vector Execution Time
Execution Time depends on three factors:

The length of the operand vectors

Structural hazards among the operations

The data dependences


CKV
Vector Arithmetic Execution
Use deep pipeline to execute element operations
Fast clock
V1 V2 V3

V3 <- V1 * V2

Simplifies control of deep pipeline because elements in
vector are independent  No Hazards
CKV
Vector Execution Time

Initiation Rate  the rate at which a vector unit


consumes new operands and produces new results

Time taken for single vector instruction

𝑉𝑒𝑐𝑡𝑜𝑟 𝑙𝑒𝑛𝑔𝑡h × 𝐼𝑛𝑖𝑡𝑖𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒

Single Lane Multiple Lanes


VMIPS
Initiation Rate =1
CKV
Vector Execution Time
Convoy
A set of vector instructions that could potentially execute
together.

Performance can be estimated by counting the number of


convoys

Convoys should not have structural hazard

Assume convoy of instructions must complete execution


before any other instructions can begin.
CKV
Vector Execution Time

MULVS.D V2, V1, V0

ADDVVV4, V2, V3

Separate convoys?
Sequences with RAW hazards can be in the same convey
via chaining.

Vector version of register bypassing

Allows a vector operation to start as soon as the individual 
elements of its vector source operand become available
CKV
Vector Execution Time
Chaining Flexible Chaining
LV V1
MULV V3, V1, V2 V1 V2 V3 V4 V5
ADDV V5, V3, V4

Chain Chain

Load Unit

Memory
CKV
Vector Chaining Advantage
Without chaining, must wait for last element of result to be
written before starting dependent instruction
Load
Mul
Add
Time
With chaining, can start dependent instruction as soon as
first result appears
Load
Mul
Add
CKV
Vector Execution Time
Timing metric to estimate the time for a convoy

Chime

Unit of time taken to execute one convoy

Vector sequence of m convoys executes in m chimes

For a vector length of n, its clock cycles

Approximation  ignores overheads

Better for long vectors


CKV
Example
LV V1, Rx ;load vector X
MULVS.D V2, V1, F0 ;vector‐scalar multiply
LV V3, Ry ;load vector Y
ADDVV.D V4, V2, V3 ;add two vectors
SV Ry, V4 ;store the sum

How many Convoys?


How many Chimes?
Total number of clock cycles?
CKV
Challenges of Vector Instructions
Start up time

Application and architecture must support long vectors.

Assume the same as Cray‐1

Floating‐point add => 6 clock cycles

Floating‐point multiply => 7 clock cycles

Floating‐point divide => 20 clock cycles

Vector load => 12 clock cycles
CKV
Multiple Lanes

Parallel nature of instructions allow an implementation to


execute these instructions

Using deeply pipelined functional unit

Array of parallel functional units

Combination of parallel and pipelined functional units


CKV

A[8] B[8] VMIPS ISA property  arithmetic instructions allow


element N of one vector register to take part in
A[7] B[7] operations with element N from other register
A[6] B[6]

A[5] B[5]
Lane
A[4] B[4]

A[3] B[3]

A[2] B[2]

A[1] B[1]

+ + + + +
C[0] C[0] C[1] C[2] C[3]
CKV
Multiple Lanes
CKV
Vector Length Registers

Vector length not known at compile time?

for (i=0; i<n; i=i+1)


C[i] = A[i] + B[i];

n value may not even be known at run time

Use Vector Length Register (VLR)

For n < Maximum Vector Length (MVL), set VLR to n

For n > MVL ? Strip Mining


CKV
Vector Length Registers

“Strip‐mining”

Break loops into pieces that fit into vector registers

Modify

for (i=0; i<n; i=i+1)


C[i] = A[i] + B[i];
CKV

Thank You for Attending

You might also like