OPTIMIZATIONS OF VECTOR
ARCHITECTURES
1 > Multiple Lanes
Element N of vector
register As “hardwired”
to element N of vector
register 8
Element by element
¥, | operations!
+emcee
= Each lane contains
‘= One portion of the vector
register fle
= One execution pipeline
from each vector
functional unit
«= Each VFU executes
vector instructions at the
rate of one element
group per cycle using
multiple pipelines, one
per lane2 > Vector Length Re:
+ In our discussion, maximum veetor length = vimax = 32
(Assumption) (Defaut value)
SainoaINONVIOBEA|
Mismatch of vector length in programs > Less than vimax?
Length of vector operations is unknown at compile time?
In above code:
Solution
+ Leth Reg
2 > Vector Length Re:
= What if the value of n is not known at compile time and
may be greater than vlmax?
Sains NHOBEA|
= Solution: Separate Loop for Strip Mining! (Chapter — 3)
= RISC-V has better solution:
= Instruction setvl sets vi
+ Ifn> wimax then sett sets
sets! = nin the last iteration ofthe loop“.a0=n=n-10
2 > Vector Length Registers
Strip Mining: Separate loops for green and yellow blocks!
RISC-V: One loop only!
Odd-sized Piece
(Less than MVL)3 > Vector Mask Control
IF statements introduce control dependences in a loop.
64
‘Above loop cannot be vectorized due conditional
execution of the body.
Solution:
Vector Mask Control
Predicate Registers: Boolean vectors that hold the mask
= Starting addresses of X and Y are in x5 and x6
respectively .
= Use predicate register to “disable” elements:3 > Vector Mask Registers
= Using a vector-mask register does have an overhead.
But using a vector-mask control may stil be significantly
faster than using scalar mode.
Vector registers rely on compilers to manipulate mask
registers.
GPUs get the same effect using hardware to manipulate
internal mask registers that are invisible to GPU
software,
In both versions GFLOPS rate drops when masks are
used.
4 > Memory Banks
Memory system must be designed to support high
bandwidth for vector loads and stores unit (VLSU)
For RV64V VLSU, start-up time is 12 clock cycles.
Initiation rate may not necessarily be 1 clock cycle due to
memory bank stalls (limited bandwidth)
Solution:
= Spreading accesses across multiple independent
memory banks (can maintain initiation rate = 1 for VLSU)
+ Control bank addresses independently
+ Abily to load or store non sequential words
+ Support multiple vector processors sharing the same memory
banks.
a ee ee ee ee ee4 > Memory,Banks
= Example:
+ 32 processors, each generating 4 Igads and 2 storesleycle
+ Processor cycle ime is 2.167 ns, SRAM cycle time is 15 ns
+ How many memory banks needed?
= Solution
+ Number of references per processor = 4 +2=6
+ Maximum number of memary references each cyte by 32
processors = (4+2) x 32 = 192
+ Each SRAM bank is busy for = 18/2.167 = 6.922 =7 processor
clock cycles
+ Minimum memory banks required = 192 x7 = 1344 memory
banks!
seunioajanys0i99A
y 7932 has 1024 memory banks with ab
LAA ee ee ee ee
5 > Stride
ition in memory of adjacent elements in a vector may not be
sequential
When an array is allocated memory, itis inearized and must be lald
out in either of
LAA Ee ee ee ee eee ee5 > Stride
‘+ When an array is allocated memory, itis linearized and must be laid|
ut in either
seinyaayays0.en
1+ Linearization means elements in
rowicoluma are not adjacent
5 > Stride
Solution;
= Use non-unit stride
‘Stride
ee ee ee ee ee 2)ero
Bank conflict (stall) cours when the same bank is hit faster than bank
busy time:
= Bank Bus;
hed
6 > Scatter-Gather
= Handling sparse matrices in vector architectures
sesnioojalyJ006A
= In sparse matrix, elements of a vector are stored in
some compacted form and then accessed indirectly
= Consider:
Sparse vector sum on the arrays A and C, using index
vectors K and fi to designate the nonzero elements of A
and C.
Ce ee ee ee ee6 > Scatter-Gather
seinyaayalysoren
= Solution:
‘= Gather ~ Scatter Operations!
‘= Hardware support is available in almost all modern processors
> Base address + Offset given in index vector
ee ee ee ee
6 > Scatter-Gather
XS > Base Address of Al]
x6 > Bas of Cl)
x79 Index Veet AD)
x28 Index Vectge MI] for CU]
ae ee ee ee ee ee7 > Programming Vector Architectures
‘= Compilers can provide feedback to programmers,
= Programmers can provide hints to compiler
ee ee ee ee ee 2