Lec 24

OPTIMIZATIONS OF VECTOR ARCHITECTURES 1 > Multiple Lanes Element N of vector register As “hardwired” to element N of vector register 8 Element by element ¥, | operations! +emcee = Each lane contains ‘= One portion of the vector register fle = One execution pipeline from each vector functional unit «= Each VFU executes vector instructions at the rate of one element group per cycle using multiple pipelines, one per lane2 > Vector Length Re: + In our discussion, maximum veetor length = vimax = 32 (Assumption) (Defaut value) SainoaINONVIOBEA| Mismatch of vector length in programs > Less than vimax? Length of vector operations is unknown at compile time? In above code: Solution + Leth Reg 2 > Vector Length Re: = What if the value of n is not known at compile time and may be greater than vlmax? Sains NHOBEA| = Solution: Separate Loop for Strip Mining! (Chapter — 3) = RISC-V has better solution: = Instruction setvl sets vi + Ifn> wimax then sett sets sets! = nin the last iteration ofthe loop“.a0=n=n-10 2 > Vector Length Registers Strip Mining: Separate loops for green and yellow blocks! RISC-V: One loop only! Odd-sized Piece (Less than MVL)3 > Vector Mask Control IF statements introduce control dependences in a loop. 64 ‘Above loop cannot be vectorized due conditional execution of the body. Solution: Vector Mask Control Predicate Registers: Boolean vectors that hold the mask = Starting addresses of X and Y are in x5 and x6 respectively . = Use predicate register to “disable” elements:3 > Vector Mask Registers = Using a vector-mask register does have an overhead. But using a vector-mask control may stil be significantly faster than using scalar mode. Vector registers rely on compilers to manipulate mask registers. GPUs get the same effect using hardware to manipulate internal mask registers that are invisible to GPU software, In both versions GFLOPS rate drops when masks are used. 4 > Memory Banks Memory system must be designed to support high bandwidth for vector loads and stores unit (VLSU) For RV64V VLSU, start-up time is 12 clock cycles. Initiation rate may not necessarily be 1 clock cycle due to memory bank stalls (limited bandwidth) Solution: = Spreading accesses across multiple independent memory banks (can maintain initiation rate = 1 for VLSU) + Control bank addresses independently + Abily to load or store non sequential words + Support multiple vector processors sharing the same memory banks. a ee ee ee ee ee4 > Memory,Banks = Example: + 32 processors, each generating 4 Igads and 2 storesleycle + Processor cycle ime is 2.167 ns, SRAM cycle time is 15 ns + How many memory banks needed? = Solution + Number of references per processor = 4 +2=6 + Maximum number of memary references each cyte by 32 processors = (4+2) x 32 = 192 + Each SRAM bank is busy for = 18/2.167 = 6.922 =7 processor clock cycles + Minimum memory banks required = 192 x7 = 1344 memory banks! seunioajanys0i99A y 7932 has 1024 memory banks with ab LAA ee ee ee ee 5 > Stride ition in memory of adjacent elements in a vector may not be sequential When an array is allocated memory, itis inearized and must be lald out in either of LAA Ee ee ee ee eee ee5 > Stride ‘+ When an array is allocated memory, itis linearized and must be laid| ut in either seinyaayays0.en 1+ Linearization means elements in rowicoluma are not adjacent 5 > Stride Solution; = Use non-unit stride ‘Stride ee ee ee ee ee 2)ero Bank conflict (stall) cours when the same bank is hit faster than bank busy time: = Bank Bus; hed 6 > Scatter-Gather = Handling sparse matrices in vector architectures sesnioojalyJ006A = In sparse matrix, elements of a vector are stored in some compacted form and then accessed indirectly = Consider: Sparse vector sum on the arrays A and C, using index vectors K and fi to designate the nonzero elements of A and C. Ce ee ee ee ee6 > Scatter-Gather seinyaayalysoren = Solution: ‘= Gather ~ Scatter Operations! ‘= Hardware support is available in almost all modern processors > Base address + Offset given in index vector ee ee ee ee 6 > Scatter-Gather XS > Base Address of Al] x6 > Bas of Cl) x79 Index Veet AD) x28 Index Vectge MI] for CU] ae ee ee ee ee ee7 > Programming Vector Architectures ‘= Compilers can provide feedback to programmers, = Programmers can provide hints to compiler ee ee ee ee ee 2

Lec 24

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 24

Uploaded by

Copyright:

Available Formats

You might also like