Book 1 – Computer Architecture: A Quantitative Approach, Henessy and Patterson, 5th Edition, Morgan Kaufmann, 2012 Chapter 4 - Data-Level Parallelism in Vector, SIMD, and GPU Architectures Vector Architecture – some important questions 1.How does a vector processor execute on a single vector faster than one element per clock cycle? 2.How does a vector processor handle programs where vector lengths are not the same as the length of the vector registers? 3.What happens when there is an IF conditional statement inside the code to be vectorized? 4.What does a vector processor need from the memory system? 5.How does a vector processor handle multidimensional matrices? 6.How does a vector processor handle sparse matrices? 1.How does a vector processor execute on a single vector faster than one element per clock cycle? Solution – Multiple Lanes Every scalar operation within a vector operation is independent of the rest • Allows for multiple hardware lanes Every lane has • A complete replica of all execution units • A portion of the registers (e.g. odd, even for two lanes) • A path to get data from memory Multiple Lanes With 2 lanes the cost of a vector operation goes from 64+overhead to 32+overhead With 4 lanes goes to 16+overhead Q2 How does a vector processor handle programs where vector lengths are not the same as the length of the vector registers? Solution – Vector Length Register • It is easy to write programs that work with vectors that are multiple of 64 • What to do when vector length is not exactly 64 or vector length is not known at compile time?
• We need a Vector Length Register (VLR)
➢ Very important for read write ➢ The value in VLR cannot be greater than the length of the vector registers • Every operation is done on the first VLR elements only ➢ We ignore the rest Strip Mining • What if n > Max. Vector Length (MVL)? • Strip mining is the generation of code such that each vector operation is done for a size less than or equal to MVL • One loop is created that handles any number of iterations that is multiple of the MVL and another iteration that handles any remaining iterations (less than MVL) • In practice, compilers usually create a single strip-mined loop that is parametrized to handle both portions by changing the length Strip Mined version of DAXPY Q3 What happens when there is an IF conditional statement inside the code to be vectorized? Solution – Vector Mask Register (VMR)
• Allows the use of simple if statement
• The vector-mask control uses a Boolean vector to control the execution of a vector instruction • Requires clock even for the element where the mask is 0 – some vector processors use vector mask only to disable the storing of the result and the operation still occurs Conditional Execution
Use vector mask register to ‘disable’ elements (if conversion):
Vector Architecture – some important questions 1.How does a vector processor execute on a single vector faster than one element per clock cycle? 2.How does a vector processor handle programs where vector lengths are not the same as the length of the vector registers? 3.What happens when there is an IF conditional statement inside the code to be vectorized? 4.What does a vector processor need from the memory system? 5.How does a vector processor handle multidimensional matrices? 6.How does a vector processor handle sparse matrices? Q4 What does a vector processor need from the memory system? Solution – Multiple Memory Banks Example : CRAY T932 • 32 processors, each generating 4 loads and 2 stores/cycle • Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns • How many memory banks are needed? o 32*6=192 accesses o 15/2.167 ≈ 7 processor cycles o 1344! o (Cray T932 has 1024 memory banks) • Without sufficient memory bandwidth, vector execution can be futile • Spread accesses across multiple banks and allow multiple independent accesses