You are on page 1of 12

Chapter 2

Data Level Parallelism


Book 1 – Computer Architecture: A Quantitative Approach, Henessy and Patterson,
5th Edition, Morgan Kaufmann, 2012
Chapter 4 - Data-Level Parallelism in Vector, SIMD, and GPU Architectures
Vector Architecture – some important questions
1.How does a vector processor execute on a single vector faster
than one element per clock cycle?
2.How does a vector processor handle programs where vector
lengths are not the same as the length of the vector registers?
3.What happens when there is an IF conditional statement inside
the code to be vectorized?
4.What does a vector processor need from the memory system?
5.How does a vector processor handle multidimensional matrices?
6.How does a vector processor handle sparse matrices?
1.How does a vector processor execute on a single vector
faster than one element per clock cycle?
Solution – Multiple Lanes
Every scalar operation within a vector operation is independent of the rest
• Allows for multiple hardware lanes
Every lane has
• A complete replica of all execution units
• A portion of the registers (e.g. odd, even for two lanes)
• A path to get data from memory
Multiple
Lanes
With 2 lanes the cost of a
vector operation goes from
64+overhead to
32+overhead
With 4 lanes goes to
16+overhead
Q2 How does a vector processor handle programs where vector
lengths are not the same as the length of the vector registers?
Solution – Vector Length Register
• It is easy to write programs that work with vectors that are multiple of 64
• What to do when vector length is not exactly 64 or vector length is not
known at compile time?

• We need a Vector Length Register (VLR)


➢ Very important for read write
➢ The value in VLR cannot be greater than the length of the vector registers
• Every operation is done on the first VLR elements only
➢ We ignore the rest
Strip Mining
• What if n > Max. Vector Length (MVL)?
• Strip mining is the generation of code such that each vector
operation is done for a size less than or equal to MVL
• One loop is created that handles any number of iterations that is
multiple of the MVL and another iteration that handles any
remaining iterations (less than MVL)
• In practice, compilers usually create a single strip-mined loop that
is parametrized to handle both portions by changing the length
Strip Mined version of DAXPY
Q3 What happens when there is an IF conditional statement inside
the code to be vectorized?
Solution – Vector Mask Register (VMR)

• Allows the use of simple if statement


• The vector-mask control uses a Boolean vector to control the
execution of a vector instruction
• Requires clock even for the element where the mask is 0 – some
vector processors use vector mask only to disable the storing of
the result and the operation still occurs
Conditional Execution

Use vector mask register to ‘disable’ elements (if conversion):


Vector Architecture – some important questions
1.How does a vector processor execute on a single vector faster
than one element per clock cycle?
2.How does a vector processor handle programs where vector
lengths are not the same as the length of the vector registers?
3.What happens when there is an IF conditional statement inside
the code to be vectorized?
4.What does a vector processor need from the memory system?
5.How does a vector processor handle multidimensional matrices?
6.How does a vector processor handle sparse matrices?
Q4 What does a vector processor need from the memory system?
Solution – Multiple Memory Banks
Example : CRAY T932
• 32 processors, each generating 4 loads and 2 stores/cycle
• Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns
• How many memory banks are needed?
o 32*6=192 accesses
o 15/2.167 ≈ 7 processor cycles
o 1344!
o (Cray T932 has 1024 memory banks)
• Without sufficient memory bandwidth, vector execution can be futile
• Spread accesses across multiple banks and allow multiple independent
accesses

You might also like