You are on page 1of 10

Chapter 2

Data Level Parallelism


Book 1 – Computer Architecture: A Quantitative Approach, Henessy and Patterson,
5th Edition, Morgan Kaufmann, 2012
Chapter 4 - Data-Level Parallelism in Vector, SIMD, and GPU Architectures
Q5 How does a vector processor handle multi-dimensional
matrices?
Solution – Loads and stores with stride
• Every bank likes to provide data in order
• With multiple banks we can work around this
• We use strides:
• Get an element every strideth element (LVWS)
• Strides may introduce bank conflicts
• If we have 64 memory banks and the stride is 128

LVWS V1, (R1,R2)


SVWS (R1,R2), V1 R1 + i * R2
We interleave our elements amongst the multiple banks
e.g. BANK0 BANK1 BANK2 BANK3
0 1 2 3

If stride = 1 4 5 6 7
8 9 10 11
For no stalls
12 13 14 15
#banks > bank busy time 16 17 18 19
If stride ≠ 1 20 21 22 23
Bank conflict occurs if
#𝑏𝑎𝑛𝑘𝑠
< bank busy time
𝐺.𝐶.𝐷(#𝑏𝑎𝑛𝑘𝑠,𝑠𝑡𝑟𝑖𝑑𝑒)
Example
Suppose we have 8 memory banks with a bank busy time of 6 clocks and a total
memory latency of 12 cycles. How long will it take to complete a 64-element vector
load with a stride of 1? With a stride of 32?
Answer
stride = 1
Since the number of banks is larger than the bank busy time, for a stride of 1 the load
will take 12 + 64 = 76 clock cycles, or 1.2 clock cycles per element.

stride = 32
The worst possible stride is a value that is a multiple of the number of memory banks,
as in this case with a stride of 32 and 8 memory banks.
Every access to memory (after the first one) will collide with the previous access
Every access will have to wait for the 6-clock-cycle bank busy time
The total time will be 12 + 1 + 6 * 63 = 391 clock cycles, or 6.1 clock cycles per
element.
Q6 How does a vector processor handle Sparse Matrices
Solution – Gather and Scatter Instructions
• A sparse matrix is a matrix populated primarily with zeroes as elements
• Huge sparse matrices may often appear in scientific and engineering applications when solving
partial differential equations
• Consider

• This code implements a sparse vector sum on the arrays A and C, using index vectors K and M to
designate the non-zero elements of A and C
• A and C must have same number of nonzero elements so K and M are the same size
• Gather (LVI) and scatter (SVI) instructions are used for handling sparse matrices
• Result of a gather operation is a non-sparse vector in a vector register
LVI V1, (R1+V2)
SVI (R1+V2), V1 R1+V2(i)
0 1 2 3 4 5 6 7 8 9
A 2 0 0 0 5 0 8 12 0 0 V2 = K = {0, 4, 6, 7}
Graphics Processing Units(GPUs)
• Produce high impact visuals and tackling data parallel problems
• They have a unified graphics and computing architecture that serves as both
programmable graphics processor and scalable parallel computing platform
• This text focusses on GPUs for computing
• Hundreds of parallel floating point units with a programming language that makes GPUs
easier to program
• CUDA and OpenCL are the two main programming paradigms
• CUDA (Compute Unified Device Architecture) - NVIDIA developed a C like programming
language to improve productivity of GPUs
• CUDA programming model – SIMT (Single Instruction, Multiple Threads)
• Compared to C, each loop iteration becomes an independent thread
NVIDIA - FERMI
Architecture
Streaming
multiprocessor(SM)

CUDA core/SIMD Lane


NVIDIA - FERMI
Architecture
• Basics of FERMI architecture

• Host processor/ CPU

• GPU Device

• Streaming Multiprocessor

• CUDA Core/ SIMD Lane

• SFUs

• Programming Model of GPUs

You might also like