Professional Documents
Culture Documents
1
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Science Area Number Codes Struct Unstruct Dense Sparse N- Monte FFT PIC Sig
of Grids Grids Matrix Matrix Body Carlo I/O
Teams
Plasmas/ 2 H3D(M),VPIC, X X X X
Magnetosphere OSIRIS, Magtail/UPIC
Earthquakes/ 2 AWP-ODC, X X X X
Seismology HERCULES, PLSQR,
SPECFEM3D
Evolution 1 Eve
Engineering/System 1 GRIPS,Revisit X
of Systems 2
Computer Science W. Hwu
©Wen-mei 1 and David Kirk/NVIDIA, 2010-2016 X X X X X
Objective
• To learn the key techniques for compacting data
in parallel sparse methods
– reduced main memory space
– better use of on-chip memory (cache, shared mem)
– reduced use of memory bandwidth---fewer bytes
transferred to on-chip memory
– retaining regularity
3
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Sparse Matrix
• Solution of linear systems A*X + Y = 0
– A: matrix מקדמים, X and Y are vectors. Solve for X
• Solution 1: inversion of the coefficient matrix
– X = A-1 * (─Y)
– Traditional inversion algorithms such as Gaussian
elimination can create too many “fill-in” elements and
explode the size of the matrix
• Solution 2: Iterative Conjugate Gradient solvers
– A*X + Y = 0 Conjugate Gradient solvers guess a solution
– If not 0, iterate again
4
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Sparse Matrix-Vector Multiplication
(SpMV)
× +
A X Y Y
6
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Spare Matrix example
Row 0 3 0 1 0
Row 1 0 0 0 0
Row 2 0 2 4 1
Row 3 1 0 0 1
7
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Compressed Sparse Row (CSR) Format
Row 0 3 0 1 0
Row 1 0 0 0 0
Row 2 0 2 4 1
Row 3 1 0 0 1
8
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
CSR Data Layout
row_ptr 0 2 2 5 7
data 3 1 2 4 1 1 1
col_index 0 2 1 2 3 0 3
9
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
A Serial SpMV/CSR Kernel
Y=A* X +Y
10
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
A Simple Parallel SpMV
Row 0 3 0 1 0 Thread 0
Row 1 0 0 0 0 Thread 1
Row 2 0 2 4 1 Thread 2
Row 3 1 0 0 1 Thread 3
11
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
A Parallel SpMV/CSR Kernel (CUDA)
row_ptr 0 2 2 5 7
data 3 1 2 4 1 1 1
col_index 0 2 1 2 3 0 3
13
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
CSR Kernel Noncolasced Memory Accesses
• Adjacent threads access non-adjacent memory
locations
– Grey elements accessed by all threads in iteration 0
row_ptr 0 2 2 5 7
data 3 1 2 4 1 1 1
col_index 0 2 1 2 3 0 3
14
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Padding ריפודand Transposing חילוף
• Problems in CSR:
– Noncoalesced memory accesses
– Divergence
15
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Padding and transposing the
example matrix
• Pad all rows to the length of the longest row
• Transpose (Column Major) for coalescing
– We place all elements of column 0 in consecutive
memory locations (in row 0)
– Followed by elements of column 1, then column 2 …
– Called Transposing (Column Major)
16
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Regularizing SpMV with ELL(PACK) Format
Thread 1
3 1 *
3 * 2 1
* * *
1 * 4 1
2 4 1
* * 1 *
1 1 *
Transposed
CSR with Padding
17
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
How to determine data[] & col_index[]
• data[0] … data[3] contain 3, *, 2, 1 (the 0th
element of all rows
18
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Memory Coalescing with ELL
Transposed
Col
19
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
A parallel SpMV/ELL kernel
1. SpMV/ELL kernel simpler than SpMV/CSR kernel
1. All rows are the same length (with padding)
20
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
A parallel SpMV/ELL kernel (code)
1. __global__ void SpMV_ELL(int num_rows, float *data,
int *col_index, int num_elem, float *x, float *y) {
21
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Is SpMV/ELL all-around winner?
• If sparse matrix has few rows with a very large
number of non-zero elements
– ELL format will have a lot of padding (space & execution)
• Example
– 1000x1000 matrix, 1% non-zero elements, on average
each row has 10 non-zero elements
– CSR 2% uncompressed matrix size (including overhead)
– One row has 200 non-zero elements
– ELL 40% uncompressed matrix size (including overhead)
22
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Reduced Padding
23
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Coordinate (COO) format
Row 0 Row 2 Row 3
Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row indices row_index[7]{ 0, 0, 2, 2, 2, 3, 3 }
24
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
COO Allows Reordering Elements
Row 0 Row 2 Row 3
Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row indices row_index[7] { 0, 0, 2, 2, 2, 3, 3 }
COO
ELL
26
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Hybrid format
• Row 2 has largest Row 0 3 0 1 0
number of elements
Row 1 0 0 0 0
• Remove last non-
zero element of row Row 2 0 2 4 1
2 from ELL format
and move it to COO Row 3 1 0 0 1
format
27
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Reduced Padding with Hybrid Format
data col_index
Thread 0 3 1 0 2
Thread 0
Thread 1
Thread 2
Thread 3
Thread 1 * * * *
Thread 2 2 4 1 2
Thread 3 1 1 0 3
Iteration 0
data 3 * 2 1 1 * 4 1 data 1
col_index 3
index 0 * 1 0 2 * 2 3
row_index 2
ELL
Without column 3 COO
This column may be
processed in any order 28
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Sequential loop implements
SpMV/COO
• We can order elements in any order since they
are independent of each other
• Code shows one COO row processed
29
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016