You are on page 1of 29

Multithreaded Architectures

(Applied Parallel Programming)

Chapter 10: sparse matrix


computation

1
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Science Area Number Codes Struct Unstruct Dense Sparse N- Monte FFT PIC Sig
of Grids Grids Matrix Matrix Body Carlo I/O
Teams

Climate and 3 CESM, GCRM, X X X X X


Weather CM1/WRF, HOMME

Plasmas/ 2 H3D(M),VPIC, X X X X
Magnetosphere OSIRIS, Magtail/UPIC

Stellar Atmospheres 5 PPM, MAESTRO, X X X X X X


and Supernovae CASTRO, SEDONA,
ChaNGa, MS-FLUKSS

Cosmology 2 Enzo, pGADGET X X X


Combustion/ 2 PSDNS, DISTUF X X
Turbulence
General Relativity 2 Cactus, Harm3D, X X
LazEV
Molecular 4 AMBER, Gromacs, X X X
Dynamics NAMD, LAMMPS

Quantum Chemistry 2 SIAL, GAMESS, X X X X X


NWChem

Material Science 3 NEMOS, OMEN, GW, X X X X


QMCPACK

Earthquakes/ 2 AWP-ODC, X X X X
Seismology HERCULES, PLSQR,
SPECFEM3D

Quantum Chromo 1 Chroma, MILC, X X X


Dynamics USQCD
Social Networks 1 EPISIMDEMICS

Evolution 1 Eve

Engineering/System 1 GRIPS,Revisit X
of Systems 2
Computer Science W. Hwu
©Wen-mei 1 and David Kirk/NVIDIA, 2010-2016 X X X X X
Objective
• To learn the key techniques for compacting data
in parallel sparse methods
– reduced main memory space
– better use of on-chip memory (cache, shared mem)
– reduced use of memory bandwidth---fewer bytes
transferred to on-chip memory
– retaining regularity

3
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Sparse Matrix
• Solution of linear systems A*X + Y = 0
– A: matrix ‫מקדמים‬, X and Y are vectors. Solve for X
• Solution 1: inversion of the coefficient matrix
– X = A-1 * (­­─Y)
– Traditional inversion algorithms such as Gaussian
elimination can create too many “fill-in” elements and
explode the size of the matrix
• Solution 2: Iterative Conjugate Gradient solvers
– A*X + Y = 0 Conjugate Gradient solvers guess a solution
– If not 0, iterate again

4
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Sparse Matrix-Vector Multiplication
(SpMV)

× +

A X Y Y

©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016


Challenges
• Compared to dense matrix multiplication, SpMV
(sparse matrix vector)
– Is irregular/unstructured
– Has little input data reuse
– Benefits little from compiler transformation tools

• Key to maximal performance


– Maximize regularity (by reducing divergence and load
imbalance)
– Maximize DRAM burst utilization (layout arrangement)

6
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Spare Matrix example

Row 0 3 0 1 0

Row 1 0 0 0 0

Row 2 0 2 4 1

Row 3 1 0 0 1

7
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Compressed Sparse Row (CSR) Format

Row 0 Row 2 Row 3

Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }

Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }


Row Pointers ptr[5] { 0, 2, 2, 5, 7 }

Row 0 3 0 1 0
Row 1 0 0 0 0
Row 2 0 2 4 1
Row 3 1 0 0 1
8
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
CSR Data Layout

row_ptr 0 2 2 5 7

data 3 1 2 4 1 1 1

col_index 0 2 1 2 3 0 3

9
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
A Serial SpMV/CSR Kernel

Y=A* X +Y

Row 0 Row 2 Row 3


Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row Pointers row_ptr[5] { 0, 2, 2, 5, 7 }

10
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
A Simple Parallel SpMV

Row 0 3 0 1 0 Thread 0
Row 1 0 0 0 0 Thread 1
Row 2 0 2 4 1 Thread 2
Row 3 1 0 0 1 Thread 3

• Each thread processes one row

11
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
A Parallel SpMV/CSR Kernel (CUDA)

1. __global__ void SpMV_CSR(int num_rows, float *data,


int *col_index, int *row_ptr, float *x, float *y) {
2. int row = blockIdx.x * blockDim.x + threadIdx.x;
3. if (row < num_rows) {
4. float dot = 0;
5. int row_start = row_ptr[row];
6. int row_end = row_ptr[row+1];
7. for (int elem = row_start; elem < row_end; elem++) {
8. dot += data[elem] * x[col_index[elem]];
}
9. y[row] += dot; // inside ‘if’
}
}
12
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
CSR Kernel Control Divergence
• Number iterations = compressed row length
• Threads execute different number of iterations in for-loop
(depending on the row length)

row_ptr 0 2 2 5 7

data 3 1 2 4 1 1 1

col_index 0 2 1 2 3 0 3

13
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
CSR Kernel Noncolasced Memory Accesses
• Adjacent threads access non-adjacent memory
locations
– Grey elements accessed by all threads in iteration 0

row_ptr 0 2 2 5 7

data 3 1 2 4 1 1 1

col_index 0 2 1 2 3 0 3

14
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Padding ‫ ריפוד‬and Transposing ‫חילוף‬
• Problems in CSR:
– Noncoalesced memory accesses
– Divergence

• Solution: ELL format


– Name came from ELLPACK
– Included a sparse matrix package

15
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Padding and transposing the
example matrix
• Pad all rows to the length of the longest row
• Transpose (Column Major) for coalescing
– We place all elements of column 0 in consecutive
memory locations (in row 0)
– Followed by elements of column 1, then column 2 …
– Called Transposing (Column Major)

• Both data and col_index padded/transposed

16
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Regularizing SpMV with ELL(PACK) Format

Thread 1
3 1 *
3 * 2 1
* * *
1 * 4 1
2 4 1
* * 1 *
1 1 *
Transposed
CSR with Padding

17
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
How to determine data[] & col_index[]
• data[0] … data[3] contain 3, *, 2, 1 (the 0th
element of all rows

• col_index[0] … col_index[3] contain 0, *, 1, 0 (the


column position, in the original matrix, of the 0th
element of all rows)

• row pointers no longer needed --- rows are the


same length, so the row number is the index

18
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Memory Coalescing with ELL

Transposed

Col

19
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
A parallel SpMV/ELL kernel
1. SpMV/ELL kernel simpler than SpMV/CSR kernel
1. All rows are the same length (with padding)

2. “for” loop iterates num_elements, which is the


same for all threads
1. No control divergence---all threads iterate the same
number of times

3. We got Memory Coalescing

20
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
A parallel SpMV/ELL kernel (code)
1. __global__ void SpMV_ELL(int num_rows, float *data,
int *col_index, int num_elem, float *x, float *y) {

2. int row = blockIdx.x * blockDim.x + threadIdx.x; //each thread


3. if (row < num_rows) { // has its own row
4. float dot = 0;
5. for (int i = 0; i < num_elem; i++) {num_elem same for all rows
6. dot += data[row+i*num_rows]*x[col_index[row+i*num_rows]];
}
7. y[row] += dot;
}
}

21
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Is SpMV/ELL all-around winner?
• If sparse matrix has few rows with a very large
number of non-zero elements
– ELL format will have a lot of padding (space & execution)

• Example
– 1000x1000 matrix, 1% non-zero elements, on average
each row has 10 non-zero elements
– CSR 2% uncompressed matrix size (including overhead)
– One row has 200 non-zero elements
– ELL 40% uncompressed matrix size (including overhead)
22
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Reduced Padding

• Root of the problem: a few very long non-zero rows

• Need to “take away” some of the long row

• Coordinate (COO) format does that

23
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Coordinate (COO) format
Row 0 Row 2 Row 3
Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row indices row_index[7]{ 0, 0, 2, 2, 2, 3, 3 }

• Explicitly list the column and row indices for


every non-zero element
• No need for row_ptr, each element has row and
column positions

24
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
COO Allows Reordering Elements
Row 0 Row 2 Row 3
Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row indices row_index[7] { 0, 0, 2, 2, 2, 3, 3 }

• A column identifies an element: data and coordinates (row, col)


• Elements may be processed in any order
• Columns may be reorded

Nonzero values data[7] { 1 1, 2, 4, 3, 1 1 }


Column indices col_index[7] { 0 2, 1, 2, 0, 3, 3 }
Row indices row_index[7] { 3 0, 2, 2, 0, 2, 3 }
25
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Often implemented
in sequential host
code in practice

COO

ELL
26
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Hybrid format
• Row 2 has largest Row 0 3 0 1 0
number of elements
Row 1 0 0 0 0
• Remove last non-
zero element of row Row 2 0 2 4 1
2 from ELL format
and move it to COO Row 3 1 0 0 1
format

27
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Reduced Padding with Hybrid Format

data col_index
Thread 0 3 1 0 2
Thread 0
Thread 1
Thread 2
Thread 3
Thread 1 * * * *
Thread 2 2 4 1 2
Thread 3 1 1 0 3
Iteration 0

data 3 * 2 1 1 * 4 1 data 1
col_index 3
index 0 * 1 0 2 * 2 3
row_index 2
ELL
Without column 3 COO
This column may be
processed in any order 28
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Sequential loop implements
SpMV/COO
• We can order elements in any order since they
are independent of each other
• Code shows one COO row processed

for (int i = 0; i < num_elem; i++)


y[row_index[i]] += data[i] * x[col_index[i]];

29
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016

You might also like