Multithreaded Architectures: (Applied Parallel Programming)

Multithreaded Architectures
(Applied Parallel Programming)
Chapter 10: sparse matrix

computation
1
©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016
Science Area Number Codes Struct Unstruct Dense Sparse N- Monte FFT PIC Sig
of Grids Grids Matrix Matrix Body Carlo I/O
Teams
Climate and 3 CESM, GCRM, X X X X X

Weather CM1/WRF, HOMME
Plasmas/ 2 H3D(M),VPIC, X X X X
Magnetosphere OSIRIS, Magtail/UPIC
Stellar Atmospheres 5 PPM, MAESTRO, X X X X X X

and Supernovae CASTRO, SEDONA,
ChaNGa, MS-FLUKSS
Cosmology 2 Enzo, pGADGET X X X

Combustion/ 2 PSDNS, DISTUF X X
Turbulence
General Relativity 2 Cactus, Harm3D, X X
LazEV
Molecular 4 AMBER, Gromacs, X X X
Dynamics NAMD, LAMMPS
Quantum Chemistry 2 SIAL, GAMESS, X X X X X

NWChem
Material Science 3 NEMOS, OMEN, GW, X X X X

QMCPACK
Earthquakes/ 2 AWP-ODC, X X X X
Seismology HERCULES, PLSQR,
SPECFEM3D
Quantum Chromo 1 Chroma, MILC, X X X

Dynamics USQCD
Social Networks 1 EPISIMDEMICS
Evolution 1 Eve
Engineering/System 1 GRIPS,Revisit X
of Systems 2
Computer Science W. Hwu
©Wen-mei 1 and David Kirk/NVIDIA, 2010-2016 X X X X X
Objective
• To learn the key techniques for compacting data
in parallel sparse methods
– reduced main memory space
– better use of on-chip memory (cache, shared mem)
– reduced use of memory bandwidth---fewer bytes
transferred to on-chip memory
– retaining regularity
3
Sparse Matrix
• Solution of linear systems A*X + Y = 0
– A: matrix ‫מקדמים‬, X and Y are vectors. Solve for X
• Solution 1: inversion of the coefficient matrix
– X = A-1 * (─Y)
– Traditional inversion algorithms such as Gaussian
elimination can create too many “fill-in” elements and
explode the size of the matrix
• Solution 2: Iterative Conjugate Gradient solvers
– A*X + Y = 0 Conjugate Gradient solvers guess a solution
– If not 0, iterate again
4
Sparse Matrix-Vector Multiplication
(SpMV)
× +
A X Y Y

Challenges
• Compared to dense matrix multiplication, SpMV
(sparse matrix vector)
– Is irregular/unstructured
– Has little input data reuse
– Benefits little from compiler transformation tools
• Key to maximal performance

– Maximize regularity (by reducing divergence and load
imbalance)
– Maximize DRAM burst utilization (layout arrangement)
6
Spare Matrix example
Row 0 3 0 1 0
Row 1 0 0 0 0
Row 2 0 2 4 1
Row 3 1 0 0 1
7
Compressed Sparse Row (CSR) Format
Row 0 Row 2 Row 3
Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }

Row Pointers ptr[5] { 0, 2, 2, 5, 7 }
Row 0 3 0 1 0
Row 1 0 0 0 0
Row 2 0 2 4 1
Row 3 1 0 0 1
8
CSR Data Layout
row_ptr 0 2 2 5 7
data 3 1 2 4 1 1 1
col_index 0 2 1 2 3 0 3
9
A Serial SpMV/CSR Kernel
Y=A* X +Y
Row 0 Row 2 Row 3

Row Pointers row_ptr[5] { 0, 2, 2, 5, 7 }
10
A Simple Parallel SpMV
Row 0 3 0 1 0 Thread 0
• Each thread processes one row
11
A Parallel SpMV/CSR Kernel (CUDA)
1. __global__ void SpMV_CSR(int num_rows, float *data,

int *col_index, int *row_ptr, float *x, float *y) {
2. int row = blockIdx.x * blockDim.x + threadIdx.x;
3. if (row < num_rows) {
4. float dot = 0;
5. int row_start = row_ptr[row];
6. int row_end = row_ptr[row+1];
7. for (int elem = row_start; elem < row_end; elem++) {
8. dot += data[elem] * x[col_index[elem]];
}
9. y[row] += dot; // inside ‘if’
}
}
12
CSR Kernel Control Divergence
• Number iterations = compressed row length
• Threads execute different number of iterations in for-loop
(depending on the row length)
row_ptr 0 2 2 5 7
data 3 1 2 4 1 1 1
col_index 0 2 1 2 3 0 3
13
CSR Kernel Noncolasced Memory Accesses
• Adjacent threads access non-adjacent memory
locations
– Grey elements accessed by all threads in iteration 0
row_ptr 0 2 2 5 7
data 3 1 2 4 1 1 1
col_index 0 2 1 2 3 0 3
14
Padding ‫ ריפוד‬and Transposing ‫חילוף‬
• Problems in CSR:
– Noncoalesced memory accesses
– Divergence
• Solution: ELL format

– Name came from ELLPACK
– Included a sparse matrix package
15
Padding and transposing the
example matrix
• Pad all rows to the length of the longest row
• Transpose (Column Major) for coalescing
– We place all elements of column 0 in consecutive
memory locations (in row 0)
– Followed by elements of column 1, then column 2 …
– Called Transposing (Column Major)
• Both data and col_index padded/transposed
16
Regularizing SpMV with ELL(PACK) Format
Thread 1
3 1 *
3 * 2 1
* * *
1 * 4 1
2 4 1
* * 1 *
1 1 *
Transposed
CSR with Padding
17
How to determine data[] & col_index[]
• data[0] … data[3] contain 3, *, 2, 1 (the 0th
element of all rows
• col_index[0] … col_index[3] contain 0, *, 1, 0 (the

column position, in the original matrix, of the 0th
element of all rows)
• row pointers no longer needed --- rows are the

same length, so the row number is the index
18
Memory Coalescing with ELL
Transposed
Col
19
A parallel SpMV/ELL kernel
1. SpMV/ELL kernel simpler than SpMV/CSR kernel
1. All rows are the same length (with padding)
2. “for” loop iterates num_elements, which is the

same for all threads
1. No control divergence---all threads iterate the same
number of times
3. We got Memory Coalescing
20
A parallel SpMV/ELL kernel (code)
1. __global__ void SpMV_ELL(int num_rows, float *data,
int *col_index, int num_elem, float *x, float *y) {
2. int row = blockIdx.x * blockDim.x + threadIdx.x; //each thread

3. if (row < num_rows) { // has its own row
4. float dot = 0;
5. for (int i = 0; i < num_elem; i++) {num_elem same for all rows
6. dot += data[row+i*num_rows]*x[col_index[row+i*num_rows]];
}
7. y[row] += dot;
}
}
21
Is SpMV/ELL all-around winner?
• If sparse matrix has few rows with a very large
number of non-zero elements
– ELL format will have a lot of padding (space & execution)
• Example
– 1000x1000 matrix, 1% non-zero elements, on average
each row has 10 non-zero elements
– CSR 2% uncompressed matrix size (including overhead)
– One row has 200 non-zero elements
– ELL 40% uncompressed matrix size (including overhead)
22
Reduced Padding
• Root of the problem: a few very long non-zero rows
• Need to “take away” some of the long row
• Coordinate (COO) format does that
23
Coordinate (COO) format
Row 0 Row 2 Row 3
Row indices row_index[7]{ 0, 0, 2, 2, 2, 3, 3 }
• Explicitly list the column and row indices for

every non-zero element
• No need for row_ptr, each element has row and
column positions
24
COO Allows Reordering Elements
Row 0 Row 2 Row 3
Row indices row_index[7] { 0, 0, 2, 2, 2, 3, 3 }
• A column identifies an element: data and coordinates (row, col)

• Elements may be processed in any order
• Columns may be reorded
Nonzero values data[7] { 1 1, 2, 4, 3, 1 1 }

Column indices col_index[7] { 0 2, 1, 2, 0, 3, 3 }
Row indices row_index[7] { 3 0, 2, 2, 0, 2, 3 }
25
Often implemented
in sequential host
code in practice
COO
ELL
26
Hybrid format
• Row 2 has largest Row 0 3 0 1 0
number of elements
Row 1 0 0 0 0
• Remove last non-
zero element of row Row 2 0 2 4 1
2 from ELL format
and move it to COO Row 3 1 0 0 1
format
27
Reduced Padding with Hybrid Format
data col_index
Thread 0 3 1 0 2
Thread 0
Thread 1
Thread 2
Thread 3
Thread 1 * * * *
Thread 2 2 4 1 2
Thread 3 1 1 0 3
Iteration 0
data 3 * 2 1 1 * 4 1 data 1
col_index 3
index 0 * 1 0 2 * 2 3
row_index 2
ELL
Without column 3 COO
This column may be
processed in any order 28
Sequential loop implements
SpMV/COO
• We can order elements in any order since they
are independent of each other
• Code shows one COO row processed
for (int i = 0; i < num_elem; i++)

y[row_index[i]] += data[i] * x[col_index[i]];
29

Multithreaded Architectures: (Applied Parallel Programming)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multithreaded Architectures: (Applied Parallel Programming)

Uploaded by

Copyright:

Available Formats

Multithreaded Architectures

(Applied Parallel Programming)

Chapter 10: sparse matrix

Climate and 3 CESM, GCRM, X X X X X

Stellar Atmospheres 5 PPM, MAESTRO, X X X X X X

Cosmology 2 Enzo, pGADGET X X X

Quantum Chemistry 2 SIAL, GAMESS, X X X X X

Material Science 3 NEMOS, OMEN, GW, X X X X

Quantum Chromo 1 Chroma, MILC, X X X

©Wen-mei W. Hwu and David Kirk/NVIDIA, 2010-2016

• Key to maximal performance

Row 0 Row 2 Row 3

Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }

Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }

Row 0 Row 2 Row 3

• Each thread processes one row

1. __global__ void SpMV_CSR(int num_rows, float *data,

• Solution: ELL format

• Both data and col_index padded/transposed

• col_index[0] … col_index[3] contain 0, *, 1, 0 (the

• row pointers no longer needed --- rows are the

2. “for” loop iterates num_elements, which is the

3. We got Memory Coalescing

2. int row = blockIdx.x * blockDim.x + threadIdx.x; //each thread

• Root of the problem: a few very long non-zero rows

• Need to “take away” some of the long row

• Coordinate (COO) format does that

• Explicitly list the column and row indices for

• A column identifies an element: data and coordinates (row, col)

Nonzero values data[7] { 1 1, 2, 4, 3, 1 1 }

for (int i = 0; i < num_elem; i++)

You might also like

1. global void SpMV_CSR(int num_rows, float *data,