You are on page 1of 5

Literature Review of Selected Papers

Topic : Sparse Matrix-Vector Multiplication using Merge-Path based hybrid format


Introduction:
Sparse matrices arise because of discretization of partial differential equations due to physical
phenomena as well as computationally intensive artificial networks. As a result, sparse
matrix-based computations are becoming an increasingly important problem. When the total
amount of non-zero elements in the matrices involved is very small, the applications are
typically bottlenecked by memory rather than computation. It is primarily due to the
irregular, data-dependent memory accesses in existing matrix libraries, that leads to poor
throughput and performance. To fully unleash the potential of systems that challenges
parallelism of unprecedented scale, a multicore specific-optimization methodology must be
devised for important scientific computations. An optimized strategy is proposed which is
effective for the multicore environment. It also establishes a significant performance
improvement compared to existing up-to-date serial and parallel sparse matrix-vector product
(SpMV) implementations. The SpMV is of importance for solving sparse linear systems,
eigenvalue systems, Krylov subspace methods, and other similar problems. It is also an
important building block for large-scale combinatorial graph algorithms. Achieving good
performance on today’s parallel architectures requires complementary strategies for workload
decomposition and matrix storage formatting that provide both uniform processor utilization
and efficient use of memory bandwidth. A server-class multi-core CPU and GPU are used as
baselines to compare the architectures against.
Background:
SpMV in general refer to operations of the form: y ← αAx+βy, where A is large and sparse,
x,y are (dense) column vectors and α, β are scalars. We choose the variant y ← Ax + y
because it isolates the sparse matrix from the dense component. Specifically, the number of
floating point operations is necessarily twice the number of non-zero elements in A (one
MAC per element), independent of matrix dimensions. This is as opposed to the general form
which is influenced by the vector βy, which can skew performance.
One of the primary goals of most SpMV optimizations is to mitigate the impact of
irregularity in the underlying matrix structure. This is also our central concern. Specifically,
we aim to minimize execution and memory divergence caused by the potential irregularity of
the underlying sparse matrix. Since an efficient SpMV kernel should be memory-bound, our
measure of success will be the fraction of peak bandwidth these kernels can achieve.
To address these problems, a kernel needs to designate two key features:
a. Matrix format
b. Parallel kernel design
We choose well-known formats such as DIA, ELL, CSR and COO that are supported by
standard sparse matrix packages and we organise our computations to minimise divergence.
i. Diagonal format (DIA): The diagonal format is formed by two arrays: data, which
stores the nonzero values, and offsets, which stores the offset of each diagonal
from the main diagonal. Diagonals above and below the main diagonal have
positive and negative offsets, respectively.
ii. ELLPACK format (ELL): An M-by-N sparse matrix with at most K nonzeros per
row is stored as a dense M-by-K array data of nonzeros and array indices of
column indices.
iii. Compressed Sparse Row format (CSR): Explicitly stores column indices and
nonzero values in arrays indices and data. A third array of row pointers, ptr,
allows the CSR format to represent rows of varying length.
iv. Coordinate Format (COO): Arrays: row, indices, and data store the row indices,
column indices, and values, respectively, of the nonzero entries. We further
assume that entries with the same row index are stored contiguously.
The CSR format eliminates COO row-index repetition by storing the nonzero values and
column indices in row-major order, and building a separate row-offsets array such that the
entries for row i in the other arrays occupy the half-open interval [row-offsets(i), row-
offsets(i+1)]. CSR is perhaps the most commonplace general-purpose format for in-memory
storage.
CsrMV implementations adhere to one of two general parallelization strategies: row-splitting
distribution or non-zero splitting, where non-zero data is equally partitioned.
i. Row-splitting: This practice assigns regularized sections of long rows to multiple
processors to limit the number of number data items assigned to each processor.
The partial sums from related sub-rows can be aggregated in a subsequent pass.
The differences in length between rows that are smaller than the splitting size can
still contribute load imbalance between threads. The splitting of long rows can be
done statically via a pre-processing step that encodes the dataset into an
alternative format, or dynamically using a work-sharing runtime. The dynamic
variant requires runtime task distribution, a behaviour likely to incur processor
contention and limited scalability on massively parallel systems.
ii. Non-zero splitting: An alternative to row-based parallelization is to assign an
equal share of the nonzero data (i.e., the values and column-indices arrays) to each
processor. Processors then determine to which row(s) their data items belong by
searching within the row-offsets array. As each processor consumes its section of
nonzero, it must track its progress through the row-offsets array. The CsrMV
parallelization perform this searching as an offline pre-processing step, storing
processor-specific waypoints within supplemental data structures.
A specific type of parallel CsrMV decomposition is defined such that each processor is
provided an equal share of steps, regardless of list sizing and value content, called merge-path
CsrMV. The decision path begins in the top-left corner and ends in the bottom-right. When
traced sequentially, the merge path moves right when consuming the elements from A and
down when consuming from B. Consequently, the path coordinates describe a complete
schedule of element comparison and consumption across both input sequences. Furthermore,
each path coordinate can be linearly indexed by its grid diagonal, where diagonals are
enumerated from top-left to bottom-right. Per convention, the semantics of merge always
prefer items from A over those from B when comparing same-valued items. This results in
exactly one decision path.
To parallelize across p threads, the grid is sliced diagonally into p swaths of equal width, and
it is the job of each thread is to establish the route of the decision-path through its swath. The
two elements Ai and Bj scheduled to be compared at diagonal k can be found via constrained
binary search along that diagonal: find the first (i,j) where Ai is greater than all of the items
consumed before Bj, given that i+j=k. Each thread need only search the first diagonal in its
swath; the remainder of its path segment can be trivially established via sequential
comparisons seeded from that starting coordinate.
Another implementation strategy in terms of matrix formatting and parallelization is the
blocked row-column (BRC) format. It has a two-dimensional blocking mechanism that
reduces the thread divergence by reordering and blocking rows of the input matrix with
nearly equal number of non-zero elements onto the same execution units. This improves load
balance by partitioning rows into blocks with a constant number of non-zeros such that they
perform the same amount of work. The selection of block size based on sparsity
characteristics of the matrix plays the vital role in improving the performance. It is attained
by representation of only the nonzero elements, the number of operations along with the
memory requirements. The row-blocking of one matrix A is done to reduce thread divergence
and column-blocking to improve load balance. The neighbouring rows are grouped in blocks
after the permutation to achieve memory coalescing. This results in a decrease in memory
usage, redundant computation and data transfer. By assigning one warp to each block, thread
divergence is avoided within the warp. The Blocks are produced in such a way that the first
row has a larger number of non-zeros than the last row. Column blocking on top of row-
blocking is done so that the matrix A is also blocked along the column dimension hence
avoiding the performance bottleneck due to the longer rows.
Experimental Evaluation:
The algorithms for merge-path implementation were implemented in both CPU and GPU for
benchmarking against other SpMV kernels. Intel’s Math Kernel Library (MKL) was used as
the standard metric for comparison alongside NVIDIA’s cuSPARSE library. The dataset used
for comparison belongs to the University of Florida’s sparse matrix collection.

Additionally, we also have performance evaluation for the different matrix format-based
implementations compared against the BRC format for different GPU devices (SP/DP =
single/double precision).

Conclusion:
Sparse matrix vector multiplication is best optimised by evenly balancing the load
distribution of all on-zero computations on all processing elements. This would ideally
require a hybrid blocked format derived from CSR which is parallelised based on the merge
path algorithm described. Such an implementation would give optimal speedup, with the
tradeoff being consistent throughput. Power requirements across different platforms needs
further exploring.
References:
[1] Merrill, Duane, and Michael Garland. "Merge-based sparse matrix-vector multiplication
(SpMV) using the CSR storage format." Proceedings of the 21st ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming. ACM, 2016.
[2] Merrill, Duane, and Michael Garland. "Merge-based sparse matrix-vector multiplication
(SpMV) using the CSR storage format." Proceedings of the 21st ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming. ACM, 2016.
[3] Bell, Nathan, and Michael Garland. "Implementing sparse matrix-vector multiplication on
throughput-oriented processors." Proceedings of the conference on high performance
computing networking, storage and analysis. ACM, 2009.
[4] Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection."
ACM Transactions on Mathematical Software (TOMS) 38.1 (2011): 1.

You might also like