You are on page 1of 25

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2701302

Analysis of a Class of Parallel Matrix Multiplication Algorithms

Article · October 1997


Source: CiteSeer

CITATIONS READS
10 445

4 authors, including:

John A. Gunnels Calvin Lin


NVIDIA University of Texas at Austin
113 PUBLICATIONS 3,456 CITATIONS 147 PUBLICATIONS 4,027 CITATIONS

SEE PROFILE SEE PROFILE

Robert Alexander van de Geijn


University of Texas at Austin
274 PUBLICATIONS 8,098 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Graph Analytics View project

Multicore Software View project

All content following this page was uploaded by John A. Gunnels on 29 January 2013.

The user has requested enhancement of the downloaded file.


Analysis of a Class of
Parallel Matrix Multiplication Algorithms
John Gunnels
Calvin Lin
Greg Morrow
Robert van de Geijn
Department of Computer Sciences
The University of Texas at Austin
Austin, TX 78712
fgunnels,lin,morrow,rvdgg@cs.utexas.edu

A Technical Paper Submitted to IPPS 98y

Abstract
Publications concerning parallel implementation of matrix-matrix multiplication
continue to appear with some regularity. It may seem odd that an algorithm that can
be expressed as one statement and three nested loops deserves this much attention.
This paper provides some insights as to why this problem is complex: Practical algo-
rithms that use matrix multiplication tend to use di erent shaped matrices, and the
shape of the matrices can signi cantly impact the performance of matrix multiplication.
We provide theoretical analysis and experimental results to explain the di erences in
performance achieved when these algorithms are applied to di erently shaped matrices.
This analysis sets the stage for hybrid algorithms which choose between the algorithms
based on the shapes of the matrices involved. While the paper resolves a number of
issues, it concludes with discussion of a number of directions yet to be pursued.
Corresponding Author, Phone: (512) 471-9720, Fax: (512) 471-8885.
yRegarding the length of this paper: There are a large number of tables and gures within the text, in
addition to an extensive bibliography. Taking this into account, the length is within submission guidelines.

1
1 Introduction
Over the last three decades, a number of di erent approaches have been proposed for im-
plementation of matrix-matrix multiplication on distributed memory architectures. These
include Cannon's algorithm [7], the broadcast-multiply-roll algorithm [16, 15], and Parallel
Universal Matrix Multiplication Algorithm (PUMMA) [11]. This last algorithm is a general-
ization of broadcast-multiply-roll to non-square meshes of processors. An alternative gener-
alization of this algorithm [19, 20] and an interesting algorithm based on a three-dimensional
data distribution have also been developed [2].
The approach now considered the most practical, sometimes known as broadcast-broadcast,
was rst proposed by Agarwal et al. [1], who showed that a sequence of parallel rank-k
(panel-panel) updates is a highly e ective way to parallelize C = AB . That same obser-
vation was independently made by van de Geijn et al. [24], who introduced the Scalable
Universal Matrix Multiplication Algorithm (SUMMA). In addition to computing C = AB
as a sequence of panel-panel updates, SUMMA implements C = AB T and C = AT B as
a sequence of matrix-panel (of vectors) multiplications, and C = AT B T as a sequence of
panel-panel updates. Later work [8] showed how these techniques can be extended to a large
class of commonly used matrix-matrix operations that are part of the Level-3 Basic Linear
Algebra Subprograms [12].
All of the e orts above-mentioned have concentrated primarily on the special case where
the input matrices are approximately square. More recently, work by Li, et al. [22] attempts
to create a \poly-algorithm" that chooses between three of the above algorithms (Cannon's,
broadcast-multiply-roll, and broadcast-broadcast). Through empirical data they show some
advantage when meshes are non-square or odd shaped. However, the algorithms being
considered are not inherently more adapted to speci c shapes of matrices, and thus limited
bene t is observed.
In earlier work [23, 4, 3] we observe the need for algorithms to be sensitive to the shape
of the matrices, and we hint at the fact that a class of algorithms naturally supported by
the Parallel Linear Algebra Package (PLAPACK) infrastructure provides a convenient set
of algorithms to serve as the basis for such shape-adaptive algorithms. Thus, a number of
hybrid algorithms are now part of PLAPACK. This observation was also made as part of
the ScaLAPACK [9] project, and indeed ScaLAPACK includes a number of matrix-matrix
operations that are implemented to choose algorithms based on the shape of the matrix. We
comment on the di erence between our approach and ScaLAPACK later.
This paper provides a description and analysis of the class of parallel matrix multiplica-
tion algorithms that naturally lends itself to hybridization. The analysis combines theoretical
results with empirical results to provide a complete picture of the bene ts of the di erent
algorithms and how they can be combined into hybrid algorithms. A number of simpli -
cations are made to provide a clear picture that illustrates the issues. Having done so, a
number of extensions are identi ed for further study.
2
2 Background
2.1 Model of Parallel Architecture
For our analysis, we will assume a logical r c mesh of computational nodes which are


indexed Pi;j , 0 i < r and 0 j < c, so that the total number of nodes is p = rc. These
 

nodes are connected through some communication network, which could be a hypercube
(Intel iPSC/860), a higher dimensional mesh (Intel Paragon, Cray T3D/E) or a multistage
network (IBM SP2).

2.2 Model of Communication Cost


2.2.1 Individual messages
We will assume that in the absence of network con ict, sending a message of length n between
any two nodes can be modeled by
+n ;
where represents the latency (startup cost) and the cost per byte transferred. For all
available communication networks, the collective communications in Section 2.2.2 can be
implemented to avoid network con icts. This will in some networks require an increased
number of messages, which a ects the latency incurred.
2.2.2 Collective communication
In the matrix-matrix multiplication algorithms discussed here, all communication encoun-
tered is collective in nature. These collective communications are summarized in Table 1.
Frequently operations like broadcast are implemented using minimum spanning trees, in
Operation MPI call Cost
Broadcast MPI Bcast TBcast(n; p)
Collect MPI Allgather TColl(n; p)
Scatter MPI Scatter TScat(n; p)
Gather MPI Gather TGath(n; p)
Distributed Sum MPI Reduce scatter TDistr:Sum(n; p)
Sum-to-one MPI Reduce TSum2one(n; p)
Sum-to-all MPI Allreduce TSum2all(n; p)
Table 1: Collective communication routines used in matrix-matrix multiplication.

which case the cost can be modeled by TBcast(n; p) = log2(p)( + n ). However, when n is
3
large, better algorithms are available. Details on the algorithms and estimated costs can be
found in the literature (e.g., [6, 15, 18]).

2.3 Model of Computational Cost


All computation on the individual nodes will be implemented using calls to highly optimized
sequential matrix-matrix multiply kernels that are part of the Level-3 Basic Linear Algebra
Subprograms (BLAS) [12]. In our model, we will represent the cost of one add or one
multiply by .
For many vendors, the shape of the matrices involved in a sequential matrix-matrix
multiply (dgemm call) in uences the performance of this kernel. Given the operation
C = AB
there are three dimensions: m, n, and k, where C is m n, A is m k and B is k n. In
  

a typical use, each of these dimensions can be either large or small, leading to the di erent
shapes given in Table 2. In our model, we will use a di erent cost for each shape, as indicated
in the table. The fact that this cost can vary considerably will be demonstrated in Section 4
and Table 5.
We identify the di erent shapes by the operation they become when \small" equals one,
which indicates that one or more of the matrices become scalars or vectors. The di erent
operations gemm, gemv, ger, axpy, and dot refer to the BLAS routines1 for general matrix-
matrix multiplication, matrix-vector multiplication, rank-1 update, scalar-a-times-x-plus-y,
and dot product. The names for some of the cases reported in the last column come from
recognizing that when \small" is greater than one but small, a matrix that has a row or
column dimension of one becomes a row panel or column panel of vectors, respectively.

2.4 Data distribution


2.4.1 Physically Based Matrix Distribution
We have previously observed [14] that data distributions should focus on the decomposition
of the physical problem to be solved, rather than on the decomposition of matrices. Typically,
it is the elements of vectors that are associated with data of physical signi cance, and it is
therefore their distribution to nodes that is signi cant. A matrix, which is a discretized
operator, merely represents the relation between two vectors, which are discretized spaces:
y = Ax. Since it is more natural to start by distributing the problem to nodes, we partition
x and y, and assign portions of these vectors to nodes. The matrix A is then distributed
1 Notice that the more general operation performed is C AB + C , which further explains the choice
of BLAS routine mentioned.

4
m n k Shape Cost per op small = 1 Case
C = A B
large large large gemm
B
C A
panel-panel
=
large large small ger

C A B
matrix-panel
=
large small large gemv
B
C A
panel-axpy
=
large small small axpy ( = 1)

C A
B
panel-matrix
=
small large large gemv
C A B
panel-axpy
=
small large small axpy ( = 1)

C A
B
panel-dot
=
small small large dot ( = = 1)

C A B
scalar mult.
=
small small small

Table 2: Di erent possible cases for C = AB .

to nodes so that it is consistent with the distribution of the vectors, as we describe below.
We call such matrix distributions physically based if the layout of the vectors which induce
the distribution of A to nodes is consistent with where an application would naturally want
them. We will use the abbreviation PBMD for Physically Based Matrix Distribution.
We now describe the distribution of the vectors, x and y, to nodes, after which we will
show how the matrix distribution is induced (derived) by this vector distribution. Let us
assume x and y are of length n and m, respectively. We will distribute these vectors using a
distribution block size of bdistr. For simplicity assume n = Nbdistr and m = Mbdistr. Partition
x and y so that 0 x 1 0 y 1
0
BB x1 CC BB y01 CC
x=B B@ ... CCA and y = BB@ ... CCA
xN ?1 yM ?1
where N >> p and M >> p and each xi and yi is of length bdistr. Partitioning A conformally

5
yields the blocked matrix
0 A A0;1 A0;N ?1 1
0;0 
B
B A1;0 A1;1 A1;N ?1 CCC
B
A = B .. .

. CA (1)
@ . .. ..
AM ?1;0 AM ?1;1 AM ?1;N ?1
where each subblock is of size bdistr bdistr. Blocks of x and y onto a two-dimensional mesh of


nodes in column-major order, as illustrated in Fig. 1. The matrix distribution is induced by


these vector distributions by assigning blocks of columns A;j to the same column of nodes
as subvector xj , and blocks of rows Ai; to the same row of nodes as subvector yi. This is
illustrated in Fig. 1, taken from [23].
Notice that the distribution of the matrix wraps blocks of rows and columns onto rows
and columns of processors, respectively. The wrapping is necessary to improve load balance.
2.4.2 Size of local data
As part of our cost analysis in the subsequent sections, it will be important to compute the
maximum number of elements of a vector x of length n assigned to a processor. Similarly,
we need the maximum number of matrix elements assigned to any node. Functions for these
quantities are given in Table 3.
lmax(b; n; p) = Maximum local length when a vector of length n is wrapped onto p
processors using block size b
 d dn=be
p eb
lmax(bdistr; m; r) = Maximum local number of rows when a matrix of size m  n is
wrapped onto r  c mesh of processors using distribution block size
bdistr
lmax(rbdistr; n; c) = Maximum local number of columns when a matrix of size m  n
is wrapped onto r  c mesh of processors using distribution block
size bdistr. The factor r comes from the fact that column blocks are
wrapped r at a time.
Lmax(bdistr; m; n; r; c) = Maximum local number of matrix elements when a matrix of size
m  n is wrapped onto r  c mesh of processors using distribution
block size bdistr
= lmax(bdistr; m; r)  lmax(rbdistr; n; c)

Table 3: Functions that give the maximum local number of elements for distributed vectors
and matrices.

6
Column Indices
0, 1, 2,12,13,...3, 4, 5,15,16,... 6, 7, 8,18,19,...9,10,11,21,22,...

...,12,9 ,6 ,3 ,0
0, 12,... 0, 12,...
3, 15,... 3, 15,...
6, 18,... 6, 18,...
9, 21,... 9, 21,...

Row Indices
1, 13,... 1, 13,...
4, 16,... 4, 16,...
7, 19,... 7, 19,...
10, 22,... 10, 22,...

...,17,14,11,8 ,5, 2
2, 14,... 2, 14,...
5, 17,... 5, 17,...
8, 20,... 8, 20,...
11, 23,... 11, 23,...

0 1 2 3 0
0 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,3 0,3 0,0 0,0
1 1,0 1,0 0,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,3 1,3 1,0 1,0
2 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,3 2,3 2,0 2,0
0 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,3 0,3 0,0 0,0
1 1,0 1,0 0,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,3 1,3 1,0 1,0
2 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,3 2,3 2,0 2,0
0 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,3 0,3 0,0 0,0
1 1,0 1,0 0,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,3 1,3 1,0 1,0
2 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,3 2,3 2,0 2,0
0 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,3 0,3 0,0 0,0
1 1,0 1,0 0,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,3 1,3 1,0 1,0
2 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,3 2,3 2,0 2,0
0 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,3 0,3 0,0 0,0
1 1,0 1,0 0,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,3 1,3 1,0 1,0
2 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,3 2,3 2,0 2,0

Figure 1: Inducing a matrix distribution from vector distributions when the vectors are over-
decomposed. Top: The subvectors of x and y are assigned to a logical 3 4 mesh of nodes 

by wrapping in column-major order. By projecting the indices of y to the left, we determine


the distribution of the matrix row-blocks of A. By projecting the indices of x to the top, we
determine the distribution of the matrix column-blocks of A. Bottom: Distribution of the
subblocks of A. The indices indicate the node to which the subblock is mapped.

7
2.4.3 Data movements for matrix-matrix multiplication
In the PBMD representation, vectors are fully distributed (not duplicated!), while a single
row or column of a matrix will reside in a single row or column of processors. One bene t
of the PBMD representation is the clean manner in which vectors, matrix rows, and matrix
columns interact. Redistributing a column of a matrix like a vector requires a scatter com-
munication within rows. The natural extension is that redistributing a panel of columns like
a multivector (a group of vectors) requires each of the columns of the panel to be scattered
within rows. In Table 4, we show the natural data movements encountered during matrix-
matrix multiplication, and their costs. In that table, we also count the time for packing
(TPack) and unpacking (TUnpack) data into a format that allows for the collective communi-
cation to proceed. The necessity for this can be seen by carefully examining Fig. 1, which
shows that when indices that indicate vector distribution are projected to derive row and
column distribution, an interleaving occurs.

3 A Class of Algorithms
The previous section laid the foundation for the analysis of a class of parallel matrix-matrix
multiplication algorithms. We show that di erent blockings of the operands lead to di erent
algorithms, each of which can be built from a simple parallel matrix-matrix multiplication
kernel. These kernels themselves can be recognized as the special cases in Table 2 where one
or more of the dimensions are small.
The kernels can be separated into two groups. For the rst, at least two dimensions are
large, leading to operations involving matrices and panels of rows or columns: the panel-panel
update, matrix-panel multiply, and panel-matrix multiply. Notice that these cases are similar
to kernels used in our paper [8] for parallel implementation of matrix-matrix operations as
well as basic building blocks used by ScaLAPACK2 for parallel implementation of Basic
Linear Algebra Subprograms. However, it is the systematic data movement supported by
PLAPACK that allows the straight-forward implementation, explanation and analysis of
these operations, which we present in the rest of this section.
The second group of kernels assumes that only one of the dimensions is large, making
one the operands a small matrix, and the other two row or column panels. Their implemen-
tation recognizes that when the bulk of the computation involves operands that are panels,
redistributing like a multivector becomes bene cial. These kernels inherently can not be
supported by ScaLAPACK due to fundamental design decisions underlying that package.
In contrast, Physically Based Matrix Distribution and PLAPACK naturally support their
implementation.
We will see that all algorithms will be implemented as a series of calls to these kernels.
2 The PB-BLAS, for Parallel Blocked BLAS

8
! Redistribute column T! (bdistr; m; n; r; c) = n=bdistr 

panel A as a duplicated TBcast(lmax(bdistr; m; r) bdistr; c)




column panel.
! Redistribute row panel T! (bdistr; m; n; r; c) = m=bdistr 

A as a duplicated row TBcast(lmax(r bdistr; n; c) bdistr; r)


 

panel.
T! (bdistr; m; n; r; c) = n=bdistr
!@@@ Redistribute column 

@ panel A as a multivec- [TPack(lmax(bdistr; m; r) bdistr; c)+




tor. TScat(lmax(bdistr; m; r) bdistr; c)]




T! (bdistr; m; n; r; c) = m=bdistr
!@@@ Redistribute row panel A 

@ as a multivector. [TPack(lmax(r bdistr; n; c) bdistr; r)+


 

TScat(lmax(r bdistr; n; c) bdistr; r)]


 

T! (bdistr; m; n; r; c) = n=bdistr
@@@@ ! Redistribute multivector 

A as a duplicated column [TColl(lmax(bdistr; m; r) bdistr; c)+




panel. TUnpack(lmax(bdistr; m; r) bdistr; c)]




T! (bdistr; m; n; r; c) = m=bdistr
@@@@ ! Redistribute multivector 

A as a duplicated row [TColl(lmax(r bdistr; n; c) bdistr; r)+


 

panel. TUnpack(lmax(r bdistr; n; c) bdistr; r)]


 

Redistribute column T! (bdistr; m; n; r; c) =


!
panel A as a duplicated T ! (bdistr; m; n; r; c)+
row panel. T ! (bdistr; n; m; r; c)
Redistribute row panel A T! (bdistr; m; n; r; c) =
!
as a duplicated column T ! (bdistr; m; n; r; c)+
panel. T ! (bdistr; n; m; r; c)
! Sum duplicated column T! (bdistr; m; n; r; c) = n=bdistr 

panels within rows to a TSum2one(lmax(bdistr; m; r) bdistr; c)




column panel.
! Sum duplicated row pan- T! (bdistr; m; n; r; c) = m=bdistr 

els within columns to a TSum2one(lmax(r bdistr; n; c) bdistr; r)


 

row panel.

Table 4: Data movements encountered as part of matrix-matrix multiplication.

9
One bene t from this approach is that a high level of performance is achieved while workspace
required for storing duplicates of data is kept reasonable.
We recognize that due to page limitations, the explanations of the algorithms are some-
what terse. We refer the reader to the literature [8, 23, 24] for additional details.

3.1 Notation
For simplicity, we will assume that the dimensions m, n, and k are integer multiples of
the algorithmic block size balg. When discussing these algorithms, we will use the following
partitionings of matrices A, B , and C :
0 ^ 1
B X0 C
  B X^ C
X = X0 X1 XnX =b = B
 BB ..1 CCC
@ . A
alg

X^mX =b alg

where X A; B; C , and mX and nX are the row and column dimension of the indicated
2 f g

matrix.
Also, 0 X 1
0;0 X0;1 X0;n=b

B
B X1;0 X1;1 X1;n=b CC alg

B
X = B .. ...

... CC alg

@ . A
Xm=b ;0 Xm=b ;1
alg alg
Xm=b ;n=b
 alg alg

In the above partitionings, the block size balg is choosen to maximize the performance of the
local matrix-matrix multiplication operation.

3.2 Panel-panel based algorithm


Observe that
0 1
B^0
  BBB B^1 CC
CC
C = AB = A0 A1  Ak=b BB ... CA
@
alg

B^k=b alg

= A0B^0 + A1B^1 +  + Ak=b alg


B^k=b alg

Thus, matrix-matrix multiplication can be implemented as k=balg calls to a kernel that per-
forms a panel-panel multiplication. This kernel can be implemented as follows:

10
Description Cost
C = A
B Case: m large, n large, k = balg T (m; n; k; r; c) =

AAA Duplicate panel A. This requires a T ! (bdistr; m; k; r:c)+


broadcast within rows
B
B Duplicate panel B . This requires a T ! (bdistr; k; ; r:c)+
B broadcast within columns

C = AAA
B
B
Local update. 2Lmax(bdistr; m; n; r; c)
B

The cost of calling this kernel repeatedly to compute C = AB becomes


k=balgT (bdistr; m; n; balg; r; c)

3.3 Matrix-panel based algorithm


Observe that
   
C0 C1  Cn=b = C = AB = A B0 B1 Bn=b 
alg
  alg

= AB0 AB1 ABn=b alg

Thus, matrix-matrix multiplication can be implemented by computing each column panel of


C, Cj , by a call to a matrix-panel kernel:

11
Description Cost
C = A B Case: m large, n = balg, k large T (bdist; m; n; k; r; c) =

B
B Duplicate panel B . This requires T ! (bdistr; k; n; r:c)+
B a scatter within rows followed by a
collect within columns.

CCC= A
B
B Local matrix-panel multiply to 2Lmax(bdistr; m; n; r; c) +
B compute contributions to C

C CCC
Summation of contributions to C T ! (bdistr; m; n; r:c)
within rows. This requires a sum-
=

to-one operation within columns

The total cost to compute C = AB in this fashion is


n=balgT (bdistr; m; balg; k; r; c)

3.4 Panel-matrix based algorithm


Observe that
0 1 0 1 0 1
BB C^0 CC B A^0 CC B A^0B CC
BB C^1 CC = C = AB = BBB A^1 CC B = BBB A^1B CC
B@ ... CA B@ ... CA B@ ... CA
C^k=b alg
A^k=b A^k=b B
alg alg

Thus, matrix-matrix multiplication can be implemented by computing each row panel of C,


C^i, by a call to the panel-matrix kernel:

12
Description Cost
C
=
A
B Case: m = balg , n large, k large T (bdist; m; n; k; r; c) =

AAA Duplicate panel A. T ! (bdistr; m; k; r:c)

C
C AAA B
Local matrix-panel multiply to 2Lmax(bdistr; m; n; r; c)
compute contributions to C
=
C

C C
C Summation of contributions to C T ! (bdistr; m; n; r:c)
within rows. This requires a sum-
=
C
to-one operation within columns

The total cost is given by


m=balgT (bdistr; balg; n; k; r; c)

3.5 Panel-axpy based algorithm


Observe that
 
C0 C1  Cn=b
0 B 1
alg

0;0 B0;1 B0;n=b




  BBB B1;0 CC
alg
B1;1 B1;n=b CC
A0 A1 Ak=b

= B@ ... ... ...
alg

A

alg

Bk=b ;0 Bk=b ;1 Bk=b ;n=b


 

alg alg alg alg

= A0B0;0 +  + Ak=b alg


Bk=b ;0
alg
A0B0;n=b + + Ak=b
alg
 alg
Bk=b alg ;n=balg

Thus, each column panel of C, Cj , can be computed by k=balg calls to the panel-axpy kernel:

13
Description Cost
C = A
B Case: m large, n = balg, k = balg T (bdist; m; n; k; r; c) =

@@@A@ Redistribute A as a multivector. T ! (bdistr; m; k; r:c)+

B
B
Duplicate B to all nodes T ! (bdistr; m; n; r; c)+
B

@@@C@ @@@A@ Locally compute contribution to C . 2lmax(bdistr; m; p)b2alg+


B
= B
B
balg
bdistr [T ! (bdistr; m; bdistr; r; c)
C =
@@@C@ Add contribution to C to C .
+lmax(bdistr; m; r)bdistr]

For computing all panels in this fashion, the cost becomes


n k T (b ; m; b ; b ; r; c)
distr alg alg
balg balg
Alternatively, C and B can be partitioned by row blocks, and A into bdistr bdistr blocks, 

which leads to an algorithm that computes each row panel of C by k=balg calls to an alter-
native panel-axpy kernel. This leads to a total cost of
m k T (b ; b ; n; b ; r; c)
distr alg alg
balg balg
3.6 Panel-dot based algorithm
Observe that
0 1 0 ^ 1
BB C0.. ;0 
CCC0;n=b
...BB A..0 CC 
alg 
@ . A = @ . A 0 B B n=b  alg

Cm=b ;0 Cm=b ;n=b


 ^
Am=b
0 1
alg alg alg alg

^ A^0Bn=b
BB A0..B0 ...

alg
C
C
= @ . A
A^m=b B0 A^m=b Bn=b
alg
 alg alg

Thus, each block Ci;j can be computed as a panel-dot operation:


14
Description Cost
C
=
A
B Case: m = balg, n = balg, k large T (bdist; m; n; k; r; c) =

@A@@T@ Redistribute A as a multivector. T ! (bdistr; m; k; r:c)+

@@@B@ Redistribute B as a multivector. T ! (bdistr; m; n; r; c)+


C
C @ @
@@@ @@@
A T B Locally compute contribution to C . 2l max(bdistr; m; p)b2alg+
C

C
=
C
C Add contribution to C to C . TSum2all(b2alg; p)
C

Total cost: m nT
balg balg (bdistr; balg; balg; k; r; c)

4 Preliminary Results
In this section, we report preliminary performance results on the Cray T3E of implemen-
tations of the described algorithms. Our implementations were coded using the PLAPACK
library infrastructure, which naturally supports these kinds of algorithms. In addition, we
report results from our performance model.

4.1 Low-level parameters


In order to report results from our performance model, a number of parameters must rst
be determined. These can be grouped onto two sets: communication related parameters and
computation related parameters.
4.1.1 Communication
All PLAPACK communication is accomplished by calls to the Message-Passing Interface
(MPI). Independent of the algorithms, we determined the communication related parameters
by performing a simple "bounce-back" communication. From this, the latency (or startup
cost) was measured to equal = 50 sec. while the cost per byte communicated equaled
= 0:45 nanosec. (for a transfer rate of 22 Mbytes per second).

15
Constant Value Rate
(sec per op) (MFLOPS)
2:28 10?9
 439
5:80 10?9
 172
4:22 10

?9 237
4:35 10

?9 230
5:43 10

?9 184
5:44 10?9
 184

Table 5: Values of the local computational parameters measured on the Cray T3E. Notice,
these parameters represent performance of an early version of the BLAS.

For the data movements that require packing, we measured cost per byte packed to be
 = 0:2 nanosec. Considerable improvements in the PLAPACK packing routines can still be
achieved and thus we expect this overhead to be reduced in the future.
4.1.2 Computation
Ultimately, it is the performance of the local 64 bit matrix-matrix multiplication implemen-
tation that determines the asymptotic performance rate that can be achieved. Notice that
our algorithms inherently operate on matrices and panels, and thus the performance for the
di erent cases in Table 2 where "small" is taken to equal the algorithmic blocking size needs
to be determined. In our preliminary performance tests, we chose the algorithmic blocking
size to equal 128, which has, in the past, given us reasonable performance for a wide range
of algorithms. Given this blocking size, the performance of the local BLAS matrix-matrix
multiplication kernel is given in Table 5.
It is interesting to note that it is the panel-panel kernel that performs best. To those of
us experienced with high-performance architectures this is not a surprise: This is the case
that is heavily used by the LINPACK benchmark, and thus the rst to be fully optimized.
There is really no technical reason why for a block size of 128 the other shapes do not
perform equally well. Indeed, as an architecture matures, those cases also get optimized.
For example, on the Cray T3D the performance of the local matrix-matrix BLAS kernel is
not nearly as dependent on the shape.

4.2 Performance of the algorithms


In this section, we demonstrate both through the cost model and measured data how the
di erent algorithms perform as a function of shape. To do so, we x one or two of the
dimensions to be large and vary the other dimension(s).

16
4.2.1 Dimensions m and n are large
In Figure 2, we show the predicted and measured performance of the various algorithms
when both m and n are xed to be large.
First, let us comment on the asymptotic behavior. It is the case that for the panel-
panel, matrix-panel, and panel-matrix algorithms as k gets large, communication and other
overhead eventually become a lower order term when compared to computation. Thus,
asymptotically, they should all approach a performance rate equal to the rate achieved by
the local matrix-matrix kernel. This argument does not hold for the panel-axpy and panel-dot
based algorithms: For each call the the parallel panel-axpy and panel-dot kernel, O(balgn=r)
or O(balgn=c) data is communicated before performing O(b2algn=p) oating point operations.
Thus, if balg, r and c are xed, communication does not become less signi cant, even if the
matrix dimensions are increased. All of these observations can be observed in the graphs.
m = n = 4096 m = n = 4096
model data t3e data

400.0 LLS 400.0


LSL
SLL
SSL LLS
SLS LSL
LSS SLL
SSL
300.0 300.0 SLS
LSS
MFLOPS/node

MFLOPS/node

200.0 200.0

100.0 100.0

0.0 0.0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
matrix dimension, k matrix dimension, k

Figure 2: Performance of the various algorithms on a 16 processor con guration of the Cray
T3E. In this graph, two dimensions, m and n are xed to be large, while k is varied. A
distribution block size equal to the algorithmic block size of 128 was used to collect this
data. The graph on the left reports data derived from the model, while the graph on the
right reports measured data.
Next, notice from the graphs that the panel-panel based algorithm always outperforms
the others. A moment of thought leads us to conclude that this is natural, since the panel-
panel kernel is inherently designed for the case where m and n are large, and k is small.
By comparison, a matrix-panel based algorithm will inherently perform badly when k is
small, since in that algorithm parallelism is derived from the distribution of A, which is only
17
distributed to one or a few columns of nodes when k is small. Similar arguments explain the
poor behavior of the other algorithms when k is small.
Finally, comparing the graph derived from the model to the graph reporting measured
data, we notice that while the model is clearly not perfect, it does capture qualitatively the
measured data. Also, the predicted performance reasonably re ects the measured perfor-
mance, especially for the panel-panel, matrix-panel, and panel-matrix based algorithms.
4.2.2 Dimensions m and k are large
In Fig. 3, we show the predicted and measured performance of the various algorithms when
both m and k are xed to be large.
Again, we start by commenting on the asymptotic behavior. As for the case where m
and k are xed to be large, for the panel-panel, matrix-panel, and panel-matrix algorithms
as k gets large, communication and other overhead eventually become a lower order term
when compared to computation. Thus, asymptotically, they should and do all approach a
performance rate equal to the rate achieved by the local matrix-matrix kernel. Again, this
argument does not hold for the panel-axpy and panel-dot based algorithms. All of these
observations can be observed in the graphs.
m = k = 4096 m = k = 4096
model data t3e data

400.0 400.0

LLS
LSL LLS
SLL LSL
SSL SLL
300.0 SLS 300.0 SSL
LSS SLS
MFLOPS/node

MFLOPS/node

LSS

200.0 200.0

100.0 100.0

0.0 0.0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
matrix dimension, n matrix dimension, n

Figure 3: Performance of the various algorithms on a 16 processor con guration of the Cray
T3E. In this graph, two dimensions, m and k are xed to be large, while n is varied. A
distribution block size equal to the algorithmic block size of 128 was used to collect this
data. The graph on the left reports data derived from the model, while the graph on the
right reports measured data.

18
Next, notice from the graphs that the matrix-panel based algorithm initially outperforms
the others. This is natural, since the matrix-panel kernel is inherently designed for the case
where m and k are large, and n is small. By comparison, a panel-panel or panel-matrix based
algorithm will inherently perform badly when n is small, since in those algorithms parallelism
is derived from the distribution of C and B , respectively, which are only distributed to one
or a few columns of nodes when k is small. Similar arguments explain the poor behavior of
the other algorithms when k is small.
Notice that eventually the panel-panel based algorithm outperforms the matrix-panel
algorithm. This is solely due to the fact that the implementation of the local matrix-matrix
multiply is more ecient for the panel-panel case.
Again,the model is clearly not perfect, but does capture qualitatively the measured data.
For the panel-panel, matrix-panel and panel-matrix based algorithms, the model is quite
good: for example, it predicts the cross-over point where the panel-panel algorithm becomes
superior to within ve percent.
4.2.3 Dimension n is large
In Fig. 4, we present results for one of the cases where only one dimension is xed to be
large. This time, we would expect one of the panel-axpy based algorithms to be superior,
at least when the other dimensions are small. While it is clear from the graphs that the
other algorithms perform poorly when m and k are small, due to time pressures, we did not
collect data for the case where m and k were extremely small (much less than the algorithmic
blocking size) and thus at this moment we must declare the evidence inconclusive.

5 Conclusion
In this paper, we have systematically presented algorithms, analysis, and performance results
for a class of parallel matrix algorithms. This class of algorithms allows for high performance
to be attained across a range of shapes of matrices.
As mentioned, the results presented in the previous section are extremely preliminary.
They are literally the rst results computed with the model and the rst results for imple-
mentations of the panel-axpy and panel-dot based algorithms.
The good news is that certainly qualitatively the model re ects the measured perfor-
mance. Even quantitatively, the initial results look good. Finally, there were no surprises:
both the model and the empirical results re ected what we expected from experience.
A lot of further experimentation must and will be performed in order to paint a complete
picture:
Experimental and model data must be collected for various architectures. Of particular
interest will be those architectures where the local matrix-matrix multiplication kernel
19
n = 4096 n = 4096
model data t3e data

400.0 400.0
LLS
LSL LLS
SLL LSL
SSL SLL
SLS SSL
300.0 LSS 300.0 SLS
LSS
MFLOPS/node

MFLOPS/node

200.0 200.0

100.0 100.0

0.0 0.0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
matrix dimension, m = k matrix dimension, m = k

Figure 4: Performance of the various algorithms on a 16 processor con guration of the


Cray T3E. In this graph, one dimension, n is xed to be large and m = k are varied. A
distribution block size equal to the algorithmic block size of 128 was used to collect this
data. The graph on the left reports data derived from the model, while the graph on the
right reports measured data.

20
is more mature, yielding a more uniform performance regardless of shape.
The algorithmic blocking size was chosen to equal an algorithmic blocking size that has
performed well for us across a range of algorithms. However, this choice is somewhat
skewed towards the blocking size that makes the local panel-panel multiply ecient. In
particular, we would expect that using a much larger algorithmic blocking size for the
panel-axpy and panel-dot based algorithms would improve the ratio of computation to
communication and thus the parallel performance.
Both the model and PLAPACK support the case where distribution and algorithmic block-
ings that are not equal. Much of the jagged behavior in the curves in the graphs are
due to load imbalance, which can be reduced by decreasing the distribution blocking
size. By keeping the algorithmic blocking size large, the asymptotic performance will
not change.
The model is greatly a ected by the choice of implementation for the collective communi-
cation algorithms on a given architecture. Our results for the model were derived by
assuming that commonly used algorithms were implemented. However, optimizations
(or lack thereof) on di erent architectures may make this assumption invalid.
The di erences between the algorithms can be expected to become much more pronounced
when larger numbers of processors are utilized. Thus, we must collect data from larger
machine con gurations.
We conclude by remarking that if the model were perfect, it could be used to pick be-
tween the di erent algorithms, thus creating a hybrid algorithm. Clearly, too many unknown
parameters make it dicult to tune the model for di erent vendors. Thus there is a need
for simple heuristics which are not as sensitive to imperfections in the model. If the vendors
supplied local matrix-matrix multiply kernels that performed reasonably well for all shapes
of matrices encountered in this paper, this task would be greatly simpli ed. In particular,
choosing between the algorithms that target the case where at least two dimensions are large
would become a matter of determining which matrix has the largest number of elements.
The algorithm that keeps that matrix stationary, communicating data from the other matri-
ces, should perform best, since parallelism is derived from the distribution of that stationary
matrix. Similarly, choosing between the algorithms that target the case where at one di-
mension is large would become a matter of determining which matrix has the least number
of elements. The algorithm that redistributes the other two matrices should perform best,
since parallelism is derived from the length of the matrices that are redistributed. Thus,
what would be left would be to determine whether shapes of the matrices are such that at
least two dimensions are large, or only one dimension is large. Investigation of this heuristic
is left as future research.

21
Acknowledgments
We are indebted to the rest of the PLAPACK team (Philip Alpatov, Dr. Greg Baker, Dr.
Carter Edwards, James Overfelt, and Dr. Yuan-Jye Jason Wu) for providing the infrastruc-
ture that made this research possible.
Primary support for this project come from the Parallel Research on Invariant Subspace
Methods (PRISM) project (ARPA grant P-95006) and the Environmental Molecular Sciences
construction project at Paci c Northwest National Laboratory (PNNL) (PNNL is a multipro-
gram national laboratory operated by Battelle Memorial Institute for the U.S. Department
of Energy under Contract DE-AC06-76RLO 1830). Additional support for PLAPACK came
from the NASA High Performance Computing and Communications Program's Earth and
Space Sciences Project (NRA Grants NAG5-2497 and NAG5-2511), and the Intel Research
Council.
We gratefully acknowledge access to equipment provided by the National Partnership for
Advanced Computational Infrastructure (NPACI), The University of Texas Computation
Center's High Performance Computing Facility, and the Texas Institute for Computational
and Applied Mathematics' Distributed Computing Laboratory,

Additional information
For additional information, visit the PLAPACK web site:
http://www.cs.utexas.edu/users/plapack

References
[1] Agarwal, R. C., F. Gustavson, and M. Zubair, \A high-performance matrix multiplication
algorithm on a distributed memory parallel computer using overlapped communication," IBM
Journal of Research and Development, Volume 38, Number 6, 1994.
[2] Agarwal, R. C., S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar, \A 3-Dimensional
Approach to Parallel Matrix Multiplication," IBM Journal of Research and Development,
Volume 39, Number 5, pp. 1{8, Sept. 1995.
[3] Alpatov, P., G. Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, Robert van de Geijn,
and J. Wu, "PLAPACK: Parallel Linear Algebra Package { Design Overview", Proceedings of
SC97, to appear.
[4] Alpatov, P., G. Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, Robert van de Geijn,
and J. Wu, "PLAPACK: Parallel Linear Algebra Package," Proceedings of the SIAM Parallel
Processing Conference, 1997.

22
[5] Anderson, E., Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S.
Hammarling, A. McKenney, and D. Sorensen, \Lapack: A Portable Linear Algebra Library
for High Performance Computers," Proceedings of Supercomputing '90, IEEE Press, 1990, pp.
1{10.
[6] Barnett, M., S. Gupta, D. Payne, L. Shuler, R. van de Geijn, and J. Watts, \Interproces-
sor Collective Communication Library (InterCom)," Scalable High Performance Computing
Conference 1994.
[7] Cannon, L.E., A Cellular Computer to Implement the Kalman Filter Algorithm, Ph.D. Thesis
(1969), Montana State University.
[8] Chtchelkanova, A., J. Gunnels, G. Morrow, J. Overfelt, R. van de Geijn, "Parallel Implemen-
tation of BLAS: General Techniques for Level 3 BLAS," TR-95-40, Department of Computer
Sciences, University of Texas, Oct. 1995.
[9] Choi J., J. J. Dongarra, R. Pozo, and D. W. Walker, \Scalapack: A Scalable Linear Algebra
Library for Distributed Memory Concurrent Computers," Proceedings of the Fourth Sympo-
sium on the Frontiers of Massively Parallel Computation. IEEE Comput. Soc. Press, 1992,
pp. 120-127.
[10] Choi, J., J. J. Dongarra, and D. W. Walker, \Level 3 BLAS for distributed memory con-
current computers", CNRS-NSF Workshop on Environments and Tools for Parallel Scienti c
Computing, Saint Hilaire du Touvet, France, Sept. 7-8, 1992. Elsevier Science Publishers, 1992.
[11] Choi, J., J. J. Dongarra, and D. W. Walker, \PUMMA: Parallel Universal Matrix Multipli-
cation Algorithms on distributed memory concurrent computers," Concurrency: Practice and
Experience, Vol 6(7), 543-570, 1994.
[12] Dongarra, J. J., J. Du Croz, S. Hammarling, and I. Du , \A Set of Level 3 Basic Linear
Algebra Subprograms," TOMS, Vol. 16, No. 1, pp. 1{16, 1990.
[13] Dongarra, J. J., R. A. van de Geijn, and D. W. Walker, \Scalability Issues A ecting the Design
of a Dense Linear Algebra Library," Journal of Parallel and Distributed Computing, Vol. 22,
No. 3, Sept. 1994, pp. 523{537.
[14] C. Edwards, P. Geng, A. Patra, and R. van de Geijn, Parallel matrix distributions: have we
been doing it all wrong?, Tech. Report TR-95-40, Dept of Computer Sciences, The University
of Texas at Austin, 1995.
[15] Fox, G. C., M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker,
Solving Problems on Concurrent Processors, Vol. 1, Prentice Hall, Englewood Cli s, N.J.,
1988.
[16] Fox, G., S. Otto, and A. Hey, \Matrix algorithms on a hypercube I: matrix multiplication,"
Parallel Computing 3 (1987), pp 17-31.
23
[17] Gropp, W., E. Lusk, A. Skjellum, Using MPI: Portable Programming with the Message-Passing
Interface, The MIT Press, 1994.
[18] C.-T. Ho and S. L. Johnsson, Distributed Routing Algorithms for Broadcasting and Personal-
ized Communication in Hypercubes, In Proceedings of the 1986 International Conference on
Parallel Processing, pages 640{648, IEEE, 1986.
[19] Huss-Lederman, S., E. Jacobson, A. Tsao, "Comparison of Scalable Parallel Matrix Multi-
plication Libraries," in Proceedings of the Scalable Parallel Libraries Conference, Starksville,
MS, Oct. 1993.
[20] Huss-Lederman, S., E. Jacobson, A. Tsao, G. Zhang, "Matrix Multiplication on the Intel
Touchstone DELTA," Concurrency: Practice and Experience, Vol. 6 (7), Oct. 1994, pp. 571-
594.
[21] Lin, C., and L. Snyder, \A Matrix Product Algorithm and its Comparative Performance on
Hypercubes," in Proceedings of Scalable High Performance Computing Conference, (Stout, Q,
and M. Wolfe, eds.), IEEE Press, Los Alamitos, CA, 1992, pp. 190{3.
[22] Li, J., A. Skjellum, and R. D. Falgout, \A Poly-Algorithm for Parallel Dense Matrix Multipli-
cation on Two-Dimensional Process Grid Topologies," Concurrency, Practice and Experience,
VOL. 10, ???
[23] van de Geijn, R., Using PLAPACK: Parallel Linear Algebra Package, The MIT Press, 1997.
[24] van de Geijn, R. and J. Watts, \SUMMA: Scalable Universal Matrix Multiplication Algo-
rithm," Concurrency: Practice and Experience, VOL. 9(4), 255{274 (April 1997).

24

View publication stats

You might also like