Matrix Multiplications and Collective Communication: Michael Hanke

Matrix
Multiplication
Michael
Hanke
Introduction
Matrix Multiplications And Collective
The Recursive
Doubling
Algorithm Communication
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication Michael Hanke
An
Application:
Google
School of Engineering Sciences
Parallel Computations for Large-Scale Problems I
Chapter 5, p. 1/38
Matrix
Multiplication
Michael Outline
Hanke
Introduction
The Recursive
Doubling
Algorithm
1 Introduction
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication 2 The Recursive Doubling Algorithm
An
Application:
Google
3 Matrix-Vector Multiplication
4 Matrix-Matrix Multiplication
5 An Application: Google
Chapter 5, p. 2/38
Matrix
Multiplication
Michael Basic Matrix Operations

Hanke
Introduction
The Recursive
Doubling
Algorithm • Let A and B be to matrices of dimensions M × N and K × L
Matrix-Vector
Multiplication • Matrix addition C = A + B (defined if M = K and N = L)
Matrix-Matrix
Multiplication
cm,n = am,n + bm,n (1 ≤ m ≤ M, 1 ≤ n ≤ N)
An
Application:
Google
• Matrix multiplication C = AB (defined if N = K )
N
X
cm,l = am,n bn,l
n=1
cm,l is the product of the m-th row of A and the l -th column of
B
Chapter 5, p. 3/38
Matrix
Multiplication
Michael Addition of Matrices

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
for n = 1:N
Matrix-Matrix
Multiplication for m = 1:M
An c(m,n) = a(m,n)+b(m,n);
Application:
Google end
end
• Embarrassingly parallel, all data access purely local

• Any data distribution will do, no overlap necessary
Chapter 5, p. 4/38
Matrix
Multiplication
Michael Multiplication of Matrices

Hanke
Introduction
The Recursive
Doubling
Algorithm
• Assume that C is initialized to zero.
Matrix-Vector
Multiplication
for l = 1:L
Matrix-Matrix
Multiplication for m = 1:M
An
Application:
for n = 1:N
Google c(m,l) = c(m,l)+a(m,n)*b(n,l);
end
end
end
• The innermost loop implies a global data dependency for every

cm,l
Chapter 5, p. 5/38
Matrix
Multiplication
Michael The Challenge

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication
An
Application: Find an algorithm and a data distribution such that
Google
1 data doubling on different processes is avoided;
2 excessive data movement (communication!) is avoided.
Chapter 5, p. 6/38
Matrix
Multiplication
Michael What Will Come

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Strategy
Multiplication
Consider the loops from the innermost to the outermost one.
An
Application:
Google
1 The innermost (n-) loop is the scalar product of two vectors.
2 The n- and the m-loops are a matrix-vector multiplication for
each l
3 Finally, find an implementation of the complete algorithm
Chapter 5, p. 7/38
Matrix
Multiplication
Michael Scalar Product of Two Vectors

Hanke
Introduction
• The scalar product of two vectors x, y ∈ RM is defined as
The Recursive
Doubling
Algorithm
M
X
Matrix-Vector T
Multiplication s = hx, y i = x y = xm ym
Matrix-Matrix m=1
Multiplication
An • Assume some data distribution on P processors such that x and

Application:
Google y are distributed identically
• Then it holds
M
X P−1
XX P−1
X
s= xm ym = xi yi = ωp
m=1 p=0 i∈Ip p=0
| {z }
ωp
• Assume that all local computations are done such that ωp is

available
Chapter 5, p. 8/38
Matrix
Multiplication
Michael Global Summation

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication Problem
Matrix-Matrix
Multiplication
Compute the sum s on P processors,
An
Application:
Google
s = ω0 + ω1 + · · · + ωP−1
• sequential complexity: O(P) operations

• The MPI solution to this problem: MPI_Reduce
• How could it be realized?
Chapter 5, p. 9/38
Matrix
Multiplication
Michael Global Summation: Idea

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication
Idea (for P = 8):
An
Application:
Google s = ω0 + ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
| {z } | {z } | {z } | {z }
| {z } | {z }
| {z }
s
Chapter 5, p. 10/38
Matrix
Multiplication
Michael Global Summation

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication
An
Application:
Google
Chapter 5, p. 11/38
Matrix
Multiplication
Michael Realization
Hanke
Introduction
The Recursive
Doubling
Algorithm • Assume that P = 2D is a power of 2. On process p, the program
Matrix-Vector
Multiplication
reads:
Matrix-Matrix
Multiplication
s = ωp ;
An
for d = 0:D-1
Application:
Google
q = bitflip(p,d);
send(s,q);
receive(h,q);
s = s+h;
end
• The bitflip operation inverts bit number d in the binary

representation of p.
Chapter 5, p. 12/38
Matrix
Multiplication
Michael Comments
Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• After execution of the program, every processor contains s
Matrix-Matrix
Multiplication • This algorithm is called Recursive Doubling (Butterfly algorithm)
An
Application: • The basic algorithm can be applied in many different contexts
Google
(examples: barriers, broadcast)
• Since every processor contains the final sum, its MPI
counterpart is MPI_Allreduce
• The algorithm as given may dead-lock. Use MPI_Sendrecv
Chapter 5, p. 13/38
Matrix
Multiplication
Michael Modification For Any Number of

Hanke
Introduction
Processes
The Recursive • Let 2D < P < 2D+1
Doubling
Algorithm s = ωp ;
Matrix-Vector
if p >= 2D
Multiplication send(s,bitflip(p,D));
Matrix-Matrix end
Multiplication if p < P-2D
An receive(h,bitflip(p,D));
Application:
Google s = s+h;
end
if p < 2D
for d = 0:D-1
send(s,bitflip(p,d));
receive(h,bitflip(p,d));
s = s+h;
end
end
if p < P-2D
send(s,bitflip(p,D));
end
if p >= 2D
receive(s,bitflip(p,D));
end
Chapter 5, p. 14/38
Matrix
Multiplication
Michael Performance Analysis

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• Obviously, T1∗ = (2M − 1)ta
Matrix-Matrix • The local computation time is tcomp,1 = (2Ip − 1)ta
Multiplication
An • Time for recursive doubling:

Application:
Google
t = D(2(tstartup + 1 · tdata ) + ta )
• Using a load balanced data distribution, i.e., Ip = M/P it holds
Tp = (2Ip − 1)ta + log P(2(tstartup + 1 · tdata ) + ta )
Chapter 5, p. 15/38
Matrix
Multiplication
Michael Matrix-Vector Multiplication

Hanke
Introduction
The Recursive
Doubling
Algorithm • For a given M × N-matrix A and an N-dimensional vector x,
Matrix-Vector
Multiplication y = Ax is given by
Matrix-Matrix
Multiplication N
X
An
Application:
ym = am,n xn = ham , xi
Google n=1
where am is the m-th row of A

• Use a load-balanced data distribution on R = P × Q processors
in array topology to store A (no overlap!)
• x must have the same data distribution as the rows of A
• Similarly, y has the same data distribution as the columns of A
Chapter 5, p. 16/38
Matrix
Multiplication
Michael Data Distribution

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication
An
Application:
Google
There is additional memory needed to store x and y : In the

sequential version, M + N memory locations are necessary. In the
parallel version, QM + PN locations are necessary.
Chapter 5, p. 17/38
Matrix
Multiplication
Michael Algorithm
Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication • Once the data are distributed, all individual components ym can
An be computed in parallel using the Recursive Doubling Algorithm
Application:
Google
• Do not forget to block data exchanges for avoiding excessive
startup times!
• The performance analysis is similar to the one given before
Chapter 5, p. 18/38
Matrix
Multiplication
Michael Matrix-Matrix Multiplication

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• Let A be an M × N-matrix, B an N × K -matrix. Then C = AB
Matrix-Matrix
Multiplication is
N
An X
Application:
Google
cm,k = am,n bn,k = ham , bk i
n=1
m
where a is the m-th row of A and bk is the k-th column of B
• Why not simply generalize matrix-vector multiplication?
We would need excessive additional memory (QMK + PNK )
Chapter 5, p. 19/38
Matrix
Multiplication
Michael Potential
Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication Observation
An If M = N = K , the matrix multiplication requires M 3 operations on
Application:
Google M 2 data. So we expect a good potential for parallelization.
In the following we assume M = N = K and R = P × P.
Chapter 5, p. 20/38
Matrix
Multiplication
Michael A Simple Algorithm

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
• Assume that we have P = M
Multiplication
• Assign (am , bk ) to processor (m, k)
Matrix-Matrix
Multiplication
• Compute cm,k on processor (m, k)
An
Application: • Advantages:
Google
• Computation time tcomp = (2M − 1)ta

• No communication at the computation stage
• Drawbacks:
• A lot of communication for initializing
• or, large memory requirements
Chapter 5, p. 21/38
Matrix
Multiplication
Michael Cannon’s Algorithm

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix • In a first step, assume P = M

Multiplication
An • Data distribution: processor (m − 1, n − 1) holds am,n , bm,n , cm,n

Application:
Google
A × B = C
b12
b22
a31 a32 a33 b32 c32
Chapter 5, p. 22/38
Matrix
Multiplication
Michael Partial Products

Hanke
• On the “diagonal” processors (m − 1, n − 1), parts of the sum

Introduction
The Recursive
Doubling cm,m = am,1 b1,m + · · · + am,m bm,m + · · · + am,M bM,m
Algorithm
Matrix-Vector
Multiplication are available
Matrix-Matrix A × B = C
Multiplication
An b12
Application:
Google a21 a22 a23 b22 c22
b32
• Shifting the rows of A cyclic to the right and those of B cyclic

downwards provides the next term
A × B = C
b32
a23 a21 a22 b12 c22
b22
• Repeating this cyclic exchange ones again completes the

calculation of the diagonal elements
Chapter 5, p. 23/38
Matrix
Multiplication
Michael The Outer Diagonal Elements

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication • Consider element c1,2 ,
Matrix-Matrix
Multiplication
c1,2 = a1,1 b1,2 + a1,2 b2,2 + · · ·
An
Application:
Google • While a1,2 is available, b2,2 is not
• This can be corrected by shifting the second column of B
(cyclically) upwards
• In order not to change the indices on the diagonal, the second
row of A must be shifted (cyclically) to the left
Chapter 5, p. 24/38
Matrix
Multiplication
Michael Initial Data Distribution

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• More general, the matrices A and B must be initialized as
Matrix-Matrix follows:
Multiplication
• The m-th row of A is shifted cyclically m − 1 positions to the left
An
Application: • The n-th column of B is shifted cyclically n − 1 positions upwards
Google
A B
a11 a12 a13 b11 b22 b33
a22 a23 a21 b21 b32 b13
a33 a31 a32 b31 b12 b23
Chapter 5, p. 25/38
Matrix
Multiplication
Michael Algorithm
Hanke
Introduction
The Recursive
Doubling 1 Initially, processor (p, q) has elements ap+1,q+1 , bp+1,q+1
Algorithm
Matrix-Vector 2 Shift the rows of A and the columns of B into the structure
Multiplication
described above
Matrix-Matrix
Multiplication 3 for k = 1, . . . , N
An
Application: 1 Multiply the local values on each processor
Google
2 Shift the rows of A and the columns of B cyclically by one
processor
4 If necessary: undo step 1
Remarks:
• If the number of processors is much less than N 2 ,
a blocked version will do.
• The last shift in step 3.2 can be omitted.
Chapter 5, p. 26/38
Matrix
Multiplication
Michael Efficiency Analysis

Hanke
Introduction
The Recursive
Doubling
Algorithm Assume: R = P × P and P divides N
Matrix-Vector
Multiplication Step 2: Use red-black communication
Matrix-Matrix
Multiplication
t2 = 4 ∗ (tstartup + Ip2 tdata )
An
Application:
Google
Step 3.1: t3,1 = 2 ∗ Ip3 ta
Step 3.2: t3,2 = 4 ∗ (tstartup + Ip2 tdata )
TR =TP×P = 2PIp3 ta + 4(P + 1)(tstartup + Ip2 tdata )

Ts∗ =2N 3 ta
Chapter 5, p. 27/38
Matrix
Multiplication
Michael Speedup
Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication
2N 3 ta
SR =
An 2PIp3 ta + 4(P + 1)(tstartup + Ip2 tdata )
Application:
Google
2N 3 ta
=
2P(N/P)3 ta + 4(P + 1)(tstartup + (N/P)2 tdata )
1
=R √
R( R+1) tstartup tdata
1 + 2 N3 ta + 2 P+1N ta
Chapter 5, p. 28/38
Matrix
Multiplication
Michael Other Algorithms

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication There are a number of other algorithms available which have a better
Matrix-Matrix efficiency than Cannon’s algorithm:
Multiplication
An • Strassen’s algorithm reduces the complexity of matrix

Application:
Google multiplication from O(N 3 ) to O(N 2.81... ). This algorithm can be
refined to complexity O(N 2.376... )
• A refined communication strategy on a differently organized
processor topology was proposed by Dekel, Nassimi, and Sahni
(DNS algorithm)
Chapter 5, p. 29/38
Matrix
Multiplication
Michael Google’s Problem

Hanke
Introduction
The Recursive
Doubling
Algorithm The birth of Google
Matrix-Vector
Multiplication
Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd: The
Matrix-Matrix PageRank citation ranking: Bringing order to the web. Technical
Multiplication
Report SIDL-WP-1999-0120,
An
Application: http://dbpubs.stanford.edu/pub/1999-66
Google
A search engine faces two problems:
1 Find all web pages which contain the search phrase (more
general: satisfy some search criterion)
2 Among all these pages, find the most relevant
Ad 1: Web crawler
Ad 2: PageRank citation ranking
Chapter 5, p. 30/38
Matrix
Multiplication
Michael Page Ranking

Hanke
Introduction
The Recursive
Doubling
Algorithm • Every web page has a number of forward links (pointers to other
Matrix-Vector
Multiplication
web pages) and a number of backlinks (web pages pointing to
Matrix-Matrix the given page)
Multiplication
An
• A web page seems to be more important if the number of
Application:
Google
backlinks is large
• Simple counting the number of backlinks may not be sufficient:
A backlink with a high importance should have a higher weight
• This is a recursive definition: A page is important if the
backlinks are important.
• But what is the root of this recursion?? The web does not have
a root :-(
Chapter 5, p. 31/38
Matrix
Multiplication
Michael Definition of PageRank

Hanke
Introduction
The Recursive
Doubling
Algorithm • Let m be a webpage
Matrix-Vector
Multiplication • Let Fm be the set of pages m points to
Matrix-Matrix
Multiplication • Let Bm be the set of pages that point to m
An
Application:
• Let Nm = |Fm | be the number of forward links, and c be a
Google
normalization factor
• Then we define PageRank:
X Rn
Rm = c
Nn
n∈Bm
• Actually, this definition is slightly simplified.
Chapter 5, p. 32/38
Matrix
Multiplication
Michael Reformulation of PageRank

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• Define a matrix A by
Matrix-Matrix
Multiplication (
An 1/Nn , if there is a link from m to n,
Application: amn =
Google 0, otherwise
• Let R = (R1 , . . . , RM )T be the PageRank vector. Then it holds
R = cAR
Chapter 5, p. 33/38
Matrix
Multiplication
Michael PageRank
Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Definition
Matrix-Matrix
The PageRank of a given connectivity matrix A is an eigenvector R
Multiplication corresponding to the largest eigenvalue c of A.
An
Application:
Google
Remark: Actually, A is a so-called sparse matrix, i.e. a matrix which

contains very many zero entries. For the time being, we neglect this
property. In connection with algorithms on graphs, we will consider
sparse matrices.
Chapter 5, p. 34/38
Matrix
Multiplication
Michael The Power Method For Eigenvalue

Hanke
Introduction
Problems
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication • Given a square matrix A and an initial guess x [0] 6= 0
An
Application: • Form the sequence x [k+1] = Ax [k] , k = 0, 1, . . .
Google
• It can be shown that, for many matrices A and almost all initial
guesses x [0] the sequence {x [k] /kx [k] k} converges towards an
eigenvector to the largest eigenvalue c of A
hx [k+1] ,x [k] i
• An estimation ck of c can be obtained by ck =
hx [k] ,x [k] i
Chapter 5, p. 35/38
Matrix
Multiplication
Michael Algorithm
Hanke
Introduction
The Recursive
Doubling
Algorithm
% Let x = x0 be chosen
Matrix-Vector
Multiplication err = tol;
Matrix-Matrix c = 0;
Multiplication
while err >= tol*abs(c)
An
Application: nrm = 1/sqrt(<x,x>); % recursive doubling
Google
x = nrm*x;
y = A*x; % matrix-vektor mult
cnew = <y,x>; % recursice doubling
err = abs(cnew-c);
c = cnew;
x = y; % vector transposition
end
Chapter 5, p. 36/38
Matrix
Multiplication
Michael Vector Transposition

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• Assume a P × Q processor mesh and A distributed accordingly
Matrix-Matrix
Multiplication • Let y be distributed in the same way as the columns of A
An
Application: • The simple case: P = Q and row and column data distributions
Google
are equal:
• MPI_Alltoall can be used to transpose y
• The general case is much more involved. Essentially, it is
equivalent to one recursive doubling step
Chapter 5, p. 37/38
Matrix
Multiplication
Michael Optimization of Parallel Algorithms

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector • In a sequential algorithm, the program will run faster if the

Multiplication
Matrix-Matrix
consecutive steps are optimized individually.
Multiplication
• Hence, optimizing a sequential algorithm is a local problem.
An
Application:
Google
• In a parallel environment, for a given algorithm the execution
time depends both on the compuation time and the data
distribution.
• A certain data distribution may be optimal at one step of the
algorithm while suboptimal for another.
• Thus, optimizing a parallel program is a global problem!
Chapter 5, p. 38/38

Matrix Multiplications and Collective Communication: Michael Hanke

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matrix Multiplications and Collective Communication: Michael Hanke

Uploaded by

Copyright:

Available Formats

Matrix

Parallel Computations for Large-Scale Problems I

Michael Basic Matrix Operations

Michael Addition of Matrices

• Embarrassingly parallel, all data access purely local

Michael Multiplication of Matrices

• The innermost loop implies a global data dependency for every

Michael The Challenge

Michael What Will Come

Michael Scalar Product of Two Vectors

An • Assume some data distribution on P processors such that x and

• Assume that all local computations are done such that ωp is

Michael Global Summation

• sequential complexity: O(P) operations

Michael Global Summation: Idea

Michael Global Summation

• The bitflip operation inverts bit number d in the binary

Michael Modification For Any Number of

Michael Performance Analysis

An • Time for recursive doubling:

• Using a load balanced data distribution, i.e., Ip = M/P it holds

Tp = (2Ip − 1)ta + log P(2(tstartup + 1 · tdata ) + ta )

Michael Matrix-Vector Multiplication

where am is the m-th row of A

Michael Data Distribution

There is additional memory needed to store x and y : In the

Michael Matrix-Matrix Multiplication

In the following we assume M = N = K and R = P × P.

Michael A Simple Algorithm

• Computation time tcomp = (2M − 1)ta

Michael Cannon’s Algorithm

Matrix-Matrix • In a first step, assume P = M

An • Data distribution: processor (m − 1, n − 1) holds am,n , bm,n , cm,n

Michael Partial Products

• On the “diagonal” processors (m − 1, n − 1), parts of the sum

• Shifting the rows of A cyclic to the right and those of B cyclic

• Repeating this cyclic exchange ones again completes the

Michael The Outer Diagonal Elements

Michael Initial Data Distribution

Michael Efficiency Analysis

TR =TP×P = 2PIp3 ta + 4(P + 1)(tstartup + Ip2 tdata )

Michael Other Algorithms

An • Strassen’s algorithm reduces the complexity of matrix

Michael Google’s Problem

Michael Page Ranking

Michael Definition of PageRank

• Actually, this definition is slightly simplified.

Michael Reformulation of PageRank

• Let R = (R1 , . . . , RM )T be the PageRank vector. Then it holds

Remark: Actually, A is a so-called sparse matrix, i.e. a matrix which

Michael The Power Method For Eigenvalue

Michael Vector Transposition

Michael Optimization of Parallel Algorithms

Matrix-Vector • In a sequential algorithm, the program will run faster if the

You might also like