You are on page 1of 38

Matrix

Multiplication

Michael
Hanke

Introduction
Matrix Multiplications And Collective
The Recursive
Doubling
Algorithm Communication
Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication Michael Hanke
An
Application:
Google
School of Engineering Sciences

Parallel Computations for Large-Scale Problems I

Chapter 5, p. 1/38
Matrix
Multiplication

Michael Outline
Hanke

Introduction

The Recursive
Doubling
Algorithm
1 Introduction
Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication 2 The Recursive Doubling Algorithm
An
Application:
Google

3 Matrix-Vector Multiplication

4 Matrix-Matrix Multiplication

5 An Application: Google

Chapter 5, p. 2/38
Matrix
Multiplication

Michael Basic Matrix Operations


Hanke

Introduction

The Recursive
Doubling
Algorithm • Let A and B be to matrices of dimensions M × N and K × L
Matrix-Vector
Multiplication • Matrix addition C = A + B (defined if M = K and N = L)
Matrix-Matrix
Multiplication
cm,n = am,n + bm,n (1 ≤ m ≤ M, 1 ≤ n ≤ N)
An
Application:
Google
• Matrix multiplication C = AB (defined if N = K )

N
X
cm,l = am,n bn,l
n=1

cm,l is the product of the m-th row of A and the l -th column of
B

Chapter 5, p. 3/38
Matrix
Multiplication

Michael Addition of Matrices


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
for n = 1:N
Matrix-Matrix
Multiplication for m = 1:M
An c(m,n) = a(m,n)+b(m,n);
Application:
Google end
end

• Embarrassingly parallel, all data access purely local


• Any data distribution will do, no overlap necessary

Chapter 5, p. 4/38
Matrix
Multiplication

Michael Multiplication of Matrices


Hanke

Introduction

The Recursive
Doubling
Algorithm
• Assume that C is initialized to zero.
Matrix-Vector
Multiplication
for l = 1:L
Matrix-Matrix
Multiplication for m = 1:M
An
Application:
for n = 1:N
Google c(m,l) = c(m,l)+a(m,n)*b(n,l);
end
end
end

• The innermost loop implies a global data dependency for every


cm,l

Chapter 5, p. 5/38
Matrix
Multiplication

Michael The Challenge


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication

An
Application: Find an algorithm and a data distribution such that
Google
1 data doubling on different processes is avoided;
2 excessive data movement (communication!) is avoided.

Chapter 5, p. 6/38
Matrix
Multiplication

Michael What Will Come


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Strategy
Multiplication
Consider the loops from the innermost to the outermost one.
An
Application:
Google
1 The innermost (n-) loop is the scalar product of two vectors.
2 The n- and the m-loops are a matrix-vector multiplication for
each l
3 Finally, find an implementation of the complete algorithm

Chapter 5, p. 7/38
Matrix
Multiplication

Michael Scalar Product of Two Vectors


Hanke

Introduction
• The scalar product of two vectors x, y ∈ RM is defined as
The Recursive
Doubling
Algorithm
M
X
Matrix-Vector T
Multiplication s = hx, y i = x y = xm ym
Matrix-Matrix m=1
Multiplication

An • Assume some data distribution on P processors such that x and


Application:
Google y are distributed identically
• Then it holds
M
X P−1
XX P−1
X
s= xm ym = xi yi = ωp
m=1 p=0 i∈Ip p=0
| {z }
ωp

• Assume that all local computations are done such that ωp is


available

Chapter 5, p. 8/38
Matrix
Multiplication

Michael Global Summation


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication Problem
Matrix-Matrix
Multiplication
Compute the sum s on P processors,
An
Application:
Google
s = ω0 + ω1 + · · · + ωP−1

• sequential complexity: O(P) operations


• The MPI solution to this problem: MPI_Reduce
• How could it be realized?

Chapter 5, p. 9/38
Matrix
Multiplication

Michael Global Summation: Idea


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication
Idea (for P = 8):
An
Application:
Google s = ω0 + ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
| {z } | {z } | {z } | {z }
| {z } | {z }
| {z }
s

Chapter 5, p. 10/38
Matrix
Multiplication

Michael Global Summation


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication

An
Application:
Google

Chapter 5, p. 11/38
Matrix
Multiplication

Michael Realization
Hanke

Introduction

The Recursive
Doubling
Algorithm • Assume that P = 2D is a power of 2. On process p, the program
Matrix-Vector
Multiplication
reads:
Matrix-Matrix
Multiplication
s = ωp ;
An
for d = 0:D-1
Application:
Google
q = bitflip(p,d);
send(s,q);
receive(h,q);
s = s+h;
end

• The bitflip operation inverts bit number d in the binary


representation of p.

Chapter 5, p. 12/38
Matrix
Multiplication

Michael Comments
Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• After execution of the program, every processor contains s
Matrix-Matrix
Multiplication • This algorithm is called Recursive Doubling (Butterfly algorithm)
An
Application: • The basic algorithm can be applied in many different contexts
Google
(examples: barriers, broadcast)
• Since every processor contains the final sum, its MPI
counterpart is MPI_Allreduce
• The algorithm as given may dead-lock. Use MPI_Sendrecv

Chapter 5, p. 13/38
Matrix
Multiplication

Michael Modification For Any Number of


Hanke

Introduction
Processes
The Recursive • Let 2D < P < 2D+1
Doubling
Algorithm s = ωp ;
Matrix-Vector
if p >= 2D
Multiplication send(s,bitflip(p,D));
Matrix-Matrix end
Multiplication if p < P-2D
An receive(h,bitflip(p,D));
Application:
Google s = s+h;
end
if p < 2D
for d = 0:D-1
send(s,bitflip(p,d));
receive(h,bitflip(p,d));
s = s+h;
end
end
if p < P-2D
send(s,bitflip(p,D));
end
if p >= 2D
receive(s,bitflip(p,D));
end
Chapter 5, p. 14/38
Matrix
Multiplication

Michael Performance Analysis


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• Obviously, T1∗ = (2M − 1)ta
Matrix-Matrix • The local computation time is tcomp,1 = (2Ip − 1)ta
Multiplication

An • Time for recursive doubling:


Application:
Google
t = D(2(tstartup + 1 · tdata ) + ta )

• Using a load balanced data distribution, i.e., Ip = M/P it holds

Tp = (2Ip − 1)ta + log P(2(tstartup + 1 · tdata ) + ta )

Chapter 5, p. 15/38
Matrix
Multiplication

Michael Matrix-Vector Multiplication


Hanke

Introduction

The Recursive
Doubling
Algorithm • For a given M × N-matrix A and an N-dimensional vector x,
Matrix-Vector
Multiplication y = Ax is given by
Matrix-Matrix
Multiplication N
X
An
Application:
ym = am,n xn = ham , xi
Google n=1

where am is the m-th row of A


• Use a load-balanced data distribution on R = P × Q processors
in array topology to store A (no overlap!)
• x must have the same data distribution as the rows of A
• Similarly, y has the same data distribution as the columns of A

Chapter 5, p. 16/38
Matrix
Multiplication

Michael Data Distribution


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication

An
Application:
Google

There is additional memory needed to store x and y : In the


sequential version, M + N memory locations are necessary. In the
parallel version, QM + PN locations are necessary.

Chapter 5, p. 17/38
Matrix
Multiplication

Michael Algorithm
Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication • Once the data are distributed, all individual components ym can
An be computed in parallel using the Recursive Doubling Algorithm
Application:
Google
• Do not forget to block data exchanges for avoiding excessive
startup times!
• The performance analysis is similar to the one given before

Chapter 5, p. 18/38
Matrix
Multiplication

Michael Matrix-Matrix Multiplication


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• Let A be an M × N-matrix, B an N × K -matrix. Then C = AB
Matrix-Matrix
Multiplication is
N
An X
Application:
Google
cm,k = am,n bn,k = ham , bk i
n=1
m
where a is the m-th row of A and bk is the k-th column of B
• Why not simply generalize matrix-vector multiplication?
We would need excessive additional memory (QMK + PNK )

Chapter 5, p. 19/38
Matrix
Multiplication

Michael Potential
Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication Observation
An If M = N = K , the matrix multiplication requires M 3 operations on
Application:
Google M 2 data. So we expect a good potential for parallelization.

In the following we assume M = N = K and R = P × P.

Chapter 5, p. 20/38
Matrix
Multiplication

Michael A Simple Algorithm


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
• Assume that we have P = M
Multiplication
• Assign (am , bk ) to processor (m, k)
Matrix-Matrix
Multiplication
• Compute cm,k on processor (m, k)
An
Application: • Advantages:
Google

• Computation time tcomp = (2M − 1)ta


• No communication at the computation stage

• Drawbacks:
• A lot of communication for initializing
• or, large memory requirements

Chapter 5, p. 21/38
Matrix
Multiplication

Michael Cannon’s Algorithm


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix • In a first step, assume P = M


Multiplication

An • Data distribution: processor (m − 1, n − 1) holds am,n , bm,n , cm,n


Application:
Google
A × B = C
b12
b22
a31 a32 a33 b32 c32

Chapter 5, p. 22/38
Matrix
Multiplication

Michael Partial Products


Hanke

• On the “diagonal” processors (m − 1, n − 1), parts of the sum


Introduction

The Recursive
Doubling cm,m = am,1 b1,m + · · · + am,m bm,m + · · · + am,M bM,m
Algorithm

Matrix-Vector
Multiplication are available
Matrix-Matrix A × B = C
Multiplication

An b12
Application:
Google a21 a22 a23 b22 c22
b32

• Shifting the rows of A cyclic to the right and those of B cyclic


downwards provides the next term
A × B = C
b32
a23 a21 a22 b12 c22
b22

• Repeating this cyclic exchange ones again completes the


calculation of the diagonal elements
Chapter 5, p. 23/38
Matrix
Multiplication

Michael The Outer Diagonal Elements


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication • Consider element c1,2 ,
Matrix-Matrix
Multiplication
c1,2 = a1,1 b1,2 + a1,2 b2,2 + · · ·
An
Application:
Google • While a1,2 is available, b2,2 is not
• This can be corrected by shifting the second column of B
(cyclically) upwards
• In order not to change the indices on the diagonal, the second
row of A must be shifted (cyclically) to the left

Chapter 5, p. 24/38
Matrix
Multiplication

Michael Initial Data Distribution


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• More general, the matrices A and B must be initialized as
Matrix-Matrix follows:
Multiplication
• The m-th row of A is shifted cyclically m − 1 positions to the left
An
Application: • The n-th column of B is shifted cyclically n − 1 positions upwards
Google

A B
a11 a12 a13 b11 b22 b33
a22 a23 a21 b21 b32 b13
a33 a31 a32 b31 b12 b23

Chapter 5, p. 25/38
Matrix
Multiplication

Michael Algorithm
Hanke

Introduction

The Recursive
Doubling 1 Initially, processor (p, q) has elements ap+1,q+1 , bp+1,q+1
Algorithm

Matrix-Vector 2 Shift the rows of A and the columns of B into the structure
Multiplication
described above
Matrix-Matrix
Multiplication 3 for k = 1, . . . , N
An
Application: 1 Multiply the local values on each processor
Google
2 Shift the rows of A and the columns of B cyclically by one
processor
4 If necessary: undo step 1
Remarks:
• If the number of processors is much less than N 2 ,
a blocked version will do.
• The last shift in step 3.2 can be omitted.

Chapter 5, p. 26/38
Matrix
Multiplication

Michael Efficiency Analysis


Hanke

Introduction

The Recursive
Doubling
Algorithm Assume: R = P × P and P divides N
Matrix-Vector
Multiplication Step 2: Use red-black communication
Matrix-Matrix
Multiplication
t2 = 4 ∗ (tstartup + Ip2 tdata )
An
Application:
Google
Step 3.1: t3,1 = 2 ∗ Ip3 ta
Step 3.2: t3,2 = 4 ∗ (tstartup + Ip2 tdata )

TR =TP×P = 2PIp3 ta + 4(P + 1)(tstartup + Ip2 tdata )


Ts∗ =2N 3 ta

Chapter 5, p. 27/38
Matrix
Multiplication

Michael Speedup
Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication
2N 3 ta
SR =
An 2PIp3 ta + 4(P + 1)(tstartup + Ip2 tdata )
Application:
Google
2N 3 ta
=
2P(N/P)3 ta + 4(P + 1)(tstartup + (N/P)2 tdata )
1
=R √
R( R+1) tstartup tdata
1 + 2 N3 ta + 2 P+1N ta

Chapter 5, p. 28/38
Matrix
Multiplication

Michael Other Algorithms


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication There are a number of other algorithms available which have a better
Matrix-Matrix efficiency than Cannon’s algorithm:
Multiplication

An • Strassen’s algorithm reduces the complexity of matrix


Application:
Google multiplication from O(N 3 ) to O(N 2.81... ). This algorithm can be
refined to complexity O(N 2.376... )
• A refined communication strategy on a differently organized
processor topology was proposed by Dekel, Nassimi, and Sahni
(DNS algorithm)

Chapter 5, p. 29/38
Matrix
Multiplication

Michael Google’s Problem


Hanke

Introduction

The Recursive
Doubling
Algorithm The birth of Google
Matrix-Vector
Multiplication
Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd: The
Matrix-Matrix PageRank citation ranking: Bringing order to the web. Technical
Multiplication
Report SIDL-WP-1999-0120,
An
Application: http://dbpubs.stanford.edu/pub/1999-66
Google
A search engine faces two problems:
1 Find all web pages which contain the search phrase (more
general: satisfy some search criterion)
2 Among all these pages, find the most relevant
Ad 1: Web crawler
Ad 2: PageRank citation ranking

Chapter 5, p. 30/38
Matrix
Multiplication

Michael Page Ranking


Hanke

Introduction

The Recursive
Doubling
Algorithm • Every web page has a number of forward links (pointers to other
Matrix-Vector
Multiplication
web pages) and a number of backlinks (web pages pointing to
Matrix-Matrix the given page)
Multiplication

An
• A web page seems to be more important if the number of
Application:
Google
backlinks is large
• Simple counting the number of backlinks may not be sufficient:
A backlink with a high importance should have a higher weight
• This is a recursive definition: A page is important if the
backlinks are important.
• But what is the root of this recursion?? The web does not have
a root :-(

Chapter 5, p. 31/38
Matrix
Multiplication

Michael Definition of PageRank


Hanke

Introduction

The Recursive
Doubling
Algorithm • Let m be a webpage
Matrix-Vector
Multiplication • Let Fm be the set of pages m points to
Matrix-Matrix
Multiplication • Let Bm be the set of pages that point to m
An
Application:
• Let Nm = |Fm | be the number of forward links, and c be a
Google
normalization factor
• Then we define PageRank:
X Rn
Rm = c
Nn
n∈Bm

• Actually, this definition is slightly simplified.

Chapter 5, p. 32/38
Matrix
Multiplication

Michael Reformulation of PageRank


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• Define a matrix A by
Matrix-Matrix
Multiplication (
An 1/Nn , if there is a link from m to n,
Application: amn =
Google 0, otherwise

• Let R = (R1 , . . . , RM )T be the PageRank vector. Then it holds

R = cAR

Chapter 5, p. 33/38
Matrix
Multiplication

Michael PageRank
Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
Definition
Matrix-Matrix
The PageRank of a given connectivity matrix A is an eigenvector R
Multiplication corresponding to the largest eigenvalue c of A.
An
Application:
Google

Remark: Actually, A is a so-called sparse matrix, i.e. a matrix which


contains very many zero entries. For the time being, we neglect this
property. In connection with algorithms on graphs, we will consider
sparse matrices.

Chapter 5, p. 34/38
Matrix
Multiplication

Michael The Power Method For Eigenvalue


Hanke

Introduction
Problems
The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication • Given a square matrix A and an initial guess x [0] 6= 0
An
Application: • Form the sequence x [k+1] = Ax [k] , k = 0, 1, . . .
Google
• It can be shown that, for many matrices A and almost all initial
guesses x [0] the sequence {x [k] /kx [k] k} converges towards an
eigenvector to the largest eigenvalue c of A
hx [k+1] ,x [k] i
• An estimation ck of c can be obtained by ck =
hx [k] ,x [k] i

Chapter 5, p. 35/38
Matrix
Multiplication

Michael Algorithm
Hanke

Introduction

The Recursive
Doubling
Algorithm
% Let x = x0 be chosen
Matrix-Vector
Multiplication err = tol;
Matrix-Matrix c = 0;
Multiplication
while err >= tol*abs(c)
An
Application: nrm = 1/sqrt(<x,x>); % recursive doubling
Google
x = nrm*x;
y = A*x; % matrix-vektor mult
cnew = <y,x>; % recursice doubling
err = abs(cnew-c);
c = cnew;
x = y; % vector transposition
end

Chapter 5, p. 36/38
Matrix
Multiplication

Michael Vector Transposition


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• Assume a P × Q processor mesh and A distributed accordingly
Matrix-Matrix
Multiplication • Let y be distributed in the same way as the columns of A
An
Application: • The simple case: P = Q and row and column data distributions
Google
are equal:
• MPI_Alltoall can be used to transpose y
• The general case is much more involved. Essentially, it is
equivalent to one recursive doubling step

Chapter 5, p. 37/38
Matrix
Multiplication

Michael Optimization of Parallel Algorithms


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector • In a sequential algorithm, the program will run faster if the


Multiplication

Matrix-Matrix
consecutive steps are optimized individually.
Multiplication
• Hence, optimizing a sequential algorithm is a local problem.
An
Application:
Google
• In a parallel environment, for a given algorithm the execution
time depends both on the compuation time and the data
distribution.
• A certain data distribution may be optimal at one step of the
algorithm while suboptimal for another.
• Thus, optimizing a parallel program is a global problem!

Chapter 5, p. 38/38

You might also like