Matrix Multiplication Algorithms for Parallel Computing

MATRIX MULTIPLICATION
(Part b)
By:
Shahrzad Abedi
Professor: Dr. Haj Seyed Javadi
MATRIX Multiplication
SIMD
MIMD
Multiprocessors
Multicomputers
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 2
Matrix Multiplication Algorithms
for Multiprocessors
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
3
p1
p2
p3
p4
p1 p2 p3 p4
Matrix Multiplication Algorithm
for a UMA Multiprocessor
p1
p2
p3
p4
p1
p2
C A
B
Example:
n= 8 , P=2 n/p= 4
n/p times
We must read n/p rows of A and we must read every
element of B, n/p times
for Multiprocessors
Question : Which Loop should be made
parallel in the sequential Matrix multiplication
algorithm?

Grain Size :
Amount of work performed between processor
interactions
Ratio of Computation time to Communication
Time : Computation time / Communication time
6 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Sequential Matrix Multiplication
Algorithm
for Multiprocessors
Design Strategy If load balancing is not a
problem maximize grain size
Question : Which Loop should be made
parallel ? i or j or k ?
K has data dependency
If j Grain-size = O(
n
3
/np)= O(
n
2
/p)
If i Grain-size = O(
n
3
/p)
X
Parallelizing i loop
9
n/p rows each (
n
2
)

n/p x
n
2
= (
n
3
/p)

Synchronization
overhead (p)

Complexity (
n
3
/p + p)

10
Matrix Multiplication in Loosely
Coupled Multiprocessors
Some matrix elements may be much easier
to access than others
It is important to keep local as many memory
references as possible
In previous UMA algorithm :
Every process must access n/p rows of matrix A
and access every element of B n/p times
Only a single addition and a single
multiplication occur for every element of B
fetched . This is not a good ratio!
Implementation of this algorithm on NUMA
Multi-processors yields poor speedup!
Matrix Multiplication in Loosely
Coupled Multiprocessors
Another method must be found to partition
the problem
An attractive method Block Matrix
Multiplication

Block Matrix Multiplication
A and B are both n x n matrices, n= 2k
A and B can be thought of as conglomerates of
4 smaller matrices, each of size k x k

Given this partitioning of A and B into blocks , C is
defined as follows:
For example there are processes,
then matrix multiplication is done by dividing
A and B into p blocks of size k x k.
STEP 1: compute C
i, j
= A
i,1
B
1,j

A B
15
P1: P2:
P3: P4:
P1: :P2
P3: :P4
STEP 2: Compute C
i,j
=C
i,j
+A
i,2
B
2,j

P1:
P3:
P2:
P4:
P1: :P2
:P4 P3:
Each block multiplication requires 2k
2

memory fetches, k
3
additions and k
3

multiplications
The number of arithmetic operations per
memory access has risen from 2 , in previous
algorithm to:

17
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J.
Quinn
for NUMA Multiprocessors
Try to resolve memory contention as much as
possible
Increase the locality of memory references to
reduce memory access time
Design Strategy Reduce average memory
latency time by increasing locality
Algorithms for Multicomputers:
Row-Column Oriented Algorithm

Partition Matrix A into rows and B into columns(n is a
power of 2 and we are executing algorithm on an n-
processor hypercube):
One imaginable parallelization:
Parallelize the outer loop (i)
All parallel processes access column 0 of b, then column 1
of b, etc.
This results in a sequence of broadcast steps each having
(logn) on an n-processor hypercube( refer to chapter 6,
p. 170)
In the case of a multiprocessor too much contention
for the same memory bank is called hot spot

Design Strategy Eliminate contention for
shared resources by changing temporal order
of data accesses.
New Solution for a multicomputer:
Change the order in which the algorithm
computes the elements of each row of C
Processes are organized as a ring.
After each process has used its current column of
B, it fetches the next column of B from its
successor on the ring
1 0
5 4
3
2
7 6
We embed a ring in a hypercube
with dilation 1 using Gray Codes
Each message can be sent in
time (1)
2
6
3
4
1 0
5
7
Example : Use 4 processes to multiply two matrices A
4x4
and B
4x4

P1: P4:
P2: P3:
4x4
and B
4x4

P1: P4:
P2: P3:
4x4
and B
4x4

P1:
P4:
P2: P3:
4x4
and B
4x4

P1:
P4:
P2: P3:
Generalizing the algorithm :

Multiplying l x m and m x n matrices on p processors where p<l
and p<n
Assume l and n are integer multiples of p
Every processor begins with l/p rows of A and n/p columns of B
and multiplies the (l/p) x m submatrix of A with the m x (n/p)
submatrix of B producing a (l/p) x (n/p) submatrix of the product
matrix
Then every processor passes its piece of B to its successor
processor
After p iterations, each processor has multiplied its piece of A
with every piece of B, building a (l/p) x n section of the
product matrix
The total number of Computation steps : (l/p)m(n/p)p=
(lmn/p)

total Communication time:
The standard assumption : Sending and
receiving a message has Message latency
plus message transmission time times the
number of values sent
: Message latency
: Message transmission time
Every iteration has communication time :
2(+m(n/p))
Over p iteration total communication time is :

Algorithms for Multicomputers:
Block-Oriented Algorithm
We want to maximize number of multiplications
performed per iteration
Multiplying l x m matrix A by m x n matrix B (l, m and
n are integer multiples of where p is an even
power of 2.

Processors as a two-dimensional mesh with
wraparound connections
Give each processor a subsection
of A and subsection of B.

The new Matrix multiplication algorithm is a
corollary of two results shown earlier:
Block matrix multiplication performed analogously to
scalar matrix multiplication Each occurrence of
scalar multiplication is replaced by an occurrence of
matrix multiplication
The algorithm previously used on 2-dimensional mesh
of processors with a staggering technique The same
staggering technique is used to position the blocks of
A and B, so that every processor multiplies two
submatrices every iteration
Phase 1: Staggering the block submatrixes of matrix
A is done in both directions: left and right
Phase 1: Staggering the block submatrixes of matrix
B is done in both directions: up and down
Chapter 7: Matrix Multiplication , Parallel
Computing :Theory and Practice, Michael J.
Quinn
32
From s point of view:
A
1,2
*B
2,2
A
1,1
*B
1,2
+ A
1,0
*B
0,2
+ + C
1,2
= A
1,3
*B
3,2
(1)
(2)
(3) (4)
There are iterations that every processor sends and
receives a portion of matrix A and B
Number of Computation steps

The staggering and unstaggering phase takes steps instead of
p -1 steps in Getlemans algorithm
How?
There are iterations that every processor sends and
receives a portion of matrix A and B
Total communication steps for transferring A block /B block
2( + ( )) =
total Communication time:
The standard assumption : Sending and receiving
a message has Message latency plus message
transmission time times the number of values
sent
: Message latency
: Message transmission time

The two multicomputer Algorithms
Both the block oriented algorithm and the row-
column algorithm have the same number of
computation steps : (lmn/p)
When does the second algorithm require less
communication time?
Assume that we are multiplying two n x n matrices,
where n is an integer multiple of p

37
The two multicomputer Algorithms
Thus the block oriented algorithm is uniformly
superior to the row-column algorithm when
the number of processors is an even power of
2 greater than or equal to 16.

Questions?

Matrix Multiplication Algorithms for Parallel Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matrix Multiplication Algorithms for Parallel Computing

Uploaded by

Copyright:

Available Formats

MATRIX MULTIPLICATION

You might also like