You are on page 1of 39

MATRIX MULTIPLICATION

(Part b)
By:
Shahrzad Abedi
Professor: Dr. Haj Seyed Javadi
MATRIX Multiplication
SIMD
MIMD
Multiprocessors
Multicomputers
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 2
Matrix Multiplication Algorithms
for Multiprocessors
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
3
p1
p2
p3
p4
p1 p2 p3 p4
Matrix Multiplication Algorithm
for a UMA Multiprocessor
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 4
p1
p2
p3
p4
Matrix Multiplication Algorithm
for a UMA Multiprocessor
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 5
p1
p2
C A
B
Example:
n= 8 , P=2 n/p= 4
n/p times
We must read n/p rows of A and we must read every
element of B, n/p times
Matrix Multiplication Algorithms
for Multiprocessors
Question : Which Loop should be made
parallel in the sequential Matrix multiplication
algorithm?


Grain Size :
Amount of work performed between processor
interactions
Ratio of Computation time to Communication
Time : Computation time / Communication time
6 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Sequential Matrix Multiplication
Algorithm
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 7
Matrix Multiplication Algorithms
for Multiprocessors
Design Strategy If load balancing is not a
problem maximize grain size
Question : Which Loop should be made
parallel ? i or j or k ?
K has data dependency
If j Grain-size = O(
n
3
/np)= O(
n
2
/p)
If i Grain-size = O(
n
3
/p)
X
8 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Matrix Multiplication Algorithm
for a UMA Multiprocessor
Parallelizing i loop
9
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Matrix Multiplication Algorithm
for a UMA Multiprocessor
n/p rows each (
n
2
)

n/p x
n
2
= (
n
3
/p)



Synchronization
overhead (p)

Complexity (
n
3
/p + p)


10
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Matrix Multiplication in Loosely
Coupled Multiprocessors
Some matrix elements may be much easier
to access than others
It is important to keep local as many memory
references as possible
In previous UMA algorithm :
Every process must access n/p rows of matrix A
and access every element of B n/p times
Only a single addition and a single
multiplication occur for every element of B
fetched . This is not a good ratio!
Implementation of this algorithm on NUMA
Multi-processors yields poor speedup!
11 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Matrix Multiplication in Loosely
Coupled Multiprocessors
Another method must be found to partition
the problem
An attractive method Block Matrix
Multiplication

12 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Block Matrix Multiplication
A and B are both n x n matrices, n= 2k
A and B can be thought of as conglomerates of
4 smaller matrices, each of size k x k

Given this partitioning of A and B into blocks , C is
defined as follows:
13 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Block Matrix Multiplication
For example there are processes,
then matrix multiplication is done by dividing
A and B into p blocks of size k x k.
14 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Block Matrix Multiplication
STEP 1: compute C
i, j
= A
i,1
B
1,j



A B
15
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
P1: P2:
P3: P4:
P1: :P2
P3: :P4
Block Matrix Multiplication
STEP 2: Compute C
i,j
=C
i,j
+A
i,2
B
2,j


16 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
P1:
P3:
P2:
P4:
P1: :P2
:P4 P3:
Block Matrix Multiplication
Each block multiplication requires 2k
2

memory fetches, k
3
additions and k
3

multiplications
The number of arithmetic operations per
memory access has risen from 2 , in previous
algorithm to:

17
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J.
Quinn
Matrix Multiplication Algorithm
for NUMA Multiprocessors
Try to resolve memory contention as much as
possible
Increase the locality of memory references to
reduce memory access time
Design Strategy Reduce average memory
latency time by increasing locality
18 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Algorithms for Multicomputers:
Row-Column Oriented Algorithm

Partition Matrix A into rows and B into columns(n is a
power of 2 and we are executing algorithm on an n-
processor hypercube):
One imaginable parallelization:
Parallelize the outer loop (i)
All parallel processes access column 0 of b, then column 1
of b, etc.
This results in a sequence of broadcast steps each having
(logn) on an n-processor hypercube( refer to chapter 6,
p. 170)
In the case of a multiprocessor too much contention
for the same memory bank is called hot spot

19 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Row-Column Oriented Algorithm
Design Strategy Eliminate contention for
shared resources by changing temporal order
of data accesses.
New Solution for a multicomputer:
Change the order in which the algorithm
computes the elements of each row of C
Processes are organized as a ring.
After each process has used its current column of
B, it fetches the next column of B from its
successor on the ring
20 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Row-Column Oriented Algorithm
1 0
5 4
3
2
7 6
We embed a ring in a hypercube
with dilation 1 using Gray Codes
Each message can be sent in
time (1)
2
6
3
4
1 0
5
7
21 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Row-Column Oriented Algorithm
Example : Use 4 processes to multiply two matrices A
4x4
and B
4x4

22 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
P1: P4:
P2: P3:
Row-Column Oriented Algorithm
Example : Use 4 processes to multiply two matrices A
4x4
and B
4x4

23 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
P1: P4:
P2: P3:
Row-Column Oriented Algorithm
Example : Use 4 processes to multiply two matrices A
4x4
and B
4x4

24 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
P1:
P4:
P2: P3:
Row-Column Oriented Algorithm
Example : Use 4 processes to multiply two matrices A
4x4
and B
4x4

25 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
P1:
P4:
P2: P3:
Row-Column Oriented Algorithm
Generalizing the algorithm :

Multiplying l x m and m x n matrices on p processors where p<l
and p<n
Assume l and n are integer multiples of p
Every processor begins with l/p rows of A and n/p columns of B
and multiplies the (l/p) x m submatrix of A with the m x (n/p)
submatrix of B producing a (l/p) x (n/p) submatrix of the product
matrix
Then every processor passes its piece of B to its successor
processor
After p iterations, each processor has multiplied its piece of A
with every piece of B, building a (l/p) x n section of the
product matrix
The total number of Computation steps : (l/p)m(n/p)p=
(lmn/p)

26 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Row-Column Oriented Algorithm
total Communication time:
The standard assumption : Sending and
receiving a message has Message latency
plus message transmission time times the
number of values sent
: Message latency
: Message transmission time
Every iteration has communication time :
2(+m(n/p))
Over p iteration total communication time is :


27 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Algorithms for Multicomputers:
Block-Oriented Algorithm
We want to maximize number of multiplications
performed per iteration
Multiplying l x m matrix A by m x n matrix B (l, m and
n are integer multiples of where p is an even
power of 2.

Processors as a two-dimensional mesh with
wraparound connections
Give each processor a subsection
of A and subsection of B.

28 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Block-Oriented Algorithm
The new Matrix multiplication algorithm is a
corollary of two results shown earlier:
Block matrix multiplication performed analogously to
scalar matrix multiplication Each occurrence of
scalar multiplication is replaced by an occurrence of
matrix multiplication
The algorithm previously used on 2-dimensional mesh
of processors with a staggering technique The same
staggering technique is used to position the blocks of
A and B, so that every processor multiplies two
submatrices every iteration
29 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Block-Oriented Algorithm
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 30
Phase 1: Staggering the block submatrixes of matrix
A is done in both directions: left and right
Block-Oriented Algorithm
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 31
Phase 1: Staggering the block submatrixes of matrix
B is done in both directions: up and down
Block-Oriented Algorithm
Chapter 7: Matrix Multiplication , Parallel
Computing :Theory and Practice, Michael J.
Quinn
32
Block-Oriented Algorithm
From s point of view:
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 33
A
1,2
*B
2,2
A
1,1
*B
1,2
+ A
1,0
*B
0,2
+ + C
1,2
= A
1,3
*B
3,2
(1)
(2)
Block-Oriented Algorithm
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 34
(3) (4)
Block-Oriented Algorithm
There are iterations that every processor sends and
receives a portion of matrix A and B
Number of Computation steps



The staggering and unstaggering phase takes steps instead of
p -1 steps in Getlemans algorithm
How?
There are iterations that every processor sends and
receives a portion of matrix A and B
Total communication steps for transferring A block /B block
2( + ( )) =
35 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
Block-Oriented Algorithm
total Communication time:
The standard assumption : Sending and receiving
a message has Message latency plus message
transmission time times the number of values
sent
: Message latency
: Message transmission time


36 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
The two multicomputer Algorithms
Both the block oriented algorithm and the row-
column algorithm have the same number of
computation steps : (lmn/p)
When does the second algorithm require less
communication time?
Assume that we are multiplying two n x n matrices,
where n is an integer multiple of p


37
Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn
The two multicomputer Algorithms
Thus the block oriented algorithm is uniformly
superior to the row-column algorithm when
the number of processors is an even power of
2 greater than or equal to 16.
38 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

Questions?
39 Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

You might also like