Letter
A Practical Performance Comparison of Parallel Matrix Multiplication
Algorithms on Network of Workstations
Kalim Qureshi
Haroon Rashid** Non-member
‘Non-member
this paper we practically compared the performance of three blocked based parallel multiplication algorithms with simple
fixed size runtime task scheduling strategy on homogeneous cluster of workstations. Parallel Virtual Machine (PVM) was used
for tis study.
Keywords: Paral
algorithms, Load balancing, matrix muliplicalin, performance analysis, cluster parallel computing
4. Introduction all-to-all broadcast of matrix B's blocks is performed in each
column,
Matrix multiplication is a problem which has a wide application The serial block based mateix multiplication algorithm is below
in many fields of seience, But the problem costs too much Procedure BLOCK_MAT MUL (A,B,C)
‘computational complexity if i is solved using the conventional Begin
serial algorithm. There are some serial algorithms like” which For 010 ¢-1
‘may improve the performance « bit. To gain much more Forj-Otoq-l do
performance, the parallel algorithm implemented on distributed Begin
system is the hest solution, As the matrix multiplication is Initialize all elements of C to zero
‘embarrassingly parallel, so this application is relatively easy to For k-0 ta 4-1 do
implement, But itis still an issue of great interest to develop a Cyn Cyt Aca By,
‘matrix. multiplication algorithm computationally optimal. In this Endfor
paper, the four existing parallel matrix multiplication algorithms End BLOCK_MAT_MUL
‘will be implemented and their performance from many aspects 23° Cannon's Algorithm The matrices A and Bare
vill he measured. The algorithms are RTS (Run Time Stratey®, _partoned ino_p square blocks. We label the processors from
simple block based algorithm”, Cannon's Algorithm and Fox's P(0,0) to P(Jp-l. pt) Initially AG, }) and BA) are assigned
Algorithm”. RTS has been chosen as itis considered one of the to Pi, j). The blocks are systematically rotated among the
hest_non block based simple parallel matrix multiplication processors alter every sub-matrix multiplication so that every
algorithms. processor gets a fresh A(i,k) after each rotation Cannon's
algorithm reorders the summation in the inner loop of block
smatix multiplication as follows
Cha) = Cli) + sum_fa-OF1} AON BOR)
2 Investigated Parallel Algorit
Workstation (NOW)
1s on Network of
24 Runtime Task Scheduling (RTS) Strategy In RIS CU) ~ sun fk-O}NS-1} AG. (°K) mod *
strategy, a unit of task (one row and one coluran) is distributed at BEGIN mod 8)
runtime, As the node completes the previous assigned sub-task, Cannon's maivix multiplication algorithm
new task is assigned for processing by the master The number of forall (i010 5-1). "skew"A
task processed by the nodes is highly depends on the node's Lefcirular-shift row 1 of A by
performance. It cope the heterogeneity and load imbalance of 0 that Afi) is overeriten by AC, G+) mods)
nodes, The size of the tsk plays a great role of performance end for
under this strategy. oral (0 05-1). "skew" B
2.2 Block based Matrix Multiplication The concept of Upreircular-shgt column i of B by
biocke matrix mutilation will be wsed forall the rest three 0 that By) is overeriten by B( (i) mod.)
algorithms. For example, one x n mari A can be regued as 4
24 array of blocks such that each Block is an (na) x (nya) end for
sub-matrix. We can use p processors to implement the block fork-Otos-}
vetson of matrix mutiptication algorithm in parallel, wh forall (iO fos-1,J-0 108-1)
This algorithm ean be easly parallelized if an all-to-all brosdeast CHA) ~ Ci) + AAS) *BAAD
‘of matrix A's blocks is performed in each row of processors and an Lefrcireularshif each row of A by I
‘0 that Aid) is overwniten by AQ, 1) mod s)
+ Math, And Computer Science Department, Kuwait Univesity Up-cireular-shift each column of B by 1
Kuwait 0 that Bij) is overwriten by BY (i 1) mod sj)
++ COMSATS Institut of information Technology, Abbttsba, end for
Pakistan,
WFINC, 125848, 2005 67724 ‘This is another memory efficient
parallel algorithm for multiplying dense matrices. Again, both n x
fn matrices A and B are paritioned among p processors so that
‘each processor initially stores (n/p) x (nlp) blocks of each matrix,
‘The algorithm uses one to all broadcasts of the block blocks of
‘matrix A in processor rows, and single step circular upward shifts
‘of the blocks of matrix B along processor columns. Initially each
diagonal block AG, js selected for broadcast.
‘The pseudo-code of Fox’s algorithm is given below:
1 Partition the matrices A and B into sub-matrices and send tothe
processors
node
FOR i= 1, misgrt(¥)
FOR} =I, misqrt(¥)
=0
FORK
(i Limisgrt(h)] + 1, imisqrt(Xy
+r
ano
FOR
jmisqr)
(0 pmisqrut +
+r
akl al represents the
Ait
clement inthe kah
Mra and lth column of
whereas
Atif represents a
sub-matrix of A
Bil ~ bkI// as above but for B
ENDFOR
ENDFOR
SEND m, N, ind, 10 current node
SEND Aij to current node
SEND Bij to current node
node = node * I
ENDFOR
ENDFOR
1 Wait for the node to compute the sub-marix products
WAIT for nodes to finish
1 Read sub-matrices Ci from the nodes
FOR node = 0,N-1
READ Cij from current node
ENDFOR
1 Assemble C from all the Ci's
OUrPUTC
END PROGRAM
3.
These entire four algorithms are implemented in UNIX
‘environment on SUN homogenous cluster of machines. Also,
‘workstations are interconnected by an Ethernet 10 Mbis LAN and
PYM library has been used to create process and distribute the
tasks.
The algorithms were tested for 16 x 16 and 64 x 64 size of
‘matrices. All shes results are surnmarized in Figure 1
From Figure 1 it is evident that any block based matin
‘multiplication algorithm is far better than simple matin,
‘multiplication algorithms. The Cannon's and Fox’s algorithn
performances ate almost same, But they both are slighlly better
Implementation and Measurement
678
Fig. 1 The measure processing time of four algorithms
using matrix of 64x64 and 16x16
Fig. 2. Memory used my all four algorithms
than the simple block based algorithm,
‘The main facility of Cannon's and Fox’s algorithm is the
memory usage per process is very less than the other existing
algorithms. This is very prominent from Figure 2
(Manuscript received April 1, 2004)
References
(1) DM Bey, K Lee, and HD. Sion
‘Acccete the Solution of intr Sto
Wels, Nod, p77 Jan 1951)
K Quah snd ML task “An empl staly of tk seedling
ater for mage processing splcton 90 hropercus dred
Cenpuing steno! ave ofthe Pall Ditibted Compaig
Pracess Joma Vl No? pp 297-06 (Spe 200)
2 Grama, A. Gop, Keys and. Kumar Led Par
Comparing, Second, Aan Wes, ISBN 0201648652, yp 45389
em
fing Sea's Alport
“mal of Supecomputng
a
Kalim Qureshi Non-member He is an Assistant Professor in Department
‘of Math and Computer Science at Kuwait University, Kua He is id his
MS and PLD fiom Dept. of Computer Science and Systems Enge,
Muroran Institute of Technology, Hokkaido, Japan in 1997 and 2000,
respectively, He is @ member of IEEE Computer Society, He can be
contact by E-mail: qureshisei kurv edu kw
‘Haroon Rashid (Non-member is an Associate Professor in Department of
Computer Science at COMSATS Insite of Information Technology,
Abboudbad Campus, Pakistan. His research interests include parallel
computing, distributed systems, high speed networks, multimedia, and
network performance optimization. His E-mail: haroonGlitnet pk
TEE) Trans E18, Vol. 125, No.4, 2005