You are on page 1of 2
Letter A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms on Network of Workstations Kalim Qureshi Haroon Rashid** Non-member ‘Non-member this paper we practically compared the performance of three blocked based parallel multiplication algorithms with simple fixed size runtime task scheduling strategy on homogeneous cluster of workstations. Parallel Virtual Machine (PVM) was used for tis study. Keywords: Paral algorithms, Load balancing, matrix muliplicalin, performance analysis, cluster parallel computing 4. Introduction all-to-all broadcast of matrix B's blocks is performed in each column, Matrix multiplication is a problem which has a wide application The serial block based mateix multiplication algorithm is below in many fields of seience, But the problem costs too much Procedure BLOCK_MAT MUL (A,B,C) ‘computational complexity if i is solved using the conventional Begin serial algorithm. There are some serial algorithms like” which For 010 ¢-1 ‘may improve the performance « bit. To gain much more Forj-Otoq-l do performance, the parallel algorithm implemented on distributed Begin system is the hest solution, As the matrix multiplication is Initialize all elements of C to zero ‘embarrassingly parallel, so this application is relatively easy to For k-0 ta 4-1 do implement, But itis still an issue of great interest to develop a Cyn Cyt Aca By, ‘matrix. multiplication algorithm computationally optimal. In this Endfor paper, the four existing parallel matrix multiplication algorithms End BLOCK_MAT_MUL ‘will be implemented and their performance from many aspects 23° Cannon's Algorithm The matrices A and Bare vill he measured. The algorithms are RTS (Run Time Stratey®, _partoned ino_p square blocks. We label the processors from simple block based algorithm”, Cannon's Algorithm and Fox's P(0,0) to P(Jp-l. pt) Initially AG, }) and BA) are assigned Algorithm”. RTS has been chosen as itis considered one of the to Pi, j). The blocks are systematically rotated among the hest_non block based simple parallel matrix multiplication processors alter every sub-matrix multiplication so that every algorithms. processor gets a fresh A(i,k) after each rotation Cannon's algorithm reorders the summation in the inner loop of block smatix multiplication as follows Cha) = Cli) + sum_fa-OF1} AON BOR) 2 Investigated Parallel Algorit Workstation (NOW) 1s on Network of 24 Runtime Task Scheduling (RTS) Strategy In RIS CU) ~ sun fk-O}NS-1} AG. (°K) mod * strategy, a unit of task (one row and one coluran) is distributed at BEGIN mod 8) runtime, As the node completes the previous assigned sub-task, Cannon's maivix multiplication algorithm new task is assigned for processing by the master The number of forall (i010 5-1). "skew"A task processed by the nodes is highly depends on the node's Lefcirular-shift row 1 of A by performance. It cope the heterogeneity and load imbalance of 0 that Afi) is overeriten by AC, G+) mods) nodes, The size of the tsk plays a great role of performance end for under this strategy. oral (0 05-1). "skew" B 2.2 Block based Matrix Multiplication The concept of Upreircular-shgt column i of B by biocke matrix mutilation will be wsed forall the rest three 0 that By) is overeriten by B( (i) mod.) algorithms. For example, one x n mari A can be regued as 4 24 array of blocks such that each Block is an (na) x (nya) end for sub-matrix. We can use p processors to implement the block fork-Otos-} vetson of matrix mutiptication algorithm in parallel, wh forall (iO fos-1,J-0 108-1) This algorithm ean be easly parallelized if an all-to-all brosdeast CHA) ~ Ci) + AAS) *BAAD ‘of matrix A's blocks is performed in each row of processors and an Lefrcireularshif each row of A by I ‘0 that Aid) is overwniten by AQ, 1) mod s) + Math, And Computer Science Department, Kuwait Univesity Up-cireular-shift each column of B by 1 Kuwait 0 that Bij) is overwriten by BY (i 1) mod sj) ++ COMSATS Institut of information Technology, Abbttsba, end for Pakistan, WFINC, 125848, 2005 677 24 ‘This is another memory efficient parallel algorithm for multiplying dense matrices. Again, both n x fn matrices A and B are paritioned among p processors so that ‘each processor initially stores (n/p) x (nlp) blocks of each matrix, ‘The algorithm uses one to all broadcasts of the block blocks of ‘matrix A in processor rows, and single step circular upward shifts ‘of the blocks of matrix B along processor columns. Initially each diagonal block AG, js selected for broadcast. ‘The pseudo-code of Fox’s algorithm is given below: 1 Partition the matrices A and B into sub-matrices and send tothe processors node FOR i= 1, misgrt(¥) FOR} =I, misqrt(¥) =0 FORK (i Limisgrt(h)] + 1, imisqrt(Xy +r ano FOR jmisqr) (0 pmisqrut + +r akl al represents the Ait clement inthe kah Mra and lth column of whereas Atif represents a sub-matrix of A Bil ~ bkI// as above but for B ENDFOR ENDFOR SEND m, N, ind, 10 current node SEND Aij to current node SEND Bij to current node node = node * I ENDFOR ENDFOR 1 Wait for the node to compute the sub-marix products WAIT for nodes to finish 1 Read sub-matrices Ci from the nodes FOR node = 0,N-1 READ Cij from current node ENDFOR 1 Assemble C from all the Ci's OUrPUTC END PROGRAM 3. These entire four algorithms are implemented in UNIX ‘environment on SUN homogenous cluster of machines. Also, ‘workstations are interconnected by an Ethernet 10 Mbis LAN and PYM library has been used to create process and distribute the tasks. The algorithms were tested for 16 x 16 and 64 x 64 size of ‘matrices. All shes results are surnmarized in Figure 1 From Figure 1 it is evident that any block based matin ‘multiplication algorithm is far better than simple matin, ‘multiplication algorithms. The Cannon's and Fox’s algorithn performances ate almost same, But they both are slighlly better Implementation and Measurement 678 Fig. 1 The measure processing time of four algorithms using matrix of 64x64 and 16x16 Fig. 2. Memory used my all four algorithms than the simple block based algorithm, ‘The main facility of Cannon's and Fox’s algorithm is the memory usage per process is very less than the other existing algorithms. This is very prominent from Figure 2 (Manuscript received April 1, 2004) References (1) DM Bey, K Lee, and HD. Sion ‘Acccete the Solution of intr Sto Wels, Nod, p77 Jan 1951) K Quah snd ML task “An empl staly of tk seedling ater for mage processing splcton 90 hropercus dred Cenpuing steno! ave ofthe Pall Ditibted Compaig Pracess Joma Vl No? pp 297-06 (Spe 200) 2 Grama, A. Gop, Keys and. Kumar Led Par Comparing, Second, Aan Wes, ISBN 0201648652, yp 45389 em fing Sea's Alport “mal of Supecomputng a Kalim Qureshi Non-member He is an Assistant Professor in Department ‘of Math and Computer Science at Kuwait University, Kua He is id his MS and PLD fiom Dept. of Computer Science and Systems Enge, Muroran Institute of Technology, Hokkaido, Japan in 1997 and 2000, respectively, He is @ member of IEEE Computer Society, He can be contact by E-mail: qureshisei kurv edu kw ‘Haroon Rashid (Non-member is an Associate Professor in Department of Computer Science at COMSATS Insite of Information Technology, Abboudbad Campus, Pakistan. His research interests include parallel computing, distributed systems, high speed networks, multimedia, and network performance optimization. His E-mail: haroonGlitnet pk TEE) Trans E18, Vol. 125, No.4, 2005

You might also like