You are on page 1of 10
Parallel Algorithms for Singular Value Decompo: ion Renard R. Ulreyt, Anthony A. Maciejewskit, and Howard Jay Siegelt NCR Corporation, 2001 Danfield Ct., Ft. Collins, CO 80525 USA ‘Parallel Processing Laboratory, School of Electrical Engineering Purdue University, West Lafayette, IN 47907-1285 USA. Abstract {In motion rate control applications, itis faster and eas jer @ solve the equations imolved if the singular value decomposition (SVD) of the Jacobian matrix is frst deter ‘mined. A parallel SVD algorithm with minimum execution lime is desired. One approach using Givens rotations lends itself 10 parallelization, reduces the iterative nature ofthe algorithm, and efficiently handles rectangular matr- ces. This research focuses on the minimization ofthe SVD execution time when using this approach. Specific issues dressed include considerations of data mapping, effects Of the manber of processors used on execution time, impacts of the interconnection network on performance, ‘and trade-offs between modes of parallelism. Results are verified by experimental data colleced on the PASM par. allel machine prototype Introduction Decreasing the execution time of computerized tasks is the focus of a tremendous amount of study. The use of parallel computer systems is one method to belp decrease these times. The performance of a parallel system, how- ever, is dependent on the algorithm implementation and the parallel machine characteristics. Performance opt ‘mization is therefore complicated, due to the wide variety of algorithm characteristics [7] and the rapidly growis variety of parallel machines that have been built or pro- posed. Thus, the study of mapping algorithms onto paral- Jel machines is an important research area, ‘The singular valve decomposition (SVD) of matrices ‘has been extensively used in cotrol applications, e... dur- ing the computational analysis of robotic manipulators (8, 22), The decomposition aids the computational solution This rec was soponed in pt by the Nan Seepe Foundation unde grant CDA-S01866, by Suds Nabe! Labret com tees T1790, and by) Rome Laboriory inder enact, 0.8186 6602-604 © 1996 IEEE, of system equations such s the movon rate como! fr- aula f= J8, where eR specifies the end effector veloc- iy, JER specifies joint velocities, and JER™™ is the Jacobian matix 21). For ystems with many cooperating ‘maniplatos te value of N can rach ato the hundreds, revuling in s severe computational burden for achieving realtime contol Tn general, computation of the SVD of an arbitrary satix is an iterative procedure, so the numberof opere- tions required to calculate ito within acceptable eer lim- its is not known beforehind, The contol of many sys- tems, however, is based on equations involving the curent Jacobian matrix, which can be regarded asa perturbation of the previous matrix, ie. 1440) =JCO+AN. It has ‘been demenstated that for these cases knowledge ofthe previous state canbe used during the computation of the ezureat SVD to decrease execution time [12]. This paper describes and analyzes two SVD algorithm implementa sions for these cases. Experimental data obtained on the PASM protouype parle compute {1,19} is provide that supports the conclusions of the algorithm analyses. ‘Section 2 provides background information about SVD. ‘Givens rotations, and PASM. Descriptions ofthe two par- allel SVD implementaions being analyzed are preseated in Secon 3. Section 4 demoaseats an analysis approach to determine wiich implementation haste shores exeou- tion time. The performances of SVD iinplementatons oa PASM are evalu in Section 5. 2: Background information ‘The SVD of a matrix JER" is defined as the matrix fctorization J=UDV", where YER! and VER™' are orthogonal maces ofthe singular vectors, and DERM is a nonnegaive diagonal matrix, The singular values of J, ‘ae ordered from largest to smallest along the diagonal OfD. Itis assumed here that MSN. ‘The Golub-Reinsch algorithm [6] is the standard toch- nique for determining the SVD of a matrix. This method, ower, has evo unattractive aspects, The st i that the Algorithm, a its deed, camot vse kaowedge ofa pre= iow matix decomposition. ‘The second is tht the techy nigue i elavely serial in nature. making more paral ‘ble algorithms desirable Sever parallel SVD algriims tne been imple mone for various machise architectures, incding those proposed in 3, 4,10, 11, 16. These implementations also do not allow thei iterative nates 0 be reduced. Algo. tits beng stodied i ths paper ae based on a method logy pesated in [12]. which exclshely uses Givens ‘otaons (6) orhogonaize matix columas. Swccesive Givens rotations are vsed fo generate the conbogonal matrix V tat wil rest in JVB. wher te columns of BER“ are orthogonal. A matrix with ‘vhogonal columns can be writen asthe product of an fnbogenal matrix U ands diagonal matix D (ie, [B=UD) by leting the columns of Unequal nomalized colums of B. by y= byflibll (where Ilb,|l = Vb"). co) and defining the diagonal elements of D to be equal to the orm ofthe columns of B y= bl This esol inthe SVD of ‘The orthogonal mati V chat will onhogonalize the columns of Js formed as a prodct of Givens rotations, tach of which onhogonalzes wo columns. Coniering the ih and kth columns of an arirary matin A. single Givens rotation results in ew columns, af anda given by af =a,cos(¢)+a, sin(¢) @) a = & cost) —a, sin(4), @ The cox) and sin) teams accessary to achieve orhogo- ality ae computed using the formulas in 14), which we based on the quanites peataqaa"y—a"y. and o= VaR Using these quattes, when q20 cos) = EFI and sing) Wheaa<0 Sind) = sex(p)- C=O and cos(#) = pic sin(¢)) . where sgn(p) equals 1 if p20 and —1 if p<0. Two sets of formolas are given so that il-condiioned equations esult- ing from the subacon of neary equal mumbers can shay be voided. “To orthogonalize each possible pair of columns requires N(N- 1)/2 rotations, referred to as a sweep [6]. The matrix V can be computed by iteratively forming the prod- ‘ct of ast of sweeps and tesing for convergence. While the number of sweeps required to onhogonalize the columns of J is not generally known beforehand. it vas shown in [12] hat by sing the V matix frm the SVD of @ 6 llc -cost)). (6) o 05 the previous Jo ind a inital estimate for B, BUeAd=Ii+Ad=VO. ® ‘one can obtsin good approximation to the new SVD sing a single sweep if AKO is small, Therefore, ia this work the cureat V- mauix is caleulaed using Verd0=VOrTT IT Ga. whee Gy denotes the Givens roation 10 erhoponalize columas i and k. Only @ single sweep is performed to update the matrix V. ‘The PASM (partionable SIMD/MIMD) parle! pro- cessing sytem [I 19] was used to implement thse algo- rithms. PASM, designed at Purdue Univesity, supports mixed-mode parallelism. is, it can opera in either SIMD or MIMD mode of parallelism, and can switch modes at instruction level geasulariy with generally negli- ible overhead. A small scale 30-processor PASM proto- {ype has boen built with 16 PES (processonemory pairs) in the computational eagine. Tor inter-PE communice tions, PASM uses a paridousle ciuitswitched multi- stage coke intercoanection network (18) also called an Omege (9), The network can be used in both SIMD and MIMD moses. PASM is capable of employing baie synchronization {5] in MIMD mode, called Barier MIMD (MIMD) Fach PE executes its code independendy until it aves at 4 syachronzaton point called a bacriet. Thea, each PE waits athe baie ual all PEs indicate they have reached i One use for this is to synchronize inte-PE wansfers peeformed in MIMD mode 3: Data mapping 3.1: Overview Based on the equations in Section 2, Fig. 1 gives an gorithm to calculate V, D, and U using Givens rotations. This algorithm assumes that the SVD of the Jacobian ‘matrix from the previous control sample period has been ‘computed. Thus for step 1, the previous V matrix is avail- fable on the system. It is assumed thet the algorithm then Coaverges with a single sweep of rotations in sep 2. Referring to the parallel execution of a Givens rotation by all PEs asa rotation step. N= I rotation steps must be performed on N72 column pairs to form all NIN 1)/2col- ‘umn pairs. With unique column pairs distributed among 1N2 PEs, inver-PE communication is avoided within each rotation step. After the inital rotation step, however, an iter-PE communication is required before esch remaining rotation step. This rotatedransferroiae sequence is ‘required both to form all column pairs and to converge the B and V matices to their single-sweep values. Newl ‘updated columas are being transferred in each communi cation step, ‘lel nates for B rm J previo Ving 8) {eral etn pie 6) do Taliep and ing) ‘Site cout) and ng (6) 07) erm cotati on com and of simi to) and) ef oan o coh and Kf ¥ snare ad “aul D tom Bing (2) ‘eit om Band sing) Fig. 1: Highvievel algorithm for finding SVD using Givens rotator Poca ene") ‘As presented in Section 2, the calculations involved in this algorithm are suaightforward. Of greater interest are ways to effectively map matrix elements to particular par- allel machines, and the types of inter-PE. communication these mappings dictate Various implementations of col- umn transfer operations have been devised, including those in (3, 4.16]. Each of these methods map a unique ‘column pair to each of N/2 PES. The availability of a mul tistage cube network on PASM allows matrix data to be isrbuted scross more PEs than allowed by implementa: ‘ions in (3, 4, 16), and thus increases the number of PES that can perform useful work while stil performing all necessary inter-PE communications in single transfer steps ‘Two different methods for mapping matrices to PEs are resented. These implementations assume that M=2" M in these ‘two equations, the mumber of FLOPS used in each imple ‘mentation contnves to decrease as R (and the aumber of Ps) increases, upto the maximum allowed when R = M. Setting the derivative of the DT count of the 2CPP approach to zero results in the mathematically optimal value of R= ((N? -2N+NM-4N)/(@N—2)). In this equa- tion, R may be less than M, depeading on the values of N and M. Setting the derivative ofthe DT count ofthe 1CPP spproach to zero results in the mathematically optimal ling pn Operon Ona 2EP | ARKEN=ONGTOMBD | gay a $1CIN= 2) 10) Gna) [“aaTeaNeioNMt—aM=9 [Ret TORUS aN FORMS | rea aay asi | +N 4cu0N 10) ‘and column seg- = SN GONT aN aa SNM ~ORNME ‘ewan Dar Trier Cont 2crP | GURKNTANS RMB Ty cy #HGN=2) 44201) NAPE [NIN eNMRaM=? To] AKIN N RMB] aay aaxrey | H2N-eivem = ae Table 1: SVD algoritim operation count totals. value of R=(N?-N+NM-2MVQN-1)). Ageia, R ‘may be less than M, depending on the values of N and M. ‘An examination of this equation, however, provides inter- esting results. Leting M=N, the equation reduces 10 Re=N-QN/GN= 1)), so the optimal value of R will be between N=2 and N=1. Therefore, when using the ICRP algorithm with M=N, the number of DTS will ecrease as R increases from I toM ~2, Also if Min the original equation is reduced to less than N24 by some ‘power of two, the mathematically optimal value of R is larger than its assumed maximum value of MsN/ and the minimum mmber of DT is always reached when the ‘maximum numberof PES are used. ‘The possibilty that the number of DTs performed by an algorithm may increase as the number of PEs increases ‘means that there could be a case when the total algorithm ‘execution time increases when more PEs are used, A ‘method is presented in the next subsection for determining ‘whether this is tue for a given system and problem size. 44: Performance prediction ‘A method is adapted from [15] to predict the number of PEs to use chat will minimize the execution time for the SVD algorithm, This method gives relative weights to the FP and DT operations by the determination of communi- cation ratio (CR). This ratio is used with the complexity equations in Table 1 to predict only whether performance ‘improves as more PEs are used. Because the numbers of FP and DT operations do not account forthe total execu- tion time, machine-dependent data was collected to use for the prediction. ‘The CR is calculated in tems of average expectod time to perform a DT over the average expected time to per- form a FLOP (including memory access and array address calculation times). The units of measure for the CR are (sees. DT)\secs. /FLOP)=FLOPYDT, Various met ‘ods can be used to determine the CR. The one chosen executes one implementation of the SVD algorithm on small matrix, using the minimum number of PEs that the ‘implementation allows. The ICP algorithm was arbiter. ily selected to measure the CR. with four PEs being used to decompose a random 4 matrix. Although the PASM. prototype can operate in different modes of parallelism, SIMD mode is used throughout this analysis for consis- tency. Hardware timers are used to measure the execution times of the operations being considered. Because the PASM prototype curently performs all FP calculations in software and has a relatively fast inter-PE communication network, its CR measured 0.119. It is assumed for this analysis that the CR does not vary with the number of PES sed, Using the CR, the predicied performance (PP) of a ‘machine running an SVD implementation is approximated by PP=(o0. of FLOPs)+CR: (no. of DTs), and is a func tion of both matrix size and the number of PEs, With this efiniion, PP will have units of number of FLOPs. Because the PP equations for the 2CPP and ICP approaches (PPacey and PPicep) do not consider many ‘overhead operations. they'do not provide absolute execu tion times, but they are reasonable estimates of relative ‘execution times as R, N, and M are varied. Therefore, they can be analyzed to determine the number of PEs that will provide minimum execution time on a particular machine. Implementation comparison The operation counts of the 2CPP and ICPP approaches are now compared. One comparison covers when the number of PEs equals the minimum common number that tbe two implementations can use (N PES). A. second comparison is for when the maximum common number of PEs are used (NM/2 PEs). These two cases are focused on because various numbers of PEs can be used, pending on the valves N, M, and R. The third case directly compares PPscyp and PP cer. To compare the two implementations with N PEs, replacements are made for R and r in the equations of Table 1 which correspond to using N PEs with either approach. The 2CPP approach requires both fewer FLOPS and fewer DTs under the constraints that M22 and N=4 tails in 20). Because these constraints are met forall values of N’and M of interest, the 2CPP implementation is ‘expected to be the fastest (neglecting differences in over- ‘head berween the two approaches) when the minimum common number of PES are used. ‘To compare the two approaches using NM/2 PEs, the same method i fllowed, with different values replacing R ‘nd rin the equations of Table 1, Analysis in [20] shows that the ICPP implementation uses fewer FLOPs when [NM/2 PEs are used, under the constraint that M> 2, which js tre forall values of M of interest. Its also shown that the ICPP implementation uses fewer DTs under the con- straint (M(N~1)-(m+1)+M)>N?. This inequality is ‘ot tre for all values of N and M, but it can easily be shown to be te when M

You might also like