0 views

Uploaded by anon_939178178

aa

aa

© All Rights Reserved

- Chapter 3
- MGMT 221, Ch. II
- 5_Matrices.pdf
- Matrices
- One Dimensional Array and Multi-Dimensional Arrays.pdf
- CHSL Solved Exam General Intelligence Held 04.12.2011 1st Sitting\
- Verification and Expansion of SDOF Transformation Factors for Beams
- Halusmanquickguide.en US
- FINAL 022014 Answer Key
- Bx 34452461
- Metsim Mathematical Functions
- Patterson
- Tutorial Notes
- zoho2.pdf
- Example
- ssss-1-1(3).pdf
- Pr2
- 26 Molenaar
- BAB02- The Commonly Used Blocks Lib
- 10.1.1.48

You are on page 1of 55

Addison Wesley, 2003.

Topic Overview

• Matrix-Vector Multiplication

• Matrix-Matrix Multiplication

Matix Algorithms: Introduction

matrices and vectors readily lend themselves to data-

decomposition.

decomposition.

and block-cyclic partitionings.

Matrix-Vector Multiplication

x to yield the n × 1 result vector y.

W = n2 . (1)

Matrix-Vector Multiplication: Rowwise 1-D Partitioning

processor storing complete row of the matrix.

one of its elements.

Matrix-Vector Multiplication: Rowwise 1-D Partitioning

Matrix A Vector x Processes

P0 0 P0 0

P1 1 P1 1

. n/p . n

. .

Pp-1 p-1 Pp-1 p-1

(a) Initial partitioning of the matrix (b) Distribution of the full vector among all

and the starting vector x the processes by all-to-all broadcast

Matrix A Vector y

P0 0 1 p-1 P0 0

P1 0 1 p-1 P1 1

. 0 1 p-1 .

. 0 1 p-1 .

Pp-1 0 1 p-1 Pp-1 p-1

(c) Entire vector distributed to each (d) Final distribution of the matrix

process after the broadcast and the result vector y

rowwise block 1-D partitioning. For the one-row-per-process

case, p = n.

Matrix-Vector Multiplication: Rowwise 1-D Partitioning

all broadcast is required to distribute all the elements to all the

processes.

j=0 (A[i, j] × x[j]).

time Θ(n). Therefore, the parallel time is Θ(n).

Matrix-Vector Multiplication: Rowwise 1-D Partitioning

partitioning.

• Each process initially stores n/p complete rows of the matrix and

a portion of the vector of size n/p.

involves messages of size n/p.

n2

TP = + ts log p + tw n. (2)

p

This is cost-optimal.

Matrix-Vector Multiplication: Rowwise 1-D Partitioning

Scalability Analysis:

desired efficiency E.

In this case, p < n, therefore, W = n2 = Ω(p2).

Matrix-Vector Multiplication: 2-D Partitioning

each processor owns a single element.

processors.

Matrix-Vector Multiplication: 2-D

√

Partitioning

n/ p

Matrix A Vector x

P0 P1 . . P√p-1

P√ p √n

p

P2 √ p . n

.

Pp-1

(a) Initial data distribution and communication (b) One-to-all broadcast of portions of

steps to align the vector along the diagonal the vector along process columns

Matrix A Vector y

P√ p

P2 √ p .

.

Pp-1

(c) All-to-one reduction of partial results (d) Final distribution of the result vector

one-element-per-process case, p = n2 if the matrix size is n × n.

Matrix-Vector Multiplication: 2-D Partitioning

• The first communication step for the 2-D partitioning aligns the

vector x along the principal diagonal of the matrix.

diagonal process to all the processes in the corresponding

column using n simultaneous broadcasts among all processors

in the column.

one reduction along the columns.

Matrix-Vector Multiplication: 2-D Partitioning

algorithm: one-to-one communication to align the vector

along the main diagonal, one-to-all broadcast of each vector

element among the n processes of each column, and all-to-

one reduction in each row.

time is Θ(log n).

algorithm is not cost-optimal.

Matrix-Vector Multiplication: 2-D Partitioning

√ √

(n/ p) × (n/ p) block of the matrix.

√

• The vector is distributed in portions of n/ p elements in the last

process-column only.

√

and reduction are all (n/ p).

√ √

• The computation is a product of an (n/ p) × (n/ p) submatrix

√

with a vector of length (n/ p).

Matrix-Vector Multiplication: 2-D Partitioning

√

• The first alignment step takes time ts + tw n/ p.

√ √

• The broadcast and reductions take time (ts + tw n/ p) log( p).

• Total time is

n2 n

TP ≈ + ts log p + tw √ log p (4)

p p

Matrix-Vector Multiplication: 2-D Partitioning

Scalability Analysis:

√

• To = pTp − W = tsp log p + tw n p log p.

W = K 2t2w p log2 p as the dominant term.

bandwidth).

2 we have, W = n 2

= p log 2

p. For this, we

have, p = O logn2 n .

Matrix-Matrix Multiplication

matrices A and B to yield the product matrix C = A × B.

method), although, these can be used as serial kernels in the

parallel algorithms.

view, an n × n matrix A can be regarded as a q × q array of

blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) × (n/q)

submatrix.

(n/q) × (n/q) matrices.

Matrix-Matrix Multiplication

√ √ √

Ai,j and Bi,j (0 ≤ i, j < p) of size (n/ p) × (n/ p) each.

• Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j

of the result matrix.

√

for 0 ≤ k < p.

columns.

Matrix-Matrix Multiplication

√ √

• The two broadcasts take time 2(ts log( p) + tw (n2/p)( p − 1)).

√ √ √

• The computation requires p multiplications of (n/ p) × (n/ p)

sized submatrices.

n3 n2

TP = + ts log p + 2tw √ . (5)

p p

to bandwidth term tw and concurrency.

optimal.

Matrix-Matrix Multiplication: Cannon’s Algorithm

√

• In this algorithm, we schedule the computations of the p

processes of the ith row such that, at any given time, each

process is using a different block Ai,k .

processes after every submatrix multiplication so that every

process gets a fresh Ai,k after each rotation.

Matrix-Matrix Multiplication: Cannon’s Algorithm

A 0,0 A 0,1 A 0,2 A 0,3 B0,0 B0,1 B0,2 B0,3

B0,0 B1,1 B2,2 B3,3 B1,0 B2,1 B3,2 B0,3

B1,0 B2,1 B3,2 B0,3 B2,0 B3,1 B0,2 B1,3

B2,0 B3,1 B0,2 B1,3 B3,0 B0,1 B1,2 B2,3

B3,0 B0,1 B1,2 B2,3 B0,0 B1,1 B2,2 B3,3

(c) A and B after initial alignment (d) Submatrix locations after first shift

B2,0 B3,1 B0,2 B1,3 B3,0 B0,1 B1,2 B2,3

B3,0 B0,1 B1,2 B2,3 B0,0 B1,1 B2,2 B3,3

B0,0 B1,1 B2,2 B3,3 B1,0 B2,1 B3,2 B0,3

B1,0 B2,1 B3,2 B0,3 B2,0 B3,1 B0,2 B1,3

(e) Submatrix locations after second shift (f) Submatrix locations after third shift

Matrix-Matrix Multiplication: Cannon’s Algorithm

multiplies its local submatrices. This is done by shifting all

submatrices Ai,j to the left (with wraparound) by i steps and

all submatrices Bi,j up (with wraparound) by j steps.

moves one step up (again with wraparound).

√

until all p blocks have been multiplied.

Matrix-Matrix Multiplication: Cannon’s Algorithm

√

a block shifts is p − 1, the two shift operations require a total of

2(ts + tw n2/p) time.

√

• Each of the p single-step shifts in the compute-and-shift phase

of the algorithm takes ts + tw n2/p time.

√

• The computation time for multiplying p matrices of size

√ √

(n/ p) × (n/ p) is n3/p.

n3 √ n2

TP = + 2 pts + 2tw √ . (6)

p p

identical to the first algorithm, except, this is memory optimal.

Matrix-Matrix Multiplication: DNS Algorithm

matrices A and B come in two orthogonal faces and result C

comes out the other orthogonal face.

operation (and thus the complexity).

Matrix-Matrix Multiplication: DNS Algorithm

and broadcast takes log n time, the total runtime is log n.

n/ log n processors along the direction of accumulation.

Matrix-Matrix Multiplication: DNS Algorithm

0,3 1,3 2,3 3,3

k=3

k k=1

0,1 1,1 2,1 3,1

j

0,2 1,2 2,2 3,2

k=0

0,1 1,1 2,1 3,1

0,0 1,0 2,0 3,0 0,0 1,0 2,0 3,0

i

(a) Initial distribution of A and B (b) After moving A[i,j] from Pi,j,0 to Pi,j,j

0,3 1,3 2,3 3,3 = 3,2 3,2 3,2 3,2

0,3 1,3 2,3 3,3 3,1 3,1 3,1 3,1

A[0,3] x B[3,0]

0,3 1,3 2,3 3,3 3,0 3,0 3,0 3,0

+

A 0,2 1,2 2,2 3,2 2,2 2,2 2,2 2,2

0,2 1,2 2,2 3,2

A[0,2] x B[2,0] 2,1 2,1 2,1 2,1

B

0,2 1,2 2,2 3,2 2,0 2,0 2,0 2,0

+

0,1 1,1 2,1 3,1 1,2 1,2 1,2 1,2

0,1 1,1 2,1 3,1 1,1 1,1 1,1 1,1

A[0,1] x B[1,0]

0,1 1,1 2,1 3,1 1,0 1,0 1,0 1,0

+

0,0 1,0 2,0 3,0 0,2 0,2 0,2 0,2

0,0 1,0 2,0 3,0 0,1 0,1 0,1 0,1

A[0,0] x B[0,0]

0,0 1,0 2,0 3,0 0,0 0,0 0,0 0,0

matrices A and B on 64 processes.

Matrix-Matrix Multiplication: DNS Algorithm

q < n.

Each matrix can thus be regarded as a q × q two-dimensional

square array of blocks.

case, we operate on blocks rather than on individual elements.

Matrix-Matrix Multiplication: DNS Algorithm

A and B, and takes ts + tw (n/q)2 time for each matrix.

• The two one-to-all broadcasts take 2(ts log q+tw (n/q)2 log q) time

for each matrix.

n3 n2

TP = + ts log p + tw 2/3 log p. (7)

p p

Solving a System of Linear Equations

a1,0x0 + a1,1x1 + · · · + a1,n−1xn−1 = b1,

.. .. .. ..

an−1,0x0 + an−1,1x1 + · · · + an−1,n−1xn−1 = bn−1.

A[i, j] = ai,j , b is an n × 1 vector [b0, b1, . . . , bn−1]T , and x is the

solution.

Solving a System of Linear Equations

back-substitution. The triangular form is as:

x1 + u1,2x2 + · · · + u1,n−1xn−1 = y1 ,

.. ..

xn−1 = yn−1.

an upper-triangular matrix is Gaussian Elimination.

Gaussian Elimimation

2. begin

3. for k := 0 to n − 1 do /* Outer loop */

4. begin

5. for j := k + 1 to n − 1 do

6. A[k, j] := A[k, j]/A[k, k]; /* Division step */

7. y[k] := b[k]/A[k, k];

8. A[k, k] := 1;

9. for i := k + 1 to n − 1 do

10. begin

11. for j := k + 1 to n − 1 do

12. A[i, j] := A[i, j] − A[i, k] × A[k, j]; /* Elimination step */

13. b[i] := b[i] − A[i, k] × y[k];

14. A[i, k] := 0;

15. endfor; /* Line 9 */

16. endfor; /* Line 3 */

17. end GAUSSIAN ELIMINATION

Gaussian Elimination

the outer loop, the algorithm performs (n − k)2 computations.

Summing from k = 1..n, we have roughly (n3/3) multiplications-

subtractions.

Column k

Column j

Inactive part

Active part

Parallel Gaussian Elimination

• The first step of the algorithm normalizes the row. This is a serial

operation and takes time (n − k) in the kth iteration.

processors. This takes time (ts + tw (n − k − 1)) log n.

own. This requires (n − k − 1) multiplications and subtractions.

1..n − 1 as

3 1

TP = n(n − 1) + tsn log n + tw n(n − 1) log n. (8)

2 2

Parallel Gaussian Elimination

P0 1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7)

(a) Computation:

P3 0 0 0 (3,3) (3,4) (3,5) (3,6) (3,7)

(i) A[k,j] := A[k,j]/A[k,k] for k < j < n

P4 0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7)

(ii) A[k,k] := 1

P5 0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7)

(b) Communication:

P3 0 0 0 1 (3,4) (3,5) (3,6) (3,7)

One-to-all brodcast of row A[k,*]

P4 0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7)

(c) Computation:

P3 0 0 0 1 (3,4) (3,5) (3,6) (3,7)

(i) A[i,j] := A[i,j] - A[i,k] x A[k,j]

P4 0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7)

for k < i < n and k < j < n

P5 0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7)

(ii) A[i,k] := 0 for k < i < n

P6 0 0 0 (6,3) (6,4) (6,5) (6,6) (6,7)

k = 3 for an 8 × 8 matrix partitioned rowwise among eight

processes.

Parallel Gaussian Elimination: Pipelined Execution

all the computation and communication for the kth iteration is

complete.

of a row, communication, and elimination. These steps are

performed in an asynchronous fashion.

Once it has done this, it forwards its own row to processor Pk+1.

Parallel Gaussian Elimination: Pipelined Execution

(0,0) (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4)

(1,0) (1,1) (1,2) (1,3) (1,4) (1,0) (1,1) (1,2) (1,3) (1,4) (1,0) (1,1) (1,2) (1,3) (1,4) (1,0) (1,1) (1,2) (1,3) (1,4)

(2,0) (2,1) (2,2) (2,3) (2,4) (2,0) (2,1) (2,2) (2,3) (2,4) (2,0) (2,1) (2,2) (2,3) (2,4) (2,0) (2,1) (2,2) (2,3) (2,4)

(3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4)

(4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4)

1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4)

0 (1,1) (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 (1,1) (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4)

(2,0) (2,1) (2,2) (2,3) (2,4) 0 (2,1) (2,2) (2,3) (2,4) 0 (2,1) (2,2) (2,3) (2,4) 0 (2,1) (2,2) (2,3) (2,4)

(3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) 0 (3,1) (3,2) (3,3) (3,4) 0 (3,1) (3,2) (3,3) (3,4)

(4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) 0 (4,1) (4,2) (4,3) (4,4)

1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4)

0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4)

0 (3,1) (3,2) (3,3) (3,4) 0 0 (3,2) (3,3) (3,4) 0 0 (3,2) (3,3) (3,4) 0 0 (3,2) (3,3) (3,4)

0 (4,1) (4,2) (4,3) (4,4) 0 (4,1) (4,2) (4,3) (4,4) 0 0 (4,2) (4,3) (4,4) 0 0 (4,2) (4,3) (4,4)

1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4)

0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4)

one row per process.

Parallel Gaussian Elimination: Pipelined Execution

Θ(n).

directly-connected processes, or a division step is performed

on O(n) elements of a row, or an elimination step is performed

on O(n) elements of a row.

Parallel Gaussian Elimination: Pipelined Execution

1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7)

P0

0 1 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7)

P1

0 0 0 1 (3,4) (3,5) (3,6) (3,7)

P2

0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7)

P3

0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7)

corresponding to k = 3 for an 8 × 8 matrix distributed among four

processes using block 1-D partitioning.

Parallel Gaussian Elimination: Block 1D with p < n

p < n.

active part of the matrix performs (n − k − 1)n/p multiplications

and subtractions.

communication.

2(n/p)Σn−1

approximately, n3/p.

algorithm is higher than the sequential run time by a factor of

3/2.

Parallel Gaussian Elimination: Block 1D with p < n

1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) 1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7)

P0 P0

0 1 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7) 0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7)

0 0 1 (2,3) (2,4) (2,5) (2,6) (2,7) 0 1 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7)

P1 P1

0 0 0 (3,3) (3,4) (3,5) (3,6) (3,7) 0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7)

0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7) 0 0 1 (2,3) (2,4) (2,5) (2,6) (2,7)

P2 P2

0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7) 0 0 0 (6,3) (6,4) (6,5) (6,6) (6,7)

0 0 0 (6,3) (6,4) (6,5) (6,6) (6,7) 0 0 0 (3,3) (3,4) (3,5) (3,6) (3,7)

P3 P3

0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7) 0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7)

partitioning of an 8 × 8 matrix on four processes during the

Gaussian elimination iteration corresponding to k = 3.

Parallel Gaussian Elimination: Cyclic 1D Mapping

cyclic mapping.

load imbalance.

O(n2p) (instead of O(n3) in the previous case).

Parallel Gaussian Elimination: 2-D Mapping

processors.

scaled rank-one update (scaling by the pivot element).

For this it needs the corresponding value from the pivot row,

and the scaling value from its own row.

Parallel Gaussian Elimination: 2-D Mapping

1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) 1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7)

0 1 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7) 0 1 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7)

0 0 1 (2,3) (2,4) (2,5) (2,6) (2,7) 0 0 1 (2,3) (2,4) (2,5) (2,6) (2,7)

0 0 0 (3,3) (3,4) (3,5) (3,6) (3,7) 0 0 0 (3,3) (3,4) (3,5) (3,6) (3,7)

0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7) 0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7)

0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7) 0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7)

0 0 0 (6,3) (6,4) (6,5) (6,6) (6,7) 0 0 0 (6,3) (6,4) (6,5) (6,6) (6,7)

0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7) 0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7)

for (k - 1) < i < n for k < j < n

1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) 1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7)

0 1 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7) 0 1 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7)

0 0 1 (2,3) (2,4) (2,5) (2,6) (2,7) 0 0 1 (2,3) (2,4) (2,5) (2,6) (2,7)

0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7) 0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7)

0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7) 0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7)

0 0 0 (6,3) (6,4) (6,5) (6,6) (6,7) 0 0 0 (6,3) (6,4) (6,5) (6,6) (6,7)

0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7) 0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7)

for k < j < n for k < i < n and k < j < n

to k = 3 for an 8 × 8 matrix on 64 processes arranged in a logical

two-dimensional mesh.

Parallel Gaussian Elimination: 2-D Mapping with

Pipelining

pipelined along the row. Then the scaled pivot row is pipelined

down.

step A[i, j] := A[i, j] − A[i, k] × A[k, j] as soon as A[i, k] and A[k, j]

are available.

through the mesh from top-left to bottom-right as a “front.”

through a process, the process is free to perform subsequent

iterations.

simultaneously.

Parallel Gaussian Elimination: 2-D Mapping with

Pipelining

assumed to take constant time, the front moves a single step in

this time. The front takes Θ(n) time to reach Pn−1,n−1.

next front can be initiated. In this way, the last front passes the

bottom-right corner of the matrix Θ(n) steps after the first one.

2-D Mapping with Pipelining

(0,0) (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4)

(1,0) (1,1) (1,2) (1,3) (1,4) (1,0) (1,1) (1,2) (1,3) (1,4) (1,0) (1,1) (1,2) (1,3) (1,4) 0 (1,1) (1,2) (1,3) (1,4)

(2,0) (2,1) (2,2) (2,3) (2,4) (2,0) (2,1) (2,2) (2,3) (2,4) (2,0) (2,1) (2,2) (2,3) (2,4) (2,0) (2,1) (2,2) (2,3) (2,4)

(3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4)

(4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4)

0 (1,1) (1,2) (1,3) (1,4) 0 (1,1) (1,2) (1,3) (1,4) 0 (1,1) (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4)

(2,0) (2,1) (2,2) (2,3) (2,4) 0 (2,1) (2,2) (2,3) (2,4) 0 (2,1) (2,2) (2,3) (2,4) 0 (2,1) (2,2) (2,3) (2,4)

(3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) 0 (3,1) (3,2) (3,3) (3,4)

(4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4)

0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4)

0 (2,1) (2,2) (2,3) (2,4) 0 0 (2,2) (2,3) (2,4) 0 0 (2,2) (2,3) (2,4) 0 0 (2,2) (2,3) (2,4)

0 (3,1) (3,2) (3,3) (3,4) 0 (3,1) (3,2) (3,3) (3,4) 0 (3,1) (3,2) (3,3) (3,4) 0 0 (3,2) (3,3) (3,4)

(4,0) (4,1) (4,2) (4,3) (4,4) 0 (4,1) (4,2) (4,3) (4,4) 0 (4,1) (4,2) (4,3) (4,4) 0 (4,1) (4,2) (4,3) (4,4)

0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4)

0 0 (3,2) (3,3) (3,4) 0 0 (3,2) (3,3) (3,4) 0 0 (3,2) (3,3) (3,4) 0 0 0 (3,3) (3,4)

0 (4,1) (4,2) (4,3) (4,4) 0 0 (4,2) (4,3) (4,4) 0 0 (4,2) (4,3) (4,4) 0 0 (4,2) (4,3) (4,4)

Parallel Gaussian Elimination: 2-D Mapping with

Pipelining and p < n

the matrix performs n2/p multiplications and subtractions, and

√

communicates n/ p words along its row and its column.

there are n iterations. This is equal to 2n3/p.

Parallel Gaussian Elimination: 2-D Mapping with

Pipelining and p <√n

n n/ p

1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) 1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7)

0 1 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7) 0 1 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7)

0 0 1 (2,3) (2,4) (2,5) (2,6) (2,7) 0 0 1 (2,3) (2,4) (2,5) (2,6) (2,7)

√n

0 0

PSfrag replacementsn

0 (4,3) (4,4) (4,5) (4,6) (4,7) 0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7)

p 0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7) 0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7)

0 0 0 (6,3) (6,4) (6,5) (6,6) (6,7) 0 0 0 (6,3) (6,4) (6,5) (6,6) (6,7)

0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7) 0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7)

for i = k to (n - 1) for j = (k + 1) to (n - 1)

corresponding to k = 3 for an 8 × 8 matrix on 16 processes of a

two-dimensional mesh.

Parallel Gaussian Elimination: 2-D Mapping with

Pipelining and p < n

1 (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) 1 (0,4) (0,1) (0,5) (0,2) (0,6) (0,3) (0,7)

0 1 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7) 0 (4,4) 0 (4,5) 0 (4,6) (4,3) (4,7)

0 0 1 (2,3) (2,4) (2,5) (2,6) (2,7) 0 (1,4) 1 (1,5) (1,2) (1,6) (1,3) (1,7)

0 0 0 (3,3) (3,4) (3,5) (3,6) (3,7) 0 (5,4) 0 (5,5) 0 (5,6) (5,3) (5,7)

0 0 0 (4,3) (4,4) (4,5) (4,6) (4,7) 0 (2,4) 0 (2,5) 1 (2,6) (2,3) (2,7)

0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7) 0 (6,4) 0 (6,5) 0 (6,6) (6,3) (6,7)

0 0 0 (6,3) (6,4) (6,5) (6,6) (6,7) 0 (3,4) 0 (3,5) 0 (3,6) (3,3) (3,7)

0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7) 0 (7,4) 0 (7,5) 0 (7,6) (7,3) (7,7)

2-D mappings of an 8 × 8 matrix onto 16 processes during the

Gaussian elimination iteration corresponding to k = 3.

Parallel Gaussian Elimination: 2-D Cyclic Mapping

cyclic mapping.

two processes in any iteration is that of one row and one

column update.

√

• This contributes Θ(n p) to the overhead function. Since there

√

are n iterations, the total overhead is Θ(n2 p).

Gaussian Elimination with Partial Pivoting

column) such that A[k, i] is the largest in magnitude among all

A[k, j] such that k ≤ j < n.

overhead since the division step takes the same time as

computing the max.

adding a log p term to the overhead.

Gaussian Elimination with Partial Pivoting: 2-D

Partitioning

loss.

columns.

Solving a Triangular System: Back-Substitution

determine the vector x.

2. begin

3. for k := n − 1 downto 0 do /* Main loop */

4. begin

5. x[k] := y[k];

6. for i := k − 1 downto 0 do

7. y[i] := y[i] − x[k] × U [i, k];

8. endfor;

9. end BACK SUBSTITUTION

Solving a Triangular System: Back-Substitution

subtractions.

optimize the data distribution for the factorization part.

with vector y distributed uniformly.

back.

amount of time for communication and Θ(n/p) time for

computation.

Solving a Triangular System: Back-Substitution

√

• If the matrix is partitioned by using 2-D partitioning on a p ×

√

p logical mesh of processes, and the elements of the vector

are distributed along one of the columns of the process mesh,

√

then only the p processes containing the vector perform any

computation.

U to the process containing the corresponding elements of y

for the substitution step (line 7), the algorithm can be executed

√

in Θ(n2/ p) time.

• While this is not cost optimal, since this does not dominante the

overall computation, the cost optimality is determined by the

factorization.

- Chapter 3Uploaded bySharif M Mizanur Rahman
- MGMT 221, Ch. IIUploaded byMulugeta Girma
- 5_Matrices.pdfUploaded bythinkiit
- MatricesUploaded byKishan500
- One Dimensional Array and Multi-Dimensional Arrays.pdfUploaded byMalik Shahzad
- CHSL Solved Exam General Intelligence Held 04.12.2011 1st Sitting\Uploaded bypalashpatankar
- Verification and Expansion of SDOF Transformation Factors for BeamsUploaded bybong2rm
- Halusmanquickguide.en USUploaded bytheimagingsource
- Bx 34452461Uploaded byAnonymous 7VPPkWS8O
- FINAL 022014 Answer KeyUploaded byDavid
- Metsim Mathematical FunctionsUploaded byAshish Kotwal
- PattersonUploaded bymeerashekar
- Tutorial NotesUploaded bythewodros
- zoho2.pdfUploaded byPrincy
- ExampleUploaded byMuhammad Adeel
- ssss-1-1(3).pdfUploaded byShailesh kumar yadav
- Pr2Uploaded bygopcma
- 26 MolenaarUploaded byjoaosevan
- BAB02- The Commonly Used Blocks LibUploaded bycakMAD69
- 10.1.1.48Uploaded byrobotfreak91
- Simultaneous Multidiagonalization for the CS DecompositionUploaded bykingston
- MathematicsUploaded byVenkatesh Rai
- Engineering20100100002_14998674Uploaded bySukesh Kumar
- A537641241_18370_5_2018_L9-12-16827-2D TransformationsUploaded byMritunjay Sengar
- 10.1109@PES.2010.5589536Uploaded byBretii Andini
- Improving of Artifical Neural Networks Performance by Using GPU's a SurveyUploaded byCS & IT
- Differential Motion IRUploaded byDineshNewalkar
- Introduction to MatlabUploaded byPunit Gupta
- 2014 FM Ignition Package Units 3-4Uploaded byJessie Singleton
- MCA_ITA419_cyclesheet_2Uploaded byakshayt044

- Theory X and Theory YUploaded byhanizzz
- wt lab manualUploaded byBhuvan Jai
- MM CookbookUploaded bySuvendu Bishoyi
- STAR approach to job interview.Uploaded byMax Pan
- Physics 26 - Electrical Quantities 1Uploaded bydipgang7174
- 12 - ODI ToolsUploaded byMaheshBirajdar
- Basic Method of Economic EvaluationUploaded byAstuti Purbaningsih
- Medical Ethics Basic PrinsipUploaded byAdeLia Nur Fitriana
- 3G RAN DesignUploaded byMahmut Gün
- 1C03 Assignment 03 AcidficationUploaded byclblackangel
- Long Form - Divided by borders, united by riversUploaded byDhaka Tribune
- Self and Power_ Political Reconstruction in the Drama of Christopher MarloweUploaded byDánisaGarderes
- Simulated AnnealingUploaded byLokesh Singanaboina
- 5 Why InstructionsUploaded byMansoor Ali
- Maxim Hand OutUploaded byArlene Aurin-Aliento
- DTIx0592xPA-TourismPolicyEN.pdfUploaded byEmanuela
- Module 1 FinalUploaded byRicChinKY
- Change ManagementUploaded byMohammad Raihanul Hasan
- User Authentication With Laravel - Laravel BookUploaded byharadhon
- Apollo 16 PAO Mission Commentary CMUploaded byBob Andrepont
- T-REC-G.191-200012-S!!PDF-EUploaded bymaarajendra
- Zigbee Pro Stack Profile 2Uploaded bymilee
- Accenture Life Sciences OverviewUploaded byBobAdler
- The Business of SportUploaded byAlister Renaux
- hal_2011Uploaded byNaveen Kumar Ramachandrappa
- Capacitor MotorUploaded byHaziq Shukor
- Chapter 3 Srs TemplateUploaded byFidelis Yu Ri
- Sathish Linux ResumeUploaded byjaya74msc
- Lean Six Sigma New GB v 5.0 AnalyzeUploaded byRavi Kumar
- Clinical Gait AnalysisUploaded byMichael John Tedjajuwana