Lect7 Parallel System

Dense Matrix Algorithms
Mapping Matrices into processors

1- striped partitioning
The matrix is divided into groups of complete rows or columns,
and each processor is assigned one such group.
•The partitioning is uniform if each group contain an equal

number of rows or columns.
•The partitioning is called block striped if each processor is

assigned contiguous rows or columns.
•Partitioning the rows or columns of the matrix can be

sequentially distributed among the processors is wraparound
manner which is called cyclic striped mapping.
* A mapping that is a hybrid between block and cyclic
mapping is called block- cyclic striped mapping. The
matrix is striped into blocks of q rows ( q < (n/p)) and
these blocks are distributed among the processors in a
cyclic manner.
2- checker board partitioning
The matrix is divided into smaller square or rectangular blocks or
sub matrices that are distributed among processors.
* In uniform check board partitioning, all sub matrices are of the

same size.
•A check board partitioning splits both the rows and the columns
of the matrix, so no processor is assigned any complete row or
column.
• check board partitioning can be block or cyclic.

* Checker boarding can be exploit more concurrency than

striping (if the parallel algorithm allows it), because the matrix
computation can be divided among a maximum of processors
using checker boardering, but we cannot use more than n
processors with striping.
Matrix Transposition
* If we assume that it takes unit time to exchange a pair of matrix
elements, then the sequential run time of transpose nxn matrix is
( -n ) /2 , which can be approximated to /2.
Matrix Transposition using checker boarding

partitioning - Mesh
1- Assume that an n x n matrix is stored in an n x n mesh of
processors, so that one processor holds a single element of the
matrix.
* Let 4 x 4 matrix on a 16 processor mesh.

•To obtain the transpose, the matrix elements located below the
diagonal must move to the corresponding diametrically opposite
locations above the diagonal.
•An element located below the diagonal first move up to the
diagonal, and then to the right to its destination processor.
•An element above the diagonal moves down to the diagonal
and then left to its destination processor.
2- consider p (number of processors < )
* The matrix is distributed among the processors by using a

uniform block checker board partitioning.
The transpose of the entire matrix can be computed in two phase
The square matrix blocks are treated as indivisible units and the
two dimensional arrays of blocks is transposed.
All blocks must be transposed locally within their respective
processors.
• During communication phase, the matrix blocks residing on
the bottom left and top right processors cover the longest
distances to swap their locations. These paths covering
approximately link each. Each block contain
elements. Then total communication time equal
• During local exchange of a pair of matrix elements ( assume
take unit time ), so each processor spends approximately
Matrix Transposition using Hypercube
Use algorithm known as Recursive Transposition Algorithm (RTA)

For an 8 x 8 matrix
If the matrix is checker boardered into four blocks, the task of
transposing the matrix involves exchange the top right and bottom
left blocks and then computing the transpose of each of the four
block internally. We can compute the transforms of these blocks in
parallel by further dividing each one of them into four parts and
repeating the procedure.
-In hypercube have p processors composed of four subcubes of
p/4 processors.
- for n = 8, and p = 16
First subcube consist of 4 processors.
- Communication step take time.
-There is 2 steps for exchange of every data, so total
communication steps = log p
- the matrix is divided into blocks, so the size of
each block is
- total communication time =
- total computation inside each processor take time equal
The total parallel time is
Matrix – Vector Multiplication
Using fewer than n processors
Parallel Run Time on Hypercube

Using fewer than n processors
Parallel Run Time on Mesh

Lect7 Parallel System

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect7 Parallel System

Uploaded by

Copyright:

Available Formats

Dense Matrix Algorithms

Mapping Matrices into processors

•The partitioning is uniform if each group contain an equal

•The partitioning is called block striped if each processor is

•Partitioning the rows or columns of the matrix can be

* In uniform check board partitioning, all sub matrices are of the

• check board partitioning can be block or cyclic.

* Checker boarding can be exploit more concurrency than

Matrix Transposition using checker boarding

* Let 4 x 4 matrix on a 16 processor mesh.

* The matrix is distributed among the processors by using a

The transpose of the entire matrix can be computed in two phase

Use algorithm known as Recursive Transposition Algorithm (RTA)

Using fewer than n processors

Parallel Run Time on Hypercube

Using fewer than n processors

Parallel Run Time on Mesh

You might also like