You are on page 1of 20

Dense Matrix Algorithms

Mapping Matrices into processors


1- striped partitioning
The matrix is divided into groups of complete rows or columns,
and each processor is assigned one such group.

•The partitioning is uniform if each group contain an equal


number of rows or columns.

•The partitioning is called block striped if each processor is


assigned contiguous rows or columns.

•Partitioning the rows or columns of the matrix can be


sequentially distributed among the processors is wraparound
manner which is called cyclic striped mapping.
Mapping Matrices into processors
* A mapping that is a hybrid between block and cyclic
mapping is called block- cyclic striped mapping. The
matrix is striped into blocks of q rows ( q < (n/p)) and
these blocks are distributed among the processors in a
cyclic manner.
Mapping Matrices into processors
2- checker board partitioning
The matrix is divided into smaller square or rectangular blocks or
sub matrices that are distributed among processors.

* In uniform check board partitioning, all sub matrices are of the


same size.

•A check board partitioning splits both the rows and the columns
of the matrix, so no processor is assigned any complete row or
column.

• check board partitioning can be block or cyclic.


Mapping Matrices into processors

* Checker boarding can be exploit more concurrency than


striping (if the parallel algorithm allows it), because the matrix
computation can be divided among a maximum of processors
using checker boardering, but we cannot use more than n
processors with striping.
Matrix Transposition
* If we assume that it takes unit time to exchange a pair of matrix
elements, then the sequential run time of transpose nxn matrix is
( -n ) /2 , which can be approximated to /2.

Matrix Transposition using checker boarding


partitioning - Mesh
1- Assume that an n x n matrix is stored in an n x n mesh of
processors, so that one processor holds a single element of the
matrix.

* Let 4 x 4 matrix on a 16 processor mesh.


Matrix Transposition
Matrix Transposition
•To obtain the transpose, the matrix elements located below the
diagonal must move to the corresponding diametrically opposite
locations above the diagonal.
•An element located below the diagonal first move up to the
diagonal, and then to the right to its destination processor.
•An element above the diagonal moves down to the diagonal
and then left to its destination processor.
Matrix Transposition
2- consider p (number of processors < )

* The matrix is distributed among the processors by using a


uniform block checker board partitioning.

The transpose of the entire matrix can be computed in two phase

The square matrix blocks are treated as indivisible units and the
two dimensional arrays of blocks is transposed.
All blocks must be transposed locally within their respective
processors.
Matrix Transposition
Matrix Transposition
• During communication phase, the matrix blocks residing on
the bottom left and top right processors cover the longest
distances to swap their locations. These paths covering
approximately link each. Each block contain
elements. Then total communication time equal
• During local exchange of a pair of matrix elements ( assume
take unit time ), so each processor spends approximately
Matrix Transposition
Matrix Transposition using Hypercube

Use algorithm known as Recursive Transposition Algorithm (RTA)


For an 8 x 8 matrix
Matrix Transposition
Matrix Transposition
If the matrix is checker boardered into four blocks, the task of
transposing the matrix involves exchange the top right and bottom
left blocks and then computing the transpose of each of the four
block internally. We can compute the transforms of these blocks in
parallel by further dividing each one of them into four parts and
repeating the procedure.
-In hypercube have p processors composed of four subcubes of
p/4 processors.
- for n = 8, and p = 16
First subcube consist of 4 processors.
Matrix Transposition
- Communication step take time.
-There is 2 steps for exchange of every data, so total
communication steps = log p
- the matrix is divided into blocks, so the size of
each block is
- total communication time =
- total computation inside each processor take time equal
The total parallel time is
Matrix – Vector Multiplication
Matrix – Vector Multiplication

Using fewer than n processors

Parallel Run Time on Hypercube


Matrix – Vector Multiplication

Using fewer than n processors

Parallel Run Time on Mesh

You might also like