You are on page 1of 14

# Programming project

Dijkstra's algorithm

## Summer school: ' Introduction to High performance Computing

Center for High-Performance Computing & KTH School of Computer
Science and Communication

Salvator Gkalea
Department ICT/ES
KTH
salvator@kth.se

## Tutor: Stefano Markidis

Date: October 2014

Dionysios Zelios
Department of Physics
Stockholm University
dionisis.zel@gmail.com

Contents
Abstract ..................................................................................................................................... 3
Introduction ............................................................................................................................... 3
Dijkstra's Parallelization ............................................................................................................ 5
Simulation results and analysis ................................................................................................. 8
References ............................................................................................................................... 14

Abstract
Dijkstra's algorithm is a graph search algorithm that solves the singlesource shortest path problem for a graph with non-negative edge path
costs, producing a shortest path tree. This algorithm is often used in
routing and as a subroutine in other graph algorithms. In this project we
investigate the parallelization of this algorithm and its speedup against
the sequential one. Message Passing Interface (MPI) is used for this
purpose, in order to parallelize the source-shortest path algorithm.

Introduction
Dijkstras algorithm determines the shortest path from a single source
vertex to every other vertex. It can also be used for finding costs of
shortest paths from a single vertex to a single destination vertex by
stopping the algorithm once the shortest path to the destination vertex
has been determined. For instance, if the vertices of the graph represent
cities and edge path costs represent driving distances between pairs of
cities connected by a direct road, Dijkstra's algorithm can be used to find
the shortest route between one city and all other cities.
A pseudo-code representation of the Dijkstra's sequential single-source
shortest-paths Algorithm is given. The pseudo-algorithm proceeds as
follows:

1. function Dijkstra(V, E, w, s)
2. begin
3.
VT := {s};
4.
for all v exists in (V - VT) do
5.
if (s, v) exists
set l[v] := w(s, v);
6.
else
set l[v] := infinitive;
end for
7.
while VT V do
8.
begin
9.
find a vertex u such that l[u] :=
min{l[v]|v exists in (V - VT)};
10.
VT := VT joint {u};
11.
for all v exists in (V - VT) do
12.
l[v] := min{l[v], l[u] + w(u, v)};
13.
end while
14. end Dijkstra

## Suppose we have a given weighted graph G = (V,E,w), where V =

vertices, E = edges, w = weight. The pseudo-code above represents the
procedure where the shortest path, from vertex s to every other vertex
v, is computed. For every vertex v that exists in (V - VT), the algorithm
stores the minimum cost to reach vertex v from vertex s. In line 6-7 the
procedure stores every weight of each vertex to the l[] array. In the rest
of the lines, the algorithm finds the closest vertex u to vertex s , based
on the weight, among all those not yet examined and then updates the
minimum distance ( l[] array ).
The main parallelization idea to approach this problem is to
assume that every process can discover its node that is closest to the
source. Then, it is supposed to make a reduction in order to distinguish
the closest node and upon success it will broadcast the selection of this
node to all the other processes. At the end each process will update
locally its distant array ( l[] array ).
The parallel computing can be achieved with some basic MPI
communication functions. These functions are described below and they
are going to be used to parallelize the single-source shortest-paths
Algorithm.

MPI_Init
MPI_Finalize

## Initialize the parallel environment

End the parallel environment, release system
resources
MPI_Comm_size Returns the number of parallel computing units
MPI_Comm_rank Returns the current calculation unit identification
nimber
MPI_Reduce
Reduces values on all processes to a single value
MPI_Bcast
Broadcasts a message from the process with rank
"root" to all other processes of the communicator
MPI_Gather
Gathers together values from a group of processes

Dijkstra's Parallelization
The pseudo-code mentioned above can be parallelized based on the line
9 in which the minimum distance among all the nodes is computed or
based on the line 11 where the algorithm calculates the minimum
overall distance and stores it to the local minimum distance array.
The classic approach is to break down the data that need to be
computed into segments. Then each node is responsible to process and
compute a segment of data. At the end all the results from all the
segments must be gathered at a node. The C code above demonstrates
the logic mentioned above.

#include <stdio.h>
#include <mpi.h>
int numV,

//number of vertices
*todo, //vertices to be analysed
numNodes, //number of nodes
chunk, //number of vertices handled by every node
start, end, //start, end point vertex for each node
myNode; //ID of node

unsigned maxInt,
localMin[2], // [0]: min for local node, [1]: vertex
for that min

## globalMin[2], // [0]: global min for all the node,

[1]: vertex for that min
*graph, //hops between vertices
*minDistance; // min distance
double T1, T2; // start, end time
// Generation of the Graph
void init(int ac, char **av) {
int i, j;
unsigned u, tmp;
numV = atoi(av[1]);
MPI_Init(&ac, &av);
MPI_Comm_size(MPI_COMM_WORLD, &numNodes);
MPI_Comm_rank(MPI_COMM_WORLD, &myNode);
chunk = numV / numNodes;
start = myNode * chunk;
end = start + chunk - 1;
maxInt = -1 >> 1;
graph = malloc(chunk * numV * sizeof(int));
minDistance = malloc(numV * sizeof(int));
todo = malloc(numV * sizeof(int));
for (i = 0; i < chunk; i++) {
for (j = i; j < numV; j++) {
if (j == i)
graph[i][j] = 0;
else {
graph[i][j] = rand(1) % 21 + 1;
graph[j][i] = graph[i][j];
}
}
}
for (i = 0; i < numV; i++) {
todo[i] = 1;
minDistance[i] = maxInt;
}
minDistance[0] = 0;
}
int main(int ac, char **av) {
int i,j,k;
//Initialization process
init(ac, av);
if (myNode == 0)
T1 = MPI_Wtime();
for (step = 0; step < numV; step++) {
// Find local minimum distance

## for (i = start; i <= end; i++) {

if (todo[i] && minDistance[i] < maxInt)
{
localMin[0] =
minDistance[i];
localMin[1] = i;
}
}
MPI_Reduce(localMin,globalMin,1,MPI_2INT,MPI_MINLOC,0,MPI_COMM_
WORLD);
MPI_Bcast(globalMin,1,MPI_2INT,0,MPI_COMM_WORLD);
//check that vertex as passed
todo[globalMin[1]] = 0;
//Update the local min distance array
for (j = 0; j < chunk; j++) {
if (globalMin[0] + graph[globalMin[1] *
chunk + j] < minDistance[j + (myNode * chunk)]){
minDistance[j + (myNode *
chunk)] = globalMin[0] + graph[globalMin[1] * chunk + j];
}
}
}
MPI_Gather(minDistance + start, chunk, MPI_INT, minDistance,
chunk,
MPI_INT, 0, MPI_COMM_WORLD);
if (myNode == 0)
T2 = MPI_Wtime();

if (0 && myNode == 0)
for (k = 1; k < numV; i++) {
printf("node[%d] = %u\n", i, minDistance[i]);
}
if (myNode == 0)
printf("elapsed: %f\n", (float) (T2 - T1));
MPI_Finalize();
}

## The node 0 is actually acting as a master which is responsible to

broadcast (MPI_Bcast) the global minimum distance that has been
computed by all the other nodes. The other nodes are acting as
receivers, waiting for the global minimum distance from the node 0.
As it was mentioned before, each node is responsible to process a
segment of the whole data. The results from all the nodes are gathered
(MPI_Gather) to the node 0 and are been checked with the local results
in order to be stored or rejected (MPI_Reduce).

## Simulation results and analysis

The experiments have been conducted on Milner with the Intel
compiler.
Time measurements
First, we run our algorithm for different values of meshes. Each time we
vary the number of cores and we get as a result the time needed for the
algorithm to run.
The measurements are shown in the table below and were executed as
aprun n Nodes ./Dijkstra.out mesh, where Nodes=132 and mesh=the
size of the graph (ex. Mesh=4*106 means the number of vertices)
mesh= 4*106, Nodes=1..32
Nodes 1
Time 0.009083
(sec)

2
0.010213

4
0.007118

8
0.008694

16
0.015567

32
0.034436

2
0.150809

4
0.092254

8
0.062077

16
0.105193

32
0.1638634

2
0.597349

4
0.352130

8
0.202681

16
0.189242

32
0.200911

2
2.077290

4
1.213774

8
0.874406

16
0.85756

32
1.056482

mesh= 64*106
Nodes
Time
(sec)

1
0.14132

mesh= 256*106
Nodes
Time
(sec)

1
0.564806

mesh=900*106
Nodes
Time
(sec)

1
1.984156

## It is obvious that the running time is decreasing as the number of cores

increases until it reaches the smallest value, then the running time will
increase because of the communication latency.
However, for middle size network the phenomenon of a reducing
running time is not that obvious. For a small size network, the running

## time is even increasing as the number of cores increases, because the

communication latency outperforms the benefit from using more cores.

Speed up measurements
In our next step, we want to investigate the speed up of our algorithm.
In order to do that, we divided the base time in one node with time
needed for multiple nodes to be executed. The measurements and the
corresponding plot are given below:
mesh= 4*106
Nodes 1
Time 1
(sec)

2
0.889356

4
1.276060

8
1.044743

16
0.583477

32
0.263764

2
0,9370793

4
1,531857

8
2,27652

16
1,3434353

32
0,862425

2
0,945520

4
1,60397

8
2,78667

16
2,98457

32
2,81122

2
0,9551

4
1,63469

8
2,269147

16
2,3137

32
1,8780

mesh= 64*106
Nodes
Time
(sec)

1
1

mesh= 256*106
Nodes
Time
(sec)

1
1

mesh=900*106
Nodes
Time
(sec)

1
1

## The speed up is increasing as the number of cores increases until it

reaches the maximum value, then the speed up is decreasing.
This is happening because more cores are used. The speed up is
decreasing because the communication latency outperforms the benefit
from using more cores.
As the network size increases, the number of cores used to get the
maximum speed up increases.

Cost measurements

## In addition, we would like to investigate the cost in comparison with the

number of cores that we have used. In order to do that, we multiplied
the number of cores with the execution time in one core. The
measurements and the corresponding diagram are presented below:

mesh= 4*106
Nodes 1
Time 1,87807
(sec)

2
0,020426

4
0,028472

8
0,069552

16
0,249072

32
1,101952

2
0,301618

4
0,369016

8
0,496616

16
1,683088

32
5,2436288

2
1,194698

4
1,40852

8
1,621448

16
3,027872

32
6,429152

2
4,15458

4
4,855096

8
6,995248

16
13,72096

32
36,97687

mesh= 64*106
Nodes
Time
(sec)

1
0,14132

mesh= 256*106
Nodes
Time
(sec)

1
0,564806

mesh=900*106
Nodes
Time
(sec)

1
1,984156

The cost is increasing because the speed up (or the benefit of a reduced
running time) cannot outperform the cost of using more cores.

References

1) Wikipedia: http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
2) Lecture notes by Erwin Laure:
http://agenda.albanova.se/conferenceDisplay.py?confId=4384
3)Han, Xiao Gang, Qin Lei Sun, and Jiang Wei Fan. "Parallel Dijkstra's
Algorithm Based on Multi-Core and MPI." Applied Mechanics and
Materials 441 (2014): 750-753.
4) http://www.mpich.org/
5) http://www.open-mpi.org/
6) Crauser, Andreas, et al. "A parallelization of Dijkstra's shortest
path algorithm."Mathematical Foundations of Computer Science
1998. Springer Berlin Heidelberg, 1998. 722-731
7) Meyer, Ulrich, and Peter Sanders. "-stepping: A parallel single
source shortest path algorithm." AlgorithmsESA98. Springer
Berlin Heidelberg, 1998. 393-404.
8)http://www.inf.ed.ac.uk/publications/thesis/online/IM040172.pdf