P. 1
Parallel and Distributed Programming on Low Latency Clusters

Parallel and Distributed Programming on Low Latency Clusters

|Views: 506|Likes:
Vittorio Giovara's master thesis is finally here! an in depth view on concurrent programming and new approaches to adapt old software to latest hardware capabilities! As usual free to use in any way as long as you quote the author and the original source.
Vittorio Giovara's master thesis is finally here! an in depth view on concurrent programming and new approaches to adapt old software to latest hardware capabilities! As usual free to use in any way as long as you quote the author and the original source.

More info:

Published by: Project Symphony Collection on Feb 07, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

03/13/2013

pdf

text

original

Analyzing the functions of 4.4 from several profiling sessions a common pattern has been

found.

As a matter of fact, every function contained one or more loops, carrying quite a number

of instructions over arrays and matrices. For this reason a general plan has been decided and

summed up in Figure 17.

As first step, the standard sequential loop is parallelized to fully exploit all the eight cores

each single machine can offer. By setting up proper shared/private variables lists, the loop

is divided among a given number of OpenMP threads and each carries out a portion of that

iteration; as soon as a thread ends, a new one is created and assigned a element, until the whole

loop section is completed.

The second step in this strategy is to split in two distinct and equal parts before exploiting

OpenMP. Each part is submitted to a node of the cluster and separately executed; at the end of

the loop data is exchanged back with MPI and merged so that the two machines can continue

working on complete arrays. Thanks to Infiniband, latency for exchanged data sets is reduced

to a minimum.

45

46

Even though OpenMP requires little software modifications, in order to obtain the maximum

possible throughput from the software, some updates have been carried out, mainly reducing

portions of redundant code.

Figure 17. Implementation scheme overview

It should be noted, however, that the software is not embarrassingly parallel; as a matter of

fact there were a number of modification to the software in order to apply parallelization and

47

distributed computing. The synchronization object mostly used is the implicit blocking offered

by the send() and recv() mechanism; since data is exchanged between the two machines in

the same manner, until either of them is ready to process data, the other cannot continue.

In other sections of the code, synchronization was achieved by native OpenMP directives, as

shown in 5.3.4.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->