You are on page 1of 17

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com


Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

Input File Affinity Measure Running on


Master-Worker Paradigm with MPI/PVM
Ewedafe Simon Uzezi
Department of Computing, Faculty of Science and Technology,
The University of the West Indies, Mona Kingston 7, Jamaica

Abstract: In this paper we implement a parallel performance the results [7]. The master-worker paradigm has
of a type of alternating iterative scheme called Mitchell- fundamental limitations: both communications from the
Fairweather Double Sweep Method (MF-DS) on 2-D Telegraph master and contention in accessing file repositories may
Equation (TE) using an input file affinity measure running on a become bottlenecks to the overall scheduling scheme,
master-worker paradigm. The implementation was carried out causing scalability problems. It is worth noting that related
using Message Passing Interface (MPI) and Parallel Virtual
works on parallel implementation focus on the problem of
Machine (PVM). The TE was discretized using the popular
finite difference method resulting into MF-DS. Parallel efficiently scheduling application to execute on distributed
performance and parallel effectiveness of the method using the architectures organized either as pure master-worker or
Input File Affinity Measure (Iaff) was experimentally evaluated hierarchical platforms, while lacking scalability analysis.
and numerical results show its effectiveness and parallel There are several metrics available for measuring the
performance. We show in this paper that, the input file affinity scalability of algorithm-machine pairs.
measure has scalability performance with less information to Parallelization of Partial Differential Equations (PDEs) by
schedule tasks. time decomposition was first proposed by [34] following
earlier efforts at space-time methods [27, 28, 49]. The
Keywords: Parallel Performance, Input File Affinity, application of the alternating iterative methods to solve
Master-Worker, 2-D Telegraph Equation, MPI. problem of 2-D telegraph equations have shown that they
need high computational power and communications.
1. INTRODUCTION Numerical algorithms comprise of series of heuristics that
Distributed application has become increasingly popular, can help to optimize a wide range of tasks required for
not only because of their commodity component parallel and distributed architectures to work efficiently.
architecture that gives them an economic advantage, but Genetic Algorithms (GAs), Genetic Programming (GP) or
also because of their potential to achieve high performance Simulated Annealing (SA) are nowadays helping computer
by exploiting application-level parallelism. However, this designers on the advance of computer architecture, while
potential is often compromised by the high overheads of improving on parallel architectures are allowing to run
communication among cluster nodes [32]. Computing computing intensive numerical algorithms for solving other
infrastructures are reaching an unprecedented degree of difficult problems. In the world of parallel computing MPI
complexity. First, parallel processing is coming to is the de facto standard for implementing programs on
mainstream, because of the frequency and power multiprocessors. To help with program development under
consumption wall that leads to the design of multi-core a distributed computing environment a number of software
processors. Second, there is a wide adoption of distributed tools have been developed MPI and PVM [23, 25] are
processing technologies because of the deployment of the chosen here since they have large user groups. Many
Internet and consequently large-scale grid and cloud applications are “embarrassingly” parallel and require
infrastructures. All these combined together make minimal performance out of MPI. These applications
programming of computing infrastructure a very difficult exploit coarse grain parallelism and communicate rarely.
challenge. Programmers have to face with both parallel and Nevertheless, measuring the communication performance is
distributed programming paradigms when designing an useful for determining which applications are most suitable
application, and several software codes that are executed on for MPI. The ultimate goal of running MPI on parallel
various computing resources spread over the Internet within computers is to increase programmer productivity and
a grid or cloud-base infrastructure [29]. decrease the large software cost of using High Performance
Parallel applications are composed of sequential, Computing (HPC) systems. Obtaining increased peak
independent tasks. By independent we mean that there is no performance (i.e. exploiting more parallelism) requires
communication or dependencies among tasks. The input for more lines of code. MPI is a point-point message library; a
each task is one or more files and one file can be input for significant amount of code can be added to any application
more than one task. The output for each task is also one or to do parallel operations.
more files, the output files and each task generates its own Sequential numerical methods for solving time dependable
set of output files. It is worth noting that these applications problems have been explored extensively [40, 47].
are often referred to as parallel-sweep applications [21]. In Attempts have also been made towards parallel solutions on
master-worker paradigm a master node is responsible for distributed memory MIMD machines. Large scale
scheduling computation among the worker and collecting computational scientific and engineering problem such as

Volume 8, Issue 6, November - December 2019 Page 1


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

time dependent and 3D flows of viscous elastic fluids 2 discusses previous research work. Section 3 presents the
required large computational resources with a performance model for the 2-D TE and introduces the MF-DS scheme.
approaching some tens of giga (109) floating point Section 4 describes the input file affinity measure. Section
calculations per second. An alternative and cost effective 5 exemplifies the use of the input file affinity measure.
means of achieving a comparable performance is by way of Section 6 describes the parallel implementation of the
distributed computing, using a system of processors loosely algorithms. Section 7 presents the numerical and
connected through a Local Area Network (LAN) [10]. experimental result of the schemes under consideration.
Relevant data need to be passed from processor to Finally, section 8 presents our conclusions.
processor through a message passing mechanism [13, 22,
14, 42, 30]. The telegraph equation is important for 1.1 Previous research work
modeling several relevant problems such as signal analysis
[1], wave propagation [38], e.t.c. In this paper, we deal with Parallel performance of algorithms has been studied in
an electrical transmission line with constant linear several papers during the last years [6, 25, 11, 12, 15, 25].
parameters resistance (R), inductance (L), capacitance (C) Bag of Task (BoT) applications composed of independent
and leakage conductance (G) in both the space and time tasks with file sharing have appeared in papers [7, 26, 31].
domains. A number of iterative methods are developed to Their relevance has motivated the development of
solve the telegraph equations [18]. Some of these iterative specialized environments which aim to facilitate the
schemes are employed in various parallel platforms [49, 27, execution of large BoT applications on computational grids
28]. Parallel algorithms have been implemented for the and clusters, such as the AppLes Parameter-Sweep
finite difference method [19, 20, 17], the discrete eigen Template (APST) [5]. Giersch et al., [26] proved theoretical
functions method used in [2] and [17] used the AGE limits for the computational complexity associated to the
method on 1-D telegraph problem, but not implemented on scheduling problem. The authors also proposed several new
parallel platform for parallel improvement. The boundary heuristics which produce schedules that approach the
element method and the finite volume method using quality achieved by the heuristics proposed by [7], while
domain decomposition have been implemented [13]. keeping the computational complexity one order of
We present in this paper algorithms which have scalability magnitude faster. In [18], they proposed an iterative
performance comparable to other algorithms in several scheduling approach that produces effective and efficient
circumstances. We describe the design of our parallel schedules, compared to the previous works in [7, 26].
system through Iaff for understanding parallel execution. However, for homogeneous platforms, the algorithms
Through several implementations and performance proposed in [31] can be considerably simplified, in a way
improvement analysis of the alternating iterative methods, that it becomes equivalent to algorithms proposed
we explore how to use the Iaff on master-slave paradigm for previously. Fabricio [21] analyzes the scalability of BoT
identifying parallel performance on 2-D TE. We assume the applications running on master-slave platforms and
amount of work is increased exclusively by increasing the proposed a scalability related measure.
number of tasks. The reason for this is that it is relatively Eric [3] presents a detailed study on the scheduling of tasks
easy to increase the range of parameters to be analyzed in in the parareal algorithm that achieves significantly better
an application. On the order hand, in order to increase the efficiency than the usual algorithm. It proposes two
amount of work related to each individual task, the input algorithms, one which uses a manager-worker paradigm
data for that task should be changed, which may change the with overlap of sequential and parallel phases, and a second
nature of the problem to be processed by those tasks and it that is completely distributed. Hinde et al. [29], proposed a
is not always feasible. It is worth noting that some of the generic approach to embed the master-worker paradigm
related works mentioned focus on the problem of efficient into software component models and describes how this
scheduling of the decomposition to execute on distributed generic approach can be implemented within an existing
architectures organized either as pure master-slave or software component model. Many works deal with the
hierarchical platforms, while lacking scalability analysis. master-worker paradigm. With respect to distributed
We consider that the amount of computation associated computing, they can be divided in two categories according
with each task is fixed, each task depends on one or more to [29]. One side, some works focus on Network Enabled
input files for execution, and input file can be shared Servers (NES), usually based on the GridRPC operations,
among tasks [26]. The implementation was carried out standardized by the OGF [39]. On the other side, the
using MPI and PVM; we solve the 2-D TE using MF-DS master-worker paradigm is very popular on desktop grids
[47]. The method involves the solution of sets of such as SETI@HOME [23], and BOINC [4]. They all rely
tridiagonal equations along lines parallel to the x and y axes on a more or less automatic management of non-functional
at the first and second time steps respectively. The first is properties such as worker management, the request
by employing the double sweep method of [37], while the scheduling and the transport of requests between master
second is [20, 46]. and workers. The desired level of transparency is provided
In this work we assume that both the size of the input files, through an application programming interface (API) that
the dependencies among input files and tasks are known. implements the user visible part of the system. For our
We also consider that the master node has access to a concern, there are two main API, one for the master side
storage system which serves as the repository of all input and one for the worker side. Key message approach to
and output files. This paper is organized as follows: Section prioritize communications along critical path to speed up

Volume 8, Issue 6, November - December 2019 Page 2


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

execution of parallel applications in a cluster environment equations. Sometimes they can be better modeled (in terms
was presented by [50]. of fitting the data generated by certain blood flow
On the numerical computing part, [44] went in search of experiments) by the TE. The existence of time-bounded
numerical consistency in parallel programming and solutions of nonlinear bounded perturbations of the
presented methods that can drastically improve numerical telegraph equation with Neumann boundary conditions has
consistency for parallel calculations across varying recently been considered in [1]. We consider the second
numbers of processors. The study assesses the value of the order TE as given in [17]:
enhanced numerical consistency in the context of general
2v v   2v  2v 
finite difference calculations. In this research paper, we  a  b  2 2
consider the consistency approach of the alternating t 2 t  x y  (2.1)
iterative methods from the descritization of PDEs using Iaff
implemented on master-worker paradigm across varying 0  x  1, 0  y  1, t  0
numbers of processors with MPI and PVM, and proposed
some performance improvement methods. In [19, 20], with initial condition
parallel implementation of 2-D telegraph equation using the
SPMD technique and domain decomposition strategy have v( x, y, 0)  f ( x, y ) (2.2a)
been researched and they presented a model that enhances and boundary conditions
overlap communication with computation to avoid
unnecessary synchronization, thus yield significant speed
up. v(0, y , t )  f1 ( y , t ), v(1, y, t )  f 2 ( y , t ) 
 (2.2b)
ADI method for the PDEs proposed by Peaceman and v( x, 0, t )  f 3 ( x, t ), v( x,1, t )  f 4 ( x, t ) 
Rachford [41] has been widely used for solving algebraic
systems resulting from finite difference method analysis of It is assumed that the initial and the boundary conditions
PDEs in several scientific and engineering applications. On are given with sufficient smoothness to maintain the order
the parallel computing front, [43], has proposed a parallel of accuracy and consistency of the different schemes under
consideration. To obtain the different solutions of the above
ADI solver for linear array of processors. Chan and Saied
initial and boundary conditions, we divide the interval
[8] have implemented ADI scheme on hypercube. Later
Lixing et al. [35], parallelized the ADI solver on [0,1] into n subintervals. Where a  RC  GL , let
multiprocessors. The ADI method in [41] has been used for Δx, Δy and Δt be the grid spacing in the x, y and t
solving heat equations in 2D. Several approaches to solve
the telegraphic equations using different numerical schemes directions, where Δx  1 m, Δy  1 n, m and n are the
have been treated in [17, 18]. positive integers. The approximation values vi , j , k of the
solution v( x, y, t ) for the problem (2.1) – (2.2b) are to be
2. TELEGRAPH EQUATION
computed at the grid points ( xi , y j , t k ) , where
This paper deals with an electrical transmission line with
constant linear parameters R, L, C and G, in both the x i  iΔ x, i  0,1, 2,..., m, y j  jΔ j, j  0,1,...n;
frequency and time domains. The speed of convergence of
iterative scheme is examined for the synchronous b  1 LC . The region and its boundary consisting of the
communication approaches in parallel environment [48, lines x  0, x  1 and t  0, have thus discretized at the
49]. In this present work we are dealing with the numerical
approximation of the second order telegraph equation as grid points ( xi , y j , t k ) . Let vi , j be the approximate values
shown in eq. (3.1), where a and b are known as constant of v( x, t ) at the grid point ( xi , t k ) . The analytical solution
coefficients. Equation (3.1) below referred to as second-
order telegraph equation with constant coefficients, models of the initial boundary value problem eq. (3.1 – 3.2b)
mixture between diffusion and wave propagation by cannot be determined for arbitrary f ( x, t ) and the only
introducing a term that accounts for effects of finite alternative is the application of stable numerical methods.
velocity to standard heat or mass transport equation [16]. Out of the number of numerical methods available, the
However, the TE (2.1) is commonly used in signal analysis most important is finite difference method, because it is
for transmission and propagation of electrical signals [36] easy to implement and universally applicable. Evans and
and also has applications in other fields (see [45]). In recent Hassan [17] have discussed alternating group explicit finite
years, much attention has been given in the literature to the difference method for the solution of TE. Mohanty [38] has
development, analysis and implementation of stable 4
discussed finite difference schemes of O ( k  h ) for the
4
methods for the numerical solution of 2-D telegraph
equation [38]. Recently, [38] developed unconditional solution of multi-dimensional linear telegraph equation and
stable difference scheme for the solution of multi- it has been shown that the schemes are conditionally stable.
dimensional telegraph equations. The schemes are second- Furthermore, the standard central difference scheme of
order accurate in space and time. Of concern are suspension O (k 2  h 2 ) for the differential equation (3.1), is obtained
flows. These combine directed and random motion and are by using the second-order central difference
traditionally modeled by parabolic partial differential approximations to the derivatives. Let

Volume 8, Issue 6, November - December 2019 Page 3


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

t k  kt , k  1, 2, ... For simplicity, we take


x  y  d , and sometimes denote ( xi , y j , t k ) by Av (kj1/2)  f k , j  1,2,  n. (2.6)

(i, j, k). Among the finite difference method for the where
numerical solution of the problem (2.1) - (2.2b), the
v  (v1, j , v 2, j ,  , v m, j ) T , f  (f1, j , f 2, j ,  f m, j ) T
classical explicit method is suitable in any case for parallel
computing, but the method is stable only when
at the (k + 1) time level, sub-iteration 2 is given by:
t d 2  1 4, thus t must be restricted to a very small
value. The central and forward operator is given by:
ρ y vi,jn 1(2) n 1(2)
1  (1  2ρ y )v i,j  ρ y vi,jn 1(2) 
2
 v vi,jn 1  2vi,jn  vi,jn 1 1
(2.7)
 , vi,jn 1(1)  A 2 vi,jn 1(*) .  i,j
t 2 (Δt) 2

n 1
n 1
v v  v i,j i,j let a  1  2ρ y , b  c  ρ y . For various values of i

t 2Δt and j , (2.7) can be written in a more compact matrix form
n 1 n 1 n 1
(2.3)
 2 v vi 1,j  2v i,j  vi 1,j as:
 ,
x 2 (Δx)2
Bv (ki 1)  g k 1/2 , i  1,2,  m (2.8)
2 n 1 n 1 n 1
 v v i,j1  2v i,j v i,j1
 where
y 2 (Δy) 2

v (ki 1)  (v i,1 , v i,2 ,  , v i, n ) T , g  (g i,1 , g i,2 ,  , g i, n ) T


extending the finite difference scheme on the telegraph
equation of (2.1) becomes: and set

vin, j 1  2vin, j  vin, j 1 vin, j 1  vin, j 1 vin, j 1(*)  2vin, j  vin, j 1 (2.9)
2
a 
( t ) 2t n 1
this is a prediction of vi , j by the extrapolation method.
 vin1,1 j  2vin, j 1  vin1,1 j  Splitting by using an ADI procedure as in [41], we get a set
  (2.4) of recursion relations as follows:
 ( x ) 2 
b  n 1 0 ( I  A1 )vin, j 1(1)  ( A2 )vin, j 1(*) 
n 1 n 1
 vi , j 1  2vi , j  vi , j 1  (2.10)
  (2Co vin, j  C1vin, j 1 )
 (y )2 
( I  A2 )vin, j 1(2)  vin, j 1(1)  A2 vin, j 1(*) (2.11)
Although this simple implicit scheme is unconditionally where vi , j
n 1(1)
is the intermediate solution and the desired
stable, we need to solve a penta-diagonal system of
algebraic equations at each time step. Therefore, the n 1 n 1(2)
solution is vi , j  vi , j .
computational time is huge.
The numerical representative of Eq. (2.10) and (2.11) using
the Mitchell and Fairweather scheme is as follows:
2.1 DS-MF  1 1   n 1(1)
 1   ρ x   A1  vi,j 
We recall from [19] and [20] that the 2-D ADI method  2 6 
resulted into:  1 1   n 1(*)
sub-iteration 1 is given by:  1   ρ y   A 2  vi,j (2.12)
 2 6 
ρ x vin1,j1(1)  (1  2ρ x )vi,jn 1(1)  ρ x vin1,j1(1)  
 2Co vi,jn  C1vi,jn 1 
(2.5)
n 1(*) n n 1
(A 2 )v i,j  (2Co v  C vi,j 1 i,j )  i,j  1 1   n 1(2)
 1   ρ y   A 2  vi,j 
 2 6 
let a  1  2ρ x , b  c  ρ x . For various values of i (2.13)
and j , (2.5) can be written in a more compact matrix form  1 1   n 1(1) n 1(*)
 1   ρ x   A1  vi,j  A 2 vi,j
at the (k + ½ ) time level as:  2  6  

Volume 8, Issue 6, November - December 2019 Page 4


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

The horizontal sweep (2.12) and the vertical sweep (2.13) the input files are completely received. This assumption is
formulas can be manipulated and written in a compact coherent to the master-worker paradigm that is currently
5 1 ρx available in [14].
matrix. Let a   ρx , b  c   . For various
6 12 2 For the sake of simplicity and without loss of generality, we
values of i and j , (2.12) can be written in a more compact consider in this section that there is no contention related to
matrix form at the (k + ½ ) time level as in (2.6). Similarly, the transmission of output files from worker nodes to the
5 1 ρy master. Indeed, it is possible to merge the computation
let a   ρy, b  c   . For various values of phase with the completion phase without affecting the
6 12 2 results of this section. Therefore, we merge both phases as
i and j , (2.13) can be written in a more compact matrix
form at the (k + 1 ) time level as in (2.16). By defining t 'comp in the equations that follow. A worker is idle when it
5 1 ρy is not involved with the execution of any of the three
a  ρy, b  c   the resulting tri-diagonal phases of a task. For the results below, the task model is
6 12 2 composed of T heterogeneous tasks. All tasks and files
system of equations are solved using similar iterative have the same size and each task depends upon a single
procedure as in the DS-PR, that is, the two-stage IADE-DY non-shared file. Note that the problem of scheduling an
algorithm. application where each task depends upon a single non-
shared file, all tasks and files have the same size and the
3 THE IAFF master-worker paradigm is heterogeneous, has polynomial
complexity [14]. These assumptions are considered in the
With reference to [11], we introduce the Iaff. First, we analysis that follows. We define the effective number of
describe a simplified execution model that classifies several processors Peff as the maximum number of workers needed
issues related with the execution of the telegraphic equation to run an application with no idle periods on any worker
on the master-worker paradigm. Typically, each task goes processor. Taking into account the task and platform
through three phrases during execution of a parameter- models described in this section, a processor may have idle
sweep application: (1) an initialization phase, where the periods if:
necessary files are sent from the master to the slave node
and the task is started. The duration of this phase is equal to
tinit. Note that this phase includes the overhead incurred by t 'comp  (P  1)t init (3.2)
the master to initiate a data transfer to a slave, (2) a
computational phase, where the task processes the Peff is then given by the following equation:
parameter file at the slave node and produces an output file.
The duration of this phase is equal to tcomp. Any additional
overhead related to the reception of input files by a worker  t 'comp 
node is also included in this phase and (3) a completion Peff   1 (3.3)
phase, where the output file is sent back to the master and  t init 
the master task is completed. The duration of this phase is the total number of tasks to be executed on a processor is at
equal to tend. This phase may require some processing at the most
master, mainly related to writing out files to the repository.
Since this writing may be deferred until the disk is
available, we assume that this processing time is negligible. T 
M  (3.4)
Therefore, the initialization phase of one slave can occur P
concurrently with the completion phase of another slave
node. Given these three phases, the total execution time of
a task is equal to for a platform with Peff processors, the upper bound for the
total execution time (makespan) will be
t total  t init  t comp  t end (3.1) t makespan   M(t init  t 'comp )  (P  1)t init (3.5)

as the machine model, we consider a cluster composed of P the second term in the right hand side of Eq. (3.5) shows
+ 1 processors. For the rest of this paper we assume that T the time needed to start the first (P – 1) tasks in the other P
» P. One processor is the master and the other processors – 1 processors. If we have a platform where the number of
are workers. Communication among master and workers is processors is larger than Peff the overall makespan is
carried out through Ethernet and master can only send files
dominated by communication times between the master and
through the network to a single worker at a given time. We
the workers. We then have:
assume the communication link is full-duplex, i.e., the
master can receive an output file from a worker at the same
time it sends an input file to another worker. We also t makespan   MPt init  t 'comp (3.6)
assume that communication computation begins as soon as

Volume 8, Issue 6, November - December 2019 Page 5


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

the set of Eqs. (3.2) – (3.6) will be considered in is to have an initial grouping based only on static
subsequent sections of this paper. It is worth noting that Eq. information and then tasks are replicated between nodes
(3.5) is valid when workers are constantly busy, either taking into account the rate of task completion, which will
performing computation or depend on the load of individual processors. A description
of the algorithm follows (see Fig.1):
communication. Eq. (3.6) is applicable when workers have -1. Group tasks using the input file affinity metric.
idle periods, i.e., are not performing either computation or -2. When dispatching tasks to processors, the input files of
communication. Eq. (3.2) occurs mainly in two cases: the groups of tasks are sent in a pipelined way, overlapping
communication with computation whenever possible. The
(1) For very large platforms (P large). master node only sends a file to a worker node if the file is
t comp not already stored on the worker node, in order to save
(2) For applications with small ratio, such as network bandwidth. The task that is executed first is the
t init one whose sum of input files bytes to be transferred is the
fine-grain applications. smallest, in order to begin execution on the worker node as
In order to measure the degree of affinity of a set of tasks soon as possible. Note that this also reduces t init when file
concerning their input files, we introduce the concept of transfer are pipelined.
input file affinity. Given a set of G of tasks, composed of K
tasks, G  T1 , T2 , , Tk , and the set F of the Y input 1. Group tasks according to Input File Affinity;
2. Schedule Groups of tasks on Worker Processors;
files needed by the tasks belonging to group G, 3. Wait;
 
F  f1 , f 2 ,  , f y , we define I aff as follows: 4. On Pi returning results after x tasks completed do
{
y 5. Compute average execution time on
I aff (G) 
 (N  1) f
i 1 i
(3.7)
processor Pi;
y 6. Update task queue;
 Nf i 1 i i 7. Abort still running replicas of completed
tasks;
8. If Processor Pi is idle
where f i is the size in bytes of file f i and
9. If there are unfinished groups not yet
N i (1  N i  K) is the number of tasks in group G which replicated on slower processors
10. Replicate unfinished group into
have file f i as an input file. The term (N i  1) in the processor Pi;
numerator of the above equation can be explained as }
follows: if N i tasks share an input file f i , that file may be
sent only once (instead of N i times) when the group of
tasks is executed on a worker node. The potential reduction Fig.1 Clustering Algorithm
of the number of bytes transferred from a master node to a
worker node considering only input file f i is then -4. For every x tasks completed (where x is an adjustable
parameter), the worker machine should send the
(N i  1) f i . Therefore, the input file affinity indicates the corresponding results back to the master node. This
overall reduction of the amount of data that needs to be mechanism is similar to a regular check-pointing. If the
transferred to a worker node, when all tasks of a group are worker machine fails, the only tasks that have to be
sent to one node. Note that 0  I aff  1 . For the special executed are those for which the results were not received
yet. It is also possible to obtain information about the
k 1 current load of a machine by measuring the number of tasks
case where all tasks share the same input file I aff  ,
k still to be executed.
where k is the number of tasks of a group. -8. If a machine becomes idle, send a replica of the
remaining tasks (not yet replicated) of the machine with the
3.1 Using the Iaff largest amount of unfinished computation to be executed on
the idle machine.
In this section, we exemplify the utilization of the input file -10. If processor Pi is idle, the master identifies the
affinity concept to schedule the iterative methods discussed processor Ps that made the smaller amount of progress in
above on 2-D TE. The dynamic clustering algorithm is processing its tasks and replicates some of the tasks into Pi.
oblivious to task execution time, i.e., it does not need to Tasks for replication are chosen in a way that input file
know in advance actual or estimated task execution times. affinity is maximized.
The complexity of this algorithm is clearly dominated by
This class of algorithm is important especially when
information about task execution time is not available and the function that generates the groups. It is impractical for
cannot be estimated accurately. The initial of the algorithm an algorithm to exhaustively search the solution space to
find an optimal clustering of tasks. There is a large number

Volume 8, Issue 6, November - December 2019 Page 6


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

of possible heuristics to cluster tasks into groups. Our 1. For all Processors define the number of tasks
objective here is to use the input file affinity measure for to be assigned
clustering tasks in a way that scalability is improved with 2. For all tasks
the use of the iterative techniques explained above. For that 3. Compute the input files byte sum of the
reason we propose a heuristic for clustering the groups, task (Ifilesum).
which we call I-Group and is described below in Fig.2: 4. Sort results of step (3) according to Ifilesum in
list L
-1. Define the number of tasks to be assigned to each 5. P = Number of processors
processor depending on its relative speed, based on the 6. For all groups defined in step (1) (in non-
average speed of the processors of the cluster. For instance, increasing order of number of
if the relative speed is 1.0 (equal to the average of the slave tasks, beginning with the largest group)
processors of the cluster), the number of tasks to be {
T  7. Assign the task with the smallest Ifilesum,
assigned to that processor is at most   . It is worth not yet assigned, to the group
P 8. position = 1
noting that this is only a first approximation, to be adjusted 9. Until the group is completed do
later by the heuristic. {
-2. Compute the byte sum of all input files needed to 10. If ( Iaff(L[position], L[position+1]) < β)
execute each task on a slave processor (Ifilesum). then position = (position + P) mod size(L)
-4. Sort all computed results by Ifilesum. Smaller Ifilesum 11. else position = (position + 1) mod size
should appear on top of list. (L)
-6. Tasks are group in this loop. In the heuristic 12. Assign to the group the task located at
presented in this section there is at most one group per position in list L
processor at any time during execution. In the very 13. Remove assigned tasks from List L
beginning of the execution the number of groups is }
equal to the number of processors, since initially there 14. End do
is one group per processor. First the time with the 15. P=P–1
smallest Ifilesum not yet assigned to a group is selected. }
This is done in order to minimize the time needed to
start execution of the first task in a slave node. Fig.2 I-Group – Heuristic for grouping tasks using the Iaff
-10. If the set of files associated to the next task in the measure
list defined in (4) have an input file affinity greater
than β to the set of files belonging to the task just Remember that input file transfers are done in a pipelined
assigned, then the next task in the list is assigned to the way, overlapping computation with communication when
same processor. Note that β is a variable parameter, but possible. Depending on the size of the group and the
normally it should be increment of position variable, the end of list L could be
greater than 0.5. The reason to do that is to maximize the reached before completing the group. Note that the
input file affinity (Iaff) inside a group. Therefore, sending of complexity of the heuristic I-Group, shown in Fig.2, is
similar sets of files to multiple processors can be avoided. dominated by steps (2), (4) and (10). The loop represented
If the set of files belonging to the next task in the list is by step (2) computes the Ifilesum for all tasks. The execution
different enough (i.e., (Iaff) between the two sets of files < of step (2) can be implemented by processing all the T lists
0.5), then the task located P positions after the task just (only once) in time O(T  D), where D is the number of
assigned is selected. This is done for two reasons: first, to
guarantee that the tasks with smaller Ifilesum are dispatched dependency relations and T is the number of tasks. As step
first to the processors. Second, to create groups as possible (4) sorts the tasks according to Ifilesum, its complexity is
regarding the number of bytes that should be sent from the O(T.logT). The complexity of steps (6) – (13) is
master node. Our objective here is to avoid the possibility dominated by step (10), which computes the Iaff for all pairs
of one group having tasks that depends only on small files, of adjacent tasks in the list. If ΔT denotes the maximum
while other groups have tasks that depends only on larger number of files a tasks depends upon, then the calculation
files. of the Iaff for every single pair of tasks can be executed in
time O (T ) , and the calculation for all pairs can be
executed in time O(T.Δ(T) . Therefore, the complexity of
the heuristic I-Group is O(T  D  T.logT  T.T ).
Furthermore, it is worth noting that T.T  D and
T.T  T . Therefore, the complexity of the heuristic can
be simply denoted as O (T . log T  T .T ).

Volume 8, Issue 6, November - December 2019 Page 7


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

4 PARALLEL IMPLEMENTATION
The implementation is done on a distributed computing P 
1j1  
 P2j1  P3j1    PRj1    
environment (Armadillo Generation Cluster) consisting of Fig.4 R x S mesh connected row-wise worker
48 Intel Pentium at 1.73GHZ and 0.99GB RAM. communication
Communication is through a fast Ethernet of 100 MBits per Design G i1j1 , such that it contains the following grid points
seconds running Linux. The cluster performance has high
memory bandwidth with a message passing supported by (X (i11)1 ,Y(j11) j ): 
PVM which is public-domain software from Oak Ridge  
National Laboratory [25]. PVM is a software system that G i1j1   i  1, 2,  , L1 or Li  1 
enables a collection of heterogeneous computers to be used  j  1, 2,  , M or M  1 
as a coherent and flexible concurrent computational  1 1 
resource. The program written in Fortran, C, or C++ are Assign the group G i1j1 , to the workers
provided access to PVM through calling PVM library
routines for functions such as process initiation, message Pi1, j1 : i1  1, 2,  , R, For, j1  1, 2,  , S. Each
transmission and reception. The Geranium Cadcam Cluster n 1
worker computes its assigned group v i, j values in the
consisting of 48 Intel Pentium at 1.73GHZ and 0.99GB
RAM. Communication is through a fast Ethernet of 100 th
required number of sweeps. At the (p  1/2) sweep the
MBits per seconds connected and running Linux. The (p 1/2)th
cluster performance has high memory bandwidth with a workers compute v i, j values of its assigned groups.
message passing supported by MPI [23]. The program th
For the (p  1/2) level the worker Pi1j1 requires one
written in C, provided access to MPI through calling MPI
library routines. value from the worker Pi11j1 or Pi11j1 , worker. In the
n 1
At each time-step we have to evaluate v values at ' lm'
(p  1/2) th level the communication between the workers
grid points, where ' l' is the number of grid points along x-
is done row-wise as shown in Fig.4. After communication
axis. Suppose we are implementing this method on R  S
mesh connected computer. Denote the workers by between the workers is completed then each worker Pij
Pi1, j1 : i1  1, 2, , R and R  l, j1  1, 2,  , S and S  M p 1/2 th
computes the v i, j values. For the (p  1) sweep each
. The workers Pi1j1 , are connected as shown in Fig.3. Let worker Pi1j1 requires one value from the Pi11j1 or Pi11j1
1 M worker. Here the communication between processors is
L1    and M1    where   is the smallest done column-wise as shown in Fig.5. Then each worker
R  S computes the values v i, j
(p 1)th
of its assigned group.
integer part. Divide the ' lm' grid points into ' RS' groups
P i1,1 
so that each group contains at most (L1  1)(M1  1) grid
points and at least L1 M 1 grid points. Denote these groups 

by
P i1,2 
G i1j1 : i1  1,2, , R, j1  1,2, , S .

P11   P21   P31     PR1  


   

P12   P22   P32     PR2 
    P i1S 

P13   P23   P33     PR3  Fig.5 R x S mesh connected column-wise worker


communication
    
     Statements need to be inserted to select which portions of
the code will be executed by each processor. The copy of
     the program is started by checking pvm_parent, it then
P1S   P2S   P3S     PRS  spawns multiple copies of itself and passes them the array
of tids. At this point, each copy is equal and can work on its
5
data partition in collaboration with other workers. In the
Fig.3 R x S mesh connected workers
master model, the master program spawns and direct a
number of worker program which perform computations.
Any pvm task can initiate processes on the machine. The

Volume 8, Issue 6, November - December 2019 Page 8


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

master calls pvm_mytid, which as the first pvm call, enrolls


this task in the pvm system. It then calls pvm_spawn to However, an alternative to blocking communication
execute a given number of slave programs on other operations MPI provides non-blocking communication to
machines in pvm. Each worker calls pvm_tid to determine allow an application to overlap communication and
its task id in the virtual machine, and then uses the data computation. This overlap improves application
broadcast from the master to create a unique ordering from performance. In non-blocking communication, initialization
0 to nprocessor-1. Subsequently, pvm_send and pvm_recv and completion of communication operations are distinct.
are used to pass messages between processors. When The second message operation in Fig.6, illustrates one
finished, all pvm programs call pvm_exit to allow pvm to message transfer from task 1 to task 0 using non-blocking
disconnect any sockets to the process and keep track of send and non-blocking receive, respectively.
which processes are currently running. A non-blocking send has both a send start call (MPI_Isend)
initializes the send operation and it return before the
4.1 MPI Communication Service Design message is copied from the send buffer. The send complete
call (MPI_Wait) completes the non-blocking send by
MPI like most other network-oriented middleware services verifying that the data has been copied from the send
communicates data from one worker to another across a buffer. It is this separation of send start and send complete
network. However, MPI’s higher level of abstraction that provides the application with the opportunity to
provides an easy-to-use interface more appropriate for perform computations. Task 1, in Fig.6, uses MPI_Isend to
distributing parallel computing applications. We focus our initiate the transfer of sdata to task 0. During the time
evaluation on [23] because MPI serves as an important between MPI_Isend and MPI_Wait, Task 1 can not modify
foundation for a large group of applications. Conversely, sdata because the actual copy of the message from sdata is
MPI provides a wide variety of communication operations not guaranteed until the MPI_Wait call returns. After
including both blocking and non-blocking sends and MPI_Wait returns, Task 1 is free to use or overwrite the
receives and collective operations such as broadcast and data in sdata. Similarly, a non-blocking receive has both a
global reductions. We concentrate on basic message receive start call and a receive complete call. The receive
operations: blocking send, blocking receives, non-blocking start call (MPI_Irecv) initiates the receive operation and it
send, and non-blocking receive. Note that MPI provides a may return before the incoming message is copied into the
rather comprehensive set of messaging operations. MPI receive buffer. The receive complete call (MPI_Wait)
primitive communication operation is the blocking send to completes the non-blocking receive by verifying that the
blocking receive. A blocking send (MPI_Send) does not data has been copied into the buffer. As with non-blocking
return until both the message data and envelope have been send, the application has the opportunity to perform
safely stored. When the blocking sends returns, the sender computation between the receive start and receive complete
is free to access and overwrite the send buffer. Note that calls. Task 0 in Fig.6, uses MPI_Irecv to initiate the receive
these semantics allow the blocking send to compute even if of sdata from Task 1 during the time between MPI_Irecv
no matching receive has been executed by the receiver. A and MPI_Wait, Task 0 cannot read or modify sdata because
blocking receives (MPI_Recv) returns when a message that the message from Task 1 is not guaranteed to be in this
matches its specification has been copied to the buffer. buffer until the MPI_Wait call returns. After MPI_Wait
returns, Task 0 is free to need rdata.

Task 0 Task 1 4.2 Speedup, Efficiency and Effectiveness


#define size … #define size …
int sdata [size]; int sdata [size];
int rdata[size]; int rdata [size]; The performance metric most commonly used is the
MPI_Status status; MPI_Status status; speedup and efficiency which gives a measure of the
MPI_Request request; MPI_Request request; improvement of performance experienced by an application
int tag = 20; int tag = 20; when executed on a parallel system [15]. Speedup is the
/* initialization */ /* initialization */
/* … blocking send – receive 0 -> 1 /* … blocking send-receive 0 -> 1 ratio of the serial time to the parallel version run on N
*/ */ workers. Efficiency is the ability to judge how effective the
/* fill sdata */ MPI_Recv(rdata, size, MPI_INT, parallel algorithm is expressed as the ratio of the speedup to
MPI_Send(sdata, size, MPI_INT, 1, 0, tag, MPI_COMM_WORLD, N workers. In traditional parallel systems it is widely define
tag, MPI_COMM_WORLD); &status);
/* use or overwrite sdata */ /* …non-blocking send-receive 1 - as:
/* …non-blocking send-recv 1 -> 0 > 0 */
*/ /* fill sdata */
MPI_Irecv(rdata, size, MPI_INT, 1, MPI_Isend(sdata, size, MPI_INT, S ( N )  T ( s) , E( N )  S ( N ) (4.1)
tag, MPI_COMM_WORLD, 0, tag, MPI_COMM_WORLD, T (N ) N
&request); &request);
/* computation excluding rdata */ /* computation excluding sdata */ where S(n) is the speedup factor for the parallel
MPI_Wait(&request, &status); MPI_Wait(&request, &status);
/* use rdata */ /* use or overwrite sdata */ computation, T(s) is the CPU time for the best serial
/* finish */ /* finish*/ algorithm, T(n) is the CPU time for the parallel algorithm
using N workers, E(n) is the total efficiency for the parallel
algorithm. However, this simple definition has been
Fig.6 Example of message operations with MPI
Volume 8, Issue 6, November - December 2019 Page 9
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

focused on constant improvements. A generalized speedup T (1) Tser (1)  Tsc (1)
formula is the ratio of parallel to sequential execution Epar ( N )  
speed. A different approach known as relative speedup, nT ( N ) nTser (1)  Tsc (1)
(4.7)
considers the parallel and sequential algorithm to be the Tser (1)  Tsc(1)
same. While the absolute speedup calculates the 
performance gain for a particular problem using any nTser (1)
algorithm, relative speedup focuses on the performance
gain for a specific algorithm that solves the problem. The the parallel efficiency represents the effectiveness of the
total efficiency is usually decomposed into the following parallel program running on N workers relative to a single
equations. processor. However, it is the total efficiency that is of real
E ( N )  E num ( N ) E par ( N ) E load ( N ), (4.2) significance when comparing the performance of a parallel
No
where Enum, is the numerical efficiency that represents the program to the corresponding serial version. Let Ts (1)
loss of efficiency relative to the serial computation due to denotes the CPU time of the corresponding serial program
the variation of the convergence rate of the parallel to reach a prescribed accuracy with No iterations,
computation. Eload is the load balancing efficiency which N1L
TB B ( N ) denotes the total CPU time of the parallel
takes into account the extent of the utilization of the
version of the program with B blocks run on N workers to
workers. Epar is the parallel efficiency which is defined as
reach the same prescribed accuracy with Ni iterations
the ratio of CPU time taken on one worker to that on N
including any idle time. The superscript L acknowledges
workers. The parallel efficiency and the corresponding
degradation in performance due to the load balancing
speedup are commonly written as follows:
problem. The total efficiency can be decomposed as
follows:
S par ( N )
S par ( N )  T (1) , E par ( N )  (4.3) TsN o (1) TsN o (1) TBNo1 (1)
T (N ) N E(N )  
n.TBN1BL ( N ) TBNo1 (1) TBNoB (1)
(4.8)
The parallel efficiency takes into account the loss of
TBN0B (1) TBN1B (1) TBN1B ( N )
efficiency due to data communication and data ,
management owing to domain decomposition. The CPU TBN1B (1) TBN1B ( N ) TBN1BL ( N )
time for the parallel computations with N workers can be
written as follows: N NL
where TB1B ( n) has the same meaning as TB1B ( n) except
the idle time is not included. Comparing (4.5) and (4.2), we
T ( N )  Tm ( N )  Tsd ( N )  Tsc ( N ) (4.4) obtain:
TBN1B ( N )
where Tm(N) is the CPU time taken by the master program, Eload ( N )  , E par (n)
Tsd(N) is the average worker CPU time spent in data TBN1BL
communication in workers, Tsc(N) is the average CPU time
expressed in computation in workers. Generally, TBN1B (1)
 ,
Tm ( N )  Tm (1), Tsd ( N )  Tsd (1), n.TBN1B ( N )
(4.9)
Tsc (1) (4.5) T No (1)
Tsc ( N )  , Enum( N )  sN1 
N TB  B (1)
therefore, the speedup can be written as: TsNo (1) TBNo1 (1) TBNoB (1)
,
TBNo1 (1) TBNoB (1) TBN1B (1)
T (1)
S par ( N )  
T (N ) when B=1 and N = 1, Tm(1) + Tsd(1) << Tsc(1), then
(4.6)
Tser (1)  Tsc (1) T (1)  Tsc (1) TBNo1 (1) / TsN o (1)  1.0. We note that
 ser No N1
Tser (1)  Tsc (1) / N Tser (1) T BB (1) / T BB (1)  N o / N 1 . Therefore,
No TBNo1 (1)
where Tser (1)  Tm (1)  Tsd (1), which is the part that E num ( N )  E dd , E dd  (4.10)
N1 TBNoB (1)
cannot be parallelized. This is called Amdahl’s law,
showing that there is a limiting value on the speedup for a
given problem. The corresponding efficiency is given by: we call (4.10) domain decomposition efficiency (DD),
which includes the increase of CPU time induced by grid
overlap at interfaces and the CPU time variation generated
by DD techniques. The second term N o / N 1 in the right

Volume 8, Issue 6, November - December 2019 Page 10


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

hand side of (4.10) represents the increase in the number of Spar and the efficiency Epar for a mesh of 300x300, with B = 50 blocks and
Niter = 100 for PVM and MPI.
iterations required by the parallel method to achieve a
specified accuracy compared to the serial method. The
effectiveness is given by: Schem
es
N Tw Tm Ts
d
Tsc PVM MPI

T Spar Epar T Spar Epar


Ln  S n C n (4.11)
1 296 11 53 1454 1623 1.000 1.00 1328 1.000 1.00
8 6 0 0
where C n  nTn , T1 is the execution time on a serial 2 257 11 51 700.6 865.6 1.875 0.93 689.8 1.925 0.96
5 4 8 7 3
4 238 11 51 377.6 542.6 2.991 0.74 439.5 3.021 0.75
machine and Tn is the computing time on parallel machine 8 4 3 3 8 9 5
8 201 11 51 308.0 473.0 3.431 0.42 340.9 3.895 0.48

with N processors. Hence, effectiveness can be written as: MF-DS 1


1
193
4
11 51
4
131.0
4
296.0 5.482
9
0.34
5
217.6 6.103
7
0.38
6 2 4 6 6 3 1
2 163 11 51 84.19 249.1 6.513 0.32 195.3 6.798 0.34
0 4 4 9 6 5 0
Ln  S n (nTn )  E n Tn  E n S n T1 (4.12) 2
4
153
8
11
4
51 57.79 222.7
9
7.285 0.30
4
175.4
8
7.568 0.31
5
3 118 11 51 24.56 189.5 8.562 0.28 154.1 8.613 0.28
0 4 4 6 5 9 7

which clearly shows that Ln is a measure of both speedup 3


8
109
9
11
4
51 10.71 175.7
1
9,237 0.24
3
136.8
9
9.701 0.25
5
4 987 11 31 0.11 145.1 11.18 0.23 123.4 10.75 0.22
and efficiency. Therefore, a parallel algorithm is said to be 8 4 1 5 3 3 9 4

effective if it maximizes Ln and hence Ln T1  S n E n .

Table 1 The wall time TW, the master time TM, the worker
data time TSD, the worker computational time TSC, the total
time T, the parallel speed-up Spar and the efficiency Epar for
a mesh of 200x200, with B = 50 blocks and Niter = 100 for
PVM and MPI. 5 Results and Discussion

Schem N Tw T Ts Tsc PVM MPI


es m d
5.1 Benchmark Problem
T Spar Epar T Spar Epar

Schem N Tw Tm Ts Tsc PVM MPI


We implement the MS-DF schemes on the 2-D TE using
es d
T Spar Epar T Spar Epar
the input file affinity measure, with the values of the
physical properties in our test case chosen in such a way
1 387 13 58 1639 1832 1.000 1.00 1528 1.000 1.00
4 5 0 0 that LC and (RC + GL) are equal to one. The application of
2 301 13 56 798.1 988.1 1.854 0.92 791.3 1.931 0.96
4 4 3 3 7 0 6 the above mentioned algorithms were compared in terms of
4 249 13 56 437.6 627.6 2.919 0.70 440.6 3.468 0.86
1 4 1 1 3 0 7 performance by simulating their executions in master-
8 193 13 56 278.1 468.1 3.913 0.48 369.4 4.136 0.51
8 4 9 9 9 4 7 worker paradigm on several sizes. We assume a platform
MF-DS 1 135 13 56 124.4 314.4 5.827 0.36 238.0 6.418 0.40
6 6 4 0 4 8 1 composed of variable number of heterogeneous processors.
2 111 13 56 90.4 280.6 6.528 0.32 218.4 6.995 0.35
0 7 4 4 6 4 0 The solution domain was divided into rectangular blocks.
2 938 13 56 51.27 241.2 7.593 0.31 195.1 7.831 0.32
4 4 7 6 2 6 The experiment is demonstrated on meshes of 200x200 and
3 899 13 56 19.95 209.9 8.726 0.29 167.7 9.108 0.30
0 4 5 1 6 4 300x300 for block sizes of 50, 100 and 200 respectively,
3 806 12 56 10.58 190.5 9.613 0.25 149.3 10.23 0.26
8 4 8 3 5 1 9 both for MPI and PVM. Tables 1 - 14 show the various
4 718 11 46 1.72 161.7 11.32 0.23 127.5 11.98 0.25
8 4 2 8 6 2 2 0 performance timing.
1 231 5 23 1169. 1245. 1.000 1.00 1083.6 1.000 1.00
4 3 6 6 0 3 0
2 196 5 19 632 704.1 1.769 0.88 566.46 1.913 0.95
8 3 3 5 7 Table 3 The wall time TW, the master time TM, the slave
4 134 5 19 407.6 477.6 2.608 0.65 365.10 2.968 0.74
1 1 1 1 2 2 data time TSD, the slave computational time TSC, the total
8 108 5 19 305.9 375.9 3.313 0.41 285.99 3.789 0.47
9 1 7 7 4 4 time T, the parallel speed-up Spar and the efficiency Epar for
MF-DS 1 862 5 19 164.6 234.6 5.309 0.33 182.92 5.924 0.37
6 1 2 2 2 0 a mesh of 200x200, with B = 100 blocks and Niter = 100
2 718 5 19 133.3 203.3 6.125 0.30 166.41 6.512 0.32
0 1 6 6 6 6 for PVM and MPI.
2 701 5 19 107.5 177.5 7.016 0.29 146.73 7.385 0.30
4 1 4 4 2 8
3 629 5 19 83.44 153.4 8.118 0.27 127.08 8.527 0.28
0 1 4 1 4
3 611 5 19 68.2 138.2 9.013 0.23 114.51 9.463 0.24 Table 4 The wall time TW, the master time TM, the worker
8 1 0 7 9
4
8
528 5
1
19 47.31 117.3
1
10.61
8
0.22
1
97.47 11.11
8
0.23
2
data time TSD, the worker computational time TSC, the total
time T, the parallel speed-up Spar and the efficiency Epar for
a mesh of 300x300, with B = 100 blocks and Niter = 100
for PVM and MPI.
Table 2 The wall time TW, the master time TM, the worker data time TSD,
the worker computational time TSC, the total time T, the parallel speed-up

Volume 8, Issue 6, November - December 2019 Page 11


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

Schemes N Tw Tm Tsd Tsc PVM MPI (this is necessarily greater than the maximum wall time
T Spar Epar T Spar Epar
returned by the workers), the master CPU time, TM , the
1 3382 151 96 2074 2321 1.000 1.000 1825 1.000 1.000

2 3141 148 92 979.01 1219.01 1.904 0.952 945.11 1.931 0.966 average worker computational time, TSC and the average
4 2961 148 92 437.07 677.07 3.428 0.857 489.93 3.725 0.931
worker data communication time TSD all in seconds. The
8 2749 147 92 401.98 640.98 3.621 0.453 413.08 4.418 0.552
speed-up and efficiency versus the number of processors
MF-DS 16 2421 147 92 165.50 404.50 5.738 0.359 283.52 6.437 0.402
are shown in Fig.7(a,b,c) and Fig. 8(a,b) respectively, with
20 2097 147 92 100.08 339.08 6.845 0.342 254.00 7.185 0.359
block number B as a parameter. The results show that the
24 1862 147 92 66.07 305.07 7.608 0.317 230.20 7.928 0.330 parallel efficiency increases with increasing grid size for
30 1718 147 92 24.87 263.87 8.796 0.293 202.98 8.991 0.300 given block number both for MPI and PVM and decreases
38 1601 147 92 14.24 236.57 9.811 0.258 182.28 10.012 0.263 with the increasing block number for given grid size. Given
48 1497 137 92 12.18 205.02 11.321 0.236 166.15 10.984 0.229 other parameters the speed-up increases with the number of
processors. At a large number of processors Amdahl’s law
starts to operate, imposing a limiting speed-up due to the
constant serial time. Note that the elapsed time is a strong
function of the background activities of the cluster. When
Consider the TE of the form: the number of processors is small the wall time decreases
with the number of processors. As the number of
 2U  2U  2U U processors become large the wall time increases with the
 2  2  U number of processors as observed from the figures and
x 2 y t t table.
(5.1) The total CPU time is composed of three parts: the CPU
time for the master task, the average worker CPU time for
The boundary condition and initial condition posed are: data communication and the average worker CPU time for
computation, T  TM  TSD  TSC . Data communication
U (0, y, t )  0  at the end of every iteration is necessary in this strategy.
U (1, y , t )  100 Indeed, the updated values of the solution variables on full
 t0 (5.1a) domain are multicast to all workers after each iteration
U ( x,0, t )  0  since a worker can be assigned a different sub-domain
under the pool-of-task paradigm. The master task includes
U ( x,1, t )  100  sending updated data to workers, assigning the task tid to
workers, waiting for message from processors and
U ( x, y ,0)  e xy , (5.1b) receiving the result from workers. For a given grid size, the
CPU time to send task tid to workers increase with block
number, but the timing for other tasks does not change
5.2 Parallel Efficiency significantly with block number. In Tables 1 – 6, the master
time TM is constant when the number of processors
To obtain a high efficiency, the worker computational time increases for a given grid size and number of sub-domains.
Tsc (1) should be significantly larger than the serial time The master program is responsible for (1) sending updated
variables to worker (T1), (2) assigning task to worker (T2),
Schemes N Tw Tm Tsd Tsc PVM MPI
(3) waiting for the worker to execute tasks (T3), (4)
T Spar Epar T Spar Epar
receiving the results (T4).
1 5328 128 63 2327 2518 1.000 1.000 2267 1.000 1.000
2
4
4692
3968
127
127
60
60
1165.31
557.75
1352.31
744.75
1.862
3.381
0.931
0.845
1160.78
609.24
1.953
3.721
0.977
0.930
Table 5 The wall time TW, the master time TM, the worker
MF-DS
8
16
2887
2126
127
127
60
60
369.1
213
556.10
400
4.528
6.295
0.566
0.393
471.41
337.65
4.809
6.714
0.601
0.420
data time TSD, the worker computational time TSC, the total
20
24
1974
1684
127
127
60
60
173.13
134.54
360.13
321.54
6.992
7.832
0.350
0.326
309.87
282.46
7.316
8.026
0.366
0.334
time T, the parallel speed-up Spar and the efficiency Epar for
30
38
1322
1181
127
127
60
60
94.09
68.04
281.09
255.04
8.958
9.873
0.299
0.260
243.29
213.18
9.318
10.634
0.311
0.280
a mesh of 200x200, with B = 200 blocks and Niter = 100
48 967 127 60 27.88 214.88 11.718 0.244 184.14 12.311 0.256 for PVM and MPI.
Tser . In this present program, the CPU time for the master
task and the data communication is constant for a given
Table 6 The wall time TW, the master time TM, the worker
grid size and sub-domain. Therefore the task in the inner
data time TSD, the worker computational time TSC, the total
loop should be made as large as possible to maximize the
time T, the parallel speed-up Spar and the efficiency Epar for
efficiency. The speed-up and efficiency obtained for
a mesh of 300x300, with B = 200 blocks and Niter = 100
various sizes of 200x200 to 300x300 are for various
for PVM and MPI.
numbers of sub-domains; from B = 50 to 200 are listed in
Tables 1 – 6 for PVM and MPI application. In these tables
we listed the wall (elapsed) time for the master task, TW ,

Volume 8, Issue 6, November - December 2019 Page 12


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856
Schemes Tw Tm Tsd Tsc PVM MPI

T Spar Epar T Spar Epar 40


30

Speedup
1 5982 172 174 2836 3182 1.000 1.000 2618 1.000 1.000
2 4634 169 173 1284.79 1626.79 1.956 0.978 1329.61 1.969 0.985
4 4182 169 173 475.78 817.78 3.891 0.973 667.35 3.923 0.981
8 3763 169 173 296.19 638.19 4.986 0.623 495.18 5.287 0.661 20 B=200 MF-DS
MF-DS 16 3211 169 173 145.51 487.51 6.527 0.408 357.94 7.314 0.457
20 2727 169 173 98.78 440.78 7.219 0.361 307.31 8.519 0.426
24
30
2189
1962
169
169
173
173
51.13
19.28
393.13
337.76
8.094
9.421
0.337
0.314
272.59
259.00
9.604
10.108
0.400
0.337
10 B=100 MF-DS
38 1724 169 173 15.74 300.64 10.584 0.279 223.42 11.718 0.308
48 1510 169 173 11.92 269.02 11.828 0.246 207.38 12.624 0.263
0 B=50 MF-DS
Table7 Effectiveness of the various schemes with PVM and
MPI for 300 x 300 mesh size 1 2 4 8 162024303848

N PVM Ln MPI Ln Number of workers


T(s) T(s)

2 1626.79 0.060 1329.61 0.074


MF- 8 638.19 0.098 495.18 0.133 c.
DS Fig.7 Speed-up versus the number of workers for various
16 487.51 0.084 357.94 0.128 block sizes. a mesh 200x200 PVM, b mesh 200x200 MPI,
20 440.78 0.082 307.31 0.139 c mesh 300x300 MPI
30 337.76 0.093 259.00 0.130
48 269.02 0.091 207.38 0.127

3.5
40
3
30
B=200 MF-
Speedup

2.5
20 DS
Efficiency

B=100 MF-
2
B=100 MF- DS
10
DS 1.5
B=100 MF-
0 B=50 MF-DS DS
1
1 4 16 24 38
0.5 B=50 MF-DS
Number of workers
0
1 4 16 24 38

a Number of workers

a
40
30 B=200 MF- 3.5
Speedup

20 DS
3
10 B=100 MF-
DS 2.5
0
B=50 MF-DS B=200 MF-
Axis Title

1 4 16 24 38 2
DS
Number of workers 1.5 B=100 MF-
DS
b 1
B=50 MF-DS
0.5

0
1 4 16 24 38
Axis Title

Volume 8, Issue 6, November - December 2019 Page 13


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

Fig.8 Parallel efficiency versus the number of workers for domain in the serial computation after one iteration, but this
various block sizes. a mesh 200x200 MPI, b mesh 300x300 is delayed in the parallel computation. In addition, the
MPI values of variables at the interfaces used in the current
iteration are the previous values obtained in the last
Table 7 shows the effectiveness of the various schemes iteration. Therefore, the parallel algorithm is less “implicit”
with PVM and MPI. As the number of processor increases, than the serial one. Despite these inherent short comes. A
the effectiveness of MF-DS scheme performs significantly high efficiency is obtained for large scale problems.
better than the ADI. As the total number of processor
increases, the bottleneck of parallel computers appears and
the global reduction consumes a large part of time, we
anticipate that the improvement will move to be significant.
Fig.7 and Fig.8 show the efficiency using PVM and MPI Table 9 The worker computational time TSC, for 100
with varying block sizes, and observed that the MPI iterations as a function of various block numbers
implementation in Fig. 8 converges slightly better using the
Iaff measure.
5.3 Numerical Efficiency Scheme NIxNJ
B=20 PVM B=50 B=100 B=200
B=30
The numerical efficiency E num includes the Domain 200x200 797 981 1169.6 1639 2327
MF-DS 300x300 943 1268 1454 2074 2836
Decomposition efficiency E DD and convergence rate
behavior N o / N 1 , as defined in Eq. (6.10). The DD
Table 10 The worker computational time TSC, for 100
N N
efficiency E dd  TB o1 (1) / TB oB (1) includes the increase iterations as a function of various block numbers
of floating point operations induced by grid overlap at
Scheme NIxNJ
interfaces and the CPU time variation generated by DD MPI
techniques. In Tables 9 and 10, we listed the total CPU time B=20 B=50 B=100 B=200
distribution over various grid sizes and block numbers B=30
running with only one processor for PVM and MPI. Using 200x200 728 864 992 1464 1996
MF-DS 300x300 852 1096 1218 1986 2334
this table, the DD efficiency EDD can be calculated and the
results are shown in Fig.9 and Fig.10. Note that the DD
efficiency can be greater than one, even with one processor.
Fig.9 and 10 show that the optimum number of sub-
domains, which maximizes the DD efficiency EDD 1.2
increases with the grid size. The convergence rate behavior
No / N1, the ratio of the iteration number for the best 1
sequential CPU time on one processor and the iteration
number for the parallel CPU time on N processor describe 0.8
the increase in the number of iterations required by the
parallel method to achieve a specified accuracy as 200x200
compared to the serial method. This increase is caused 0.6 MF-DS
mainly by the deterioration in the rate of convergence with 300X300
increasing number of processors and sub-domains. Because 0.4
MF-DS
the best serial algorithm is not known generally, we take
the existing parallel program running on one processor to 0.2
replace it. Now the problem is that how the decomposition
strategy affects the convergence rate? The results are 0
summarized in Table 11 and 12 with Fig.12 and 13, and
20 30 50 100 200
Table 13 and 14 with Fig.12, 13 and 14.
It can be seen that No / N1 decreases with increasing block
number and increasing number of processors for given grid Fig.9 The DD efficiency versus the number of sub-domains
size. The larger the grid size, the higher the convergence for various blocks of PVM
rate. For a given block number, a higher convergence rate
is obtained with less processors. This is because one
processor may be responsible for a few sub-domains at
each iteration. If some of this sub-domains share some
common interfaces, the subsequent blocks to be computed
will use the new updated boundary values, and therefore, an
improved convergence rate results. The convergence rate is
reduced when the block number is large. The reason for this
is evident: the boundary conditions propagate to the interior
Volume 8, Issue 6, November - December 2019 Page 14
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

2 3101 4321 4852 3211


MF-DS 4 3101 4582 5074 3528
1.2 8 3101 4718 5362 3769
16 3101 4992 5621 3928
1 30 3101 5284 5895 4325
48 3101 5476 6122 4518
0.8
200X200
0.6 MF-DS
1.2
0.4 300X300
MF-DS 1
0.2
0 0.8 B=20
20 30 50 100 0.6 B=50
0.4 B=100
Fig.10 The DD efficiency versus the B=200
0.2
number of sub-domains for various blocks of MPI
Table11 The number of iteration to achieve a given 0
tolerance of 10-3 for a grid of 200x200 for PVM
2 4 8 16 30 48
Scheme N B=20 B=50 B=100 B=200

2 3621 4347 4621 3118 Fig.11 Convergence behavior with domain decomposition
MF-DS 4 3621 4508 4792 3346 for mesh 200x200 MF-DS PVM
8 3621 4921 5121 3598
16 3621 5248 5448 3771
30 3621 5422 5611 3964
48 3621 5745 5908 4125

1.2
1
0.8 B=20
Table12 The number of iteration to achieve a given 0.6 B=50
tolerance of 10-3 for a grid of 200x200 for MPI
0.4
B=100
Scheme N B=20 B=50 B=100 B=200 0.2
0 B=200
2 1914 3125 3541 2392
MF-DS 4 1914 3448 3679 2586 2 4 8 16 30 48
8 1914 3703 3824 2793
16 1914 4094 4219 2899
30 1914 4268 4387 3016
48 1914 4491 4666 3325 Fig.12 Convergence behavior with domain decomposition
for mesh 200x200 MF-DS MPI

Table13 The number of iteration to achieve a given


tolerance of 10-3 for a grid of 300x300 for PVM
1.2
1
Scheme N B=20 B=50 B=100 B=200
0.8 B=20
2 1914 3125 3541 2392
MF-DS 4 1914 3448 3679 2586 0.6 B=50
8 1914 3703 3824 2793
16 1914 4094 4219 2899
0.4 B=100
30 1914 4268 4387 3016
0.2 B=200
48 1914 4491 4666 3325
0
2 4 8 16 30 48
Table14 The number of iteration to achieve a given
tolerance of 10-3 for a grid of 300x300 for MPI
Fig.13 Convergence behavior with domain decomposition
Scheme N B=20 B=50 B=100 B=200
for mesh 300x300 MF-DS PVM

Volume 8, Issue 6, November - December 2019 Page 15


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

1.2 domain greatly influences the performance of the parallel


computation (Fig. 12 – 13). (4) The convergence rate
1 depends upon the block numbers and the number of
0.8 B=20 processors for a given grid. For a given number of blocks,
0.6 B=50 the convergence rate increases with decreasing number of
0.4 processors and for a given number of processors it
B=100 decreases with increasing block number for both MPI and
0.2
B=200 PVM (Fig.14 – 15). The speedup was because of the
0 computation/communication interleaving approach as
2 4 8 16 30 48 observed in the figures and tables. On the basis of the
current parallelization strategy, more sophisticated models
can be attacked efficiently.
Fig.14 Convergence behavior with domain decomposition
for mesh 300x300 MF-DS MPI References
[1.] J. M. Alonso, J. Mawhin, R. Ortega, (1999). Bounded
5.4 Total Efficiency Solution of Second-Order Semilinear Evolution
Equations and Applications to Telegraph Equation. J.
We implemented the serial computations on one of the Math Pures Appl. 78, 49 - 63
processors, and calculated the total efficiencies. The total [2.] R. Aloy, M. C. Casaban, L. A. Caudillomate, L. Jodar,
efficiency E(N) for grid sizes 200x200 and 300x300 have (2007). Computing the Variable Coefficient Telegraph
been showed respectively. From Eq. (6.8), we know that Equation using a Discrete Eigen Functions Method.
the total efficiency depend on No / N1, E par and DD Computers and Mathematics with Applications 54, pp.
efficiency EDD since the load balancing is not the real 448 – 458.
problem here. For a given grid size and block number, the [3.] E. Arubanel, (2011). Scheduling Tasks in the Parareal
DD efficiency is constant. Thus, the variation of E(N) with Algorithm. Parallel Computing 37, 172 – 182
[4.] Berkeley Open Infrastructure for Network Computing,
processor number n is governed by E par and N o / N 1 . 2002. http://boinc.berkeley.edu/
When the processor number becomes large, E(N) decreases [5.] F. Berman, R. Wolski, H. Casanova, W. Cirne, H.
with n due to the effect of both the convergence rate and Dail, M. Faerman, S. Figueira, J. Hayes, G. Obertelli,
the parallel efficiency. With the MPI version we were able J. Schopf, G. Shao, S. Smallen, N. Spring, A. Su, D.
to achieve best implementation for the efficiency of Zagorodnov, (2003). Adaptive Computing on the Grid
different mesh sizes with different block numbers as using AppLes. IEEE Transactions on Parallel and
observed in Fig. 14 and 16. In comparison of Fig. 14 and Distributed Systems, 14(4), pp 369 - 382
16 on convergence behavior with domain decomposition [6.] D. Callahan, K. Kennedy. 1988. Compiling Programs
for different mesh sizes, we observed that the for Distributed Memory Multiprocessors. Journal of
implementation with MPI achieved more conformity to Supercomputer 2, pp 151 – 169
unity [7.] H. Casanova, A. Legrand, D. Zagorodnov, F. Berman,
(2000). Heuristics for Scheduling the Ninth
6. CONCLUSION Heterogeneous Computing Workshop, IEEE Computer
Society Press
In this paper we have presented the results of the [8.] T. Chan, F. Saied, 1987. Hypercube Multiprocessors.
experimental study into using input file affinity measure to SIAM, Philadelphia
schedule tasks in a master-worker paradigm using PVM [9.] H. Chi-Chung, G. Ka-Kaung, et. al., 1994. Solving
and MPI on MF-DS scheme. The aim was to implement the Partial Differential Equations on a Network of
execution performance of the said applications, the Workstations. IEEE, pp 194 – 200.
utilization of master-worker paradigm and provide some [10.] R. Chypher, A. Ho, et al., 1993. Architectural
recommendations regarding the feasibility of this approach Requirements of Parallel Scientific Applications with
suggest a scheduling method and confirm experimental Explicit Communications. Computer Architecture, pp
results achieved by other researchers. The performance 2 – 13
results demonstrate the effectiveness of the alternating [11.] P.J Coelho, M.G Carvalho, 1993. Application of a
schemes. It not only outperforms stationary iterative Domain Decomposition Technique to the
schemes, but also approach good convergence. Mathematical Modeling of Utility Boiler. Journal of
Computational results obtained have clearly shown the Numerical Methods in Eng., 36 pp 3401 – 3419
benefits of using parallel algorithms. We have come to [12.] D’Ambra P., M. Danelutto, Daniela S., L. Marco,
some conclusions that: (1) the parallel efficiency is strongly 2002. Advance Environments for Parallel and
dependent on the problem size, block numbers and the Distributed Applications: a view of current status.
number of processors as observed in Fig.6 both for PVM Parallel Computing 28, pp 1637 – 1662.
and MPI. (2) A high parallel efficiency can be obtained
with large scale problems. (3) The decomposition of
Volume 8, Issue 6, November - December 2019 Page 16
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 8, Issue 6, November - December 2019 ISSN 2278-6856

[13.] H.S Dou, Phan-Thien. 1997. A Domain [30.] K. Jaris, D.G. Alan, 2003. A High-Performance
Decomposition Implementation of the Simple Method Communication Service for Parallel Computing on
with PVM. Computational Mechanics 20 pp 347 – 358 Distributed Systems. Parallel Computing 29, pp 851 –
[14.] F. Durst, M. Perie, D. Chafer, E. Schreck, 1993. 878
Parallelization of Efficient Numerical Methods for [31.] K. Kaya, C. Aykanat, (2006). Iterative Improvement-
Flows in Complex Geometries. Flow Simulation with Based Heuristics for Adapting Scheduling of Task
High Performance Computing I, pp 79 – 92, Vieweg, Sharing Files on Heterogeneous Master-Slave. IEEE
Braunschelweig Transactions on Parallel and Distributed Systems,
[15.] J. H. Eduardo, M. A., H. Amaral (2007). Speedup and 17(8), pp 883 - 896
Scalability Analysis of Master-Slave Applications on [32.] A. Keren, A. Barak, (2003). Opportunity Cost
Large Heterogeneous Clusters. Journal of Parallel and Algorithms for Reduction of I/O and Interprocess
Distributed Computing 67(11), pp 1155 - 1167 Communication Overhead in a Computing Cluster.
[16.] M. S. El-Azah, M. El-Gamel, (2007). Appl. Math IEEE Transactions on Computer 14(1), 39 - 50
Comput. 190, 757 - 764 [33.] E. Korpela, D. Werthimer, D. Anderson, J. Cobb, M.
[17.] D.J Evans, B. Hassan, 2003. Numerical Solution of Leboisky, (2001). SETI@home-Massively Distributed
the Telegraph Equation by the AGE Method. Computing for SETI. IEEE Computer Society, Los
International Journal of Computer Mathematics Vol. Alamitos, CA, USA, 78 – 83
80, number 10, pp 1289 – 1297 [34.] J. L. Lions, Y. Maday, G. Turinici, (2001). A Parareal
[18.] D.J. Evans, M.S. Sahimi, The Alternating Group in Time Discretization of PDEs. Comptes Rendus de
Explicit Iterative Method for Parabolic Equations I: 2- Academie des Sciences – Series 1 – Mathematics
Dimensional Problems, Intern. J. Compt. Math, Vol. 332(7), 661 - 668
24, (1988) pp. 311-341 [35.] M. Lixing, C. Frederick, J. Harris, 1998. Technical
[19.] S. U. Ewedafe, R. H. Shariffudin, (2011). Armadillo Report Department of Computer Science University of
Generation Distributed System with Geranium Cadcam Nevada Reno, NV 89557
Cluster for solving 2-D Telegraph Problem. Intern. J. [36.] A. C. Metaxas, R. J. Meredith, (1993). Industrial
Compt. Math, vol. 88, 589 – 609 Microwave, Heating, Peter Peregrinus, London
[20.] S. U. Ewedafe, R. H. Shariffudin, (2011). Parallel [37.] A. R. Mitchell, G. Fairweather, (1964). Improved
Implementation of 2-D Telegraphic Equation on forms of the Alternating direction methods of Douglas,
MPI/PVM Cluster. Int. J. Parallel Prog, 39, 202 - 231 Peaceman and Rachford for solving parabolic and
[21.] D.S. Fabricio, H. Senger (2009). Bag of Task running elliptic equations, Numer. Maths, 6, 285 – 292.
on Master-Slave with Input File. Parallel Computing
35, pp 57 – 71 AUTHOR
[22.] C. Fan, C. Jiannong, S. Yudong (2003). High
Abstractions for Message Passing Parallel Ewedafe Simon Uzezi received the B.Sc. and
Programming. Parallel Computing 29, 1589 – 1621. M.Sc. degrees in Industrial-Mathematics and
[23.] I. Foster, J. Geist, W. Groop, E. Lust, 1998. Wide- Mathematics respectively from Delta State
University and The University of Lagos in
Area Implementations of the MPI. Parallel Computing
1998 and 2003. He further obtained his Ph.D.
24 pp 1735 – 1749. in Numerical Parallel Computing 2010 from
[24.] A. Geist A. Beguelin, J. Dongarra, 1994. Parallel the University of Malaya, Malaysia. In 2011
Virtual Machine (PVM). Cambridge, MIT Press he joined UTAR in Malaysia as an Assistant Professor and later
[25.] G. A Geist, V. M Sunderami, 1992. Network Based lectured in Oman and Nigeria as Senior Lecturer in Computing.
Concurrent Computing on the PVM System. Currently he is an Associate Professor.
Concurrency Practice and Experience, pp 293 – 311
[26.] A. Giersch, Y. Robert, F. Vivien, (2006). Scheduling
Task Sharing Files on Heterogeneous Master-Slave
Platforms. Journal of System Architecture, 52(2) pp 88
– 104
[27.] Y. Guang-Wei, Long-Jun S., Yu-Lin Z. 2001.
Unconditional Stability of Parallel Alternating
Difference Schemes for Semilinear parabolic Systems.
Applied Mathematics and Computation 117, pp 267 –
283
[28.] Y. Guangwei, H. Xudeng (2007). Parallel Iterative
Difference Schemes Based on Prediction Techniques
for Sn Transport Method. Applied Numerical
Mathematics 57, 746 – 752.
[29.] L. B. Hinde, C. Perez, T. Priol, (2010). Extending
Software Component Models with the Mastr-Worker
Paradigm. Parallel Computing 36, 86 - 103

Volume 8, Issue 6, November - December 2019 Page 17

You might also like