P. 1
Strategies of Domain Decomposition to Partition Mesh-Based Applications onto Computational Grids

Strategies of Domain Decomposition to Partition Mesh-Based Applications onto Computational Grids

|Views: 75|Likes:
Journal of Computing, eISSN 2151-9617, http://www.journalofcomputing.org, https://sites.google.com/site/journalofcomputing/volume-3-issue-4-april-2011
Journal of Computing, eISSN 2151-9617, http://www.journalofcomputing.org, https://sites.google.com/site/journalofcomputing/volume-3-issue-4-april-2011

More info:

Published by: Journal of Computing on May 07, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/01/2012

pdf

text

original

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.

COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

130

Strategies of Domain Decomposition to Partition Mesh-Based Applications onto Computational Grids
Beatriz Otero and Marisa Gil
Abstract— In this paper, we evaluate strategies of domain decomposition in Grid environment to solve mesh-based applications. We compare the balanced distribution strategy with unbalanced distribution strategies. While the former is a common strategy in homogenous computing environment (e.g. parallel computers), it presents some problems due to communication latency in Grid environments. Unbalanced decomposition strategies consist of assigning less workload to processors responsible for sending updates outside the host. The results obtained in Grid environments show that unbalanced distributions strategies improve the expected execution time of mesh-based applications by up to 53%. However, this is not true when the number of processors devoted to communication exceeds the number of processors devoted to calculation in the host. To solve this problem we propose a new unbalanced distribution strategy that improves the expected execution time up to 43%. We analyze the influence of the communication patterns on execution times using the Dimemas simulator. Index Terms— Domain decomposition methods, load balancing algorithms, parallelism and concurrency, simulation.

——————————  ——————————

1 INTRODUCTION

D

cusses the work related with this study. Section 3 describes the applications and the algorithmic structure of explicit simulations. Section 4 defines the Grid topologies considered and describes the tool used to simulate the Grid environment. Section 5 deals with the mesh-based ———————————————— applications studied and the workload assignment pat B. Otero is with the Department d’Arquitectura de Computadors, Univer- terns. Section 6 shows the results obtained in the envisitat Politècnica de Catalunya, 08034, Spain. ronments specified for the three different data distribu M. Gil is with the Department d’Arquitectura de Computadors, Universition patterns. Section 7 presents the new unbalanced distat Politècnica de Catalunya, 08034, Spain. tribution that solves the problems of the unbalanced distribution proposed before. Finally, the conclusions of the

OMAIN decomposition strategy is used for efficient parallel execution of mesh-based applications. These applications are widely used in various disciplines such as engineering, structural mechanics and fluid dynamics and require high computational capabilities [1]. Computational Grids are emerging as a new infrastructure for high performance computing. Clusters of workstations of multiple institutions can be used to efficiently solve PDEs in parallel where the problem size and number of processors are chosen to maintain sufficient coarse-grained parallelism. A workstation can be a computer or a group of computers and hereafter we will refer to both as host. Our focus is simulations that make finite element analysis to solve the problems that arise from the discretization of PDEs on meshes. The general algorithmic structure of the explicit simulations is composed of two nested loops. The first, the outer loop, performs the discretization of PDE with reference to the simulation time. The second, the inner loop, applies this discretization onto all finite elements of the mesh. This inner loop performs a matrix-vector product. This numerical operation represents between 80% and 90% of the total iteration time, so an efficient parallelization of this calculation might significantly improve the total simulation time. To this end, we use domain decomposition techniques, where matrix and vector are decomposed properly in sub-domains of data that are mapped in one processor. Each sub-domain has interrelations with each of the others in the boundary elements.

In order to obtain optimal performance of mesh-based applications in Grid environments a suitable partitioning method should take into consideration several features, such as the characteristics of the processors, the quantity of traffic in the network, the latency and the bandwidth between processors both inside the host and between hosts. Most partitioners do not have this capacity; therefore they do not produce good results when the network and processors in the Grid have a heterogeneous nature. In this paper, we evaluate mesh-based applications in Grid environments using a domain decomposition technique. The main objective of this study is to improve the execution time of mesh-based applications in Grid environments by overlapping remote communications and useful computation. To achieve this, we propose new strategies of domain decomposition for partitioning mesh-based applications in computational Grids where the workload can vary depending on the characteristics of the processor and of the network. The successful deployment of parallel mesh-based applications in a grid environment must involve efficient mesh partitioning. We use the Dimemas tool to simulate the behavior of the distributed applications in Grid environments [2]. This work is organized as follows. Section 2 briefly dis-

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

131

work are presented in Section 8.

bines the two previous unbalanced distributions and allows a more efficient processor utilization.

2 RELATED WORK
We can distinguish between two distinct types of related work, one based on the partitioner method and other based on the load balancing. As mentioned above, the success of parallel mesh-based applications in Grid environments depends on efficient mesh partitioning. Several works have already proposed partitioners for computational Grids. PART, JOSTLE, SCOTCH, MinEX, PaGrid and METIS are some examples of these.

3 APPLICATIONS AND SIMULATIONS
In this section we describe the mesh-based applications features and the general algorithmic structure of the simulation schemes. 3.1 Mesh-based Applications Finite element methods have been fundamental techniques in the solution of problems in engineering modeled by PDEs. These methods include three basic steps:
Step 1: The physical problem is written in variational form (also called weighted residual form). 2. Step 2: The problem’s domain is discretized by complex shapes called elements. This is called meshing. 3. Step 3: The variational form is discretized using quadrature rules leading to a system of equations. The solution of this system represents a discrete approximation of the solution of the original continuum problem. Applications that involve a meshing procedure are referred to as mesh-based applications (step 2). Mesh-based applications are naturally suited for parallel or distributed systems because these applications require large amounts of processing time. Furthermore, mesh-based applications can be partitioned to execute concurrently on heterogeneous computers in a Grid. Implementing the finite element method in parallel involves partitioning the nodes global domain into nprocs processors. Our example applications use explicit finite element analysis for problems involving sheet stamping and car crashes [19]. We describe each of these below. Sheet stamping problems. Performance prediction of sheet stamping dies during the die-design process. As well as market pressure for faster delivery of dies and cost reduction, large car manufacturers increasingly tend to offload design responsibility onto their suppliers. Typically, dies are designed by highly experienced people who know what sort of die is needed to produce a part of a given shape. On the basis of the design, fine-tuning is performed by actually using dies to produce parts, observing the result and manually milling the die until the sheet comes out as specified by the contractor. In complex cases, it is very difficult to produce a good die design by intuition. In addition to the associated costs, failure to meet a delivery date damages a company’s image and has a negative impact on future business. Numerical simulations could provide the quantitative information needed to minimize modifications during the manufacturing process [19]. Simulations with serial codes take as long as 40 to 60 processor hours and usually require high end workstations. Parallel stamping simulations enable die manufacturing companies to alter their die design procedures: instead of an iterative approach, they can run more analyses before the first die prototypes are made. This reduces overall die design and manufacturing time, which is vital for the industry. 1.

PART [3] uses intensive algorithms and simulation annealing, so requires a parallel implementation to obtain good performance. JOSTLE [4] produces data partitions without taking into account the communication cost for each processor. SCOTCH [5] has the same limitation of JOSTLE because it generates partitions for homogeneous interprocessor communication cost. MinEX [6] makes partitions without taking into account the application granularity. PaGrid [7] uses some techniques already applied by other partitioners but adds a stage for load balancing the execution time. PaGrid produces comparable partitions to JOSTLE and attempts some improvement by minimizing the estimated execution time. Finally, METIS [8] is based on a multilevel recursive bisection algorithm. All of these approaches consider estimated execution time rather than communication cost to measure the performance of a mesh partitioner. However, minimizing the communication between hosts is fundamental in computational Grids to reduce the execution time.
As regards workload, there are some works dealing with the relationship between architecture and domain decomposition algorithms [9]. There are several studies on latency, bandwidth and optimum workload to take full advantage of the available resources [10, 11]. There are also analyses of the behavior of MPI applications in Grid environments [12, 13]. In all of these cases, the same workload for all the processors is considered. Li et al. [14] provide a survey of the existing solutions in load balancing as well as new efforts to deal with it in the face of the new challenges in Grid computing. In this work they describe and classify different schemes of load balancing for grid computing, but there is no solution which would be fully adaptive to the characteristics of the Grid. In previous works we suggested two unbalanced distribution strategies, called singleB-domain and multipleBdomain, to execute mesh-based applications in Grid environments [15, 16, 17, 18]. All of these use unbalanced data distribution and they take into account the execution platform and the processor characteristics. Both strategies minimize the communication between the processors and reduce the expected execution time by up to 53% when compared with a balanced distribution strategy. In this paper we present the details of the two unbalanced distributions proposed above. We describe the characteristics of the applications executed and the schemes to explicit simulations, and we propose a new unbalanced distribution, called multipleCB-domain distribution, which com-

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

132

This problem has high computational requirements. The size of the problem is very large because of the high complexity of the design model. In this case, Grid computing is necessary because a complex model is required to solve the problem and currently supplier companies do not have enough computational power available to perform such simulations. They are forced to use the services of remote computers that belong to other companies. Car crash problems. The car body design and the passenger’s safety must be considered. Before performing a real crash test, hundreds of crash worthiness simulations are computed and analyzed [20]. Car crash simulations are required to predict the effects on new advanced materials of various collisions, such as two cars colliding. As in the above case the car crash problem has high computational requirements and the platform Grid is a good option to realize the simulations at less cost.

for (all step time of simulation) for (all finite elements) e Obtain global vector associate to finite element ({un }) e e e Solve system: {un+1 } = [A ] {un } e Scatter {un+1 } to global vector ({un+1}) Calculate residual vector Update boundary conditions Calculate next step of time

3.2 Simulations of Mesh-based Applications
When doing structural analysis of mesh-based applications such as car crash and sheet stamping we use the displacement equation [21]. This equation determines the numerical solution for our applications. The discretization of the displacement equation using the finite element method has the following mathematical equation: [M]{ů}+[K]{u} = {Fa} where: [M] is the mass matrix [K] is the tensor matrix {ů} is the acceleration vector {u} is the displacement vector and {Fa} is the force vector. To obtain the numerical solution of (1), we use the central difference method and numerical integration in time, and so obtain the follow equation: {un+1} = [A]{un} (2) (1)

vector associate to that element ({une}). The result obtained ({un+1e}) is scattering to global vector ({un+1}). After this, the residual vector is calculated and the boundary conditions are updated if necessary. This scheme has the advantage that the global matrix [A] does not need to be formed in the inner loop. This leads to considerable savings in memory and allows to solve large problems in relatively small memory PC’s. In contrast, the second scheme needs an initial gathering of the global matrix. After this, a matrix-vector product operation is needed as in the first scheme. The algorithm for this scheme is the following: Our work follows the without matrix assembling scheme to produce the explicit simulations as this saves both memory and time.
for (all step time of simulation) for (all finite elements) e e Gather [A ]: [A] = [A] + [A ] Solve system: {un+1} = [A] {un} Calculate residual vector Update boundary conditions Calculate next step of time

4 DIMEMAS AND GRID ENVIRONMENT
We use a performance simulator called Dimemas. Dimemas is a tool developed by CEPBA1 for simulating parallel environments [2, 12, 13]. This tool is fed with a trace file and a configuration file. In order to obtain the trace file, the parallel application is executed using an instrumented version of MPI [22]. It is important to note that this execution can be done on any kind of computer. The configuration file contains details of the simulated architecture, such as number of the nodes, latency and bandwidth between nodes. Dimemas generates an output file that contains the execution times of the simulated application for the parameters specified in the configuration file. Furthermore, it is possible to obtain a graphical representation of the parallel execution. Figure 1 shows the sequence of steps to obtain the output file.

Equation (2) defines the explicit method to determine the numerical solution of our applications. Details of the characteristics of matrix [A] and the {un} vector can be seen in [21].

3.3 Explicit Simulations
In the previous subsection, we described the general algorithmic structure to determine the numerical solution for our simulations. The simulations have two different schemes, called with or without matrix assembling, depending on whether or not the matrix inside the inner loops is gathering. The scheme without matrix assembling uses a calculation algorithm element by element, thus it is not necessary to gather a global equation system. The algorithm structure of this scheme is the following: As we can see in the above algorithm, the scheme without matrix assembling makes one matrix-vector product operation per element, using a part of the global

The Dimemas simulator considers a simple model for point to point communications. This model breaks down the communication time into five components: 1. Latency time is a fixed time to start the communication.

1

European Center for Parallelism of Barcelona

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
Sequential machine machine Dimemas Dimemas Trace Trace File
File MPI MPI Library

133

Application Application Code Code

Application Code

Grid Environments: Configuration files Visualization Visualization Trace Trace File File
File Dimemas simulation Dimemas simulation

Visualization tool: PARAVER

(latency of 10 ms and 100 ms, bandwidth of 64 Kbps, 300 Kbps and 2 Mbps, flight time of 1ms and 100 ms). We model a Grid environment using a set of hosts. Each host is a network of Symmetric Multiprocessors (SMP). The Grid environment is formed by a set of connected hosts. Each host has a direct full-duplex connection with any other host. We do this because we think that some of the most interesting Grids for the scientist involve nodes that are themselves high-performance parallel machines or clusters. We consider different topologies in this study: two, four and eight hosts.

Fig. 1. The Dimemas tool.

5 DATA DISTRIBUTION
This work is based on the use of distributed applications that solve sparse linear systems using iterative methods. These systems arise from the discretization of partial differential equations, especially when explicit methods are used. These algorithms are parallelized using domain decomposition for the data distribution. Each particular domain has a parallel process associated to it. A matrix-vector product operation is carried out in each iteration of the iterative method. The matrix-vector product is performed using a domain decomposition algorithm, as a set of independent computations and a final set of communications. The communications in this context are associated to the domain boundaries. Each process exchanges the boundary values with all its neighbors so that, each process has as many communication exchanges as neighbor domains [26, 27]. For each communication exchange, the size of the message is the length of the boundary between the two domains involved. We use METIS to perform the domain decomposition for the initial mesh [28, 29, 30].
Balanced distribution pattern. This is the usual strategy for domain decomposition algorithms. It generates as many domains as processors in the Grid. The computational load is perfectly balanced between domains. This balanced strategy is suitable in homogeneous parallel computing, where all communications have the same cost. Figure 2 shows an example of a finite element mesh with 256 degrees of freedom (dofs) with the boundary nodes for each balanced partition. We consider a Grid with 4 hosts and 8 processors per host. We solve an initial decomposition in four balanced domains. Figure 3 shows the balanced domain decomposition.

2. 3.
4.

Resource contention time is dependent on the global load in the local host [23]. The transfer time is dependent on the message size. We model this time with a bandwidth parameter.
The WAN contention time is dependent on the global traffic in the WAN [24].

5.

The flight time is the time spent in the transmission of the message, during which no CPU time is used [23]. It depends on the distance between hosts. We consider hosts distributed at equal distances, since our environment is homogeneous.

We consider an ideal environment one where resource contention time is negligible: there are an infinite number of buses for the interconnection network and as many links as the number of different remote communications the host has with others hosts. For the WAN contention time, we use a linear model to estimate the traffic in the external network. We have considered the traffic function with 1% influence from internal traffic and 99% influence from external traffic. Thus, we model the communications with just three parameters: latency, bandwidth and flight time. These parameters are set according to what is commonly found in present networks. We have studied different works to determine these parameters [24, 25]. Table 1 shows the values of these parameters for the internal and external host communications. The internal column defines the latency and bandwidth between processors inside a host. The external column defines the latency and bandwidth values between hosts. The communications inside a host are fast (latency 25 µs, bandwidth 100 Mbps), and the communications between hosts are slow

0 16 32 48 64 80 96

1 17 33 49 65 81 97

2 18 34 50 66 82 98

3 19 35 51 67 83

4 20 36 52 68 84

5 21 37 53 69 85

6 22 38 54 70 86

7 23 39 55 71 87

8 24 40 56 72 88

9 25 41 57 73 89

10 26 42 58 74 90

11 27 43 59 75 91

12 28 44 60 76 92

13 29 45 61 77 93

14 30 46 62 78 94

15 31 47 63 79 95

99 100 101 102 103

104 105 106 107 108 109 110 111 120 121 122 123 124 125 126 127 136 137 138 139 140 141 142 143 152 153 154 155 156 157 158 159 168 169 170 171 172 173 174 175 184 185 186 187 188 189 190 191 200 201 202 203 204 205 206 207 216 217 218 219 220 221 222 223 232 233 234 235 236 237 238 239 248 249 250 251 252 253 254 255

112 113 114 115 116 117 118 119

TABLE 1 LATENCY, BANDWIDTH AND FLIGHT TIME VALUES
Parameters
Latency Bandwidth Flight time

128 129 130 131 132 133 134 135 144 145 146 147 148 149 150 151 160 161 162 163 164 165 166 167

Internal
25 µs 100 Mbps -

External
10 ms and 100 ms 64 Kbps, 300 Kbps and 2Mbps 1 ms and 100 ms

176 177 178 179 180 181 182 183 192 193 194 195 196 197 198 199 208 209 210 211 212 213 214 215

224 225 226 227 228 229 230 231 240 241 242 243 244 245 246 247

Fig. 2. Boundary nodes per host

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 CPU 8 CPU 9 CPU 10 CPU 11 CPU 12 CPU 13 CPU 14 CPU 15 CPU 16 CPU 17 CPU 18 CPU 19 CPU 20 CPU 21 CPU 22 CPU 23 CPU 24 CPU 25 CPU 26 CPU 27 CPU 28 CPU 29 CPU 30 CPU 31

134
0 1 2 18 34 50 66 82 98 35 51 3 4 5 6 22 38 54 69 7 23 39 8 24 40 9 25 41 57 73 89 10 11 12 28 44 60 76 92 13 29 45 61 77 93 14 30 D9 46 62 78 15 31 47 63 79 95

D0

D1

D2

D3

D8

D9

D10

D11

16 32 48 64

D0 17
33 49 65 81

D 20 1 21
36 52 68 84 37

D 26 8 27
42 58 74 90 43 59

55 56 D5 D13 72 88

D2

D3
83

67

D4

D5

D6

D7

D12

D13

D14

D15

80 96

85

D4

70

D10
91

86

D12
94

97

99 100 101 102 103

104 105 106 107 108 109 110 111 124 122 123 D15 125 126 127 140 137 138 139 D29 141 142 143

D11

D7 112 113 114 115 116 117 118

D6 D14 120
D30

135 131 128 129 130 D21 132 133 134 D22

D16

D17

D18

D19

D24

D25

D26

D27

144 145 146 160 161 162 D16

148 149 150 151 165 166 167 164 D 17

152 153 154 155 156 157 158 159 168 169 187 171 172 173 174 175

176 177 178 179 180 181 182 183

D24 D25 D26 184 185 187 187 188 189 190 191
D31 200
201 202 203 204 205 206 207 222 223

D23 192 193 194 195 196 197 198 199

D20

D21

D22

D23

D28

D29

D30

D31

208 209 210 211 212 213 214 215 D18 D19 D20 224 225 226 227 228 229 230 231 240 241 242 243 244 245 246 247

216 217 218 219 220

238 239 232 233 234 235 236 237 D D27 28 187 249 250 251 252 253 254 255

Fig. 3. Balanced distribution.

Fig. 5.MultipleB-domain distribution.

Fig. 6.c.Communication diagram for a computational iteration of the MultipleB-domain distribution.

Unbalanced distribution pattern. Our proposal is to build some domains with a negligible computational load. Those domains are devoted only to manage the slow communications. In order to do this, we divide the domain decomposition in two phases. First, balanced domain decomposition is done between the number of hosts. This guarantees that the computational load is balanced between hosts. Second, unbalanced domain decomposition is done inside a host. This second decomposition involves splitting the boundary nodes of the host sub-graph. We create as many special domains as remote communications. Note that these domains contain only boundary nodes, so they have negligible computational load. We call these special domains B-domains (Boundary domains). The remainder host sub-graph is decomposed in nproc-b domains, where nproc is the number of processors in the host and b stands for the number of B-domains. We call these domains C-domains (Computational domains). As a first approximation we assign one CPU to each domain. The CPUs assigned to B-domains remain inactive most of the time. We use this policy to obtain the worst case for our decomposition algorithm. This inefficiency could be solved assigning all the B-domains in a host to the same CPU. We consider two unbalanced decomposition of the mesh shows in figure 2. First, we create a sub-domain with the layer of boundary nodes for each initial domain (singleB-domain), which contains seven computational domains (Fig. 4). Second, we create some domains (multipleB-domain) for each initial partition using the layer of boundary nodes. Then, the remainder mesh is decomposed in five computational domains (Fig. 5). We must remark that the communication pattern of the
0 1 2 18 D1 34 50 66 82 98 35 3 4 20 36 52 68 84 69 5 21 D2 37 6 22 38 54 7 23 39 55 8 24 40 56 72 88 9 10 11 12 13 29 45 61 77 14 30 D10 46 62 15 31 47 63 79 16D017 32 48 64 80 96 33 49 65 81 26 25D8 41 57 73 89 42 58 74 28 27D9 43 59 44 60 76

balanced and the unbalanced domain decomposition may be different, since the number of neighbors of each domain may also be different. Figure 6 illustrates the communication pattern of the balanced/unbalanced distributions for this example. The arrows in the diagram represent processors interchanging data. The beginning of the arrow identifies the sender. The end of the arrow identifies the receiver. Short arrows represent local communications inside a host, whereas long arrows represent remote communications between hosts. In Fig. 6.a, all the processors are busy and the remote communications are done at the end of each iteration. The bars in the diagram represent the computational time to each processor. In Figs. 6.b and 6.c, the remote communication takes place overlapped with the computation. In figure 6.b, the remote communication is overlapped only with the first remote computation. In figure 6.c, all remote communications in the same host are overlapped.
CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 CPU 8 CPU 9 CPU 10 CPU 11 CPU 12 CPU 13 CPU 14 CPU 15 CPU 16 CPU 17 CPU 18 CPU 19 CPU 20 CPU 21 CPU 22 CPU 23 CPU 24 CPU 25 CPU 26 CPU 27 CPU 28 CPU 29 CPU 30 CPU 31

Fig. 6.a. Communication diagram for a computational iterations of the balanced distribution.
CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 CPU 8 CPU 9 CPU 10 CPU 11 CPU 12 CPU 13 CPU 14 CPU 15 CPU 16 CPU 17 CPU 18

D4
67

51

D3

97

D5

83

85

D6

70

D7 D15

D11

86

99 100 101 102 103

D14 104 105 106
124 122 123 D15 126

90

D12

D13

78

D 112 113 114 115 7 116 117 118
128 129 130 131 23 D 132 133 134 144 145 146 161 162 160 D 16 148 164 D17 151 167

CPU 19 CPU 20 CPU 21 CPU 22 CPU 23 CPU 24 CPU 25 CPU 26 CPU 27 CPU 28 CPU 29 CPU 30 CPU 31

140 137 138 139 D31 141 142 143 152 153 154 155 156 157 158 159 168 169 187 171 172 173 174 175

D26

176 177 178 179 180 192 193 208 209 D19 224 225

D18

183

D25 184 185 187 187 188 189 190 191
D31 200

D24

D23 199

D20
227

D27 201 202 203 204 205 206 207
222 223

215 229 230 231

216 217 218 219 220

D21 240 241 242 243 244 245 246 247

D22

238 239 232 233 234 235 236 237 D D28 D29 30 187 249 250 251 252 253 254 255

Fig. 4.SingleB-domain distribution.

Fig. 6.b.Communication diagram for a computational iteration of the SingleB-domain distribution.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

135
s ingleB-domain (4x 4)

6 RESULTS
In this section we show the results obtained using Dimemas. We simulate a 128 processors machine using the following Grid environment. The number of hosts is 2, 4 or 8; the number of CPUs/host is 4, 8, 16, 32 or 64; thus, we have from 8 to 128 total CPUs. The simulations were done considering lineal network traffic models. We consider three significant parameters to analyze the execution time behavior: the communication latency between hosts, the bandwidth in the external network and the flight time.
As data set, we consider a finite element mesh with 1,000,000 dofs. This size is usual for car crash or sheet stamping models. We consider two kinds of meshes, which define most of the typical cases. The first one, called stick mesh, can be completely decomposed in strips, so there are, at most, two neighbors per domain. The second one, called box mesh, cannot be decomposed in strips, so the number of neighbors per domain could be greater than two. The size of the stick mesh is 104x10x10 nodes. The size of the box mesh is 102x102x102 nodes. Figures 7.a, 7.b, 8.a and 8.b show the time reduction percentages compared with the balanced distribution for each Grid configuration in stick mesh as a function of the bandwidth. The unbalanced decomposition reduces the execution time expected for the balanced distribution in most cases. For a Grid with 2 hosts and 4 processors per host, the predicted execution time of the balanced distribution is better than other distributions because the number of remote communications is two. In this case, the multipleBdomain unbalanced distribution has only one or two processors per host computation. The results are similar when we consider that the external latency is equal to 100 ms (figs. 9.a, 9.b, 10.a and 10.b). Therefore, the value of this parameter has not significant impact on the results for this topology. In the other cases, the benefit of the unbalanced distributions ranges from 1% to 53% of time reduction. The execution time reduction increases until 82% for other topologies and configurations. For 4 and 8 hosts, the singleB-domain unbalanced distribution has similar behavior than the balanced distribution, since the remote communications
s ingleB-domain (2x 4)

STICK MESH
E xternal latency of 10 ms and flight time of 1 ms

multipleB-domain (4x 4) s ingleB-domain (4x 8) multipleB-domain (4x 8) s ingleB-domain (4x 16) multipleB-domain (4x 16) s ingleB-domain (4x 32) multipleB-domain (4x 32) s ingleB-domain s ingleB-domain (8x 8) (8x 16) multipleB-domain (8x 8) multipleB-domain (8x 16)

90,00 80,00 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 -40,00 -50,00 -60,00 -70,00 -80,00 -90,00 -100,00 64 Kbps 300 Kbps 2 Mbps

Execution time reduction (%)

Ba ndw idth

Fig. 7.b. Execution time reduction for the stick mesh with external latency of 10 ms and flight time of 1 ms (4 and 8 hosts).

s ingleB-domain (2x 4)

ST ICK M ESH
Exte rnal late ncy of 10 ms and flight time of 100 ms
60,00 55,00 50,00 45,00 40,00 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0,00 -5,00 -10,00 -15,00 -20,00 -25,00 -30,00 -35,00 64 Kbps 300 Kbps

multipleB-domain (2x 4) s ingleB-domain (2x 8) multipleB-domain (2x 8) s ingleB-domain (2x 16) multipleB-domain (2x 16) s ingleB-domain (2x 32) multipleB-domain (2x 32) s ingleB-domain (2x 64) multipleB-domain (2x 64)

Execution time reduction (%)

2 Mbps

Ba ndw idth

Fig. 8.a. Execution time reduction for the stick mesh with external latency of 10 ms and flight time of 100 ms (2 hosts).
s ingleB-domain (4x 4)

ST ICK M ESH
E xte rnal late ncy of 10 ms and flight time of 100 ms

multipleB-domain (4x 4) s ingleB-domain (4x 8) multipleB-domain (4x 8) s ingleB-domain (4x 16) multipleB-domain (4x 16) s ingleB-domain (4x 32) multipleB-domain (4x 32) s ingleB-domain s ingleB-domain (8x 8) (8x 16) multipleB-domain (8x 8) multipleB-domain (8x 16)

100,00 90,00 80,00 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 -40,00 -50,00 -60,00 -70,00 -80,00 -90,00 -100,00 64 Kbps 300 Kbps 2 Mbps

Execution time reduction (%)

Ban d w id th

ST ICK M ESH
Exte rnal late ncy of 10 ms and flight time of 1 ms
60,00 55,00 50,00 45,00 40,00 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0,00 -5,00 -10,00 -15,00 -20,00 -25,00 -30,00 -35,00 64 Kbps 300 Kbps

multipleB-domain (2x 4) s ingleB-domain (2x 8) multipleB-domain (2x 8) s ingleB-domain (2x 16) multipleB-domain (2x 16) s ingleB-domain (2x 32) multipleB-domain (2x 32) s ingleB-domain (2x 64) multipleB-domain (2x 64)

Fig. 8.b. Execution time reduction for the stick mesh with external latency of 10 ms and flight time of 100 ms (4 and 8 hosts).

Execution time reduction (%)

cannot be overlapped and they have to be done sequentially. In this case, the topologies having few processors per computation are not appropriate. The unbalanced distribution reduces the execution time up to 32%.

2 Mbps

Ba ndw idth

Fig. 7.a. Execution time reduction for the stick mesh with external latency of 10 ms and flight time of 1 ms (2 hosts).

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

136
s ingleB-domain (4x 4)

STICK MESH
Exte rnal latency of 100 ms and flight time of 1 ms
60,00 55,00 50,00 45,00 40,00 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0,00 -5,00 -10,00 -15,00 -20,00 -25,00 -30,00 -35,00 64 Kbps 300 Kbps

singleB-domain (2x4) multipleB-domain (2x 4) singleB-domain (2x8) multipleB-domain (2x 8) singleB-domain (2x16) multipleB-domain (2x 16) singleB-domain (2x32) multipleB-domain (2x 32) singleB-domain (2x64) multipleB-domain (2x 64)
90,00 80,00 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 -40,00 -50,00 -60,00 -70,00 -80,00 -90,00 -100,00

ST ICK M ESH
E xte rnal late ncy of 100 ms and flight time of 100 ms

multipleB-domain (4x 4) s ingleB-domain (4x 8) multipleB-domain (4x 8) s ingleB-domain (4x 16) multipleB-domain (4x 16) s ingleB-domain (4x 32) multipleB-domain (4x 32) s ingleB-domain (8x 8) multipleB-domain (8x 8) s ingleB-domain (8x 16) multipleB-domain (8x 16)

Execution time reduction (%)

Execution time reduction (%)

2 Mbps

64 Kbps

300 Kbps

2 Mbps

Ba ndw idth

Ba ndw idth

Fig. 9.a. Execution time reduction for the stick mesh with external latency of 100 ms and flight time of 1 ms (2 hosts).
s ingleB-domain (4x 4)

Fig. 10.b. Execution time reduction for the stick mesh with external latency and flight time of 100 ms (4 and 8 hosts).

STICK MESH
External late ncy of 100 ms and flight time of 1 ms

multipleB-domain (4x 4) s ingleB-domain (4x 8) multipleB-domain (4x 8) s ingleB-domain (4x 16) multipleB-domain (4x 16)

90,00 80,00 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 -40,00 -50,00 -60,00 -70,00 -80,00 -90,00 -100,00 64 Kbps 300 Kbps 2 Mbps

s ingleB-domain (4x 32) multipleB-domain (4x 32) s ingleB-domain s ingleB-domain (8x 8) (8x 16) multipleB-domain (8x 8) multipleB-domain (8x 16)

Execution time reduction (%)

fluence of the external latency on the application performance in a box mesh increases the percentage of reduction of the execution time up to 4%. We suppose that the distance between hosts is the same. However, if we consider hosts distributed at different distances, we obtain similar benefits for the different distributions. Moreover, if the calculation capacity of each processor in a host is different, the initial data partition will be done consider it. Then the data in each processor will not be the same but

s ingleB-domain

(2x 8) (2x 16) (2x 32) (2x 64)

BO X M ESH
Exte rnal late ncy of 10 ms and flight time of 1 ms
90,00

multipleB-domain (2x 8) s ingleB-domain s ingleB-domain s ingleB-domain multipleB-domain (2x 16) multipleB-domain (2x 32) multipleB-domain (2x 64)

Bandw idth

Execution time reduction (%)

Fig. 9.b. Execution time reduction for the stick mesh with external latency of 100 ms and flight time of 1 ms (4 and 8 hosts).

85,00 80,00 75,00 70,00 65,00 60,00 55,00 50,00 45,00 40,00 64 Kbps 300 Kbps

s ingleB-domain

(2x 4) (2x 8) (2x 16) (2x 32) (2x 64)

ST ICK M ESH
E xte rnal late ncy of 100 ms and flight time of 100 ms
55,00 50,00 45,00 40,00 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0,00 -5,00 -10,00 -15,00 -20,00 -25,00 -30,00 -35,00 64 Kbps 300 Kbps Ban d w id th 2 Mbps

multipleB-domain (2x 4) s ingleB-domain s ingleB-domain s ingleB-domain s ingleB-domain multipleB-domain (2x 8) multipleB-domain (2x 16) multipleB-domain (2x 32) multipleB-domain (2x 64)

Execution time reduction (%)

2 Mbps

Ba ndw idth

Fig. 11.a. Execution time reduction for the box mesh with external latency of 10 ms and flight time of 1 ms (2 hosts).

singleB -do main (4x8)

BO X M ESH
E xte rnal late ncy of 10 ms and flight time of 1 ms

m ultipleB -do main (4x8) singleB -do main (4x1 6) m ultipleB -do main (4x16) singleB -do main (4x32)

70,00

m ultipleB -do main (4x32) singleB -do main (8x8)

Execution time reduction (%)

Fig. 10.a. Execution time reduction for the stick mesh with external latency and flight time of 100 ms (2 hosts).

60,00 50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 64 Kbps 300 Kbps 2 Mbps

m ultipleB -do main (8x8) singleB -do main (8x1 6) m ultipleB -do main (8x16)

Figures 11.a, 11.b, 12.a and 12.b show the reduction of the expected execution time obtained for each Grid configuration varying the flight time, the external latency and the bandwidth in a box mesh. For the 2 hosts configuration in a box mesh, the behavior for singleB-domain and multipleB-domain unbalanced distribution is similar, since the number of remote communications is the same. Variations of the flight time and the external latency improve the results up to 85%. Figures 11.b and 12.b shows the reduction on the expected execution time obtained for 4 and 8 hosts. The in-

Ba nd w id th

Fig. 11.b. Execution time reduction for the box mesh with external latency of 10 ms and flight time of 1 ms (4 and 8 hosts).

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

137

CPU 0

BO X M ESH P
0 1 2

s ingleB-domain (2x 8) multipleB-domain (2x 8) s ingleB-domain (2x 16) multipleB-domain (2x 16) s ingleB-domain (2x 32) multipleB-domain (2x 32) s ingleB-domain (2x 64) multipleB-domain (2x 64)

Exte rnal late ncyCPU 1 10 ms and flight time of 100 ms of P CPU 2 P
90,00 85,00
CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 CPU 8 CPU 9 CPU 10 CPU 11 CPU 12 CPU 13 CPU 14 CPU 15 CPU 16 CPU 17 CPU 18 CPU 19 CPU 20 CPU 21 CPU 22 CPU 23 CPU 24 CPU 25 CPU 26 CPU 27 CPU 28 CPU 29 CPU 30 CPU 31 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21 P22 P23 P24 P25 P26 P27 P28 P29 P30 P31 P32 P33 P34 P35 P36 P37 P38 P39 Host 0

TABLE 2 MAXIMUM NUMBER OF COMMUNICATIONS FOR PUTATIONAL ITERATION
STICK MESH Balanced Host x CPUs 2x4 2x8 2x16 2x32 2x64 4x4 4x8 4x16 4x32 8x8 8x16
Remote / Local Communication

A COM-

Execution time reduction (%)

80,00 75,00 70,00 65,00 60,00 55,00 50,00 45,00 40,00 64 Kbps
Host 3 Host 2 Host 1

singleB-domain

multipleBdomain

Remote / Local Communication

Remote / Local Communication

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 2 2 2 2 2 2 BOX MESH

1 1 1 1 1 2 2 2 2 2 2

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 3 3 3 3 3 3

300 Kbps

2 Mbps

Ba ndw idth

Fig. 12.a. Execution time reduction for the box mesh with external latency14. 10 ms and flight time of 100 ms (2 hosts). Fig. of Communication diagram for a computational iteration the MultipleCB-domain distribution.
singleB -do main (4x8)

BOX M ESH
E xte rnal late ncy of 10 ms and flight time of 100 ms
70,00 60,00

multipleB -do main (4x8) singleB -do main (4x1 6) multipleB -do main (4x1 6) singleB -do main (4x32) multipleB -do main (4x32) singleB -do main (8x8) multipleB -do main (8x8) singleB -do main (8x1 6) multipleB -do main (8x1 6)

Execution time reduction (%)

50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 64 Kbps 300 Kbps

2x4 2x8 2x16 2x32 2x64
2 Mbps

2 4 5 6 7 7 10 9 13 13

3 5 8 7 8 5 9 8 5 4

1 1 1 1 1 3 3 3 6 6

3 6 7 15 25 6 11 22 7 13

1 1 1 1 1 4 4 4 13 13

3 6 8 14 24 6 9 14 7 11

4x8 4x16 4x32 8x8 8x16

Ba ndw idth

Fig. 12.b. Execution time reduction for the box mesh with external latency of 10 ms and flight time of 100 ms (4 and 8 hosts).

the computational load will be balanced between processors. The number of remote and local communications varies depending on the partition and the dimensions of the data meshes. Table 2 shows the maximum number of communications for a computational iteration. The number of remote communications is higher for a box mesh than for a stick mesh. Thus, the box mesh suffers from higher overhead. We propose the use of unbalanced distribution patterns to reduce the number of remote communications required. Our approach shows to be very effective, especially for box meshes. We observe that the multipleBdomain with unbalanced distribution is not sensitive to the latency increase until the latency is larger than the computational time. However, the execution time for the balanced distribution increases with the latency.

In addition, to reduce the number of processors performing remote communications in the multipleB-domain we propose to assign all B-domains in a host to a single CPU, which concurrently will manage all the communications. We will call this unbalanced distribution multipleCB-domain. Figures 13 and 14 illustrate the domain decomposition and communication pattern of the multipleCB-domain distribution for the example described in section 5. The main difference between the multipleB-domain and multipleCB-domain is the amount of domains per host because in the second case all communications are assigned to the same CPU inside a host. In Figure 5, the multipleBdomain distribution has 8 data domains per host, now multipleCB-domain distribution has 10 data domains per
D0 D1 D2 D3 D7 D17 D6 D8
D18
14 15

D10
57 58 D13 73 74

D11

30 31 D12 46 47

7 MULTIPLECB-DOMAIN DISTRIBUTION
The multipleB-domain unbalanced distribution creates as many special domains per host as external communications. Then, the scalability of the unbalanced distribution will be moderated, because a processor is devoted just to manage communications for every special domain. The optimum domain decomposition is problem dependent, but a simple model can be built to approximate the optimum.

D4

D5
D9 D27

D16
D28 D38

D14
D19 D37

D15

D20

D21 D23

D22
D29 D39

D30

D31

D32 D35

D24

D25

D26

D33

D34

D36

Fig. 13. MultipleCB-domain distribution.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

138

host, for three of them will be assigned to one CPU (special domains) into the host and the remainder of data domains will be assigned to the rest of the CPUs. Now we have solely a CPU that manages remote communications and a larger number of CPUs performing computation. This kind of distribution allows us to minimize the number of idle CPUs in a host devoted only to remote communications. For a Grid with 2 hosts, the predicted execution time is the same that to the multipleB-domain because the number of remote communications is only one. However, when considering 4 or 8 hosts, multipleCB-domain domain makes a reduction in execution time percentage up to 43% compared to balanced distribution, while multipleB-domain distribution makes a reduction percentage up to 53%. In general, multipleCB-domain distribution is 10% worse than multipleB-domain distribution, mainly due to the problems in managing concurrency remote communications in the simulator. It is also important to look at the MPI implementation [31]. The ability to overlap communications and computation depends on this implementation. A multithread MPI implementation could overlap communication and computation, but problems with context switching between threads and interferences between processes could appear. In a single thread MPI implementation we can use non-blocking send/receive with a wait_all routine. However, we have observed some problems with this approach. The problems are associated with the internal order in no blocking MPI routines for sending and receiving actions. In our experiments, this could be solved programming explicitly the proper order of the communications. But the problem remains for a general case. We conclude that it is very important to have no blocking MPI primitives that actually exploit the full duplex channel capability. As a future work, we will consider other MPI implementations that optimize the collective operations [32, 33].

optimal case is when the number of processors making calculation in a host is twice the number of processors managing remote communications. Otherwise, if the number of processors making calculations is small, then the unbalanced distribution will be less efficient than the balanced distribution. In this case, we propose the use of the multipleCB-domain distribution. In this distribution all remote communications in a host are concurrently managed by the same CPU. This distribution has around a 10% worse execution time than others unbalanced distributions. In general, to obtain a good performance in the strategies presented in this paper the number of processors per host needs to be equal or higher than 8. In other case the number of processors performing computation is not enough to overlap remote communications.

ACKNOWLEDGMENTS
This work was supported by the Ministry of Science and Technology of Spain under contract TIN2007-60625, the HiPEAC European Network of Excellence and Barcelona Supercomputing Center (BSC).

REFERENCES
[1] G. Allen et al. “Classifying and enabling grid applications. Concurrency and Computation”, Practice and Experience, vol.0, pp. 1-13, 2000. (Journal citation) [2] Dimemas, Internet, http://www.cepba.upc.es/dimemas/. 2000. [3] J. Chen and V. E. Taylor. “Mesh partitioning for efficient use of distributed systems”, IEEE Trans. Parallel and Distributed Systems, vol. 13, no. 1, pp.67-79, 2002. (IEEE Transactions) [4] C. Walshaw and M. Cross. “Multilevel mesh partitioning for heterogeneous communications networks”, Future Generation Computer Systems, vol. 17, no. 5, pp. 601-623, 2001. (Journal citation) [5] F. Pelligrini and J. Roman. “A software package for static mapping by dual recursive bipartitioning of process and architecture graphs”, Proc. of the High Performance Computing and Networking, pp. 493-498, 1996. (Conference proceedings) [6] S. K. Das, D. J. Harvey and R. Biswas. “MinEX: a latency-tolerant dynamic partitioner for grid computing applications”, Future Generation Computer Systems, vol. 18, no. 4, pp. 477-489, 2002. (Journal citation) [7] S. Huang, E. Aubanel and V. Bhavsar. “Mesh partitioners for computational grids: a comparison”, Computational Science and Its Applications, LNCS 2269, pp. 60-68, 2003. (Journal or magazine citation) [8] S. Kumar, S. Das and R. Biswas. “Graph partitioning for parallel applications in heterogeneous grid environments”, Proc. Sixteenth International Parallel and Distributed Processing Symposium, 2002, doi.ieeecomputersociety.org/10.1109/IPDPS.2002.1015564. (Conference proceedings) [9] W. D. Gropp and D. E. Keyes. “Complexity of Parallel Implementation of Domain Decomposition Techniques for Elliptic Partial Differential Equations”, SIAM Journal on Scientific and Statistical Computing, vol. 9, no. 2, pp. 312-326, 1988. (Journal citation) [10] D. K. Kaushik, D. E. Keyes and B. F. Smith. “On the Interaction of Architecture and Algorithm in the Domain-based Parallelization of an Unstructured Grid Incompressible Flow Code”, Proc. Tenth International Conference on Domain Decomposition Methods, pp. 311-319, 1997. (Conference proceedings) [11] W. Gropp et al. “Latency, Bandwidth, and Concurrent Issue Limitations in High-Performance CFD”, Proc. First Mit Conference on Computational Fluid and Solid Mechanics, pp. 830-841, 2001. (Conference proceedings)

8 CONCLUSIONS
In this paper, we present an unbalanced domain decomposition strategy for solving problems that arise from discretization of partial differential equations on meshes. Applying the unbalanced distribution in different platforms is simple, because the data partition is easy to obtain. We compare the results obtained with the classical balanced strategy used. We show that the unbalanced distribution pattern improves the execution time of domain decomposition applications in Grid environments. We considered two kinds of meshes, which define the most typical cases. We show that the expected execution time can be reduced up to 53%. The unbalanced distribution pattern reduces the number of remote communications required per host compared with the balanced distribution, especially for box meshes. However, the unbalanced distribution can be inappropriate if the total number of processors is less than the total number of remote communications. The

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

139

[12] R. M. Badia et al. “Dimemas: Predicting MPI Applications Behavior in Grid Environments”, Proc. of the Workshop on Grid Applications and Programming Tools GGF8, 2003. (Conference proceedings) [13] R. M. Badia et al. “Performance Prediction in a Grid Environment”, Proc, First European across Grid Conference, 2003. (Conference proccedings) [14] Y. Li and Z. Lan. “A Survey of Load Balancing in GridComputing. Computational and Information Science”, Proc. First International Symposium (CIS’04), pp. 280-285, 2004. (Conference proceedings). [15] B. Otero et al. “A Domain Decomposition Strategy for GRID Environments”, Proc. Eleventh European PVM/MPI Users’ Group Meeting, pp. 353-361, 2004. (Conference proceedings) [16] B. Otero and J. M. Cela. “A workload distribution pattern for grid environments”, Proc. the 2007 International Conference on Grid Computing and Applications, pp. 56-62, 2007. (Conference proceedings) [17] B. Otero et al. “Performance Analysis of Domain Decomposition”, Proc. Fouth International Conference Grid and Cooperative Computing, pp. 10311042, 2005. (Conference proceedings) [18] B. Otero et al. “Data Distribution Strategies for Domain Decomposition Applications in Grid Environments”, Proc. Sixth International Conference on Algorithms and Architecture for Parallel Processing, pp. 214-224, 2005. (Conference proceedings) [19] W. Sosnowski. “Flow Approach-Finite Element Model for Stamping Processes versus Experiment”, Computer Assisted Mechanics and Engineering Sciences, vol. 1, pp. 49-75, 1994. (Journal citation) [20] N. Frisch et al. “Visualization and Pre-processing of Independent Finite Element Meshes for Car Crash Simulations”, The Visual Computer, vol. 18, no. 4, pp. 236-249, 2002. (Journal citation) [21] Z. H. Zhong. Finite Element Procedures for Contact-Impact Problems. Oxford University Press, pp.1-372, 1993. (Book style) [22] Paraver. http://www.cepba.upc.es/dimemas. 2002. [23] R. M. Badía et al. “DAMIEN: Distributed Applications and Middleware for Industrial Use of European Networks”. D5.3/CEPBA. IST-200025406, unpublished. (Unpublished manuscript) [24] R. M. Badía et al. “DAMIEN: Distributed Applications and Middleware for Industrial Use of European Networks”. D5.2/CEPBA. IST-200025406, unpublished. (Unpublished manuscript) [25] B. Otero and J. M. Cela. “Latencia y ancho de banda para simular ambientes Grid”, Technical Report TR-UPC-DAC-2004-33, UPC. España, 2004. (Technical report with report number) http://www.ac.upc.es/recerca/reports/DAC/2004/index,ca.html. [26] D. E. Keyes. “Domain Decomposition Methods in the Mainstream of Computational Science”, Proc. Fourteenth International Conference on Domain Decomposition Methods, pp. 79-93, 2003. (Conference proceedings) [27] X. C. Cai. “Some Domain Decomposition Algorithms for Nonselfadjoint Elliptic and Parabolic Partial Differential Equations”, Technical Report TR- 461, Courant Institute, NY, 1989. (Technical report with report number) [28] K. George and K. Vipin K. “Parallel multilevel k-way partitioning scheme for irregular graphs”, SIAM Rev., vol 41, no. 2, pp. 278-300, 1999. (Journal citation) [29] K. George and K. Vipin. “A fast and high quality multilevel scheme for partitioning irregular graphs”, SIAM J. Sci. Comput., vol. 20, no. 1, pp. 359-392, 1998. (Journal citation) [30] Metis, Internet, http://glaros.dtc.umn.edu/gkhome/views/metis. 2011. [31] Message Passing Interface Forum, MPI-2: Extensions to the MPI, 2003. http://scc.ustc.edu.cn/zlsc/cxyy/200910/W020100308601028317962.p df (2011) [32] N. Karonis, B. Toonen and I. Foster. “Mpich-g2: A Grid-enabled Implementation of the Message Passing Interface”, Journal of Parallel and Distributed Computing, vol. 63, no. 5, pp. 551-563, 2003. (Journal citation) [33] I. Foster and N. T. Karonis. “A Grid-enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems”, Proc. of the ACM/IEEE Supercomputing, 1998. (Conference proceedings)   B. Otero received her M.Sc. and her first Ph.D. degrees in Computer Science at University of Central of Venezuela in 1999 and 2006, respectively. After that, she received her second Ph.D. in Computer Architecture and Technology in 2007 at Polytechnic University of

Catalonia (UPC). Currently, she is an Assistant Professor at the Computer Architecture Department at UPC. Her research interests include parallel programming, load balancing, cluster computing, and autonomic communications. She is member of HiPEAC Network of Excellence. M. Gil is an Associate Professor at the Universitat Politècnica de Catalunya (UPC). She received her Ph.D. in computer science from the UPC in 1994. Her research is primarily concerned with the design and implementation of system software for parallel computing, to improve resource management. Her work focus mainly in the area of OS, middleware and runtime multicore architectures support. She is member of HiPEAC Network of Excellence and in the SARC European project.  

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->