You are on page 1of 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220457170

Failure Resilient Heterogeneous Parallel Computing Across Multidomain


Clusters

Article in The International Journal of High Performance Computing Applications · May 2005
DOI: 10.1177/1094342005054260 · Source: DBLP

CITATIONS READS
10 131

2 authors, including:

Dawid Kurzyniec
Google Inc.
40 PUBLICATIONS 405 CITATIONS

SEE PROFILE

All content following this page was uploaded by Dawid Kurzyniec on 27 May 2015.

The user has requested enhancement of the downloaded file.


Failure Resilient Heterogeneous Parallel Computing Across
Multidomain Clusters
Dawid Kurzyniec, Peter Hwang and Vaidy Sunderam
{dawidk,vss}@mathcs.emory.edu
Dept. of Math and Computer Science
Emory University, Atlanta, GA, USA

30th November 2004

Abstract
We propose lightweight middleware solutions that facilitate and simplify the execution of
failure-resilient MPI programs across multidomain clusters. The system described in this pa-
per leverages H2O, a distributed metacomputing framework, to route MPI message passing
across heterogeneous aggregates located in different administrative or network domains. MPI
programs instantiate a specially written H2O pluglet; messages that are destined for remote
sites are intercepted and transparently forwarded to their final destinations. We demonstrate
that the proposed technique is indeed effective in enabling communication by MPI programs
across distinct clusters and across firewalls. Only minimally lowered performance was observed
in our tests, and we believe the substantially increased functionality would compensate for this
overhead in most situations. In addition to enabling multi-cluster communications, we note
that with the increasing size and distribution of metacomputing environments, fault tolerance
aspects become critically important. We argue that the fault tolerance model proposed by
FT-MPI fits well in geographically distributed environments, even though its current imple-
mentation is confined to a single administrative domain. We describe extensions to overcome
these limitations by combining FT-MPI with the H2O framework. Our approach allows users
to run fault tolerant MPI programs on heterogeneous, geographically distributed shared ma-
chines, without sacrificing performance and with minimal involvement of resource providers.

1 Introduction
Recently, the popularity of clusters has grown as the high performance computing architecture of
choice, coupled with their continuous increase in size – currently measured in thousands of nodes
per installation [31]. Additionally, there is a trend towards aggregating such supercomputing
resources across multiple networks that are geographically and administratively separate, and
heterogeneous in terms of architecture, software, and capabilities. In such “multidomain” systems,
some parts of the resource collection may also be behind firewalls, or on a non-routable private
network, leading to difficulties in access to and from a counterpart entity in another portion of
the system.
Despite the emergence of numerous distributed programming technologies, such as workflow
systems, Web Services [32], or problem solving environments [2], MPI [11] arguably remains the
most popular programming paradigm for high performance applications. In the MPI model,
individual processes may send and receive messages to and from any other process, and to spawn
processes on any other host, irrespective of its physical location. It is easy to see that this model

1
does not translate very effectively to multidomain systems in which communication from one
domain to another is not straightforward, and process control is subject to domain-specific security
policies. The situation is further exacerbated by the heterogeneity of multi-domain environments
which are based on diverse hardware platforms, operating systems, access policies, and middleware
systems. Last but not least, the probability of failures, disconnects, and network partitions
increases tremendously, both due to size and the uncoordinated, unpredictable nature of such
aggregates.
It has been demonstrated [3] that MPI can provide capabilities for solving very large scientific
problems on heterogeneous computational grids, under tightly controlled circumstances, using
grid-enabled MPI implementations [19]. However, current solutions require tedious and time-
consuming configuration by resource providers [7, 29], and they lack fault tolerance facilities. To
realize the importance of the latter, we note that current 10-20 Tflops machines experience a
partial failure every 10-40 hours, even if individual components are of very high reliability [28]. In
geographically distributed setups, situation is much worse: unpredictability of wide-area networks
causes estimated mean time between failures (MTBF) to drop to the order of minutes. It is obvious
that scientific applications, whose execution times routinely reach weeks, or even months, will not
be able to run in such setups unless appropriate provisions of the environment allow them to
withstand and recover from partial failures.
In the context of the Harness and H2O projects, we are devising lightweight schemes to ad-
dress the above obstacles by exploiting and enhancing established MPI implementations [16, 10].
Our approach to enable communication between domains involves the instantiation of customiz-
able agents at selected locations; by leveraging the security and reconfigurability features of H2O,
such ”pluglets” serve as proxies that relay messages between individual domains as appropriate,
transparently performing address and other translations that may be necessary. Fault tolerance
is addressed by leveraging appropriate features of the FT-MPI [10] implementation, extending
them to geographically distributed environments spanning multiple administrative domains. We
present the overall architecture and design of this system, describing how it enables MPI pro-
grams to operate across firewalls and private networks, with preliminary performance results.
Finally, we demonstrate how multi-domain computing is facilitated by enabling dynamic staging
of applications and of the MPI runtime on shared resources.

2 Background and Related Work


Aspects of heterogeneity, fault-tolerance, and support for geographically distributed environments,
in the context of MPI, have been addressed individually by several other projects. We discuss
these related projects in this section, along with frameworks and MPI implementations that were
used as a foundation for our work.

2.1 The H2O Metasystem Framework


Among general purpose distributed software infrastructures considered most promising today are
“grid” [27, 14] and “metacomputing” frameworks. These infrastructures primarily aim to aggre-
gate or cross-access varied resources, often spanning different administrative domains, networks,
and institutions [13]. Our ongoing work (upon which the research described in this paper is
based) has pursued alternative approaches to metasystem middleware; its mainstay is the H2O
substrate [24] for lightweight and flexible cooperative computing. In the H2O system, a soft-
ware backplane architecture supporting component-based services hosts pluggable components
Registration and Discovery e-mail,
UDDI JNDI LDAP DNS GIS
phone, ...
...

Publish Find

Deploy

...
Provider
native A
A code
Deploy A A B
Provider Client Provider B Client
Client B
Provider
Deploy

Legacy
App Repository Repository

A B A
Reseller Developer B
C C

Figure 1: H2O component model illustrating pluglets deployable by clients, providers, or third-
party resellers, and different modes of operation, including client-side aggregation.

that provide composable services. Such components may be uploaded by resource providers but
also by clients or third-parties who have appropriate permissions. Resource owners retain com-
plete and fine-grained control over sharing policies; yet, authorized clients have much flexibility
in configuring and securely using compute, data, and application services. By utilizing a model
in which service providers are central and independent entities, global distributed state is signifi-
cantly reduced at the lower levels of the system, thereby resulting in increased failure resilience and
dynamic adaptability. Further, since providers themselves can be clients, a distributed computing
environment can be created that operates in true peer-to-peer mode, but effectively facilitate the
farm computing, private (metacomputing) virtual machine, and grid paradigms for resource shar-
ing. H2O assumes that each individual resource may be represented by a software component that
provides services through well defined remote interfaces. Providers supply a runtime environment
in the form of a component container. The containers proposed in H2O are similar to those in
other environments (e.g. J2EE), but which are capable of hosting dynamic components that are
supplied and deployed by (authorized) external entities, thereby facilitating (re)configuration of
services according to client needs. A comprehensive depiction of the H2O architecture is shown in
Figure 1. Components, called pluglets, follow a standardized paradigm and implement composable
functionality; details may be found in the H2O programming guide and other papers [17].

2.2 Grid-enabled and fault-tolerant MPI


The popularity of MPI and of grid software toolkits has motivated a number of efforts to marry the
two. One of the earliest attempts was the MPICH-G project [12], that successfully demonstrated
the deployment of MPI programs on multiple clusters interconnected by TCP/IP links. Based
on the MPICH implementation of MPI, this project leveraged the Globus toolkit to spawn MPI
jobs across different clusters, delivering single sign-on, executable staging, and process control
facilities. MPICH-G mainly concentrated on resource allocation and process spawning issues,
although heterogeneity and multinetwork systems were catered for. A more basic approach is
adopted by Madeleine-III [4] whose goal is to provide a true multi-protocol implementation of
MPI on top of a generic and multi-protocol communication layer called Madeleine. A number
MPI MPI
program program

MPI MPI

Firewall
daemon daemon

H2O H2O
proxy proxy
pluglet pluglet

H2O kernel H2O kernel

Figure 2: Simplified H2O MPI Firewall architecture

of projects have focused on the message passing substrate, providing optimized implementations
of collective communication primitives: these include PACX-MPI [21], MPICH-G2 [19] (a recent
incarnation of MPICH-G), StaMPI [18] which is now being extended to include MPI-IO, and
MagPie [23]. These efforts have been quite successful, however, they provide few fault tolerance
facilities and are therefore not scalable to very large distributed setups.
On the other hand, efforts to support fault tolerant MPI have thus far been confined to single
administrative domains. Solutions based on coordinated checkpointing [30, 1, 6] and message log-
ging [5, 26, 8] are attractive due to transparency for the application developer; however, they suffer
from certain inherent limitations related to the checkpoint/restart approach [25] reducing support
for heterogeneity. Also, in order to withstand network partitioning in geographically distributed
setups, they would necessarily have to transmit large amounts of data over low-bandwidth links
that are prone to partitioning. FT-MPI [9, 10] adopts a different approach, addressing fault tol-
erance at the API level. FT-MPI survives the crash of n-1 processes in an n process job, and
can restart them. However, failure recovery is not transparent: the process state must be recon-
structed by the application. In exchange, FT-MPI maintains support for heterogeneity (e.g. by
allowing failed process to be restarted on a different architecture) and minimizes communication
and computation overheads, resulting in high communication performance and scalability.
Our ongoing project seeks to build upon these experiences, leveraging features of existing
MPI implementations. The project aims to (1) comprehensively support machine, interconnection
network, and operating system heterogeneity; non-routable and non-IP networks; operation across
firewalls; and failure resilience, and (2) leverage the component architecture of H2O to enable
collaborations and sharing across administrative boundaries, by providing support for dynamic
and secure staging of application components, application data, and also MPI runtime libraries,
under strict control of the resource providers’ security policies. We describe our design and
prototype implementations in the remainder of this paper.

3 Multi-domain computing with MPICH2


Initially, our approach to enabling multi-domain MPI was focused on communication across fire-
walls, and was based on one of the most widespread implementation of MPI, viz. MPICH2. To
achieve our goals, we made appropriate modifications to the MPICH2 implementation and cou-
pled them with the H2O proxy mechanism. The design of these extensions and experiences from
Resource 1

H2O
3.1. <<deploy>>
proxy
4.1. return handle pluglet
5.1. initialize
H2O kernel
2.1. login

H2O proxy pluglet


3.n. <<deploy>>

.......
startup program
4.n. return handle
5.n. initialize
1. read
Resource n
Kernel
references H2O
2.n. login proxy
kernel 1 pluglet
...
kernel n H2O kernel

Figure 3: Startup sequence

initial experiments are described in this section.

3.1 The H2O Proxy Pluglet


The H2O proxy pluglet essentially served as a forwarding and demultiplexing engine at the edge
of each resource in a multidomain system. This way, the necessity to handle wide-area communi-
cation and the responsibility to address firewall-related issues was confined to H2O proxy pluglets,
as illustrated in Figure 2. From an operational viewpoint, H2O proxy pluglets were loaded onto
H2O kernels by kernel owners, end-users, or other authorized third-parties, by exploiting the
hot-deployment feature of H2O. A startup program was provided to help loading H2O proxy
pluglets into H2O kernels. The proxy pluglets leveraged H2O communication facilities to forward
messages across clusters. The fact that H2O uses well-known port numbers and is capable of tun-
neling communication via HTTP, made it possible to configure firewalls appropriately to allow
the H2O forwarding. The startup program took two arguments: a filename of kernel references
corresponding to the H2O kernels where the proxy pluglets was to be loaded, and a codebase
URL where the proxy pluglet binaries were to be found. The startup program (1) read the kernel
references from the argument file; (2) logged in to each of the H2O kernels referenced in the file;
(3) loaded the proxy pluglets onto the kernels; (4) obtained a handle to each proxy pluglet; and
(5) distributed handle data to each proxy pluglet, as shown in the sequence diagram of Figure 3.

3.2 Changes to MPICH2


One of our design goals was to isolate any changes to the used MPI implementation so that
evolutionary changes might be made easily. The core modifications to the library involved the
re-routing of all messages through the H2O proxy pluglet rather than communicating directly
with the remote MPI process – an action that would not work across firewalls and non-routable
networks. This goal was addressed by locating the proxy protocol at the level of socket connections,
which was low enough to allow us keep the number and extent of changes to a minimum, and
standard enough to ensure portability. We needed to modify only two socket functions to reroute
all MPI program communications through the H2O Proxy: sock post connect(host, port) – which
Resource 1 Resource 2
MPI process
MPI process
(source)
(target)
6. conn_req
4. connect 10. conn_ack
5. accept 8. connect
1. listen
H2O H2O
proxy 7. forward proxy
3. read pluglet 9. ack pluglet

2. <<create>>
/tmp/
h2oMPICH.dat

Figure 4: Establishing a connection via proxy

asynchronously connects to a host and port, and sock handle connect() – which handles newly
formed connections. The first function was modified to connect to the proxy pluglet (that listened
on a well-known local port) instead of the final destination; the second function was augmented
to forward the destination address to the proxy and let it establish the routed connection. Steps
in the interaction between MPI and the H2O proxy pluglet are listed below, and are shown
diagrammatically in Figure 4. The proxy pluglet enables outgoing connections using a locally
published port (steps 1, 2, 3); spawns a connection handler thread to service the connection
request (4,5); establishes a mapping to the destination pluglet (6,7); connects to the destination
user process (8); and sets up direct channels via helper threads (9,10).
After MPI programs were set up to successfully communicate over H2O channels, tests showed
that MPI was still unable to completely communicate through firewalls. This was due to the fact
that MPICH2 daemons, through which processes are initially spawned, had unresolved connec-
tions that were blocked by the firewall. To solve this problem the MPICH2 daemon Python code
was also modified to reroute its communication in an analogous manner by modifying the lowest
level socket function mpd get inet socket and connect(host, port).
A few other issues are worthy of mention. First, due to the use of lazy connections, overheads
were confined to only those communication channels that were required, at the relatively low
expense of first-time delay. The second issue involves connecting to the proxy pluglet. Since
rerouting all communications to the H2O proxy pluglet requires changing the host and port
parameters of the MPI connect() call, a locally published connection endpoint scheme was used,
again resulting in isolating all changes to the sock post connect(host, port) function. Also, in
order to ensure that connection to the remote MPI program through the H2O proxy pluglets is
established before any communications are conducted, a handshake protocol was devised. After
the MPICH2 socket is connected and the host and port are sent to the H2O proxy pluglet, an
acknowledgment is received in confirmation.

3.3 Preliminary Results


With the described proxy-based approach, some degradation of performance is inevitable because
all direct TCP/IP connections (in MPICH2) now become three-stage channels that are routed
through the H2O proxy pluglet. To measure the overhead, we utilized a test suite included with
MPICH2, aimed at providing reproducible and reliable benchmarks for any MPI implementation.
Mpptest provides both point-to-point and collective operations tests; and goptest is used to study
Figure 5: Results of mpptest from MPICH2 package

the scalability of collective routines as a function of number of processors [15]. The results for
MPICH2 and MPICH2 over H2O on Intel PCs running Linux on 100MBit Ethernet are presented.
Mpptest results are shown in Figure 5. MPICH throughput plateaus at 11.2 MB/sec (near the
theoretical bandwidth limit) and MPICH over H2O plateaus at 8.8 MB/sec showing a 26.8%
decrease. The anomalous drop in performance around message sizes of 50 KB for MPICH over
H2O is sometimes seen in parallel computing performance evaluations, and is likely due to a
benchmark dependent maximum buffer size.
Goptest shows the scalability of calling collective routines such as scatter and broadcast with
respect to the number of processes. MPICH2 and MPICH2-over-H2O scalability with different
message sizes are shown in Figure 6. A scalability comparison between MPICH2 and MPICH2
over H2O of the 256-byte message size is shown in Figure 7. Very little difference between
MPICH2 and MPICH2 over H2O exists because the added communication is outweighed by the
synchronization time of collective routines.

4 Towards Fault-Tolerant Wide-Area MPI


Positive experiences with extending MPICH2 to enable communication across firewalls, described
in the previous section, encouraged us to seek a more thorough solution to multi-domain MPI
computing. First, individual users have different levels of access rights on shared resources in
domains that they do not control. The software infrastructure should relieve users of the logistical
burdens of using such environments, including problems related to deployment of MPI runtime
libraries, program executables, and data files, among computational resources that belong to
distinct administrative domains. Second, it should address issues of fragility that characterize
such geographically distributed multidomain environments.
Motivated by its scalability and performance characteristics [10], as well as its support for
Figure 6: Results of scatter goptest from MPICH2 suite

Figure 7: Scalability comparison between MPICH2 and MPICH2-over-H2O


kernel
repository

pluglet Pluglet.class
LINUX/
myprogram
WIN32/
myprogram.exe

ClassLoader cl = this.getClass().getClassLoader();
String rname = “$pvmarch{}/$executable{myprogram};
URL source = cl.getResource(rname);
File cmd = runtimeCxt.stageResource(source, “$executable{myprogram}”);
Runtime.getRuntime().exec(cmd.getAbsolutePath());

Figure 8: Resource staging in the H2O framework.

heterogeneous environments, we chose FT-MPI as the base MPI implementation to support the
above facilities. This choice allowed us to leverage the fairly sophisticated and thorough support
provided in FT-MPI to survive process failures. As mentioned in the previous section, our de-
sign to enable multidomain MPICH2 necessitated only minimal and isolated changes; therefore,
porting those changes to FT-MPI was straightforward, and we could immediately focus on the
new extensions. This section presents features of H2O related to resource staging, and describes
how we applied them to extend FT-MPI functionality to multi-domain setups. We outline the
security aspects of our proposed extensions and discuss communication mechanisms implemented
to support interconnection between isolated private networks.

4.1 Staging in H2O


One of the main design goals of the H2O resource sharing framework was to achieve maximum,
preferably binary, compatibility across heterogeneous platforms. Also, it has been considered es-
sential to provide fine-grained access control infrastructure that would, for example, allow remote
users to utilize a local CPU but prevent them from accessing a file system. These goals motivated
the use of Java as the foundation technology. Nonetheless, H2O provides extensive support for
platform-dependent functionality, and provides features that allow the hiding of heterogeneity
from the user. For instance, a H2O service (i.e. pluglet) may request the JVM to link a dynamic
native library. The pluglet may carry multiple versions of the library precompiled for different
platforms; the appropriate version is transparently resolved and linked at run time. The following
scheme is used: first, the system obtains the “library path” attribute associated with the plu-
glet, then it resolves the actual resource name by combining information about the library path,
requested library name, and detected platform type, then it stages the resource to a temporary
local file, and finally it loads the library from that file.
In fact, the same mechanism can be used explicitly, by the pluglet or by its client, to stage
arbitrary resources. These resources may be either platform-specific or not, and they may originate
from client’s filesystem, pluglet’s class path, or an arbitrary URL. Figure 8 shows how pluglets can
stage native executables and then invoke them. Automatic platform type detection is based on
the classification introduced by PVM, and also adapted by FT-MPI. Detection involves inspecting
Java system properties, analyzing results from the “uname” command (if present), and verifying
the presence of certain system-specific files.
host

kernel User FT-MPI


Provider startup binary
deploy
pluglet repository
StartupPluglet.class
LINUX/
startup_d
startup_d libftmpi.so
stage
SUN4SOL2/
libftmpi.so startup_d
libftmpi.so
...

Figure 9: Staging the FT-MPI runtime using the H2O framework.

4.2 FT-MPI Runtime Setup


The contents of the binary FT-MPI distribution can be divided into three categories:

• server side: service daemons and the FT-MPI runtime library;

• client side: ftmpirun and console;

• developer tools: ftmpicc, code examples, and tests.

As the above classification suggests, the only files that must be present at the computing nodes
are: the FT-MPI startup daemon (and optionally a notifier daemon and a watchdog daemon),
the FT-MPI runtime library, and the application itself. Importantly, since FT-MPI does not have
any external dependencies except for the standard C library, precompiled versions of these files
are widely portable within a single platform type.
We have applied the staging mechanism of H2O to automate the process of MPI runtime
deployment, as illustrated in Figure 9. FT-MPI service daemons have been wrapped into H2O
pluglets. To set up the MPI runtime, users simply deploy these pluglets to remote H2O kernels.
It is necessary that providers authorize the deployment (e.g. by indicating that they trust the
code signer) but they do not have to configure the MPI runtime themselves. The pluglets stage
and start up appropriate platform-specific versions of FT-MPI service daemons, fetching them
from a network repository.
Importantly, this scheme allows users to set up the MPI runtime, and thus execute MPI
programs, on machines where they do not have a login account but instead, only restricted H2O
access. The setup hides heterogeneity from users and releases providers from the responsibility
to install or configure additional software except for the H2O kernel itself.

4.3 Application staging


By default, FT-MPI can launch only those applications that are stored in the node file system.
The startup daemon designates a directory where all application binaries must be stored; the
user supplies only the file name. We have extended this model to allow for applications to be
staged from arbitrary locations. Our scheme is analogous to Java remote class loading. Instead of
the aforementioned predefined base directory, we use a URL code base that the user may specify
http://myorg.edu/mpiapps/
Providers

kernel – cluster 1 stage, run Application


repository
startup
LINUX/
pluglet2
kernel – cluster myapp1
myapp2
startup SUN4SOL2/
myapp1
pluglet
myapp2
...

kernel – cluster n
startup
pluglet User

Distributed ftmpirun -np 512


Virtual Machine -codebase ”http://myorg.edu/mpiapps/”
myapp1

Figure 10: Launching MPI application from a network repository.

along with the executable name. As illustrated in Figure 10, an execution request is handled by
the startup pluglet which uses the H2O staging mechanism to fetch a platform-specific version of
the application and save it in a local file. Then, the startup pluglet invokes the FT-MPI startup
daemon instructing it to launch the application from the local file.
Taking the Java class loading analogy even further, we advocate using the popular JAR
archive format as a deployment technology for multi-platform MPI applications stored in network
repositories, and JAR URLs as a means to refer to their contents. Below we list the features of
the JAR file format that prove useful in this usage scenario:

• Compression and indexing: reduces download sizes, sometimes significantly.

• Metadata: can be used to specify application requirements and dependencies.

• Digital signatures: support code source authentication, essential for assessing the trustwor-
thiness of code downloaded from remote repositories.

These features are backed by a readily available toolkit (e.g. command-line tools to sign the
code or to verify signatures) that can be directly applied to perform appropriate application
deployment.

4.4 Access Control


To provide protection against malicious code, startup daemon must assess whether the application
that it is asked to launch can be trusted. The usual approach is to base the decision upon the
code source and/or principal who requested the execution. FT-MPI includes limited support for
both: the code source is restricted to a designated place in a local filesystem which is assumed to
be trusted, and additionally, optional X.509-based user authentication is provided.
Application staging, however, requires more sophisticated measures to ensure adequate secu-
rity. In H2O-FT-MPI, the notion of a code source has been extended to follow the Java model:
Cluster 1 Cluster 2
App App

Startup_d Startup_d
Startup Startup
pluglet pluglet
Communication H2O proxy H2O proxy
within cluster

App Communication App


across clusters
Startup_d Startup_d

Figure 11: Interconnecting non-routable private networks.

it is comprised of a URL base path and code signers. Information about the principal is acquired
from the H2O framework, which uses flexible authentication mechanisms to establish user identity.
The startup pluglet maintains a security policy that defines access rules, e.g. equivalent to “allow
Dr. X to execute any application he likes” or “allow all users in group Y to execute applications
coming from ’https://vendorZ/apps/’ or signed by ’VendorZ’ ”, etc.

4.5 Connectivity among Heterogeneous Clusters


FT-MPI uses TCP/IP as the underlying transport, assuming that nodes have unique IP addresses
and that they are accessible to each other through these addresses. However, it is common for
a cluster interconnect to be a private non-routable network. Consequently, nodes belonging to
two such clusters cannot directly communicate via IP. FT-MPI can be used inside a single such
cluster, but it is incapable of interconnecting a collection of them. To address this limitation, we
adopted the and extended the proxy-based approach presented in Section 3, and illustrated here
in Figure 11. As before, proxies are deployed on clusters’ front-ends, which are assumed to have
global IP addresses as well as direct connectivity with their individual nodes. Since actual node
IP addresses may be private and thus non-unique, we instead assign to each node a virtual class
A private IP address, unique within a given distributed virtual machine. To ensure uniqueness,
each cluster’s proxy is granted a distinct range of virtual addresses that it can assign to the nodes.
Since proxies are created upon instantiation of the virtual machine, it is possible to let each proxy
know the virtual IP ranges of all other proxies. We modified the low-level communication layer of
FT-MPI (in a manner similar to that described previously for MPICH2) so that when a process
want to establish a connection with another process, it contacts its local proxy first, specifying
the requested target’s virtual IP address. Based on that IP address, the proxy either sets up a
routed connection to the destination (through the destination’s proxy), or returns the physical
target IP so that the source process can follow with a direct connection.
We reemphasize that implementing proxies at the socket level allows us to minimize changes
to the MPI implementation, and simplifies future maintainancemaintenance. However, further
development is needed in order to optimize the performance of collective operations [20, 18, 23, 22].
4.5 80

4 direct route 70 direct route


proxy route proxy route
3.5
60
3

throughput [MB/s]
throughput [MB/s]

50
2.5
40
2
30
1.5
20
1
10
0.5

0 0
3 4 5 6
0 200 400 600 800 1000 10 10 10 10
message size [B] message size [B]

Figure 12: Throughput of routed connection versus direct connection, for small and large mes-
sages.

5 Performance Evaluation
Most of the functionality resulting from coupling FT-MPI with H2O is related only to the applica-
tion and environment startup and thus has no effect on the FT-MPI communication performance,
which is discussed elsewhere [10]. The only exception is the communication proxy infrastructure
that enables FT-MPI applications to communicate across clusters. In case of intra-cluster com-
munications, resolution of virtual addresses introduces a small overhead at socket creation time
but causes no penalty thereafter. To measure the performance of a routed communication, we
performed a comparison experiment on two Pentium 4 2.4 GHz nodes interconnected by a Gi-
gabit Ethernet switch. The machines had 1 GB RAM each and were running Mandrake Linux
10.0. We used the standard communication benchmark “bmtest” from the FT-MPI distribution
to compare performance of a direct TCP/IP connection versus a route through two proxies, one
at each machine. Results, shown in Figure 12 indicate that the routed connection achieved about
65% of the throughput of a direct link consistently for all message sizes, reaching 50 MB/s for 1
MB messages. These results, consistent with earlier experiments with MPICH2, lead us to the
conclusion that proxies themselves are unlikely to cause a performance bottleneck if the commu-
nication channel between them (e.g. the Internet) is slower than the cluster interconnect thus
constraining the available bandwidth.

6 Discussion and Future Work


In this paper, we have discussed effects and issues of extending the MPI paradigm to hetero-
geneous collections of distributed resources spanning administrative domains. We described our
initial efforts in supporting the operation of the standard MPI model over multidomain systems,
exemplified by aggregates of multiple clusters behind firewalls. This exercise demonstrated the
substantially improved functionality that could be attained with minimal changes to the under-
lying MPI implementation and with only marginally lowered performance. We then extended
this subsystem to comprehensive multidomain systems, and also supported the FTMPI interface.
The fault tolerance model introduced by FT-MPI fits well in such environments; but the stan-
dard implementation was not directly applicable due to certain limitations. We have applied the
H2O metacomputing framework to address these shortcomings. Our approach allows users to
run fault tolerant MPI programs on heterogeneous, geographically distributed shared machines,
without sacrificing performance and without requiring resource providers to install and configure
MPI software. In the course of future work, we plan to implement optimizations to collective
communication routines by taking advantage of a hierarchical interconnect topology, typical in
geographically distributed environments.

References
[1] A. Agbaria and R. Friedman. Starfish: Fault-tolerant dynamic MPI programs on clusters
of workstations. In Eighth IEEE International Symposium on High Performance Distributed
Computing, page 31, Aug. 1999. Available at http://portal.acm.org/citation.cfm?id=
823236.

[2] S. Agrawal, J. Dongarra, and K. S. ans S. Vadhiyar. NetSolve: Past, present, and future
- a look at a grid enabled server. In F. Berman, G. Fox, and A. Hey, editors, Making the
Global Infrastructure a Reality. Wiley Publishing, 2003. Available at http://icl.cs.utk.
edu/news_pub/submissions/netsolve-ppf.pdf.

[3] G. Allen, T. Dramlitsch, I. Foster, N. Karonis, M. Ripeanu, E. Seidel, and B. Toonen.


Supporting efficient execution in heterogeneous distributed computing environments with
Cactus and Globus. In Supercomputing 2001 Conference, Denver, Colorado, USA, November
10-16 2001. Available at http://www.cactuscode.org/Papers/GordonBell_2001.ps.gz.

[4] O. Aumage and G. Mercier. MPICH/MadIII: a Cluster of Clusters Enabled MPI Implemen-
tation. In Proc. 3rd IEEE/ACM International Symposium on Cluster Computing and the
Grid (CCGrid 2003), pages 26–35, Tokyo, May 2003. IEEE.

[5] A. Bouteiller, F. Cappello, T. Hérault, G. Krawezik, P. Lemarinier, and F. Magniette.


MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based mes-
sage logging. In ACM/IEEE SC2003 Conference, page 25, Phoenix, Arizona, USA, Nov.
2003. Available at http://www.sc-conference.org/sc2003/paperpdfs/pap209.pdf.

[6] Y. Chen, J. S. Plank, and K. Li. CLIP: A Checkpointing Tool for Message-Passing Parallel
Programs, pages 182–200. The MIT Press, Cambridge, MA, 2004.

[7] J. Chin and P. V. Coveney. Towards tractable toolkits for the Grid: a plea for lightweight,
usable middleware. Available at http://www.realitygrid.org/lgpaper21.pdf.

[8] E. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead,
limited rollback and fast output commit. IEEE Transactions on Computers, Special Issue
on Fault-Tolerant Computing, 41(5):526–531, May 1992. Available at ftp://cs.rice.edu/
public/manetho/ieeetc-92.ps.gz.

[9] G. Fagg, A. Bukovsky, and J. Dongarra. HARNESS and fault tolerant MPI. Parallel Com-
puting, 27(11):1479–1496, Oct. 2001. Available at http://icl.cs.utk.edu/publications/
pub-papers/2001/harness-ftmpi-pc.pdf.
[10] G. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J. Don-
garra. Process fault-tolerance: Semantics, design and applications for high performance
computing. International Journal for High Performance Applications and Supercom-
puting, Apr. 2004. Available at http://icl.cs.utk.edu/projectsfiles/sans/pubs/
ft-semantics-lyon2002.pdf.

[11] M. P. I. Forum. MPI-2: Extensions to the Message-Passing Interface, July 18 1997. http:
//www.mpi-forum.org/docs/mpi-20.ps.

[12] I. Foster and N. Karonis. A grid-enabled MPI: Message passing in heterogeneous distributed
computing systems. In Supercomputing 98, Orlando, FL, Nov. 1998.

[13] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. The Intl
Journal of Supercomputer Applications and High Performance Computing, 11(2):115–128,
Summer 1997.

[14] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The physiology of the grid: An open
grid services architecture for distributed systems integration, Jan. 2002. Available at http:
//www.globus.org/research/papers/ogsa.pdf.

[15] W. Gropp and E. Lusk. Reproducible measurements of MPI performance characteristics. In


Proceedings of 6th European PVM/MPI Users’ Group Meeting, volume 1697 of Lecture Notes
in Computer Science, Barcelona, Spain, Sept. 1999.

[16] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation


of the MPI message passing interface standard. Parallel Computing, 22(6):789–828, Sept.
1996.

[17] H2O Home Page. http://www.mathcs.emory.edu/dcl/h2o/.

[18] T. Imamura, Y. Tsujita, H. Koide, and H. Takemiya. An architecture of Stampi: MPI library
on a cluster of parallel computers. In 7th European PVM/MPI Users’ Group Meeting, pages
200–207, Balatonfüred, Lake Balaton, Hungary, 2000. Available at http://portal.acm.
org/citation.cfm?id=648137.746622.

[19] N. Karonis, B. Toonen, and I. Foster. MPICH-G2: A grid-enabled implementation of the Mes-
sage Passing Interface. Journal of Parallel and Distributed Computing (JPDC), 63(5):551–
563, May 2003. Available at ftp://ftp.cs.niu.edu/pub/karonis/papers/JPDC_G2/JPDC_
G2.ps.gz.

[20] N. Karonis, B. Toonen, and I. Foster. MPICH-G2: A grid-enabled implementation of the Mes-
sage Passing Interface. Journal of Parallel and Distributed Computing (JPDC), 63(5):551–
563, May 2003. Available at ftp://ftp.cs.niu.edu/pub/karonis/papers/JPDC_G2/JPDC_
G2.ps.gz.

[21] R. Keller, B. Krammer, M. S. Mueller, M. M. Resch, , and E. Gabriel. MPI development


tools and applications for the grid. In Workshop on Grid Applications and Programming
Tools, Seattle, WA, June 2003.

[22] R. Keller, B. Krammer, M. S. Mueller, M. M. Resch, and E. Gabriel. MPI development tools
and applications for the grid. In Workshop on Grid Applications and Programming Tools,
Seattle, WA, USA, June 2003.
[23] T. Kielmann, H. E. Bal, S. Gorlatch, K. Verstoep, and R. F. Hofman. Network performance-
aware collective communication for clustered wide area systems. Parallel Computing,
27(11):1431–1456, Oct. 2001. Available at http://www.cs.vu.nl/~kielmann/papers/
lyon00.ps.gz.

[24] D. Kurzyniec, T. Wrzosek, D. Drzewiecki, and V. Sunderam. Towards self-organizing dis-


tributed computing frameworks: The H2O approach. Parallel Processing Letters, 13(2):273–
290, 2003.

[25] M. Litzkov, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration of UNIX
processes in the Condor distributed processing system. Technical Report 1346, University of
Wisconsin-Madison, 1997. Available at http://www.cs.wisc.edu/condor/doc/ckpt97.ps.

[26] S. Louca, N. Neophytou, A. Lachanas, and P. Eviripidou. MPI-FT: Portable fault tolerance
scheme for MPI. Parallel Processing Letters, 10(4):371–382, 2000. Available at http://www.
worldscinet.com/ppl/10/1004/S0129626400000342.html.

[27] Z. Nemeth and V. Sunderam. A comparison of conventional distributed computing envi-


ronments and computational grids. In International Conference on Computational Science
(ICCS), Amsterdam, Apr. 2002. Available at http://www.mathcs.emory.edu/harness/
pub/general/zsolt1.ps.gz.

[28] F. Petrini, K. Davis, and J. C. Sancho. System-level fault-tolerance in large-scale parallel


machines with buffered coscheduling. In In 9th IEEE Workshop on Fault-Tolerant Parallel,
Distributed and Network-Centric Systems (FTPDS04), Santa Fe, NM, Apr. 2004. Available
at http://www.c3.lanl.gov/~fabrizio/papers/ftpds04.pdf.

[29] J. M. Schopf and B. Nitzberg. Grids: The top 10 questions. Scientific Programming, special
issue on Grid Computing, 10(2):103–111, Aug. 2002. Available at http://www.globus.org/
research/papers/topten.final.pdf.

[30] G. Stellner. CoCheck: Checkpointing and process migration for MPI. In 10th International
Parallel Processing Symposium, pages 526–531, Honolulu, Hawaii, Apr. 1996. Available at
http://portal.acm.org/citation.cfm?id=660853.

[31] Top 500 supercomputer sites. http://top500.org/.

[32] W3C Consortium. Web Services activity. http://www.w3.org/2002/ws/.

View publicatio

You might also like