Professional Documents
Culture Documents
Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters
Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters
net/publication/220457170
Article in The International Journal of High Performance Computing Applications · May 2005
DOI: 10.1177/1094342005054260 · Source: DBLP
CITATIONS READS
10 131
2 authors, including:
Dawid Kurzyniec
Google Inc.
40 PUBLICATIONS 405 CITATIONS
SEE PROFILE
All content following this page was uploaded by Dawid Kurzyniec on 27 May 2015.
Abstract
We propose lightweight middleware solutions that facilitate and simplify the execution of
failure-resilient MPI programs across multidomain clusters. The system described in this pa-
per leverages H2O, a distributed metacomputing framework, to route MPI message passing
across heterogeneous aggregates located in different administrative or network domains. MPI
programs instantiate a specially written H2O pluglet; messages that are destined for remote
sites are intercepted and transparently forwarded to their final destinations. We demonstrate
that the proposed technique is indeed effective in enabling communication by MPI programs
across distinct clusters and across firewalls. Only minimally lowered performance was observed
in our tests, and we believe the substantially increased functionality would compensate for this
overhead in most situations. In addition to enabling multi-cluster communications, we note
that with the increasing size and distribution of metacomputing environments, fault tolerance
aspects become critically important. We argue that the fault tolerance model proposed by
FT-MPI fits well in geographically distributed environments, even though its current imple-
mentation is confined to a single administrative domain. We describe extensions to overcome
these limitations by combining FT-MPI with the H2O framework. Our approach allows users
to run fault tolerant MPI programs on heterogeneous, geographically distributed shared ma-
chines, without sacrificing performance and with minimal involvement of resource providers.
1 Introduction
Recently, the popularity of clusters has grown as the high performance computing architecture of
choice, coupled with their continuous increase in size – currently measured in thousands of nodes
per installation [31]. Additionally, there is a trend towards aggregating such supercomputing
resources across multiple networks that are geographically and administratively separate, and
heterogeneous in terms of architecture, software, and capabilities. In such “multidomain” systems,
some parts of the resource collection may also be behind firewalls, or on a non-routable private
network, leading to difficulties in access to and from a counterpart entity in another portion of
the system.
Despite the emergence of numerous distributed programming technologies, such as workflow
systems, Web Services [32], or problem solving environments [2], MPI [11] arguably remains the
most popular programming paradigm for high performance applications. In the MPI model,
individual processes may send and receive messages to and from any other process, and to spawn
processes on any other host, irrespective of its physical location. It is easy to see that this model
1
does not translate very effectively to multidomain systems in which communication from one
domain to another is not straightforward, and process control is subject to domain-specific security
policies. The situation is further exacerbated by the heterogeneity of multi-domain environments
which are based on diverse hardware platforms, operating systems, access policies, and middleware
systems. Last but not least, the probability of failures, disconnects, and network partitions
increases tremendously, both due to size and the uncoordinated, unpredictable nature of such
aggregates.
It has been demonstrated [3] that MPI can provide capabilities for solving very large scientific
problems on heterogeneous computational grids, under tightly controlled circumstances, using
grid-enabled MPI implementations [19]. However, current solutions require tedious and time-
consuming configuration by resource providers [7, 29], and they lack fault tolerance facilities. To
realize the importance of the latter, we note that current 10-20 Tflops machines experience a
partial failure every 10-40 hours, even if individual components are of very high reliability [28]. In
geographically distributed setups, situation is much worse: unpredictability of wide-area networks
causes estimated mean time between failures (MTBF) to drop to the order of minutes. It is obvious
that scientific applications, whose execution times routinely reach weeks, or even months, will not
be able to run in such setups unless appropriate provisions of the environment allow them to
withstand and recover from partial failures.
In the context of the Harness and H2O projects, we are devising lightweight schemes to ad-
dress the above obstacles by exploiting and enhancing established MPI implementations [16, 10].
Our approach to enable communication between domains involves the instantiation of customiz-
able agents at selected locations; by leveraging the security and reconfigurability features of H2O,
such ”pluglets” serve as proxies that relay messages between individual domains as appropriate,
transparently performing address and other translations that may be necessary. Fault tolerance
is addressed by leveraging appropriate features of the FT-MPI [10] implementation, extending
them to geographically distributed environments spanning multiple administrative domains. We
present the overall architecture and design of this system, describing how it enables MPI pro-
grams to operate across firewalls and private networks, with preliminary performance results.
Finally, we demonstrate how multi-domain computing is facilitated by enabling dynamic staging
of applications and of the MPI runtime on shared resources.
Publish Find
Deploy
...
Provider
native A
A code
Deploy A A B
Provider Client Provider B Client
Client B
Provider
Deploy
Legacy
App Repository Repository
A B A
Reseller Developer B
C C
Figure 1: H2O component model illustrating pluglets deployable by clients, providers, or third-
party resellers, and different modes of operation, including client-side aggregation.
that provide composable services. Such components may be uploaded by resource providers but
also by clients or third-parties who have appropriate permissions. Resource owners retain com-
plete and fine-grained control over sharing policies; yet, authorized clients have much flexibility
in configuring and securely using compute, data, and application services. By utilizing a model
in which service providers are central and independent entities, global distributed state is signifi-
cantly reduced at the lower levels of the system, thereby resulting in increased failure resilience and
dynamic adaptability. Further, since providers themselves can be clients, a distributed computing
environment can be created that operates in true peer-to-peer mode, but effectively facilitate the
farm computing, private (metacomputing) virtual machine, and grid paradigms for resource shar-
ing. H2O assumes that each individual resource may be represented by a software component that
provides services through well defined remote interfaces. Providers supply a runtime environment
in the form of a component container. The containers proposed in H2O are similar to those in
other environments (e.g. J2EE), but which are capable of hosting dynamic components that are
supplied and deployed by (authorized) external entities, thereby facilitating (re)configuration of
services according to client needs. A comprehensive depiction of the H2O architecture is shown in
Figure 1. Components, called pluglets, follow a standardized paradigm and implement composable
functionality; details may be found in the H2O programming guide and other papers [17].
MPI MPI
Firewall
daemon daemon
H2O H2O
proxy proxy
pluglet pluglet
of projects have focused on the message passing substrate, providing optimized implementations
of collective communication primitives: these include PACX-MPI [21], MPICH-G2 [19] (a recent
incarnation of MPICH-G), StaMPI [18] which is now being extended to include MPI-IO, and
MagPie [23]. These efforts have been quite successful, however, they provide few fault tolerance
facilities and are therefore not scalable to very large distributed setups.
On the other hand, efforts to support fault tolerant MPI have thus far been confined to single
administrative domains. Solutions based on coordinated checkpointing [30, 1, 6] and message log-
ging [5, 26, 8] are attractive due to transparency for the application developer; however, they suffer
from certain inherent limitations related to the checkpoint/restart approach [25] reducing support
for heterogeneity. Also, in order to withstand network partitioning in geographically distributed
setups, they would necessarily have to transmit large amounts of data over low-bandwidth links
that are prone to partitioning. FT-MPI [9, 10] adopts a different approach, addressing fault tol-
erance at the API level. FT-MPI survives the crash of n-1 processes in an n process job, and
can restart them. However, failure recovery is not transparent: the process state must be recon-
structed by the application. In exchange, FT-MPI maintains support for heterogeneity (e.g. by
allowing failed process to be restarted on a different architecture) and minimizes communication
and computation overheads, resulting in high communication performance and scalability.
Our ongoing project seeks to build upon these experiences, leveraging features of existing
MPI implementations. The project aims to (1) comprehensively support machine, interconnection
network, and operating system heterogeneity; non-routable and non-IP networks; operation across
firewalls; and failure resilience, and (2) leverage the component architecture of H2O to enable
collaborations and sharing across administrative boundaries, by providing support for dynamic
and secure staging of application components, application data, and also MPI runtime libraries,
under strict control of the resource providers’ security policies. We describe our design and
prototype implementations in the remainder of this paper.
H2O
3.1. <<deploy>>
proxy
4.1. return handle pluglet
5.1. initialize
H2O kernel
2.1. login
.......
startup program
4.n. return handle
5.n. initialize
1. read
Resource n
Kernel
references H2O
2.n. login proxy
kernel 1 pluglet
...
kernel n H2O kernel
2. <<create>>
/tmp/
h2oMPICH.dat
asynchronously connects to a host and port, and sock handle connect() – which handles newly
formed connections. The first function was modified to connect to the proxy pluglet (that listened
on a well-known local port) instead of the final destination; the second function was augmented
to forward the destination address to the proxy and let it establish the routed connection. Steps
in the interaction between MPI and the H2O proxy pluglet are listed below, and are shown
diagrammatically in Figure 4. The proxy pluglet enables outgoing connections using a locally
published port (steps 1, 2, 3); spawns a connection handler thread to service the connection
request (4,5); establishes a mapping to the destination pluglet (6,7); connects to the destination
user process (8); and sets up direct channels via helper threads (9,10).
After MPI programs were set up to successfully communicate over H2O channels, tests showed
that MPI was still unable to completely communicate through firewalls. This was due to the fact
that MPICH2 daemons, through which processes are initially spawned, had unresolved connec-
tions that were blocked by the firewall. To solve this problem the MPICH2 daemon Python code
was also modified to reroute its communication in an analogous manner by modifying the lowest
level socket function mpd get inet socket and connect(host, port).
A few other issues are worthy of mention. First, due to the use of lazy connections, overheads
were confined to only those communication channels that were required, at the relatively low
expense of first-time delay. The second issue involves connecting to the proxy pluglet. Since
rerouting all communications to the H2O proxy pluglet requires changing the host and port
parameters of the MPI connect() call, a locally published connection endpoint scheme was used,
again resulting in isolating all changes to the sock post connect(host, port) function. Also, in
order to ensure that connection to the remote MPI program through the H2O proxy pluglets is
established before any communications are conducted, a handshake protocol was devised. After
the MPICH2 socket is connected and the host and port are sent to the H2O proxy pluglet, an
acknowledgment is received in confirmation.
the scalability of collective routines as a function of number of processors [15]. The results for
MPICH2 and MPICH2 over H2O on Intel PCs running Linux on 100MBit Ethernet are presented.
Mpptest results are shown in Figure 5. MPICH throughput plateaus at 11.2 MB/sec (near the
theoretical bandwidth limit) and MPICH over H2O plateaus at 8.8 MB/sec showing a 26.8%
decrease. The anomalous drop in performance around message sizes of 50 KB for MPICH over
H2O is sometimes seen in parallel computing performance evaluations, and is likely due to a
benchmark dependent maximum buffer size.
Goptest shows the scalability of calling collective routines such as scatter and broadcast with
respect to the number of processes. MPICH2 and MPICH2-over-H2O scalability with different
message sizes are shown in Figure 6. A scalability comparison between MPICH2 and MPICH2
over H2O of the 256-byte message size is shown in Figure 7. Very little difference between
MPICH2 and MPICH2 over H2O exists because the added communication is outweighed by the
synchronization time of collective routines.
pluglet Pluglet.class
LINUX/
myprogram
WIN32/
myprogram.exe
ClassLoader cl = this.getClass().getClassLoader();
String rname = “$pvmarch{}/$executable{myprogram};
URL source = cl.getResource(rname);
File cmd = runtimeCxt.stageResource(source, “$executable{myprogram}”);
Runtime.getRuntime().exec(cmd.getAbsolutePath());
heterogeneous environments, we chose FT-MPI as the base MPI implementation to support the
above facilities. This choice allowed us to leverage the fairly sophisticated and thorough support
provided in FT-MPI to survive process failures. As mentioned in the previous section, our de-
sign to enable multidomain MPICH2 necessitated only minimal and isolated changes; therefore,
porting those changes to FT-MPI was straightforward, and we could immediately focus on the
new extensions. This section presents features of H2O related to resource staging, and describes
how we applied them to extend FT-MPI functionality to multi-domain setups. We outline the
security aspects of our proposed extensions and discuss communication mechanisms implemented
to support interconnection between isolated private networks.
As the above classification suggests, the only files that must be present at the computing nodes
are: the FT-MPI startup daemon (and optionally a notifier daemon and a watchdog daemon),
the FT-MPI runtime library, and the application itself. Importantly, since FT-MPI does not have
any external dependencies except for the standard C library, precompiled versions of these files
are widely portable within a single platform type.
We have applied the staging mechanism of H2O to automate the process of MPI runtime
deployment, as illustrated in Figure 9. FT-MPI service daemons have been wrapped into H2O
pluglets. To set up the MPI runtime, users simply deploy these pluglets to remote H2O kernels.
It is necessary that providers authorize the deployment (e.g. by indicating that they trust the
code signer) but they do not have to configure the MPI runtime themselves. The pluglets stage
and start up appropriate platform-specific versions of FT-MPI service daemons, fetching them
from a network repository.
Importantly, this scheme allows users to set up the MPI runtime, and thus execute MPI
programs, on machines where they do not have a login account but instead, only restricted H2O
access. The setup hides heterogeneity from users and releases providers from the responsibility
to install or configure additional software except for the H2O kernel itself.
kernel – cluster n
startup
pluglet User
along with the executable name. As illustrated in Figure 10, an execution request is handled by
the startup pluglet which uses the H2O staging mechanism to fetch a platform-specific version of
the application and save it in a local file. Then, the startup pluglet invokes the FT-MPI startup
daemon instructing it to launch the application from the local file.
Taking the Java class loading analogy even further, we advocate using the popular JAR
archive format as a deployment technology for multi-platform MPI applications stored in network
repositories, and JAR URLs as a means to refer to their contents. Below we list the features of
the JAR file format that prove useful in this usage scenario:
• Digital signatures: support code source authentication, essential for assessing the trustwor-
thiness of code downloaded from remote repositories.
These features are backed by a readily available toolkit (e.g. command-line tools to sign the
code or to verify signatures) that can be directly applied to perform appropriate application
deployment.
Startup_d Startup_d
Startup Startup
pluglet pluglet
Communication H2O proxy H2O proxy
within cluster
it is comprised of a URL base path and code signers. Information about the principal is acquired
from the H2O framework, which uses flexible authentication mechanisms to establish user identity.
The startup pluglet maintains a security policy that defines access rules, e.g. equivalent to “allow
Dr. X to execute any application he likes” or “allow all users in group Y to execute applications
coming from ’https://vendorZ/apps/’ or signed by ’VendorZ’ ”, etc.
throughput [MB/s]
throughput [MB/s]
50
2.5
40
2
30
1.5
20
1
10
0.5
0 0
3 4 5 6
0 200 400 600 800 1000 10 10 10 10
message size [B] message size [B]
Figure 12: Throughput of routed connection versus direct connection, for small and large mes-
sages.
5 Performance Evaluation
Most of the functionality resulting from coupling FT-MPI with H2O is related only to the applica-
tion and environment startup and thus has no effect on the FT-MPI communication performance,
which is discussed elsewhere [10]. The only exception is the communication proxy infrastructure
that enables FT-MPI applications to communicate across clusters. In case of intra-cluster com-
munications, resolution of virtual addresses introduces a small overhead at socket creation time
but causes no penalty thereafter. To measure the performance of a routed communication, we
performed a comparison experiment on two Pentium 4 2.4 GHz nodes interconnected by a Gi-
gabit Ethernet switch. The machines had 1 GB RAM each and were running Mandrake Linux
10.0. We used the standard communication benchmark “bmtest” from the FT-MPI distribution
to compare performance of a direct TCP/IP connection versus a route through two proxies, one
at each machine. Results, shown in Figure 12 indicate that the routed connection achieved about
65% of the throughput of a direct link consistently for all message sizes, reaching 50 MB/s for 1
MB messages. These results, consistent with earlier experiments with MPICH2, lead us to the
conclusion that proxies themselves are unlikely to cause a performance bottleneck if the commu-
nication channel between them (e.g. the Internet) is slower than the cluster interconnect thus
constraining the available bandwidth.
References
[1] A. Agbaria and R. Friedman. Starfish: Fault-tolerant dynamic MPI programs on clusters
of workstations. In Eighth IEEE International Symposium on High Performance Distributed
Computing, page 31, Aug. 1999. Available at http://portal.acm.org/citation.cfm?id=
823236.
[2] S. Agrawal, J. Dongarra, and K. S. ans S. Vadhiyar. NetSolve: Past, present, and future
- a look at a grid enabled server. In F. Berman, G. Fox, and A. Hey, editors, Making the
Global Infrastructure a Reality. Wiley Publishing, 2003. Available at http://icl.cs.utk.
edu/news_pub/submissions/netsolve-ppf.pdf.
[4] O. Aumage and G. Mercier. MPICH/MadIII: a Cluster of Clusters Enabled MPI Implemen-
tation. In Proc. 3rd IEEE/ACM International Symposium on Cluster Computing and the
Grid (CCGrid 2003), pages 26–35, Tokyo, May 2003. IEEE.
[6] Y. Chen, J. S. Plank, and K. Li. CLIP: A Checkpointing Tool for Message-Passing Parallel
Programs, pages 182–200. The MIT Press, Cambridge, MA, 2004.
[7] J. Chin and P. V. Coveney. Towards tractable toolkits for the Grid: a plea for lightweight,
usable middleware. Available at http://www.realitygrid.org/lgpaper21.pdf.
[8] E. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead,
limited rollback and fast output commit. IEEE Transactions on Computers, Special Issue
on Fault-Tolerant Computing, 41(5):526–531, May 1992. Available at ftp://cs.rice.edu/
public/manetho/ieeetc-92.ps.gz.
[9] G. Fagg, A. Bukovsky, and J. Dongarra. HARNESS and fault tolerant MPI. Parallel Com-
puting, 27(11):1479–1496, Oct. 2001. Available at http://icl.cs.utk.edu/publications/
pub-papers/2001/harness-ftmpi-pc.pdf.
[10] G. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J. Don-
garra. Process fault-tolerance: Semantics, design and applications for high performance
computing. International Journal for High Performance Applications and Supercom-
puting, Apr. 2004. Available at http://icl.cs.utk.edu/projectsfiles/sans/pubs/
ft-semantics-lyon2002.pdf.
[11] M. P. I. Forum. MPI-2: Extensions to the Message-Passing Interface, July 18 1997. http:
//www.mpi-forum.org/docs/mpi-20.ps.
[12] I. Foster and N. Karonis. A grid-enabled MPI: Message passing in heterogeneous distributed
computing systems. In Supercomputing 98, Orlando, FL, Nov. 1998.
[13] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. The Intl
Journal of Supercomputer Applications and High Performance Computing, 11(2):115–128,
Summer 1997.
[14] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The physiology of the grid: An open
grid services architecture for distributed systems integration, Jan. 2002. Available at http:
//www.globus.org/research/papers/ogsa.pdf.
[18] T. Imamura, Y. Tsujita, H. Koide, and H. Takemiya. An architecture of Stampi: MPI library
on a cluster of parallel computers. In 7th European PVM/MPI Users’ Group Meeting, pages
200–207, Balatonfüred, Lake Balaton, Hungary, 2000. Available at http://portal.acm.
org/citation.cfm?id=648137.746622.
[19] N. Karonis, B. Toonen, and I. Foster. MPICH-G2: A grid-enabled implementation of the Mes-
sage Passing Interface. Journal of Parallel and Distributed Computing (JPDC), 63(5):551–
563, May 2003. Available at ftp://ftp.cs.niu.edu/pub/karonis/papers/JPDC_G2/JPDC_
G2.ps.gz.
[20] N. Karonis, B. Toonen, and I. Foster. MPICH-G2: A grid-enabled implementation of the Mes-
sage Passing Interface. Journal of Parallel and Distributed Computing (JPDC), 63(5):551–
563, May 2003. Available at ftp://ftp.cs.niu.edu/pub/karonis/papers/JPDC_G2/JPDC_
G2.ps.gz.
[22] R. Keller, B. Krammer, M. S. Mueller, M. M. Resch, and E. Gabriel. MPI development tools
and applications for the grid. In Workshop on Grid Applications and Programming Tools,
Seattle, WA, USA, June 2003.
[23] T. Kielmann, H. E. Bal, S. Gorlatch, K. Verstoep, and R. F. Hofman. Network performance-
aware collective communication for clustered wide area systems. Parallel Computing,
27(11):1431–1456, Oct. 2001. Available at http://www.cs.vu.nl/~kielmann/papers/
lyon00.ps.gz.
[25] M. Litzkov, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration of UNIX
processes in the Condor distributed processing system. Technical Report 1346, University of
Wisconsin-Madison, 1997. Available at http://www.cs.wisc.edu/condor/doc/ckpt97.ps.
[26] S. Louca, N. Neophytou, A. Lachanas, and P. Eviripidou. MPI-FT: Portable fault tolerance
scheme for MPI. Parallel Processing Letters, 10(4):371–382, 2000. Available at http://www.
worldscinet.com/ppl/10/1004/S0129626400000342.html.
[29] J. M. Schopf and B. Nitzberg. Grids: The top 10 questions. Scientific Programming, special
issue on Grid Computing, 10(2):103–111, Aug. 2002. Available at http://www.globus.org/
research/papers/topten.final.pdf.
[30] G. Stellner. CoCheck: Checkpointing and process migration for MPI. In 10th International
Parallel Processing Symposium, pages 526–531, Honolulu, Hawaii, Apr. 1996. Available at
http://portal.acm.org/citation.cfm?id=660853.
View publicatio