You are on page 1of 98

Comparison of overhead in different Grid

environments

master thesis in computer science

by

Thomas Zangerl

submitted to the Faculty of Mathematics, Computer


Science and Physics of the University of Innsbruck

in partial fulfillment of the requirements


for the degree of Master of Science

supervisor: Dr. Maximilian Berger, Institute of Computer


Science

Innsbruck, 24 January 2009


Certificate of authorship/originality
I certify that the work in this thesis has not previously been submitted for a
degree nor has it been submitted as part of requirements for a degree except as
fully acknowledged within the text.

I also certify that the thesis has been written by me. Any help that I have
received in my research work and the preparation of the thesis itself has been
acknowledged. In addition, I certify that all information sources and literature
used are indicated in the thesis.

Thomas Zangerl, Innsbruck on the 24 January 2009

i
ii
Abstract

Grid middlewares simplify the access to Grid resources for the end user by pro-
viding functionality such as metascheduling, matchmaking, information services
and adequate security facilities. However, the advantages of the middlewares
usually come at the cost of temporal overhead added to the execution time
of the submitted jobs. In this thesis, we group the overhead into two cate-
gories: The first type of overhead occurs before the jobs become executed in
form of scheduling latency. What follows is information service overhead, which
is introduced by delays in information flow about the job status from the ex-
ecuting worker-node up to the end-user. We analyse both types of overhead
with respect to several factors, such as absolute values and variance, for the
Grid middlewares SSH, Globus and gLite. We evaluate our experimental data
regarding daytime-, weekday-, CE- and queue-influence, and discuss the results
and the implications.
iv
Contents

1 Introduction 3
1.1 The Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Grid middlewares . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Virtual Organisations . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Data staging . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Impact of overhead on short-running jobs . . . . . . . . . 7
1.4.2 Impact of overhead variance on timeout values . . . . . . 7
1.4.3 Impact of overhead variance on workflows . . . . . . . . . 8
1.4.4 Implications of measured overhead for middlewares . . . . 8
1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 The Middlewares 13
2.1 SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Globus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Security concepts . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Resource Management . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Information services . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Data transfer . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Security concepts . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Resource Management . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Information services . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Data transfer . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Implementation of adaptor and testbench 25


3.1 JavaGAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 The motivation behind JavaGAT . . . . . . . . . . . . . . 25
3.1.2 JavaGAT’s overall architecture . . . . . . . . . . . . . . . 25
3.1.3 Writing adaptors for JavaGAT . . . . . . . . . . . . . . . 27

v
3.2 Implementing the gLite adaptor . . . . . . . . . . . . . . . . . . . 27
3.2.1 VOMS-Proxy creation . . . . . . . . . . . . . . . . . . . . 29
3.3 Job submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 External libraries . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 The testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Results 35
4.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Statistical methods . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Globus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.4 gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Scheduling times . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Globus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Notification overhead . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Globus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Staging overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.1 Globus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.2 gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.1 Scheduling times in Globus and gLite . . . . . . . . . . . 68
4.5.2 Information service overhead in Globus and gLite . . . . . 68

5 Conclusions 71

A gLite Documentation 73

B GridLab license 81

List of Figures 83

List of Tables 85

Bibliography 87

vi
Acknowledgements
First and foremost, I want to thank my supervisor, Dr. Maximilian Berger,
who has always, when required, found the time to give his advice on problems
in person. More than once, this meant spending a lot of time on a particular
problem. He also gave me support and assistance, whenever I needed it.
My thanks are also expressed to Dr. Alexander Beck-Ratzka, who developed
the original gLite adaptor and has been very helpful in supplying me with the
source code before its release and in so quickly answering the many questions
that I have asked him. Roelof Kemp from the JavaGAT project also gave me
invaluable support by integrating the adaptor into JavaGAT and by providing
me SVN contributor access to the project.
Furthermore I want to thank the EGEE project, and particularly the VOCE
VO for providing infrastructure and user support. In this context, I also have to
outline the importance of all the people who helped me quickly, even on week-
ends, on the various developer mailing lists that I burdened with my problems.
Roberto Pugliese was so kind to mail me the source-code of the glite-security-
trustmanager from the CVS at the University of Trieste, which I couldn’t access
from outside due to the university’s security policy.
Last not least I want to thank my family, who has supported me on so many
occasions during my studies and my girlfriend Nadine, who has always motivated
me to carry on when I was frustrated.

1
2
Chapter 1

Introduction

1.1 The Grid

Scientific research, especially but not exclusively in the fields of astronomy,


physics, biology or chemistry, often means measuring and collecting large sets
of data. In order to gain usable results from the collected data, the raw samples
have to be evaluated, usually in form of some sort of computational process-
ing. As experiments grow larger and larger, the amount of data needed to be
processed and the computational power required for processing it increase.
For example, as of 2008 the Large Hadron Collider (LHC) is the largest sci-
entific instrument in the world and produces 15 petabytes of data per year [1].
If such amounts of data are to be processed, either large and expensive main-
frames are needed or the computational task has to be distributed among many
computers.
To assist in achieving that task, Grid computing was introduced by Foster
in [2] and given a more exact definition in [3].
Grid computing is about computational power provided by all kind of com-
putational resources in different geographic areas in the world that can be har-
nessed to solve concrete computational problems on demand and as the resources
are available. In order to highlight the conceptual similarity to a power grid, the
term Grid was coined for that heterogeneous, distributed, worldwide computing
network.
Computing resources (so called sites) participating in the Grid usually consist
of standard, general purpose hardware and a standard operating system (mostly
a flavor of Linux). Network communications between the Grid components
are usually routed using standard Internet protocols on existing Internet wires.
Because standard components are commonly manufactured in large amounts,
they are comparatively inexpensive. However, this advantage comes at the cost
of heterogeneity of hardware and software that may introduce complications and
management overhead.

3
Application user applications

Middleware functionality
scheduling
Collective workload management

resource access
Resource bookkeeping
job information

communication
Connectivity authentication

hardware
Fabric resources

Figure 1.1: The Grid layer model with the components of each layer

With this set of standard components, the Grid can provide considerable
computational power to the end user. The theoretical performance of the LHC
computing Grid alone has been estimated to equal 1 petaflop [4]. This is about
the same value as the Rmax performance in the Linpack Benchmark of the top
site in the supercomputers Top 500 list of June 2008. (see [5]).
Hence, the sheer mass of comparatively inexpensive computing components
can amount to considerable performance, yet somehow the resources have to be
coordinated to form a useful distributed computing environment. This is the
task of the Grid middleware.

1.2 Grid middlewares


The management of the Grid environment and of Grid tasks, is conducted by a
Grid middleware. The middleware provides a layer between user-created Grid
applications and the Grid infrastructure (figure 1.2). Thus the middleware typ-
ically implements the functionality of two or three layers of the Grid service
stack (figure 1.1).
This way it abstracts details often considered unnecessary for application
development from the user and allows her to exploit Grid resources more con-
veniently in a coordinated way. The middleware follows a client-server-model:
The client software stack is installed on the machine of the Grid user, while
compatible server versions have to be present on the Grid sites, where the com-

4
C++−Application Java−Application User

API−Call API−Call Shell

C++−Language−Bindings Java−Language−Bindings Command−Line−Interface

Middleware

SOAP SOAP GridFTP

WS−Interface WS−Interface GridFTP−Interface

GridFTP
Grid security Grid computation Grid storage
server element server

Figure 1.2: The middleware as a transparent component between APIs and ser-
vices

putational task of the user are to be executed. The set of Grid-related tasks that
a middleware can take over on behalf of the Grid user, varies among different
middlewares; however, minimal functionality that a scientific user would expect
from a middleware includes access control, copying of needed data sets and the
executable itself to the execution sites (input staging), copying of result data
back to the user’s computer (output staging), continually reporting the progress
of the computational task (job monitoring) and relaying system messages to the
user (staging stdout and stderr ).

1.2.1 Virtual Organisations


In order to implement access control, Grid middlewares often aggregate users
in so called virtual organisations. A virtual organisation (VO) is a group of
individuals, that can be mapped from real organisations or consists of people
sharing a common goal, geographic region, usage motivation or having other
incentives for cooperation. For example, in the EGEE Grid infrastructure there
exist virtual organisations like VOCE (VO for Central Europe) or compchem
(VO for people working in computational chemistry).
With many middlewares, Grid access control is performed on the basis of
virtual organisations, i.e. if a resource X may be used by VO Z and person Y

5
is a member of VO Z, person Y may use resource X. Of course, authenticity
and authorisation have to be ensured. For this purpose, cryptographically well
defined public key infrastructure procedures with X509-certificates [6] are used.
VOs maintain tables of their members along with the signatures of their digital
certificates. Upon accessing a resource, the user is expected to show her certifi-
cate or a temporarily limited proxy created from it for proof of VO-membership
.

1.2.2 Data staging


Many middlewares perform input and output staging of data using the GridFTP
protocol [7]. GridFTP is an extension to standard FTP. Additional commands
have been introduced for parallel transmissions, striping and accordingly a trans-
mission mode for out of order reception has been created. To make GridFTP
fit into the standard Grid security concepts, it supports authenticated transmis-
sions using GSS-API [8] tokens.

1.3 Problem definition


Grid middlewares exist to make life for the user easier; details are abstracted
away and the user or programmer is usually supplied with command-line tools
and/or programming language bindings that access middleware functionality in
a uniform way.
As an additional layer between the Grid applications or the user and the
actual Grid resources, the middlewares can also introduce significant overhead
for various operations. The overall temporal overhead in the execution of a job
can result from latencies in submitting jobs and scheduling them on CEs (if a
metascheduler is involved) and workqueues, data transfer times, queue waiting
times and information service overhead.
We define information service overhead as the latencies that can occur in
information flow within the Grid. Mostly this concerns the middleware’s job
monitoring components.
Since events such as the initialisation of output staging depend on information
about the job’s status, the information service overhead directly influences the
execution time. This type of overhead can normally not be modelled by a
constant factor. It varies among middlewares, execution sites, workqueues and
with different workloads of the respective resources. Due to the architecture of
some middlewares, the overhead can also be subject to random variance.
Various types of overhead can in the worst case lead to job executions, which
show latencies that are disproportionally high when compared to the expected

6
net job execution time or even stay in a non-completed state for an indefinite
time span. In a Grid environment, the goal is to avoid such executions, for
example by using a timeout and resubmission strategy. To determine feasible
timeout values, knowledge about approximate latency values and latency vari-
ance is critical.

1.4 Motivation
In this thesis, overhead of different middlewares is measured and compared.
Such overhead has different negative effects, based on whether the total overhead
or the overhead variance is the major problem.

1.4.1 Impact of overhead on short-running jobs


Middleware overhead prolongs execution time. This especially impacts experi-
ments with many short-running jobs, because the middleware-overhead adds to
the job’s execution time and, in case of high overheads, may even surpass the
execution time itself. This can render the Grid infeasible for certain experiment
configurations.
Quick solutions from the view of the application developers are to bypass the
middleware’s workload or job management systems by creating applications that
submit job controllers to the Grid, which spawn several of their own small jobs.
Because those small jobs are invisible to the workload management of the Grid
middleware, they are not direct subjects to the quality of service mechanisms
imposed by it.
It can neither be in the interest of a Grid middleware creator, to render the
Grid infeasible for certain types of jobs, nor can it be intended that users create
their own middlewares within the middleware. Hence the goal should be to keep
information service latency values at a minimum level.

1.4.2 Impact of overhead variance on timeout values


Not all jobs submitted to the Grid will terminate successfully and return the
result of the computation. This is a natural result of the Grid’s structure. Due
to the heterogeneity of the Grid, the site on which the job is scheduled may lack
a required library or may not be able to provide all the needed resources. Due
to the decentralised organisation, sites may be offline but not reported as such,
or network failures may occur. The job may get stuck indefinitely in a waiting
queue on a computing element, because other jobs with higher priorities receive
preferential treatment.

7
Solutions to this problem are often based on timeouts and resubmission of
jobs. The timeout values either include an estimation of the job’s execution
time and a statistical model of the latency or are based on the estimated times
a job should spend in a certain state (for example, in the ”WAITING” state).
Timeouts can be crucial in an environment without success guarantees. How-
ever, latency variance makes it hard to estimate such timeout values. Most
notably, the latency should not depend on a random factor introduced by the
middleware by its way of organising information flow or by site schedulers with
esoteric scheduling policies.

1.4.3 Impact of overhead variance on workflows

If jobs are inter-dependent, i.e. if certain jobs require results of other jobs in
order to complete execution, latency variance can heavily affect the performance
of the experiments, because in order to finish execution, jobs may have to wait
a long time for required intermediate results of other jobs.
This property holds especially for complex application workflows. If inter-
dependence is high, even a small fraction of outliers can lead to the need of
multiple resubmissions of the workflow. Here, again, this may provide an in-
centive to the user for submitting complete workflow management processes as
single jobs to the Grid, which may bypass quality of service mechanisms of the
middleware and negatively influence overall stability even more.

1.4.4 Implications of measured overhead for middlewares

There is not much that middlewares can do to avoid overhead caused by hard-
ware or network components, or by general workload in the Grid. Scheduling
overhead mostly depends on such workload and on the configuration of the
workqueue (prioritisation etc.).
However, the reason for information service overhead is often to be found in
the design of a middleware. By measuring that information service overhead
with respect to

• Overall latency

• Latency variance

in different middlewares and taking a look at the design of the middleware


components, conclusions concerning potential improvement can be drawn.

8
1.5 Methodology
In order to measure the latency values, a testbench was implemented in the Java
programming language. To ensure uniform access to different middlewares from
the testbench, the middleware wrapper API JavaGAT [9] was used. JavaGAT
had to be extended with an adaptor for the gLite middleware. Part of that
adaptor had already been developed by Dr. Alexander Beck Ratzka and Andreas
Havenstein (Max-Planck-Institut für Gravitationsphysik). However, the original
adaptor was based on an outdated version JavaGAT, lacked needed functionality
and showed problematic behaviour in terms of memory usage. The latter was
not the fault of the original authors but originated from third party libraries
with memory leaks. We have significantly improved the adaptor and added
functionality, while fixing the memory leaks and porting it to the latest JavaGAT
interfaces. We have used the adaptor as a part of the testbench to measure gLite
execution times.
The measurements themselves were conducted on a uniform timescale on dif-
ferent Grid sites with different middlewares. The results were statistically anal-
ysed and interpreted.
The implementation of the testbench and the adaptor is described in detail
in chapter 3.

1.6 Structure of the thesis


Section 1.7 presents previous work in the fields of latency modelling with the
most significant results obtained by it. To establish a context for the middleware
performance comparison, Chapter 2 gives an overview on the components and
internals of the analysed middlewares.
Chapter 3 explains the mode of operation of the middleware wrapper Java-
GAT, our work on the gLite adaptor and our middleware testbench, which uses
JavaGAT as its base API.
Having outlined all important parts of the testbench itself, Chapter 4 focuses
on the collected results. We interpret overall scheduling latencies and middle-
ware overhead for all middlewares and try to isolate obvious patterns in it and
to detect influences by factors such as submission date, used CE and configura-
tion options and compare the obtained values. Following that, we interpret the
evaluation according to the peculiarities of the different middlewares.
Chapter 5 briefly summarises our results.
Appendix A contains a read-me written for the gLite adaptor explaining some
of the settings and pitfalls in practical use.

9
1.7 Related work

Previous work on latency modelling can be categorised according to the view-


point. An abundance of work exists on the modelling of server-side queueing,
while fewer authors have dealt with the latency as it is perceived by the user.
In earlier days, this was not necessary, since due to the simple protocols that
were in involved in client-server communication, server-sided queueing times
and queueing-times from the perspective of the client were identical up to a
small more or less constant factor. However, previous work covering differences
in server-side queueing times and user perceived latency, exists from other re-
search areas, e.g. w.r.t. network transmissions in the web [10].
Pure queueing times have been analysed by Mu’alem and Feitelson [11] on
a typical cluster configuration with a backfilling FIFO scheduler. Backfilling
denotes an advance reservation system based on future prediction for the jobs
in the queue. The conclusion of the authors is that backfilling achieves lower
execution times than simple FIFO strategies. Interesting corollaries of their
results are, that users are generally bad at estimating job execution times (which
some cluster schedulers require as input) and that users seem to adapt their
workload over time, such that it best fits the deployed scheduling policy.
In [12] Chun and Culler attach a utility function to jobs in order to evaluate
queueing times in a user based approach. It is argued, that in a commercial
supercomputing environment, transparent pricing and scheduling based on first
price auctions is most effective concerning the overall job waiting times. It
is assessed, that pre-emption does not speed up the waiting process, if the
differently priced queues have adequate sizes.
The workload behaviour on a LCG cluster (LCG is the predecessor of gLite) is
measured and analysed in [13]. The LCG cluster is managed by several queues
with different policies for long, medium and short-running jobs respectively.
The concept behind these multiple queues is, that higher waiting times are
acceptable for long-running jobs, when seen in relation to the execution time.
It is outlined, that despite the fact that VOs submit to queues with different
lengths based on the estimated CPU time of their jobs, some VOs are treated
unfairly concerning waiting times. Furthermore, the authors argue that jobs
split into several subtasks will experience a comparatively longer waiting time
than one large job performing all the tasks alone.
In [14], Oikonomakos et al. also measure job waiting times on a Grid site
running the LCG middleware. The observed waiting times have standard devi-
ations by factors larger than the mean waiting time values, which attests to the
heavy tailed distribution of job waiting times in the EGEE environment.

10
However, the systematic complexity of multi-layered hierarchical middlewares
like gLite causes a divergence of user-perceived and cluster-generated overhead,
since many middleware components may introduce overhead themselves. Con-
sequently, some works focus on the overhead from the user perspective.
Lingrand et al. present experiments conducted within EGEE’s biomed VO
over a large period of time in [15]. A timeout value of about 500 seconds is shown
to significantly improve the expectation of the job execution time. Furthermore,
the authors show that there is no relevant impact on the job execution time by
the day of the week.
The same authors show in [16] that the optimal timeout value based on their
latency model varies significantly among different gLite sites. Furthermore, dif-
ferent queues on gLite sites are evaluated with respect to their latency behaviour.
It is argued, that the best result with regard to execution time is obtained by
classifying the queues in two classes with different optimal timeouts.
As concerning temporal influence, the observation is made, that overall there
are slightly higher latencies on weekends. However, the presented data on day-
of-the-week influences shows partial inconsistencies and would need further ex-
ploration, which is also acknowledged by the authors in the conclusion. Fur-
thermore, it is questionable whether ANOVA is the best choice for analysis of
the class partitioning, because the CDFs indicate that the samples are not nor-
mally distributed. ANOVA needs normal distribution of the underlying data as
a prerequisite.
Another probabilistic model for computing optimal timeout values for resub-
mission strategies is introduced in [17]. The authors apply the model to several
well-known distributions in order to derive optimal timeout values for systems
modelled by them. For the EGEE grid, the probability distribution corresponds
to a mixture of log-normal and Pareto distributions. It is argued, that by de-
riving optimal timeout values from the model, the expectation of job duration
moves within the dimension of outlier-free production systems. As for collecting
necessary input data for the model, the authors suggest that it could be taken
from the workload management system logs.
All three works focus on total execution times as perceived by the end-user
as the subject of their analysis. Our approach, on the other hand allows us to
analyse the different sources of overhead, namely scheduling and information
service delays, separately.

11
12
Chapter 2

The Middlewares

2.1 SSH
In order to utilise Grid resources for job execution, the following minimal com-
ponents are required.

• Authorisation and access control

• Computing resource access

• File transfer

These tasks can already be achieved with the standard SSH protocol [18].
Authorisation is part of the protocol itself and access control can be ensured
using Unix ACLs and schedulers on the execution site. The computing resource
access is a central part of the protocol’s functionality and file transfer can be
implemented with SCP, which is based on SSH.
Such a solution is very simple, because there is no matchmaking and no meta-
scheduling. The authorisation solution mandates that each user owns an account
on the machine on which the Grid application becomes executed, which is not
very scalable. There is no inherent support for parallelism with this solution
and scheduling on the executing machine has to be performed by the operating
system scheduler, which is often hardly apt to do this job in an uncontrolled
multi-user environment with long-running, non-interactive applications. Fur-
thermore due to the lack of matchmaking it obliges the user to have knowledge
about all computing elements, their respective addresses and hardware and soft-
ware properties and to do the matchmaking himself.
Hence, SSH is not a Grid middleware, because even though it can be used
for remote job execution, it lacks vital functionality of middlewares. But the
time overhead produced by this kind of solution is very close to zero, basically
only comprising network latencies. Therefore it is a good base reference when
overheads of complex middlewares are evaluated.

13
2.2 Globus
Globus Toolkit has been one of the first Grid middlewares and perhaps it still is
the most popular one. Many technologies that have emerged from the Globus
project are now de facto standards in Grid computing, such as the GridFTP
protocol and the Grid Security Infrastructure (GSI).
As a basis for our experiments, we have used the older, non-webservices-based
version 2 of Globus Toolkit. All essential components of Globus version 2 are
still contained in the current major version 4 [19]; especially GT4 contains mod-
ules for webservices based job management and pre-webservices job execution
management services (WS GRAM and Pre-WS GRAM).
Generally, the Globus Toolkit is a collection of APIs provided by client li-
braries for different programming languages, server libraries which provide the
Globus services and command-line tools for end users. The middleware consists
of several components, which implement different infrastructural tasks.

2.2.1 Security concepts


In order to guarantee access control and message protection, the Grid Secu-
rity Infrastructure (GSI) was created as a part of the Globus middleware. GSI
enforces that only members of an authorised VO may access Grid resources.
Additionally, it provides cryptographic means for message integrity and authen-
ticity.
The access control mechanisms require the identity of the user to be estab-
lished. For this purpose, X509-certificates that are issued by certificate author-
ities are used, as described in section 1.2.1. Such an X509 certificate consists of
the following components:

• Version (in Globus, this is version 3)

• Serial number

• Signature algorithm ID (e.g. sha1WithRSAEncryption)

• Issuer (the Certificate Authority)

• Validity

• Subject (the Distinguished Name of the user in X500-format)

• Subject Public Key (used for message encryption)

• Optional extensions

14
• Signature algorithm ID

• Signature value, encoded in ASN.1 DER

The signature at the end is from the CA and testifies the validity of the
information in the certificate and especially the ownership of the public key by
the certificate’s identity. For GSI, this certificate is required to be saved in the
.pem Format, which means that the above information is stored in a Base64-
encoding.
For access control, the user would be required to prove her identity upon using
Grid resources every time. Since her identity is established by challenging her
knowledge of the private key that belongs to the public key bound to her identity
in the certificate, this would mean having her enter the password for decryption
of the private key on each resource access. Since this is hardly feasible in a
productive environment, GSI supports the delegation of a temporally limited
proxy which can impersonate the user [20].
The delegation process consists of creating a new public/private key pair, the
public key of which will be used for a new X509 certificate that is signed by
the holder of the Grid certificate herself. The generated private key is stored
alongside the certificate in a proxy file (see figure 2.2.1). For security reasons,
file permissions on the proxy are set restrictively and its validity is temporally
limited.
The proxy’s subject corresponds to the subject name in the Grid certificate
and other properties are derived from the original certificates as well. A new
critical X509 extension ensures that the proxy creator can define different poli-
cies concerning the proxy rights (i.e. whether it may delegate new proxies by
itself).
The generated private key is intended for one-time use, which makes proxy
revocation an easy task: It suffices to delete the proxy file.
Based on these proxy files, authentication in Globus is done based on the
GSS-API and SSL version 3, based on its implementation OpenSSL [21].

2.2.2 Resource Management


In Globus, resource management consists of the following components [22]:

• Resource Specification Language (RSL)

• Resource brokers (RBs)

• Site co-allocators (e.g. DUROC)

• Grid Resource Allocation Manager (GRAM)

15
Private key

Delegation start
11
00
00
11
00
11
create private/
Proxy
public key pair
file

User signed

sign public key


public key

111
000
000
111
000
111
000
111
User private
key

Figure 2.1: The GSI proxy delegation process

• Information services (MDS)

• Local job queueing systems (SGE, PBS, LFS, fork, etc.)

With the Resource Specification Language (RSL), the Grid user specifies the
executable, files for input and output staging, estimated job duration wall clock
time, environment variables, arguments etc. This is done in a standard textual
way, using name/value pairs.
The RSL is then sent to a Resource Broker, whose task is to specialise the
RSL, i.e. to derive possible execution sites directly from the RSL’s attributes.
Such a specialisation can be done by querying the Globus Information service
(see section 2.2.3) or subsequent resource brokers for suitable sites according to
certain properties derived from the RSL (e.g. available RAM size, number of
CPUs, installed software etc.). The resulting processed RSL is called ground
RSL.
If the job requires parallel execution on different sites in terms of MPI or
some other environment, the RSL will be passed to a site co-allocator. We
don’t have such a use-case in our testbench scenario. Hence, co-allocation will
not be covered here. An in-depth description can be found in [23].
In our case, the specification refined by the resource broker is passed to
a GRAM. GRAM exists in a newer, webservices based version (WS GRAM,
GRAM4) and in an older version still supported by current releases of Globus

16
Start Pending Active Done

Failed

Figure 2.2: GRAM job states

2. specialize
RSL

5. job handle

Application Resource Broker


1. RSL 3. ground RSL
callback URL 4. job handle

GRAM
6. status notification
callbacks
PBS

Figure 2.3: Example GRAM job registration process

toolkit (Pre-WS GRAM, GRAM2). Since the AustrianGrid sites still exclusively
provide GRAM2 service endpoints, our experiments rely on the older version of
GRAM, which will be described here.
When the resource description reaches GRAM level, it is already on an ex-
ecution site, i.e. on a cluster, blade, parallel machine or some other compute
element. Usually on these compute elements, some local load balancer/sched-
uler will run, such as Sun Grid Engine (SGE), Portable Batch System (PBS)
or LoadLeveler (LL). GRAM provides a uniform interface to all those sched-
ulers while using them for queueing the Grid jobs. At the same time, GRAM
supervises these executions, notifies the user of status changes using a callback
URL specified at job submission and generates a job handle that can be used
for cancelling the job or actively polling its status (figure 2.3).
The job’s legal status transitions can be modelled by an acyclic directed graph
(see figure 2.2).

2.2.3 Information services


Information services in Globus are implemented using central LDAP directo-
ries [24] and a standardised information model, called GLUE [25]. Along with
the protocols for automated site information retrieval (GRIP) and notification
about information availability (GRRP), those components form the Globus in-

17
formation system MDS [26]. A frequent configuration is known as Berkley
Database Information Index (BDII [27]), which consists of two or more LDAP
databases in combination with a well-defined update-process. Aggregate LDAP
directories may be hierarchically grouped and can notify each other about in-
formation updates within their controlled realms using GRRP. The resource
broker or the user will likely search a top-level LDAP service which includes
information of all smaller domain-specific LDAP services.
This method of information aggregation ensures scalability. On the other
hand, the use of aggregation with notification protocols can mean that infor-
mation in the aggregate directory services can be outdated due to propagation
delay. However, often updates are relatively infrequent and the notification in-
tervals can be fine-tuned to form a trade-off between system load due to useless
messages and update notification delay.

2.2.4 Data transfer


Globus 2 uses GridFTP (section 1.2.2) as the only protocol for input and output
staging and inter-grid-site data transfer.

2.3 gLite
gLite is used as the middleware of the EGEE project1 , which is an EU-funded
multi-disciplinary Grid environment and one of the largest Grid infrastructures
in the world. gLite is a de facto successor of the LHC computing grid (LCG) and
uses components developed for the LCG. Since the LCG is a package of Globus,
Condor and CERN-made components, many parts of the gLite middleware are
based on protocols and technology developed as part of Globus.
gLite is distributed as a set of command-line tools that run exclusively on
Scientific Linux, various APIs for programming languages for the proprietary
protocols of gLite and WSDL files for the gLite webservices, from which bindings
for different programming languages may generation. Many components of gLite
are webservice-based, so most bindings can be generated from the respective
WSDL files.

2.3.1 Security concepts


gLite uses GSI and proxy-delegation based on the Globus concepts (sec-
tion 2.2.1). However, in a Grid environment with the dimensions of EGEE, the
1
http://www.eu-egee.org/

18
existing Globus practise to manually add authorised users to map-files on every
potential execution site for a certain VO has been found not scalable.
Hence, authorisation itself is performed by a Virtual Organisation Member-
ship Service (VOMS) [28]. The VOMS server has knowledge about the users of a
certain VO and maintains that information in a relational database. If the user
wants to construct a VOMS proxy, it first handshakes with the VOMS server
using a standard GSI-authenticated message and then sends a signed request.
The user’s request contains roles, groups and capabilities that later-on be used
by grid sites for fine grained access control. If the request can be parsed cor-
rectly by the VOMS server and the user has sufficient rights for the demanded
group membership, role and capabilities, the server sends a structured VOMS
authorisation response. The VOMS response contains the following data:

• VOMS user authorisation info (groups, roles, capabilities)

• User credentials

• VOMS server credentials

• Temporal validity of the proxy

• VOMS server signature

The client can now save the VOMS response as an X509v3 Attribute Cer-
tificate in the grid proxy certificate and serialise the resulting certificate to a
VOMS proxy file. For access control, the user transmits the full VOMS proxy
to the resource broker, which only needs to check the validity of the VOMS
information instead of consulting a user map-file. However, unwanted users can
still be banned from resources by explicit blacklisting.
VOMS proxies can not be revoked, but have a limited lifetime which is in-
cluded in the attribute certificate to avoid reuse of existing VOMS tickets by
malicious users.

2.3.2 Resource Management


The gLite resource management consists of the following components [29] [30]:

• Job Description Language (JDL)

• Workload Management System (WMS)

• Logging and Bookkeeping Service (LB)

• Information supermarket (ISM)

19
• Computing Element (CE)

• Local Resource Management Systems (LRMS, e.g. PBS, Maui, SGE,...)

Job Description Language

The structure of the Job Description Language is quite similar to the RSL
format used in Globus (section 2.2.2). It also consists of name/value pairs, but
offers functionality beyond the scope of RSL. The language is based on Condor’s
ClassAd language [31] and besides standard fields for executables, input and
output sandbox files, stdout and stderr, arguments and environment variables
explicit GLUE attributes can be supplied as requirements for matchmaking.
Since all GLUE attributes, that the WMS can derive from the information
supermarket may be specified in the requirements section, the user can become
quite explicit about hardware and software requirements; for example the user
may desire more than 1 GB of RAM, RedHat-Linux as the CEs operating system
or an i686-processor on the executing machine. It is even possible to specify
explicit workqueues for scheduling, or exclude certain workqueues.

The Workload Management System

Like the Resource Broker in Globus, in gLite the WMS takes the user’s JDL
and tries to find an appropriate resource fitting to the requirements expressed in
the JDL for job scheduling. Unlike in Globus, there is no active layer between
the WMS and the local resource manager, i.e. there is no specific gLite analogy
to Globus’ GRAM. However, the CE must provide a webservice interface for
submitting and cancelling jobs, retrieving the job status and sending signals to
jobs. This is called a computing element acceptance service (CEA). CEs can
operate in push and pull models, i.e. receive jobs and execute them, if they are
idle or ask for them, when they are idle. In order to make informed matchmaking
decisions, the WMS queries the information supermarket. If no resources are
available, it keeps the submission request in a task queue.
Clients can communicate with the WMS using a webservice-interface and
SOAP messages. The whole job registration process is depicted in figure 2.4.

The compute element

The CE is the subject of the WMS meta-scheduling process. As described above,


it offers a webservice interface for job registration and cancellation called CEA.
The CEA interacts with the job controller (JC) that controls the locally de-
ployed cluster-manager/load-balancer (LRMS). These local resource managers

20
9. query job state

WS−Int.
4. register Job LB logging
server

WS−Int. Interface
GridFTP
6. stage in sandbox

7. start job
Monitor

WS−Int.
8. start: jobId
5. return jobID WMS
CEA
JC
1. VOMS−proxy−auth
2. send JDL string LRMS
3. locate LB server (e.g. Maui)
generated JobId schedule

WN WN WN

Figure 2.4: Job registration in gLite

queue arriving jobs and schedule them based on autonomous decisions on vari-
ous worker nodes (WN), which are responsible for performing the actual com-
putation. Typically deployed LRMSes are PBS, SGE, LCG-PBS, LSF, LL or
Maui, which is a queueing plug-in for various LRMS products. The job monitor
checks the status of the job (running, completed) and reports it to the logging
and bookkeeping system.

Logging & Bookkeeping

In gLite, the job’s status is tracked using a specialised component, the logging
and bookkeeping (LB) server. The status is updated by events that are gener-
ated either by CEs themselves or by an aggregating WMS, using some logging
API. The logging API passes the information to a physically close locallogger
daemon, which stores it on a local disk file and reports success to the logging
API. The interlogger daemon forwards the information from the locallogger
daemon to the responsible bookkeeping server. The URL of the bookkeeping
server is included in the server part of the job-id, which has already been cho-
sen by the WMS. Hence, LB server and job remain interdependent and an LB
server collects information about the job’s status during its complete lifetime.
The incoming events are mapped to higher level gLite states (figure 2.6) by the
LB-server, where the user may query for them, or actively receive them if she
registered for update notifications (figure 2.5).

21
LB WSDL
logging
LB producer

API

API
locallogger interlogger LB Server
library

WMS components
Local
Log
File

Figure 2.5: The information flow in L & B reporting

Submitted Waiting Running Done (OK) Cleared

Ready Scheduled

Done (Canceled) Done (Failed)

Aborted

Figure 2.6: gLite job states (bold arrows on successful execution path)

22
2.3.3 Information services
The information supermarket (ISM) is a central repository for information about
Grid resources. It can be queried by the WMS during the process of matchmak-
ing for a JDL submission request. The architecture of the ISM itself usually
is implemented in quite similar way as in Globus (section 2.2.3), based on the
BDII. As in Globus, GLUE is used as the information model for the data.
More advanced query and update technologies for the information supermar-
ket, such as R-GMA [32] or XML/SOAP exist, but are rarely used.

2.3.4 Data transfer


gLite supports multiple ways of transferring data. Like in Globus, input data
can be send to the WMS, which sends it along to the CEs where the job becomes
executed. Output data will be sent from the CE to the WMS, which passes it
back to the user. The whole data transfer is conducted using GridFTP, input
and output files for data staging are specified in the job’s JDL string. The WMS
will store input and output in local directories, called input/output sandbox.
However, the input and output sandbox for a job is usually limited to a size
between 10 and 100 MByte. For larger files, a Grid site dedicated to file storage,
a so-called storage element (SE) should be used. gLite employs a number of
different technologies to allow the user consistent SE access. Upload of a large
file to a SE requires the following steps:

1 Space on the SE for the file needs to be allocated. For this purpose the
Storage Resource Manager interface (SRM [33]) is used. SRM is an uni-
form webservice-based interface to different kinds of SEs and their dif-
ferent storage systems. SRM is not implemented by gLite, but by the
storage systems, which are supposed to provide a SRM interface. If a
space reservation request invoked by a user is granted by SRM, it will
return a storage URL (SURL) which denotes the full path where the file
is going to be stored and a transport URL (TURL), which is an endpoint
for direct GridFTP transfers of the file to the SE.

2 Register the file in the LCG File Catalog (LFC) [34]. For that purpose,
the LFC replica server for the current VO has to be contacted using a
proprietary protocol. Once the user sends in the address of one replica
(the SURL received in step 1), the LFC-server will respond with an unique
identifier and a logical file name (LFN). This is a one-to-many mapping,
i.e. more replicas with different storage URLs can be added to the same
unique identifier. The file’s GUID is immutable, while the LFN can be
changed by the user.

23
3 Use the transport URL and GridFTP to copy the file to the SE.

4 Get the file during job execution. Currently, this is only possible using
the respective command line interfaces of the lcg-tools. The WMS can be
instructed to schedule jobs to CEs ”close” to the SE where the files are
stored by adding the LFN or the GUID to the DataRequirement field in
the JDL.

Hence, management of large files is still in gLite is still quite a complex matter
and disappointingly there exist only C reference implementations for the pro-
prietary LFC protocol. Java APIs such as GFAL [35] are just C wrappers and
require platform dependent libraries to be installed.

24
Chapter 3
Implementation of adaptor and
testbench

3.1 JavaGAT
JavaGAT [9] is a high-level Grid-access interface for abstracting the technical
details of middleware-specific code, which is maintained independently from the
interface.

3.1.1 The motivation behind JavaGAT


Learning all the details about specific Grid middlewares in order to use their
APIs in a productive way can be a tedious task, especially for application pro-
grammers who don’t know and don’t care much about the Grid paradigms and
the relations among the various components. Additionally, Grid APIs are con-
stantly evolving and thus render Grid-based applications short-lived.
To remove burden from application programmers who prefer to focus on their
own field of expertise, the middleware-wrapper JavaGAT was created. The aims
of the JavaGAT project are:

• Hiding the diversity and frequent API changes of existing middlewares


behind a uniform programming API

• Abstracting the low level nature of current middleware APIs away and
helping the developer concentrate on the core application rather than Grid
internals

• Allowing applications to stay portable and making the exchange of em-


ployed middlewares easy.

3.1.2 JavaGAT’s overall architecture


JavaGAT uses an adaptor concept to achieve the goals detailed in section 3.1.1.
Adaptors can be written for different Grid middlewares and provide different

25
Grid application

Monitoring
File−API RB−API Security−API
API

JavaGAT−Engine
Monitoring
File−CPI RB−CPI Security−CPI
CPI

... ...

GAT−Adaptors
gridftp globus

sftp gLite

globus
...
gLite

Figure 3.1: JavaGAT’s structure

capabilities (such as job monitoring, resource broker contact, I/O operations


etc.). JavaGAT defines the interfaces for the middleware-specific adaptors and
matches the generic API calls of the end user to middleware-specific code. When
the user invokes a high-level operation such as copying a file from a source to
a (Grid) destination, JavaGAT, if not restricted by user-supplied preferences,
will try to invoke all available middleware implementations until the first suc-
ceeds in performing the requested operation. This process is called intelligent
dispatching.
The contract that middleware adaptors must fulfil is described in JavaGAT’s
capabilities provider interface (CPI). The term ”interface” is misleading since
the CPIs are actually abstract classes that implement all functionality that can
be solved generically, themselves. The structure is shown in figure 3.1.
Middleware adaptors that extend the CPI-classes can leave methods described
in the contract unimplemented, in which case an UnsupportedOperationExcep-
tion will be thrown upon invoking the respective method in the respective adap-
tor. JavaGAT hides failures in method invocation on a certain adaptor from the
user, unless all applicable adaptors have failed.
JavaGAT aims to seamlessly integrate with Java’s programming paradigms.

26
For example, classes for file staging extend Java’s java.io classes. To control the
behaviour of JavaGAT itself and the instantiated adaptors, a set of meta-classes
is available, such as GAT , GATContext and Preferences. GAT is used as a fac-
tory façade for the construction of actual objects behind the various interfaces.
GATContext carries security parameters and additional GAT- or adaptor-specific
preferences specified in the Preferences class. For instance, if the user wants
to restrict the set of file adaptors that are going to be invoked to SFTP, she
may add the preference ("File.adaptor.name", "sftp") to the GATContext and
subsequently limit connections to passive mode by specifying the preference
("ftp.connection.passive", "true").
Thus, knowledgeable users can control the behaviour of JavaGAT, while the
interface remains simple for inexperienced users.

3.1.3 Writing adaptors for JavaGAT


Adaptors for JavaGAT have to fulfil the following requirements:

• Extend the respective CPI classes. (e.g. an adaptor for file transfer should
extend FileCpi, whereas an adaptor for resource brokerage should extend
the ResourceBrokerCpi class) .

• Implement the minimal set of methods that are necessary for the adaptor
to provide useful functionality.

• Deploy the adaptor as a JAR file in the directory specified by the


gat.adaptor.path environment variable.

• Include all the actual CPI implementations provided by the adaptor in


the JAR’s manifest file with the respective CPI as attribute name and the
implementing classes as the attribute value.

The invocation of the adaptor implementation by the GATEngine is con-


ducted with the help of Java classloading and reflection mechanisms.

3.2 Implementing the gLite adaptor


Dr. Alexander Beck-Ratzka1 has kindly supplied us with the gLite adaptor that
had been developed at his institute. The adaptor included basic functionality
such as job submission, input/output data staging, status polling and getting
1
Max Planck Institut für Gravitationsforschung (Albert Einstein Institut) http://www.aei.
mpg.de

27
job information. However, it lacked some functionality considered important by
us.
First and foremost, the original adaptor didn’t include methods to create
VOMS proxies. If the adaptor can not create VOMS proxies itself, a manual
step is required before invoking the adaptor: One needs to create the required
proxies with the assistance of gLite command-line tools. Since those tools de
facto exclusively support Scientific Linux, which is not a common Linux flavor
on workplace computers, this mostly implies creating the proxy on a system
running Scientific Linux (typically a Grid UI machine) and copying it back to
the system executing the adaptor.
We found this hardly acceptable with regard to our use cases, so the pri-
mary goal was to implement VOMS proxy support. Further improvements to
the original adaptor were made, as additional requirements were identified in
productive use. Those improvements include:

• Copying of output files to user-specified directories

• Use of LDAP directory URLs as resource broker names and identification


of available resource brokers from LDAP by the adaptor

• Custom poll intervals for the job status polling thread

• Support of additional JDL attributes (DataRequirements.InputData,


RetryCount, GLUE matchmaking attributes)

• Support for passive status updates via metric listeners

• Resolving problems with multithreaded use

• Working around memory leaks in external libraries

• Making job information retrieval JavaGAT-compliant

• Porting the original adaptor to JavaGAT 2.0

• Refactoring the original adaptor’s code

Attempts have been made to implement logical file handling (section 2.3.4),
but this idea has been dropped due to unresolvable problems with the propri-
etary LFC protocol. Furthermore, with respect to JavaGAT’s structure, such
functionality would have had to be implemented as a subclass of LogicalFileCpi
and not as a part of the gLite-Adaptor. Because CPIs are abstract classes, not
interfaces, and Java does not support multiple inheritance, an adaptor can only
be derived from one CPI.

28
Figure 3.2: Class diagram of the VOMS proxy management classes

The following sections are going to outline some aspects of the adaptor in
detail.

3.2.1 VOMS-Proxy creation

We created the package org.gridlab.gat.security.glite, in which we im-


plemented our gLite security classes (figure 3.2).
The process of VOMS proxy creation is roughly explained in section 2.3.1.
Our classes model these mechanisms. GlobusProxyManager implements the stan-
dard Globus proxy factory methods, mostly by using the respective commod-
ity methods from the JavaCoG-Kit [36]. The VomsProxyManager extends the
GlobusProxyManager and can create a VOMS-proxy by executing the following
steps:

29
1 A standard Globus proxy is created.

2 This proxy is used for obtaining a GSI-secured socket to the VOMS-server.

3 The VomsServerCommunicator sends the request string containing group


membership, role and capabilities request encapsulated in a XML message.

3 If the server was able to parse the request, the VomsServerCommunicator


receives a XML response which contains the granted attribute certificate
(AC), or an error message.

4 The AC is Base64-encoded and must be decoded in order to be readable


for use with the certificate tools.

5 Finally, the AC is DER-encoded and stored as an extension in a new


Globus proxy, which is now a full VOMS-proxy.

3.3 Job submission


The architecture of the gLite resource brokerage adaptor is visualised in fig-
ure 3.3.
JavaGAT’s adaptor invocation mechanism will call the submitJob() method
on the GliteResourceBrokerAdaptor, when the user submits a job. If the bro-
kerURI passed to the GliteResourceBrokerAdaptor upon construction is a LDAP
address, the constructor uses the LDAPResourceFinder to locate WMSes corre-
sponding to the virtual organisation specified in the GATContext. It will pick
one such WMS randomly and set the brokerURI accordingly. In submitJob() it
constructs a new GliteJob and attaches the given metric listener to it.
The GliteJob class constructs a VOMS-Proxy using the classes from 3.2.1,
authenticates with the WMS via webservice calls, constructs a JDL string with
the methods in the JDL class, copies the sandbox files with GridFTP and uses
the respective webservice methods to get a job-ID and start the job. Once it
receives the job-ID, it tracks the state at the LB server. Thus it closely models
the steps depicted in figure 2.4.

3.3.1 External libraries


Due to the webservice nature and the security constraints of gLite, the adaptor
is dependant on 30 external libraries. Among those there are many webservice
support libraries, such as Axis or JAXRPC, the webservice stubs for the var-
ious webservices themselves (WMS, LB, Security delegation, JDL processing),

30
Figure 3.3: Class diagram of the gLite adaptor

31
Globus libraries, security libraries (to enable SSL encryption in GSI-protected
sessions), and libraries upon which the aforementioned ones are dependant.
When loading the adaptors, JavaGAT identifies the adaptor JAR by its man-
ifest file. In a second step, it saves the path to the classes from the libraries, on
which the adaptor is dependant (which have to be in the same directory), into
an URLClassLoader. This classloader is set as the thread’s context classloader
upon invoking the adaptor. Normally, conflicts between differing versions of the
same libraries required by different adaptors should thus be avoided.
Nevertheless, we found that different versions of Apache Axis can cause prob-
lems with that mechanism. For example, classes from the Axis library version
found in the Globus adaptor are still in the classpath, when the gLite adaptor
is instantiated. The reasons for this behaviour are rather unclear - web dis-
cussions about Axis indicate that it may stem from Axis using thread context
classloaders itself. A workaround is to include the required version of Axis in a
classpath with higher priority than the thread context classloader’s (e.g. Java’s
system classpath) when executing the gLite adaptor.

3.3.2 License
JavaGAT is published under the conditions of the GridLab Open Source license.
This license is BSD-like and basically allows redistribution in source or binary
form, modification, use and installation without restrictions. Because the gLite
adaptor was submitted to the central JavaGAT repository and published as a
part of the JavaGAT release, it is non-exclusively subject to the license provi-
sions thereof. The full license text can be found in Appendix B.

3.4 The testbench


The testbench we have used for measuring our results has a structure roughly
similar to JavaGAT. A core module includes all the functionality common for
all middleware test runners, while middleware-specific modules provide the im-
plementation. An illustration can be found on the simplified class diagram in
figure 3.4. This decision has been made out of pragmatic considerations - it
works well with JavaGAT’s architecture, allows for easy creation of new adap-
tor tests, since only little code is adaptor-dependant and makes deploying tests
on remote sites easier, because only the dependency libraries of the used modules
need to be copied and maintained.
The testbench logs interesting information to a XML-file. Typically one XML-
file is associated with one concrete testcase class that extends Testcase. For
each started job, the GATTestRunner logs meta-information such as jobID, used

32
middleware, virtual organisation, resource broker URI and the executables with
their arguments and input files. Following that, all status updates received by
the middleware are logged along with the time at which they were received.
Furthermore, upon construction, the GATTestRunner starts a RunListener, which
creates a listening socket at the first unbound port in the Globus TCP port range
(ports 40000 - 40500). Those ports were chosen because they are usually not
blocked by firewalls in Grid environments. Before submitting the job specified
in the testcase instance, the GATTestRunner creates a wrapper shell-script. This
shell script calls the executables specified in the testcase with the respective
arguments and adds a wget call to the IP-address of the computer on which the
testbench runs and RunListener’s port respectively before and after execution
of the main part.
The GATTestRunner replaces the specified executable by /bin/sh and adds the
wrapper-script as the only input argument. The actual execution parameters, as
they are specified in the testcase-instance, are logged as part of the job’s meta-
information. When the shell script becomes executed, it first sends a callback
via wget to the RunListener. Thus, the RunListener can record the exact time
of execution start and notify the GATTestRunner about it. The same applies to
execution end. The GATTestRunner eventually logs the dates received from the
RunListener along with the dates reported back by the middleware to the XML
file.
Because of the fact that under certain circumstances, jobs may exhibit running
times that are virtually infinite (see section 1.3), a maximum execution time
has to be defined in the concrete testcase instance. After that time interval, the
TestInstanceGuardian will forcibly terminate the job. The GATTestRunner logs
such terminations to the XML file.

33
Figure 3.4: A simplified class diagram of the testbench

34
Chapter 4

Results

4.1 Experiment setup


We have grouped our results according to middlewares, scheduling times and
callback overhead and we have tried to quantify the influence of the weekdays
and time of the day. Furthermore, we have aggregated data based on the queue
to which the job was submitted, the queue management system and the con-
figuration of this system. We have tried to statistically model the gathered
results.
In all middleware tests, the executable consisted of a small python script
which computes prime numbers up to 100000 using the Sieve of Eratosthenes
algorithm. Furthermore, small files (smaller than 10 Kilobytes) had to be pre-
and poststaged. This problem was chosen, because it renders the tests less syn-
thetic than i.e. by just executing /bin/hostname, while not imposing too much
load on the computational resources. We believe that under such a simulation
the behaviour of information services can be observed in a better way than with
purely synthetic tasks. Python was chosen, because a python installation is a
standard component of virtually every Linux distribution, for which reason it is
universally available on the Grid.
If the execution did not finish within 45 minutes, the job was forcibly termi-
nated by the testbench.

4.1.1 Statistical methods


Important figures of collected data are the median and the mean, because they
give an indication of the average absolute waiting times and overhead values.
The variance of the data on the other hand, gives an intuition about the devia-
tion of the values from the median.
A fast way of visualising both absolute overhead and its variance is the box-
plot. It shows the median of a sample, along with the upper and lower quantiles,
i.e. the area between the lowest and the highest 25% of the data. The length

35
of this area is called interquartile range (IQR). The bars located 1.5*IQR above
the upper and below the lower quantile are called whiskers. Every data point
smaller than the lower whisker and larger than the upper whisker is marked as
a dot on the graph and considered an outlier.
Histograms show data values on the x-axis and the frequency at which they
occur on the y-axis. Sometimes, this allows for a fast understanding of the
approximate distribution of data values. When the large spread of data values
makes it appropriate, we use logarithmic scales on the x- or the y-axis. Whenever
an axis is logarithmic, this is mentioned on the respective graph.
To determine whether the partition of measurements into categories (executed
on a certain weekday, scheduled to a certain queue/type of queue, etc.) is
reasonable, we use different tests. The t-test can be applied to two independent
vectors of data points and tests the hypothesis that the two samples are from
distributions with equal means. The Kruskal-Wallis test is a one way analysis
of variance on the collected data and returns a p-value. The p-value is the
probability of obtaining, under the null hypothesis (in the Kruskal-Wallis test
the null hypothesis is that the samples from the different groups are equally
distributed), a result as extreme or more than the one that is tested.
If the p-value is smaller than the significance level the null hypothesis is
rejected. The significance level is the probability of falsely rejecting the null
hypothesis (i.e. a false positive probability) and is often set to 5%. If the p-
value exceeds the significance level, the only conclusion that can be derived, is
that the hypothesis can not be rejected at the significance level. Both t-test
and Kruskal-Wallis test share the advantage, that unlike ANOVA they do not
assume a normal distribution of the underlying measurement values.
It must be kept in mind, that such tests can merely give a hint and can
not replace thorough interpretation of the data itself. Especially, there may
be statistical significant differences among certain groupings, even when the
practical implications, i.e. the subjective latency difference in relation to the
total execution time or in relation to other delay factors, are comparatively
small. Additionally, for large samples, as we have available in most cases, any
one way analysis of variance is likely to produce smaller p-values than for smaller
samples. Hence, the tests will only be used in addition to interpretation based
on other statistic benchmarks.
The Kolmogorov-Smirnov test (abbreviated K-S test) is a goodness-of-fit test,
which means that it can be applied to check whether a gathered sample matches
a certain distribution, or whether two samples are from the same distribution.
The null-hypothesis in the K-S test is, that the two tested samples are from the
same distribution, and like in the Kruskal-Wallis test, it can either be rejected
by the test, or not be rejected at the given significance level.

36
An empirical way to evaluate the distribution of the data, is plotting its
empirical cumulative distribution function (CDF) against the CDFs of well-
known distributions. This method is not as exact as a K-S test but it can point
out tendencies and give hints concerning the distribution of the sample. The
CDF is defined as follows
Z x
F (x) = f (x)
−∞

where f(x) is the probability density function, i.e. at point x, the CDF rep-
resents the summed probability of the value x and all values smaller than x in
the sample.
The above mentioned techniques become interesting if they can be used in
a practical context, e.g. if timeout values for a timeouting and resubmission
strategy can be derived from them. In [17], the authors propose the following
model for computing the optimal timeout value

1 ∞
Z
t∞
Ej (t∞ ) = ufR (u)du + − t∞ (4.1)
FR (t∞ ) 0 (1 − ρ)FR (t∞
where t∞ denotes the timeout value, FR the CDF and fR the probability
density function of the samples. ρ is the probability of outliers.
We are going to use that model to estimate expected execution time for time-
out values in our analysis. However, it must be critically remarked, that even
though this model takes the outlier ratio into account, it only considers the past
distribution in each evaluation point and not the length of the remaining tail.
Hence it may be too optimistic on long-tailed distributions. For the sake of
simplicity, from now on, we are going to refer to the above equation simply as
the timeout model.

4.1.2 SSH

A central remote computer (zid-gpl.uibk.ac.at) served as the execution en-


vironment for the SSH test. From this sever, desktop computers in computer
rooms of the university are reachable via SSH. These computers constitute the
CEs in our test. Every 30 minutes, the testbench picks three user desktops
from three distinct computer rooms (RR15, RR 18 and RR20) independently
and submits the test job to each of them. There is no a-priori way to make
assumptions about whether the computer will be switched on and run Linux,
which is a necessary condition, and how big the existing user-created load at
the ”CE” will be at the time of job acceptance.

37
4.1.3 Globus
The Globus experiment consists of jobs that were submitted between 2008-10-10
and 2008-10-14 and between 2008-11-07 and 2008-11-15 in a 30 minutes interval
to the following Globus sites in the AustrianGrid1 (we will use the abbrevations
in brackets from now on):

• http://blade.labs.fhv.at/jobmanager-sge (blade.labs)

• http://schafberg.coma.sbg.ac.at (schafberg)

• http://altix1.jku.austriangrid.at/jobmanager-pbs (altix1.jku)

• http://hydra.gup.uni-linz.ac.at/jobmanager-pbs (hydra.gup)

Due to problems with firewalls, we used the active job polling mechanism in
Globus as opposed to GRAM callbacks (like they are in shown in figure 2.3).
Active job polling contacts the resource broker, which polls GRAM.

4.1.4 gLite
Every 30 minutes, 3 jobs were submitted to the same resource broker in the
virtual organisation VOCE (VO for Central Europe)2 . The used resource broker
was skurut67-6.cesnet.cz (now replaced by wms1.egee.cesnet.cz). Since
the resource broker acts as a meta-scheduler, the decision about the queue, in
which the job will become executed, is made by the WMS (see section 2.2.2).
The measurements took place between 2008-09-22 and 2008-10-05 and 2008-10-
09 and 2008-10-14.

4.2 Scheduling times


In the following sections we discuss scheduling times for the SSH, Globus and
gLite Grid environments. The terms scheduling time, scheduling latency or
scheduling delay all refer to the time interval, which passes between the first
middleware notification that the job has been registered and the actual execution
of the payload on a computational resource in the Grid.

4.2.1 SSH
SSH has a median scheduling latency of 0 seconds, the mean of all samples is
39 milliseconds, while the is 7847 millisec2 , which corresponds to a standard
1
http://www.austriangrid.at
2
http://egee.cesnet.cz/en/voce/

38
deviation of 0.08 seconds. This means, that nearly all SSH jobs were scheduled
at the same time or before JavaGAT could report the submission status. The
few-hundred milliseconds latency that some job submissions exhibit, is composed
of network and authentication latency, but overall it can be said that delays will
be hardly noticeable by the end user.

4.2.2 Globus
After evaluating the measured scheduling times in general, we have tried to
factor in influences on the scheduling latency by the date at which the job was
submitted and the CE on which the job became executed. The results are going
to be presented in the following sections.

Overall scheduling latency

Figure 4.1 shows Globus scheduling latencies that were measured during the
experiments. It can clearly be seen that, with the exception of a few outliers,
by far most jobs become executed after a scheduling delay below 10 seconds.
The mean latency value of all measurements is 3.9 seconds. The median
value of the scheduling times is 0. This implies that in the used Austrian Grid
configuration with Globus middleware one can expect that the jobs get scheduled
almost instantly. Note, that in this case 0 means that the job was executed
before or at the same time of the first middleware notification, i.e. the wrapper
script callback arrived before the middleware could notify the testbench of the
job’s submission status.
The histogram in figure 4.2 shows the distribution of scheduling latencies
without extreme outliers.
It can be seen that a scheduling latency of 0 is by far the most probable. Fur-
ther scheduling latencies between 1 and 10 seconds are approximately equiprob-
able. Hence, little surprisingly, the cumulative distribution function (CDF) of
the latency values starts at a high value in 0 and is light-tailed. Figure 4.3 con-
tains this CDF along with CDFs of other well known distributions evaluated at
the same mean and standard deviation. Truncated Gaussian refers to a normal
distribution centred at the mean of the measured data, but without negative val-
ues, which can not occur in a scheduling process. The latencies could probably
approximately be modelled by a mixture of the log-normal and the exponen-
tial distribution. It can be seen that our claim of the light tailed behaviour is
confirmed by the decay-rate, which is faster than that of the exponential model.
About 79% of all job submissions become executed at their compute element
without any noticeable latency and over 90% may consume the required re-

39
5
x 10
Latencies
Median
Mean
10

8
Scheduling latencies (ms)

0
Submission date

(a) Globus scheduling latencies

Latency values
12000 Median
Mean

10000
Scheduling latency (millisec)

8000

6000

4000

2000

Submission date

(b) Globus scheduling latencies (zoomed)

Figure 4.1: Globus scheduling latencies

40
1200

1000

800

600

400

200

0
0 5 10 15 20 25

Figure 4.2: Globus scheduling histogram (extreme outliers truncated)

Empirical CDF
1

0.9

0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2
Globus scheduling
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 2 4 6 8 10 12 14 16 18 20
Scheduling latency (sec)

Figure 4.3: Globus scheduling latency CDF

41
12000

10000

Scheduling times (millisec)


8000

6000

4000

2000

Monday Tuesday Wednesday Thursday Friday Saturday Sunday


Day of the week

Figure 4.4: Globus scheduling times per day of the week

sources within 7 seconds. All latencies higher than 12 seconds can be considered
as outliers.
Since the distribution of the scheduling latency tail roughly approximates an
exponential distribution, setting a timeout value in a timeout and resubmission
strategy necessarily implies the loss of jobs at a certain probability. However, by
taking the last non-outlier value of 12 seconds, the probability of terminating
a job that will not be stuck indefinitely, is already lower than 2%. Since in
the AustrianGrid, higher latencies form not only outliers but extreme outliers
(greater than three times the inter-quantile range), it would be a good strategy
to resubmit jobs after a 12 seconds timeout.

Scheduling latency with respect to execution date

Grouping the results according to the day of the week at which the job was
scheduled exposes no significant differences. (figure 4.4). Applying the Kruskal
Wallis test yields the result 4.7 ∗ 10−5 , which testifies a different distribution
on the daily temporal scale. However, the millisecond scale is of little practical
relevance, when considering the latency as it is perceived by the user. Rounding
the values to seconds and reapplying Kruskal-Wallis still yields 0.0141, which is
below the significance level.
However, since large samples are more likely to produce smaller values and
the median is the same for all days, we conclude that there are weekday differ-
ences in the scheduling behaviour, but that they are negligible from a practical
perspective.

42
12000

10000

Scheduling latencies (millisec)


8000

6000

4000

2000

00−08 08−16 16−24


Daytime period

Figure 4.5: Globus scheduling times per daytime periods

There is no visible correlation between the time-period of the day and the
scheduling latency. Scheduling latency is not lower at the typical work periods
as compared to typical spare time periods (figure 4.5).
The Kruskal-Wallis test result does not contradict that statement.

Scheduling latency with respect to CE

Finally, we have evaluated scheduling latency properties of the different queues.


Figure 4.6 shows box-plots of those latencies respectively with and without ex-
treme outliers. The sites seem to exhibit specific patterns. Only altix1.jku
and hydra.gup show larger latency outliers, while all jobs were scheduled with-
out noticeable delay on schafberg. blade.labs didn’t produce any outliers,
but the scheduling times seem to be more uniformly distributed in the interval
between 0 and 16 seconds, with a median of about 4 seconds.
Intuitively, different queue load balancing and scheduling managers seem to
cause different latencies. schafberg uses the operating system to spawn a new
process for each arriving job (jobmanager-fork), which explains the very low
scheduling latencies. altix1.jku and hydra.gup both use the PBS (Portable
Batch System) load-balancer, while blade.labs schedules arriving jobs with
SGE (Sun Grid Engine). A survey of these and other cluster schedulers was
conducted by Etsion and Tsafrir [37]. According to that survey, the default
behaviour of a PBS queue is a shortest job first scheduling policy, while SGE
schedules with a simple first-come first-served policy. This explains the low me-

43
dian scheduling times at the PBS queues, since our job was fairly short compared
to other Grid workloads.
Hence, the scheduling overhead in Globus seems to depend more on the re-
motely deployed load balancer than on the middleware. However, deriving de-
tailed results with respect to different load-balancing products would require
more in-depth investigation.

4.2.3 gLite

In addition to the evaluations we have conducted on the Globus measurements,


we had the opportunity to evaluate gLite’s behaviour on the same CE with dif-
ferent configuration options. Our findings are described in the following sections,
along with possible influences by daytime, weekday and used CE.

Overall scheduling latency

With the gLite middleware and the used configuration, scheduling latency is
considerably higher than with the Globus middleware. The mean scheduling
latency is 135 seconds and the median latency in gLite is 91 seconds. Figure 4.7
shows the distribution of the scheduling latency values.
Over 90% of the jobs are scheduled within 200 seconds, as can be seen on the
empirical cumulative distribution function on figure 4.8.
When comparing the gLite scheduling CDF to the exponential function, it
becomes obvious that the higher latency values of gLite form a heavy tail, since
the distribution decays slower for higher values than the respective exponential
distribution. The third quantile is 120, the IQR is 77 and the upper whisker
235.5. Therefore jobs taking more than 351 seconds (3rd quantile +3 ∗ IQR)
before getting scheduled must be considered as extreme outliers and become
subject to resubmission. The resulting losses of up to 5% of otherwise com-
pleting jobs must be considered a good tradeoff, given the heavy tail and the
corresponding low probability that the respective jobs will still be scheduled in
what the user may perceive as a short time after that waiting period.
Feeding our data into the timeout model yields 209 seconds as the optimal
timeout value and confirms the observation of the authors that higher time-
out values penalize the execution time behaviour less than aggressive ones (fig-
ure 4.9).

44
1200

1000

800
Scheduling latency (sec)

600

400

200

hydra.gup. blade.labs altix1.jku schafberg.


Grid sites

(a) Latency box plot

20000

18000

16000

14000
Scheduling latency (millisec)

12000

10000

8000

6000

4000

2000

blade.labs altix1.jku schafberg hydra.gup


Grid sites

(b) Latency box plot (zoomed)

Figure 4.6: Scheduling latency box plots per queue

45
Latencies
Median
2500 Mean

2000

Scheduling latencies (sec)

1500

1000

500

0
Submission date

Figure 4.7: gLite scheduling latencies

Empirical CDF
1

0.9

0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2
Latency cdf
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 500 1000 1500 2000 2500
Latency values (sec)

Figure 4.8: gLite scheduling latency CDF

46
500

400

Expectation of execution time (sec)


300

200

100

0
0 50 100 150 200 250 300 350 400
Timeout value (sec)

Figure 4.9: Expectation of scheduling time (y-axis) against scheduling timeout


value (x-axis)

Scheduling latencies with respect to execution date

In the context of our measurement results, the variations in scheduling latency


in consideration of different weekdays and different daytimes seem marginal.
Figures 4.10 and 4.11 show the respective boxplots.
The Kruskal Wallis test for the daytime period groups, returns a p-value of
0.29, hence the hypothesis that the sample distributions are equal, is not rejected
by the test. The p-value obtained from the weekday grouping is 0.0048, which
is below the 5% significance level. This rejects the null hypothesis that the
distributions are equal.
However, we consider the absolute differences based on submission date to
be minimal, when compared to other influences, such as type of CE. Hence,
we conclude that the scheduling times in EGEE are rather insensitive to the
submission date.

Scheduling latencies with respect to CE

In figure 4.12 we use the abbreviations listed in table 4.1 for the different queues
to which our job was submitted by the WMS.
The queues show significantly different scheduling latencies. A clear pattern
can be seen with respect to the deployed queue load-balancer. 8 queues are
managed by PBS and 7 queues by the LCG-PBS load-balancer. While queues
configured with PBS scheduled jobs with a small median latency, but a high
latency variance, the opposite holds for queues configured with the LCG-PBS

47
3
10

Scheduling latency (sec, logarithmic scale)

2
10

Monday Tuesday Wednesday Thursday Friday Saturday Sunday


Day of the week

Figure 4.10: gLite scheduling times per day of the week

3
10
Scheduling latency (sec, logarithmic scale)

2
10

00−08 08−16 16−24


Time interval of day

Figure 4.11: gLite scheduling times per daytime period

48
SRCE ce1-egee.srce.hr:2119/jobmanager-sge-prod
NIIF egee-ce.grid.niif.hu:2119/jobmanager-pbs-voce
LINZ egee-ce1.gup.uni-linz.ac.at:2119/jobmanager-pbs-voce
ELTE eszakigrid66.inf.elte.hu:2119/jobmanager-lcgpbs-voce
POZNAN ce.reef.man.poznan.pl:2119/jobmanager-pbs-voce
BME ce.hpc.iit.bme.hu:2119/jobmanager-lcgpbs-long
IRB egee.irb.hr:2119/jobmanager-lcgpbs-grid
CYF ce.cyf-kr.edu.pl:2119/jobmanager-pbs-voce
AMU pearl.amu.edu.pl:2119/jobmanager-lcgpbs-voce
WROC dwarf.wcss.wroc.pl:2119/jobmanager-lcgpbs-voce
SAVBA ce.ui.savba.sk:2119/jobmanager-pbs-voce
KFKI grid109.kfki.hu:2119/jobmanager-lcgpbs-voce
CESNET ce2.egee.cesnet.cz:2119/jobmanager-pbs-egee voce
TUKE ce.grid.tuke.sk:2119/jobmanager-pbs-voce
IJS lcgce.ijs.si:2119/jobmanager-pbs-voce
OEAW hephygr.oeaw.ac.at:2119/jobmanager-lcgpbs-voce

Table 4.1: Abbreviations used in gLite per queue boxplot

SRCE

NIIF

LINZ

ELTE

POZNAN

BME

IRB

CYF

AMU

WROC

SAVBA

KFKI

CESNET

TUKE

IJS

OEAW

unknown
2 3
10 10
Seconds (logarithmic scale)

Figure 4.12: gLite scheduling latency per queue

49
system, where jobs were scheduled with a high median latency, but a small la-
tency variance. Among the queues on which our jobs were scheduled, there is
one queue managed by the SGE load balancer (SRCE), which shows a low me-
dian scheduling latency and high latency variance, but of course no conclusions
can be drawn from the measurement results of just one queue.
From the perspective of timeout and resubmission strategies, low latency vari-
ance is more desirable than lower median latency values, because the system be-
haviour becomes more predictable. Hence, if not many time-critical small jobs
or complex workflows are submitted to EGEE, the LCG-PBS queues perform
better. LCG-PBS is a PBS wrapper that increases PBS scalability by allowing
the monitoring of all jobs from a user through the same resource broker. The
reason for this fine-tuning of PBS is the need for a higher scalability inherent in
a meta-scheduling architecture such as gLite [38]. Using the adapted version of
PBS seems to be advantageous in terms of scheduling latency.

Scheduling latencies with respect to CE configuration

Patryk Lasoń, the administrator of the CYF (cyfronet) CE, has kindly provided
us with the opportunity to run our test with different CE load balancer configu-
ration options. The CYF CE is configured with the Maui cluster scheduler [39].
Maui is available as a plugable queue to different load balancers such as PBS
or LSF and uses them for resource management, while providing Maui-specific
functionality such as reservation and policies to the outside. Maui has to poll
PBS for node information. Maui’s scheduling system is based on priority mech-
anism, which runs and makes reservations for highest priority jobs and tries
to fit lower priority jobs into gaps in the reservation system. The interval in
which Maui polls PBS is determined by the RMPOLLINTERVAL configuration
option.
We have measured the performance of the Maui-managed cyfronet queue with
a RMPOLLINTERVAL setting of 60 seconds and later with a setting of 5 sec-
onds. The resulting boxplots can be seen in figure 4.13.
The median of the results obtained with the 5 second RMPOLLINTERVAL is
127 seconds, while the median of the 60 second RMPOLLINTERVAL measure-
ment is 135 seconds. This marginal improvement comes at the cost of a much
higher latency variance. The variance in the first sample is 23905. The second
sample, however, shows a variance of 196430. These variances correspond to
standard deviations of 155 respectively 443 seconds.
But due to the larger variance, also the mean value of the queue with the
small poll interval is considerably larger: The mean scheduling latency on the

50
3

Scheduling time (sec, logarithmic scale)


10

2
10

RMPOLLINTERVAL=60s RMPOLLINTERVAL=05s

Figure 4.13: gLite scheduling latency with different Maui RMPOLLINTERVAL


settings

CE with the more frequently updated scheduler is 287 seconds, whereas it is


only 157 seconds for the CE with less frequent updates.
The reason for the higher mean value becomes obvious when one views the
CDF in figure 4.14.
With the shorter poll interval comes a higher share of jobs with very long
scheduling times. Based on the collected data, 90% of all jobs become scheduled
within less than 250 seconds with the 60 seconds RMPOLLINTERVAL setting.
With the 5 seconds setting, however, one would have to wait about 600 seconds,
until the job is scheduled with 90% probability. About one third of the submitted
jobs has to wait considerably longer for a computational resource with the short
setting than with the long setting.
One possible explanation for that is, that with the smaller poll interval, Maui
can make reservations for jobs with high priority more often and run them at
a higher frequency. Thus, jobs with lower priority waiting in the CE queue,
may starve because there are less time gaps to be filled by the Maui backfill
mechanism.

4.3 Notification overhead


Notification overhead denotes all kinds of overhead introduced by the reporting
capabilities of a Grid middleware. Sometimes, we use the terms middleware
overhead, notification delay and reporting overhead, which can all be considered
as semantically equivalent in the following sections.

51
Empirical CDF
1

0.9

0.8

0.7

0.6

F(x)
0.5

0.4

0.3

0.2

0.1
RMPOLLINTERVAL=60s
RMPOLLINTERVAL=05s
0
0 500 1000 1500 2000 2500 3000
x

Figure 4.14: CDF of gLite scheduling latency with different Maui poll intervals

We have measured two different types of notification overhead. First, notifi-


cation overhead occurs between the actual execution of a job and the reporting
of it to the end user by the middleware. For the sake of simplicity, we are go-
ing to call this type of overhead ”pre-execution” overhead. Albeit this is not
strictly an exact term, because the overhead actually occurs while the job is al-
ready executing, it nevertheless gives a good intuitive notion about which type
of overhead it denotes.
Consequently, the second type of overhead, which occurs between the com-
pletion of a job and the middleware reporting about it, is going to be called
”post-execution” overhead.

4.3.1 SSH
We have evaluated a sample of 132 jobs that were executed on the different
desktop machines they were submitted to.
The overhead between RUNNING-notification and actual execution could not
be measured, because the SSH JavaGAT adaptor did not consistently report the
RUNNING state. The post-execution overhead is 787 milliseconds in the median
and 885 milliseconds in the mean. The variance is 2140800 milliseconds2 and
thus the standard deviation is 1463 milliseconds. All values already include
potential overhead added by the JavaGAT API.
Hence, middleware notification overhead in SSH, when compared to typical
payload execution times, is below noticeability. However, it must be considered,
that SSH by itself is not a middleware in the common sense of the term.

52
4.3.2 Globus
Out of 2098 submissions, 306 states needed to be forcibly terminated after the
timeout period, which corresponds to a ratio of jobs without excessive waiting
times of about 85%.

Overall IS overhead

Normally, it could be expected that pre-execution and post-execution overheads


converge to the same value with a large number of samples. We have collected a
sample of 1368 jobs for pre-execution overhead and 1417 jobs for post-execution
overhead. The different numbers stem from the fact that sometimes notifica-
tions, either middleware-generated or wrapper-script callbacks, get dropped due
to firewall issues or network errors.
The collected data contradicts the above assumption that both information
service overheads are approximately equal. The delay between the actual job
start and the RUNNING notification is about 6 seconds in the mean and 5
seconds in the median, while after job completion the mean and median time
until notification is about 11 seconds. No significant outliers can be found in
the observed notification overhead times, as displayed in figure 4.15.
Thus, the Globus information service reports job start faster than job com-
pletion.
As displayed in figure 4.16, not only mean and median of notification overhead
times differ between pre- and post execution information service messages, but
also the distribution of frequencies. With the exception of the spike at 6 seconds,
post-execution notification overhead follows a Gaussian distribution and has a
variance on the of 23 sec2 . The pre-execution notification delays, on the other
hand, are more bursty, but show a lower variance of 9.2.
Both samples can be fitted fairly well by a truncated Gaussian distribution,
as confirmed by the CDFs in figure 4.17. The Kolmogorov-Smirnov test rejects
that assumption for the pre-execution overhead (which is bursty in parts), but
does not reject it for the post-execution overhead at the 5% significance level.
Since both overheads are more or less normally distributed and don’t contain
outliers, in a timeout and resubmission strategy, the timeout value is best set
to infinity (i.e. no timeouting of jobs should take place).

IS overhead with respect to the CE

For post-execution overhead, it must be noted, that the state is set to


POST STAGING as soon as execution end is detected - GridFTP is initiated
after reporting the state change by JavaGAT and not by Globus. However, the

53
14000

12000

10000
IS overhead (millisec)

8000

6000

4000

Overhead
2000 Median
Mean

Submission date

(a) Time between actual execution and RUNNING notifica-


tion

25000

20000
IS overhead (millisec)

15000

10000

5000

Overhead
Median
Mean
0
Submission date

(b) Time between completion and POST STAGING notifi-


cation

Figure 4.15: Middleware notification overhead in Globus

54
400

350

300

250

200

150

100

50

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(a) Histogram of latencies between actual execution and


RUNNING notification
200

180

160

140

120

100

80

60

40

20

0
0 5 10 15 20 25

(b) Histogram of latencies between completion and


POST STAGING notification

Figure 4.16: Histograms of middleware notification overhead in Globus

55
Empirical CDF
1

0.9

0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2
Latency cdf
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 5 10 15
Latency values (sec)

(a) CDF of latencies between actual execution and RUN-


NING notification
Empirical CDF
1

0.9

0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2
Latency cdf
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 5 10 15 20 25
Latency values (sec)

(b) CDF of latencies between completion and


POST STAGING notification

Figure 4.17: CDFs of middleware notification overhead in Globus

56
fact that the histogram shows a spike points at the influence that CE reporting
polling intervals might have.
Consequently, the differences could result from GRAM’s job monitoring mech-
anisms (and the polling intervals) or from properties of the underlying load-
balancing systems that are polled by GRAM. However, unlike in section 4.2.2,
no clear patterns with respect to the load-balancing system can be observed,
because the two PBS sites behave differently (figure 4.18). The information
service overhead is highest for the SGE system, but this can also be due to
the local GRAM installation of the site. In the measured sample, information
service overhead can be as high as 15 seconds even on the system with a pure
fork load-balancer.
Interestingly, GRAM can do most running notification within 3 respectively
4 seconds on two of the four systems, which intriguingly run different sched-
ulers. Hence, without a larger sample regarding the load-balancers/schedulers,
no exact conclusions can be drawn.

4.3.3 gLite
Out of 2655 jobs, 689 had to be terminated because they didn’t finish within 2700
seconds. Accordingly, based on the collected data, the probability that a gLite
job submission will not end up waiting in a state indefinitely is approximately
74%.

Overall information service overhead in gLite

The observation from section 4.3.2, that notification overhead is different for pre-
and post-job execution notifications, also holds for gLite. Figure 4.19 shows the
overheads for pre- and post-execution notifications.
As in Globus, the post-execution overhead is higher than the pre-execution
overhead. However, now the overhead difference lies in the dimension of factor
ten. The median value for pre-execution overhead is 9 seconds and the mean
value 28 seconds. For post-execution overhead, these values are 198 respectively
209 seconds. But also the variance is larger for post-execution notifications.
Analysing the gathered data yields a variance of 2078 sec2 for pre-execution
overhead and 17352 sec2 for post-execution overhead.
The histogram of the pre-execution overhead is rather unsurprising, with a
global maximum around the median, but smaller local maxima up to the highest
measured delay. The histogram of the post-execution overhead shows two spikes
around 200 and 300 seconds with a local minimum in between. Both histograms
are depicted in figure 4.20

57
14

12

10
IS overhead (sec)

blade.labs altix1.jku schafberg hydra.gup


Grid sites

(a) Time between actual execution and RUNNING notifica-


tion

25

20
IS overhead (sec)

15

10

0
blade.labs altix1.jku schafberg hydra.gup
Grid sites

(b) Time between completion and POST STAGING notifi-


cation

Figure 4.18: Middleware notification overhead by Globus site

58
1000
Overhead
Median
Mean

IS overhead (sec, logarithmic scale)

100

10

1
Submission date

(a) Time between execution start and RUNNING notifica-


tion

10000
Overhead
Median
Mean
IS overhead (sec, logarithmic scale)

1000

100

10
Submission date

(b) Time between completion and DONE notification

Figure 4.19: Middleware notification overhead before and after job execution in
gLite

59
250

200

150
Frequency

100

50

0
−20 0 20 40 60 80 100 120 140 160 180
IS overhead (sec)

(a) Histogram of the pre-execution gLite IS overhead

300

250

200
Frequency

150

100

50

0
0 100 200 300 400 500 600 700
IS overhead (sec)

(b) Histogram of the post-execution gLite IS overhead

Figure 4.20: gLite IS overhead histograms

60
Empirical CDF
1

0.9

0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2
Latency cdf
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 20 40 60 80 100 120 140 160
Latency values (sec)

(a) Pre-execution gLite IS overhead CDF

Empirical CDF
1

0.9

0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2
Latency cdf
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 100 200 300 400 500 600 700 800 900 1000
Latency values (sec)

(b) Post-execution gLite IS overhead CDF

Figure 4.21: gLite IS overhead CDFs

61
500

400

Estimated execution time (sec)


300

200

100

0
0 50 100 150 200 250 300 350 400
Timeout values (sec)

Figure 4.22: Expected notification times (y-axis) in relation to timeout values

Figure 4.21 contains a visualisation of the respective CDFs. Both distributions


do not fit to any well-known distribution. Among the distributions tried, but not
shown on the CDF, were also a Pareto-tailed truncated Gaussian distribution
and a generalised Pareto distribution.
The tail of pre-execution overhead vanishes around the maximum value of
166 seconds, so a conservative approach in a timeout and resubmission scenario
would be to set the timeout to infinity. This is also the most reasonable value,
since picking the upper whisker or quantile 0.75 + 3*IQR would result in mas-
sive loss of jobs, that will otherwise become scheduled after a relatively short
additional time interval. Applying the timeout model on the samples yields an
optimal timeout value of 14 seconds on the other hand, resulting in an estimated
average reporting time of about 12 seconds.
Post-execution overhead has a long tail and therefore it makes sense to set a
finite timeout value. Heuristically timeouts could be 379 or 509 seconds, i.e. the
values that form the upper whisker and the lower boundary for extreme outliers
respectively. The timeout model returns an optimal value of 316 seconds. As
can be seen on figure 4.22, setting too small timeout values heavily penalizes
execution behaviour.
The bursty post-execution overhead with its long absolute delay may orig-
inate from the hierarchical reporting system deployed within gLite’s logging
and bookkeeping mechanism (section 2.3.2), which necessitates multi-layered
polling/pushing for state notification propagation.

62
Information service overhead with respect to CE

Apart from the insight that gLite’s hierarchical logging system causes large
information service overhead and overhead burstiness, it was within the scope
of our interest to quantify the influence of the used CE respectively queue load
balancer. For Globus we have shown that there are strong indications for an
influence by those components.
Analysis of the pre-execution middleware overhead per queue reveal signifi-
cant differences. It can be seen on the box plots displayed in figure 4.23, that
especially the overhead for pre-execution notification exhibits a strong queue
dependence.
CESNET and ELTE are obvious overshoots and testify about the CE’s sig-
nificance regarding IS overhead. Furthermore, the influence of the specific load
balancer is an interesting aspect. Queues managed by LCG-PBS have a median
pre-execution reporting overhead of 8 seconds, a mean overhead of 40 seconds
and a variance of 3346 sec2 . All values are lower for PBS, the median being 3
seconds, the mean 11 seconds and the variance 701 sec2 . According to the t-test,
the samples from the two load balancers are modelled by a different distribution.
Therefore, the collected data allows the interpretation, that information ser-
vice overhead between the actual running state and the RUNNING notification
mainly originates from the used scheduler/load-balancer and the CE configura-
tion. The LCG enhancement to PBS, which allows the central resource broker
to monitor all submitted jobs, seems to negatively influence IS overhead.
Basically, this is the same for post-execution IS overhead. Queues managed
by LCG-PBS report job completion with higher mean (270 vs. 169 seconds) and
median latency (250 vs. 168 seconds) than PBS and vary more in the expected
notification duration (the variance for LCG-PBS is 28843 sec2 , whereas it is
5041 sec2 for PBS). The t-test considers the LCG-PBS and PBS samples to be
differently distributed. However, unlike in pre-execution IS overhead measure-
ments, the median overhead value is greater than 100 seconds for all queues
equally, which strongly suggests that the hierarchical multistage logging process
in gLite’s L & B mechanism may cause parts of the overall delay.

4.4 Staging overhead


Poststaging includes copying stdout and stderr messages and the output of the
Sieve of Eratosthenes algorithm. We only copied back a stdout file, typically of
the size of a few hundred bytes, which is a lower value than the data transfer
rate of the used networks. Hence, the measured delays quantify the overhead
imposed by the involved middlewares in post-staging and clearing phases.

63
NIIF

LINZ

ELTE

POZNAN

BME

IRB

CYF

AMU

WROC

SAVBA

KFKI

CESNET

TUKE

IJS

OEAW

unknown
0 20 40 60 80 100 120 140 160
Seconds

(a) Box plot for pre-execution gLite IS overhead per queue

SRCE

NIIF

LINZ

ELTE

POZNAN

BME

IRB

CYF

AMU

WROC

SAVBA

KFKI

CESNET

TUKE

IJS

OEAW

unknown
2 3
10 10
Seconds (logarithmic scale)

(b) Box plot for post-execution gLite IS overhead per queue

Figure 4.23: gLite IS overheads per queue

64
Clearing usually involves deletion of the sandbox files and directories.

4.4.1 Globus
Globus post staging process exposes, as shown in figure 4.24, relatively low
absolute latency values, with a median of 6 and a mean of 7.3 seconds, and a
low variance of 12 sec2 .
However, the tested sites behave differently, which can be seen on the box-
plots in figure 4.25. blade.labs has a near-to-zero variance, while hydra.gup,
altix1.jku and schafberg have a larger variance, with many outliers showing
up at altix1.jku. The post-staging latency is not so much dependent upon
the middleware, since it is exclusively dependent on the speed of the GridFTP
transactions.
GridFTP has some overhead compared to FTP, which originates from the need
of GSI-authentication. However, the performance of blade.labs shows, that
file transfer within Globus environments can be quite a deterministic process in
terms of latency. The variance of the other sites is also comparatively low and
may have its cause in network effects.

4.4.2 gLite
We have no results for gLite, but since the post-staging process is exactly the
same as in Globus with the WMS acting as the remote GridFTP endpoint, it
can be expected that the results for post-staging will be quite similar.

4.5 Comparison
The values that we have presented in our analysis for the different middlewares
are summarised in table 4.2. The optimal timeout values correspond to numbers
obtained with the timeout model shown in equation 4.1 for our samples. All
results computed with the model show, that overestimating the timeout value
does not have as negative impacts as underestimating it.
The overhead of SSH with respect to scheduling and to information service
overhead is close to 0, which is little surprising, since SSH only takes care of
forwarding commands and data (via scp), but apart from that it does not in-
clude any of the functionalities of Grid middlewares, such as meta-scheduling,
matchmaking, job state tracking and queueing. The queue is just a simple fork,
which creates a new process that becomes subject to operating system schedul-
ing. Hence, there is also close-to-zero scheduling overhead because jobs may
start instantly. The values measured in SSH are merely a base reference which

65
80
Latency values
Median
Mean
70

60

50
Latency values (sec)

40

30

20

10

0
Submission date (not to scale)

(a) Latencies in the Globus post staging process

20
Latency values
Median
18 Mean

16

14
Latency values (sec)

12

10

0
Submission date (not to scale)

(b) Latencies in the Globus post staging process (zoomed)

Figure 4.24: Post staging overhead in Globus

66
Poststaging latency (sec, logarithmic scale)
1
10

0
10

hydra.gup blade.labs altix1.jku schafberg


Grid sites

Figure 4.25: Data staging latencies per Globus site

SSH Globus gLite


Mean (sec) 0.04 3.9 135
Median (sec) 0 0 91
Variance (sec2 ) 0.008 1802 74004
Empirical distribution mix mix mix
Scheduling Outlier ratio 2.5% 19% 6%
Optimal timeout (sec) 0 0 209
Daytime influence no no no
Weekday influence no minimal small
CE influence - large large
Mean (sec) - 6.2 28
Median (sec) - 4.6 9
Variance (sec2 ) - 9.2 2078
Pre-exec. IS-overhead Empirical distribution - Trunc. Gaussian mix
Outlier ratio - 0% 21%
Optimal timeout (sec) - ∞ 14
CE influence - large large
Mean (sec) 0.9 11 209
Median (sec) 0.8 10 198
Variance (sec2 ) 2.1 22 17352
Post-exec. IS-overhead Empirical distribution mix Trunc. Gaussian mix
Outlier ratio 1.5% 0% 1.4%
Optimal timeout (sec) 1.6 ∞ 316
CE influence - large large

Table 4.2: Summary of the overhead properties of all middlewares

67
assesses that possible overhead introduced by JavaGAT and the testbench is
negligible when compared to the measured values themselves.

4.5.1 Scheduling times in Globus and gLite


The more interesting middleware is Globus, because it contains the basic com-
ponents of a middleware. Most importantly with respect to the measurements a
queueing mechanisms on the CE, which influence scheduling times and a central
resource broker which performs matchmaking and a central interface (GRAM)
on the CE for job status tracking.
The scheduling times may vary up to 12 seconds within a normal execution
trace, but are typically close to 0. Timeouting should be performed nevertheless,
since outliers may occur. We have shown, that the times depend heavily on the
deployed load-balancer and probably very much on the configuration of the
queue itself. In our experiments, the shortest job first scheduling policy favours
the average execution of our a job but can also produce outliers, while the
scheduling times obtained at the first-come first-serve queue are longer in the
median, but also resistant to outliers. These results may look differently with
other workloads. Compared to the influence by the CE, the influence by the
date at which we submitted our jobs was minimal.
The mean scheduling latency in gLite is 33 times the value of Globus and the
median latency is much higher as well. On an absolute scale, the latencies are
thus much higher than in Globus, but it must be kept in mind, that the scale
of the AustrianGrid is by dimensions smaller than the scale of the EGEE in-
frastructure, where multiple VOs access shared queues and the meta-scheduling
has a larger search space. Furthermore, at least one queue was governed by a
long-job scheduling policy, which is predestined to discriminate our relatively
short-running job.
However, it must be said that the fraction of outliers produced by gLite casts
a damning light on the middleware, the schedulers or the middleware/scheduler
combination. The corresponding latency distribution is heavy-tailed. Outliers
occur, as they do in Globus as well, but the distribution in gLite can not be fitted
with a faster-than-exponentially decaying function. Hence with high probability
there is a noticeable fraction of jobs one must consider to be outliers.
Exactly as in Globus, the latencies are queue and queue-configuration depen-
dent and do not vary drastically among different weekdays or daytime periods.

4.5.2 Information service overhead in Globus and gLite


Globus IS overhead roughly follows a truncated Gaussian distribution and is
outlier-free. Hence the delivery of the respective status update messages can be

68
considered as reliable in Globus. gLite shows outliers in its overhead, both in pre-
execution and post-execution delays. Especially in post-execution overhead, the
outliers form a long tail. Therefore it must be said that gLite can not guarantee
reliable status updates and that job completion waiting must be timeouted in
practise. It is questionable whether gLite’s hierarchical logging and bookkeeping
process is optimal for reporting. The different poll intervals in the stages of this
process manifest themselves as local maxima in the histograms of the overhead
measurements.
For both Globus and gLite, post-execution notification overhead is larger than
pre-execution overhead. Since the overhead changes with the picked CE, it can
be expected that some load-balancers are slow in registering job completion.
We could not factor a clear load-balancer dependency for the four tested Globus
sites, but for gLite it is quite obvious that the LCG-PBS scheduler performs
worse than the PBS scheduler. LCG-PBS attempts to enhance scalability by
allowing the resource broker to track all jobs submitted by the same user. This
suggests that the worse performance of LCG-PBS stems from these decentrali-
sation efforts.
Furthermore the post-execution notification delay in gLite is much more dis-
proportionate compared to the pre-execution delay than in Globus. While in
Globus, the difference between the medians is about factor 2, in gLite the dif-
ference is more than factor 20. Accordingly, in direct comparison to Globus,
gLite does a bad job in reporting job completion even in consideration of the
larger scale of EGEE. Since the logging process is the same for pre-execution
and post-execution reporting, the conclusion remains that the interaction be-
tween the local resource management software and the CE-deployed job status
monitor is suboptimal.

69
70
Chapter 5

Conclusions

In this thesis we presented an overview on the different Grid-enabling technolo-


gies SSH, Globus and gLite, of which Globus and gLite are fully-fledged Grid
middlewares. Our interest in analysing the different delay components summing
up to the total Grid delay motivated the construction of a comprehensive test-
bench, which not only measures total execution times or scheduling times, but
also overhead introduced by the middleware reporting mechanisms.
To attain that goal, we first strove for the ability to use a single library as
the basic Grid-enabling API for our testbench. We picked the Grid middleware
API wrapper JavaGAT as the uniform Grid access API for our testbench and
consequently introduce the basic concepts and functionalities of JavaGAT in the
thesis. To conduct tests with gLite middleware, we enhanced and improved an
existing gLite adaptor for JavaGAT to fit our needs. The thesis describes the
improvements made on the original work and how they can be of assistance to
users in productive use. Problems with the library management of the adaptors
are addressed and explained and the architecture of the adaptor outlined.
From there, we introduced the testbench which served as the tool for our
observations. We described the mechanisms involved in measuring and gave a
short overview of the testbench’s architecture.
Finally we presented the experimental results with detailed measurement eval-
uations for SSH, Globus and gLite with respect to scheduling and notification
delay. We have concluded that scheduling overhead and IS overhead highly
depends on the remotely deployed local resource management system, but not
too much on the submission date. Additionally, we found that gLite has a
larger overhead and a larger overhead variance than Globus concerning schedul-
ing as well as middleware notifications. The post-execution overhead in gLite
represents a particularly strong disturbance in the job execution cycle, since
it introduces absolutely high latencies and the corresponding distribution is
heavy-tailed with outliers. Hence it urges to set timeouts for resubmission on
the waiting times for status notification of the job’s completion, which should
not be necessary otherwise. Therefore we have to say, that the hierarchical re-

71
porting system in gLite with the CE’s monitor forwarding information received
from the LRMS to a multi-staged L&B system, leaves room for improvement.
Due to the fact that our testbench identifies latencies at all stages of execution,
it is possible to timeout and resubmit the job at nearly every execution state if
reasonable waiting times are exceeded. We presented such timeout values, using
both intuitive considerations and the latency model introduced in [17].
With our testbench, new data on the execution behaviour can be gathered
every time a middleware environment improves significantly with respect to the
perceived delays, as has frequently happened in EGEE in the last few years. Due
to the modular design of the testbench and the fact that only very little code is
middleware-dependent, it should be easy to create new testcases on demand.

72
Appendix A
gLite Documentation

73
Glite Resource Broker Adapter Readme

Thomas Zangerl

January 7, 2009
Contents

1 VOMS-Proxy Creation 3
1.1 Frequent errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 ”Unknown CA” error . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 ”Error while setting CRLs” . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 ”pad block corrupted” . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Could not create VOMS proxy! failed: null: . . . . . . . . . . . . . 3
1.2 Preference keys for VOMS Proxy creation . . . . . . . . . . . . . . . . . . 4
1.3 Minimum configuration to make VOMS-Proxy creation work . . . . . . . 4
1.4 (Not) reusing the VOMS proxy . . . . . . . . . . . . . . . . . . . . . . . . 5

2 The gLite-Adaptor 6
2.1 Adaptor-specific preference keys . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Supported SoftwareDescription attributes . . . . . . . . . . . . . . . . . . 6
2.3 Supported additional attributes . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Setting arbitrary GLUE requirements . . . . . . . . . . . . . . . . . . . . 7

2
1 VOMS-Proxy Creation
1.1 Frequent errors
1.1.1 ”Unknown CA” error
The proxy classes report an ”Unknown CA” error (Could not get stream from secure
socket). Probably the VomsProxyManager is missing either your root certificate or the
root certificate of the server you are communicating with. It is best, if you include
all needed certificates in ∼/.globus/certificates/. (e.g. you can copy the /etc/grid-
security/certificates directory from an UI machine of the VO you are trying to work
with to that location).
If this doesn’t suffice, you should try to include a file called cog.properties in the
∼/.globus/ directory. The content of this file could be something like this:

Java CoG Kit Configuration File


#Thu Apr 05 15:59:23 CEST 2007
usercert=~/.globus/usercert.pem
userkey=~/.globus/userkey.pem
proxy=/tmp/x509up_u<your user id>
cacert=/etc/grid-security/certificates/
ip=<your ip address>

Also check the vomsHostDN preference value for typos/errors.

1.1.2 ”Error while setting CRLs”


Try to create a cog.properties file in ∼/.globus and set the cacert property to ∼/.globus/certificates.
This will cause the VOMS Proxy classes not to look for CRLs in /etc/grid-security/certificates

1.1.3 ”pad block corrupted”


Check whether you have given all necessary information (password, host-dn, location of
your user-certificate etc., see section 1.3) as global preferences of the GAT-context.
If you have given all necessary information, check the password you have given, for
typos.

1.1.4 Could not create VOMS proxy! failed: null:


See answer in section 1.1.3, maybe you forgot to specify the vomsServerPort preference
value.

3
vomsHostDN distinguished name of /DC=at/DC=uorg compulsory
the VOMS host /O=org/CN=somesite
vomsServerURL URL of the voms server, skurut19.cesnet.cz compulsory
without protocol
vomsServerPort port on which to con- 7001 compulsory
nect to the voms server
VirtualOrganisation name of the virtual or- voce compulsory
ganisation for which the
voms proxy is created
vomsLifetime he desired proxy life- 3600 optional
time in seconds

Table 1.1: Necessary preference keys for VOMS proxy creation

1.2 Preference keys for VOMS Proxy creation


The necessary preference keys are shown in table 1.1.
Additionally you need a CertificateSecurityContext which points to your user certifi-
cate and your user key. Add that CertificateSecurityContext to the GATContext.
With the compulsory preferences, the proxy is created with a standard lifetime of 12
hours. If you want to have a different lifetime, add the optional vomsLifetime preference
key.
Do something like

GATContext context = new GATContext();


CertificateSecurityContext secContext =
new CertificateSecurityContext(new URI(your_key), new URI(your_cert), cert_pwd);
Preferences globalPrefs = new Preferences();
globalPrefs.put("vomsServerURL", "voms.cnaf.infn.it");
...
context.addPreferences(globalPrefs);
context.addSecurityContext(secContext);

1.3 Minimum configuration to make VOMS-Proxy creation


work
• A cog.properties file with lines as in section 1.1.1.

• The following global preferences set in the gat context


– vomsHostDN
– vomsServerURL
– vomsServerPort

4
– VirtualOrganisation

1.4 (Not) reusing the VOMS proxy


If multiple job submissions to the same VO happen in a small time interval, it is not
necessary to create a new VOMS proxy for every submission. Hence, if the user sets
the system property glite.newProxy to false (or doesn’t set it at all), the system will
check, whether a valid VOMS-Proxy exists. If such a proxy is found, it is determined,
whether the proxy lifetime exceeds the one specified in the vomsLifetime property. If
the vomsLifetime property is unspecified, it is checked wether the proxy is still valid for
more than 10 minutes. Very frequently, all job submissions of an application will happen
within the same VO. In case there is the need to submit jobs to multiple VOs, it won’t
be possible to to reuse the old proxy due to the VO-specific attribute certificates stored
in the proxy file.

5
2 The gLite-Adaptor
2.1 Adaptor-specific preference keys
The mechanisms provided by the GAT-API alone did not suffice to provide all the
control we found desirable for the adaptor. Hence, a few proprietary preference keys
where introduced. They are useful in controlling adaptor behaviour but are by no means
necessary if one just wants to use the adaptor. Nonetheless, they are documented here.
To avoid confusion, all of them take strings as values, even if the name would suggest
an integer or boolean.
If you want to use them, set them using preferences.put(); for example write

preferences.put("glite.pollIntervalSecs", "15");
context.addPreferences(preferences);

The following preference keys are supported:

• glite.pollIntervalSecs - how often should the job lookup thread poll the WMS for
job status updates and fire MetricEvents with status updates (value in seconds,
default 3 seconds)

• glite.deleteJDL - if this is set to true, the JDL file used for job submission will be
deleted when the job is done (”true”/”false”, default is ”false”)

• glite.newProxy - if this is set to true, create a new proxy even if the lifetime of the
old one is still sufficient

2.2 Supported SoftwareDescription attributes


The minimum supported attributes from the software description seem to be derived
from the features that RSL (the globus job submission file format) provides. Hence,
they are easy to translate to RSL properties. However, the format used for glite job
submission is JDL and attributes like count or hostCount are hard to translate to JDL.
Most of the attributes that are supported are not even achieved by the JDL format
itself, but by adding GLUE requirements. Sadly, the JDL format does not provide much
of the functionality covered by RSLs, hence many attributes remain unsupported.
On the other hand, to enable working with the features that the JDL format provides
additionally to the RSL format, a new attribute was introduced.
Set glite.retryCount to some String or Integer in order to use the retry count function
of glite.

6
2.3 Supported additional attributes
The attribute glite.DataRequirements.InputData can be set. It expects an ArrayList<String>
as value, in which there are one or more InputData LFNs or GUIDs which will be used
by the matchmaker upon scheduling in order to decide on which CE the job is going to
be scheduled. Normally, this is a CE ”close” (i.e. with low latency) to the SE.

2.4 Setting arbitrary GLUE requirements


If you would like to specify any GLUE-Requirements that are not covered by the standard
set of SoftwareResourceDescription or HardwareResourceDescription keys, you may set
glite.other either as Software- or HardwareResourceDescription and add a *full* legal
GLUE Requirment as entry, such as for example
!( other.GlueCEUniqueID == "some_ce_of_your_choice").

7
Appendix B
GridLab license
GRIDLAB OPEN SOURCE LICENSE

The GridLab licence allows software to be used by anyone and for any purpose,
without restriction. We believe that this is the best way to ensure that Grid
technologies gain wide spread acceptance and benefit from a large developer
community.
Copyright (c) 2002 GridLab Consortium. All rights reserved.
This software includes voluntary contribution made to the EU GridLab
Project by the Consortium Members: Istytut Chemii Bioorganicznej PAN
Poznaskie Centrum Superkomupterowo Sieciowe (PSNC), Pozna, Poland;
Max-Planck Institut fuer Gravitationsphysik (AEI), Golm/Potsdam, Ger-
many, Konrad-Zuse-Zentrum fuer Informationstechnik (ZIB), Berlin, Germany;
Masaryk University, Brno, Czech Republic; MTA SZTAKI, Budapest, Hungary;
Vrije Universiteit (VU), Amsterdam, The Netherlands; ISUFI/High Perfor-
mance Computing Center (ISUFI/HPCC), Lecce, Italy; Cardiff University,
Cardiff, Wales; National Technical University of Athens (NTUA), Athens,
Greece; Sun Microsystems Gridware GmbH, Germany; HP Competency Center
France
Installation, use, reproduction, display, modification and redistribution with
or without modification, in source and binary forms, is permitted provided that
the following conditions are met:

1. Redistributions in either source-code or binary form along with accompa-


nying documentation must retain the above copyright notice, the list of
conditions and the following disclaimer.

2. The names ”GAT Grid Application Toolkit”, ”EU GridLab” or ”Grid-


Lab” may be used to endorse or promote software, or products derived
therefrom.

3. You are under no obligation to provide anyone with bug fixes, patches, up-
grades or other modifications, enhancements or derivatives of the features,

81
functionality or performance of any software you provide under this license.
However, if you publish or distribute your modifications, enhancements or
derivative works without contemporaneously requiring users to enter into a
separate written license agreement, then you are deemed to have granted
GridLab Consortium a worldwide, non-exclusive, royalty-free, perpetual
license to install, use, reproduce, display, modify, redistribute and sub-
license your modifications, enhancements or derivative works, whether in
binary or source code form, under the license stated in this list of condi-
tions.

4. DISCLAIMER
THIS SOFTWARE IS PROVIDED ”AS IS” AND ANY EXPRESS
OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
THE GRIDLAB CONSORTIUM MAKE NO REPRESNTATION THAT
THE SOFTWARE, ITS MODIFICATIONS, ENHANCEMENTS OR
DERIVATIVE WORK THEROF WILL NOT INFRINGE PRIVATELY
OWNED RIGHTS. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCURE-
MENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLI-
GENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE

82
List of Figures

1.1 The Grid layer model with the components of each layer . . . . . 4
1.2 The middleware as a transparent component between APIs and
services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 The GSI proxy delegation process . . . . . . . . . . . . . . . . . . 16


2.2 GRAM job states . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Example GRAM job registration process . . . . . . . . . . . . . . 17
2.4 Job registration in gLite . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 The information flow in L & B reporting . . . . . . . . . . . . . . 22
2.6 gLite job states (bold arrows on successful execution path) . . . . 22

3.1 JavaGAT’s structure . . . . . . . . . . . . . . . . . . . . . . . . . 26


3.2 Class diagram of the VOMS proxy management classes . . . . . 29
3.3 Class diagram of the gLite adaptor . . . . . . . . . . . . . . . . . 31
3.4 A simplified class diagram of the testbench . . . . . . . . . . . . 34

4.1 Globus scheduling latencies . . . . . . . . . . . . . . . . . . . . . 40


4.2 Globus scheduling histogram (extreme outliers truncated) . . . . 41
4.3 Globus scheduling latency CDF . . . . . . . . . . . . . . . . . . . 41
4.4 Globus scheduling times per day of the week . . . . . . . . . . . 42
4.5 Globus scheduling times per daytime periods . . . . . . . . . . . 43
4.6 Scheduling latency box plots per queue . . . . . . . . . . . . . . . 45
4.7 gLite scheduling latencies . . . . . . . . . . . . . . . . . . . . . . 46
4.8 gLite scheduling latency CDF . . . . . . . . . . . . . . . . . . . . 46
4.9 Expectation of scheduling time (y-axis) against scheduling time-
out value (x-axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.10 gLite scheduling times per day of the week . . . . . . . . . . . . . 48
4.11 gLite scheduling times per daytime period . . . . . . . . . . . . . 48
4.12 gLite scheduling latency per queue . . . . . . . . . . . . . . . . . 49
4.13 gLite scheduling latency with different Maui RMPOLLINTER-
VAL settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.14 CDF of gLite scheduling latency with different Maui poll intervals 52
4.15 Middleware notification overhead in Globus . . . . . . . . . . . . 54

83
4.16 Histograms of middleware notification overhead in Globus . . . . 55
4.17 CDFs of middleware notification overhead in Globus . . . . . . . 56
4.18 Middleware notification overhead by Globus site . . . . . . . . . 58
4.19 Middleware notification overhead before and after job execution
in gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.20 gLite IS overhead histograms . . . . . . . . . . . . . . . . . . . . 60
4.21 gLite IS overhead CDFs . . . . . . . . . . . . . . . . . . . . . . . 61
4.22 Expected notification times (y-axis) in relation to timeout values 62
4.23 gLite IS overheads per queue . . . . . . . . . . . . . . . . . . . . 64
4.24 Post staging overhead in Globus . . . . . . . . . . . . . . . . . . 66
4.25 Data staging latencies per Globus site . . . . . . . . . . . . . . . 67

84
List of Tables

4.1 Abbreviations used in gLite per queue boxplot . . . . . . . . . . 49


4.2 Summary of the overhead properties of all middlewares . . . . . . 67

85
86
Bibliography

[1] Worldwide LHC Computing Grid on a glance. website. http://lcg.web.


cern.ch/LCG/, visited on November 7th 2008.

[2] Ian Foster. The Grid: Blueprint for a New Computing Infrastructure.
Morgan-Kaufman, 1999.

[3] Ian Foster. What is the grid? - a three point checklist. GRIDtoday, 1(6),
July 2002.

[4] Ein petaflop leistung. website. http://www.netigator.de/netigator/live/


fachartikelarchiv/ha_artikel/powerslave,pid,archiv,id,31673719,obj,
CZ,np,,ng,,thes,.html, visited on November 7th 2008.

[5] Supercomputer top500 list 06/2008. website. http://www.top500.org/


lists/2008/06, visited on November 9th 2008.

[6] D. Cooper, S. Santesson, S. Farrell, S. Boeyen, R. Housley, and W. Polk.


Internet X.509 Public Key Infrastructure Certificate and Certificate Revo-
cation List (CRL) Profile. RFC 5280 (Proposed Standard), May 2008.

[7] GWD-R (recommendation W. Allcock. GridFTP: Protocol Extensions to


FTP for the Grid. Globus Grid Forum Recommendation GFD 2.0, April
2003. http://www.globus.org/alliance/publications/papers/GFD-R.0201.
pdf.

[8] J. Linn. Generic Security Service Application Program Interface Version 2,


Update 1. RFC 2743 (Proposed Standard), January 2000.

[9] Rob V. van Nieuwpoort, Thilo Kielmann, and Henri E. Bal. User-friendly
and reliable grid computing based on imperfect middleware. In Proceed-
ings of the ACM/IEEE Conference on Supercomputing (SC’07), nov 2007.
Online at http://www.supercomp.org.

[10] Marik Marshak and Hanoch Levy. Evaluating web user perceived latency
using server side measurements. Computer Communications, 26(8):872–
887, 2003.

87
[11] Ahuva W. Mu’alem and Dror G. Feitelson. Utilization, predictability, work-
loads, and user runtime estimates in scheduling the ibm sp2 with backfill-
ing. IEEE Transactions on Parallel and Distributed Systems, 12(6):529–
543, 2001.

[12] B. Chun and D. Culler. User-centric performance analysis of market-based


cluster batch schedulers, 2002.

[13] Emmanuel. Medernach. Workload analysis of a cluster in a grid environ-


ment. In 11th workshop on Job Scheduling Strategies for Parallel processing,
2005.

[14] Michael Oikonomakos, Kostas Christodoulopoulos, and Em-


manouel (Manos) Varvarigos. Profiling computation jobs in grid
systems. In CCGRID ’07: Proceedings of the Seventh IEEE Interna-
tional Symposium on Cluster Computing and the Grid, pages 197–204,
Washington, DC, USA, 2007. IEEE Computer Society.

[15] Diane Lingrand, Johan Montagnat, and Tristan Glatard. Estimation of


latency on production grid over several weeks. In ICT4Health, Manila,
Philippines, February 2008. Oncomedia.

[16] Diane Lingrand, Johan Montagnat, and Tristan Glatard. Modeling the
latency on production grids with respect to the execution context. In CC-
GRID ’08: Proceedings of the 2008 Eighth IEEE International Symposium
on Cluster Computing and the Grid (CCGRID), pages 753–758, Washing-
ton, DC, USA, 2008. IEEE Computer Society.

[17] Tristan Glatard, Johan Montagnat, and Xavier Pennec. Optimizing jobs
timeouts on clusters and production grids. In International Symposium on
Cluster Computing and the Grid (CCGrid), pages 100–107, Rio de Janeiro,
May 2007. IEEE.

[18] T. Ylonen and C. Lonvick. The Secure Shell (SSH) Protocol Architecture.
RFC 4251 (Proposed Standard), January 2006.

[19] Ian T. Foster. Globus toolkit version 4: Software for service-oriented sys-
tems. In Hai Jin, Daniel A. Reed, and Wenbin Jiang, editors, NPC, volume
3779 of Lecture Notes in Computer Science, pages 2–13. Springer, 2005.

[20] Von Welch, Ian Foster, Carl Kesselman, Olle Mulmo, Laura Pearlman,
Jarek Gawor, Sam Meder, and Frank Siebenlist. X.509 proxy certificates
for dynamic delegation. In In Proceedings of the 3rd Annual PKI R & D
Workshop, 2004.

88
[21] The OpenSSL project. website. http://www.openssl.org.

[22] Karl Czajkowski, Ian T. Foster, Nicholas T. Karonis, Carl Kesselman, Stu-
art Martin, Warren Smith, and Steven Tuecke. A resource management
architecture for metacomputing systems. In IPPS/SPDP ’98: Proceedings
of the Workshop on Job Scheduling Strategies for Parallel Processing, pages
62–82, London, UK, 1998. Springer-Verlag.

[23] Karl Czajkowski, Ian T. Foster, and Carl Kesselman. Resource manage-
ment for ultra-scale computational grid applications. In PARA ’98: Pro-
ceedings of the 4th International Workshop on Applied Parallel Comput-
ing, Large Scale Scientific and Industrial Problems, pages 88–94. Springer-
Verlag, 1998.

[24] J. Sermersheim. Lightweight Directory Access Protocol (LDAP): The Pro-


tocol. RFC 4511 (Proposed Standard), June 2006.

[25] Sergio Andreozzi, Stephen Burke, Felix Ehm, Laurence Field, Gerson
Galang, Balazs Konya, Maarten Litmaath, Paul Millar, and JP Navarro.
Glue specification v2.0.42. OGF Specification draft, May 2008.

[26] K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman. Grid information


services for distributed resource sharing, 2001.

[27] Berkeley database information index. Website. https://twiki.cern.ch/


twiki//bin/view/EGEE/BDII, visited on November 18th 2008.

[28] Roberto Alfieri, Roberto Cecchini, Vincenzo Ciaschini, Luca dell’Agnello,


Alberto Gianoli, Fabio Spataro, Franck Bonnassieux, Philippa J. Broad-
foot, Gavin Lowe, Linda Cornwall, Jens Jensen, David P. Kelsey, Ákos
Frohner, David L. Groep, Wim Som de Cerff, Martijn Steenbakkers, Ger-
ben Venekamp, Daniel Kouril, Andrew McNab, Olle Mulmo, Mika Silander,
Joni Hahkala, and Károly Lörentey. Managing dynamic user communities
in a grid of autonomous resources. CoRR, cs.DC/0306004, 2003. informal
publication.

[29] Stephen Burke, Simone Campana, Patricia Méndez Lorenzo, Christopher


Nater, Roberto Santinelli, and Andrea Sciabà. gLite 3.1 users guide, 2006.

[30] Erwin Laure. Egee middleware architecture planning (release 2). EU De-
liverable DJRA1.4, July 2005. http://edms.cern.ch/document/594698.

89
[31] Rajesh Raman, Miron Livny, and Marvin Solomon. Matchmaking: Dis-
tributed resource management for high throughput computing. In Proceed-
ings of the Seventh IEEE International Symposium on High Performance
Distributed Computing (HPDC7), Chicago, IL, July 1998.

[32] R-gma: Relational grid monitoring architecture. Website. http://www.


r-gma.org/index.html.

[33] A. Shoshani, A. Sim, and J. Gu. Storage resource managers: Middleware


components for grid storage, 2002.

[34] Tony Calanducci. Lfc: The lcg file catalog. Slides for NA3: User Training
and Induction, June 2005. http://www.phenogrid.dur.ac.uk/howto/LFC.
pdf.

[35] Gfal java api. https://grid.ct.infn.it/twiki/bin/view/GILDA/APIGFAL.

[36] Gregor von Laszewski, Ian Foster, Jarek Gawor, and Peter Lane. A Java
Commodity Grid Kit. Concurrency and Computation: Practice and Expe-
rience, 13(8-9):643–662, 2001.

[37] Yoav Etsion and Dan Tsafrir. A short survey of commercial cluster batch
schedulers. Technical Report 2005-13, School of Computer Science and
Engineering, the Hebrew University, Jerusalem, Israel, May 2005.

[38] Jens Jensen, Graeme Stewart, Matthew Viljoen, David Wallom, and Steven
Yound. Practical Grid Interoperability: GridPP and the National Grid
Service. In UK AHM 2007, 4 2007.

[39] Brett Bode, David M. Halstead, Ricky Kendall, Zhou Lei, David Jackson,
and Maui High. The portable batch scheduler and the maui scheduler on
linux clusters. In In Proceedings of the 4th Annual Linux Showcase and
Conference. Press, 2000.

90

You might also like