Professional Documents
Culture Documents
environments
by
Thomas Zangerl
I also certify that the thesis has been written by me. Any help that I have
received in my research work and the preparation of the thesis itself has been
acknowledged. In addition, I certify that all information sources and literature
used are indicated in the thesis.
i
ii
Abstract
Grid middlewares simplify the access to Grid resources for the end user by pro-
viding functionality such as metascheduling, matchmaking, information services
and adequate security facilities. However, the advantages of the middlewares
usually come at the cost of temporal overhead added to the execution time
of the submitted jobs. In this thesis, we group the overhead into two cate-
gories: The first type of overhead occurs before the jobs become executed in
form of scheduling latency. What follows is information service overhead, which
is introduced by delays in information flow about the job status from the ex-
ecuting worker-node up to the end-user. We analyse both types of overhead
with respect to several factors, such as absolute values and variance, for the
Grid middlewares SSH, Globus and gLite. We evaluate our experimental data
regarding daytime-, weekday-, CE- and queue-influence, and discuss the results
and the implications.
iv
Contents
1 Introduction 3
1.1 The Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Grid middlewares . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Virtual Organisations . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Data staging . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Impact of overhead on short-running jobs . . . . . . . . . 7
1.4.2 Impact of overhead variance on timeout values . . . . . . 7
1.4.3 Impact of overhead variance on workflows . . . . . . . . . 8
1.4.4 Implications of measured overhead for middlewares . . . . 8
1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 The Middlewares 13
2.1 SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Globus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Security concepts . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Resource Management . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Information services . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Data transfer . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Security concepts . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Resource Management . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Information services . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Data transfer . . . . . . . . . . . . . . . . . . . . . . . . . 23
v
3.2 Implementing the gLite adaptor . . . . . . . . . . . . . . . . . . . 27
3.2.1 VOMS-Proxy creation . . . . . . . . . . . . . . . . . . . . 29
3.3 Job submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 External libraries . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 The testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Results 35
4.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Statistical methods . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Globus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.4 gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Scheduling times . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Globus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Notification overhead . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Globus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Staging overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.1 Globus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.2 gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.1 Scheduling times in Globus and gLite . . . . . . . . . . . 68
4.5.2 Information service overhead in Globus and gLite . . . . . 68
5 Conclusions 71
A gLite Documentation 73
B GridLab license 81
List of Figures 83
List of Tables 85
Bibliography 87
vi
Acknowledgements
First and foremost, I want to thank my supervisor, Dr. Maximilian Berger,
who has always, when required, found the time to give his advice on problems
in person. More than once, this meant spending a lot of time on a particular
problem. He also gave me support and assistance, whenever I needed it.
My thanks are also expressed to Dr. Alexander Beck-Ratzka, who developed
the original gLite adaptor and has been very helpful in supplying me with the
source code before its release and in so quickly answering the many questions
that I have asked him. Roelof Kemp from the JavaGAT project also gave me
invaluable support by integrating the adaptor into JavaGAT and by providing
me SVN contributor access to the project.
Furthermore I want to thank the EGEE project, and particularly the VOCE
VO for providing infrastructure and user support. In this context, I also have to
outline the importance of all the people who helped me quickly, even on week-
ends, on the various developer mailing lists that I burdened with my problems.
Roberto Pugliese was so kind to mail me the source-code of the glite-security-
trustmanager from the CVS at the University of Trieste, which I couldn’t access
from outside due to the university’s security policy.
Last not least I want to thank my family, who has supported me on so many
occasions during my studies and my girlfriend Nadine, who has always motivated
me to carry on when I was frustrated.
1
2
Chapter 1
Introduction
3
Application user applications
Middleware functionality
scheduling
Collective workload management
resource access
Resource bookkeeping
job information
communication
Connectivity authentication
hardware
Fabric resources
Figure 1.1: The Grid layer model with the components of each layer
With this set of standard components, the Grid can provide considerable
computational power to the end user. The theoretical performance of the LHC
computing Grid alone has been estimated to equal 1 petaflop [4]. This is about
the same value as the Rmax performance in the Linpack Benchmark of the top
site in the supercomputers Top 500 list of June 2008. (see [5]).
Hence, the sheer mass of comparatively inexpensive computing components
can amount to considerable performance, yet somehow the resources have to be
coordinated to form a useful distributed computing environment. This is the
task of the Grid middleware.
4
C++−Application Java−Application User
Middleware
GridFTP
Grid security Grid computation Grid storage
server element server
Figure 1.2: The middleware as a transparent component between APIs and ser-
vices
putational task of the user are to be executed. The set of Grid-related tasks that
a middleware can take over on behalf of the Grid user, varies among different
middlewares; however, minimal functionality that a scientific user would expect
from a middleware includes access control, copying of needed data sets and the
executable itself to the execution sites (input staging), copying of result data
back to the user’s computer (output staging), continually reporting the progress
of the computational task (job monitoring) and relaying system messages to the
user (staging stdout and stderr ).
5
is a member of VO Z, person Y may use resource X. Of course, authenticity
and authorisation have to be ensured. For this purpose, cryptographically well
defined public key infrastructure procedures with X509-certificates [6] are used.
VOs maintain tables of their members along with the signatures of their digital
certificates. Upon accessing a resource, the user is expected to show her certifi-
cate or a temporarily limited proxy created from it for proof of VO-membership
.
6
net job execution time or even stay in a non-completed state for an indefinite
time span. In a Grid environment, the goal is to avoid such executions, for
example by using a timeout and resubmission strategy. To determine feasible
timeout values, knowledge about approximate latency values and latency vari-
ance is critical.
1.4 Motivation
In this thesis, overhead of different middlewares is measured and compared.
Such overhead has different negative effects, based on whether the total overhead
or the overhead variance is the major problem.
7
Solutions to this problem are often based on timeouts and resubmission of
jobs. The timeout values either include an estimation of the job’s execution
time and a statistical model of the latency or are based on the estimated times
a job should spend in a certain state (for example, in the ”WAITING” state).
Timeouts can be crucial in an environment without success guarantees. How-
ever, latency variance makes it hard to estimate such timeout values. Most
notably, the latency should not depend on a random factor introduced by the
middleware by its way of organising information flow or by site schedulers with
esoteric scheduling policies.
If jobs are inter-dependent, i.e. if certain jobs require results of other jobs in
order to complete execution, latency variance can heavily affect the performance
of the experiments, because in order to finish execution, jobs may have to wait
a long time for required intermediate results of other jobs.
This property holds especially for complex application workflows. If inter-
dependence is high, even a small fraction of outliers can lead to the need of
multiple resubmissions of the workflow. Here, again, this may provide an in-
centive to the user for submitting complete workflow management processes as
single jobs to the Grid, which may bypass quality of service mechanisms of the
middleware and negatively influence overall stability even more.
There is not much that middlewares can do to avoid overhead caused by hard-
ware or network components, or by general workload in the Grid. Scheduling
overhead mostly depends on such workload and on the configuration of the
workqueue (prioritisation etc.).
However, the reason for information service overhead is often to be found in
the design of a middleware. By measuring that information service overhead
with respect to
• Overall latency
• Latency variance
8
1.5 Methodology
In order to measure the latency values, a testbench was implemented in the Java
programming language. To ensure uniform access to different middlewares from
the testbench, the middleware wrapper API JavaGAT [9] was used. JavaGAT
had to be extended with an adaptor for the gLite middleware. Part of that
adaptor had already been developed by Dr. Alexander Beck Ratzka and Andreas
Havenstein (Max-Planck-Institut für Gravitationsphysik). However, the original
adaptor was based on an outdated version JavaGAT, lacked needed functionality
and showed problematic behaviour in terms of memory usage. The latter was
not the fault of the original authors but originated from third party libraries
with memory leaks. We have significantly improved the adaptor and added
functionality, while fixing the memory leaks and porting it to the latest JavaGAT
interfaces. We have used the adaptor as a part of the testbench to measure gLite
execution times.
The measurements themselves were conducted on a uniform timescale on dif-
ferent Grid sites with different middlewares. The results were statistically anal-
ysed and interpreted.
The implementation of the testbench and the adaptor is described in detail
in chapter 3.
9
1.7 Related work
10
However, the systematic complexity of multi-layered hierarchical middlewares
like gLite causes a divergence of user-perceived and cluster-generated overhead,
since many middleware components may introduce overhead themselves. Con-
sequently, some works focus on the overhead from the user perspective.
Lingrand et al. present experiments conducted within EGEE’s biomed VO
over a large period of time in [15]. A timeout value of about 500 seconds is shown
to significantly improve the expectation of the job execution time. Furthermore,
the authors show that there is no relevant impact on the job execution time by
the day of the week.
The same authors show in [16] that the optimal timeout value based on their
latency model varies significantly among different gLite sites. Furthermore, dif-
ferent queues on gLite sites are evaluated with respect to their latency behaviour.
It is argued, that the best result with regard to execution time is obtained by
classifying the queues in two classes with different optimal timeouts.
As concerning temporal influence, the observation is made, that overall there
are slightly higher latencies on weekends. However, the presented data on day-
of-the-week influences shows partial inconsistencies and would need further ex-
ploration, which is also acknowledged by the authors in the conclusion. Fur-
thermore, it is questionable whether ANOVA is the best choice for analysis of
the class partitioning, because the CDFs indicate that the samples are not nor-
mally distributed. ANOVA needs normal distribution of the underlying data as
a prerequisite.
Another probabilistic model for computing optimal timeout values for resub-
mission strategies is introduced in [17]. The authors apply the model to several
well-known distributions in order to derive optimal timeout values for systems
modelled by them. For the EGEE grid, the probability distribution corresponds
to a mixture of log-normal and Pareto distributions. It is argued, that by de-
riving optimal timeout values from the model, the expectation of job duration
moves within the dimension of outlier-free production systems. As for collecting
necessary input data for the model, the authors suggest that it could be taken
from the workload management system logs.
All three works focus on total execution times as perceived by the end-user
as the subject of their analysis. Our approach, on the other hand allows us to
analyse the different sources of overhead, namely scheduling and information
service delays, separately.
11
12
Chapter 2
The Middlewares
2.1 SSH
In order to utilise Grid resources for job execution, the following minimal com-
ponents are required.
• File transfer
These tasks can already be achieved with the standard SSH protocol [18].
Authorisation is part of the protocol itself and access control can be ensured
using Unix ACLs and schedulers on the execution site. The computing resource
access is a central part of the protocol’s functionality and file transfer can be
implemented with SCP, which is based on SSH.
Such a solution is very simple, because there is no matchmaking and no meta-
scheduling. The authorisation solution mandates that each user owns an account
on the machine on which the Grid application becomes executed, which is not
very scalable. There is no inherent support for parallelism with this solution
and scheduling on the executing machine has to be performed by the operating
system scheduler, which is often hardly apt to do this job in an uncontrolled
multi-user environment with long-running, non-interactive applications. Fur-
thermore due to the lack of matchmaking it obliges the user to have knowledge
about all computing elements, their respective addresses and hardware and soft-
ware properties and to do the matchmaking himself.
Hence, SSH is not a Grid middleware, because even though it can be used
for remote job execution, it lacks vital functionality of middlewares. But the
time overhead produced by this kind of solution is very close to zero, basically
only comprising network latencies. Therefore it is a good base reference when
overheads of complex middlewares are evaluated.
13
2.2 Globus
Globus Toolkit has been one of the first Grid middlewares and perhaps it still is
the most popular one. Many technologies that have emerged from the Globus
project are now de facto standards in Grid computing, such as the GridFTP
protocol and the Grid Security Infrastructure (GSI).
As a basis for our experiments, we have used the older, non-webservices-based
version 2 of Globus Toolkit. All essential components of Globus version 2 are
still contained in the current major version 4 [19]; especially GT4 contains mod-
ules for webservices based job management and pre-webservices job execution
management services (WS GRAM and Pre-WS GRAM).
Generally, the Globus Toolkit is a collection of APIs provided by client li-
braries for different programming languages, server libraries which provide the
Globus services and command-line tools for end users. The middleware consists
of several components, which implement different infrastructural tasks.
• Serial number
• Validity
• Optional extensions
14
• Signature algorithm ID
The signature at the end is from the CA and testifies the validity of the
information in the certificate and especially the ownership of the public key by
the certificate’s identity. For GSI, this certificate is required to be saved in the
.pem Format, which means that the above information is stored in a Base64-
encoding.
For access control, the user would be required to prove her identity upon using
Grid resources every time. Since her identity is established by challenging her
knowledge of the private key that belongs to the public key bound to her identity
in the certificate, this would mean having her enter the password for decryption
of the private key on each resource access. Since this is hardly feasible in a
productive environment, GSI supports the delegation of a temporally limited
proxy which can impersonate the user [20].
The delegation process consists of creating a new public/private key pair, the
public key of which will be used for a new X509 certificate that is signed by
the holder of the Grid certificate herself. The generated private key is stored
alongside the certificate in a proxy file (see figure 2.2.1). For security reasons,
file permissions on the proxy are set restrictively and its validity is temporally
limited.
The proxy’s subject corresponds to the subject name in the Grid certificate
and other properties are derived from the original certificates as well. A new
critical X509 extension ensures that the proxy creator can define different poli-
cies concerning the proxy rights (i.e. whether it may delegate new proxies by
itself).
The generated private key is intended for one-time use, which makes proxy
revocation an easy task: It suffices to delete the proxy file.
Based on these proxy files, authentication in Globus is done based on the
GSS-API and SSL version 3, based on its implementation OpenSSL [21].
15
Private key
Delegation start
11
00
00
11
00
11
create private/
Proxy
public key pair
file
User signed
111
000
000
111
000
111
000
111
User private
key
With the Resource Specification Language (RSL), the Grid user specifies the
executable, files for input and output staging, estimated job duration wall clock
time, environment variables, arguments etc. This is done in a standard textual
way, using name/value pairs.
The RSL is then sent to a Resource Broker, whose task is to specialise the
RSL, i.e. to derive possible execution sites directly from the RSL’s attributes.
Such a specialisation can be done by querying the Globus Information service
(see section 2.2.3) or subsequent resource brokers for suitable sites according to
certain properties derived from the RSL (e.g. available RAM size, number of
CPUs, installed software etc.). The resulting processed RSL is called ground
RSL.
If the job requires parallel execution on different sites in terms of MPI or
some other environment, the RSL will be passed to a site co-allocator. We
don’t have such a use-case in our testbench scenario. Hence, co-allocation will
not be covered here. An in-depth description can be found in [23].
In our case, the specification refined by the resource broker is passed to
a GRAM. GRAM exists in a newer, webservices based version (WS GRAM,
GRAM4) and in an older version still supported by current releases of Globus
16
Start Pending Active Done
Failed
2. specialize
RSL
5. job handle
GRAM
6. status notification
callbacks
PBS
toolkit (Pre-WS GRAM, GRAM2). Since the AustrianGrid sites still exclusively
provide GRAM2 service endpoints, our experiments rely on the older version of
GRAM, which will be described here.
When the resource description reaches GRAM level, it is already on an ex-
ecution site, i.e. on a cluster, blade, parallel machine or some other compute
element. Usually on these compute elements, some local load balancer/sched-
uler will run, such as Sun Grid Engine (SGE), Portable Batch System (PBS)
or LoadLeveler (LL). GRAM provides a uniform interface to all those sched-
ulers while using them for queueing the Grid jobs. At the same time, GRAM
supervises these executions, notifies the user of status changes using a callback
URL specified at job submission and generates a job handle that can be used
for cancelling the job or actively polling its status (figure 2.3).
The job’s legal status transitions can be modelled by an acyclic directed graph
(see figure 2.2).
17
formation system MDS [26]. A frequent configuration is known as Berkley
Database Information Index (BDII [27]), which consists of two or more LDAP
databases in combination with a well-defined update-process. Aggregate LDAP
directories may be hierarchically grouped and can notify each other about in-
formation updates within their controlled realms using GRRP. The resource
broker or the user will likely search a top-level LDAP service which includes
information of all smaller domain-specific LDAP services.
This method of information aggregation ensures scalability. On the other
hand, the use of aggregation with notification protocols can mean that infor-
mation in the aggregate directory services can be outdated due to propagation
delay. However, often updates are relatively infrequent and the notification in-
tervals can be fine-tuned to form a trade-off between system load due to useless
messages and update notification delay.
2.3 gLite
gLite is used as the middleware of the EGEE project1 , which is an EU-funded
multi-disciplinary Grid environment and one of the largest Grid infrastructures
in the world. gLite is a de facto successor of the LHC computing grid (LCG) and
uses components developed for the LCG. Since the LCG is a package of Globus,
Condor and CERN-made components, many parts of the gLite middleware are
based on protocols and technology developed as part of Globus.
gLite is distributed as a set of command-line tools that run exclusively on
Scientific Linux, various APIs for programming languages for the proprietary
protocols of gLite and WSDL files for the gLite webservices, from which bindings
for different programming languages may generation. Many components of gLite
are webservice-based, so most bindings can be generated from the respective
WSDL files.
18
existing Globus practise to manually add authorised users to map-files on every
potential execution site for a certain VO has been found not scalable.
Hence, authorisation itself is performed by a Virtual Organisation Member-
ship Service (VOMS) [28]. The VOMS server has knowledge about the users of a
certain VO and maintains that information in a relational database. If the user
wants to construct a VOMS proxy, it first handshakes with the VOMS server
using a standard GSI-authenticated message and then sends a signed request.
The user’s request contains roles, groups and capabilities that later-on be used
by grid sites for fine grained access control. If the request can be parsed cor-
rectly by the VOMS server and the user has sufficient rights for the demanded
group membership, role and capabilities, the server sends a structured VOMS
authorisation response. The VOMS response contains the following data:
• User credentials
The client can now save the VOMS response as an X509v3 Attribute Cer-
tificate in the grid proxy certificate and serialise the resulting certificate to a
VOMS proxy file. For access control, the user transmits the full VOMS proxy
to the resource broker, which only needs to check the validity of the VOMS
information instead of consulting a user map-file. However, unwanted users can
still be banned from resources by explicit blacklisting.
VOMS proxies can not be revoked, but have a limited lifetime which is in-
cluded in the attribute certificate to avoid reuse of existing VOMS tickets by
malicious users.
19
• Computing Element (CE)
The structure of the Job Description Language is quite similar to the RSL
format used in Globus (section 2.2.2). It also consists of name/value pairs, but
offers functionality beyond the scope of RSL. The language is based on Condor’s
ClassAd language [31] and besides standard fields for executables, input and
output sandbox files, stdout and stderr, arguments and environment variables
explicit GLUE attributes can be supplied as requirements for matchmaking.
Since all GLUE attributes, that the WMS can derive from the information
supermarket may be specified in the requirements section, the user can become
quite explicit about hardware and software requirements; for example the user
may desire more than 1 GB of RAM, RedHat-Linux as the CEs operating system
or an i686-processor on the executing machine. It is even possible to specify
explicit workqueues for scheduling, or exclude certain workqueues.
Like the Resource Broker in Globus, in gLite the WMS takes the user’s JDL
and tries to find an appropriate resource fitting to the requirements expressed in
the JDL for job scheduling. Unlike in Globus, there is no active layer between
the WMS and the local resource manager, i.e. there is no specific gLite analogy
to Globus’ GRAM. However, the CE must provide a webservice interface for
submitting and cancelling jobs, retrieving the job status and sending signals to
jobs. This is called a computing element acceptance service (CEA). CEs can
operate in push and pull models, i.e. receive jobs and execute them, if they are
idle or ask for them, when they are idle. In order to make informed matchmaking
decisions, the WMS queries the information supermarket. If no resources are
available, it keeps the submission request in a task queue.
Clients can communicate with the WMS using a webservice-interface and
SOAP messages. The whole job registration process is depicted in figure 2.4.
20
9. query job state
WS−Int.
4. register Job LB logging
server
WS−Int. Interface
GridFTP
6. stage in sandbox
7. start job
Monitor
WS−Int.
8. start: jobId
5. return jobID WMS
CEA
JC
1. VOMS−proxy−auth
2. send JDL string LRMS
3. locate LB server (e.g. Maui)
generated JobId schedule
WN WN WN
queue arriving jobs and schedule them based on autonomous decisions on vari-
ous worker nodes (WN), which are responsible for performing the actual com-
putation. Typically deployed LRMSes are PBS, SGE, LCG-PBS, LSF, LL or
Maui, which is a queueing plug-in for various LRMS products. The job monitor
checks the status of the job (running, completed) and reports it to the logging
and bookkeeping system.
In gLite, the job’s status is tracked using a specialised component, the logging
and bookkeeping (LB) server. The status is updated by events that are gener-
ated either by CEs themselves or by an aggregating WMS, using some logging
API. The logging API passes the information to a physically close locallogger
daemon, which stores it on a local disk file and reports success to the logging
API. The interlogger daemon forwards the information from the locallogger
daemon to the responsible bookkeeping server. The URL of the bookkeeping
server is included in the server part of the job-id, which has already been cho-
sen by the WMS. Hence, LB server and job remain interdependent and an LB
server collects information about the job’s status during its complete lifetime.
The incoming events are mapped to higher level gLite states (figure 2.6) by the
LB-server, where the user may query for them, or actively receive them if she
registered for update notifications (figure 2.5).
21
LB WSDL
logging
LB producer
API
API
locallogger interlogger LB Server
library
WMS components
Local
Log
File
Ready Scheduled
Aborted
Figure 2.6: gLite job states (bold arrows on successful execution path)
22
2.3.3 Information services
The information supermarket (ISM) is a central repository for information about
Grid resources. It can be queried by the WMS during the process of matchmak-
ing for a JDL submission request. The architecture of the ISM itself usually
is implemented in quite similar way as in Globus (section 2.2.3), based on the
BDII. As in Globus, GLUE is used as the information model for the data.
More advanced query and update technologies for the information supermar-
ket, such as R-GMA [32] or XML/SOAP exist, but are rarely used.
1 Space on the SE for the file needs to be allocated. For this purpose the
Storage Resource Manager interface (SRM [33]) is used. SRM is an uni-
form webservice-based interface to different kinds of SEs and their dif-
ferent storage systems. SRM is not implemented by gLite, but by the
storage systems, which are supposed to provide a SRM interface. If a
space reservation request invoked by a user is granted by SRM, it will
return a storage URL (SURL) which denotes the full path where the file
is going to be stored and a transport URL (TURL), which is an endpoint
for direct GridFTP transfers of the file to the SE.
2 Register the file in the LCG File Catalog (LFC) [34]. For that purpose,
the LFC replica server for the current VO has to be contacted using a
proprietary protocol. Once the user sends in the address of one replica
(the SURL received in step 1), the LFC-server will respond with an unique
identifier and a logical file name (LFN). This is a one-to-many mapping,
i.e. more replicas with different storage URLs can be added to the same
unique identifier. The file’s GUID is immutable, while the LFN can be
changed by the user.
23
3 Use the transport URL and GridFTP to copy the file to the SE.
4 Get the file during job execution. Currently, this is only possible using
the respective command line interfaces of the lcg-tools. The WMS can be
instructed to schedule jobs to CEs ”close” to the SE where the files are
stored by adding the LFN or the GUID to the DataRequirement field in
the JDL.
Hence, management of large files is still in gLite is still quite a complex matter
and disappointingly there exist only C reference implementations for the pro-
prietary LFC protocol. Java APIs such as GFAL [35] are just C wrappers and
require platform dependent libraries to be installed.
24
Chapter 3
Implementation of adaptor and
testbench
3.1 JavaGAT
JavaGAT [9] is a high-level Grid-access interface for abstracting the technical
details of middleware-specific code, which is maintained independently from the
interface.
• Abstracting the low level nature of current middleware APIs away and
helping the developer concentrate on the core application rather than Grid
internals
25
Grid application
Monitoring
File−API RB−API Security−API
API
JavaGAT−Engine
Monitoring
File−CPI RB−CPI Security−CPI
CPI
... ...
GAT−Adaptors
gridftp globus
sftp gLite
globus
...
gLite
26
For example, classes for file staging extend Java’s java.io classes. To control the
behaviour of JavaGAT itself and the instantiated adaptors, a set of meta-classes
is available, such as GAT , GATContext and Preferences. GAT is used as a fac-
tory façade for the construction of actual objects behind the various interfaces.
GATContext carries security parameters and additional GAT- or adaptor-specific
preferences specified in the Preferences class. For instance, if the user wants
to restrict the set of file adaptors that are going to be invoked to SFTP, she
may add the preference ("File.adaptor.name", "sftp") to the GATContext and
subsequently limit connections to passive mode by specifying the preference
("ftp.connection.passive", "true").
Thus, knowledgeable users can control the behaviour of JavaGAT, while the
interface remains simple for inexperienced users.
• Extend the respective CPI classes. (e.g. an adaptor for file transfer should
extend FileCpi, whereas an adaptor for resource brokerage should extend
the ResourceBrokerCpi class) .
• Implement the minimal set of methods that are necessary for the adaptor
to provide useful functionality.
27
job information. However, it lacked some functionality considered important by
us.
First and foremost, the original adaptor didn’t include methods to create
VOMS proxies. If the adaptor can not create VOMS proxies itself, a manual
step is required before invoking the adaptor: One needs to create the required
proxies with the assistance of gLite command-line tools. Since those tools de
facto exclusively support Scientific Linux, which is not a common Linux flavor
on workplace computers, this mostly implies creating the proxy on a system
running Scientific Linux (typically a Grid UI machine) and copying it back to
the system executing the adaptor.
We found this hardly acceptable with regard to our use cases, so the pri-
mary goal was to implement VOMS proxy support. Further improvements to
the original adaptor were made, as additional requirements were identified in
productive use. Those improvements include:
Attempts have been made to implement logical file handling (section 2.3.4),
but this idea has been dropped due to unresolvable problems with the propri-
etary LFC protocol. Furthermore, with respect to JavaGAT’s structure, such
functionality would have had to be implemented as a subclass of LogicalFileCpi
and not as a part of the gLite-Adaptor. Because CPIs are abstract classes, not
interfaces, and Java does not support multiple inheritance, an adaptor can only
be derived from one CPI.
28
Figure 3.2: Class diagram of the VOMS proxy management classes
The following sections are going to outline some aspects of the adaptor in
detail.
29
1 A standard Globus proxy is created.
30
Figure 3.3: Class diagram of the gLite adaptor
31
Globus libraries, security libraries (to enable SSL encryption in GSI-protected
sessions), and libraries upon which the aforementioned ones are dependant.
When loading the adaptors, JavaGAT identifies the adaptor JAR by its man-
ifest file. In a second step, it saves the path to the classes from the libraries, on
which the adaptor is dependant (which have to be in the same directory), into
an URLClassLoader. This classloader is set as the thread’s context classloader
upon invoking the adaptor. Normally, conflicts between differing versions of the
same libraries required by different adaptors should thus be avoided.
Nevertheless, we found that different versions of Apache Axis can cause prob-
lems with that mechanism. For example, classes from the Axis library version
found in the Globus adaptor are still in the classpath, when the gLite adaptor
is instantiated. The reasons for this behaviour are rather unclear - web dis-
cussions about Axis indicate that it may stem from Axis using thread context
classloaders itself. A workaround is to include the required version of Axis in a
classpath with higher priority than the thread context classloader’s (e.g. Java’s
system classpath) when executing the gLite adaptor.
3.3.2 License
JavaGAT is published under the conditions of the GridLab Open Source license.
This license is BSD-like and basically allows redistribution in source or binary
form, modification, use and installation without restrictions. Because the gLite
adaptor was submitted to the central JavaGAT repository and published as a
part of the JavaGAT release, it is non-exclusively subject to the license provi-
sions thereof. The full license text can be found in Appendix B.
32
middleware, virtual organisation, resource broker URI and the executables with
their arguments and input files. Following that, all status updates received by
the middleware are logged along with the time at which they were received.
Furthermore, upon construction, the GATTestRunner starts a RunListener, which
creates a listening socket at the first unbound port in the Globus TCP port range
(ports 40000 - 40500). Those ports were chosen because they are usually not
blocked by firewalls in Grid environments. Before submitting the job specified
in the testcase instance, the GATTestRunner creates a wrapper shell-script. This
shell script calls the executables specified in the testcase with the respective
arguments and adds a wget call to the IP-address of the computer on which the
testbench runs and RunListener’s port respectively before and after execution
of the main part.
The GATTestRunner replaces the specified executable by /bin/sh and adds the
wrapper-script as the only input argument. The actual execution parameters, as
they are specified in the testcase-instance, are logged as part of the job’s meta-
information. When the shell script becomes executed, it first sends a callback
via wget to the RunListener. Thus, the RunListener can record the exact time
of execution start and notify the GATTestRunner about it. The same applies to
execution end. The GATTestRunner eventually logs the dates received from the
RunListener along with the dates reported back by the middleware to the XML
file.
Because of the fact that under certain circumstances, jobs may exhibit running
times that are virtually infinite (see section 1.3), a maximum execution time
has to be defined in the concrete testcase instance. After that time interval, the
TestInstanceGuardian will forcibly terminate the job. The GATTestRunner logs
such terminations to the XML file.
33
Figure 3.4: A simplified class diagram of the testbench
34
Chapter 4
Results
35
of this area is called interquartile range (IQR). The bars located 1.5*IQR above
the upper and below the lower quantile are called whiskers. Every data point
smaller than the lower whisker and larger than the upper whisker is marked as
a dot on the graph and considered an outlier.
Histograms show data values on the x-axis and the frequency at which they
occur on the y-axis. Sometimes, this allows for a fast understanding of the
approximate distribution of data values. When the large spread of data values
makes it appropriate, we use logarithmic scales on the x- or the y-axis. Whenever
an axis is logarithmic, this is mentioned on the respective graph.
To determine whether the partition of measurements into categories (executed
on a certain weekday, scheduled to a certain queue/type of queue, etc.) is
reasonable, we use different tests. The t-test can be applied to two independent
vectors of data points and tests the hypothesis that the two samples are from
distributions with equal means. The Kruskal-Wallis test is a one way analysis
of variance on the collected data and returns a p-value. The p-value is the
probability of obtaining, under the null hypothesis (in the Kruskal-Wallis test
the null hypothesis is that the samples from the different groups are equally
distributed), a result as extreme or more than the one that is tested.
If the p-value is smaller than the significance level the null hypothesis is
rejected. The significance level is the probability of falsely rejecting the null
hypothesis (i.e. a false positive probability) and is often set to 5%. If the p-
value exceeds the significance level, the only conclusion that can be derived, is
that the hypothesis can not be rejected at the significance level. Both t-test
and Kruskal-Wallis test share the advantage, that unlike ANOVA they do not
assume a normal distribution of the underlying measurement values.
It must be kept in mind, that such tests can merely give a hint and can
not replace thorough interpretation of the data itself. Especially, there may
be statistical significant differences among certain groupings, even when the
practical implications, i.e. the subjective latency difference in relation to the
total execution time or in relation to other delay factors, are comparatively
small. Additionally, for large samples, as we have available in most cases, any
one way analysis of variance is likely to produce smaller p-values than for smaller
samples. Hence, the tests will only be used in addition to interpretation based
on other statistic benchmarks.
The Kolmogorov-Smirnov test (abbreviated K-S test) is a goodness-of-fit test,
which means that it can be applied to check whether a gathered sample matches
a certain distribution, or whether two samples are from the same distribution.
The null-hypothesis in the K-S test is, that the two tested samples are from the
same distribution, and like in the Kruskal-Wallis test, it can either be rejected
by the test, or not be rejected at the given significance level.
36
An empirical way to evaluate the distribution of the data, is plotting its
empirical cumulative distribution function (CDF) against the CDFs of well-
known distributions. This method is not as exact as a K-S test but it can point
out tendencies and give hints concerning the distribution of the sample. The
CDF is defined as follows
Z x
F (x) = f (x)
−∞
where f(x) is the probability density function, i.e. at point x, the CDF rep-
resents the summed probability of the value x and all values smaller than x in
the sample.
The above mentioned techniques become interesting if they can be used in
a practical context, e.g. if timeout values for a timeouting and resubmission
strategy can be derived from them. In [17], the authors propose the following
model for computing the optimal timeout value
1 ∞
Z
t∞
Ej (t∞ ) = ufR (u)du + − t∞ (4.1)
FR (t∞ ) 0 (1 − ρ)FR (t∞
where t∞ denotes the timeout value, FR the CDF and fR the probability
density function of the samples. ρ is the probability of outliers.
We are going to use that model to estimate expected execution time for time-
out values in our analysis. However, it must be critically remarked, that even
though this model takes the outlier ratio into account, it only considers the past
distribution in each evaluation point and not the length of the remaining tail.
Hence it may be too optimistic on long-tailed distributions. For the sake of
simplicity, from now on, we are going to refer to the above equation simply as
the timeout model.
4.1.2 SSH
37
4.1.3 Globus
The Globus experiment consists of jobs that were submitted between 2008-10-10
and 2008-10-14 and between 2008-11-07 and 2008-11-15 in a 30 minutes interval
to the following Globus sites in the AustrianGrid1 (we will use the abbrevations
in brackets from now on):
• http://blade.labs.fhv.at/jobmanager-sge (blade.labs)
• http://schafberg.coma.sbg.ac.at (schafberg)
• http://altix1.jku.austriangrid.at/jobmanager-pbs (altix1.jku)
• http://hydra.gup.uni-linz.ac.at/jobmanager-pbs (hydra.gup)
Due to problems with firewalls, we used the active job polling mechanism in
Globus as opposed to GRAM callbacks (like they are in shown in figure 2.3).
Active job polling contacts the resource broker, which polls GRAM.
4.1.4 gLite
Every 30 minutes, 3 jobs were submitted to the same resource broker in the
virtual organisation VOCE (VO for Central Europe)2 . The used resource broker
was skurut67-6.cesnet.cz (now replaced by wms1.egee.cesnet.cz). Since
the resource broker acts as a meta-scheduler, the decision about the queue, in
which the job will become executed, is made by the WMS (see section 2.2.2).
The measurements took place between 2008-09-22 and 2008-10-05 and 2008-10-
09 and 2008-10-14.
4.2.1 SSH
SSH has a median scheduling latency of 0 seconds, the mean of all samples is
39 milliseconds, while the is 7847 millisec2 , which corresponds to a standard
1
http://www.austriangrid.at
2
http://egee.cesnet.cz/en/voce/
38
deviation of 0.08 seconds. This means, that nearly all SSH jobs were scheduled
at the same time or before JavaGAT could report the submission status. The
few-hundred milliseconds latency that some job submissions exhibit, is composed
of network and authentication latency, but overall it can be said that delays will
be hardly noticeable by the end user.
4.2.2 Globus
After evaluating the measured scheduling times in general, we have tried to
factor in influences on the scheduling latency by the date at which the job was
submitted and the CE on which the job became executed. The results are going
to be presented in the following sections.
Figure 4.1 shows Globus scheduling latencies that were measured during the
experiments. It can clearly be seen that, with the exception of a few outliers,
by far most jobs become executed after a scheduling delay below 10 seconds.
The mean latency value of all measurements is 3.9 seconds. The median
value of the scheduling times is 0. This implies that in the used Austrian Grid
configuration with Globus middleware one can expect that the jobs get scheduled
almost instantly. Note, that in this case 0 means that the job was executed
before or at the same time of the first middleware notification, i.e. the wrapper
script callback arrived before the middleware could notify the testbench of the
job’s submission status.
The histogram in figure 4.2 shows the distribution of scheduling latencies
without extreme outliers.
It can be seen that a scheduling latency of 0 is by far the most probable. Fur-
ther scheduling latencies between 1 and 10 seconds are approximately equiprob-
able. Hence, little surprisingly, the cumulative distribution function (CDF) of
the latency values starts at a high value in 0 and is light-tailed. Figure 4.3 con-
tains this CDF along with CDFs of other well known distributions evaluated at
the same mean and standard deviation. Truncated Gaussian refers to a normal
distribution centred at the mean of the measured data, but without negative val-
ues, which can not occur in a scheduling process. The latencies could probably
approximately be modelled by a mixture of the log-normal and the exponen-
tial distribution. It can be seen that our claim of the light tailed behaviour is
confirmed by the decay-rate, which is faster than that of the exponential model.
About 79% of all job submissions become executed at their compute element
without any noticeable latency and over 90% may consume the required re-
39
5
x 10
Latencies
Median
Mean
10
8
Scheduling latencies (ms)
0
Submission date
Latency values
12000 Median
Mean
10000
Scheduling latency (millisec)
8000
6000
4000
2000
Submission date
40
1200
1000
800
600
400
200
0
0 5 10 15 20 25
Empirical CDF
1
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
Globus scheduling
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 2 4 6 8 10 12 14 16 18 20
Scheduling latency (sec)
41
12000
10000
6000
4000
2000
sources within 7 seconds. All latencies higher than 12 seconds can be considered
as outliers.
Since the distribution of the scheduling latency tail roughly approximates an
exponential distribution, setting a timeout value in a timeout and resubmission
strategy necessarily implies the loss of jobs at a certain probability. However, by
taking the last non-outlier value of 12 seconds, the probability of terminating
a job that will not be stuck indefinitely, is already lower than 2%. Since in
the AustrianGrid, higher latencies form not only outliers but extreme outliers
(greater than three times the inter-quantile range), it would be a good strategy
to resubmit jobs after a 12 seconds timeout.
Grouping the results according to the day of the week at which the job was
scheduled exposes no significant differences. (figure 4.4). Applying the Kruskal
Wallis test yields the result 4.7 ∗ 10−5 , which testifies a different distribution
on the daily temporal scale. However, the millisecond scale is of little practical
relevance, when considering the latency as it is perceived by the user. Rounding
the values to seconds and reapplying Kruskal-Wallis still yields 0.0141, which is
below the significance level.
However, since large samples are more likely to produce smaller values and
the median is the same for all days, we conclude that there are weekday differ-
ences in the scheduling behaviour, but that they are negligible from a practical
perspective.
42
12000
10000
6000
4000
2000
There is no visible correlation between the time-period of the day and the
scheduling latency. Scheduling latency is not lower at the typical work periods
as compared to typical spare time periods (figure 4.5).
The Kruskal-Wallis test result does not contradict that statement.
43
dian scheduling times at the PBS queues, since our job was fairly short compared
to other Grid workloads.
Hence, the scheduling overhead in Globus seems to depend more on the re-
motely deployed load balancer than on the middleware. However, deriving de-
tailed results with respect to different load-balancing products would require
more in-depth investigation.
4.2.3 gLite
With the gLite middleware and the used configuration, scheduling latency is
considerably higher than with the Globus middleware. The mean scheduling
latency is 135 seconds and the median latency in gLite is 91 seconds. Figure 4.7
shows the distribution of the scheduling latency values.
Over 90% of the jobs are scheduled within 200 seconds, as can be seen on the
empirical cumulative distribution function on figure 4.8.
When comparing the gLite scheduling CDF to the exponential function, it
becomes obvious that the higher latency values of gLite form a heavy tail, since
the distribution decays slower for higher values than the respective exponential
distribution. The third quantile is 120, the IQR is 77 and the upper whisker
235.5. Therefore jobs taking more than 351 seconds (3rd quantile +3 ∗ IQR)
before getting scheduled must be considered as extreme outliers and become
subject to resubmission. The resulting losses of up to 5% of otherwise com-
pleting jobs must be considered a good tradeoff, given the heavy tail and the
corresponding low probability that the respective jobs will still be scheduled in
what the user may perceive as a short time after that waiting period.
Feeding our data into the timeout model yields 209 seconds as the optimal
timeout value and confirms the observation of the authors that higher time-
out values penalize the execution time behaviour less than aggressive ones (fig-
ure 4.9).
44
1200
1000
800
Scheduling latency (sec)
600
400
200
20000
18000
16000
14000
Scheduling latency (millisec)
12000
10000
8000
6000
4000
2000
45
Latencies
Median
2500 Mean
2000
1500
1000
500
0
Submission date
Empirical CDF
1
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
Latency cdf
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 500 1000 1500 2000 2500
Latency values (sec)
46
500
400
200
100
0
0 50 100 150 200 250 300 350 400
Timeout value (sec)
In figure 4.12 we use the abbreviations listed in table 4.1 for the different queues
to which our job was submitted by the WMS.
The queues show significantly different scheduling latencies. A clear pattern
can be seen with respect to the deployed queue load-balancer. 8 queues are
managed by PBS and 7 queues by the LCG-PBS load-balancer. While queues
configured with PBS scheduled jobs with a small median latency, but a high
latency variance, the opposite holds for queues configured with the LCG-PBS
47
3
10
2
10
3
10
Scheduling latency (sec, logarithmic scale)
2
10
48
SRCE ce1-egee.srce.hr:2119/jobmanager-sge-prod
NIIF egee-ce.grid.niif.hu:2119/jobmanager-pbs-voce
LINZ egee-ce1.gup.uni-linz.ac.at:2119/jobmanager-pbs-voce
ELTE eszakigrid66.inf.elte.hu:2119/jobmanager-lcgpbs-voce
POZNAN ce.reef.man.poznan.pl:2119/jobmanager-pbs-voce
BME ce.hpc.iit.bme.hu:2119/jobmanager-lcgpbs-long
IRB egee.irb.hr:2119/jobmanager-lcgpbs-grid
CYF ce.cyf-kr.edu.pl:2119/jobmanager-pbs-voce
AMU pearl.amu.edu.pl:2119/jobmanager-lcgpbs-voce
WROC dwarf.wcss.wroc.pl:2119/jobmanager-lcgpbs-voce
SAVBA ce.ui.savba.sk:2119/jobmanager-pbs-voce
KFKI grid109.kfki.hu:2119/jobmanager-lcgpbs-voce
CESNET ce2.egee.cesnet.cz:2119/jobmanager-pbs-egee voce
TUKE ce.grid.tuke.sk:2119/jobmanager-pbs-voce
IJS lcgce.ijs.si:2119/jobmanager-pbs-voce
OEAW hephygr.oeaw.ac.at:2119/jobmanager-lcgpbs-voce
SRCE
NIIF
LINZ
ELTE
POZNAN
BME
IRB
CYF
AMU
WROC
SAVBA
KFKI
CESNET
TUKE
IJS
OEAW
unknown
2 3
10 10
Seconds (logarithmic scale)
49
system, where jobs were scheduled with a high median latency, but a small la-
tency variance. Among the queues on which our jobs were scheduled, there is
one queue managed by the SGE load balancer (SRCE), which shows a low me-
dian scheduling latency and high latency variance, but of course no conclusions
can be drawn from the measurement results of just one queue.
From the perspective of timeout and resubmission strategies, low latency vari-
ance is more desirable than lower median latency values, because the system be-
haviour becomes more predictable. Hence, if not many time-critical small jobs
or complex workflows are submitted to EGEE, the LCG-PBS queues perform
better. LCG-PBS is a PBS wrapper that increases PBS scalability by allowing
the monitoring of all jobs from a user through the same resource broker. The
reason for this fine-tuning of PBS is the need for a higher scalability inherent in
a meta-scheduling architecture such as gLite [38]. Using the adapted version of
PBS seems to be advantageous in terms of scheduling latency.
Patryk Lasoń, the administrator of the CYF (cyfronet) CE, has kindly provided
us with the opportunity to run our test with different CE load balancer configu-
ration options. The CYF CE is configured with the Maui cluster scheduler [39].
Maui is available as a plugable queue to different load balancers such as PBS
or LSF and uses them for resource management, while providing Maui-specific
functionality such as reservation and policies to the outside. Maui has to poll
PBS for node information. Maui’s scheduling system is based on priority mech-
anism, which runs and makes reservations for highest priority jobs and tries
to fit lower priority jobs into gaps in the reservation system. The interval in
which Maui polls PBS is determined by the RMPOLLINTERVAL configuration
option.
We have measured the performance of the Maui-managed cyfronet queue with
a RMPOLLINTERVAL setting of 60 seconds and later with a setting of 5 sec-
onds. The resulting boxplots can be seen in figure 4.13.
The median of the results obtained with the 5 second RMPOLLINTERVAL is
127 seconds, while the median of the 60 second RMPOLLINTERVAL measure-
ment is 135 seconds. This marginal improvement comes at the cost of a much
higher latency variance. The variance in the first sample is 23905. The second
sample, however, shows a variance of 196430. These variances correspond to
standard deviations of 155 respectively 443 seconds.
But due to the larger variance, also the mean value of the queue with the
small poll interval is considerably larger: The mean scheduling latency on the
50
3
2
10
RMPOLLINTERVAL=60s RMPOLLINTERVAL=05s
51
Empirical CDF
1
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
0.1
RMPOLLINTERVAL=60s
RMPOLLINTERVAL=05s
0
0 500 1000 1500 2000 2500 3000
x
Figure 4.14: CDF of gLite scheduling latency with different Maui poll intervals
4.3.1 SSH
We have evaluated a sample of 132 jobs that were executed on the different
desktop machines they were submitted to.
The overhead between RUNNING-notification and actual execution could not
be measured, because the SSH JavaGAT adaptor did not consistently report the
RUNNING state. The post-execution overhead is 787 milliseconds in the median
and 885 milliseconds in the mean. The variance is 2140800 milliseconds2 and
thus the standard deviation is 1463 milliseconds. All values already include
potential overhead added by the JavaGAT API.
Hence, middleware notification overhead in SSH, when compared to typical
payload execution times, is below noticeability. However, it must be considered,
that SSH by itself is not a middleware in the common sense of the term.
52
4.3.2 Globus
Out of 2098 submissions, 306 states needed to be forcibly terminated after the
timeout period, which corresponds to a ratio of jobs without excessive waiting
times of about 85%.
Overall IS overhead
53
14000
12000
10000
IS overhead (millisec)
8000
6000
4000
Overhead
2000 Median
Mean
Submission date
25000
20000
IS overhead (millisec)
15000
10000
5000
Overhead
Median
Mean
0
Submission date
54
400
350
300
250
200
150
100
50
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
180
160
140
120
100
80
60
40
20
0
0 5 10 15 20 25
55
Empirical CDF
1
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
Latency cdf
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 5 10 15
Latency values (sec)
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
Latency cdf
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 5 10 15 20 25
Latency values (sec)
56
fact that the histogram shows a spike points at the influence that CE reporting
polling intervals might have.
Consequently, the differences could result from GRAM’s job monitoring mech-
anisms (and the polling intervals) or from properties of the underlying load-
balancing systems that are polled by GRAM. However, unlike in section 4.2.2,
no clear patterns with respect to the load-balancing system can be observed,
because the two PBS sites behave differently (figure 4.18). The information
service overhead is highest for the SGE system, but this can also be due to
the local GRAM installation of the site. In the measured sample, information
service overhead can be as high as 15 seconds even on the system with a pure
fork load-balancer.
Interestingly, GRAM can do most running notification within 3 respectively
4 seconds on two of the four systems, which intriguingly run different sched-
ulers. Hence, without a larger sample regarding the load-balancers/schedulers,
no exact conclusions can be drawn.
4.3.3 gLite
Out of 2655 jobs, 689 had to be terminated because they didn’t finish within 2700
seconds. Accordingly, based on the collected data, the probability that a gLite
job submission will not end up waiting in a state indefinitely is approximately
74%.
The observation from section 4.3.2, that notification overhead is different for pre-
and post-job execution notifications, also holds for gLite. Figure 4.19 shows the
overheads for pre- and post-execution notifications.
As in Globus, the post-execution overhead is higher than the pre-execution
overhead. However, now the overhead difference lies in the dimension of factor
ten. The median value for pre-execution overhead is 9 seconds and the mean
value 28 seconds. For post-execution overhead, these values are 198 respectively
209 seconds. But also the variance is larger for post-execution notifications.
Analysing the gathered data yields a variance of 2078 sec2 for pre-execution
overhead and 17352 sec2 for post-execution overhead.
The histogram of the pre-execution overhead is rather unsurprising, with a
global maximum around the median, but smaller local maxima up to the highest
measured delay. The histogram of the post-execution overhead shows two spikes
around 200 and 300 seconds with a local minimum in between. Both histograms
are depicted in figure 4.20
57
14
12
10
IS overhead (sec)
25
20
IS overhead (sec)
15
10
0
blade.labs altix1.jku schafberg hydra.gup
Grid sites
58
1000
Overhead
Median
Mean
100
10
1
Submission date
10000
Overhead
Median
Mean
IS overhead (sec, logarithmic scale)
1000
100
10
Submission date
Figure 4.19: Middleware notification overhead before and after job execution in
gLite
59
250
200
150
Frequency
100
50
0
−20 0 20 40 60 80 100 120 140 160 180
IS overhead (sec)
300
250
200
Frequency
150
100
50
0
0 100 200 300 400 500 600 700
IS overhead (sec)
60
Empirical CDF
1
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
Latency cdf
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 20 40 60 80 100 120 140 160
Latency values (sec)
Empirical CDF
1
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
Latency cdf
0.1 Exponential
Log−normal
Trunc. Gaussian
0
0 100 200 300 400 500 600 700 800 900 1000
Latency values (sec)
61
500
400
200
100
0
0 50 100 150 200 250 300 350 400
Timeout values (sec)
62
Information service overhead with respect to CE
Apart from the insight that gLite’s hierarchical logging system causes large
information service overhead and overhead burstiness, it was within the scope
of our interest to quantify the influence of the used CE respectively queue load
balancer. For Globus we have shown that there are strong indications for an
influence by those components.
Analysis of the pre-execution middleware overhead per queue reveal signifi-
cant differences. It can be seen on the box plots displayed in figure 4.23, that
especially the overhead for pre-execution notification exhibits a strong queue
dependence.
CESNET and ELTE are obvious overshoots and testify about the CE’s sig-
nificance regarding IS overhead. Furthermore, the influence of the specific load
balancer is an interesting aspect. Queues managed by LCG-PBS have a median
pre-execution reporting overhead of 8 seconds, a mean overhead of 40 seconds
and a variance of 3346 sec2 . All values are lower for PBS, the median being 3
seconds, the mean 11 seconds and the variance 701 sec2 . According to the t-test,
the samples from the two load balancers are modelled by a different distribution.
Therefore, the collected data allows the interpretation, that information ser-
vice overhead between the actual running state and the RUNNING notification
mainly originates from the used scheduler/load-balancer and the CE configura-
tion. The LCG enhancement to PBS, which allows the central resource broker
to monitor all submitted jobs, seems to negatively influence IS overhead.
Basically, this is the same for post-execution IS overhead. Queues managed
by LCG-PBS report job completion with higher mean (270 vs. 169 seconds) and
median latency (250 vs. 168 seconds) than PBS and vary more in the expected
notification duration (the variance for LCG-PBS is 28843 sec2 , whereas it is
5041 sec2 for PBS). The t-test considers the LCG-PBS and PBS samples to be
differently distributed. However, unlike in pre-execution IS overhead measure-
ments, the median overhead value is greater than 100 seconds for all queues
equally, which strongly suggests that the hierarchical multistage logging process
in gLite’s L & B mechanism may cause parts of the overall delay.
63
NIIF
LINZ
ELTE
POZNAN
BME
IRB
CYF
AMU
WROC
SAVBA
KFKI
CESNET
TUKE
IJS
OEAW
unknown
0 20 40 60 80 100 120 140 160
Seconds
SRCE
NIIF
LINZ
ELTE
POZNAN
BME
IRB
CYF
AMU
WROC
SAVBA
KFKI
CESNET
TUKE
IJS
OEAW
unknown
2 3
10 10
Seconds (logarithmic scale)
64
Clearing usually involves deletion of the sandbox files and directories.
4.4.1 Globus
Globus post staging process exposes, as shown in figure 4.24, relatively low
absolute latency values, with a median of 6 and a mean of 7.3 seconds, and a
low variance of 12 sec2 .
However, the tested sites behave differently, which can be seen on the box-
plots in figure 4.25. blade.labs has a near-to-zero variance, while hydra.gup,
altix1.jku and schafberg have a larger variance, with many outliers showing
up at altix1.jku. The post-staging latency is not so much dependent upon
the middleware, since it is exclusively dependent on the speed of the GridFTP
transactions.
GridFTP has some overhead compared to FTP, which originates from the need
of GSI-authentication. However, the performance of blade.labs shows, that
file transfer within Globus environments can be quite a deterministic process in
terms of latency. The variance of the other sites is also comparatively low and
may have its cause in network effects.
4.4.2 gLite
We have no results for gLite, but since the post-staging process is exactly the
same as in Globus with the WMS acting as the remote GridFTP endpoint, it
can be expected that the results for post-staging will be quite similar.
4.5 Comparison
The values that we have presented in our analysis for the different middlewares
are summarised in table 4.2. The optimal timeout values correspond to numbers
obtained with the timeout model shown in equation 4.1 for our samples. All
results computed with the model show, that overestimating the timeout value
does not have as negative impacts as underestimating it.
The overhead of SSH with respect to scheduling and to information service
overhead is close to 0, which is little surprising, since SSH only takes care of
forwarding commands and data (via scp), but apart from that it does not in-
clude any of the functionalities of Grid middlewares, such as meta-scheduling,
matchmaking, job state tracking and queueing. The queue is just a simple fork,
which creates a new process that becomes subject to operating system schedul-
ing. Hence, there is also close-to-zero scheduling overhead because jobs may
start instantly. The values measured in SSH are merely a base reference which
65
80
Latency values
Median
Mean
70
60
50
Latency values (sec)
40
30
20
10
0
Submission date (not to scale)
20
Latency values
Median
18 Mean
16
14
Latency values (sec)
12
10
0
Submission date (not to scale)
66
Poststaging latency (sec, logarithmic scale)
1
10
0
10
67
assesses that possible overhead introduced by JavaGAT and the testbench is
negligible when compared to the measured values themselves.
68
considered as reliable in Globus. gLite shows outliers in its overhead, both in pre-
execution and post-execution delays. Especially in post-execution overhead, the
outliers form a long tail. Therefore it must be said that gLite can not guarantee
reliable status updates and that job completion waiting must be timeouted in
practise. It is questionable whether gLite’s hierarchical logging and bookkeeping
process is optimal for reporting. The different poll intervals in the stages of this
process manifest themselves as local maxima in the histograms of the overhead
measurements.
For both Globus and gLite, post-execution notification overhead is larger than
pre-execution overhead. Since the overhead changes with the picked CE, it can
be expected that some load-balancers are slow in registering job completion.
We could not factor a clear load-balancer dependency for the four tested Globus
sites, but for gLite it is quite obvious that the LCG-PBS scheduler performs
worse than the PBS scheduler. LCG-PBS attempts to enhance scalability by
allowing the resource broker to track all jobs submitted by the same user. This
suggests that the worse performance of LCG-PBS stems from these decentrali-
sation efforts.
Furthermore the post-execution notification delay in gLite is much more dis-
proportionate compared to the pre-execution delay than in Globus. While in
Globus, the difference between the medians is about factor 2, in gLite the dif-
ference is more than factor 20. Accordingly, in direct comparison to Globus,
gLite does a bad job in reporting job completion even in consideration of the
larger scale of EGEE. Since the logging process is the same for pre-execution
and post-execution reporting, the conclusion remains that the interaction be-
tween the local resource management software and the CE-deployed job status
monitor is suboptimal.
69
70
Chapter 5
Conclusions
71
porting system in gLite with the CE’s monitor forwarding information received
from the LRMS to a multi-staged L&B system, leaves room for improvement.
Due to the fact that our testbench identifies latencies at all stages of execution,
it is possible to timeout and resubmit the job at nearly every execution state if
reasonable waiting times are exceeded. We presented such timeout values, using
both intuitive considerations and the latency model introduced in [17].
With our testbench, new data on the execution behaviour can be gathered
every time a middleware environment improves significantly with respect to the
perceived delays, as has frequently happened in EGEE in the last few years. Due
to the modular design of the testbench and the fact that only very little code is
middleware-dependent, it should be easy to create new testcases on demand.
72
Appendix A
gLite Documentation
73
Glite Resource Broker Adapter Readme
Thomas Zangerl
January 7, 2009
Contents
1 VOMS-Proxy Creation 3
1.1 Frequent errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 ”Unknown CA” error . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 ”Error while setting CRLs” . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 ”pad block corrupted” . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Could not create VOMS proxy! failed: null: . . . . . . . . . . . . . 3
1.2 Preference keys for VOMS Proxy creation . . . . . . . . . . . . . . . . . . 4
1.3 Minimum configuration to make VOMS-Proxy creation work . . . . . . . 4
1.4 (Not) reusing the VOMS proxy . . . . . . . . . . . . . . . . . . . . . . . . 5
2 The gLite-Adaptor 6
2.1 Adaptor-specific preference keys . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Supported SoftwareDescription attributes . . . . . . . . . . . . . . . . . . 6
2.3 Supported additional attributes . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Setting arbitrary GLUE requirements . . . . . . . . . . . . . . . . . . . . 7
2
1 VOMS-Proxy Creation
1.1 Frequent errors
1.1.1 ”Unknown CA” error
The proxy classes report an ”Unknown CA” error (Could not get stream from secure
socket). Probably the VomsProxyManager is missing either your root certificate or the
root certificate of the server you are communicating with. It is best, if you include
all needed certificates in ∼/.globus/certificates/. (e.g. you can copy the /etc/grid-
security/certificates directory from an UI machine of the VO you are trying to work
with to that location).
If this doesn’t suffice, you should try to include a file called cog.properties in the
∼/.globus/ directory. The content of this file could be something like this:
3
vomsHostDN distinguished name of /DC=at/DC=uorg compulsory
the VOMS host /O=org/CN=somesite
vomsServerURL URL of the voms server, skurut19.cesnet.cz compulsory
without protocol
vomsServerPort port on which to con- 7001 compulsory
nect to the voms server
VirtualOrganisation name of the virtual or- voce compulsory
ganisation for which the
voms proxy is created
vomsLifetime he desired proxy life- 3600 optional
time in seconds
4
– VirtualOrganisation
5
2 The gLite-Adaptor
2.1 Adaptor-specific preference keys
The mechanisms provided by the GAT-API alone did not suffice to provide all the
control we found desirable for the adaptor. Hence, a few proprietary preference keys
where introduced. They are useful in controlling adaptor behaviour but are by no means
necessary if one just wants to use the adaptor. Nonetheless, they are documented here.
To avoid confusion, all of them take strings as values, even if the name would suggest
an integer or boolean.
If you want to use them, set them using preferences.put(); for example write
preferences.put("glite.pollIntervalSecs", "15");
context.addPreferences(preferences);
• glite.pollIntervalSecs - how often should the job lookup thread poll the WMS for
job status updates and fire MetricEvents with status updates (value in seconds,
default 3 seconds)
• glite.deleteJDL - if this is set to true, the JDL file used for job submission will be
deleted when the job is done (”true”/”false”, default is ”false”)
• glite.newProxy - if this is set to true, create a new proxy even if the lifetime of the
old one is still sufficient
6
2.3 Supported additional attributes
The attribute glite.DataRequirements.InputData can be set. It expects an ArrayList<String>
as value, in which there are one or more InputData LFNs or GUIDs which will be used
by the matchmaker upon scheduling in order to decide on which CE the job is going to
be scheduled. Normally, this is a CE ”close” (i.e. with low latency) to the SE.
7
Appendix B
GridLab license
GRIDLAB OPEN SOURCE LICENSE
The GridLab licence allows software to be used by anyone and for any purpose,
without restriction. We believe that this is the best way to ensure that Grid
technologies gain wide spread acceptance and benefit from a large developer
community.
Copyright (c) 2002 GridLab Consortium. All rights reserved.
This software includes voluntary contribution made to the EU GridLab
Project by the Consortium Members: Istytut Chemii Bioorganicznej PAN
Poznaskie Centrum Superkomupterowo Sieciowe (PSNC), Pozna, Poland;
Max-Planck Institut fuer Gravitationsphysik (AEI), Golm/Potsdam, Ger-
many, Konrad-Zuse-Zentrum fuer Informationstechnik (ZIB), Berlin, Germany;
Masaryk University, Brno, Czech Republic; MTA SZTAKI, Budapest, Hungary;
Vrije Universiteit (VU), Amsterdam, The Netherlands; ISUFI/High Perfor-
mance Computing Center (ISUFI/HPCC), Lecce, Italy; Cardiff University,
Cardiff, Wales; National Technical University of Athens (NTUA), Athens,
Greece; Sun Microsystems Gridware GmbH, Germany; HP Competency Center
France
Installation, use, reproduction, display, modification and redistribution with
or without modification, in source and binary forms, is permitted provided that
the following conditions are met:
3. You are under no obligation to provide anyone with bug fixes, patches, up-
grades or other modifications, enhancements or derivatives of the features,
81
functionality or performance of any software you provide under this license.
However, if you publish or distribute your modifications, enhancements or
derivative works without contemporaneously requiring users to enter into a
separate written license agreement, then you are deemed to have granted
GridLab Consortium a worldwide, non-exclusive, royalty-free, perpetual
license to install, use, reproduce, display, modify, redistribute and sub-
license your modifications, enhancements or derivative works, whether in
binary or source code form, under the license stated in this list of condi-
tions.
4. DISCLAIMER
THIS SOFTWARE IS PROVIDED ”AS IS” AND ANY EXPRESS
OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
THE GRIDLAB CONSORTIUM MAKE NO REPRESNTATION THAT
THE SOFTWARE, ITS MODIFICATIONS, ENHANCEMENTS OR
DERIVATIVE WORK THEROF WILL NOT INFRINGE PRIVATELY
OWNED RIGHTS. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCURE-
MENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLI-
GENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE
82
List of Figures
1.1 The Grid layer model with the components of each layer . . . . . 4
1.2 The middleware as a transparent component between APIs and
services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
83
4.16 Histograms of middleware notification overhead in Globus . . . . 55
4.17 CDFs of middleware notification overhead in Globus . . . . . . . 56
4.18 Middleware notification overhead by Globus site . . . . . . . . . 58
4.19 Middleware notification overhead before and after job execution
in gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.20 gLite IS overhead histograms . . . . . . . . . . . . . . . . . . . . 60
4.21 gLite IS overhead CDFs . . . . . . . . . . . . . . . . . . . . . . . 61
4.22 Expected notification times (y-axis) in relation to timeout values 62
4.23 gLite IS overheads per queue . . . . . . . . . . . . . . . . . . . . 64
4.24 Post staging overhead in Globus . . . . . . . . . . . . . . . . . . 66
4.25 Data staging latencies per Globus site . . . . . . . . . . . . . . . 67
84
List of Tables
85
86
Bibliography
[2] Ian Foster. The Grid: Blueprint for a New Computing Infrastructure.
Morgan-Kaufman, 1999.
[3] Ian Foster. What is the grid? - a three point checklist. GRIDtoday, 1(6),
July 2002.
[9] Rob V. van Nieuwpoort, Thilo Kielmann, and Henri E. Bal. User-friendly
and reliable grid computing based on imperfect middleware. In Proceed-
ings of the ACM/IEEE Conference on Supercomputing (SC’07), nov 2007.
Online at http://www.supercomp.org.
[10] Marik Marshak and Hanoch Levy. Evaluating web user perceived latency
using server side measurements. Computer Communications, 26(8):872–
887, 2003.
87
[11] Ahuva W. Mu’alem and Dror G. Feitelson. Utilization, predictability, work-
loads, and user runtime estimates in scheduling the ibm sp2 with backfill-
ing. IEEE Transactions on Parallel and Distributed Systems, 12(6):529–
543, 2001.
[16] Diane Lingrand, Johan Montagnat, and Tristan Glatard. Modeling the
latency on production grids with respect to the execution context. In CC-
GRID ’08: Proceedings of the 2008 Eighth IEEE International Symposium
on Cluster Computing and the Grid (CCGRID), pages 753–758, Washing-
ton, DC, USA, 2008. IEEE Computer Society.
[17] Tristan Glatard, Johan Montagnat, and Xavier Pennec. Optimizing jobs
timeouts on clusters and production grids. In International Symposium on
Cluster Computing and the Grid (CCGrid), pages 100–107, Rio de Janeiro,
May 2007. IEEE.
[18] T. Ylonen and C. Lonvick. The Secure Shell (SSH) Protocol Architecture.
RFC 4251 (Proposed Standard), January 2006.
[19] Ian T. Foster. Globus toolkit version 4: Software for service-oriented sys-
tems. In Hai Jin, Daniel A. Reed, and Wenbin Jiang, editors, NPC, volume
3779 of Lecture Notes in Computer Science, pages 2–13. Springer, 2005.
[20] Von Welch, Ian Foster, Carl Kesselman, Olle Mulmo, Laura Pearlman,
Jarek Gawor, Sam Meder, and Frank Siebenlist. X.509 proxy certificates
for dynamic delegation. In In Proceedings of the 3rd Annual PKI R & D
Workshop, 2004.
88
[21] The OpenSSL project. website. http://www.openssl.org.
[22] Karl Czajkowski, Ian T. Foster, Nicholas T. Karonis, Carl Kesselman, Stu-
art Martin, Warren Smith, and Steven Tuecke. A resource management
architecture for metacomputing systems. In IPPS/SPDP ’98: Proceedings
of the Workshop on Job Scheduling Strategies for Parallel Processing, pages
62–82, London, UK, 1998. Springer-Verlag.
[23] Karl Czajkowski, Ian T. Foster, and Carl Kesselman. Resource manage-
ment for ultra-scale computational grid applications. In PARA ’98: Pro-
ceedings of the 4th International Workshop on Applied Parallel Comput-
ing, Large Scale Scientific and Industrial Problems, pages 88–94. Springer-
Verlag, 1998.
[25] Sergio Andreozzi, Stephen Burke, Felix Ehm, Laurence Field, Gerson
Galang, Balazs Konya, Maarten Litmaath, Paul Millar, and JP Navarro.
Glue specification v2.0.42. OGF Specification draft, May 2008.
[30] Erwin Laure. Egee middleware architecture planning (release 2). EU De-
liverable DJRA1.4, July 2005. http://edms.cern.ch/document/594698.
89
[31] Rajesh Raman, Miron Livny, and Marvin Solomon. Matchmaking: Dis-
tributed resource management for high throughput computing. In Proceed-
ings of the Seventh IEEE International Symposium on High Performance
Distributed Computing (HPDC7), Chicago, IL, July 1998.
[34] Tony Calanducci. Lfc: The lcg file catalog. Slides for NA3: User Training
and Induction, June 2005. http://www.phenogrid.dur.ac.uk/howto/LFC.
pdf.
[36] Gregor von Laszewski, Ian Foster, Jarek Gawor, and Peter Lane. A Java
Commodity Grid Kit. Concurrency and Computation: Practice and Expe-
rience, 13(8-9):643–662, 2001.
[37] Yoav Etsion and Dan Tsafrir. A short survey of commercial cluster batch
schedulers. Technical Report 2005-13, School of Computer Science and
Engineering, the Hebrew University, Jerusalem, Israel, May 2005.
[38] Jens Jensen, Graeme Stewart, Matthew Viljoen, David Wallom, and Steven
Yound. Practical Grid Interoperability: GridPP and the National Grid
Service. In UK AHM 2007, 4 2007.
[39] Brett Bode, David M. Halstead, Ricky Kendall, Zhou Lei, David Jackson,
and Maui High. The portable batch scheduler and the maui scheduler on
linux clusters. In In Proceedings of the 4th Annual Linux Showcase and
Conference. Press, 2000.
90