You are on page 1of 15

Local Grid Scheduling Techniques using

Performance Prediction
D.P. Spooner, S.A. Jarvis, J. Caoy, S. Sainiz and G.R. Nudd.

High Performance Systems Group, Department of Computer Science,
University of Warwick, Coventry, CV4 7AL, UK.
Tel: +44 24 7652 2863 Fax: +44 24 7657 3024
Email: dps@dcs.warwick.ac.uk
y
C & C Research Laboratories, NEC Europe Ltd, St. Augustin, Germany.

z
NASA Ames Research Center, Moffett Field, California, USA.

Abstract
The use of computational grids to provide an integrated computer platform, composed of differenti-
ated and distributed systems, presents fundamental resource and workload management questions. Key
services such as resource discovery, monitoring and scheduling are inherently more complicated in a grid
environment where the resource pool is large, dynamic and architecturally diverse.
In this research, we approach the problem of grid workload management through the development of a
multi-tiered scheduling architecture (TITAN) that employs a performance prediction system (PACE) and
task distribution brokers to meet user-defined deadlines and improve resource usage efficiency. This paper
focuses on the lowest tier which is responsible for local scheduling. By coupling application performance
data with scheduling heuristics, the architecture is able to balance the processes of minimising run-to-
completion time and processor idle time, whilst adhering to service deadlines on a per-task basis.

1 Introduction high performance systems. This is not the case how-


ever, and Foster et al. predict in [1] that providers
It is anticipated that grids will develop to deliver will charge for their services through the various
high performance computing capabilities with flex- grid accounting protocols that are currently in de-
ible resource sharing to dynamic virtual organisa- velopment [6]. In this consumer orientated model,
tions [1]. Essential to this growth is the develop- quality-of-service (QoS) contracts must become an
ment of fundamental infrastructure services that will integral element of grid scheduling, where the per-
integrate and manage large heterogeneous pools of ceived value of a service will steadily decrease as
computing resources, and offer them seamlessly to key service metrics such as deadline are not met.
differentiated users. A key grid service is workload
In this research, a performance prediction sys-
management, which involves distributing tasks and
tem (PACE) is used to obtain parallel application
data over a selection of resources and coordinating
performance data prior to run-time allowing re-
communication and transfer systems.
source requirements to be anticipated and deadlines
Grid computing itself originated from a new
considered. PACE [7] models consist of applica-
computing infrastructure for scientific research and
tion and resource objects, with a parametric evalu-
cooperation [2] and is rapidly becoming an estab-
ation engine that provides expected execution traces
lished technology for large-scale resource sharing
and run-time predictions for different resource con-
and distributed system integration [3]. The Open
figurations. When coupled with a suitable schedul-
Grid Services Architecture (OGSA [4]) and its asso-
ing algorithm, this approach can enhance deadline
ciated implementation, the Globus toolkit [5] is be-
scheduling on heterogeneous multi-processor sys-
coming a standard platform for grid service and ap-
tems which motivates its application to grid comput-
plication development, based upon web service pro-
ing.
tocols and open standards.
It is a common misconception that grid comput- The specific case of allocating independent tasks
ing will provide general access to free resources and to multi-processor systems is examined in this work
where application performance models exist and the as resource configuration, user contention, node fail-
hardware has been characterised. The architecture, ure and congestion can incur a major impact on per-
known as TITAN, adopts a conceptual multi-tiered formance and are difficult factors to predict a pri-
approach to workload management, where standard ori, particularly when access to system information
Globus [5] grid protocols such as the Grid Index In- is restricted. Additionally, computational providers
formation Service (GIIS [8]) and Globus Resource will invariably offer services to users with conflict-
Allocation Management (GRAM [9]) are used at ing needs and requirements. Contract issues includ-
the top tier, service-providing brokers are used in ing deadline and response times must be balanced in
the middle tier and localised scheduling systems are order to meet respective service-level agreements.
used at the lowest tier. This local and global par- It is therefore non-trivial for workload manage-
titioning provides a scalable platform for task and ment systems to select suitable resources for a par-
resource management. ticular task given a varied, dynamic resource pool.
This paper focuses on the lowest-level tier where The search space for this multi-parameter schedul-
PACE is used to estimate execution times for a par- ing problem is large and not fully defined until run-
ticular task on a given set of architecture types prior time. As a result, a just-in-time approach is em-
to run-time. This is combined with an iterative ployed in this research where performance models
heuristic (genetic) algorithm to select schedules and of an application are evaluated as tasks are submit-
organise the task queue according to a given set of ted allowing the system to consider run-time param-
metrics. The rationale for selecting such an algo- eters prior to execution. The models are evaluated
rithm is that it is able to adapt to minor changes in by local schedulers which are typically responsible
the problem space (such as a host disconnecting, or a for managing small to medium sized clusters of ho-
user submitting a task), and can be guided by means mogeneous processing nodes.
of a fitness function to improve task allocation ac-
cording to multiple metrics.
The main contribution of this work is to apply 2.1 TITAN Architecture
an established performance prediction tool to grid
TITAN adopts a loosely hierarchical structure con-
scheduling by use of an iterative heuristic algorithm.
sisting of workload managers, distribution brokers
This paper introduces the main theoretical consid-
and Globus interoperability providers. The tiers map
erations for the development of such a scheduling
over conceptual divisions between group, organi-
system, and then describes the implemented system
sation and inter-organisational boundaries as illus-
with results based on an experimental workload. The
trated in Figure 1. In the current implementation,
organisation of this paper is as follows: section 2
each distributed TITAN node can provide each or
introduces the workload management problem and
any of these services depending on configuration.
how a performance prediction system can assist; in
The purpose of the top tier is to provide Globus
section 3, the algorithm used in the lowest tier is
interoperability services. This includes passing
presented along with the coding scheme. Section 4
GRAM requests to the distribution brokers and man-
discusses the implementation and section 5 presents
aging data that passes from the schedulers to the
and evaluates experimental data; related work and
Globus information services. This represents the top
conclusions are given in sections 6 and 7.
of the TITAN chain, and hence information regard-
ing schedulers and distribution brokers outside the
organisation is achieved through the Globus Mon-
2 Grid Workload Management itoring and Discovery Service (MDS). The middle
While grid computing is not aimed specifically at tier consists of distribution broker systems which
the high performance computing (HPC) community, can manage subordinate brokers or schedulers. The
many organisations regard the grid as a means to de- brokers are based on the A4 advertising and discov-
liver commodity computation that exceeds the capa- ery agent structure [10, 11]. The lowest tier consists
bilities of their current systems. Along with com- of the schedulers that manage the physical comput-
putationally powerful systems, grid providers of this ing resources.
type will need to employ effective management tools
in order to steer codes to appropriate processing re- 2.2 Task Allocation
sources.
In a static environment, application performance Tasks are submitted to the TITAN system via a text-
may be improved through the tuning of an appropri- based tool, a portal or (in development) GRAM. In
ate algorithm, the re-mapping of data, or the adjust- each case, the request is delivered through the inter-
ment of communication behaviour all of which can operability layer and received by a distribution bro-
be considered, to some extent, during development. ker. In this paper, each task must have an associ-
In a dynamic grid setting, system-wide issues such ated performance model which is currently stored in
Figure 1: The TITAN architecture consisting of distribution brokers that communicate with other brokers and local
schedulers to execute tasks in accordance with service policy constraints. An interoperability layer provides an interface
to Globus permitting job submission to a particular broker.

a database indexed by task name, although a longer- line failure depending on a task request parameter.
term aim is to store details of the performance model The balance of advertisement and discovery can
in the Globus information service. When a task is be selected by a series of policies to suit the environ-
submitted without a performance model, TITAN can ment (and modified on-the-fly as required). Where
be configured (by means of system policies) to move task requests are relatively infrequent, the schedul-
tasks to another broker, to a subset of resources in the ing queues remain relatively static and so periodic
cluster, or to the end or front of the current queue. advertisement of queue details to the information
While the lack of a performance model reduces TI- service is sufficient. Conversley, high load will re-
TAN to the functional equivalent of a batch sched- sult in many changes in the queue length and task
uler, it is still able to make some decisions based on mix. In this environment, brokers may need to ‘dis-
deadline, priority and resource requirement. cover’ these details as necessary.
On receipt of an application request, the distri-
bution broker interrogates its local scheduler (if one 2.3 Schedule Representation
is associated) to ascertain whether the task can ex-
ecute on the available resources and meet the user- When a suitable scheduler is located, the task request
specified deadline. The response is based upon eval- is passed from the broker to the lowest tier whose
uating a PACE model for the given task, the re- objective is to organise tasks in a manner that satis-
source type, and by examining the current state of fies contract policies and system metrics (such as the
the schedule queue. If the scheduler can meet the minimisation of execution time and idle time).
user-defined deadline, the task is submitted to the In this work, the localised scheduling problem
local manager for scheduling. Otherwise, the bro- is defined by the allocation of a group of indepen-
ker attempts to locate a scheduler that can meet the dent tasks T = fT0 ; T1 ; : : : ; Tn 1 g to a network of
requirements through discovery and advertisement hosts H = fH0 ; H1 ; : : : ; Hm 1 g. Tasks are run in
mechanisms. a designated order ` 2 P (T ) where P is the set of
permutations. ` j defines the j th task of ` to execute
where 8`j ; 9 j  H; j 6= ;, that is, each task in
Each broker can maintain details of subordinate
schedulers and make this data available via the GIIS
` has a suitable host architecture on which to run.
(acts an information provider). This ‘advertisement’
To provide a compact representation of a schedule, a
strategy allows a foreign broker to interrogate the in-
two dimensional coding scheme is utilised:
formation service to determine whether a scheduler
exists that can run a task and meet the deadline. If
a suitable broker cannot be located by this method, `0 `1 ::: `n 1
the originating broker may initiate a ‘discovery’ pro- H0 M0;0 M0;1 ::: M0;n 1
cess where it interrogates neighbouring brokers in Sk = H1 M1;0 M1;1 ::: M1;n 1
the hierarchy. If these brokers are unable to meet the .. .. .. ..
. . . .
requirement, they pass on the request to their neigh-
bours by means of a ‘discovery step’. If an ideal
Hm 1 Mm 1;0 Mm 1;1 : : : Mm 1;n 1
(1)
scheduler cannot be located within a preset number
of discovery steps, the task request is either rejected
or passed to a scheduler that can minimise the dead- where Sk is the k th schedule in a set of schedules
S = fS0 ; S1 ; : : : ; Sp
1 g and Mij is the mapping of schedule representation S k . The models are accurate
hosts to a particular application: for predicting the behavioural properties of com-
 putation applications on pre-modeled architectures.
Mij =
1; if Hi 2 j For applications whose internal structures have a low
0; if Hi 2= j (2)
level of data-dependency, the performance models
can be within 5% of the actual run-time. In this
work, it is assumed that the performance model is
For a set of identical processors, the predicted accurate and that tasks will finish when expected –
execution time for each task is a function, p ext ( j )j , a strategy is discussed in section 5.2 to address the
where j is the set of allocated resources for task ` j . situation where a task overruns the expected com-
Each arrangement of j is a performance scenario pletion time.
that will result in different run-time execution be-
haviour for a particular task. While codes exists that PACE models are modular, consisting of appli-
benefit from as many hosts as are available (i.e. em- cation, sub-task, parallel and resource objects. Ap-
barrassingly parallel), most applications will expe- plication tools are provided that take C source code
rience degradation in performance as the communi- and generate sub-tasks that capture the serial com-
cation costs outweigh the computation capability at ponents of the code by control flow graphs. It may
a certain level of parallelism. Additional contraints be necessary to add loop and conditional probabili-
such as memory size and hierarchy, node architec- ties to the sub-tasks where data cannot be identified
ture and configuration will also effect application via static analysis. The parallel object is developed
scalability. It is therefore important that the resource to describe the parallel operation of the application.
set ( j ) is selected appropriately depending on the This can be reasonably straightforward for simple
task involved. Performance prediction can assist in codes, and a library of templates exists for standard
this selection via accurate modelling of scaling be- constructs. Applications that exhibit more complex
haviour using software and hardware characterisa- parallel operations may require customisation. The
tion. sub-tasks and parallel objects are compiled from a
In the case where all the processing nodes are performance specification language (PSL) and are
identical (as in the case of a cluster), the number of linked together with an application object that rep-
different performance scenarios is equal to the num- resents the entry point of the model.
ber of hosts. On a conventional system, it would Resource tools are available to characterise the
be straightforward to formulate the function p ext ()j , resource hardware through micro-benchmarking and
either by retaining historical data or by one-shot ex- modeling techniques for communication and mem-
ecutions. However, in a heterogeneous grid environ- ory hierarchies. The resultant resource objects are
ment, with architectural diversity and dynamic sys- then used as inputs to an evaluation engine which
tem properties, the number of performance scenarios takes the resource objects and parameters to produce
increases rapidly. For any non-trivial resource net- predictive traces and an estimated execution time.
work, it would be difficult to obtain p ext ( j )j prior This arrangement is illustrated in Figure 2.
to scheduling. The model forms a complex representation
which can be used to establish prediction metrics
for the execution of a particular application on a tar-
2.4 Performance Prediction get architectural platform. The level of detail in the
The PACE system provides a method to obtain model can be chosen so as to improve accuracy and
pext ( j ) dynamically, given an application model models are accumulated so that they can be reused.
and suitable hardware descriptions. The hardware As the models are modular it is possible to evaluate
(resource) descriptions are generated when the re- pext ()j with differing host architectures, simply by
sources are configured for use by TITAN, and the ap- replacing the resource object. This facilitates the ap-
plication models are generated prior to submission. plication of PACE to heterogeneous environments as
The PACE evaluation combines these components the application model can remain, for the most part,
with parameters to predict the execution time just– independent of the hardware. Examples of the use
in–time. The speed of evaluation negates the need of PACE include performance optimisation of finan-
to store performance timings in a database, although cial applications [12], on-the-fly performance analy-
TITAN uses a cache internally during its packing op- sis for application execution steering [13], and per-
eration. formance prediction using the Accelerated Strategic
Assuming that a run-to-completion (non- Computing Initiative (ASCI) demonstrator applica-
overlap) strategy is adopted, TITAN is able to esti- tion Sweep3D [14].
mate the finish time for a particular task by summing Figure 3 demonstrates the connection between
the earliest possible start time plus the expected ex- PACE and the scheduling system. The application
ecution time, readily obtained by PACE from the tools are used by the users when the task is submitted
Figure 2: The PACE system consisting of application and resource tools that produce objects which form the input to
a parametric evaluation engine.

if a model does not exist. With a good understanding


of the code, model generation is reasonably straight- tsj = max
8i;Hi 2 j
ftrji g (4)
forward. However, the PACE system is being de-
veloped to facilitate model generation for less well- where trji is the release time for task j on host i.
known codes [15]. The scheduler uses the evaluation For tasks `0 : : : `j 1 that execute on H i , this is set
engine to identify expected execution run-time from to the end time of the application. Where there are
the application models and from the resource models no previous applications on this host, then this is set
that represent the cluster or multi-processor system. to the host release time, hr i , which is the time when
Hi is expected to become available. It is based upon
tasks in the running queue and is set from te j when
3 Localised Scheduling tasks move out of the scheduling state. It is defined
as:
In this work, a cluster is considered as a network of
peer-to-peer hosts that are architecturally identical trji = max
8r;Hi 2 r
fT Rjir g (5)
with similar communication costs between nodes.
Scheduling issues are addressed at this level using 
ter ; if r < j
T Rjir =
hri ; if r  j
task scheduling algorithms driven by PACE perfor- (6)
mance predictions. Three scheduling algorithms are
developed in the section and are described below. Using this algorithm it is possible to sort tasks based
on deadline and priority. However, no improvement
in resource utilisation is made and so it is possible
3.1 Deadline Sort (DS) for tasks to ‘over-utilise’ a cluster. This results in a
When tasks are received by the scheduling system, it lower run-time for the task and prevents other tasks
is possible to map the process to all possible nodes from using the cluster.
and determine the expected run-time using PACE.
The expected end-time te j for each task in the sched- 3.2 Deadline Sort with Node Limitation
ule queue can then be determined by summing the (DS-NL)
task start-time with the task’s execution time, given
by the following expression: The second approach extends the first by limiting the
number of nodes allocated to a task based on predic-
tej = tsj + pext ( j )j (3) tive data. This can result in improved resource utili-
sation where the tasks use the number of nodes that
The start time, tsj , can be considered as the lat- result in the fastest run-time. Also, by selectively
est free time of all hosts mapped to the application. assigning the appropriate start node it is possible to
If all of the mapped hosts are idle, then ts j is set to parallelise two or more tasks if the number of nodes
the current system time t, hence: occupied is less than the sum of all nodes.
Figure 3: The connection between the scheduler and the PACE environment.

The problem with this algorithm is that a task can representation space.
conceivably prevent another task from running at the
expense of one overlapped node. In this instance, 3.3.1 Fitness Function
one solution may be to force one of the tasks to re-
linquish a node (if the task topology allows it). This The most significant factor in our fitness function
is not a simple problem and the exploration of multi- is the make-span, ! , which is defined as the time
ple scheduling solutions is an extension of the classi- from the start of the first task to the end of the last
cal multi-processing (MS) scheduler problem, which task. It is calculated by evaluating the PACE func-
is known be intractable in the general case [16] and tion pext ()j for each host allocation and combining
NP-hard for the case when m > 4 where m is the this with the end time of the task:
number of machines.
! = max
0j n 1
ftej g (7)

3.3 Genetic Algorithm (GA) Equation 7 represents the latest completion time of
all the tasks where the aim of the process is to
The third approach uses a genetic algorithm to ex- minimise this function with respect to the sched-
plore areas of the solution space to find good-quality ule. While schedules can be evaluated on make-
schedules. Such algorithms are well-suited to search span alone, in a real-time system, idle time must
problems of this nature [17] and have been applied also be considered to ensure that unnecessary space
to a number of similar areas such as dynamic load is removed from the start of the schedule queue.
balancing [18]. The approach used in this work gen- TITAN employs an additional function (8) that pe-
erates a set of initial schedules, evaluates the sched- nalises early idle time more harshly than later idle
ules to obtain a measure of fitness, selects the most time, using the following expression:

X  Tiidle 
appropriate and combines them together using op-
erators (crossover and mutation) to formulate a new
k = !k (2!k 2Tiidle
Ti )
idle
set of solutions. This is repeated using the current
i
!k2
schedule as a basis, rather than restarting the pro-
(8)
cess, allowing the system to capitalise on the organ-
isation that has occurred previously.
As with other iterative searching algorithms, this where Tiidle is the start of the idle time, and T iidle
process is based upon the notion of a fitness function is the size of idle time space which can be calculated
that provides a quantitative representation of how as the difference between the end time of the task
‘good’ a solution is with respect to other solutions. and the start time of the next task on the same host.
In this manner, a set of solutions can be created and This idle weighting function is therefore a decrease
evaluated with respect to multiple metrics to isolate from 100% weighting at (T i = 0; Ti = ! ) to 0%
superior solutions. This ‘survival of the fittest’ ap- weighting at (Ti = w; Ti = 0). The reason for pe-
proach maintains good solutions and penalises poor nalising early idle time is that late idle time has less
solutions in order to move towards a good-quality impact on the final solution as:
result. 1. the algorithm has more time to eliminate the
This is achieved using an algorithm that selects pocket;
suitable candidate schedules from a population of
schedules in S and produces new schedules in S 0 . A
2. new tasks may appear that fill the gap.
fitness function is used to guide the selection of suc- A linear contract penalty is also evaluated which can
cessful schedules by ejecting poor schedules from be used to ensure that the deadline time for a task,
the system and maintaining good schedules in the dj , is greater than the expected end-time of the task.
X
n 1
k = max f(tej dj ) ; 0g (9) `0 = [`A:0 ; `A:1 ; `A:2 ; `B:3 ; `B:4 ; `B:5 ; `B:6 ]
j =0
The three fitness functions are combined with
= [T1 ; (T3 ); T5 ; T2; T6 ; T7 ; T0; (T3 )]
preset weights to formulate an overall fitness func- = [T1 ; T4 ; T5; T2 ; T6 ; T7; T0 ; T3 ]
tion (10). Using these weights, it is possible to (12)
change the behaviour of the scheduler on-the-fly,

Random changes are then applied to ` 0 to give


allowing the scheduler to prioritise on deadline or

the mutated representation ` 00 , illustrated with the


compress on make-span, for example.

bracketed tasks in the following expression:


(W i  k ) + (W
m  !k ) + (W c  k )
fk = (10)
Wi + Wm + Wc
The fitness value is normalised to a cost function `00 = [T1 ; T4 ; T5 ; T2 ; T6 ; (T7 ); T0 ; (T3 )]
using a dynamic scaling technique, and then used as = [T1 ; T4; T5 ; T2 ; T6; T3 ; T0 ; T7]
the basis for the selection function. Schedules with
(13)
a higher fitness value are more likely to be selected
than a schedule with a fitness approaching 0.
The host map crossover process merges the host
3.3.2 Crossover / Mutation component of the two schedules together, using a
simpler binary copy. The mutation process ran-
The selected schedules are subject to two algo- domly toggles bits in the hostmap, while ensuring
rithmic operators to create a new solution set, that topology rules have not been broken (such as
namely ‘crossover’ and ‘mutation’. The purpose of trying to run the task on zero hosts for example).
crossover is to maintain the qualities of a solution, Expression 14 represents a full schedule of 8 tasks
whilst exploring a new region of the problem space. and 5 hosts.
Mutation adds diversity to the solution set.
As the schedules are represented as a two di-
mensional string that consists of a task ordering `00= [T1 ; T4; T5 ; T2; T6 ; T3 ; T0; T7 ]
part (which is q-ary coded) and a binary constituent 2 3
[1; 1; 0; 1; 1] ; [1; 1; 0; 0; 0] ;
(which is binary coded), the genetic operators have
00 66 [0; 0; 1; 0; 0] ; [0; 0; 0; 1; 0] ; 77
to be different for the task order and host mappings. M = 4 5
[0; 0; 0; 0; 1] ; [1; 1; 0; 0; 1] ;
Each operator acts on a pair of schedules, and is ap-
[1; 1; 1; 1; 1] ; [1; 0; 1; 0; 1] :
plied to the entire solution set.
(14)

S 7! S0 7! S 00 Figure 4 depicts a visualisation of this represen-


(11)
crossover mutation tation. The tasks are moved as close to the bottom of
the schedule as possible, which can result in a task
The task crossover process consists of merging
that has no host dependencies being run prior to a
a portion of the task order with the remaining por-
task earlier in the task order. This is illustrated with
tion of a second task order, and then re-ordering to
eliminate illegal orders. The mutation process in-
T5 being scheduled before T 4 despite being further
on in the task order.
volves swapping random task orders. For example,
two task orders can be represented as follows:

`A = [T1 ; T3 ; T5 ; T2 ; T0 ; T6 ; T4 ; T7 ] 4 Implementation
`B = [T4 ; T5 ; T2 ; T1 ; T6 ; T7 ; T0 ; T3 ] The techniques presented in the previous section
have been developed into a working system for
`A and `B can be combined at a random point scheduling parallel tasks over a heterogeneous net-
to produce a new schedule task list ` 0 , where `A:n work of resources. The GA algorithm introduced
denotes the nth element of `A . Task duplications in section 3 forms the centre point of the localised
are then removed by undoing the crossover opera- workload managers and is responsible for selecting,
tion for the repeated task, as illustrated in the follow- creating and evaluating new schedules. Connection
ing expression where (T 3 ) denotes a task duplication to the PACE system is achieved by a remote evalua-
which is resolved by setting the first (T 3 ) to the first tion library, with a performance cache to store tim-
unused value of ` B ings for subsequent re-use.
T T
Figure 4: Visualisation of the schedule given in expression 14. Task 5 is run earlier than 4 as it requires H2 which
T T
is not used by either 1 or 4 .

4.1 Workload Management Process executing state. This allows the operation of the al-
gorithms to be tested and tuned. In this simulation, a
The core data type is the Task. Tasks are submitted workload of 32 tasks was created for a 16 node clus-
to the workload manager from the distribution bro- ter. Each task was selected from a set of five codes
kers, and enter a pool of Task objects. The scheduler with different performance scaling characteristics as
maintains continuous visibility of this pool and ap- illustrated in Figure 6. Each of the codes have asso-
plies the strategies described earlier to form solution ciated PACE performance data which can be used to
sets. The scheduler also accepts host information provide the scaling curves. Details of model genera-
from a second pool which is updated by an on-line tion can be found for the Sweep3D model in [19].
host gathering and statistics module. Currently, this In the first experiment, the test schedule is evalu-
information is gained from UNIX ‘uptime’ and ated to 6000 generations with 3 different population
‘sar’ commands, although a small agent is being sizes: S = fS0 ; S1 ; : : : ; S19 g, fS0 ; S1 ; : : : ; S39 g,
developed to remove the operating system depen-
dence that this approach involves. The information
fS0 ; S1 ; : : : ; S59 g.
The first 2000 iterations are shown in Figure 7
is to check the load, status and idle–time of the hosts
which illustrates the classical GA properties: namely
in the cluster. It is therefore possible for TITAN
rapid convergence to (an often) sub-optimal solu-
to utilise hosts as ‘dual–mode’ machines which are
tion. Reducing the population size constrains the
only used for scheduling when they have been idle
diversity which can have a direct effect on identi-
for a preset period of time. As this level of resource
fying a good quality solution. However, increasing
is below the Globus layers, the MDS cannot be used
the population has computation and storage implica-
for statistics gathering.
tions, limiting the number of iterations that can be
At each iteration, the best solution is compared achieved between tasks. An approximate cost of this
with an overall fittest solution. This allows the technique is given by:
scheduler to maintain a copy of the best schedule
found to date, whilst allowing the process to explore
other regions of the problem space. If a significant teval = Np Nt [Cga + Cpace ] (15)
change occurs in either the host or task pools, then
this schedule is deemed ‘stale’ and replaced by the where Np is the number schedules in the population,
fittest schedule in the current generation. Nt is the number of tasks in the scheduling queue,
and Cga and Cpace are the computational cost of
Periodically, the scheduler examines the bottom
executing the GA operators and the PACE evalua-
of the schedule and if there are any tasks to run, will
tion engine respectively. Hence, reducing the pop-
move them into the run queue and remove them from
ulation size permits more iterations/sec which can
the scheduling queue. Figure 5 provides a summary
move more rapidly toward a solution. During ex-
view of this system.
perimentation, it was found that the difference be-
tween population sizes 40 and 60 were minimal with
4.2 GA Performance a significant saving in run-time. With a population
of 40, the scheduler (running on a standalone node
In order to test the GA operation, simulations were with a 800Mhz Pentium III) can deliver 100 itera-
run in ‘test’ mode where tasks are not moved into the tions per/sec assuming that the PACE data has been
Figure 5: Diagram that details the key system states.

cached (which typically occurs after the initial few periment, the deadline metric becomes progressively
iterations). dominant and at 1000 generations the scheduler
In the second experiment, the weighting func- becomes deadline-driven as opposed to makespan-
tion of (10) is demonstrated by introducing synthetic driven, demonstrating how multiple parameters can
deadlines at approximately 1000 generations. Fig- be considered dynamically.
ure 8 contains the normalised graphs of the make-
span contribution, the deadline contribution and the
global fitness. The idle-time weight is neglected in 5 Evaluation
this experiment as it has a less visible effect in the
tests, and is more useful when tasks are executing. To schedule and run tasks, TITAN is started in ‘ex-
Initially, the unsorted schedule results in a high ecution’ mode (as opposed to the ‘test’ mode de-
make-span and associated make-span contribution to scribed previously). The same 32 tasks assigned in
the overall global fitness. As the iterations continue, the previous section were used with deadlines set
the make-span is reduced and the population fitness cyclically between 5 and 11 minutes, and priori-
improves. At generation 600, the imminent dead- ties assigned between 1 and 5 where 1 is the most
line constraints start to effect the global fitness, al- important. Each task was submitted to the TITAN
though make-span is still the dominant factor. As scheduler using a propritary text-based tool. Work
the tasks are not removed from the queue in this ex- on adding GRAM interoperability is in progress and

100 CPI
Sweep3D
FFT
Execution Time (seconds)

80 Improc
Memsort

60

40

20

0
2 4 6 8 10 12 14 16
Nodes

Figure 6: Scaling characteristics for CPI, the Sweep3D ACSI demonstrator code, an FFT routine, an image processing
code and a bitonic memory sort – based on a resource timings for a cluster of Pentium III 800Mhz based systems
connected over a Fast Ethernet network.
850

800 pop-size = 20
pop-size = 40
750 pop-size = 60

Makespan (seconds)
700

650

600

550

500

450

400
0 500 1000 1500 2000
Generation

Figure 7: Make-span comparison for three different population sizes.

it will allow the scheduler to be utilised from Globus. In each of these cases, it is possible to calculate
In addition, the scheduler will make key service data the average advance time (A T ) which gives an in-
available via the information services. dication of how efficiently deadlines are being met.
This is calculated as follows:
dj T Ej
AT = (16)
5.1 Run-time Evaluation n
where dj is the user-defined deadline, T E j is the ac-
In the first experiment, the three scheduling algo- tual end-time of the application j , and n is the num-
rithms are compared by make-span. They are set ber of tasks in the queue. Table ?? provides the av-
against a simple first-come-first-served (FC–FS) al- erage advance time for each approach.
gorithm which performs no adjustment other than
sorting by priority. Figure 9 illustrates the number The negative values of FC–FS and deadline sort
of tasks completed over time. In this test, the FC–FS indicate that many of the deadlines were not met,
and the deadline sorting demonstrate the worst-case indeed on average both approaches missed their re-
make-span as they are unable to reconfigure tasks to spective deadlines by over two minutes.
improve resource usage. The approach of limiting
the tasks to the maximum required nodes results in
5.2 Dynamic Evaluation
an expected drop in make-span. However, this op-
eration is critically related to the number of nodes, The advantage of the GA approach over the node
and hence the GA algorithm, which is adaptable, is limitation algorithm is that it is not so sensitive to
able to improve on this by 20% by making slight the number of processing nodes. This is especially
modifications that can fill the gaps left by the node apparent if an event occurs such as a host going off-
limitation technique. line. In the final simulation, 4 of the 16 processors

1
Makespan component
Deadline component
0.8 Global fitness
Weighted Contribution

0.6

0.4

0.2

0
0 200 400 600 800 1000 1200 1400
Generation

Figure 8: The contribution of makespan and deadline to the global fitness of the population (lower is fitter). At 1000
generations, the scheduler is more influenced by deadline than makespan.
1200
FC-FS
DS
1000 DS-NL
GA
800

Time (seconds)
600

400

200

0
0 5 10 15 20 25 30
Tasks Complete

Figure 9: The GA algorithm results in the most rapid execution of the scheduler queue, compared with first-come
first-served (FC-FS), deadline sort (DS) and deadline sort with node limitation (DS-NL).

are removed after approximately 60 seconds of run- plication. Agents that contain both static and dy-
time. The node-limited algorithm takes an immedi- namic performance information query the Network
ate hit which pushes the make-span up accordingly. Weather Service [29] at the time of execution and
Without any means to work around the gaps, all the use algorithms to make a selection whether re-
tasks are pushed back and it takes over 2 minutes for sources are suitable candidates or not. Ninf also
the make-span to fall back previous levels. The GA, utilises the Network Weather Service and is based on
which has a lower make-span from the outset, copes performance evaluation techniques. The approach of
far better with the gap, recovering the make-span in NetSolve to provide remote computation facilities is
less than a minute. through the selection of the highest performing re-
In section 2.4 it stated that the scheduler makes sources in a network. Unlike TITAN/PACE how-
the assumption that PACE is 100% accurate. Al- ever, modifications to the application source code
though tests have yet to be run, it is felt that the are required in order to utilise NetSolve’s services.
GA will cope well with deviations from expected Other related grid projects include batch queuing
run-time in much the same manner as when hosts systems that provide system-level cluster scheduling
go off-line, reorganising the schedule queue where including Condor, Condor/G [30] and LSF. These
necessary. system do not, however, necessarily aim to find the
best scheduling solution, while TITAN has been de-
veloped with this in mind. Both Condor and LSF
6 Related Work provide facilities for managing large, high through-
put grid resources and Condor, in particular, inte-
There are a number of grid projects that address var- grates tightly with Globus. TITAN requires fur-
ious workload and resource management issues, in- ther development to achieve full Globus integration.
cluding Nimrod [20], NetSolve [21], AppLeS [22], Other scheduling tools that integrate with Globus
Ninf [23], Condor [24], LSF [25] and the Grid Re- include GRB which provides access to Globus ser-
source Broker (GRB) [26] project. Some of these vices using a web-based portal, permitting location-
works, including Nimrod, Ninf and AppLeS are sim- transparent scheduling of tasks. It is a more general
ilar to TITAN/PACE in that performance models are tool than TITAN using resource finders and match-
employed to enhance resource utilisation through ef- makers based on resource and user characteristics as
fective allocation. Nimrod, for example, has been opposed to performance models.
developed specifically to assist in virtual labora- As grid development continues, the requirement
tory work and is able to orchestrate distributed ex- to consider performance issues will become increas-
periments using a parametric evaluation engine and ingly important. While effective utilisation and allo-
declarative performance modeling language. It also cation is desirable for system and resource providers,
utilises a scheduling heuristic [27] to allocate re- performance characterisation provides the user with
sources. Nimrod/G [28] extends Nimrod with a grid- realisable deadline and priority parameters. It is an-
aware version that uses grid protocols to perform au- ticipated that future implementations of Globus will
tomatic discovery of resources. feature reservation services in which performance
A different approach is taken by AppLeS by prediction will be crucially important to prevent
placing the onus on the user to provide the schedul- wasteful allocations and meaningless user metrics
ing functionaility which is embedded into the ap- (i.e. such as non-fullfilable deadlines). Considera-
800
NL-DS
700 GA

600

Makespan (seconds)
500

400

300

200

100

0
0 100 200 300 400 500 600 700 800
Time (seconds)

Figure 10: Make-span of schedule queue over time. After 1 minute, 4 hosts are taken off-line resulting a schedule
reorganisation. The GA copes well with this incremental change.

tion of performance issues at the ‘middleware’ level, This work was initially based on Globus 2, and is
the approach taken by many of the works above and now being updated to integrate with OGSA and as-
TITAN, will help resolve these issues. The approach sociated services.
of TITAN is to utilise an established and flexible A test suite has been implemented to tune the
performance characterisation system (PACE) with algorithm parameters and the PACE models. The
a GA-based algorithm. While PACE provides true results demonstrate that the scheduler converges
hardware and software separation, which is ideal for quickly to a suitable schedule pattern, and that it bal-
hetergenous grid environments, it would be possible ances the three functions of idle time, make span and
for TITAN to utilise other performance tools includ- the quality-of-service metric deadline time.
ing POEMS [?] or analytical models such as [?].
The PACE system provides an effective mech-
The use of genetic algorithms (GA) in dynamic anism for determining general application perfor-
multi-processing scheduling systems has been ex- mance through a characterisation technique. The re-
plored previously [31, 32] where they have been sultant models can be used for performance stud-
successfully applied to the minimisation of sched- ies, design work and code-porting activities. In
ule length (make-span) for dissimilar tasks. TITAN this work, they have been adapted for use with
augments the traditional GA operations of crossover a GA-based scheduler for provide the basis for
and mutation with application-centric data obtained performance-based middleware for general applica-
through PACE model evaluation including the abil- tions.
ity to reject schedules that over-utilise resources. As
Further work will examine the relationship be-
PACE is able to produce accurate results rapidly, it
tween low-level scheduling and the middle tier of
can be used in iterative procedures (although some
workload management and top tier of wide area
caching is required). Evaluating PACE models prior
management. It is also possible to extend the sys-
to execution also allows the system to consider vari-
tem to manage flows of work as groups of related
ables that are not available during the traditional per-
tasks. It is envisaged that this framework will de-
formance analysis stage which often require recom-
velop to provide complimentary services to existing
position of the model.
grid infrastructures (i.e. Globus) using performance
prediction and service brokers to meet quality de-
mands for grid-aware applications.
7 Conclusions
The work presented in this paper is concerned with
improving grid workload management using a multi- Acknowledgments
tiered framework based on Globus providers, dis-
tribution brokers and local schedulers. This paper This work is sponsored in part by grants from the
has focused primarily on the iterative heuristic al- NASA AMES Research Center (administrated by
gorithm and performance prediction techniques that USARDSG, contract no. N68171-01-C-9012) and
have been developed for the local schedulers. Cur- the EPSRC (contract no. GR/R47424/01). The au-
rent work is focussed on fully implementing op- thors would also like to express their gratitude to
erability with Globus to enable performance-based Darren Kerbyson and Stuart Perry for their contri-
‘global’ scheduling in addition to ‘local’ scheduling. bution to this work.
References [12] S.C. Perry, R.H. Grimwood, D.J. Kerbyson,
E. papaefstathiou, and G.R. Nudd. Perfor-
[1] I. Foster, C. Kesselman, and S. Tuecke. The mance Optimisation of Financial Option Cal-
Anatomy of the Grid: Enabling Scalable Vir- culations. Parallel Computing, 26(5):623–639,
tual Organisations. Int. J. of Supercomputing 2000.
Applications, 2001.
[13] D.J. Kerbyson, E. Papaefstathiou, and G.R.
[2] I. Foster and C. Kesselman. The GRID: Nudd. Application Execution Steering us-
Blueprint for a New Computing Infrastructure. ing On-the-fly Performance Prediction. High
Morgan-Kaufmann, 1998. Performance Computing and Networking, Lec-
ture Notes in Computer Science, 1401:718–
[3] I. Foster, C. Kesselman, J.M. Nick, and 727, 1998.
S. Tuecke. Grid services for distributed sys-
tem integration. IEEE Computer, 35(6):37–46, [14] J. Cao, D.J. Kerbyson, E. Papaefstathiou,
2002. and G.R. Nudd. Modelling of ASCI
High Performance Applications using PACE.
[4] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. Proc. UK Performance Engineering Workshop
The Physiology of the Grid: An Open Grid (UKPEW’99), pages 413–424, 1998.
Services Architecture for Distributed Systems
Integration. Open Grid Service Infrastructure [15] J.D. Turner, D.P. Spooner, S.A. Jarvis, D.N.
WG, Global Grid Forum, Jun 2002. Dillenberger, and G.R. Nudd. A Transaction
Definition Language for Java Application Re-
[5] I. Foster and C. Kesselman. Globus: A Meta- sponse Measurement. In Proc. of 27th Int.
computing Infrastructure Toolkit. Int. J. of Conf. on Technology Management and Perfor-
Supercomputer Applications, 11(2):115–128, mance Evaluation of Enterprise-Wide Informa-
1997. tion Systems (CMG2001), Anaheim, Califor-
nia, USA, December 2001.
[6] Global Grid Forum. http://www.
globalgridforum.org. [16] J. Du and J. Leung. Complexity of Scheduling
Parallel Task Systems. SIAM J. on Discrete
[7] G.R. Nudd, D.J. Kerbyson, E. Papaefstathiou, Mathematics, November 1989.
S.C. Perry, J.S. Harper, and D.V. Wilcox.
PACE : A Toolset for the Performance Pre- [17] D. Dumitrescu, B. Lazzerini, L.C. Jain, and
diction of Parallel and Distributed Systems. A. Dumitrescu. Evolutionary Computation.
Int. J. of High Performance Computing Appli- CRC Press, 2000. ISBN: 0–84–930588–8.
cations, Special Issues on Performance Mod-
[18] A.Y. Zomaya and Y.H. Teh. Observations on
elling, 14(3):228–251, 2000.
Using Genetic Algorithms for Dynamic Load-
[8] K. Czajkowski, S. Fitzgerald, I. Foster, and Balancing. IEEE Transactions on Parallel and
C. Kesselman. Grid Information Services for Distributed Systems, 12(9):899–911, 2001.
Distributed Resource Sharing. In Proc. of the [19] K.R. Koch, R.S. Baker, and R.E. Alcouffe. So-
10th IEEE International Symposium on High- lution of the First-Order Form of the 3-D Dis-
Performance Distributed Computing (HPDC- crete Ordinates Equation on a Massively Par-
10), 2001. allel Processor. Trans. of the Amer. Nuc. Soc,
65(108), 1992.
[9] The Globus Toolkit. http://www.
globus.org/toolkit. [20] R. Buyya, D. Abramson, and J. Giddy. Nim-
rod/G: An Architecture for a Resource Man-
[10] J. Cao, D.J. Kerbyson, and G.R. Nudd. High agement and Scheduling system in a global
Performance Service Discovery in Large-scale computational grid. In Proc. of 4th Int. Conf.
Multi-agent and Mobile-agent Systems. Int. J. on High Performance Computing, Asia-Pacific
of Software Engineering and Knowledge Engi- Region, Beijing, China, 2000.
neering, Special Issue on Multi-Agent Systems
and Mobile Agents, 11(5):621–641, 2001. [21] H. Casanova and J. Dongarra. NetSolve:
A Network Server for Solving Computational
[11] J. Cao, S.A. Jarvis, S. Saini, D.J. Kerbyson, Science Problems. Int. J. of Supercomputing
and G.R. Nudd. ARMS: an Agent-based Re- Applications and HPC, 11(3), 1997.
source Management System for Grid Comput-
ing. Scientific Programming, Special Issue on [22] F. Berman, R. Wolski, S. Figueira, J. Schopf,
Grid Computing, 10(2):135–148, 2002. and G. Shao. Application-level Scheduling on
Distributed Heterogeneous Networks. Proc. of [28] R. Buyya, D. Abramson, and J. Giddy.
Supercomputing, 1996. Nimrod–G resource broker for service-
oriented grid computing. IEEE Distributed
[23] A. Takefusa, S. Matsuoka, H. Nakada, K. Aida, Systems Online, 2(7), 2001.
and U. Nagashima. Overview of a Perfor-
mance Evaluation System for Global Comput- [29] R. Wolski, N.T. Spring, and J. Haye. The
ing Scheduling Algorithms. In Proc. of 8th Network Weather Service: A Distributed Re-
IEEE Int. Symp. on High Performance Dis- source Performance Forecasting Service for
tributed Computing, pages 97–104, 1999. Metacomputing. Future Generation Comput-
[24] M. Litzkow, M. Livny, and M. Mutka. Condor ing Systems, 1999.
– A Hunter of Idle Workstations. In Proc. of
8th Int. Conf. on Distributed Computing Sys- [30] J. Frey, T. Tannenbaum, I. Foster, M. Livny,
tems (IPDCS88), pages 104–111, 1988. and S. Tuecke. Condor–G: A Computa-
tion Management Agent for Multi-Institutional
[25] S. Zhou. LSF: Load Sharing in Large-scale Grids. In Proc. of the 10th IEEE Symposium
Heterogenous Distributed Systems. In Proc. of on High Performance Distributed Computing
1992 Workshop on Cluster Computing, 1992. (HPDC10), 2001.
[26] G. Aloisio, M. Cafaro, E. Blasi, and I. Epic-
oco. The Grid Resource Broker, a Ubiquitous [31] E.S. Hou, N. Ansari, and H. Ren. Genetic Al-
Grid Computing Framework. Scientific Pro- gorithm for Multiprocessor Scheduling. IEEE
gramming : Special issue on Grid Computing, Transactions on Parallel and Distributed Sys-
10(2), 2002. tems, 5(2):113–120, 1994.

[27] A. Abraham, R. Buyya, and B. Nath. Na- [32] M.D. Kidwell and D.J. Cook. Genetic Algo-
ture’s Heuristics for Scheduling Jobs on Com- rithm for Dynamic Task Scheduling. In Proc.
putational Grids. In Proc. of 8th IEEE Int. of IEEE 13th Annual Internation Phoenix Con-
Conf. on Advanced Computing and Communi- ference on Computings and Communications,
cations, Cochin, India, 2000. pages 61–67, 1994.
Appendix A : TITAN Screen Shots

Figure 11: Visualisation of the schedule before GA operation.

Figure 12: Visualisation of the schedule after 300 iterations.

You might also like