You are on page 1of 10

Exploring Virtual Workspace Concepts in a Dynamic Universe for Condor

Quinn Lewis

ABSTRACT

Virtualization offers a cost-effective and flexible way to use and manage

computing resources. Such an abstraction is appealing in grid computing for better

matching jobs (applications) to computational resources. This paper applies the virtual

workspace concept introduced in the Globus Toolkit to the Condor workload

management system. It allows existing computing resources to be dynamically

provisioned at run-time by users based on application requirements instead of statically at

design-time.

INTRODUCTION

A common goal of computer systems is to minimize cost while maximizing other

criteria, such as performance, reliability, and scalability, to achieve the objectives of the

user(s). In Grid computing, a scalable way to harness large amounts of computing power

across various organizations is to amass several relatively inexpensive computing

resources together. Coordinating these distributed and heterogeneous computing

resources for the purposes of perhaps several users can be difficult. In such an

environment, resource consumers have several varying, specific, and demanding

requirements and preferences for how they would like their applications and services to

leverage the resources made available by resource providers. Resource providers must

ensure the resources meet a certain quality of service (e.g. make resources securely and

consistently available to several concurrent users).

In the past, control over the availability, quantity, and software configurations of

resources has been limited to the resource provider. With virtualization, it becomes
possible for resource providers to offer up more control of the resources to a user without

sacrificing quality of service to other resource consumers. Users (resource consumers)

can more easily create execution environments that meet the needs of their applications

and jobs within the policies defined by the resource providers. Such a relationship,

enabled by virtualization, is both cost-effective and flexible for the resource producer and

consumer. [1]

The virtual workspace term, initially coined in [2] for use with the Globus

Toolkit, "is an abstraction of an execution environment that can be made dynamically

available to authorized clients by using well-defined protocols". This execution

environment can encompass several physical resources. Generically, this concept could

be implemented in various ways; however, virtualization has proven itself to be a

practicable implementation. [3]

Condor, "is a specialized workload management system for compute-intensive

jobs" [4]. Condor currently abstracts the resources of a single physical machine into

virtual machines which can run multiple jobs at the same time [5]. A "universe" is used

to statically describe the execution environment in which the jobs are expected to run.

This approach assumes the resources (whether real or virtual) have to all be allocated in

advance. While there is support for adding more resources to an existing pool via the

Glide-in mechanism, the user still has to dedicate the use of these other physical

resources.

The purpose of this paper is to describe how a Condor execution environment

(universe) can be dynamically created at run-time by users to more flexibly and cost-

effectively use and manage existing resources using virtualization. Two of the unique
implementation details described in this paper are the use of Microsoft Windows and

Microsoft Virtual Server 2005 R2 for the virtual machine manager (VMM) on the host

operating system (instead of being Linux-based using Xen or VMWare) and the use of

differencing virtual hard disks. More details about virtual workspaces and similar

attempts to virtualize Condor are described in Related Work. The implementation details

of the work performed for a dynamic Condor universe are provided along with

performance tests results. Future enhancements are included for making this work-in-

progress more robust.

RELATED WORK

While virtualization has a number of applications for business computing and

software development and testing, the work outlined in this paper most directly applies to

technical computing, including Grid computing, clusters, and resource-scavenging

systems.

Grid Computing

The use of virtualization in Grid computing has been proposed before, touting the

benefits of legacy application support, improved security, and the ability to deploy

computation independently of site administration. The challenges of dynamically

creating and managing virtual machines are also described [6]. The virtual workspace

concept [7] extended [6] to present "a unified abstraction" and address additional issues

associated with the complexities of managing such an environment in the Grid. Two key

differences between the Grid-related work mentioned and this paper is the emphasis on

dynamically creating the execution environment at run-time and the (Microsoft)

virtualization software employed.
As mentioned previously, the Condor Glide-in mechanism works in conjunction

with the Globus Toolkit to temporarily make Globus resources available to a user’s

Condor pool. This has the advantage of being able to submit Condor jobs using Condor

capabilities (matchmaking and scheduling) on Globus managed resources [8]. However,

it is expected that the user acquire these remote resources before the jobs are executed.

Using virtualization allows the existing “local” Condor resources to be leveraged as the

jobs require.

Clusters

Many of the same motivations that exist for this work have also been applied to

clusters [9, 10] but focus more on dynamically provisioning homogenous execution

environments on resources. Although perhaps accommodated in the design of the

Cluster-on-Demand [9], virtualization technology is not used in the implementation of the

system. The resources are assumed to physically exist and the software is deployed by

re-imaging the machine. In [10], virtualization is used to provision the software on the

cluster(s) but the time required to stage in the virtual image(s) is costly. The use of the

“differencing” virtual hard disk image type in this work offers a mitigating solution to

this problem [11].

Condor

Additional work with virtualization and Condor focuses on exploiting Condor’s

“cycle stealing” capability at the University of Nebraska Lincoln to transform typical

Windows campus machines into Unix-based machines required by researchers [12]. The

solution leveraged coLinux to run a Condor compute node through a Windows device

driver [13]. While some of the same motivation exists for this work, using a
virtualization technology such as Virtual Server 2005 R2 allows other operating systems

and versions to be used and provides more flexible ways to programmatically control the

dynamic environment.

IMPLEMENTATION

We leverage Condor’s existing ability to schedule jobs, advertise resource

availability, and match jobs to resources and introduce a flexible extension for

dynamically describing, deploying, and using virtual execution resources in the Condor

universe.

In Condor, one or more machines (resources) along with jobs (resource requests)

are part of a collection, known as a pool. The resources in the pool have one or more of

the following roles: Central Manager, Execute, and/or Submit. The Central Manager

collects information and negotiates how jobs are matched to available resources. Submit

resources allow jobs to be submitted to the Condor pool through a description of the job

and its requirements. Execute resources run jobs submitted by users after having been

matched and negotiated by the Central Manager. [14]

We extend the responsibilities of each of these three different roles to incorporate

virtualization into Condor. Each Execute resource describes the extent to which it can be

virtualized (to the Central Manager) and is responsible for hosting additional (virtual)

resources. The Submit resource(s) takes a workflow of jobs and requirements and

initiates the deployment of the virtual resources plus signals its usage (start/stop) to the

host/execute machine. The Central Manager is responsible for storing virtual machine

metadata used for scheduling. For this implementation, a single machine is used for the

Central Manager, Submit, and Execute roles.
The virtualization capabilities for a particular Execute resource can be published

to the Central Manager via authorized use of condor_advertise. Attributes about the

virtual Execute resources, such as the operating system (and version), available memory

and disk space, and more specific data about the status of the virtual machine are

included. Currently, the “host” Execute resource invokes condor_advertise for each

“guest” or virtual Execute resource it anticipates hosting at start-up. This approach

allows virtual resources to appear almost indistinguishable from real physical resources

and will be included in Condor’s resource scheduling. Note that real resources are

running while the virtual resources are not. They have only been described.

Using the standard Condor tools, such as condor_status, users can view the

resources (real and virtual) available in the pool. Users can then create workflows (using

Windows Workflow Foundation [16]) for one or more jobs that intend to run on the

provided resources. Since the virtual resource(s) may not be running when a job is

submitted, the initial scheduling will fail. Fortunately, Condor provides a SOAP-based

API for submitting and querying jobs [15]. Using this Condor API via workflows,

unsuccessful job submissions can be checked for the intended attributes of the advertised

machine to determine if the resource is a virtual machine and if it needs to be deployed,

and/or if it needs to be started.

The user can indicate specific job requirements in the workflow. These

requirements can optionally specify the location of the files required to run the virtual

machine for consumer flexibility (assuming the provider has allowed it). These files

provide the operating system and necessary configuration (including Condor) for

executing the job. The workflow is invoked by the Submit machine. If the virtual
resource is specified by the workflow, the workflow manager on the Submit machine

either transfers the virtual machine files to the Execute resource or provides the Execute

resource with the location and protocol for retrieving the virtual machine files. (The

automatic copying of virtual images was not completely implemented for this paper.) For

performance, it is expected that host Execute machines have base virtual images local to

the resource that provide the operating system and Condor. Additional software and

configuration can be added by in a separate file that only stores the modified blocks from

a parent hard disk (file), called differencing virtual disks. This provides a flexible

balance, allowing resource providers to provide base images and giving resource

consumers the ability to extend the base images.

The workflow, running on the Submit machine, also provides the logic for starting

the virtual resource on the host. Microsoft Virtual Server R2 provides an API for

managing local and remote virtual machines. The workflow leverages this API for

starting the virtual resources. For this paper, it assumes that virtual resources are started

from a “cold” state. The result is that startup times are as long as a normal boot time for

the respective operating system.

PERFORMANCE TESTS AND MEASUREMENTS

To test performance, a 2GHz AMD Athlon 64 processor with 1 GB of RAM

running Windows XP was used as the Central Manager, Execute, and Submit role. Two

virtual Execute machines, running Debian Linux 3.1 and Windows 2000, each with 128

MB RAM were created. A virtual network was created to allow communication between

the three different operating systems, each running Condor.
The MEME [17] bioinformatics application was used as the test job. Initially, a

MEME job was submitted to the Condor pool using the standard Condor command-line

tools (e.g. condor_submit). The test input and configuration options were used resulting

in job submission, execution, and result times of less than one minute.

Using Windows Workflow Foundation and Visual Studio, a graphical workflow

was constructed that submitted the same MEME job to the cluster, specifically requesting

a Windows 2000 or Linux resource. The same test input and configuration options took 6

to 8 minutes on average. Since the virtual machines are programmatically started only

after an initial job schedule fails and are currently starting from a cold state, the start

times include the setup and also reflects the time for the operating system to boot. There

is also an unresolved issue with the (5 minute) cycle time between scheduling when using

the Condor SOAP API [18].

Additionally, the Windows 2000 virtual machine was created as a base image

(932 MB) with a differencing virtual disk that included Condor and other support

software (684 MB). Since the differencing disks use a sector bitmap to indicate which

sectors are within the current disk (1’s) or on the parent (0’s), the specification [11]

suggests it may be possible to achieve performance improvements. It also lent itself well

to compression. The 684 MB difference disk was compressed to 116 MB (using standard

ZIP compression). This file could be transferred over a standard broadband Internet

connection in 3.7 minutes (at 511.88 Kb/s) as opposed to 30 minutes.

CONCLUSION AND FUTURE WORK

A number of additional modifications are required for this solution to become

more robust. For example, security was not considered. Also, the current times for
executing short running jobs are not acceptable. Another improvement would be to start

the virtual machines from a “hot” or paused state. Since the virtual machines used in this

exercise were DHCP, the virtual machines would need to have static IPs or have

additional knowledge of when the virtual machines are un-paused. The virtual hard

disk(s) may be further compressed using a specific compression algorithm that takes the

disk format into account. Performance considerations could also be given to differencing

hard disks that are chained together for application extensibility purposes.

This paper describes a mechanism for extending Condor to take advantage of

virtualization to more flexibly (and cost-effectively) create an execution environment at

run-time that balances the interests of the resource providers and consumers.
REFERENCES

1. Keahey, K., Foster, I., Freeman, T., Zhang, X. Virtual Workspaces: Achieving
Quality of Service and Quality of Life in the Grid. CCGRID 2006, Singapore,
May 2006.
2. Keahey, K., Foster, I., Freeman, T., Zhang, X., Galron, D. Virtual Workspaces in
the Grid. Europar 2005, Lisbon, Portugal, September, 2005.
3. http://workspace.globus.org/vm
4. http://www.cs.wisc.edu/condor/description.html
5. http://www.bo.infn.it/alice/alice-doc/mll-doc/condor/node4.html
6. Figueiredo, R., inda, P., Fortes, Jose. A Case For Grid Computing On Virtual
Machines.
7. Keahey, K., Ripeanu, M., Doering, K. Dynamic Creation and Management of
Runtime Environments in the Grid.
8. http://www.cs.wisc.edu/condor/CondorWeek2005/presentations/user_tutorial.ppt
9. Chase, J., Irwin, D., Grit, L., Moore, J., Sprenkle, S. Dynamic Virtual Clusters in
a Grid Site Manager.
10. Zhang, X., Keahey, K., Foster, I., Freeman, T. Virtual Cluster Workspaces for
Grid Applications.
11. Virtual Hard Disk Image Form Specification. October 11, 2006 – Version 1.0.
Microsoft.
12. Sumanth, J. Running Condor in a Virtual Environment with coLinux.
http://www.cs.wisc.edu/condor/CondorWeek2006/presentations/sumanth_condor
_colinux.ppt
13. Santosa, M., Schaefer, A. Build a heterogeneous cluster with coLinux and
openMosix. http://www-128.ibm.com/developerworks/linux/library/l-
colinux/index.html
14. Condor Version 6.9.2 Manual. http://www.cs.wisc.edu/condor/manual/v6.9/
15. http://www.cs.wisc.edu/condor/birdbath/
16. http://wf.netfx3.com/content/WFIntro.aspx
17. MEME. http://meme.sdsc.edu
18. https://lists.cs.wisc.edu/archive/condor-users/2006-May/msg00296.shtml