You are on page 1of 89


Faisal N. Abu-Khzam & Michael A. Langston

University of Tennessee

Hour 1: Introduction Break Hour 2: Using the Grid Break Hour 3: Ongoing Research Q&A Session

Hour 1: Introduction
What is Grid Computing? Who Needs It? An Illustrative Example Grid Users Current Grids


What is Grid Computing?
Computational Grids
± Homogeneous (e.g., Clusters) ± Heterogeneous (e.g., with one-of-a-kind instruments)

Cousins of Grid Computing Methods of Grid Computing

Each user should have a single login account to access all resources. Resources may be owned by diverse organizations. . and data. switches.Computational Grids A network of geographically distributed resources including computers. instruments. peripherals.

. capacity.g. Gridware can be viewed as a special type of middleware that enable sharing and manage grid components based on user requirements and resource attributes (e.Computational Grids Grids are typically managed by gridware. availability«) . performance.

etc. .. Network Computing. Internet Computing..Cousins of Grid Computing Parallel Computing Distributed Computing Peer-to-Peer Computing Many others: Cluster Computing. Client/Server Computing.

Distributed Computing People often ask: Is Grid Computing a fancy new name for the concept of distributed computing? In general. the answer is ³no. .´ Distributed Computing is most often concerned with distributing the load of a program across two or more processes.

. Computers can act as clients or servers depending on what role is most efficient for the network.PEER2PEER Computing Sharing of computer resources and services by direct exchange between systems.

Methods of Grid Computing Distributed Supercomputing High-Throughput Computing On-Demand Computing Data-Intensive Computing Collaborative Computing Logistical Networking 10 .

Distributed Supercomputing Combining multiple high-capacity resources on a computational grid into a single. Tackle problems that cannot be solved on a single system. 11 . virtual distributed supercomputer.

12 . with the goal of putting unused processor cycles to work.High-Throughput Computing Uses the grid to schedule large numbers of loosely coupled or independent tasks.

Models real-time computing demands.On-Demand Computing Uses grid capabilities to meet short-term requirements for resources that are not locally accessible. 13 .

and databases.Data-Intensive Computing The focus is on synthesizing new information from data that is maintained in geographically distributed repositories. 14 . Particularly useful for distributed data mining. digital libraries.

15 . Applications are often structured in terms of a virtual shared space.Collaborative Computing Concerned primarily with enabling and enhancing human-to-human interactions.

Contrasts with traditional networking. and distribution channels.Logistical Networking Global scheduling and optimization of data movement. depots. Called "logistical" because of the analogy it bears with the systems of warehouses. 16 . which does not explicitly model storage resources in the network.

Who Needs Grid Computing? A chemist may utilize hundreds of processors to screen thousands of compounds per hour. Meteorologists seek to visualize and analyze petabytes of climate data with enormous computational demands. 17 . Teams of engineers worldwide pool resources to analyze terabytes of structural data.

collected microbiological samples in the tidewaters around Wallops Island. University of California. San Diego. 18 . Virginia. She needed the high-performance microscope located at the National Center for Microscopy and Imaging Research (NCMIR).An Illustrative Example Tiffany Moisan. a NASA research scientist.

she could move the platform holding them and make adjustments to the microscope. Thus. in addition to viewing the samples. 19 .Example (continued) She sent the samples to San Diego and used NPACI¶s Telescience Grid and NASA¶s Information Power Grid (IPG) to view and control the output of the microscope from her desk on Wallops Island.

Example (continued) The microscope produced a huge dataset of images. 20 . This dataset was stored using a storage resource broker on NASA¶s IPG. Moisan was able to run algorithms on this very dataset while watching the results in real time.

Grid Users Grid developers Tool developers Application developers End Users System Administrators 21 .

Grid Developers Very small group. Implementers of a grid ³protocol´ who provides the basic services required to construct a grid. 22 .

Tool Developers Implement the programming models used by application developers. Implement basic services similar to conventional computing services: ± User authentication/authorization ± Process management ± Data access and communication 23 .

Tool Developers
Also implement new (grid) services such as:
± Resource locations ± Fault detection ± Security ± Electronic payment


Application Developers
Construct grid-enabled applications for end-users who should be able to use these applications without concern for the underlying grid. Provide programming models that are appropriate for grid environments and services that programmers can rely on when developing (higher-level) applications.

System Administrators
Balance local and global concerns. Manage grid components and infrastructure. Some tasks still not well delineated due to the high degree of sharing required.


Some Highly-Visible Grids
The NSF PACI/NCSA Alliance Grid. The NSF PACI/SDSC NPACI Grid. The NASA Information Power Grid (IPG). The Distributed Terascale Facility (DTF) Project.

Intel. and Oracle. Sun Microsystems. Myricom.DTF Currently being built by NSF¶s Partnerships for Advanced Computational Infrastructure (PACI) A collaboration: NCSA. Quest Communications. 28 . and Caltech will work in conjunction with IBM. SDSC. Argonne.

visualization systems. and data at four sites. Stores more than 450 trillion bytes of data.6 trillion calculations per second.DTF Expectations A 40-billion-bits-per-second optical network (Called TeraGrid) is to link computers. 29 . Performs 11.


Hour 2: Using the Grid Globus Condor Harness Legion IBP NetSolve Others 31 .

and the University of Chicago's Distributed Systems Laboratory. Started in 1996 and is gaining popularity year after year. 32 . the University of Southern California¶s Information Sciences Institute.Globus A collaboration of Argonne National Laboratory¶s Mathematics and Computer Science Division.

data resources.Globus A project to develop the underlying technologies needed for the construction of computational grids. displays. special instruments and so forth. 33 . Focuses on execution environments for integrating widely-distributed computational platforms.

34 .The Globus Toolkit The Globus Resource Allocation Manager (GRAM) ± Creates. The Grid Security Infrastructure (GSI) ± Provides authentication services. and manages services. monitors. ± Maps requests to local schedulers and computers.

etc. 35 . and locations of replicated datasets. including server configurations. network status.The Globus Toolkit The Monitoring and Discovery Service (MDS) ± Provides information about system status. Nexus and globus_io ± provides communication services for heterogeneous environments.

Heartbeat Monitor (HBM) ± Used by both system administrators and ordinary users to detect failure of system components or processes. 36 .The Globus Toolkit Global Access to Secondary Storage (GASS) ± Provides data movement and access mechanisms that enable remote programs to manipulate local data.

Condor The Condor project started in 1988 at the University of Wisconsin-Madison. 37 . The main goal is to develop tools to support High Throughput Computing on large collections of distributively owned computing resources.

Condor pools can share resources by a feature of Condor called flocking. A ³Condor pool´ consists of any number of machines. that are connected by a network. 38 . of possibly different architectures and operating systems.Condor Runs on a cluster of workstations to glean wasted CPU cycles.

± Enables the submission of new jobs. 39 . ± Puts a job on hold. A machine with job management installed is called a submit machine. ± Provides information about jobs that are already finished.The Condor Pool Software Job management services: ± Supports requests about the job queue .

A machine could be a ³submit´ and an ³execute´ machine simultaneously. ± Performs resource allocation and scheduling.The Condor Pool Software Resource management: ± Keeps track of available machines. 40 . Machines with resource management installed are called execute machines.

Condor-G A version of Condor that uses Globus to submit jobs to remote resources. 41 . Can be installed on a single machine. Allows users to monitor jobs submitted through the Globus toolkit. Thus no need to have a Condor pool installed.

Legion An object-based metasystems software project designed at the University of Virginia to support millions of hosts and trillions of objects linked together with high-speed links. 42 . Allows groups of users to construct shared virtual work spaces. to collaborate research and exchange information.

43 . runtime library implementations. The key feature of Legion is its object-oriented approach.Legion An open system designed to encourage third party development of new or updated applications. and core components.

44 . the University of Tennessee. Conceived as a natural successor of the PVM project.Harness A Heterogeneous Adaptable Reconfigurable Networked System A collaboration between Oak Ridge National Lab. and Emory University.

distributed virtual machine (DVM) that can run on anything from a Supercomputer to a PDA. and Multiple DVM Collaboration. Distributed Peer-to-Peer Control.Harness An experimental system based on a highly customizable. Built on three key areas of research: Parallel Plug-in Interface. 45 .

distributed systems and applications. 46 .IBP The Internet Backplane Protocol (IBP) is a middleware for managing and using remote storage. It was devised at the University of Tennessee to support Logistical Networking in large scale.

and can direct communication between them with DMA. 47 . On a processor backplane.IBP Named because it was designed to enable applications to treat the Internet as if it were a processor backplane. the user has access to memory and peripherals.

IBP IBP gives the user access to remote storage and standard Internet resources (e. content servers implemented with standard sockets) and can direct communication between them with the IBP API.g. 48 .

IBP makes it possible for applications of all kinds to use logistical networking to exploit data locality and more effectively manage buffer resources. 49 . applicationindependent interface to storage in the network.IBP By providing a uniform.

Designed for solving complex scientific problems in a loosely-coupled heterogeneous environment. 50 .NetSolve A client-server-agent model.

in addition to usage statistics.The NetSolve Agent A ³resource broker´ that represents the gateway to the NetSolve system Maintains an index of the available computational resources and their characteristics. 51 .

The NetSolve Agent Accepts requests for computational services from the client API and dispatches them to the best-suited sever. 52 . Runs on Linux and UNIX.

Runs on Linux. UNIX. Runs on a user¶s local system. Contacts the NetSolve system through the agent.The NetSolve Client Provides access to remote resources through simple and intuitive APIs. 53 . and Windows. which in turn returns the server that can best service the request.

Runs on different platforms: a single workstation. A daemon process that awaits client requests. symmetric multiprocessors (SMPs).The NetSolve Server The computational backbone of the system. cluster of workstations. or massively parallel processors (MPPs). 54 .

The NetSolve Server A key component of the server is the Problem Description File (PDF). With the PDF. 55 . routines local to a given server are made available to clients throughout the NetSolve system.

The PDF Template PROBLEM Program Name « LIB Supporting Library Information « INPUT specifications « OUTPUT specifications « CODE 56 .

57 . Uses statistical models on the collected data to generate a forecast of future behavior.Network Weather Service Supports grid technologies. Uses sensor processes to monitor cpu loads and network traffic. NetSolve is currently integrating NWS into its agent.

A NetSolve client is now in testing that allows access to Globus. The NetSolve client uses Legion¶s data-flow graphs to keep track of data dependencies. 58 . Legion has adopted NetSolve¶s client-user interface to leverage its metacomputing resources.Gridware Collaboarations NetSolve is using Globus' "Heartbeat Monitor" to detect failed servers.

Gridware Collaboarations NetSolve can access Condor pools among its computational resources. This improves fault tolerance. IBP-enabled clients and servers allow NetSolve to allocate and schedule storage resources as part of its resource brokering. 59 .


± Ongoing work at Tennessee General Issues. ± Open questions of interest to the entire research community 61 . Special Projects.Hour 3: Ongoing Research Motivation.

Motivation Computer speed doubles every 18 months Network speed doubles every 9 months Graph from Scientific American (Jan-2001) by Cleo Vilett. Caufield and Perkins 62 . source Vined Khoslan. Kleiner.

Special Projects The SInRG Project. Unbridled Parallelism ± SETI@home and Folding@home ± The Vertex Cover Solver Security. ± Grid Service Clusters (GSCs) ± Data Switches Incorporating Hardware Acceleration. 63 .

The SInRG Project 64 .

but tuned to take advantage of the highly structured and controlled design of the cluster.The Grid Service Cluster The basic grid building block. Some GSCs are general-purpose and some are special-purpose. Each GSC will use the same software infrastructure as is now being deployed on the national Grid. 65 .

The Grid Service Cluster 66 .

Links of at least1Gbps assure QoS in many circumstances simply by over provisioning.An advanced data switch The components that make up a GSC must be able to access each other at very high speeds and with guaranteed Quality of Service (QoS). 67 .

68 . Initial in-core memory (RAM) is approximately 4 gigabytes.Computational Ecology GSC Collaboration between computer science and mathematical ecology. 8-processor Symmetric MultiProcessor (SMP). Out-of-core data storage unit provides a minimum of 450 gigabytes.

High-end graphics workstations.Medical Imaging GSC Collaboration between computer science and the medical school. 69 . Distinguished by the need to have these workstations attached as directly as possible to the switch to facilitate interactive manipulation of the reconstructed images.

Molecular Design GSC Collaboration between computer science and chemical engineering. Data visualization laboratory 32 dual processors High performance switch 70 .

Investigating the potential of reconfigurable computing in grid environments. 8 Linux boxes with Pilchard boards.Machine Design GSC Collaboration between computer science and electrical engineering. 12 Unix-based CAD workstations. 71 .

Machine Design GSC 72 .

Types of Hardware General purpose hardware ± can implement any function ASICs ± hardware that can implement only a specific application FPGAs ± reconfigurable hardware that can implement any function 73 .

The FPGA FPGAs offer reprogrammability Allows optimal logic design of each function to be implemented Hardware implementations offer acceleration over software implementations which are run on general purpose processors 74 .

Higher bandwidth and lower latency than other environments.The Pilchard Environment Developed at Chinese University in Hong Kong. Plugs into 133MHz RAM DIMM slot and is an example of ³programmable active memory.´ Pilchard is accessed through memory read/write operations. 75 .

Determine effectiveness of hardware acceleration in this environment. Provide an interface for the remote use of FPGAs. Allow users to experiment and gauge whether a given problem would benefit from hardware acceleration. 76 .Objectives Evaluate utility of NetSolve gridware.

Sample Implementations Fast Fourier Transform (FFT) Data Encryption Standard algorithm (DES) Image backprojection algorithm A variety of combinatorial algorithms 77 .

VHDL code is needed 78 .Implementation Techniques Two types of functions are implemented Software version .runs in the FPGA To implement the hardware version of the function.runs on the PC¶s processor Hardware version .

CAD tools help make mapping decisions based on constraints such as: chip area. resource usage minimization.The Hardware Function Implemented in VHDL or some other hardware description language. I/O pin counts. routing resources and topologies. partitioning. 79 . The VHDL code is then mapped onto the FPGA (synthesis).

To run.The Hardware Function Result of synthesis is a configuration file (bit stream). 80 . This file defines how the FPGA is to be reprogrammed in order to implement the new desired functionality. a copy of the configuration file must be loaded on the FPGA.

Behind the Scenes VHDL programmer VHDL code Synthesis Software programmer Server administrator Configuration file Software and Hardware functions PDFs. Libraries Request Client Request results NetSolve server 81 .

Conclusions Hardware acceleration is offered to both local and remote users. A development environment is provided for devising and testing a wide variety of software. 82 . Resources are available through an efficient and easy-to-use interface. hardware and hybrid solutions.

83 .Unbridled Parallelism Sometimes the overhead of gridware is unneeded. We¶re currently building a Vertex Cover solver with multiple levels of acceleration. Well known examples include SETI@home and Folding@home.

± When does it make sense? ± How much efficiency are we gaining? ± What are its limitations? 84 .A Naked SSH Approach A bit of blasphemy: the anti-gridware paradigm Our work begs several questions.

Grid Security Algorithm complexity theory ± Verifiability ± Concealment Cryptography and checkpointing ± Corroboration ± Scalability Voting and spot-checking ± Fault tolerance ± Reliability 85 .

Some General Issues Grid architecture. Resource management. QoS mechanisms. Fault tolerance. 86 . Performance monitoring. 87 .edu/~abukhzam/gridtutorial.htm Condor: http://www.References URL to these slides: Globus: http://www.cs. 88 .edu/ SInRG: Harness: NWS: http://nws.References NetSolve: http://icl.cs.