You are on page 1of 6

H I G H - P E R F O R M A N C E

C O M P U T I N G

Linux Computing Clusters


Installing High-Performance
By Christopher Stanton, Rizwan Ali, Yung-Chin Fang, and Munira A. Hussain

The first challenge in moving a newly deployed cluster framework to a usable high-performance computing cluster is installation of the operating system as well as third-party software packages. In four- to eight-node clusters, each node can be installed manually. Large, industrial-strength clusters require a more efficient method. This article describes different types of cluster configurations, efficient Linux installation methods, and the benefits of each.

igh-performance computing (HPC) clusters use three main types of master and compute node configurations: loosely, moderately, and tightly coupled. Each configuration describes the compute nodes dependency on the master node (see Figure 1). Although all three require the master to be available for a job to run, the masters status does not necessarily affect the compute nodes availability. From an operating system viewpoint, the compute nodes in a loosely coupled cluster are fully autonomous machines. Each node has a full copy of the operating system (OS), which allows someone to boot up and log into the node without contacting the master nodeunless the network uses dynamic Internet Protocol (IP) addresses. Failure to retrieve a dynamic IP address from the master node will not inhibit a node from successfully starting, but it will be accessible only through a local console. A moderately coupled cluster binds the compute nodes more closely to the master node. In this configuration, the compute nodes boot process requires the master node because, at minimum, the programs and information needed during boot are located on the master. Once the compute node has retrieved all needed file systems from the master, it will act like a stand-alone machine and can be logged into as though all file systems were local. Tightly coupled systems push the dependence on the master node one step further. The compute node must load its operating

system over the network from the master node. Compute nodes in a tightly coupled cluster do not store file systems locally, aside from possibly swap or tmp. From an OS standpoint, few differences exist between the master node and the compute nodes. The ability to log into the compute nodes individually does not exist. The process space is leveled so that the cluster looks more like one large monolithic machine rather than a cluster of smaller machines. The following sections explain the utilities or methods available that enable setup and installation of the desired cluster type. Each configuration has inherent advantages and disadvantages, and the discussion explores which configuration best matches particular needs.

Installing loosely coupled clusters


In a loosely coupled cluster, each compute node has a local copy of the operating system. The most tedious and cumbersome way to install such a cluster is one at a time using a CD. Some automated methods to install a loosely coupled cluster include the following. The Kickstart file The Red Hat Kickstart installation method lets a user create a single, simple text file to automate most of a Red Hat Linux installation such as language selection, network configuration, keyboard selection, boot loader installation (such as the Linux Loader (LILO) or the GRand Unified Bootloader (GRUB)), disk partitioning, mouse
PowerSolutions

www.dell.com/powersolutions

11

H I G H - P E R F O R M A N C E

C O M P U T I N G

selection, and X Window System configuration. The Kickstart file consists of three sections: commands, package list, and scripts. Commands. The commands section lists all the installation options such as language and partition specification, network configuration, and installation method. For example, administrators can use the network configuration option to specify the nodes IP address, host name, and gateway. Packages. The %packages command starts the Kickstart file section that lists the packages to be installed. A component name (for a group of related packages) or an individual package name can specify the packages.

Loosely coupled (independent compute nodes)

Master node

Moderately coupled (dependent compute nodes)

A comps file on the Red Hat Linux CD-ROM (Redhat/base/ comps) lists several predefined components. Users also can create their own component and list the packages needed. (Note: To create a component, users must create a new International Organization for Standardization (ISO) image of the CD-ROM with their modified comps file.) The first component in the file is the Base component, which lists the set of packages necessary for Linux to run. Scripts. Administrators can use the post-installation command in the Kickstart file to install packages that are not on the CD-ROM or to further tune the installation, such as customizing the host files or enabling SSH (secure shell). The post section is usually at the end of the Kickstart file and starts with the %post command. The additional packages must be available from a server on the network, typically the master node. The %post section would look like Figure 2. This sample command would install the rpm package my_driver.rpm from the server with the IP address 10.180.0.2. Red Hat 7.1 includes a Kickstart Configurator, a graphical user interface (GUI) to create a Kickstart file (instead of typing). After selecting Kickstart options, a user can click on the Save File button to generate the Kickstart file. The Configurator enables users to select most of the options required for a Kickstart file and provides a good starting point for expert users who may alter the generated file to suit their needs. Kickstart installation methods The Installation Method command on the Kickstart file lets administrators specify the installation method: using a local CD-ROM or a local hard drive, or via Network File System (NFS), File Transfer Protocol (FTP), or Hypertext Transfer Protocol (HTTP). The most cumbersome installation is to create a Kickstart file for each node and save the file to a Red Hat installation boot floppy. When the system is booted from the floppy (the Red Hat Linux CD must be in the CD-ROM and the Kickstart file set to install from CD-ROM), the installation process automatically starts based on the options specified in the Kickstart file on the floppy. Each node has different network settings (IP address and host name) and therefore requires a separate floppy. This method is tedious for large cluster installations: It requires manual intervention to move the floppy and CD from

Master node

Tightly coupled (integrated compute nodes)

Master node

Figure 1. Cluster-level view of master and compute node configurations

# P O S T - I N S T A L L A T I O N C O M M A N D S %post rpm ivh 10.180.0.2:/opt/nfs_export/Beowulf/drivers/my_driver.rpm Figure 2. Post-installation command in the Kickstart file

12

PowerSolutions

Issue 4, 2001

a partition information table and a list of node to node, unless a large number of floppies mounted file systems. This allows partitions to or CDs are available to simultaneously install all In a loosely coupled be created with the same mount points and size. the nodes. cluster, each The master node now contains the informaA more efficient method uses the network to tion to create a duplicate of the golden client (see perform the installation. Here again, each node compute node has Figure 3). During the installation of compute must have a floppy, but the CD is no longer nodes, the addclients command allows required. The Installation Method section in the a local copy of Kickstart file can be changed to support either FTP administrators to adjust system-specific the operating system. or NFS installation. Once the Red Hat installation configuration information on each node. The with the Kickstart file has booted, it will retrieve addclients command prompts for a host-name the installation image from a dedicated server base and range, client image, and IP address. The (usually the master node) on the network. base represents the static part of the host name, and the range repIn the most commonly used installation approach, adminisresents a starting and ending index to append to the host name. For trators place the Kickstart file as well as the CD image on the example, if we chose node as the base and 1-3 as the range, instalnetwork. A Boot Protocol/Dynamic Host Configuration Protocol lation routines would be created for node1, node2, and node3. (BOOTP/DHCP) server and an NFS server must be on the local When the naming convention has been finalized, the adminisnetwork, usually on the cluster master node. The BOOTP/DHCP trator will be prompted to assign an install image to these machines server must include configuration information for all the and then an IP address to each node. The host name and associated machines to be installed in the cluster. The BOOTP/DHCP IP address are then added to a host-name file that will be used durserver provides the client its networking information as well as ing both the boot and installation process. the location of the installation boot kernel and ramdisk and posUpon completion of these steps on the master node, the computesibly the location of the Kickstart file. If the location of the Kicknode boot method must be chosen. The SystemImager kernel and start file is not provided, the installation program will try to read ramdisk can be booted from portable media such as a floppy or the file /kickstart/1.2.3.4-kickstart, where 1.2.3.4 is the CD-ROM (created by either the makeautoinstallfloppy or numeric IP address of the machine being installed, on the DHCP makeautoinstallcd commands, respectively). Alternatively, the server. Finally the client NFS mounts the files path, copies the kernel and ramdisk can be booted over the network via Preboot specified file to its local disk, and begins installing the machine Execution Environment (PXE). as described by the Kickstart file. SystemImager contains prebuilt configuration files for the Linux PXE server (PXELinux) that must be running on the master node. PXE is a lightweight protocol that enables the compute node Installing the cluster using SystemImager to contact a BOOTP/DHCP server. BOOTP (and DHCP, which is SystemImager is a remote OS duplication and maintenance sysan extension of BOOTP) allows a server to provide the client tem that reduces the repetitive steps needed to create a cluster of identified by its hardware Media Access Control (MAC) address autonomous machines. SystemImager requires the administrator much of its initial configuration information, such as IP address, to install and configure an example compute node before cloning subnet mask, broadcast address, network address, gateway address, the remaining compute nodes. One advantage of this approach is host name, and kernel and ramdisk download path. that during installation, the administrator is not required to write specialized scripts to install additional software packages or configure system settings. In the SystemImager approach, the compute node that will be Master node used as the source or example system is called the golden client. The Master node Remaining compute administrator must first install and configure the machine using grabs the golden nodes pull the client image image from traditional methods so that it is representative of all compute nodes the master node in the cluster. SystemImager, which is installed on the master node, then creates a file system image of the entire golden client machine by Golden client using the getimage command. This image only contains files Compute nodes on the remote machine rather than images of the entire partition, which saves space. The prepareclient command creates Figure 3. SystemImager installation method
PowerSolutions

www.dell.com/powersolutions

13

H I G H - P E R F O R M A N C E

C O M P U T I N G

Once the node has booted, it must retrieve its IP address and host name. This is accomplished by a DHCP server on the master node that assigns both values or by placing both values on the floppy disk used to boot the node. SystemImager provides a DHCP configuration-building tool, makedhcpserver, that will construct a DHCP configuration file that maps a host name and IP address. The makedhcpstatic command can create static mappings between a specific machine and a host name/IP address pair.

not a requirement. The kernel must be built with support for initial ramdisk enabled because the In a moderately entire root (/) file system will be located inside the ramdisk. This allows the node to boot without coupled cluster, each an NFS mount. compute node To use the ramdisk as a local root file system, some modifications are required. When the can be accessed as kernel and ramdisk are loaded into memory, the an individual machine, kernel mounts the ramdisk as a read-write file system. Next it looks for a /linuxrc file (a but each node does binary executable or a script beginning with not have its own local #!). After this file has finished running, the Maintaining the cluster with SystemImager kernel unmounts the ramdisk and mounts a copy of the OS. An administrator also can use the golden client traditional root file system from disk. Because image as a change log and a single point of adminthe root file system does not exist locally on istration for cluster-wide modificationsfrom a disk, the /linuxrc file must be linked to single file to an entire package. First the cluster administrator makes /sbin/init so the OS will run directly from the ramdisk. the desired modifications to the golden client. Next, the administraAt this point, the implementation of the two methods splits. tor either updates the currently used image or creates a new image Hybrid model. In the hybrid model, the newly booted from which to base the cluster. kernel checks for the existence of swap, var, and tmp partiThis approach enables the administrator to create a version tions for local storage and logging. If correctly sized partitions history in case a change breaks the cluster. Once the new image do not exist on the local media, they are created and mounted. has been created, the remaining compute nodes are synced with The hybrid model reduces the network load, stores logs on a the changes. This generally requires minimal time because only static storage device, and provides swap space for memory the modified files are copied to each node. If a change does disswapping. Storing logs statically allows the log to be saved rupt the cluster, it can be re-synced from an earlier image that is across reboots. known to work. Diskless model. The diskless model uses var and tmp directories located on either the initial ramdisk, another ramdisk downloaded later, or NFS-mounted directories. Using an NFS-mounted Installing moderately coupled clusters In a moderately coupled cluster, each compute node can be accessed var directory provides a single point of access to cluster-wide log as an individual machine, but each node does not have its own local files, and the logs will not be lost because of rebooting. This benefit copy of the OS. Administrators can use many different methods to can ease administration: running utilities locally on the NFSinstall a moderately coupled system. The following section describes exporting machine allows log monitoring as well as condensed log two common methods: the hybrid model (temporary data is stored reporting. Because no local disk exists, memory swapping must locally) and the fully diskless model (the compute nodes do not occur over NFS or not at all. have hard drives). Both methods use a central server to store and Memory swapping is important because if an applications load the OS and other system information. needs exceed the amount of available physical memory, the program could crash or the operating system might hang. Although memory can be swapped across the network to Booting the compute node from the network nonlocal drives, this action will drastically degrade the The compute node will need to retrieve many essential OS compoclusters performance. For these reasons, if jobs with unknown nents over the network. First it must be able to network boot, which memory requirements will run on the cluster, diskless nodes requires the compute node to support a network booting protocol, are not recommended. such as PXE, so the node can contact a BOOTP/DHCP server for configuration information. Each time a node boots, it is assigned its network information Moderately coupled clusters offer administrative benefits and given a path from which to download a Linux kernel and Under both methods, after the final directories have been ramdisk via Trivial FTP (TFTP). Although the kernel and ramdisk mounted over the network or the local drive, the compute node can be located on the same server as the BOOTP/DHCP server, it is will be fully booted and ready to accept jobs. From an external
PowerSolutions

14

Issue 4, 2001

as well as a cluster viewpoint, each compute node acts as an individual machine. From an administrative viewpoint, upgrading each compute node is vastly simplified. Because only one compute node image exists, administrators need to upgrade only one image rather than all the compute nodes. Modifications or updates made to an NFS shared directory take effect immediately. If the changes are made to the compute node kernel, ramdisk, or any other piece of the OS that has been downloaded, administrators must reboot each machine for the changes to take effect.

Tightly coupled clusters remove distinctions


Tightly coupled clusters try to remove the distinction between the compute nodes and the cluster. Users see only the master node, which looks like a massively parallel computer. The compute nodes are simply an extension of the master node and are pure processing platforms. The Scyld Beowulf system creates such a cluster.

Implementing Scyld Beowulf Scyld is a commercial package that provides a simple and convenient way to install, administer, and maintain a Beowulf cluster. A set of multiple compute nodes are managed and controlled from a master node. This cluster differs from a traditional Beowulf deployment because the cluster behaves much like a single, large-scale parallel computer. The compute nodes are pure compute nodes that do not offer a login shell. System tools have been modified so administrators can view the entire process space from the master node, similar to how the local process space is viewed on a stand-alone machine. Installation of the master node begins with the Scyld Advantages and disadvantages of tightly coupled clusters Beowulf CD, which is a modified version of Red Hat Linux 6.2. It should be noted that no information has been permanently These modifications include additional cluster management and stored on the compute nodes. The Scyld system allows the crecontrol applications, message passing libraries, and a modified ation of completely diskless compute nodes if desired but also kernel for the compute nodes. Once the master node has been provides the ability to locally install parts of the system, such as installed from the CD, administrators can the stage 1 and 2 pieces and swap space (much choose from a few compute node installation like the moderately coupled method). From a methods. A compute node installation transfers user and administrator standpoint, the cluster is Tightly coupled the necessary kernel, ramdisk, and libraries, and one large machine, which is the major advantage occurs in two stages. of this style of Beowulf cluster. The tightly inteclusters try to remove Stage 1. Several media can transfer the stage-1 grated suite of cluster management and clusterthe distinction kernel and ramdisk to the compute nodes: floppy ing utilities offer the administrator a uniform set disk, CD-ROM, basic input/output system (BIOS), of tools, which reduces time requirements and between the compute hard drive, or over the network using a network interoperability problems. booting protocol such as PXE. The first four only Unfortunately, this clustering style has some of nodes and the require that a bootable kernel be located on the the disadvantages of the moderately coupled cluscluster. chosen device. Booting the node from the network tering style and also ties the cluster to a single softrequires modifications to the master node. The ware vendor for the entire solution stack.
PowerSolutions

master node will need a TFTP server and a DHCP server installed and configured. When a compute node first boots, it will receive an IP address and the location from which to download the stage-1 kernel and ramdisk via TFTP. Once a node has booted into stage 1, it is ready to be added to the cluster. Stage 1 places the compute node in a loop where it will repeatedly broadcast its MAC address via Reverse Address Request Protocol (RARP) to indicate that it is available. On the master node, daemons that are part of the Scyld system will detect this activity and add the MAC to a list of unknown machines. A graphical utility called beosetup allows the administrator to migrate machines from the list of unknown machines to the compute node list. Compute nodes are assigned their node identifier based on the order in which they are added to the compute node list. Stage 2. This stage begins when the machine has been placed on the compute node list. The master node first transfers the stage-2 kernel to the selected compute node over a TCP/IP connection. The currently running stage-1 kernel then switches to the stage-2 kernel by a technique known as two kernel monte. After this kernel finishes booting, it downloads all required shared libraries, mounts exported file systems, and downloads and starts two user daemons, beostats and beoslave. These user daemons enable the transfer of node statistics (such as load and memory pressure) to the master node and the migration of processes from the master node, respectively. A login console is never available on the compute node, so all control and monitoring is centralized at the master node and communicated through the two user daemons.

www.dell.com/powersolutions

15

H I G H - P E R F O R M A N C E

C O M P U T I N G

Balancing administrative ease and system performance


The type of configuration selected for a given cluster should be based on user needs. As a cluster becomes more tightly coupled, the burden of administering the cluster drops. Unfortunately, this approach can adversely affect cluster performance because system management over the network degrades the available network bandwidth. In very tightly coupled clusters, many common utilities and applications will fail to operate because of the proprietary nature of the cluster design. These cost/benefit ratios should be considered before selecting a configuration.

The type of configuration selected for a given cluster should be based on user needs. As a cluster becomes more tightly coupled, the burden of administering the cluster drops.

Yung-Chin Fang (Yung-Chin_fang@dell.com) is a member of the Scalable Systems Group at Dell. Yung-Chin has a B.E. in Computer Science from Tamkang University and an M.S. in Computer Science from Utah State University. He is currently working on his Ph.D. in Computer Science at the University of Houston. Munira A. Hussain (munira_hussain@dell.com) is a systems engineer in the Scalable Systems Group at Dell. She completed her B.S. in Electrical Engineering and a minor in Computer Science from the University of Illinois at Urbana-Champaign.

Christopher Stanton (christopher_stanton@dell.com) is a senior systems engineer in the Scalable Systems Group at Dell. His HPC cluster-related interests include cluster installation, management, and performance benchmarking. Christopher graduated from the University of Texas at Austin with a B.S. and special honors in Computer Science. Rizwan Ali (rizwan_ali@dell.com) is a systems engineer working in the Scalable Systems Group at Dell. His current research interests are performance benchmarking and high-speed interconnects. Rizwan has a B.S. in Electrical Engineering from the University of Minnesota.

FOR MORE I NFORMATION


Kickstart: http://www.redhat.com/docs/manuals/linux/ RHL-7.1-Manual/custom-guide/ch-kickstart2.html SystemImager: http://systemimager.org Scyld: http://www.scyld.com

16

PowerSolutions

Issue 4, 2001

You might also like