You are on page 1of 36

Implementing a Linux Cluster

A project report
submitted in partial fullment of
the requirements for the award of the degree of
Bachelor of Technology
in
Computer Science and Engineering
by
Deepak Lukose
Roll No:16
Group No:8
S8 CSE
Department of Computer Engineering
National Institute of Technology, Calicut
Kerala - 673601
2006
Certicate
This is to certify that the project entitled Implementing a Linux Cluster is
a bonade record of the project presented by Deepak Lukose, (Y2.025) under
our supervision and guidance. The project report has been submitted to the De-
partment of Computer Engineering of National Institute of Technology,
Calicut in partial fullment of the award of the Degree of Bachelor of Technol-
ogy in Computer Science and Engineering.
Dr. M.P. Sebastian
Professor and Head
Dept.of Computer Engineering
NIT Calicut
Mr. Vinod Pathari
Lecturer
Dept.of Computer Engineering
NIT Calicut
iii
Abstract
A computer cluster is a group of loosely coupled computers that work to-
gether closely so that in many respects it can be viewed as though it were a single
computer. Clusters are commonly connected through fast local area networks and
are usually deployed to improve speed and/or reliability over that provided by a
single computer, while typically being much more cost-eective than single com-
puters of comparable speed or reliability.
Clusters built from open source software, particularly based on the GNU/Linux
operating system, are increasingly popular. Their success is not hard to explain
because they can cheaply solve an ever-widening range of number-crunching appli-
cations. A wealth of open source or free software has emerged to make it easy to
set up, administer, and program these clusters. This work aims at an implemen-
tation of free and open source clusters for performing scientic computations at a
faster pace.
iv
Acknowledgements
I express my sincere thanks to Mr. Vinod Pathari for his constant backing
and support. I would like to extend my gratitude to entire faculty and sta of CSE
Department of NITC, who stood by me in all the diculties I had to face during
the completion of this project. Last but not the least I thank God Almighty for
being the guiding light ...all throughout.
Deepak Lukose
v
Contents
Chapter
1 Introduction 1
1.1 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Cluster history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 High-performance (HPC) clusters . . . . . . . . . . . . . . . . . . . 2
1.4 Cluster technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Motivation 4
3 Design 5
3.1 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Operating System . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.3 Cluster Software . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Cluster Installation 8
4.1 OSCAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Installation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 10
vi
5 Parallel Programming 11
5.1 Hello World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.1.1 MPI Init . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.2 MPI Finalize . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.3 MPI Comm size . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.4 MPI Comm rank . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.5 MPI Get processor name . . . . . . . . . . . . . . . . . . . . 14
5.2 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Molecular Dynamics Modeling of Thermal Conductivity of Engineering Flu-
ids and its Enhancement due to Nanoparticle Inclusion 19
6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 Proling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7 Testing and Benchmarking 26
8 Conclusion 29
Bibliography 30
Chapter 1
Introduction
1.1 Problem Denition
To setup a cluster which can increase the performance of CPU intensive
tasks like scientic computations, rendering graphics etc. In short, to create a
High performance cluster for general purpose computing with special regard to
modeling and simulation of molecular systems.
1.2 Cluster history
The rst commodity clustering product was ARCnet, developed by Dat-
apoint in 1977. ARCnet was not a commercial success and clustering did not
really take o until DEC released their VAXcluster product in the 1980s for the
VAX/VMS operating system. The ARCnet and VAXcluster products not only
supported parallel computing, but also shared le systems and peripheral devices.
They were supposed to give us the advantage of parallel processing while maintain-
ing data reliability and uniqueness. VAXcluster, now VMScluster, is still available
on OpenVMS systems from HP running on Alpha and Itanium systems.
The history of cluster computing is intimately tied up with the evolution of
networking technology. As networking technology has become cheaper and faster,
cluster computers have become signicantly more attractive.
2
1.3 High-performance (HPC) clusters
High-performance clusters are implemented primarily to provide increased
performance by splitting a computational task across many dierent nodes in the
cluster, and are most commonly used in scientic computing. One of the more
popular HPC implementations is a cluster with nodes running Linux as the OS
and free software to implement the parallelism. This conguration is often referred
to as a Beowulf cluster. Such clusters commonly run custom programs which have
been designed to exploit the parallelism available on HPC clusters. Many such
programs use libraries such as Message Passing Interface(MPI) which are specially
designed for writing scientic applications for HPC computers.
1.4 Cluster technologies
The Message Passing Interface (MPI) is a computer communications protocol.
It is a de facto standard for communication among the nodes running a parallel
program on a distributed memory system. MPI is a library of routines that can
be called from Fortran, C, C++ and Ada programs. MPIs advantage over older
message passing libraries is that it is both portable (because MPI has been imple-
mented for almost every distributed memory architecture) and fast (because each
implementation is optimized for the hardware it runs on).
The GNU/Linux world sports various cluster software, such as:
Specialized application clusters - Those clusters which are built using special-
ized applications fall in this category. Eg: Beowulf, distcc, MPICH and other.
distcc provides parallel compilation when using GCC.
Director-based clusters - These clusters allow incoming requests for services to
be distributed across multiple cluster nodes. Eg:Linux Virtual Server, Linux-HA
Full-blown clusters - This category includes those clusters which integrated into
the kernel mechanism for automatic process migration among homogeneous nodes.
3
Eg:Mosix, openMosix, Kerrighed, OpenSSI etc. OpenSSI and Kerrighed are
single-system image implementations.
DragonFly BSD, a recent fork of FreeBSD 4.8 is being redesigned at its core
to enable native clustering capabilities. It also aims to achieve single-system image
capabilities.
MSCS is Microsofts high-availability cluster service for Windows. Based
on technology developed by Digital Equipment Corporation. The current version
supports up to eight nodes in a single cluster, typically connected to a SAN. A set
of APIs support cluster-aware applications, generic templates provide support for
non-cluster aware applications.
Grid computing is a technology closely related to cluster computing. The key
dierences between grids and traditional clusters are that grids connect collections
of computers which do not fully trust each other, and hence operate more like a
computing utility than like a single computer. In addition, grids typically support
more heterogeneous collections than are commonly supported in clusters.
Chapter 2
Motivation
Most of the time, the computer is idle. Start a program like xload or top
that monitors the system use, and one will probably nd that the processors load
is not even hitting the 1.0 mark. If one has two or more computers, chances are
that at any given time, at least one of them is doing nothing. Unfortunately, when
we really do need CPU power - during a C++ compile, or encoding Ogg Vorbis
music les - we need a lot of it at once. The idea behind clustering is to spread
these loads among all available computers, using the resources that are free on
other machines.
The basic unit of a cluster is a single computer, also called a node. Clusters
can grow in size - they scale - by adding more machines. The power of the
cluster as a whole will be based on the speed of individual computers and their
connection speeds are. In addition, the operating system of the cluster must make
the best use of the available hardware in response to changing conditions. This
becomes more of a challenge if the cluster is composed of dierent hardware types (a
heterogeneous cluster), if the conguration of the cluster changes unpredictably
(machines joining and leaving the cluster), and the loads cannot be predicted ahead
of time.
Chapter 3
Design
Designing a cluster entails four sets of design decisions:
(1) Determine the overall mission of the cluster.
(2) Select a general architecture for the cluster.
(3) Select the operating system, cluster software, and other system software
that will be used.
(4) Select the hardware for the cluster.
While each of these tasks, in part, depends on the others, the rst step
is crucial. If at all possible, the clusters mission should drive all other design
decisions. At the very least, the other design decisions must be made in the
context of the clusters mission and be consistent with it.
Selecting the hardware should be the nal step in the design, but often we
wont have as much choice as we would like. A number of constraints may force us
to select the hardware early in the design process. The most obvious is the budget
constraints.
Dening what we want to do with the cluster is really the rst step in de-
signing it. For many clusters, the mission will be clearly understood in advance.
This is particularly true if the cluster has a single use or a few clearly dened uses.
6
But it should be noted that clusters have a way of evolving. What may be a
reasonable assessment of needs today may not be tomorrow. Good design is often
the art of balancing todays resources with tomorrows needs.
3.1 Design decisions
3.1.1 Mission
To setup a cluster which can increase the performance of CPU intensive
tasks like scientic computations, rendering graphics etc. In short, to create a
High performance cluster for general purpose computing with special regard to
modeling and simulation of molecular systems.
3.1.2 Operating System
RedHat Enterprise Linux 4(RHEL4) was selected because it is one of the
most widely supported platform by all cluster softwares.
3.1.3 Cluster Software
OSCAR (Open Source Cluster Application Resources) is a high performance
Linux cluster that is available for parallel/distributed computing needs. OSCAR
scales well over Linux distributions. For greater control over how the cluster is
congured, one will be happier with OSCAR in the long run. Typically, OSCAR
provides better documentation than other Cluster kits like Rocks. OSCAR was
chosen over traditional Beowulf due to the ease of installation as well as its com-
prehensive package which includes many compilers and application softwares. One
of the main package that is being used on OSCAR is LAM-MPI, a popular imple-
mentation of the MPI parallel programming paradigm. LAM-MPI can be used for
developing parallel programs and it is one of the leading implementations available.
7
3.1.4 Hardware
Hardware preferred is P4 3.0GHz machines with 2GB DDR RAM(on each
node) and 1Gbps/100Mbps NIC which supports PXE(Preboot Execution Envi-
ronment), for network booting on computers. The Head node must have at least
7GB Hard disk capacity and all other client nodes need a minimum of 5GB hard
disk capacity.
Chapter 4
Cluster Installation
One of the more important developments in the short life of high perfor-
mance clusters has been the creation of cluster installation kits such as OSCAR
(Open Source Cluster Application Resources) and Rocks. With software packages
like these, it is possible to install everything one needs and very quickly have a
fully functional cluster. A fully functional cluster will have a number of software
packages each addressing a dierent need, such as programming, management, and
scheduling.
4.1 OSCAR
OSCAR is a software package that is designed to simplify cluster installation.
A collection of open source cluster software, OSCAR includes everything that one
is likely to need for a dedicated, high-performance cluster. OSCAR uses a best-in-
category approach, selecting the best available software for each type of cluster-
related task. One will often have several products to choose from for any given
need.
The design goals for OSCAR include using the best-of-class software, elimi-
nating the downloading, installation, and conguration of individual components,
and moving toward the standardization of clusters. OSCAR, it is said, reduces
the need for expertise in setting up a cluster because OSCAR takes us completely
through the installation of a cluster. In practice, it might be more tting to say
9
that OSCAR delays the need for expertise and allows us to create a fully functional
cluster before mastering all the skills one will eventually need. In the long run,
one will want to master those packages in OSCAR. OSCAR makes it very easy to
experiment with packages and dramatically lowers the barrier to getting started.
OSCAR is designed with high-performance computing in mind. So, unless
one customizes the installation, the computer nodes are meant to be dedicated to
the cluster.
4.2 Packages
OSCAR brings together a number of software packages for clustering. Most
of the packages are available as standalone packages. The main packages in OS-
CAR are:
Core: This is the core OSCAR package.
C3: The Cluster, Command, and Control tool suite provides a command-line ad-
ministration interface.cexec mkdir /opt/c3-4 will create the directory on all
clients. Similarly we can use commands like cget, ckill, cpush, crm, cshutdown.
Environmental Switcher: This is based on Modules, a Perl script that al-
lows the user to make changes to the environment of future shells. For exam-
ple, Switcher allows a user to change between MPICH and LAM/MPI. switcher
mpi = mpich-ch-p4-gcc-1.2.5.10 can be used to set the execution environ-
ment needed for mpich library.
SIS: The System Installation Suite is used to install the operating systems on the
clients.
Monitoring systems: Clumon, a web-based performance-monitoring system,
and Ganglia, a real-time monitoring system and execution environment, are the
softwares included in this category.
MAUI: This job scheduler is used with openPBS.
10
openPBS: The portable batch system is a workload management system.
PVFS: Parallel Virtual File System is a high-performance, scalable, parallel vir-
tual le system.
Any high-performance cluster would be incomplete without programming
tools. The OSCAR distribution includes:
LAM/MPI :This is one implementation of the message passing interface (MPI)
libraries.
MPICH :This is another implementation of the message passing interface (MPI)
libraries.
PVM :This package provides the parallel virtual machine system, another mes-
sage passing library.
4.3 Installation Strategy
With OSCAR, one rst installs Linux (but only on the head node) and then
installs OSCAR. The installations of the two are separate. This makes the instal-
lation more involved, but it gives us more control over the conguration of the
system, and it is somewhat easier to recover when we encounter installation prob-
lems. And because the OSCAR installation is separate from the Linux installation,
we are not tied to a single Linux distribution.
OSCAR uses a system image cloning strategy to distribute the disk image to
the compute nodes. With OSCAR it is best to use the same hardware throughout
the cluster. OSCARs thin client model is designed for diskless systems.
Chapter 5
Parallel Programming
5.1 Hello World
It is customary to start a new programming language with Hello world
program. So here also we will start with Hello World program. The parallel
version of the program(using LAM/MPI) is as follows:
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[])
{
int processId; /* rank of process */
int noProcesses; /* number of processes */
int nameSize; /* length of name */
char computerName[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);
MPI_Comm_rank(MPI_COMM_WORLD, &processId);
MPI_Get_processor_name(computerName, &nameSize);
12
fprintf(stderr,"Hello from process %d on %s\n", processId, computerName);
MPI_Finalize( );
return 0;
}
This example introduces ve MPI functions, dened through the inclusion
of the header le for the MPI library, mpi.h, and included when the MPI library
is linked to the program. While this example uses C, similar libraries are available
for C++ and FORTRAN.
Four of these functions, MPI Init, MPI Comm size, MPI Comm rank, and
MPI Finalize, are seen in virtually every MPI program.
5.1.1 MPI Init
MPI Init is used to initialize an MPI session. All MPI programs must have
a call to MPI Init. MPI Init is called once, typically at the start of a program.
One can have lots of other code before this call, or one can even call MPI Init
from a subroutine, but one should call it before any other MPI functions are
called. (There is an exception: the function MPI Initialized can be called before
MPI Init. MPI Initialized is used to see if MPI Init has been previously called. )
In C, MPI Init can be called with the addresses for argc and argv as shown in the
example. This allows the program to take advantage of command-line arguments.
Alternatively, these addresses can be replaced with a NULL.
5.1.2 MPI Finalize
MPI Finalize is called to shut down MPI. MPI Finalize should be the last
MPI call made in a program. It is used to free memory, etc. It is the users
13
responsibility to ensure that all pending communications are complete before a
process calls MPI Finalize. Every process must call MPI Finalize.
5.1.3 MPI Comm size
This routine is used to determine the total number of processes running in a
communicator (the communications group for the processes being used). It takes
the communicator as the rst argument and the address of an integer variable
used to return the number of processes. For example, if one is executing a pro-
gram using ve processes and the default communicator, the value returned by
MPI Comm size will be ve, the total number of processes being used. This is
number of processes, but not necessarily the number of machines being used.
In the hello world program, both MPI Comm size and MPI Comm rank used
the default communicator, MPI COMM WORLD. This communicator includes all
the processes available at initialization and is created automatically. Communica-
tors are used to distinguish and group messages. As such, communicators provide
a powerful encapsulation mechanism. While it is possible to create and manipulate
ones own communicators, the default communicator will probably satisfy most of
the needs.
5.1.4 MPI Comm rank
MPI Comm rank is used to determine the rank of the current process within
the communicator. MPI Comm rank takes a communicator as its rst argument
and the address of an integer variable is used to return the value of the rank.
Basically, each process is assigned a dierent process number or rank within
a communicator. Ranks range from 0 to one less than the size returned by
MPI Comm size. For example, if one is running a set of ve processes, the individ-
ual processes will be numbered 0, 1, 2, 3, and 4. By examining its rank, a process
14
can distinguish itself from other processes. The values returned by MPI Comm size
and MPI Comm rank are often used to divide up a problem among processes.
Next, each individual process can examine its rank to determine its role in the
calculation. For example, the process with rank 0 might work on the rst part of
the problem; the process with rank 1 will work on the second part of the problem,
etc. One can divide up the problem dierently also. For example, the process with
rank 0 might collect all the results from the other processes for the nal report
rather than participate in the actual calculation. It is really up to the programmer
to determine how to use this information.
5.1.5 MPI Get processor name
MPI Get processor name is used to retrieve the host name of the node on
which the individual process is running. In the sample program, we used it to
display host names. The rst argument is an array to store the name and the
second is used to return the actual length of the name.
Each of the C versions of these ve functions returns an integer error code.
With a few exceptions, the actual code is left up to the implementers. Error codes
can be translated into meaningful messages using the MPI Error string function.
Here is an example of compiling and running the code:
[deepak@ssl1 ~]$ mpicc hello.c -o hello
[deepak@ssl1 ~]$ mpirun -np 5 hello
Messages received by process 0 on ssl1.
Greetings from process 1 on ssl3.deltaforce!
Greetings from process 2 on ssl1!
Greetings from process 3 on ssl3.deltaforce!
Greetings from process 4 on ssl1!
15
There is no apparent order in the output. This will depend on the speed
of the individual machines, the loads on the machines, and the speeds of the
communications links. Unless one takes explicit measures to control the order of
execution among processors, one should not make assumptions about the order of
execution.
When running the program, the user species the number of processes on
the command line. MPI Comm size provides a way to get that information back
into the program. Next time, if one wants to use a dierent number of processes,
just change the command line and the code will take care of the rest.
5.2 Numerical Integration
This is a fairly standard problem for introducing parallel calculations because
it can be easily decomposed into parts that can be shared among the computers
in a cluster. Although in most cases it can be solved quickly on a single processor,
the parallel solution illustrates all the basics one needs to get started writing MPI
code. The reason this area problem is both interesting and commonly used is
that it is very straightforward to subdivide this problem. We can let dierent
computers calculate the areas for dierent rectangles. Basically, MPI Comm size
and MPI Comm rank are used to divide the problem among processors. MPI Send
is used to send the intermediate results back to the process with rank 0, which
collects the results with MPI Recv and prints the nal answer. Here is the program:
#include "mpi.h"
#include <stdio.h>
/* problem parameters */
#define f(x) ((x) * (x))
#define numberRects 50
16
#define lowerLimit 2.0
#define upperLimit 5.0
int main(int argc, char *argv[])
{
/* MPI variables */
int dest, noProcesses, processId, src, tag;
MPI_Status status;
/* problem variables */
int i;
double area, at, height, lower, width, total, range;
/* MPI setup */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);
MPI_Comm_rank(MPI_COMM_WORLD, &processId);
/* adjust problem size for subproblem*/
range = (upperLimit - lowerLimit) / noProcesses;
width = range / numberRects;
lower = lowerLimit + range * processId;
/* calculate area for subproblem */
area = 0.0;
for (i = 0; i < numberRects; i++)
{
17
at = lower + i * width + width / 2.0;
height = f(at);
area = area + width * height;
}
/* collect information and print results */
tag = 0;
if (processId = = 0) /* if rank is 0, collect results */
{
total = area;
for (src=1; src < noProcesses; src++)
{
MPI_Recv(&area, 1, MPI_DOUBLE, src, tag, MPI_COMM_WORLD, &status);
total = total + area;
}
fprintf(stderr, "The area from %f to %f is: %f\n",
lowerLimit, upperLimit, total );
}
else /* all other processes only send */
{
dest = 0;
MPI_Send(&area, 1, MPI_DOUBLE, dest, tag, MPI_COMM_WORLD);
};
/* finish */
MPI_Finalize( );
return 0;
}
18
In this example, we are calculating the area under the curve y = f(x
2
) between
x=2 and x=5. Since each process only needs to do part of this calculation, we need
to divide the problem among the processes so that each process gets a dierent
part and all the parts are accounted for. MPI Comm size is used to determine the
number of parts the problem will be broken into, noProcesses. That is, we divide
the total range (2 to 5) equally among the processes and adjust the start of the
range for an individual process based on its rank. In the next section of code, each
process calculates the area for its part of the problem. Then we need to collect
and combine all our individual results. One process will act as a collector to which
the remaining processes will send their results. Using the process with rank 0 as
the receiver is the logical choice. The remaining processes act as senders. A fair
amount of MPI code development can be done on a single processor system and
then moved to a multiprocessor environment.
Chapter 6
Molecular Dynamics Modeling of Thermal Conductivity of
Engineering Fluids and its Enhancement due to Nanoparticle Inclusion
In the present study, an attempt is made to estimate the enhancement of the
thermal conductivity of water by suspension of nanoparticles, using the method
of Molecular Dynamics simulation. This involves the process of generating the
atomic trajectories of a system of a nite number of particles by direct integration
of the classical Newtons equations of motion, with appropriate specication of
interatomic potentials and application of suitable initial and boundary conditions.
Initially a general simulation procedure is developed for the thermal conductivity
of a liquid and the procedure is validated with standard values. A few among
the plausible theoretical models for the thermal conductivity of nanouids has
been selected and studied. Algorithms are made for simulating them, abiding the
procedural steps of the Molecular Dynamics method. The thermal conductivity
enhancement in the base uid due to suspension of nanoparticles, estimated using
the simulations of the models considered are compared among themselves and with
the existing experimental results, and further investigated to select the most ap-
propriate model which matches best with a practical case of interest (metal oxide
- water system). Parametric studies are conducted to study the variation of ther-
mal conductivity enhancement with temperature, and the optimal dosing levels
of nanouids are also investigated upon. Further, an optimization of the simula-
tion procedure and algorithms is attempted to bring out an ecient computation
20
strategy.
6.1 Algorithm
main()
{
initializePositions();
initializeVelocities();
for(time=0; time< totalTimeSteps; time++)
{
velocityVerlet(dt);
instantaneousTemperature();
if(i%200==0)
rescaleVelocities();
if(time>equilibrationtime)
jt();
}
thermalConductivity();
}
velocityVerlet()
{
computeAccelerations(); //at time t
updatepositions; //function of a[t]
computeAccelerations(); //at time t+dt;
updateVelocity; //function of a[t] and a[t+dt]
}
21
6.2 Proling
It is generally said that a typical program will spend over 90% of its execution
time in less that 10% of the actual code. This is just a rule of thumb or heuristic,
and as such, will be wildly inaccurate or totally irrelevant for some programs.
But for many, if not most, programs, it is a reasonable observation. The actual
numbers dont matter since they will change from program to program. It is the
idea that is important for most programs, most of the execution time spent is in a
very small portion of the code.
If the application spends 95% of its time in 5% of the code, there is little to
be gained by optimizing the other 95% of the code. Even if one could completely
eliminate it, one would only see a 5% improvement. But if one can manage a 10%
improvement in the critical 5% of the code, for example, we will see a 9.5% overall
improvement in the program. Thus, the key to improving the codes performance
is to identify that crucial 5%. That is the region where one should spend ones
time optimizing code.
There is a point of diminishing returns when optimizing code. We will need
to balance the amount of time we spend optimizing code with the amount of
improvement we actually get. There is a point where the code is good enough.
The goals of proling are two-fold to decide how much optimization is worth doing
and to identify which parts of code should be optimized.
For serial algorithms, one can often make reasonable estimates on how time
is being spent by simply examining and analyzing the algorithm. The standard
approach characterizes performance using some measurement of the problem size.
Since the problem size often provides a bound for algorithmic performance, this
approach is sometimes called asymptotic analysis.
Asymptotic analysis can be problematic with parallel programs for several
reasons. First, it may be dicult to estimate the cost of communications required
22
by a parallel solution. This can be further complicated by the need for additional
code to coordinate communications among the processors. Second, there is often
a less than perfect overlap among the communicating processes. A processor may
be idle while it waits for its next task. In particular, it may be dicult to predict
when a processor will be idle and what eect this will have on overall performance.
For these and other reasons, an empirical approach to estimating performance is
often the preferred approach for parallel programs. That is, we directly measure
performance of existing programs.
Thus, with parallel programs, the most appropriate strategy is to select the
best algorithm one can and then empirically verify its actual performance.
The prole of the unoptimized code is as follows:
[delta16@athena project]$ time -p ./green_without_prof
Thermal Conductivity : 0.000922671
real 378.63
user 378.61
sys 0.00
% cumulative self self total
time seconds seconds calls ms/call ms/call name
49.31 398.03 398.03 40000 9.95 9.95 Accelerations()
34.33 675.12 277.09 1620000 0.17 0.17 lj(int)
15.88 803.34 128.21 15000 8.55 27.09 jt()
0.47 807.09 3.75 20000 0.19 20.09 Verlet(double)
0.13 808.10 1.01 1620000 0.00 0.17 ei(int)
0.00 808.14 0.04 201 0.20 0.20 rescaleVelocities()
<some output removed for brevity>
From this prole we can nd that there is scope for improvement in func-
23
tions like Accelerations, lj and jt because they account for more than 99% of the
execution time. It can be noticed that a single call to lj function takes very little
amount of time to execute but since it is executed large number of times, even
a slight improvement in the code will heavily aect the execution time. It can
be found from the code that the functions jt and Accelerations are both having
a complexity of O(N
2
). So the work done by this part of the code can be split
among the individual nodes.
The prole of the optimized code is as follows:
[delta16@athena project]$ mpirun -np 4 time -p ./mpi_green_pg
Thermal Conductivity : 0.000922359
real 155.74
user 134.28
sys 1.70
<some output removed for brevity>
% cumulative self self total
time seconds seconds calls ms/call ms/call name
54.35 182.83 182.83 40000 4.57 4.57 Accelerations()
22.37 258.11 75.27 405000 0.19 0.19 lj(int)
19.61 324.08 65.97 15000 4.40 9.45 jt()
2.39 332.12 8.04 20000 0.40 9.54 Verlet(double)
<some output removed for brevity>
The comparison between the unoptimized and optimized code is shown be-
low:
24
Total Time
0
50
100
150
200
250
300
350
400
450
Before After
self ms/call
0
2
4
6
8
10
12
Before After
No of Calls
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Before After
self ms/call
0.16
0.165
0.17
0.175
0.18
0.185
0.19
0.195
Before After
No of Calls
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
Before After
Total Time
0
50
100
150
200
250
300
Before After
self ms/call
0
1
2
3
4
5
6
7
8
9
Before After
No of Calls
0
2000
4000
6000
8000
10000
12000
14000
16000
Before After
Total Time
0
20
40
60
80
100
120
140
Before After
25
With this optimized code we get a speedup of 2.43 on a single node(4 CPUs)
cluster. The error in the computation of the order of 10
7
is due the presence of
oating point arithmetic. Maximum speedup was achieved on a dual processor
machines by using shared memory instead of message passing. When the program
was run on a cluster with two nodes(4 CPUs) the speedup achieved was in the
range of 1.7 to 1.9 .
Chapter 7
Testing and Benchmarking
Once the cluster is running, one needs to run a benchmark or two just to see
how well it performs.
There are three main reasons for running benchmarks. First, a benchmark
will provide us with a baseline. If we were to make changes to the cluster or
if we suspect problems with the cluster, we can rerun the benchmark to see if
performance is really any dierent. Second, benchmarks are useful when comparing
systems or cluster congurations. They can provide a reasonable basis for selecting
between alternatives. Finally, benchmarks can be helpful with planning. If we can
run several with dierently sized clusters, etc. , we should be able to make better
estimates of the impact of scaling the cluster.
High Performance Linpack, Hierarchical Integration (HINT) and NAS Par-
allel Benchmarks are some common benchmarks available for clusters.
As a yardstick of performance we are using the best performance as mea-
sured by the LINPACK Benchmark. LINPACK was chosen because it is widely
used and performance numbers are available for almost all relevant systems.
LINPACK is a software library for performing numerical linear algebra on
digital computers. LINPACK makes use of the BLAS (Basic Linear Algebra Sub-
programs) libraries for performing basic vector and matrix operations. The LIN-
PACK Benchmark is based on LINPACK, representing a measure of a systems
oating point computing power. It measures how fast a computer solves dense
27
n by n systems of linear equations Ax=b, a common task in engineering. The
solution is based on Gaussian elimination with partial pivoting, with
2
3
n
3
+ n
2
oating point operations. The result is Millions of oating point operations per
second(Mop/s).
For large scale distributed memory systems, the performance of a portable
implementation of the High-Performance Linpack Benchmark link is used as a per-
formance measure for ranking supercomputers in the TOP500 list of the worlds
fastest computers. This performance does not reect the overall performance of
a given system, as no single number ever can. It does, however, reect the per-
formance of a dedicated system for solving a dense system of linear equations.
Since the problem is very regular, the performance achieved is quite high, and
the performance numbers give a good correction of peak performance. When the
High Performance Linpack was run on the cluster it gave a peak performance of
1.380e-01 GFLOPS.
The folowing gure explains the relationship between the execution time and
the number of available processors.
28
Chapter 8
Conclusion
The basic objective of this project was to setup a high performance compu-
tational cluster with special concern to molecular modelling. By using a cluster kit
such as OSCAR,the rst phase of the project, setting up a high performance clus-
ter, could be completed. With the help of message passing libraries like LAM/MPI,
the second phase of the project, improving the performance of a molecular mod-
elling problem, was completed. The molecular modelling problem, which is con-
sidered to be a standard benchmark, when tested on a two node cluster gave a
remarkable speedup of the order of 2.0 . This increased eciency came at the
expense of higher code complexity. Finally, the testing phase was completed using
the High Performance Linpack benchmark which showed a peak performance of
1.380e-01 GFLOPS. Future work includes the formal analysis of existing code for
automated conversion to a parallel version for cluster implementation.
30
Bibliography
[1] Joseph D. Sloan, High Performance Linux Clusters with OSCAR, Rocks,
OpenMosix, and MPI, OReilly & Associates, 1991.
[2] Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, Tata
McGraw-Hill, 2003.
[3] Zoltan Juhasz, Peter Kacsuk and Dieter Kranzim uller, Distributed and Par-
allel Systems-Cluster and Grid Computing, Springer, 2002.
[4] Stefan Bohringer, Building a diskless Linux Cluster for high performance
computations from a standard Linux distribution, Technical Report, Institut
f ur Humangenetik, Universitatsklinikum Essen, 2003.
[5] Linux clusters information centre, http://lcic.org/
[6] Linux Documentation Project, http://www.tldp.org/HOWTO/openMosix-
HOWTO/index.html
[7] openMosix Development site, http://openmosix.sourceforge.net/
[8] openMosix website, http://openmosix.org
[9] Oscar website, http://oscar.openclustergroup.org
[10] Rocks cluster website, http://www.rocksclusters.org
[11] Openssi website, http://www.openssi.org/

You might also like