You are on page 1of 9

isiseHPC

User Manual

1.

INTRODUCTION

  • 1.1. What is the isiseHPC

The isiseHPC is a High Performance Cluster built to run numerical analyses from several users at the same time. The cluster is composed by two machines or nodes: a frontend node (master) to which users connect and manage their files and jobs; and a node which carries out

1. INTRODUCTION 1.1. What is the isiseHPC The isiseHPC is a High Performance Cluster built to

analyses submitted by users through the job management system (see …).

It is not possible for

the user to directly connect to compute nodes, as they are in a private network with the

frontend. Figure 1 schematizes the cluster network.

FIGURE 1
FIGURE 1

1.1.1. Cluster job management philosophy

The machines composing the cluster are built to the function. The frontend manages the external connections and manages the data transferring between nodes, as well as the job managing system. It acts as a server making sure all information is synchronized and that every user can submit their work in a proper way. On other hand, the Compute Nodes are simply connected to the Frontend and their function is to receive job requests from the Frontend’s job management system, to compute, and finally return the results. Please keep in mind that the Frontend Node is for job submission purposes only, never to be used as a compute node. The consequences of trying to use it to run analyses may lead to a system crash (overload). If such a scenario occurs, the system will point out the abnormal activity and perpetrators will be quickly identified by the administrator.

In a cluster, jobs files and information are shared through the private network between the frontend and the compute nodes via NFS exports. This basically means that there are some directories which are synchronized throughout the connected machines (something like dropbox). Synchronization operations, although required, are avoidable. They will increase network traffic and, consequently, force the machines to perform IO operations, degrading performance of the system (even severely, if big chunks of data are being transferred). Better performance is achieved by using Compute Nodes’ local directories to run your analyses and

then transfer it back to the shared one). The method to do it is demonstrated in

XXXXX.
XXXXX.

1.1.2. Machine specifications

The frontend is a 4 core (8 threads) computer with 8Gb of RAM. It serves the requirements for fast access from many users at the same time and managing transferred data. It has 4TB of disk space, which a sufficient amount of space if users keep on retrieving their results back to personal computers and freeing the space. Compute Nodes are equipped with 2 state-of-the- art CPUs with 8 cores each. Each core has two threads (one physical, one logical) which results in 32 threads available for computations (or a max of 32 jobs in parallel). The memory available is 64GB for the 32 threads which yields 2GB/thread. In order to keep IO operations from slowing down the computations, Compute Nodes have two 600GB 15K RPM disks is RAID-1.

The Frontend Node and the Compute Nodes have the following configurations:

Node Type

Frontend

Compute Node

Designation

Dell PowerEdge R320

Dell PowerEdge R720

Processor

Intel® Xeon® E5-1410 2.80GHz, 10M Cache, 6.4GT/s, Turbo, 4Cores (8threads), 80W

2x Intel® Xeon® E5-2690 2.90GHz, 20M Cache, 8.0GT/s QPI, Turbo, 8Cores (16threads), 135W

Memory

8GB RDIMM, 1333 MHz, Low Volt,

8x 8GB RDIMM, 1600 MHz, Low

(RAM)

Dual Rank, x4

Volt, Dual Rank

Hard Drives

2x 2TB, Near-Line SAS 6Gbps, 3.5- in, 7.2K RPM Hard Drive (Hot Plug)

2x 600GB, SAS 6Gbps, 3.5-in, 15K RPM Hard Drive (Hot-Plug) (RAID-1)

  • 1.2. How to obtain a user account

In order to obtain a user account, request it to ISISE’s administration. Upon configuration, you will be informed of your username and temporary password. It is your responsibility to change

1.1.2. Machine specifications The frontend is a 4 core (8 threads) computer with 8Gb of RAM.

it after you connected the first time (see …).

  • 1.3. Accessing isiseHPC and manage your account

The access to the Frontend Node may be done via different methods, depending on the operating system of the client (cluster user). As most users will be connecting using Windows OS, the Mac OS and Linux OS will not be covered for now.

A. Connect yourself to the university network

The first thing that is certainly required is for you to be connected to the Structural Mechanics Laboratory Local Network (Gateway 10.8.0.254). This is either achieved by plugin-in the client machine (your computer) by Ethernet cable or by using openVPN. This is because isiseHPC protected behind the university server firewall, and so, only secured connections to the lab’s local network will be allowed.

What you need:
What you need:
  • A physical connection to the local network (by Ethernet cable) or a tunnel connection via openVPN.

  • B. Download WinSCP

In order to transfer files from the client machine to the Frontnode (server) in Windows OS, the best approach is to use WinSCP or equivalent. Beside transferring data between your machine and the cluster, it also allows you to manage files from your account in isiseHPC. WinSCP also provides a console (for secure shell or SSH) for executing commands. For secure shell communication, you can also download Putty or a Google Chrome / Mozilla Firefox extension for SSH.

What you need:

  • WinSCP (http://winscp.net/eng/download.php) or equivalent

  • Secure shell software or extension (optional)

  • 2. INTERACTING WITH ISISEHPC

    • 2.1. Connecting to isiseHPC using WinSCP

The connection to the cluster is established by SSH. Making sure you have a connection to the Structural Mechanics Laboratory Local Network, open WinSCP and create a new connection.

 A physical connection to the local network (by Ethernet cable) or a tunnel connection via

Once you have connected, you’ll be presented with your HOME directory. Only you and the administrator are able to modify your files. If you which to connect to the secure shell, after login in, just go to “Commands > Open in Terminal” or just use the shortcut [CTRL+T].

2.2.

Your HOME directory

The HOME directory is SHARED DIRECTORY used by the system to communicate your data

2.2. Your HOME directory The HOME directory is SHARED DIRECTORY used by the system to communicate

through the private network (see…), sharing your submission files information, required

system files, and save your finished job results. Upon creation of your account, the HOME directory is composed by 3 main subdirectories:

  • Desktop – which has little or no importance for your operations (just keep it there);

;
;
  • Examples – where you may find scripting examples for submitting jobs to SGE (see …)

  • Modules – which contains the environment modules you need to tell the compute nodes where the software you need for your analyses is installed, and which version of the software you which to use (this works like the PATH in Windows OS);

There are also several hidden files and directories in your HOME directory (they are marked with a point “.” before the name). Once you access your HOME directory for the first time, create a new directory where you will put your analysis files. In this way, you will keep your main directory clean.

  • 2.3. Software

The isiseHPC has already basic software installed that you may need for your work, such as programming languages compilers and interpreters (c++, fortran, python, etc) and Abaqus (versions v6.11, v6.12, and v6.13). If you need any other software to be installed, you must request it to the ISISE administration and you will be informed on when it is available for use. Please keep in mind that the software must be Linux compatible.

  • 2.4. Submitting jobs to SGE

Job submissions are processed by SGE. SGE is a job submission and management software which allocates user jobs in queues, and decides on the timing of their execution (scheduling). Scheduling is dependent on the cluster workload and priority of queued jobs. As an example, if you request 4 CPUs to run your job and there are only 3 available, your job will be queued until processor resources are available and its priority is higher than the remaining queued jobs. The definition job priorities is not covered in this manual.

2.4.1. SGE basic commands

The submission/management of your jobs to SGE is done via the system shell. SGE is called using the command “qsub” and appending your batch file (SGE script). The script will contain

2.2. Your HOME directory The HOME directory is SHARED DIRECTORY used by the system to communicate

the various instructions required to execute your job properly (see…). An example how to submit a script is

Linux shell

[user@isisehpc]$ qsub myscript.qsub

This will issue myscript.qsub file to the management system. Done in this way, you will submit the file using default job configurations. In order to adjust/modify job configuration, you may use the following commands

Linux shell

#Give a name to your job using –N command

[user@isisehpc]$ qsub –N myjobname myscript.qsub

#Provide the numbers of cpus to use in the analysis with -pe orte

[user@isisehpc]$ qsub -pe orte 2 myscript.qsub

#Inform the system of the maximum run time of your analysis (-h_rt)

[user@isisehpc]$ qsub –h_rt hh:mm:ss –j y myscript.qsub

#Name your job and request for SGE outputs (-j y/n) [two commands]

[user@isisehpc]$ qsub –N myjobname –j y myscript.qsub

The command shown above are in general important for your scripting. The job name and the number of processor are regarded as essential information, while the SGE output is important to register information (log) sent by your analysis software, during the job execution.

Commands can also be issued inside the script. As a tip, define all these commands inside the script, if possible, with exception for the job name. This will allow you to have a generic SGE script file and just point, in the Linux shell, which file you want to run.

If you which to monitor your jobs simply execute

Linux shell

[user@isisehpc]$ qstat

In the case you want do delete one of your jobs

Linux shell

[user@isisehpc]$ qdel <job_id>

2.4.2. SGE scripting

To submit a job to SGE, you need to prepare a script which basically tells the Frontend the resources you need, as the number of CPUs and the amount of RAM, and the operations it needs to execute in order to successfully perform your job, as the analysis directory, the modules to load (OS environment information on the software you need), and the software commands. You do not actually run a job directly, as you would do in your workstation, but rather program a set of tasks. You may feel tempted to put all your tasks in a single script but, in case of failure, all of your work will be lost, even if is 99.999% completed! To avoid this, make different scripts for your different jobs. Though, in the end, it is up to the user to judge.

As an example, consider the submission of an Abaqus input file called “MyAbaqusJob.inp”. First, copy the “.inp” file to the cluster using WinSCP. In the directory where you have placed the file, create a file called “SGE_AbaqusJob.qsub” (you can see examples of “.qsub” files in the examples directory). Most of its structure is not to be modified. Just change what is indicated in “<>”, removing the angle brackets.

The script is started with a shebang (#! indicates a Linux command) telling the system what type of scripting language it is, which is, in this case, the parent shell.

Script - SGE_AbaqusJob.qsub – 1/5

#!/bin/sh

Now that Linux knows how to read your job submission script (which shell), you can now prepare your commands to SGE. SGE commands are expressed inside the script as #$ followed by the command itself. Common commands are:

Script - SGE_AbaqusJob.qsub – 2/5

#$ -cwd #$ -S /bin/sh

#$ -N <MyAbaqusJob>

#$ -j y #$ -l h_rt=<hh:mm:ss> #$ -pe orte <number_of_cpus_you_need>

The first command simply tells SGE that your work directory is the current work directory or cwd. In the following, and similarly to the shebang to Linux, you also indicate SGE you which to use the parent shell (-S), the name of your job (-N) , if you want reports to be printed (-j y/n), the analysis run time (-l h_rt=hh:mm:ss), and the number of processors you will require to run your analysis (in this case, the number of CPUs Abaqus will use). In the case you omit the run time, you will be telling SGE you’ll need 76h of analysis, by default. This will make your job to be queued more time. On the other hand, the time you set is the maximum time the job will be in analysis, being the script killed if it is still running.

The variables set to SGE will be usable as environment variables, called in the script as $<variable>. Variables from the shell will also be usable here. For example, the $HOME is your HOME directory. You can also assign additional variables. In this case you will not need to define them preceded by $, but rather when you need to call it (see example of SCRATCHDIR).

Script - SGE_AbaqusJob.qsub – 3/5

#tell the system you want to use the modules in your HOME directory module use $HOME/modules

#tell the system which version of Abaqus you want

module load abaqus/6.11.2

#or just tell it you which to use abaqus (if you don’t care about #the version

module load abaqus

#don’t change these lines!!

SCRATCHDIR=/state/partition1/$USER/$JOB_NAME.$JOB_ID

mkdir -p $SCRATCHDIR cp $JOB_NAME.inp $SCRATCHDIR cd $SCRATCHDIR

The code snippet above indicates the system (not SGE) of the operations you which to execute for your job. First, you tell the system you want to loads the modules which contain the commands for the installed software (like commands for Abaqus). As isiseHPC is provided with three versions of abaqus, it is important for you to tell Linux which version you which to use. If you omit it, the system will use the latest software version available. Notice the system environment variable $HOME is used.

The SCRATCHDIR directory is also prepared using SGE variables $JOB_NAME (defined by command –N) and the $JOB_ID (assigned by SGE). The SCRATCHDIR is a directory inside the Compute Node which provides enhanced speed in IO operations, for numerous reasons. One of the reasons is that SCRATCHDIR is created in a special partition of the Compute Node, which is not shared through the network. After defining the SCRATCHDIR, you tell the system to create this directory, to copy the your input file to the SCRATCHDIR, and finally but very important(!), to set SCRATCHDIR as the current directory.

Script - SGE_AbaqusJob.qsub – 4/5

abaqus job=$JOB_NAME.inp cpus=$NSLOTS scratch=$SCRATCHDIR

sleep 60 while [ -f $JOB_NAME.lck ]; do sleep 5 done

With the starting IO operations completed, you can now indicate the system to run your Abaqus job. The lines after the abaqus command will make sure the analysis won’t proceed until the analysis “.lck” file disappears.

In the end, you need to tell the system to return results to your HOME directory and delete the SCRATCHDIR. Deleting the SCRATCHDIR is very important to keep the compute nodes clean and avoid the system to crash due lack of space.

Script - SGE_AbaqusJob.qsub 5/5

cp $SCRATCHDIR/* $SGE_O_WORKDIR rm -rf $SCRATCHDIR

Once your script is completed, you can submit your script by going to the shell and execute

Linux shell

[user@isisehpc]$ qsub SGE_AbaqusJob.qsub

The full script, comment, is shown below (comments are preceded by a plain #). Just do not forget to change the values between “<>” to be consistent with your pretensions.

Script - SGE_AbaqusJob.qsub

#!/bin/sh

#$ -cwd #$ -S /bin/sh

#$ -N <MyAbaqusJob>

#$ -j y

#Time of expected analysis time – use more than you think you will #need or just delete it (will consider 76hours)

#$ -l h_rt=<hh:mm:ss>

#$ -pe orte <number_of_cpus_you_need>

#tell the system you want to use the modules in your HOME directory module use $HOME/modules

#tell the system which version of Abaqus you want

module load abaqus/6.11.2

#don’t change these lines!!

SCRATCHDIR=/state/partition1/$USER/$JOB_NAME.$JOB_ID

mkdir -p $SCRATCHDIR cp $JOB_NAME.inp $SCRATCHDIR cd $SCRATCHDIR

#JOB_NAME is the SGE variable containing the job name indicated by #by you, above. NSLOTS is the variable which contains the number of #CPUs requested. # #here is where your job actually begins!

abaqus job=$JOB_NAME.inp cpus=$NSLOTS scratch=$SCRATCHDIR

#The system will make sure your job data will not be transferred # back to the frontend before your lock file desapears. Don’t change #this!!!

sleep 60 while [ -f $JOB_NAME.lck ]; do sleep 5 done

#After the job ends, SGE will copy the results files back to the #front end and deletes the file in the scratch directory. #Do not change these lines!!!

cp $SCRATCHDIR/* $SGE_O_WORKDIR rm -rf $SCRATCHDIR