Professional Documents
Culture Documents
Ganit Labs - SGE Basic Commands PDF
Ganit Labs - SGE Basic Commands PDF
▪ What is SGE
– SGE stands for Sun Grid Engine
2 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
How SGE works
▪ Unless resources are immediately available jobs are kept in queues until
resources to execute them become available.
▪ Records of each jobs progress through the system are kept and reported
when requested
3 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
SGE Components
▪ Hosts
– Master
– Execution
– Administration
– Submit
▪ Queues (defined by the administrator)
▪ Daemons:
– sge_qmaster (Master Daemon),
– sge_schedd (Scheduler Daemon),
– sge_execd (Execution Daemon) and
– sge_commd (Communication Daemon)
4 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Host Roles
▪ Master Host
– Controls overall cluster activity
– Frontend, head node
– It runs the master daemon:sge_qmaster, controlling
• queues, jobs, status, user access permission
– Also the scheduler: sge_schedd
▪ Execution Host
– executes SGE jobs
– execution daemon: sge_execd
• Runs jobs on its hosts
• Forwards sys status/info to sge_qmaster
▪ Submit Host
– They are allowed for submitting & controlling only batch jobs
– No daemon required to run in this type of host
▪ Administration Host
– SGE administrator controls whole structure
5 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Summary table of useful SGE commands
6 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Working with SGE as a user:
7 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Submitting a Job:
▪ Create a script file (named script.sh) by using a text editor such as gedit ,vi or emacs
and inputing the following lines:
#!/bin/sh
#
echo “This code is running on” /bin/hostname
/bin/date
qsub script.sh
8 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
SGE job script
#$ -N <Job/Program Name>
#$ -e <Error File>
#$ -o <Output File>
#$ -q <Q - name>
9 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
SGE – Sample script
#!/bin/bash
#$ -N SLEEP_JOB
#$ -cwd
#$ -e Error.$JOB_NAME.$JOB_ID
#$ -o Output.$JOB_NAME.$JOB_ID
#$ -V
date
sleep 100
date
10 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Script for Serial job
#!/bin/bash
#$ -N SERIAL_JOB
#$ -cwd
#$ -e Error.$JOB_NAME.$JOB_ID
#$ -o Output.$JOB_NAME.$JOB_ID
#$ -V
< full path to the serial executable> <options & input parameters>
11 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Script for Parallel Job
#!/bin/bash
#$ -N PARALLEL_JOB
#$ -cwd
#$ -e Error.$JOB_NAME.$JOB_ID
#$ -o Output.$JOB_NAME.$JOB_ID
#$ -V
#$ -pe mvapich2 32
/data/mvapich2_intel/bin/mpirun -np $NSLOTS <full path to the executable>
<options & input parameters>
12 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
SGE Commands
$ qsub myjob.sh
Your job 742 ("myjob.sh") has been submitted.
13 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Monitoring Jobs:
2. be executing,
In order to monitor the progress of your job while in states (1) and (2) use the
qstat or Qstat commands that will inform you if the job is still waiting or started
executing. The command qstat gives info about all the jobs but Qstat gives info about
your jobs alone.
14 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Monitoring Jobs ( Contd... )
While executing (state 2) ;
use qstat –j job_number to monitor the jobs status including time and memory
consumptions.
Better still use qstat –j job_number | grep mem that will give time and memory
consumed information.
Also use tail –f job_output_filename to see the latest output from the job
qacct is the only command that may be able to tell you about the past jobs by referring
to a data-base of past usage. Output file names will contain the job number so;
15 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
SGE Commands - qstat
Check status of your job:
qstat : command will list all the jobs in the system that are either waiting to
be run or running
– qstat –f –u “*” : Detailed information of nodes
– qstat -u username : Displays ser submitted jobs
– qstat -j job : Displays job related information
16 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Some useful options :
qstat :
-explain a|A|c|E
c : displays the reason for the configuration ambiguous state of a queue
instance.
a : shows the reason for the alarm state.
(the load threshold is currently exceeded )
A : shows suspend alarm state reasons.
( The suspend threshold is currently exceeded )
E : displays the reason for a queue instance error state.
-ext : Displays additional information for each job related to the job ticket
policy scheme
-f : Specifies a "full" format display of information.
-pri : Displays additional information for each job related to the job
priorities in general.
-r : Prints extended information about the resource requirements of
the displayed jobs.
17 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Deleting Jobs :
qdel command will remove from the queue the specified jobs that are waiting
to be run or abort jobs that are already running.
▪ Individual Job
qdel Job_number
▪ List of Jobs
qdel Job_number1 Job_number2 ....
qdel –u <username>
18 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Reasons for Job Failure:
– SGE cannot find the binary file specified in the job script
– You have exceeded your quota and job fails when trying to write to a file ( use
quota command to check usage )
– Hardware failure
19 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Monitor your cluster
20 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Ganglia Monitoring Tool : Home Page
21 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Ganglia – Home Page Contd…
22 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Ganglia - Node View :
23 ©
© 2010
2010 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential