HRSK II Nutzerschulung

HRSK-II Nutzerschulung
2013-05-30
Jan Wender
Senior IT Consultant
science+computing ag, 2013
science + computing ag
Founded
1989
Offices
Tbingen
Mnchen
Dsseldorf
Berlin
Employees
~300
Turnover 2012
~30 Mio. Euro
Shareholder
Bull GmbH
Partners
Daikin, Japan
NICE srl, Italy
IBM Platform
Univa
s+c Kunden
Bremen, Hamburg
Wolfsburg
Beelen
Duisburg
Office
Dsseldorf
Office
Berlin
Alzenau
Kln
Aachen
Servicestandort
Frankfurt
Mannheim
Stuttgart
Headquarter
Tbingen
Servicestandort
Ingolstadt
Office
Mnchen
s+c Services
Software
Development
Remote
Visualisation
Script
Solutions
InfiniBand
Network
IT Services
Cluster
Management
Distributed Resource
Management
Cluster
Filesystems
Bull: European leader in mission-critical digital systems
9,000EXPERTS
recognized worldwide
in secure systems
OPERATING IN 50 COUNTRIES
Efforts in research
in 2011
1.3bn REVENUES
+29%
growth in profitability
in 1st quarter 2012
+23%
+4,6%
growth in 2011
7
Creating market leadership in Extreme Computing

180
Revenue
Sustained
global growth
with a strong
international
focus
Revenue Split
France vs ROW
2014
30
2007
2008
2009
2010
HPC experts
The largest
group of HPC
experts in
Europe
600
300
2007
2011
2010
2007
2008
2009
2010
2011
8
HPC Solutions
Infrastructure
Data center design
Mobile DataCenter
Water-cooling
Servers
Full range development from
ASICs to boards, blades, racks
Support for accelerators
Software Stack
Open, scalable, reliable
Linux, OpenMPI, Lustre, Slurm
Complete administration & monitoring
Expertise
Architecture
Benchmarking
Storage
HPC Cloud
3 petaflops-scale systems: TERA 100, CURIE & IFERC
1.25 PetaFlops
140 000+ Xeon cores
256 TB memory
30 PB disk storage
500 GB/s
IO throughput
580 m footprint
2 PetaFlops
1.5 PetaFlops
90 000+ Xeon cores

148 000 GPU cores
360 TB memory
10 PB disk storage
250 GB/s
IO throughput
200 m footprint
70 000+ Xeon cores
280 TB memory
15 PB disk storage
120 GB/s
IO throughput
200 m footprint
10
HRSK-II
11
HRSK-II
Phase 1: 2013
Phase 2: 2014+
12
Phase 1
13
Phase 1 Island 1
270 Nodes B710 SandyBridge

FDR Infiniband
14
Phase 1 Island 2
180 Nodes B510 Westmere

QDR Infiniband
15
Phase 1 Island 3
TopLevel 36port-1
3x IB
36port-1
TopLevel 36port-2
36port-2
36port-3
3x IB
Blade
Chassis
8 nodes
1
3x IB
Blade
Chassis
8 nodes
4
Service 36port-1
Service 36port-2
24 Nodes B510 Westmere

QDR Infiniband
16
bullx blade system naming
B_ _ _
0 for CPU, 5 for accelerators
0 Nehalem/Westmere, 1 Sandy Bridge, 2 Haswell
Cooling: 5 air, 7 Direct Liquid Cooling
Air
Pure CPU
CPU + accelerators
Water-cooled door
Direct Liquid
Cooling
B500
B515
B710
B715
17
B500
18
bullx chassis packaging
7U chassis
LCD
unit
CMM
PSU x4
18x blades
ESM
19
bullx B500 compute blade

Connector
to backplane
WESTMERE EP
w/ 1U heatsink
Fans
HDD/SSD
1.8"
CEA
DDR III (x12)
Tylersburg w/ short
heatsink
425
ConnectX
QDR
143.5
iBMC
ICH10
20
bullx b500 blade block diagrams

bullx B500 compute blade
Nehalem
WestmereEPEP
Westmere EP
QPI
31.2GB/s
SATA
SSD
diskless
12.8GB/s
Each direction
I/O Controller
(Tylersburg)
PCIe 8x
4GB/s
QPI
31.2GB/s
GBE
InfiniBand
QPI
21
B710
22
bullx Direct Liquid Cooling system
23
bullx B710 DLC blade: design

2 dual-socket nodes
per blade
dual-socket node
thermal simulation
thermal simulation (cross-section)

DIMMs
CPU
24
bullx B710 DLC compute blade - block diagram
CX3
IB FDR
56Gb/s
4x DDR3@1600
51.2GB/s
BH
Proc
RAM
HD
IB
Ethernet
SNB
DMI
HDD/SSD
PCH
HDD/SSD
bullx B710
8 GT/s
SNB
BMC
gbE
PCIe-3
8x
: 2x SandyBridge-EP (IvyBridge)
: 8x DDR3@1600MHz (1866 MHz)
: 2x SATA HDD/SSD 2.5
: single ConnectX-3 FDR (56Gb/s) dual as an option
: 2x GbE
25
B515
26
bullx B515 accelerator blade
Available
Q1 2013
Double-width blade
2 NVIDIA K20 GPUs (Kepler) or

2 Intel Xeon Phi coprocessors
2 Intel Xeon E5-24xx (SandyBridge) CPUs

1 dedicated PCI-e3 16x connection for
each accelerator
Double InfiniBand FDR connections
between blades
165 TF/rack
6 racks for 1 PF
2 x CPUs
2x GPUs/Xeon Phis
27
bullx B515 block diagram
CX3
38.4GB/s
PCIe-3 8x
BH
8 GT/s
PCH
SNB
BMC
gbE
IB FDR
CX3
IB FDR
56Gb/s
3x DDR3@1600
HDD/SSD
SNB
DMI
PCIe-3
16x
Accelerator
MIC / GPU
PCIe-3
16x
Accelerator
MIC / GPU
bullx B515
28
R428 SMP node
29
bullx R428 SMP node
Processor: 4x Intel Xeon E5-4600

2 QPI links (6.4 GT/s, 7.2 GT/s or 8.0 GT/s)
Chipset: Patsburg PCH (C602)
Memory:
32 DIMM sockets: 8 DIMMs per CPU, 2
DIMMs per channel
Up to 1TB with 32GB RDIMMs (1600MT/s)
Local Disk
30
bullx R428 block diagram
31
Compute Power Phase 1
----- 44
32
Workload Management: SLURM
33
SLURM
34
SLURM
History and Facts
LLNL since 2003
SchedMD since 2011
Multiple enterprises and research centers have been
contributing to the project
(LANL,CEA,HP,BULL,BSC, etc)
Large international community
Active mailing lists
Contributions
Used in more than 30% of worlds largest
supercomputers
Sequoia (IBM), 16.32 petaflop/s, 2nd in 2012
Tianhe-1A (NUDT), 2.5 Petaflop/s, 8th in 2012
Curie (BULL), 1.6 Petaflop/s, 11th in 2012
Simple Linux Utility for Resource Management

35
Bull and SLURM
BULL initially started to work with SLURM in 2005

At least 5 BULL active developers since then
Development for new SLURM features
Bugs and Support Requests Corrections
Integrated into the bullx cluster offers since 2006
Close collaboration between BULL, SchedMD and LLNL
Slurm User Group (SUG)
2nd SUG Conference September 2011
BULL sponsors and organizes with SchedMD and
LLNL
3rd SUG Conference October 2012
BULL presents new features
4th SUG Conference September 2013
Oakland (USA)
36
SLURM Architecture
37
SLURM Entities
Jobs: Resource allocation requests

Job steps: Set of (typically parallel) tasks
Partitions: Job queues with limits and access controls
Nodes
NUMA boards
Sockets
Cores
Hyperthreads
Memory
Generic Resources (e.g. GPUs)
38
SLURM Entities Example
Users submit jobs to a partition (queue)
Partition debug
Job 1
Job 2
Job 3
39
Jobs are allocated resources
Partition debug
Job 1
Job 2
Job 3
Node: tux123
Socket: 0
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
40
Jobs spawn steps, which are allocated resources from

within the job's allocation
Partition debug
Step 0
Job 1
Job 2
Job 3
Step 1
Node: tux123
Socket: 0
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
#!/bin/bash
srun -n4 exclusive a.out &
srun -n2 exclusive a.out &
wait
41
SLURM Node and Job States

Node States
Down
Idle
Allocated
Completing
Draining
Drained
Job States
Pending
Configuring
Resizing
Running
Suspended
Completing
When finished:
Cancelled
Preempted
Completed
Zero Exit Code
Failed
Non-Zero Exit Code
TimeOut
NodeFail
42
SLURM Commands: Job/step Allocation
sbatch Submit script for later execution (batch mode)

salloc Create job allocation and start a shell to use it
(interactive mode)
srun Create a job allocation (if needed) and launch a job
step (typically an MPI job)
sattach Connect stdin/out/err for an existing job or job
step
43
Job and Step Allocation Examples

Submit sequence of three batch jobs
> sbatch --ntasks=1 --time=10 pre_process.bash
Submitted batch job 45001
> sbatch --ntasks=128 --time=60 --depend=45001 do_work.bash
> sbatch --ntasks=1 --time=30 --depend=45002 post_process.bash
Example Job Script
#!/bin/bash
#SBATCH -N 2 number of nodes
#SBATCH -n 2 number of cores
#SBATCH -o /home/user/job-%j.out outputfile
#SBATCH -e /home/user/job-%j.err erroroutputfile
#SBATCH -p exec partition
#SBATCH --exclusive Default on HRSK-II
srun -N2 -n2 hostname
srun -N2 -n2 time Multiple srun commands for multiple job steps
44
Create allocation for 2 tasks, then launch hostname on the allocation,

label output with the task ID
> srun --ntasks=2 --label hostname
0: tux123
1: tux123
As above, but allocate the job two whole nodes
> srun --nodes=2 --label hostname
0: tux123
1: tux124
SchedMD LLC
http://www.schedmd.com
45

Create allocation for 4 tasks and 10 minutes for bash shell,
then launch some tasks
> salloc --ntasks=4 --time=10 bash
salloc: Granted job allocation 45000
> env | grep SLURM
SLURM_JOBID=45000
SLURM_NPROCS=4
SLURM_JOB_NODELIST=tux[123-124]
...
> hostname
tux_login
> srun --label hostname
0: tux123
1: tux123
2: tux124
3: tux124
> exit (terminate bash shell)
SchedMD LLC
http://www.schedmd.com
46
Different Executables by Task ID
Different programs may be launched by task ID with different

arguments
Use --multi-prog option and specify configuration file instead
of executable program
Configuration file lists task IDs, executable programs, and
arguments (%t mapped to task ID and %o mapped to offset
within task ID range)
> cat master.conf
#TaskID Program
Arguments
0
/usr/me/master
1-4
/usr/me/slave
--rank=%o
> srun --ntasks=5 --multi-prog master.conf
47
SLURM Commands
sinfo Report system status (nodes, queues, etc.)

squeue Report job and job step status
smap Report system, job or step status with topology
(curses-based GUI), less functionality than sview
sview Report and/or update system, job, step, partition
or reservation status with topology (GTK-based GUI)
scontrol Administrator tool to view and/or update
system, job, step, partition or reservation status
scancel Signal/cancel jobs or job steps
sbcast Transfer file to a compute nodes allocated to a
job (uses hierarchical communications)
48
sinfo Command
Reports status of nodes or partitions
Partition-oriented format is the default
Almost complete control over filtering, sorting and output

format is available
> sinfo --Node (report status in node-oriented form)
NODELIST NODES PARTITION STATE
tux[000-099]
100 batch
idle
tux[100-127]
28 debug
idle
> sinfo -p debug (report status of nodes in partition debug)
PARTITION AVAIL TIMELIMIT NODES NODELIST
debug
up
60:00
28 tux[100-127]
> sinfo -i60 (report status every 60 seconds)
49
squeue Command
Reports status of jobs and/or steps in slurmctld daemon's

records (recent job's only, older information available in
accounting records only)
Almost complete control over filtering, sorting and output
format is available
> squeue -u alec -t all (report jobs for user alec in any state)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
45124 debug
a.out alec CD 0:12
1 tux123
> squeue -s -p debug (report steps in partition debug);
STEPID PARTITION NAME USER TIME NODELIST
45144.0 debug
a.out moe 12:18 tux[100-115]
> squeue -i60 (report currently active jobs every 60 seconds)
50
sview
51
scontrol Command
Designed for system administrator use

Shows all available fields, but no filtering, sorting or
formatting options
Many fields can be modified
> scontrol show partition

PartitionName=debug
AllocNodes=ALL AllowGroups=ALL Default=YES
DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1
Nodes=tux[000-031]
Priority=1 RootOnly=NO Shared=NO PreemptMode=OFF State=UP
TotalCPUs=64 TotalNodes=32 DefMemPerNode=512 MaxMemPerNode=1024
> scontrol update PartitionName=debug MaxTime=60
52
scancel Command
Cancel a running or pending job or step

Can send arbitrary signal to all processes on all nodes
associated with a job or step
Has filtering options (state, user, partition, etc.)
Has interactive (verify) mode
> scancel 45001.1 (cancel job step 45001.1)

> scancel 45002 (cancel job 45002)
> scancel user=alec state=pending (cancel all pending jobs from user alec)
53
sbcast Command
Copy a file to local disk on allocated nodes
Execute command after a resource allocation is made
Data transferred using hierarchical slurmd daemons

communications
May be faster than shared file system
> salloc -N100 bash
> sbcast --force my_data /tmp/moe/my_data (overwrite old files)
> srun a.out
> exit (terminate spawned bash shell)
54
Partitions and QOS

Partitions and QOS are used in SLURM to group nodes and
jobs characteristics
The use of Partitions and QOS (Quality of Services) entities
in SLURM is orthogonal:
Partitions for grouping resources characteristics
QOS for grouping limitations and priorities
Partition high: 32 nodes,
High Memory
Partition low: 32 nodes,

Low Memory
QOS 1:
Higher Priority,
Higher Limits
QOS 2:
Lower Priority,
Lower Limits
Partition all: 64 nodes
55
Partitions and QoS

Partition configuration in slurm.conf file
Can be shared
More than one job on a resource (node, socket, core)
Can be exclusive
Show with
scontrol show partitions
QoS configuration in Database

Used to provide detailed limitations and priorities on jobs
Every user/account will have multiple allowed QoS
send jobs with the -qos parameter
one default QoS
Show with
sacctmgr show qos
56
Partitions on taurus
Name
mpi
mpi2
smp
Nodes
180
270
Cores
2160
4320
64
CPU-Type
X5660
E5-2690
E5-4650L
CPU-Freq.
2.8GHz
2.9 GHz
2.6 GHz
Core/Socket
Socket/Node
Cores/Node
12
16
32
Memory/Node
48 GB
32/64/128 GB
1 TB
Partitions with _lustre appended assert Lustre availability, e.g. mpi_lustre,

mpi2_lustre, smp_lustre
57
Interactive Usage 1
$ salloc -p mpi -N 2
$ env | grep SLURM
SLURM_NODELIST=taurusi[3087-3088]
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=2
SLURM_JOBID=19206
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=19206
SLURM_SUBMIT_DIR=/home/h4/bull/jw
SLURM_JOB_NODELIST=taurusi[3087-3088]
SLURM_JOB_CPUS_PER_NODE=1(x2)
SLURM_SUBMIT_HOST=tauruslogin1
SLURM_JOB_NUM_NODES=2
SLURM_CPU_BIND=threads
$ hostname
tauruslogin1
$ srun --label hostname
0: taurusi3087
1: taurusi3088
$
Ask for 2 CPUs on 2 Nodes
Environment Variables set
Subshell executed on submit

host
To execute commands on
compute nodes use srun
58
Interactive Usage 2
$ srun --label bash
hostname
0: taurusi3087
1: taurusi3088
env | egrep "SLURM_(JOBID|NODEID|PROCID|STEPID)"
0: SLURM_JOBID=19211
0: SLURM_STEPID=1
0: SLURM_NODEID=0
0: SLURM_PROCID=0
1: SLURM_JOBID=19211
1: SLURM_STEPID=1
1: SLURM_NODEID=1
1: SLURM_PROCID=1
exit
$ hostname
tauruslogin1
$ env | grep JOBID
SLURM_JOBID=19206
$ exit
exit
salloc: Relinquishing job allocation 19206
$ hostname
tauruslogin1
$ env | grep JOBID
$
To execute a shell for all

allocated nodes
No Prompt is displayed!
Command line is executed on

all nodes in parallel
Back on submit host

Still inside allocation
59
Interactive Usage 3
$ salloc -p mpi -n 2
[bull@tauruslogin1 jw 10:00:48 ]
$ srun --label bash
env | egrep "SLURM_(JOBID|NODEID|PROCID|STEPID)"
0: taurusi3087: SLURM_JOBID=19213
0: taurusi3087: SLURM_STEPID=0
0: taurusi3087: SLURM_NODEID=0
0: taurusi3087: SLURM_PROCID=1
1: taurusi3087: SLURM_JOBID=19213
1: taurusi3087: SLURM_STEPID=0
1: taurusi3087: SLURM_NODEID=0
1: taurusi3087: SLURM_PROCID=0
Ask for 2 CPUs (= Cores)
Get 2 Cores on 1 Node
NODEID is the same, PROCID

is different
60
Job Priority
Multifactor Priority Calculation
Priority: 0 .. 4294967295
Factors: Age, Fair-Share, Job size, Partition, QOS (0.0 .. 1.0)
Job_priority = (PriorityWeightAge) * (age_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor)
$ sprio
JOBID
19200
19207
$ sprio
JOBID
Weights
-l
USER PRIORITY
bull
5
bull
24
-w
PRIORITY
AGE
4
4
AGE
1000
FAIRSHARE
0
0
FAIRSHARE
100000
JOBSIZE
1
21
PARTITION
0
0
QOS
0
0
NICE
0
0
JOBSIZE
1000
61
SLURM Associations
$ sacctmgr
User
---------bull
$ sacctmgr
User
---------bull
bull
bull
bull
list user format=user,defaultaccount where user=$USER

Def Acct
---------everybody
list associations format=user,account,partition where user=$USER
Account Partition
---------- ---------tests
benchmark
project1
everybody
62
SLURM Accounting
JobID
19200 19200.batch 19200.0
19200.1
19200.2
JobName
test.job batch
hostname cat
free
Partition
mpi2
MaxVMSize
0
0
0
0
MaxVMSizeNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
MaxVMSizeTask
0
0
0
0
AveVMSize
0
0
0
0
MaxRSS
0
0
0
0
MaxRSSNode
MaxRSSTask
0
0
0
0
AveRSS
0
0
0
0
AveCPU
00:00:00
00:00:00
00:00:00
00:00:00
MaxPages
0
0
0
0
NTasks
1
1
1
1
MaxPagesNode
AllocCPUS
16
1
1
1
1
MaxPagesTask
0
0
0
0
Elapsed
00:00:00
00:00:00
00:00:00
00:00:00
00:00:00
AvePages
0
0
0
0
State
COMPLETED COMPLETED COMPLETED COMPLETED COMPLETED
MinCPU
00:00:00
00:00:00
00:00:00
00:00:00
ExitCode
00:00
00:00
00:00
00:00
00:00
MinCPUNode
AveCPUFreq
0
0
0
0
MinCPUTask
0
0
0
0
ReqCPUFreq
0
ReqMem
0n
0n
0n
0n
0n
ConsumedEnergy
0
0
0
0
MaxDiskRead
0
0
0
0
MaxDiskReadNode
MaxDiskReadTask
65534
65534
65534
65534
AveDiskRead
0
0
0
0
MaxDiskWrite
0
0
0
0
MaxDiskWriteNode
MaxDiskWriteTask
65534
65534
65534
65534
AveDiskWrite
0
0
0
0
sacct show job 19200 -l
63
SLURM Causal Research

Simples Job-Script
$ cat test.job
#!/bin/bash
#SBATCH -N 1
# SBATCH -n 2
#SBATCH -o /home/bull/jw/job-%j.out
#SBATCH -e /home/bull/jw/job-%j.err
#SBATCH -p mpi2
srun hostname
srun cat /proc/cpuinfo
srun free g
$ sbatch test.job
$
64

Warum luft der Job nicht an?
$ squeue -j 19200
JOBID PARTITION
NAME
USER ST
TIME NODES NODELIST(REASON)
19200
mpi2 test.job
bull PD
0:00
1 (ReqNodeNotAvail)
$ sinfo -t idle
PARTITION
AVAIL TIMELIMIT NODES STATE NODELIST
mpi*
up
infinite
87
idle taurusi[3002-3007,30093015,3088,3090,3109-3180]
mpi2
up
infinite
241
idle taurusi[1001,1003,1005-1007,10091013,1015-1180,1182-1232,1255-1261,1264-1270]
smp
up
infinite
1
idle taurussmp1
mpi_lustre
up
infinite
87
idle taurusi[3002-3007,30093015,3088,3090,3109-3180]
mpi2_lustre
up
infinite
241
idle taurusi[1001,1003,1005-1007,10091013,1015-1180,1182-1232,1255-1261,1264-1270]
smp_lustre
up
infinite
1
idle taurussmp1
65

$ scontrol show job 19200
JobId=19200 Name=test.job
UserId=bull(2054944) GroupId=tests(200026)
Priority=11 Account=everybody QOS=normal
JobState=PENDING Reason=ReqNodeNotAvail Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
SubmitTime=2013-05-29T09:32:11 EligibleTime=2013-05-29T09:32:11
StartTime=2013-05-30T11:20:24 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=mpi2 AllocNode:Sid=tauruslogin1:17145
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/h4/bull/jw/test.job
WorkDir=/home/h4/bull/jw
$ scontrol show reservation
ReservationName=bull_24 StartTime=2013-05-29T13:00:00 EndTime=2013-05-29T18:00:00
Duration=05:00:00
Nodes=taurusi[1001-1270] NodeCnt=270 CoreCnt=4320 Features=(null) PartitionName=(null)
Flags=SPEC_NODES
Users=bull Accounts=(null) Licenses=(null) State=INACTIVE
66

$ scontrol show job 19200
JobId=19200 Name=test.job
UserId=bull(2054944) GroupId=tests(200026)
Priority=11 Account=everybody QOS=normal
JobState=PENDING Reason=ReqNodeNotAvail Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
SubmitTime=2013-05-29T09:32:11 EligibleTime=2013-05-29T09:32:11
StartTime=2013-05-30T11:20:24 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=mpi2 AllocNode:Sid=tauruslogin1:17145
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/h4/bull/jw/test.job
WorkDir=/home/h4/bull/jw
$ scontrol show reservation
ReservationName=bull_24 StartTime=2013-05-29T13:00:00 EndTime=2013-05-29T18:00:00
Duration=05:00:00
Nodes=taurusi[1001-1270] NodeCnt=270 CoreCnt=4320 Features=(null) PartitionName=(null)
Flags=SPEC_NODES
Users=bull Accounts=(null) Licenses=(null) State=INACTIVE
$ scontrol update jobid=19200 TimeLimit=00:10:00
67

Abhilfe:
$ cat test.job
#!/bin/bash
#SBATCH -N 1
# SBATCH -n 2
#SBATCH -o /home/bull/jw/job-%j.out
#SBATCH -e /home/bull/jw/job-%j.err
#SBATCH -p mpi2
#SBATCH --time=5
srun hostname
srun cat /proc/cpuinfo
srun free g
$ sbatch test.job
$
68
MPI Support
Many different MPI implementations are supported:
MPICH1, MPICH2, MVAPICH, OpenMPI, etc.
Many use srun to launch the tasks directly

Some use mpirun or another tool within an existing
SLURM allocation (they reference SLURM environment
variables to determine what resources are allocated to the
job)
Details are online:
http://www.schedmd.com/slurmdocs/mpi_guide.html
69
bersetzungshilfe
http://www.schedmd.com/slurmdocs/rosetta.pdf
70
Credits
SLURM slides with input from

Moe Jette, SchedMD
Additional input from

Christophe Berthelot, Bull
71
Intel Cluster Studio
72
Intel Cluster Studio XE

Slides will be integrated later
73
CPU types on taurus

Partition
mpi
mpi2
smp
Name
Westmere-EP
SandyBridge-EP
SandyBridge-EP
CPU-Type
X5660
E5-2690
E5-4650L
CPU-Freq.
2.8GHz
2.9 GHz
2.6 GHz
Max. Turbo-Freq.
3.2 GHz
3.8 GHz
3.1 GHz
Core/Socket
Max Sockets/Node
Cores/Node
12
16
32
Cache
12 MB
20 MB
20 MB
Memory Channels
Max. Memory BW
32 GB/s
51.2 GB/s
51.2 GB/s
Extensions
SSE4.2
AVX
AVX
QPI links
QPI Speed
6.4 GT/s
8 GT/s
8 GT/s
Max. TDP
95 W
135 W
115 W
74
Intel Compiler Flags for Optimization

General Optimization
-O2: Optimize for Speed (Default)
-O3: Optimize for Applications with large Data Sets or many floating
point operations (loop unrolling, scalar replacement, )
-O0: No Optimization, use for Debugging
-opt-report: Generate report on performed optimizations
-opt-report-phase=hpo: Report from High Performance Optimizer
(including vectorizer and parallelizer)
-parallel: Invoke auto-parallelizer
-mkl=type: Compile with MKL, Type can be parallel (use threaded
MKL), sequential (use non-threaded MKL), cluster (use cluster and
sequential MKL)
-ip: Single file interprocedural optimization
-ipo: Interprocedural optimization among multiple files
http://software.intel.com/sites/default/files/Compiler_QRG_2013.pdf
75
75
Compiler Flags for Optimization

Common
-vec-report=[0,1,2,3,6]: Generate report on vectorization (For
analysis use 3)
Westmere
-xSSE4.2
Enable additional optimizations not enabled with -m
SandyBridge
-xAVX
Enable additional optimizations not enabled with -m
Mixed Executable
-msse4.2 axavx
Runs on Westmere and SandyBridge, executable includes 2 Code paths
http://software.intel.com/sites/default/files/Compiler_QRG_2013.pdf
76
Bullx scs AE
77
bullx supercomputer suite architecture
78
Bullx DE
Installationspfad /opt/bullxde
Zugriff via modules
module avail
module load bullxde
module unload
--------------- /opt/bullxde/modulefiles/debuggers -------------padb/3.2
-------------- /opt/bullxde/modulefiles/utils ------------------OTF/1.8
-------------- /opt/bullxde/modulefiles/profilers --------------hpctoolkit/4.9.9_3111_Bull.2
-------------- /opt/bullxde/modulefiles/perftools --------------bpmon/1.0_Bull.1.20101208 papi/4.1.1_Bull.2
ptools/0.10.4_Bull.4.20101203
------------ /opt/bullxde/modulefiles/mpicompanions ------------boost-mpi/1.44.0 mpianalyser/1.1.4 scalasca/1.3.
79
bullx DE Tools
80
padb
Job Inspection Tool, Sammeln von Stack Traces
http://padb.pittman.org.uk/
Bei hngenden Jobs ohne Crash

Synopsis
padb -O rmgr=slurm -x[t] a | jobid
export PADB_RMGR=slurm
$ salloc -p Zeus -IN 3
$ mpirun -n 9 pp_sndrcv_spb
$ ./padb -O rmgr=slurm -x 47136
0:ThreadId: 1
0:main() at pp_sndrcv_spbl.c:52
0:PMPI_Finalize() at ?:?
0:ompi_mpi_finalize() at ?:?
0:barrier() at ?:?
0:opal_progress() at ?:?
0:opal_event_loop() at ?:?
0:poll_dispatch() at ?:?
81
bullxprof
Leichtgewichtiges ProfilingTool
Verwendung
module load bullxprof/<version>
module load papi
module load bullxmpi
Konfiguration
Nur config-file
app.modules.excluded
app.functions.excluded
bullxprof.io.functions
bullxprof.mpi.functions
bullxprof.papi.counters
Kommandozeile oder config-file

bullxprof.experiments=timing|hwc
bullxprof.tracelevel=1|2|3
1: Basic, 2: Detailed, 3: Advanced
bullxprof.debug=0|1|2|3
0: Off, 1: Low, 2: Medium, 3: High
bullxprof.smartdisplay
82
bullxprof
bullxprof [ -l ] [ -s ] [ -e experiments ] [ -d debuglevel ] [ -t tracelevel ] program
[prog-arguments]
-l
-s
-e
-d
-t
Print list of instrumentable functions

Enable smart display
timing, hwc
0: Off, 1: Low, 2: Medium, 3: High, Default: 0
1: Basic, 2: Detailed, 3: Advanced, Default: 1
bullxprof mit mpirun

bullxprof <bullxprof args> mpirun <mpirun args> program <program args>
bullxprof mit srun
bullxprof <bullxprof args> srun <srun args> program <program args>
83
MPI Analyser
MPI-profiling tool, nicht-invasiv
Log whrend der Ausfhrung
Post-mortem-Auswertung
Auswertungen
Kommunikation
point-to-point messages, Anzahl und Grsse
collective messages, Anzahl und Grsse
Execution time
Maximum time interval between MPI_Init and MPI_Finalize
Table of calls of MPI functions
Number of calls for each profiled function
Message size histograms
point-to-point, collective communications
84
MPI Analyser Benutzung

Module laden
Source Intel compilers: module load intel
Source bullxmpimpi: module load bullxmpi
Source BullxDE: module load bullxde
Load MPIAnalyser module file: module load mpianalyser
Anwendung mit Library linken

C: mpicc file.o -o exec ${MPIANALYSER_LINK}
Fortran: mpif90 file.o -o exec -lmpi_f77 -lmpi_f90
${MPIANALYSER_LINK}
Use environment to enable (Default is disabled)

MPIANALYSER_PROFILECOMM=1
Run your application

MPIANALYSER_PROFILECOMM=1 srun p partition N 1 -n 4 ./foo
Auswertung mit readpfc

85
MPI Analyser readpfc

Analyse, Export als Grafik
86
Scalasca
Scalable Performance Analysis of Large-Scale Applications
http://www.scalasca.org/
FZ Jlich, German Research School for Simulation Sciences
Dokumentation: /opt/bullxde/mpicompanions/scalasca/doc
87
Scalasca
Inside Scalasca
OPARI (OpenMP and user region instrumentation tool)
EPIK (measurement library for summarization & tracing)
EARL (serial event-trace analysis library)
EXPERT (serial automatic performance analyzer)
PEARL (parallel event-trace analysis library)
SCOUT (parallel automatic performance analyzer)
CUBE3 (analysis report presentation component & utilities)
88
Scalasca Simple Usage

Instrumentation: use scalasca instrument
scalasca -instrument mpicc -c foo.c
Measurement: use scalasca -analyze

scalasca -analyze mpirun -np 8 ./cg.B.8
Variables EPK_ and ESD_ to configure scalasca like export
EPK_METRICS=CYCLES
Analysis Report Examination.

With GUI: use scalasca -examine
scalasca -examine epik_cg_8_sum_CYCLES
Without GUI: use cube3_score
cube3 score -m CYCLES -r epik_cg_8_sum_CYCLES/epitome.cube
89
Scalasca GUI Ausgabe
90
xPMPI
Framework zur Nutzung des MPI-Profiling-Layers PMPI
Aufruf via modules:
module load xPMPI/<version>_bullxmpi

module load xPMPI/<version>_bullxmpi
Konfiguration via File
/opt/bullxde/mpicompanions/xPMPI/etc/xpmpi.conf
export PNMPI_CONF=<path to user defined configuration file>
Untersttzt werden
MPI Analyser
s.v.
IPM
Ausgabe einer Statistik ber Performance und Ressourcen-Nutzung

http://ipm-hpc.sourceforge.net/
mpiP
Leichtgewichtige Sammlung von statistischer Information auf jedem node

http://mpip.sourceforge.net
MPIPROF
Leichtgewichtiger MPI Application Profiler
91
PAPI
Performance API
Grundlage fr Performance-Analyse-Tools
Standard-Definitionen fr Metriken
Standard-API
High-Level und Low-Level-Interface

Starten/Stoppen/Auslesen von Ereignis-Zhlern
Konfigurieren der Zhler
Source muss gendert werden

Keine einfache Handhabung
s. bullx DE Dokumentation
92
bpmon
Kommandozeilentool fr single-node-Ausfhrungen
Benutzt das PAPI-Interface
Beispielaufruf
bpmon e
INSTRUCTIONS_RETIRED,LLC_MISSES,MEM_LOAD_RETIRED:L3_
MISS,MEM_UNCORE_RETIRED:LOCAL_DRAM,MEM_UNCORE_RE
TIRED:REMOTE_DRAM /opt/hpctk/test_cases/llclat -S -l 4 -i 256 -r
200 -o r
Processor Perfomance Reporting

Memory Usage Reporting
93
HPCToolkit
http://hpctoolkit.org
Bull-erweiterte Version
History, Viewer, Wrapper
s. bullx DE Dokumentation
94
HPCToolkit
95
HPCToolkit
96
HPCToolkit
97
Open|SpeedShop
http://www.openspeedshop.org
Ziel: Einfache Benutzbarkeit, Modularitt, Erweiterbarkeit
Umfang:
Sampling Experiments
Support for Callstack Analysis
Hardware Performance Counters
MPI Profiling and Tracing
I/O Profiling and Tracing
Floating Point Exception Analysis
s. Tutorials
98
Darshan
Aufzeichnen des HPC-I/O einer Anwendung
http://www.mcs.anl.gov/research/projects/darshan/
Neukompilation notwendig
mpicc.darshan, mpiCC.darshan, mpif77.darshan, mpif90.darshan
DARSHAN_LOGPATH enthlt Pfad fr Ausgabedaten

Zur Auswertung:
darshan-job-summary.pl <Logdatei>
Erzeugen eines PDFs
darshan-parser <Logfile>
Lesbare Ausgabe aller Daten
99
bullx MPI
100
Bullx MPI
bullx MPI based on Open MPI
http://www.openmpi.org
Most of the documentation also valid for bullx MPI
bullx MPI conforms to MPI-2 standard, supports up to

MPI_THREAD_SERIALIZED level
Usage
modules load bullxmpi
or
source ${bullxmpi_install_path}/bin/mpivars.{sh,csh}
Compilation
mpicc, mpiCC, mpif77, mpif90
Benutzt per default Intel Compiler
GNU Compiler per Umgebungsvariable:
OMPI_FC=gfortran, OMPI_F77=gfortran, OMPI_CC=gcc, OMPI_CXX=g++
101
SLURM und bullx MPI

salloc/mpirun
SLURM will allocate the nodes, BullxMPI will place the rank
BullxMPI options can be set by mpirun
Syntax
salloc [Slurm option] -p <partition> -N nodesCount -n coreCount -exclusive
mpirun [BullxMPI placement] [BullxMPI option] ./a.out
srun
SLURM allocates the nodes and places the ranks
BullxMPI options must be set with environment variables
Syntax
srun -p partition --resv-ports [Slurm options] [Slurm placement] ./myapp
102
SLURM und bullx MPI

sbatch
Batch scripts (non interactive)
Script can use mpirun or srun
Syntax
sbatch [Slurm options] -p partition -N nodesCount -n coreCount\

./myapp_batch.sh
Basic examples
salloc --exclusive -N 2 -n 32 -p 424E3 mpirun ./hello
srun --resv-ports --cpu_bind=cores --distribution=block:block\
-N 2 -n 32 -p 424E3 ./hello
103
Mapping und Binding

Mapping: Distribution of ranks on nodes
Use successive cores (default): mpirun --bycore
Cycle on successive sockets: mpirun --bysocket
Cycle on successive nodes: mpirun --bynode
Binding ranks to core, processor, ...

Bind to a single core (default): mpirun --bind-to-core
Bind to a entire socket: mpirun --bind-to-socket
Do not bind, bind to a node: mpirun --bind-to-none
Custom bindings: mpirun --cpus-per-rank
View the binding: mpirun --report-bindings
104
bullx MPI and Lustre

Parallelization of file accesses
Set stripe count correctly
Should be a multiple of the number of compute nodes
Maximum: Number of OSTs in Lustre
Example: 10 nodes and 48 OSTs
Stripe count 10 or 20 or 30 or 40
Setting via command line

lfs setstripe -c 20 /mnt/lustre/lfs_file.new
cp /mnt/lustre/lfs_file /mnt/lustre/lfs_file.new
mv /mnt/lustre/lfs_file.new /mnt/lustre/lfs_file
lfs getstripe -c /mnt/lustre/lfs_file
105
bullx MPI and Lustre

With hint in the source code
MPI_Info_set( info, striping_factor , 32);
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE |
MPI_MODE_RDWR, info, &fh);
Automatically
When calling
MPI_File_open(comm, filename, amode, info, &fh)

amode contains MPI_MODE_CREATE
Stripe count will be set:
Number of compute nodes smaller than number of OSTs:

stripe count = Largest Multiple of #CN smaller than #OST
Number of compute nodes larger than number of OSTs:

stripe count = #OST
106
bullx MPI Tuning

Setting MCA Parameters
List of available Parameters: ompi_mca bzw. ompi_mca a
Set:
mpirun mca name value

export OMPI_MCA_name=value
Communication device
btl: self, sm (shared memory), openib (Infiniband), tcp (Ethernet)
Exclude
mpirun --mca btl ^btl_to_exclude
Set device list
mpirun --mca btl self,btl1,btl2
btl_openib_use_eager_rdma
use RDMA for eager messages
lower latency BUT higher memory footprint
btl_openib_eager_limit
size of eager messages
107
bullx MPI Tuning

mpi_leave_pinned
User buffers are left registered (decreases de/(re)-registration costs)
BUT application should re-use the exactly the same send buffers
IMB pingpong reaches maximum bandwidth
This variable will used by default in the next bullxmpi version (1.2.1.1)
Improve collective performance

coll_tuned_use_dynamic_rules
Switch used to decide if we use static (compiled/if statements) or dynamic

(built at runtime) decision function rules
coll_tuned_alltoall_algorithm
parameter to select alltoall algorithm

0: ignore, 1: basic linear, 2: pairwise, 3: modifed bruck,4: linear with sync, 5:two proc
only
coll_tuned_allreduce_algorithm
parameter to select allreduce algorithm

0: ignore, 1: basic linear, 2: nonoverlapping (tuned reduce + tuned bcast), 3: recursive
doubling, 4: ring, 5: segmented ring
108
bullx MPI Extensions

Device failover
Change of communication device if timeout occurs
Deadlock detection
Problem: Polling by MPI processes
If no activity for longer time: Assumption of Deadlock, Process is set
to micro-sleeps
Generalized Hierarchical Collective

Takes topology into account in collectives operations
Warning Data Capture

Gathering unusual events on MPI level
Output in Log
must be enabled explicitly
109
110
SLURM und Intel MPI

The mpiexec.hydra Command (Hydra Process Manager)
SLURM is supported by the Intel MPI Library 4.0 Update 3 directly
through the Hydra PM.
Use the following command to start an MPI job within an existing SLURM
session:
mpiexec.hydra -bootstrap slurm -n <num_procs> a.out
The srun Command (SLURM, recommended)
This advanced method is supported by the Intel MPI Library 4.0 Update
3. This method is the best integrated with SLURM and supports process
tracking, accounting, task affinity, suspend/resume and other features.
Use the following commands to allocate a SLURM session and start an
MPI job in it, or to start an MPI job within a SLURM session already
created using the sbatch or salloc commands:
Set the I_MPI_PMI_LIBRARY environment variable to point to the SLURM
Process Management Interface (PMI) library:
export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so
Use the srun command to launch the MPI job:
srun -n <num_procs> a.out
111

HRSK II Nutzerschulung

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HRSK II Nutzerschulung

Uploaded by

Copyright:

Available Formats

HRSK-II Nutzerschulung

science+computing ag, 2013

science+computing ag, 2013

science+computing ag, 2013

science+computing ag, 2013

science+computing ag, 2013

science+computing ag, 2013

Bull: European leader in mission-critical digital systems

science+computing ag, 2013

Creating market leadership in Extreme Computing

science+computing ag, 2013

3 petaflops-scale systems: TERA 100, CURIE & IFERC

90 000+ Xeon cores

70 000+ Xeon cores

science+computing ag, 2013

science+computing ag, 2013

science+computing ag, 2013

270 Nodes B710 SandyBridge

science+computing ag, 2013

180 Nodes B510 Westmere

science+computing ag, 2013

24 Nodes B510 Westmere

science+computing ag, 2013

bullx blade system naming

science+computing ag, 2013

science+computing ag, 2013

bullx chassis packaging

science+computing ag, 2013

bullx B500 compute blade

DDR III (x12)

bullx b500 blade block diagrams

science+computing ag, 2013

science+computing ag, 2013

bullx Direct Liquid Cooling system

science+computing ag, 2013

bullx B710 DLC blade: design

thermal simulation (cross-section)

bullx B710 DLC compute blade - block diagram

science+computing ag, 2013

science+computing ag, 2013

bullx B515 accelerator blade

2 NVIDIA K20 GPUs (Kepler) or

2 Intel Xeon E5-24xx (SandyBridge) CPUs

science+computing ag, 2013

bullx B515 block diagram

science+computing ag, 2013

R428 SMP node

science+computing ag, 2013

bullx R428 SMP node

Processor: 4x Intel Xeon E5-4600

science+computing ag, 2013

bullx R428 block diagram

science+computing ag, 2013

Compute Power Phase 1

science+computing ag, 2013

Workload Management: SLURM

science+computing ag, 2013

science+computing ag, 2013

Simple Linux Utility for Resource Management

Bull and SLURM

BULL initially started to work with SLURM in 2005

science+computing ag, 2013

science+computing ag, 2013

Jobs: Resource allocation requests