You are on page 1of 111

HRSK-II Nutzerschulung

2013-05-30

Jan Wender
Senior IT Consultant

science+computing ag, 2013

science+computing ag, 2013

science + computing ag
Founded

1989

Offices

Tbingen
Mnchen
Dsseldorf
Berlin

Employees
~300
Turnover 2012
~30 Mio. Euro
Shareholder

Bull GmbH

Partners

Daikin, Japan
NICE srl, Italy
IBM Platform
Univa

science+computing ag, 2013

s+c Kunden

Bremen, Hamburg

Wolfsburg

Beelen
Duisburg
Office
Dsseldorf

Office
Berlin

Alzenau

Kln
Aachen

Servicestandort
Frankfurt

Mannheim

Stuttgart

Headquarter
Tbingen

science+computing ag, 2013

Servicestandort
Ingolstadt
Office
Mnchen

s+c Services

Software
Development

Remote
Visualisation
Script
Solutions
InfiniBand
Network

science+computing ag, 2013

IT Services
Cluster
Management
Distributed Resource
Management
Cluster
Filesystems

science+computing ag, 2013

Bull: European leader in mission-critical digital systems

9,000EXPERTS
recognized worldwide
in secure systems

OPERATING IN 50 COUNTRIES
Efforts in research
in 2011

1.3bn REVENUES
+29%
growth in profitability
in 1st quarter 2012

science+computing ag, 2013

+23%

+4,6%
growth in 2011
7

Creating market leadership in Extreme Computing


180

Revenue

Sustained
global growth
with a strong
international
focus

Revenue Split
France vs ROW
2014

30

2007

2008

2009

2010

HPC experts

The largest
group of HPC
experts in
Europe

600

300

2007
science+computing ag, 2013

2011

2010

2007

2008

2009

2010

2011
8

HPC Solutions
Infrastructure
Data center design
Mobile DataCenter
Water-cooling

Servers
Full range development from
ASICs to boards, blades, racks
Support for accelerators

Software Stack
Open, scalable, reliable
Linux, OpenMPI, Lustre, Slurm
Complete administration & monitoring

Expertise

Architecture
Benchmarking
Storage
HPC Cloud

science+computing ag, 2013

3 petaflops-scale systems: TERA 100, CURIE & IFERC

1.25 PetaFlops
140 000+ Xeon cores

256 TB memory
30 PB disk storage
500 GB/s

IO throughput

580 m footprint
science+computing ag, 2013

2 PetaFlops

1.5 PetaFlops

90 000+ Xeon cores


148 000 GPU cores

360 TB memory
10 PB disk storage
250 GB/s

IO throughput

200 m footprint

70 000+ Xeon cores

280 TB memory
15 PB disk storage
120 GB/s

IO throughput

200 m footprint
10

HRSK-II

science+computing ag, 2013

11

HRSK-II

Phase 1: 2013

Phase 2: 2014+

science+computing ag, 2013

12

Phase 1

science+computing ag, 2013

13

Phase 1 Island 1

270 Nodes B710 SandyBridge


FDR Infiniband

science+computing ag, 2013

14

Phase 1 Island 2

180 Nodes B510 Westmere


QDR Infiniband

science+computing ag, 2013

15

Phase 1 Island 3

TopLevel 36port-1

3x IB

36port-1

TopLevel 36port-2

36port-2

36port-3

3x IB
Blade
Chassis
8 nodes
1

3x IB

Blade
Chassis
8 nodes
4

Service 36port-1

Service 36port-2

24 Nodes B510 Westmere


QDR Infiniband

science+computing ag, 2013

16

bullx blade system naming

B_ _ _
0 for CPU, 5 for accelerators
0 Nehalem/Westmere, 1 Sandy Bridge, 2 Haswell
Cooling: 5 air, 7 Direct Liquid Cooling

Air
Pure CPU
CPU + accelerators

science+computing ag, 2013

Water-cooled door

Direct Liquid
Cooling

B500
B515

B710
B715

17

B500

science+computing ag, 2013

18

bullx chassis packaging

7U chassis

LCD
unit

CMM

PSU x4
18x blades
ESM

science+computing ag, 2013

19

bullx B500 compute blade


Connector
to backplane

WESTMERE EP
w/ 1U heatsink

Fans

HDD/SSD
1.8"

CEA

DDR III (x12)

Tylersburg w/ short
heatsink

425

ConnectX
QDR
143.5
iBMC
ICH10
science+computing ag, 2013

20

bullx b500 blade block diagrams


bullx B500 compute blade
Nehalem
WestmereEPEP

Westmere EP
QPI

31.2GB/s

SATA
SSD
diskless

science+computing ag, 2013

12.8GB/s
Each direction

I/O Controller
(Tylersburg)

PCIe 8x
4GB/s

QPI
31.2GB/s

GBE

InfiniBand

QPI

21

B710

science+computing ag, 2013

22

bullx Direct Liquid Cooling system

science+computing ag, 2013

23

bullx B710 DLC blade: design


2 dual-socket nodes
per blade

dual-socket node
thermal simulation

thermal simulation (cross-section)


DIMMs

CPU
science+computing ag, 2013

24

bullx B710 DLC compute blade - block diagram

CX3

IB FDR
56Gb/s

4x DDR3@1600

51.2GB/s

BH

Proc
RAM
HD
IB
Ethernet

science+computing ag, 2013

SNB

DMI
HDD/SSD

PCH
HDD/SSD

bullx B710

8 GT/s
SNB

BMC

gbE

PCIe-3
8x

: 2x SandyBridge-EP (IvyBridge)
: 8x DDR3@1600MHz (1866 MHz)
: 2x SATA HDD/SSD 2.5
: single ConnectX-3 FDR (56Gb/s) dual as an option
: 2x GbE

25

B515

science+computing ag, 2013

26

bullx B515 accelerator blade

Available
Q1 2013

Double-width blade

2 NVIDIA K20 GPUs (Kepler) or


2 Intel Xeon Phi coprocessors

2 Intel Xeon E5-24xx (SandyBridge) CPUs


1 dedicated PCI-e3 16x connection for
each accelerator
Double InfiniBand FDR connections
between blades
165 TF/rack
6 racks for 1 PF

science+computing ag, 2013

2 x CPUs

2x GPUs/Xeon Phis

27

bullx B515 block diagram

CX3

38.4GB/s

PCIe-3 8x

BH

8 GT/s

PCH

SNB

BMC

gbE

IB FDR

CX3

IB FDR
56Gb/s

3x DDR3@1600

HDD/SSD

SNB

DMI

PCIe-3
16x

Accelerator
MIC / GPU

PCIe-3
16x

Accelerator
MIC / GPU

bullx B515

science+computing ag, 2013

28

R428 SMP node

science+computing ag, 2013

29

bullx R428 SMP node

Processor: 4x Intel Xeon E5-4600


2 QPI links (6.4 GT/s, 7.2 GT/s or 8.0 GT/s)
Chipset: Patsburg PCH (C602)
Memory:
32 DIMM sockets: 8 DIMMs per CPU, 2
DIMMs per channel
Up to 1TB with 32GB RDIMMs (1600MT/s)

Local Disk

science+computing ag, 2013

30

bullx R428 block diagram

science+computing ag, 2013

31

Compute Power Phase 1

----- 44

science+computing ag, 2013

32

Workload Management: SLURM

science+computing ag, 2013

33

SLURM

science+computing ag, 2013

34

SLURM
History and Facts
LLNL since 2003
SchedMD since 2011
Multiple enterprises and research centers have been
contributing to the project
(LANL,CEA,HP,BULL,BSC, etc)
Large international community
Active mailing lists
Contributions
Used in more than 30% of worlds largest
supercomputers
Sequoia (IBM), 16.32 petaflop/s, 2nd in 2012
Tianhe-1A (NUDT), 2.5 Petaflop/s, 8th in 2012
Curie (BULL), 1.6 Petaflop/s, 11th in 2012

Simple Linux Utility for Resource Management


science+computing ag, 2013

35

Bull and SLURM

BULL initially started to work with SLURM in 2005


At least 5 BULL active developers since then
Development for new SLURM features
Bugs and Support Requests Corrections
Integrated into the bullx cluster offers since 2006
Close collaboration between BULL, SchedMD and LLNL
Slurm User Group (SUG)
2nd SUG Conference September 2011
BULL sponsors and organizes with SchedMD and
LLNL
3rd SUG Conference October 2012
BULL presents new features
4th SUG Conference September 2013
Oakland (USA)

science+computing ag, 2013

36

SLURM Architecture

science+computing ag, 2013

37

SLURM Entities

Jobs: Resource allocation requests


Job steps: Set of (typically parallel) tasks
Partitions: Job queues with limits and access controls
Nodes

NUMA boards

Sockets

Cores

Hyperthreads

Memory
Generic Resources (e.g. GPUs)

science+computing ag, 2013

38

SLURM Entities Example

Users submit jobs to a partition (queue)

Partition debug
Job 1
Job 2
Job 3

science+computing ag, 2013

39

SLURM Entities Example

Jobs are allocated resources

Partition debug

Job 1
Job 2
Job 3

science+computing ag, 2013

Node: tux123
Socket: 0
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5

40

SLURM Entities Example

Jobs spawn steps, which are allocated resources from


within the job's allocation

Partition debug

Step 0

Job 1
Job 2
Job 3

Step 1

science+computing ag, 2013

Node: tux123
Socket: 0
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5

#!/bin/bash
srun -n4 exclusive a.out &
srun -n2 exclusive a.out &
wait

41

SLURM Node and Job States


Node States
Down
Idle
Allocated
Completing
Draining
Drained

Job States
Pending
Configuring
Resizing
Running
Suspended
Completing

When finished:
Cancelled
Preempted
Completed

Zero Exit Code

Failed

Non-Zero Exit Code

TimeOut
NodeFail
science+computing ag, 2013

42

SLURM Commands: Job/step Allocation

sbatch Submit script for later execution (batch mode)


salloc Create job allocation and start a shell to use it
(interactive mode)
srun Create a job allocation (if needed) and launch a job
step (typically an MPI job)
sattach Connect stdin/out/err for an existing job or job
step

science+computing ag, 2013

43

Job and Step Allocation Examples


Submit sequence of three batch jobs
> sbatch --ntasks=1 --time=10 pre_process.bash
Submitted batch job 45001
> sbatch --ntasks=128 --time=60 --depend=45001 do_work.bash
Submitted batch job 45002
> sbatch --ntasks=1 --time=30 --depend=45002 post_process.bash
Submitted batch job 45003
Example Job Script
#!/bin/bash
#SBATCH -N 2 number of nodes
#SBATCH -n 2 number of cores
#SBATCH -o /home/user/job-%j.out outputfile
#SBATCH -e /home/user/job-%j.err erroroutputfile
#SBATCH -p exec partition
#SBATCH --exclusive Default on HRSK-II
srun -N2 -n2 hostname
srun -N2 -n2 time Multiple srun commands for multiple job steps
science+computing ag, 2013

44

Job and Step Allocation Examples

Create allocation for 2 tasks, then launch hostname on the allocation,


label output with the task ID
> srun --ntasks=2 --label hostname
0: tux123
1: tux123
As above, but allocate the job two whole nodes
> srun --nodes=2 --label hostname
0: tux123
1: tux124

science+computing ag, 2013

SchedMD LLC
http://www.schedmd.com

45

Job and Step Allocation Examples


Create allocation for 4 tasks and 10 minutes for bash shell,
then launch some tasks
> salloc --ntasks=4 --time=10 bash
salloc: Granted job allocation 45000
> env | grep SLURM
SLURM_JOBID=45000
SLURM_NPROCS=4
SLURM_JOB_NODELIST=tux[123-124]
...
> hostname
tux_login
> srun --label hostname
0: tux123
1: tux123
2: tux124
3: tux124
> exit (terminate bash shell)

science+computing ag, 2013

SchedMD LLC
http://www.schedmd.com

46

Different Executables by Task ID

Different programs may be launched by task ID with different


arguments
Use --multi-prog option and specify configuration file instead
of executable program
Configuration file lists task IDs, executable programs, and
arguments (%t mapped to task ID and %o mapped to offset
within task ID range)
> cat master.conf
#TaskID Program
Arguments
0
/usr/me/master
1-4
/usr/me/slave
--rank=%o
> srun --ntasks=5 --multi-prog master.conf

science+computing ag, 2013

47

SLURM Commands

sinfo Report system status (nodes, queues, etc.)


squeue Report job and job step status
smap Report system, job or step status with topology
(curses-based GUI), less functionality than sview
sview Report and/or update system, job, step, partition
or reservation status with topology (GTK-based GUI)
scontrol Administrator tool to view and/or update
system, job, step, partition or reservation status
scancel Signal/cancel jobs or job steps
sbcast Transfer file to a compute nodes allocated to a
job (uses hierarchical communications)

science+computing ag, 2013

48

sinfo Command

Reports status of nodes or partitions

Partition-oriented format is the default

Almost complete control over filtering, sorting and output


format is available
> sinfo --Node (report status in node-oriented form)
NODELIST NODES PARTITION STATE
tux[000-099]
100 batch
idle
tux[100-127]
28 debug
idle
> sinfo -p debug (report status of nodes in partition debug)
PARTITION AVAIL TIMELIMIT NODES NODELIST
debug
up
60:00
28 tux[100-127]
> sinfo -i60 (report status every 60 seconds)

science+computing ag, 2013

49

squeue Command

Reports status of jobs and/or steps in slurmctld daemon's


records (recent job's only, older information available in
accounting records only)
Almost complete control over filtering, sorting and output
format is available
> squeue -u alec -t all (report jobs for user alec in any state)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
45124 debug
a.out alec CD 0:12
1 tux123
> squeue -s -p debug (report steps in partition debug);
STEPID PARTITION NAME USER TIME NODELIST
45144.0 debug
a.out moe 12:18 tux[100-115]

> squeue -i60 (report currently active jobs every 60 seconds)

science+computing ag, 2013

50

sview

science+computing ag, 2013

51

scontrol Command

Designed for system administrator use


Shows all available fields, but no filtering, sorting or
formatting options
Many fields can be modified

> scontrol show partition


PartitionName=debug
AllocNodes=ALL AllowGroups=ALL Default=YES
DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1
Nodes=tux[000-031]
Priority=1 RootOnly=NO Shared=NO PreemptMode=OFF State=UP
TotalCPUs=64 TotalNodes=32 DefMemPerNode=512 MaxMemPerNode=1024
> scontrol update PartitionName=debug MaxTime=60

science+computing ag, 2013

52

scancel Command

Cancel a running or pending job or step


Can send arbitrary signal to all processes on all nodes
associated with a job or step
Has filtering options (state, user, partition, etc.)
Has interactive (verify) mode

> scancel 45001.1 (cancel job step 45001.1)


> scancel 45002 (cancel job 45002)
> scancel user=alec state=pending (cancel all pending jobs from user alec)

science+computing ag, 2013

53

sbcast Command

Copy a file to local disk on allocated nodes

Execute command after a resource allocation is made

Data transferred using hierarchical slurmd daemons


communications
May be faster than shared file system
> salloc -N100 bash
salloc: Granted job allocation 45201
> sbcast --force my_data /tmp/moe/my_data (overwrite old files)
> srun a.out
> exit (terminate spawned bash shell)

science+computing ag, 2013

54

Partitions and QOS


Partitions and QOS are used in SLURM to group nodes and
jobs characteristics
The use of Partitions and QOS (Quality of Services) entities
in SLURM is orthogonal:
Partitions for grouping resources characteristics
QOS for grouping limitations and priorities
Partition high: 32 nodes,
High Memory

Partition low: 32 nodes,


Low Memory

QOS 1:
Higher Priority,
Higher Limits

QOS 2:
Lower Priority,
Lower Limits

Partition all: 64 nodes

science+computing ag, 2013

55

Partitions and QoS


Partition configuration in slurm.conf file
Can be shared

More than one job on a resource (node, socket, core)

Can be exclusive
Show with

scontrol show partitions

QoS configuration in Database


Used to provide detailed limitations and priorities on jobs
Every user/account will have multiple allowed QoS
send jobs with the -qos parameter
one default QoS
Show with

sacctmgr show qos

science+computing ag, 2013

56

Partitions on taurus
Name

mpi

mpi2

smp

Nodes

180

270

Cores

2160

4320

64

CPU-Type

X5660

E5-2690

E5-4650L

CPU-Freq.

2.8GHz

2.9 GHz

2.6 GHz

Core/Socket

Socket/Node

Cores/Node

12

16

32

Memory/Node

48 GB

32/64/128 GB

1 TB

Partitions with _lustre appended assert Lustre availability, e.g. mpi_lustre,


mpi2_lustre, smp_lustre

science+computing ag, 2013

57

Interactive Usage 1
$ salloc -p mpi -N 2
salloc: Granted job allocation 19206
$ env | grep SLURM
SLURM_NODELIST=taurusi[3087-3088]
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=2
SLURM_JOBID=19206
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=19206
SLURM_SUBMIT_DIR=/home/h4/bull/jw
SLURM_JOB_NODELIST=taurusi[3087-3088]
SLURM_JOB_CPUS_PER_NODE=1(x2)
SLURM_SUBMIT_HOST=tauruslogin1
SLURM_JOB_NUM_NODES=2
SLURM_CPU_BIND=threads
$ hostname
tauruslogin1
$ srun --label hostname
0: taurusi3087
1: taurusi3088
$
science+computing ag, 2013

Ask for 2 CPUs on 2 Nodes

Environment Variables set

Subshell executed on submit


host
To execute commands on
compute nodes use srun

58

Interactive Usage 2
$ srun --label bash
hostname
0: taurusi3087
1: taurusi3088
env | egrep "SLURM_(JOBID|NODEID|PROCID|STEPID)"
0: SLURM_JOBID=19211
0: SLURM_STEPID=1
0: SLURM_NODEID=0
0: SLURM_PROCID=0
1: SLURM_JOBID=19211
1: SLURM_STEPID=1
1: SLURM_NODEID=1
1: SLURM_PROCID=1
exit
$ hostname
tauruslogin1
$ env | grep JOBID
SLURM_JOBID=19206
$ exit
exit
salloc: Relinquishing job allocation 19206
$ hostname
tauruslogin1
$ env | grep JOBID
$

science+computing ag, 2013

To execute a shell for all


allocated nodes
No Prompt is displayed!

Command line is executed on


all nodes in parallel

Back on submit host


Still inside allocation

59

Interactive Usage 3
$ salloc -p mpi -n 2
salloc: Granted job allocation 19213
[bull@tauruslogin1 jw 10:00:48 ]
$ srun --label bash
env | egrep "SLURM_(JOBID|NODEID|PROCID|STEPID)"
0: taurusi3087: SLURM_JOBID=19213
0: taurusi3087: SLURM_STEPID=0
0: taurusi3087: SLURM_NODEID=0
0: taurusi3087: SLURM_PROCID=1
1: taurusi3087: SLURM_JOBID=19213
1: taurusi3087: SLURM_STEPID=0
1: taurusi3087: SLURM_NODEID=0
1: taurusi3087: SLURM_PROCID=0

Ask for 2 CPUs (= Cores)

Get 2 Cores on 1 Node

NODEID is the same, PROCID


is different

science+computing ag, 2013

60

Job Priority
Multifactor Priority Calculation
Priority: 0 .. 4294967295
Factors: Age, Fair-Share, Job size, Partition, QOS (0.0 .. 1.0)
Job_priority = (PriorityWeightAge) * (age_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor)
$ sprio
JOBID
19200
19207
$ sprio
JOBID
Weights

-l
USER PRIORITY
bull
5
bull
24
-w
PRIORITY

science+computing ag, 2013

AGE
4
4
AGE
1000

FAIRSHARE
0
0
FAIRSHARE
100000

JOBSIZE
1
21

PARTITION
0
0

QOS
0
0

NICE
0
0

JOBSIZE
1000

61

SLURM Associations
$ sacctmgr
User
---------bull
$ sacctmgr
User
---------bull
bull
bull
bull

list user format=user,defaultaccount where user=$USER


Def Acct
---------everybody
list associations format=user,account,partition where user=$USER
Account Partition
---------- ---------tests
benchmark
project1
everybody

science+computing ag, 2013

62

SLURM Accounting
JobID
19200 19200.batch 19200.0
19200.1
19200.2
JobName
test.job batch
hostname cat
free
Partition
mpi2
MaxVMSize
0
0
0
0
MaxVMSizeNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
MaxVMSizeTask
0
0
0
0
AveVMSize
0
0
0
0
MaxRSS
0
0
0
0
MaxRSSNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
MaxRSSTask
0
0
0
0
AveRSS
0
0
0
0
AveCPU
00:00:00
00:00:00
00:00:00
00:00:00
MaxPages
0
0
0
0
NTasks
1
1
1
1
MaxPagesNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
AllocCPUS
16
1
1
1
1
MaxPagesTask
0
0
0
0
Elapsed
00:00:00
00:00:00
00:00:00
00:00:00
00:00:00
AvePages
0
0
0
0
State
COMPLETED COMPLETED COMPLETED COMPLETED COMPLETED
MinCPU
00:00:00
00:00:00
00:00:00
00:00:00
ExitCode
00:00
00:00
00:00
00:00
00:00
MinCPUNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
AveCPUFreq
0
0
0
0
MinCPUTask
0
0
0
0
ReqCPUFreq
0
ReqMem
0n
0n
0n
0n
0n
ConsumedEnergy
0
0
0
0
MaxDiskRead
0
0
0
0
MaxDiskReadNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
MaxDiskReadTask
65534
65534
65534
65534
AveDiskRead
0
0
0
0
MaxDiskWrite
0
0
0
0
MaxDiskWriteNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
MaxDiskWriteTask
65534
65534
65534
65534
AveDiskWrite
0
0
0
0

sacct show job 19200 -l

science+computing ag, 2013

63

SLURM Causal Research


Simples Job-Script
$ cat test.job
#!/bin/bash
#SBATCH -N 1
# SBATCH -n 2
#SBATCH -o /home/bull/jw/job-%j.out
#SBATCH -e /home/bull/jw/job-%j.err
#SBATCH -p mpi2
srun hostname
srun cat /proc/cpuinfo
srun free g
$ sbatch test.job
Submitted batch job 19200
$

science+computing ag, 2013

64

SLURM Causal Research


Warum luft der Job nicht an?
$ squeue -j 19200
JOBID PARTITION
NAME
USER ST
TIME NODES NODELIST(REASON)
19200
mpi2 test.job
bull PD
0:00
1 (ReqNodeNotAvail)
$ sinfo -t idle
PARTITION
AVAIL TIMELIMIT NODES STATE NODELIST
mpi*
up
infinite
87
idle taurusi[3002-3007,30093015,3088,3090,3109-3180]
mpi2
up
infinite
241
idle taurusi[1001,1003,1005-1007,10091013,1015-1180,1182-1232,1255-1261,1264-1270]
smp
up
infinite
1
idle taurussmp1
mpi_lustre
up
infinite
87
idle taurusi[3002-3007,30093015,3088,3090,3109-3180]
mpi2_lustre
up
infinite
241
idle taurusi[1001,1003,1005-1007,10091013,1015-1180,1182-1232,1255-1261,1264-1270]
smp_lustre
up
infinite
1
idle taurussmp1

science+computing ag, 2013

65

SLURM Causal Research


Warum luft der Job nicht an?
$ scontrol show job 19200
JobId=19200 Name=test.job
UserId=bull(2054944) GroupId=tests(200026)
Priority=11 Account=everybody QOS=normal
JobState=PENDING Reason=ReqNodeNotAvail Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
SubmitTime=2013-05-29T09:32:11 EligibleTime=2013-05-29T09:32:11
StartTime=2013-05-30T11:20:24 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=mpi2 AllocNode:Sid=tauruslogin1:17145
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/h4/bull/jw/test.job
WorkDir=/home/h4/bull/jw
$ scontrol show reservation
ReservationName=bull_24 StartTime=2013-05-29T13:00:00 EndTime=2013-05-29T18:00:00
Duration=05:00:00
Nodes=taurusi[1001-1270] NodeCnt=270 CoreCnt=4320 Features=(null) PartitionName=(null)
Flags=SPEC_NODES
Users=bull Accounts=(null) Licenses=(null) State=INACTIVE

science+computing ag, 2013

66

SLURM Causal Research


Warum luft der Job nicht an?
$ scontrol show job 19200
JobId=19200 Name=test.job
UserId=bull(2054944) GroupId=tests(200026)
Priority=11 Account=everybody QOS=normal
JobState=PENDING Reason=ReqNodeNotAvail Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
SubmitTime=2013-05-29T09:32:11 EligibleTime=2013-05-29T09:32:11
StartTime=2013-05-30T11:20:24 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=mpi2 AllocNode:Sid=tauruslogin1:17145
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/h4/bull/jw/test.job
WorkDir=/home/h4/bull/jw
$ scontrol show reservation
ReservationName=bull_24 StartTime=2013-05-29T13:00:00 EndTime=2013-05-29T18:00:00
Duration=05:00:00
Nodes=taurusi[1001-1270] NodeCnt=270 CoreCnt=4320 Features=(null) PartitionName=(null)
Flags=SPEC_NODES
Users=bull Accounts=(null) Licenses=(null) State=INACTIVE
$ scontrol update jobid=19200 TimeLimit=00:10:00

science+computing ag, 2013

67

SLURM Causal Research


Abhilfe:
$ cat test.job
#!/bin/bash
#SBATCH -N 1
# SBATCH -n 2
#SBATCH -o /home/bull/jw/job-%j.out
#SBATCH -e /home/bull/jw/job-%j.err
#SBATCH -p mpi2
#SBATCH --time=5
srun hostname
srun cat /proc/cpuinfo
srun free g
$ sbatch test.job
Submitted batch job 19200
$

science+computing ag, 2013

68

MPI Support

Many different MPI implementations are supported:

MPICH1, MPICH2, MVAPICH, OpenMPI, etc.

Many use srun to launch the tasks directly


Some use mpirun or another tool within an existing
SLURM allocation (they reference SLURM environment
variables to determine what resources are allocated to the
job)
Details are online:
http://www.schedmd.com/slurmdocs/mpi_guide.html

science+computing ag, 2013

69

bersetzungshilfe
http://www.schedmd.com/slurmdocs/rosetta.pdf

science+computing ag, 2013

70

Credits

SLURM slides with input from


Moe Jette, SchedMD

Additional input from


Christophe Berthelot, Bull

science+computing ag, 2013

71

Intel Cluster Studio

science+computing ag, 2013

72

Intel Cluster Studio XE


Slides will be integrated later

science+computing ag, 2013

73

CPU types on taurus


Partition

mpi

mpi2

smp

Name

Westmere-EP

SandyBridge-EP

SandyBridge-EP

CPU-Type

X5660

E5-2690

E5-4650L

CPU-Freq.

2.8GHz

2.9 GHz

2.6 GHz

Max. Turbo-Freq.

3.2 GHz

3.8 GHz

3.1 GHz

Core/Socket

Max Sockets/Node

Cores/Node

12

16

32

Cache

12 MB

20 MB

20 MB

Memory Channels

Max. Memory BW

32 GB/s

51.2 GB/s

51.2 GB/s

Extensions

SSE4.2

AVX

AVX

QPI links

QPI Speed

6.4 GT/s

8 GT/s

8 GT/s

Max. TDP

95 W

135 W

115 W

science+computing ag, 2013

74

Intel Compiler Flags for Optimization


General Optimization
-O2: Optimize for Speed (Default)
-O3: Optimize for Applications with large Data Sets or many floating
point operations (loop unrolling, scalar replacement, )
-O0: No Optimization, use for Debugging
-opt-report: Generate report on performed optimizations
-opt-report-phase=hpo: Report from High Performance Optimizer
(including vectorizer and parallelizer)
-parallel: Invoke auto-parallelizer
-mkl=type: Compile with MKL, Type can be parallel (use threaded
MKL), sequential (use non-threaded MKL), cluster (use cluster and
sequential MKL)
-ip: Single file interprocedural optimization
-ipo: Interprocedural optimization among multiple files
http://software.intel.com/sites/default/files/Compiler_QRG_2013.pdf
science+computing ag, 2013

75

75

Compiler Flags for Optimization


Common
-vec-report=[0,1,2,3,6]: Generate report on vectorization (For
analysis use 3)

Westmere
-xSSE4.2

Enable additional optimizations not enabled with -m

SandyBridge
-xAVX

Enable additional optimizations not enabled with -m

Mixed Executable
-msse4.2 axavx

Runs on Westmere and SandyBridge, executable includes 2 Code paths

http://software.intel.com/sites/default/files/Compiler_QRG_2013.pdf
science+computing ag, 2013

76

Bullx scs AE

science+computing ag, 2013

77

bullx supercomputer suite architecture

science+computing ag, 2013

78

Bullx DE
Installationspfad /opt/bullxde
Zugriff via modules
module avail
module load bullxde
module unload
--------------- /opt/bullxde/modulefiles/debuggers -------------padb/3.2
-------------- /opt/bullxde/modulefiles/utils ------------------OTF/1.8
-------------- /opt/bullxde/modulefiles/profilers --------------hpctoolkit/4.9.9_3111_Bull.2
-------------- /opt/bullxde/modulefiles/perftools --------------bpmon/1.0_Bull.1.20101208 papi/4.1.1_Bull.2
ptools/0.10.4_Bull.4.20101203
------------ /opt/bullxde/modulefiles/mpicompanions ------------boost-mpi/1.44.0 mpianalyser/1.1.4 scalasca/1.3.

science+computing ag, 2013

79

bullx DE Tools

science+computing ag, 2013

80

padb
Job Inspection Tool, Sammeln von Stack Traces
http://padb.pittman.org.uk/

Bei hngenden Jobs ohne Crash


Synopsis
padb -O rmgr=slurm -x[t] a | jobid
export PADB_RMGR=slurm
$ salloc -p Zeus -IN 3
salloc: Granted job allocation 47136
$ mpirun -n 9 pp_sndrcv_spb
$ ./padb -O rmgr=slurm -x 47136
0:ThreadId: 1
0:main() at pp_sndrcv_spbl.c:52
0:PMPI_Finalize() at ?:?
0:ompi_mpi_finalize() at ?:?
0:barrier() at ?:?
0:opal_progress() at ?:?
0:opal_event_loop() at ?:?
0:poll_dispatch() at ?:?

science+computing ag, 2013

81

bullxprof
Leichtgewichtiges ProfilingTool
Verwendung
module load bullxprof/<version>
module load papi
module load bullxmpi

Konfiguration

Nur config-file
app.modules.excluded
app.functions.excluded
bullxprof.io.functions
bullxprof.mpi.functions
bullxprof.papi.counters

Kommandozeile oder config-file


bullxprof.experiments=timing|hwc
bullxprof.tracelevel=1|2|3

1: Basic, 2: Detailed, 3: Advanced

bullxprof.debug=0|1|2|3

0: Off, 1: Low, 2: Medium, 3: High

bullxprof.smartdisplay

science+computing ag, 2013

82

bullxprof
bullxprof [ -l ] [ -s ] [ -e experiments ] [ -d debuglevel ] [ -t tracelevel ] program
[prog-arguments]
-l
-s
-e
-d
-t

Print list of instrumentable functions


Enable smart display
timing, hwc
0: Off, 1: Low, 2: Medium, 3: High, Default: 0
1: Basic, 2: Detailed, 3: Advanced, Default: 1

bullxprof mit mpirun


bullxprof <bullxprof args> mpirun <mpirun args> program <program args>
bullxprof mit srun
bullxprof <bullxprof args> srun <srun args> program <program args>

science+computing ag, 2013

83

MPI Analyser
MPI-profiling tool, nicht-invasiv
Log whrend der Ausfhrung
Post-mortem-Auswertung

Auswertungen
Kommunikation
point-to-point messages, Anzahl und Grsse
collective messages, Anzahl und Grsse

Execution time

Maximum time interval between MPI_Init and MPI_Finalize

Table of calls of MPI functions

Number of calls for each profiled function

Message size histograms

point-to-point, collective communications

science+computing ag, 2013

84

MPI Analyser Benutzung


Module laden
Source Intel compilers: module load intel
Source bullxmpimpi: module load bullxmpi
Source BullxDE: module load bullxde
Load MPIAnalyser module file: module load mpianalyser

Anwendung mit Library linken


C: mpicc file.o -o exec ${MPIANALYSER_LINK}
Fortran: mpif90 file.o -o exec -lmpi_f77 -lmpi_f90
${MPIANALYSER_LINK}

Use environment to enable (Default is disabled)


MPIANALYSER_PROFILECOMM=1

Run your application


MPIANALYSER_PROFILECOMM=1 srun p partition N 1 -n 4 ./foo

Auswertung mit readpfc


science+computing ag, 2013

85

MPI Analyser readpfc


Analyse, Export als Grafik

science+computing ag, 2013

86

Scalasca
Scalable Performance Analysis of Large-Scale Applications
http://www.scalasca.org/

FZ Jlich, German Research School for Simulation Sciences

Dokumentation: /opt/bullxde/mpicompanions/scalasca/doc

science+computing ag, 2013

87

Scalasca
Inside Scalasca
OPARI (OpenMP and user region instrumentation tool)
EPIK (measurement library for summarization & tracing)
EARL (serial event-trace analysis library)
EXPERT (serial automatic performance analyzer)
PEARL (parallel event-trace analysis library)
SCOUT (parallel automatic performance analyzer)
CUBE3 (analysis report presentation component & utilities)

science+computing ag, 2013

88

Scalasca Simple Usage


Instrumentation: use scalasca instrument
scalasca -instrument mpicc -c foo.c

Measurement: use scalasca -analyze


scalasca -analyze mpirun -np 8 ./cg.B.8
Variables EPK_ and ESD_ to configure scalasca like export
EPK_METRICS=CYCLES

Analysis Report Examination.


With GUI: use scalasca -examine

scalasca -examine epik_cg_8_sum_CYCLES

Without GUI: use cube3_score

cube3 score -m CYCLES -r epik_cg_8_sum_CYCLES/epitome.cube

science+computing ag, 2013

89

Scalasca GUI Ausgabe

science+computing ag, 2013

90

xPMPI
Framework zur Nutzung des MPI-Profiling-Layers PMPI
Aufruf via modules:

module load xPMPI/<version>_bullxmpi


module load xPMPI/<version>_bullxmpi

Konfiguration via File

/opt/bullxde/mpicompanions/xPMPI/etc/xpmpi.conf
export PNMPI_CONF=<path to user defined configuration file>

Untersttzt werden
MPI Analyser

s.v.

IPM

Ausgabe einer Statistik ber Performance und Ressourcen-Nutzung


http://ipm-hpc.sourceforge.net/

mpiP

Leichtgewichtige Sammlung von statistischer Information auf jedem node


http://mpip.sourceforge.net

MPIPROF

Leichtgewichtiger MPI Application Profiler

science+computing ag, 2013

91

PAPI
Performance API
Grundlage fr Performance-Analyse-Tools
Standard-Definitionen fr Metriken
Standard-API

High-Level und Low-Level-Interface


Starten/Stoppen/Auslesen von Ereignis-Zhlern
Konfigurieren der Zhler

Source muss gendert werden


Keine einfache Handhabung
s. bullx DE Dokumentation

science+computing ag, 2013

92

bpmon
Kommandozeilentool fr single-node-Ausfhrungen
Benutzt das PAPI-Interface
Beispielaufruf
bpmon e
INSTRUCTIONS_RETIRED,LLC_MISSES,MEM_LOAD_RETIRED:L3_
MISS,MEM_UNCORE_RETIRED:LOCAL_DRAM,MEM_UNCORE_RE
TIRED:REMOTE_DRAM /opt/hpctk/test_cases/llclat -S -l 4 -i 256 -r
200 -o r

Processor Perfomance Reporting


Memory Usage Reporting

science+computing ag, 2013

93

HPCToolkit
http://hpctoolkit.org

Bull-erweiterte Version
History, Viewer, Wrapper
s. bullx DE Dokumentation
science+computing ag, 2013

94

HPCToolkit

science+computing ag, 2013

95

HPCToolkit

science+computing ag, 2013

96

HPCToolkit

science+computing ag, 2013

97

Open|SpeedShop
http://www.openspeedshop.org
Ziel: Einfache Benutzbarkeit, Modularitt, Erweiterbarkeit
Umfang:
Sampling Experiments
Support for Callstack Analysis
Hardware Performance Counters
MPI Profiling and Tracing
I/O Profiling and Tracing
Floating Point Exception Analysis

s. Tutorials

science+computing ag, 2013

98

Darshan
Aufzeichnen des HPC-I/O einer Anwendung
http://www.mcs.anl.gov/research/projects/darshan/
Neukompilation notwendig
mpicc.darshan, mpiCC.darshan, mpif77.darshan, mpif90.darshan

DARSHAN_LOGPATH enthlt Pfad fr Ausgabedaten


Zur Auswertung:
darshan-job-summary.pl <Logdatei>

Erzeugen eines PDFs

darshan-parser <Logfile>

Lesbare Ausgabe aller Daten

science+computing ag, 2013

99

bullx MPI

science+computing ag, 2013

100

Bullx MPI
bullx MPI based on Open MPI
http://www.openmpi.org

Most of the documentation also valid for bullx MPI

bullx MPI conforms to MPI-2 standard, supports up to


MPI_THREAD_SERIALIZED level
Usage
modules load bullxmpi
or
source ${bullxmpi_install_path}/bin/mpivars.{sh,csh}

Compilation
mpicc, mpiCC, mpif77, mpif90
Benutzt per default Intel Compiler
GNU Compiler per Umgebungsvariable:
OMPI_FC=gfortran, OMPI_F77=gfortran, OMPI_CC=gcc, OMPI_CXX=g++

science+computing ag, 2013

101

SLURM und bullx MPI


salloc/mpirun
SLURM will allocate the nodes, BullxMPI will place the rank
BullxMPI options can be set by mpirun
Syntax
salloc [Slurm option] -p <partition> -N nodesCount -n coreCount -exclusive
mpirun [BullxMPI placement] [BullxMPI option] ./a.out

srun
SLURM allocates the nodes and places the ranks
BullxMPI options must be set with environment variables
Syntax

srun -p partition --resv-ports [Slurm options] [Slurm placement] ./myapp

science+computing ag, 2013

102

SLURM und bullx MPI


sbatch
Batch scripts (non interactive)
Script can use mpirun or srun
Syntax

sbatch [Slurm options] -p partition -N nodesCount -n coreCount\


./myapp_batch.sh

Basic examples
salloc --exclusive -N 2 -n 32 -p 424E3 mpirun ./hello
srun --resv-ports --cpu_bind=cores --distribution=block:block\
-N 2 -n 32 -p 424E3 ./hello

science+computing ag, 2013

103

Mapping und Binding


Mapping: Distribution of ranks on nodes
Use successive cores (default): mpirun --bycore
Cycle on successive sockets: mpirun --bysocket
Cycle on successive nodes: mpirun --bynode

Binding ranks to core, processor, ...


Bind to a single core (default): mpirun --bind-to-core
Bind to a entire socket: mpirun --bind-to-socket
Do not bind, bind to a node: mpirun --bind-to-none
Custom bindings: mpirun --cpus-per-rank
View the binding: mpirun --report-bindings

science+computing ag, 2013

104

bullx MPI and Lustre


Parallelization of file accesses
Set stripe count correctly
Should be a multiple of the number of compute nodes
Maximum: Number of OSTs in Lustre
Example: 10 nodes and 48 OSTs

Stripe count 10 or 20 or 30 or 40

Setting via command line


lfs setstripe -c 20 /mnt/lustre/lfs_file.new
cp /mnt/lustre/lfs_file /mnt/lustre/lfs_file.new
mv /mnt/lustre/lfs_file.new /mnt/lustre/lfs_file
lfs getstripe -c /mnt/lustre/lfs_file

science+computing ag, 2013

105

bullx MPI and Lustre


With hint in the source code
MPI_Info_set( info, striping_factor , 32);
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE |
MPI_MODE_RDWR, info, &fh);

Automatically
When calling

MPI_File_open(comm, filename, amode, info, &fh)


amode contains MPI_MODE_CREATE

Stripe count will be set:

Number of compute nodes smaller than number of OSTs:


stripe count = Largest Multiple of #CN smaller than #OST

Number of compute nodes larger than number of OSTs:


stripe count = #OST

science+computing ag, 2013

106

bullx MPI Tuning


Setting MCA Parameters
List of available Parameters: ompi_mca bzw. ompi_mca a
Set:

mpirun mca name value


export OMPI_MCA_name=value

Communication device
btl: self, sm (shared memory), openib (Infiniband), tcp (Ethernet)
Exclude

mpirun --mca btl ^btl_to_exclude

Set device list

mpirun --mca btl self,btl1,btl2

btl_openib_use_eager_rdma
use RDMA for eager messages

lower latency BUT higher memory footprint

btl_openib_eager_limit
size of eager messages
science+computing ag, 2013

107

bullx MPI Tuning


mpi_leave_pinned
User buffers are left registered (decreases de/(re)-registration costs)
BUT application should re-use the exactly the same send buffers
IMB pingpong reaches maximum bandwidth
This variable will used by default in the next bullxmpi version (1.2.1.1)

Improve collective performance


coll_tuned_use_dynamic_rules

Switch used to decide if we use static (compiled/if statements) or dynamic


(built at runtime) decision function rules

coll_tuned_alltoall_algorithm

parameter to select alltoall algorithm


0: ignore, 1: basic linear, 2: pairwise, 3: modifed bruck,4: linear with sync, 5:two proc
only

coll_tuned_allreduce_algorithm

parameter to select allreduce algorithm


0: ignore, 1: basic linear, 2: nonoverlapping (tuned reduce + tuned bcast), 3: recursive
doubling, 4: ring, 5: segmented ring

science+computing ag, 2013

108

bullx MPI Extensions


Device failover
Change of communication device if timeout occurs

Deadlock detection
Problem: Polling by MPI processes
If no activity for longer time: Assumption of Deadlock, Process is set
to micro-sleeps

Generalized Hierarchical Collective


Takes topology into account in collectives operations

Warning Data Capture


Gathering unusual events on MPI level
Output in Log
must be enabled explicitly

science+computing ag, 2013

109

science+computing ag, 2013

110

SLURM und Intel MPI


The mpiexec.hydra Command (Hydra Process Manager)
SLURM is supported by the Intel MPI Library 4.0 Update 3 directly
through the Hydra PM.
Use the following command to start an MPI job within an existing SLURM
session:
mpiexec.hydra -bootstrap slurm -n <num_procs> a.out
The srun Command (SLURM, recommended)
This advanced method is supported by the Intel MPI Library 4.0 Update
3. This method is the best integrated with SLURM and supports process
tracking, accounting, task affinity, suspend/resume and other features.
Use the following commands to allocate a SLURM session and start an
MPI job in it, or to start an MPI job within a SLURM session already
created using the sbatch or salloc commands:
Set the I_MPI_PMI_LIBRARY environment variable to point to the SLURM
Process Management Interface (PMI) library:
export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so
Use the srun command to launch the MPI job:
srun -n <num_procs> a.out
science+computing ag, 2013

111

You might also like