Professional Documents
Culture Documents
2013-05-30
Jan Wender
Senior IT Consultant
science + computing ag
Founded
1989
Offices
Tbingen
Mnchen
Dsseldorf
Berlin
Employees
~300
Turnover 2012
~30 Mio. Euro
Shareholder
Bull GmbH
Partners
Daikin, Japan
NICE srl, Italy
IBM Platform
Univa
s+c Kunden
Bremen, Hamburg
Wolfsburg
Beelen
Duisburg
Office
Dsseldorf
Office
Berlin
Alzenau
Kln
Aachen
Servicestandort
Frankfurt
Mannheim
Stuttgart
Headquarter
Tbingen
Servicestandort
Ingolstadt
Office
Mnchen
s+c Services
Software
Development
Remote
Visualisation
Script
Solutions
InfiniBand
Network
IT Services
Cluster
Management
Distributed Resource
Management
Cluster
Filesystems
9,000EXPERTS
recognized worldwide
in secure systems
OPERATING IN 50 COUNTRIES
Efforts in research
in 2011
1.3bn REVENUES
+29%
growth in profitability
in 1st quarter 2012
+23%
+4,6%
growth in 2011
7
Revenue
Sustained
global growth
with a strong
international
focus
Revenue Split
France vs ROW
2014
30
2007
2008
2009
2010
HPC experts
The largest
group of HPC
experts in
Europe
600
300
2007
science+computing ag, 2013
2011
2010
2007
2008
2009
2010
2011
8
HPC Solutions
Infrastructure
Data center design
Mobile DataCenter
Water-cooling
Servers
Full range development from
ASICs to boards, blades, racks
Support for accelerators
Software Stack
Open, scalable, reliable
Linux, OpenMPI, Lustre, Slurm
Complete administration & monitoring
Expertise
Architecture
Benchmarking
Storage
HPC Cloud
1.25 PetaFlops
140 000+ Xeon cores
256 TB memory
30 PB disk storage
500 GB/s
IO throughput
580 m footprint
science+computing ag, 2013
2 PetaFlops
1.5 PetaFlops
360 TB memory
10 PB disk storage
250 GB/s
IO throughput
200 m footprint
280 TB memory
15 PB disk storage
120 GB/s
IO throughput
200 m footprint
10
HRSK-II
11
HRSK-II
Phase 1: 2013
Phase 2: 2014+
12
Phase 1
13
Phase 1 Island 1
14
Phase 1 Island 2
15
Phase 1 Island 3
TopLevel 36port-1
3x IB
36port-1
TopLevel 36port-2
36port-2
36port-3
3x IB
Blade
Chassis
8 nodes
1
3x IB
Blade
Chassis
8 nodes
4
Service 36port-1
Service 36port-2
16
B_ _ _
0 for CPU, 5 for accelerators
0 Nehalem/Westmere, 1 Sandy Bridge, 2 Haswell
Cooling: 5 air, 7 Direct Liquid Cooling
Air
Pure CPU
CPU + accelerators
Water-cooled door
Direct Liquid
Cooling
B500
B515
B710
B715
17
B500
18
7U chassis
LCD
unit
CMM
PSU x4
18x blades
ESM
19
WESTMERE EP
w/ 1U heatsink
Fans
HDD/SSD
1.8"
CEA
Tylersburg w/ short
heatsink
425
ConnectX
QDR
143.5
iBMC
ICH10
science+computing ag, 2013
20
Westmere EP
QPI
31.2GB/s
SATA
SSD
diskless
12.8GB/s
Each direction
I/O Controller
(Tylersburg)
PCIe 8x
4GB/s
QPI
31.2GB/s
GBE
InfiniBand
QPI
21
B710
22
23
dual-socket node
thermal simulation
CPU
science+computing ag, 2013
24
CX3
IB FDR
56Gb/s
4x DDR3@1600
51.2GB/s
BH
Proc
RAM
HD
IB
Ethernet
SNB
DMI
HDD/SSD
PCH
HDD/SSD
bullx B710
8 GT/s
SNB
BMC
gbE
PCIe-3
8x
: 2x SandyBridge-EP (IvyBridge)
: 8x DDR3@1600MHz (1866 MHz)
: 2x SATA HDD/SSD 2.5
: single ConnectX-3 FDR (56Gb/s) dual as an option
: 2x GbE
25
B515
26
Available
Q1 2013
Double-width blade
2 x CPUs
2x GPUs/Xeon Phis
27
CX3
38.4GB/s
PCIe-3 8x
BH
8 GT/s
PCH
SNB
BMC
gbE
IB FDR
CX3
IB FDR
56Gb/s
3x DDR3@1600
HDD/SSD
SNB
DMI
PCIe-3
16x
Accelerator
MIC / GPU
PCIe-3
16x
Accelerator
MIC / GPU
bullx B515
28
29
Local Disk
30
31
----- 44
32
33
SLURM
34
SLURM
History and Facts
LLNL since 2003
SchedMD since 2011
Multiple enterprises and research centers have been
contributing to the project
(LANL,CEA,HP,BULL,BSC, etc)
Large international community
Active mailing lists
Contributions
Used in more than 30% of worlds largest
supercomputers
Sequoia (IBM), 16.32 petaflop/s, 2nd in 2012
Tianhe-1A (NUDT), 2.5 Petaflop/s, 8th in 2012
Curie (BULL), 1.6 Petaflop/s, 11th in 2012
35
36
SLURM Architecture
37
SLURM Entities
NUMA boards
Sockets
Cores
Hyperthreads
Memory
Generic Resources (e.g. GPUs)
38
Partition debug
Job 1
Job 2
Job 3
39
Partition debug
Job 1
Job 2
Job 3
Node: tux123
Socket: 0
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
40
Partition debug
Step 0
Job 1
Job 2
Job 3
Step 1
Node: tux123
Socket: 0
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
#!/bin/bash
srun -n4 exclusive a.out &
srun -n2 exclusive a.out &
wait
41
Job States
Pending
Configuring
Resizing
Running
Suspended
Completing
When finished:
Cancelled
Preempted
Completed
Failed
TimeOut
NodeFail
science+computing ag, 2013
42
43
44
SchedMD LLC
http://www.schedmd.com
45
SchedMD LLC
http://www.schedmd.com
46
47
SLURM Commands
48
sinfo Command
49
squeue Command
50
sview
51
scontrol Command
52
scancel Command
53
sbcast Command
54
QOS 1:
Higher Priority,
Higher Limits
QOS 2:
Lower Priority,
Lower Limits
55
Can be exclusive
Show with
56
Partitions on taurus
Name
mpi
mpi2
smp
Nodes
180
270
Cores
2160
4320
64
CPU-Type
X5660
E5-2690
E5-4650L
CPU-Freq.
2.8GHz
2.9 GHz
2.6 GHz
Core/Socket
Socket/Node
Cores/Node
12
16
32
Memory/Node
48 GB
32/64/128 GB
1 TB
57
Interactive Usage 1
$ salloc -p mpi -N 2
salloc: Granted job allocation 19206
$ env | grep SLURM
SLURM_NODELIST=taurusi[3087-3088]
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=2
SLURM_JOBID=19206
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=19206
SLURM_SUBMIT_DIR=/home/h4/bull/jw
SLURM_JOB_NODELIST=taurusi[3087-3088]
SLURM_JOB_CPUS_PER_NODE=1(x2)
SLURM_SUBMIT_HOST=tauruslogin1
SLURM_JOB_NUM_NODES=2
SLURM_CPU_BIND=threads
$ hostname
tauruslogin1
$ srun --label hostname
0: taurusi3087
1: taurusi3088
$
science+computing ag, 2013
58
Interactive Usage 2
$ srun --label bash
hostname
0: taurusi3087
1: taurusi3088
env | egrep "SLURM_(JOBID|NODEID|PROCID|STEPID)"
0: SLURM_JOBID=19211
0: SLURM_STEPID=1
0: SLURM_NODEID=0
0: SLURM_PROCID=0
1: SLURM_JOBID=19211
1: SLURM_STEPID=1
1: SLURM_NODEID=1
1: SLURM_PROCID=1
exit
$ hostname
tauruslogin1
$ env | grep JOBID
SLURM_JOBID=19206
$ exit
exit
salloc: Relinquishing job allocation 19206
$ hostname
tauruslogin1
$ env | grep JOBID
$
59
Interactive Usage 3
$ salloc -p mpi -n 2
salloc: Granted job allocation 19213
[bull@tauruslogin1 jw 10:00:48 ]
$ srun --label bash
env | egrep "SLURM_(JOBID|NODEID|PROCID|STEPID)"
0: taurusi3087: SLURM_JOBID=19213
0: taurusi3087: SLURM_STEPID=0
0: taurusi3087: SLURM_NODEID=0
0: taurusi3087: SLURM_PROCID=1
1: taurusi3087: SLURM_JOBID=19213
1: taurusi3087: SLURM_STEPID=0
1: taurusi3087: SLURM_NODEID=0
1: taurusi3087: SLURM_PROCID=0
60
Job Priority
Multifactor Priority Calculation
Priority: 0 .. 4294967295
Factors: Age, Fair-Share, Job size, Partition, QOS (0.0 .. 1.0)
Job_priority = (PriorityWeightAge) * (age_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor)
$ sprio
JOBID
19200
19207
$ sprio
JOBID
Weights
-l
USER PRIORITY
bull
5
bull
24
-w
PRIORITY
AGE
4
4
AGE
1000
FAIRSHARE
0
0
FAIRSHARE
100000
JOBSIZE
1
21
PARTITION
0
0
QOS
0
0
NICE
0
0
JOBSIZE
1000
61
SLURM Associations
$ sacctmgr
User
---------bull
$ sacctmgr
User
---------bull
bull
bull
bull
62
SLURM Accounting
JobID
19200 19200.batch 19200.0
19200.1
19200.2
JobName
test.job batch
hostname cat
free
Partition
mpi2
MaxVMSize
0
0
0
0
MaxVMSizeNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
MaxVMSizeTask
0
0
0
0
AveVMSize
0
0
0
0
MaxRSS
0
0
0
0
MaxRSSNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
MaxRSSTask
0
0
0
0
AveRSS
0
0
0
0
AveCPU
00:00:00
00:00:00
00:00:00
00:00:00
MaxPages
0
0
0
0
NTasks
1
1
1
1
MaxPagesNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
AllocCPUS
16
1
1
1
1
MaxPagesTask
0
0
0
0
Elapsed
00:00:00
00:00:00
00:00:00
00:00:00
00:00:00
AvePages
0
0
0
0
State
COMPLETED COMPLETED COMPLETED COMPLETED COMPLETED
MinCPU
00:00:00
00:00:00
00:00:00
00:00:00
ExitCode
00:00
00:00
00:00
00:00
00:00
MinCPUNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
AveCPUFreq
0
0
0
0
MinCPUTask
0
0
0
0
ReqCPUFreq
0
ReqMem
0n
0n
0n
0n
0n
ConsumedEnergy
0
0
0
0
MaxDiskRead
0
0
0
0
MaxDiskReadNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
MaxDiskReadTask
65534
65534
65534
65534
AveDiskRead
0
0
0
0
MaxDiskWrite
0
0
0
0
MaxDiskWriteNode
taurusi1226 taurusi1226 taurusi1226 taurusi1226
MaxDiskWriteTask
65534
65534
65534
65534
AveDiskWrite
0
0
0
0
63
64
65
66
67
68
MPI Support
69
bersetzungshilfe
http://www.schedmd.com/slurmdocs/rosetta.pdf
70
Credits
71
72
73
mpi
mpi2
smp
Name
Westmere-EP
SandyBridge-EP
SandyBridge-EP
CPU-Type
X5660
E5-2690
E5-4650L
CPU-Freq.
2.8GHz
2.9 GHz
2.6 GHz
Max. Turbo-Freq.
3.2 GHz
3.8 GHz
3.1 GHz
Core/Socket
Max Sockets/Node
Cores/Node
12
16
32
Cache
12 MB
20 MB
20 MB
Memory Channels
Max. Memory BW
32 GB/s
51.2 GB/s
51.2 GB/s
Extensions
SSE4.2
AVX
AVX
QPI links
QPI Speed
6.4 GT/s
8 GT/s
8 GT/s
Max. TDP
95 W
135 W
115 W
74
75
75
Westmere
-xSSE4.2
SandyBridge
-xAVX
Mixed Executable
-msse4.2 axavx
http://software.intel.com/sites/default/files/Compiler_QRG_2013.pdf
science+computing ag, 2013
76
Bullx scs AE
77
78
Bullx DE
Installationspfad /opt/bullxde
Zugriff via modules
module avail
module load bullxde
module unload
--------------- /opt/bullxde/modulefiles/debuggers -------------padb/3.2
-------------- /opt/bullxde/modulefiles/utils ------------------OTF/1.8
-------------- /opt/bullxde/modulefiles/profilers --------------hpctoolkit/4.9.9_3111_Bull.2
-------------- /opt/bullxde/modulefiles/perftools --------------bpmon/1.0_Bull.1.20101208 papi/4.1.1_Bull.2
ptools/0.10.4_Bull.4.20101203
------------ /opt/bullxde/modulefiles/mpicompanions ------------boost-mpi/1.44.0 mpianalyser/1.1.4 scalasca/1.3.
79
bullx DE Tools
80
padb
Job Inspection Tool, Sammeln von Stack Traces
http://padb.pittman.org.uk/
81
bullxprof
Leichtgewichtiges ProfilingTool
Verwendung
module load bullxprof/<version>
module load papi
module load bullxmpi
Konfiguration
Nur config-file
app.modules.excluded
app.functions.excluded
bullxprof.io.functions
bullxprof.mpi.functions
bullxprof.papi.counters
bullxprof.debug=0|1|2|3
bullxprof.smartdisplay
82
bullxprof
bullxprof [ -l ] [ -s ] [ -e experiments ] [ -d debuglevel ] [ -t tracelevel ] program
[prog-arguments]
-l
-s
-e
-d
-t
83
MPI Analyser
MPI-profiling tool, nicht-invasiv
Log whrend der Ausfhrung
Post-mortem-Auswertung
Auswertungen
Kommunikation
point-to-point messages, Anzahl und Grsse
collective messages, Anzahl und Grsse
Execution time
84
85
86
Scalasca
Scalable Performance Analysis of Large-Scale Applications
http://www.scalasca.org/
Dokumentation: /opt/bullxde/mpicompanions/scalasca/doc
87
Scalasca
Inside Scalasca
OPARI (OpenMP and user region instrumentation tool)
EPIK (measurement library for summarization & tracing)
EARL (serial event-trace analysis library)
EXPERT (serial automatic performance analyzer)
PEARL (parallel event-trace analysis library)
SCOUT (parallel automatic performance analyzer)
CUBE3 (analysis report presentation component & utilities)
88
89
90
xPMPI
Framework zur Nutzung des MPI-Profiling-Layers PMPI
Aufruf via modules:
/opt/bullxde/mpicompanions/xPMPI/etc/xpmpi.conf
export PNMPI_CONF=<path to user defined configuration file>
Untersttzt werden
MPI Analyser
s.v.
IPM
mpiP
MPIPROF
91
PAPI
Performance API
Grundlage fr Performance-Analyse-Tools
Standard-Definitionen fr Metriken
Standard-API
92
bpmon
Kommandozeilentool fr single-node-Ausfhrungen
Benutzt das PAPI-Interface
Beispielaufruf
bpmon e
INSTRUCTIONS_RETIRED,LLC_MISSES,MEM_LOAD_RETIRED:L3_
MISS,MEM_UNCORE_RETIRED:LOCAL_DRAM,MEM_UNCORE_RE
TIRED:REMOTE_DRAM /opt/hpctk/test_cases/llclat -S -l 4 -i 256 -r
200 -o r
93
HPCToolkit
http://hpctoolkit.org
Bull-erweiterte Version
History, Viewer, Wrapper
s. bullx DE Dokumentation
science+computing ag, 2013
94
HPCToolkit
95
HPCToolkit
96
HPCToolkit
97
Open|SpeedShop
http://www.openspeedshop.org
Ziel: Einfache Benutzbarkeit, Modularitt, Erweiterbarkeit
Umfang:
Sampling Experiments
Support for Callstack Analysis
Hardware Performance Counters
MPI Profiling and Tracing
I/O Profiling and Tracing
Floating Point Exception Analysis
s. Tutorials
98
Darshan
Aufzeichnen des HPC-I/O einer Anwendung
http://www.mcs.anl.gov/research/projects/darshan/
Neukompilation notwendig
mpicc.darshan, mpiCC.darshan, mpif77.darshan, mpif90.darshan
darshan-parser <Logfile>
99
bullx MPI
100
Bullx MPI
bullx MPI based on Open MPI
http://www.openmpi.org
Compilation
mpicc, mpiCC, mpif77, mpif90
Benutzt per default Intel Compiler
GNU Compiler per Umgebungsvariable:
OMPI_FC=gfortran, OMPI_F77=gfortran, OMPI_CC=gcc, OMPI_CXX=g++
101
srun
SLURM allocates the nodes and places the ranks
BullxMPI options must be set with environment variables
Syntax
102
Basic examples
salloc --exclusive -N 2 -n 32 -p 424E3 mpirun ./hello
srun --resv-ports --cpu_bind=cores --distribution=block:block\
-N 2 -n 32 -p 424E3 ./hello
103
104
Stripe count 10 or 20 or 30 or 40
105
Automatically
When calling
106
Communication device
btl: self, sm (shared memory), openib (Infiniband), tcp (Ethernet)
Exclude
btl_openib_use_eager_rdma
use RDMA for eager messages
btl_openib_eager_limit
size of eager messages
science+computing ag, 2013
107
coll_tuned_alltoall_algorithm
coll_tuned_allreduce_algorithm
108
Deadlock detection
Problem: Polling by MPI processes
If no activity for longer time: Assumption of Deadlock, Process is set
to micro-sleeps
109
110
111