Professional Documents
Culture Documents
High-Performance Computing
OpenPOWER Workshop
CINECA, Italy
July 9, 2020
José Moreira – IBM Research
Many thanks to: Jeff Stuecheli
Edmund Gieske
Brian Thompto
Open architectures for supercomputing
• Since the dawn of supercomputing with the CDC 6600, through the many
generations of Seymour Cray machines, massively parallel processors like Blue
Gene and the current champion Fugaku, high performance computing has been
about three things:
▪ Number crunching: perform as many arithmetic operations as possible
▪ Memory bandwidth: read/write as much data from/to memory as possible
▪ Interconnect: communicate between elements as fast as possible
• IBM servers are now being designed and built around a variety of computing
technologies that IBM has made openly available to the community, including the
three supercomputing technologies listed above
• In this talk, we will focus on recent OpenPOWER developments in those areas
...
Instruction Heritage Note # Instr. Cum Instr. Open ISA
PowerPC
POWER (P1) Base 218 218 Contributing POWER PC Embedded
1997
POWER (P2) 6 224 Contributing PPC 64 PPC-E
PowerPC (P3) 64b 119 343 Contributing
...
PPC 2.00
PowerPC 2.00 (P4) 7 350 Contributing
2003
Power.org PPC-E
PowerPC 2.01 2 352 Contributing PPC 2.02
Power ISA
PowerPC 2.02 (P5) 14 366 Contributing 2006 2.03 Embedded
Power ISA 2.03 SIMD-VMX 171 537 Contributing Features
Power ISA 2.06 (P7) SIMD-VSX 189 831 Contributing 2.07 2.07B
Power ISA 2.07 (P8) 111 942 Contributing
Power ISA 3.0 (P9) 2017 3.0
231 1173 Compliance Open Power ISA
Power ISA 3.1 (P10) Prefix 246 1419 Compliance Custom
2020 3.1 3.0C Extensions
© 2020 IBM Corporation 3
PowerISA 3.1: Foundation for expansion
The PowerISA 3.1 includes a number of new features (see specification / preface for details):
• General: byte reverse instructions, vector integer multiply/divide/modulo instructions, 128-bit binary
integer operations, set boolean extension, string operations, test LSB by byte operation, VSX scalar
minimum/maximum/compare quad-precision
• SIMD: VSX 32-byte storage access operations, SIMD permute-class operations, bit-manipulation
operations, VSX load/store rightmost element operations, VSX mask manipulation operations, VSX PCV
generate operations for expand/compress,
• Translation management extensions
• Copy/paste extensions
• Persistent storage / store synchronization
• Pause / wait-reserve
• Hypervisor interrupt location control
• Matrix math operations (more details later)
• Debug: BHRB filtering updates, multiple DEAW, new performance monitor SPRs, performance monitor
facility sampling security
• Instruction prefix support : 8-byte and modifying opcodes
Prefix Suffix
0 5 6 7 11 31 0 5 31
new suffix
PO=1 M0 PO
opcode space
rank-2 update
rank-8 update
Instruction 𝑨 𝑿 𝒀𝑻 # of madds/instruction
xvf64ger 4 × 2 (fp64) 4 × 1 (fp64) 1 × 2 (fp64) 8
X[0]
ACC[0] ACC[2] ACC[4] ACC[6]
• Tolerates the most latency
X[1]
• 4 accumulators (8 × 8 result) ACC[1] ACC[3] ACC[5] ACC[7]
• 2 accumulators (8 × 4 result)
X[0]
X[0]
• 3 VSR loads/2 xv_ger instructions ACC[0] ACC[2] ACC[0]
• 1.5 VSR load/ xv_ger
X[1]
X[1]
ACC[1] ACC[3] ACC[1]
• 1 accumulator (4 × 4 result)
• 2 VSR loads/1 xv_ger instruction
• 2 VSR load/ xv_ger
© 2020 IBM Corporation 12
The micro-kernel of GEMM: 𝑪𝒎×𝒏 += 𝑨𝒎×𝒌 × 𝑩𝒌×𝒏
© 2020 IBM Corporation Same Open Memory Interface used for all Systems and Memory Technologies 15
Primary tier memory options
72b ~2666 MHz bidi
8b
DDR4 RDIMM
Capacity ~256 GB
BW ~150 GB/sec
Same System
72b ~2666 MHz bidi DDR4 LRDIMM Only 5-10ns
8b
8b Capacity ~2 TB higher load-to-use
BW ~150 GB/sec than RDIMM
(< 5ns for LRDIMM)
8b ~25G diff
8b
DDR4 OMI DIMM
BUF
Capacity ~256GB→4 TB
OMI Strategy
8b
BW ~320 GB/sec
Same System
8b ~25G diff
BW Opt OMI DIMM
16b
Capacity ~128→512 GB
BUF
BW ~650 GB/sec
1024b
On module
On Module HBM
Si interposer Capacity ~16→32 GB Unique System
BW ~1 TB/sec
© 2020 IBM Corporation 16
POWER connectivity variants
Direct Attach Memory OMI Buffered Memory
Max capacity: 2 TB Max capacity: 4 TB
Max bandwidth: 150 GB/s Max bandwidth: 650 GB/s
Limited system interconnect Enhanced system interconnect
6x bandwidth
per mm2 as
DDR signaling
x24 system attach 2 x local SMP x48 system attach 3 x local SMP
• Technology agnostic
• Low cost
JEDEC DDR DIMM • Ultra-scale system density
• Enterprise reliability
• Low-latency
• High bandwidth
▪ Storage/Compute/Network etc
Advanced ▪ ASIC/FPGA/FFSA
▪ FPGA, SOC, GPU Accelerator
SCM Solutions ▪ Load/Store or Block Access
Buffered System Memory
Accelerated TLx/DLx
U
OpenCAPI Memory Buffers
OpenCAPI
Caches Device Accelerated
Function
Device Memory
TL/DL 25Gb I/O
Application
OpenCAPI Enabled Processor