Openpower Innovations For High-Performance Computing

OpenPOWER Innovations for
High-Performance Computing
OpenPOWER Workshop
CINECA, Italy
July 9, 2020
José Moreira – IBM Research
Many thanks to: Jeff Stuecheli
Edmund Gieske
Brian Thompto
Open architectures for supercomputing
• Since the dawn of supercomputing with the CDC 6600, through the many
generations of Seymour Cray machines, massively parallel processors like Blue
Gene and the current champion Fugaku, high performance computing has been
about three things:
▪ Number crunching: perform as many arithmetic operations as possible
▪ Memory bandwidth: read/write as much data from/to memory as possible
▪ Interconnect: communicate between elements as fast as possible
• IBM servers are now being designed and built around a variety of computing
technologies that IBM has made openly available to the community, including the
three supercomputing technologies listed above
• In this talk, we will focus on recent OpenPOWER developments in those areas
© 2020 IBM Corporation 2

Power ISA – foundation of ecosystem 1985 America
• Abbreviated ineage of Power ISA IBM Server

PowerPC
− Greater than 30 years of innovation and a developed ecosystem 1990 POWER
− Instruction heritage shown for Power ISA 3.1 POWER 2 POWER PC
...
Instruction Heritage Note # Instr. Cum Instr. Open ISA
PowerPC
POWER (P1) Base 218 218 Contributing POWER PC Embedded
1997
POWER (P2) 6 224 Contributing PPC 64 PPC-E
PowerPC (P3) 64b 119 343 Contributing
...
PPC 2.00
PowerPC 2.00 (P4) 7 350 Contributing
2003
Power.org PPC-E
PowerPC 2.01 2 352 Contributing PPC 2.02
Power ISA
PowerPC 2.02 (P5) 14 366 Contributing 2006 2.03 Embedded
Power ISA 2.03 SIMD-VMX 171 537 Contributing Features
Power ISA 2.05 (P6) 105 642 Contributing 2006
Power ISA 2.06 (P7) SIMD-VSX 189 831 Contributing 2.07 2.07B
Power ISA 2.07 (P8) 111 942 Contributing
Power ISA 3.0 (P9) 2017 3.0
231 1173 Compliance Open Power ISA
Power ISA 3.1 (P10) Prefix 246 1419 Compliance Custom
2020 3.1 3.0C Extensions
PowerISA 3.1: Foundation for expansion
The PowerISA 3.1 includes a number of new features (see specification / preface for details):
• General: byte reverse instructions, vector integer multiply/divide/modulo instructions, 128-bit binary
integer operations, set boolean extension, string operations, test LSB by byte operation, VSX scalar
minimum/maximum/compare quad-precision
• SIMD: VSX 32-byte storage access operations, SIMD permute-class operations, bit-manipulation
operations, VSX load/store rightmost element operations, VSX mask manipulation operations, VSX PCV
generate operations for expand/compress,
• Translation management extensions
• Copy/paste extensions
• Persistent storage / store synchronization
• Pause / wait-reserve
• Hypervisor interrupt location control
• Matrix math operations (more details later)
• Debug: BHRB filtering updates, multiple DEAW, new performance monitor SPRs, performance monitor
facility sampling security
• Instruction prefix support : 8-byte and modifying opcodes

Power ISA 3.1 prefix architecture
Prefix Suffix
0 5 6 7 11 31 0 5 31
new suffix
PO=1 M0 PO
opcode space
• Prefix architecture, primary opcode=1

▪ RISC-friendly variable length instructions:
o New 8-byte instruction space lays the foundation for future ISA expansion
o Always 4-byte instruction alignment
▪ Two forms: modifying (M=1) and 8-byte opcode (M=0)
o Modifying: prefix extends function of existing instructions
o 8-byte opcode: provides new opcode space for expansion, multi-operand instructions, etc.
▪ PC-relative addressing: reduced path-length with new Power ABI support
▪ MMA lane masking : mask by lane for MMA operations
▪ Additional instructions and capabilities
• Generous room for expanded capabilities including opcode space and modifiers

Matrix-Multiply Assist (MMA) instructions
• The latest version of Power ISA (for POWER10) is now publicly available
• The Matrix-Multiply Assist instructions lead to very efficient implementations for
key algorithms in technical computing, machine learning, deep learning and
business analytics
• These instructions are a natural match for implementing dense numerical linear
algebra computations
• We have also shown application to other computations such as convolution
• Various other computations require additional work and research, including
arbitrary precision arithmetic, discrete Fourier transform, …

Power ISA Vector-Scalar Registers (VSRs)

Accumulators
• Accumulators are 4 × 4 arrays of 32-bit elements (we will briefly mention the
64-bit extension later)
𝑎00 𝑎01 𝑎02 𝑎03
𝑎 𝑎11 𝑎12 𝑎13
𝐴 = 𝑎10 𝑎21 𝑎22 𝑎23
20
𝑎30 𝑎31 𝑎32 𝑎33
• The elements can be either 32-bit signed-integers (int32) or 32-bit single-

precision floating-point numbers (fp32)
• Each accumulator is a software-managed shadow to a set of 4 consecutive
VSRs (8 architected accumulators – ACC[0:7]) – must choose between
using accumulator or associated VSRs
• State must be explicitly transferred between accumulators and VSRs using
VSX Move From Accumulator (xxmfacc) and VSX Move To Accumulator
(xxmtacc)
Outer-product (xv<type>ger<rank-𝑘>) instructions
• Accumulators are updated by rank-𝑘 update instructions:
• Input: 1 accumulator (𝐴) + 2 VSRs (𝑋, 𝑌)
• Output: 1 accumulator (same as input to reduce instruction encoding space)
• Operation: 𝐴 ← ± 𝐴 ± 𝑋𝑌 𝑇
• For 32-bit data, 𝑋 and 𝑌 are 4 × 1 arrays of elements
• This way 𝑋𝑌 𝑇 always has a 4 × 4 shape, compatible with accumulator
Instruction 𝑨 𝑿 𝒀𝑻 # of madds/instruction
xvf32ger 4 × 4 (fp32) 4 × 1 (fp32) 1 × 4 (fp32) 16
xvf16ger2 4 × 4 (fp32) 4 × 2 (fp16) 2 × 4 (fp16) 32
xvi16ger2 4 × 4 (int32) 4 × 2 (int16) 2 × 4 (int16) 32
Rank-𝑘 update
32-bit input elements 8-bit input elements

4 1 4 4 4 4
1
± 4 ± 4 × ± 4 ± 4 ×4
rank-1 update rank-4 update
16-bit input elements 4-bit input elements

4
4 2 4 4 8
2
± 4 ± 4 × ± 4 ± 4 ×8
rank-2 update
rank-8 update

Extension to 64-bit
• Accumulators are 4 × 2 arrays of 64-bit floating-point elements
𝑎00 𝑎01
𝑎 𝑎
𝐴 = 𝑎10 𝑎11
20 21
𝑎30 𝑎31
• Accumulators are updated by outer-product instructions:
▪ Input: 1 accumulator (𝐴) + 3 VSRs (𝑋1 , 𝑋2 , 𝑌)
▪ Output: 1 accumulator (same as input to reduce instruction encoding space)
𝑋1 𝑇
• Operation: 𝐴 ← 𝐴 + 𝑌
𝑋2
▪ 𝑋1 , 𝑋2 and 𝑌 are 2 × 1 arrays of elements
𝑋1 𝑇
▪ 𝑌 always has a 4 × 2 shape, compatible with accumulator
𝑋2
Instruction 𝑨 𝑿 𝒀𝑻 # of madds/instruction
xvf64ger 4 × 2 (fp64) 4 × 1 (fp64) 1 × 2 (fp64) 8

Load pressure and unit latency (32-bit results)
• 8 accumulators (8 × 16 result) Y[0] Y[1] Y[2] Y[3]
• 6 VSR loads/8 xv_ger instructions
• 0.75 VSR load/xv_ger
X[0]
ACC[0] ACC[2] ACC[4] ACC[6]
• Tolerates the most latency
X[1]
• 4 accumulators (8 × 8 result) ACC[1] ACC[3] ACC[5] ACC[7]
• 4 VSR loads/4 xv_ger instructions

• 1 VSR load/xv_ger
• Could work well in SMT modes
Y[0] Y[1] Y[0]
• 2 accumulators (8 × 4 result)
X[0]
X[0]
• 3 VSR loads/2 xv_ger instructions ACC[0] ACC[2] ACC[0]
• 1.5 VSR load/ xv_ger
X[1]
X[1]
ACC[1] ACC[3] ACC[1]
• 1 accumulator (4 × 4 result)
• 2 VSR loads/1 xv_ger instruction
• 2 VSR load/ xv_ger
The micro-kernel of GEMM: 𝑪𝒎×𝒏 += 𝑨𝒎×𝒌 × 𝑩𝒌×𝒏

SGEMM micro-kernel using 𝟖 × 𝟖 virtual accumulator

POWER family memory architecture
Scale Up Superior RAS, High bandwidth, High Capacity
Buffered Memory Agnostic interface for alternate memory innovations
Scale Out Low latency access

Direct Attach Memory Commodity packaging form factor
OpenCAPI Agnostic Buffered Memory (OMI)

Near Tier Commodity Enterprise Storage Class
Extreme Bandwidth Low Latency RAS Extreme Capacity
Lower Capacity Low Cost Capacity Persistence
Bandwidth
© 2020 IBM Corporation Same Open Memory Interface used for all Systems and Memory Technologies 15
Primary tier memory options
72b ~2666 MHz bidi
8b
DDR4 RDIMM
Capacity ~256 GB
BW ~150 GB/sec
Same System
72b ~2666 MHz bidi DDR4 LRDIMM Only 5-10ns
8b
8b Capacity ~2 TB higher load-to-use
BW ~150 GB/sec than RDIMM
(< 5ns for LRDIMM)
8b ~25G diff
8b
DDR4 OMI DIMM
BUF
Capacity ~256GB→4 TB
OMI Strategy
8b
BW ~320 GB/sec
Same System
8b ~25G diff
BW Opt OMI DIMM
16b
Capacity ~128→512 GB
BUF
BW ~650 GB/sec
1024b
On module
On Module HBM
Si interposer Capacity ~16→32 GB Unique System
BW ~1 TB/sec
POWER connectivity variants
Direct Attach Memory OMI Buffered Memory
Max capacity: 2 TB Max capacity: 4 TB
Max bandwidth: 150 GB/s Max bandwidth: 650 GB/s
Limited system interconnect Enhanced system interconnect
4 x DDR4 memory 8 x OMI buffered memory

x24 system attach x48 system attach
Interconnect Advanced interconnect
6x bandwidth
per mm2 as
DDR signaling
x24 system attach 2 x local SMP x48 system attach 3 x local SMP
© 2020 IBM Corporation 4 x DDR4 memory 8 x OMI buffered memory 17

DRAM DIMM comparison
IBM Centaur DIMM

OMI DDIMM
• Technology agnostic
• Low cost
JEDEC DDR DIMM • Ultra-scale system density
• Enterprise reliability
• Low-latency
• High bandwidth
© 2020 IBM Corporation

Approximate Scale 18
OpenCAPI key attributes
▪ Storage/Compute/Network etc
Advanced ▪ ASIC/FPGA/FFSA
▪ FPGA, SOC, GPU Accelerator
SCM Solutions ▪ Load/Store or Block Access
Buffered System Memory
Accelerated TLx/DLx
U
OpenCAPI Memory Buffers
OpenCAPI
Caches Device Accelerated
Function
Device Memory
TL/DL 25Gb I/O
Application
OpenCAPI Enabled Processor
1. Architecture agnostic bus – Applicable with any system/microprocessor architecture

2. Optimized for high bandwidth and low latency
3. High performance 25Gbps PHY design
4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor
5. Virtual addressing enables low overhead with no kernel, hypervisor or firmware involvement; security benefit
6. Wide range of use cases and access semantics
7. CPU coherent device memory (Home Agent Memory)
8. Architected for both classic memory and emerging advanced storage class memory
9. Minimal OpenCAPI design overhead (Less than 5% of a Xilinx VU3P FPGA)

OpenCAPI adapters
Mellanox Innova2 Nallatech 250-SoC AlphaData ADM-9H3 AlphaData ADM-9H7

Network + FPGA Multipurpose Converged Medium FPGA with 8GB HBM Large FPGA with 8GB HBM
• Xilinx US+ KU15P FPGA Network / Storage • Xilinx Virtex US+ VU33P-3 FPGA + • Xilinx US+ VU37P FPGA + HBM
• Mellanox CX5 NIC • Xilinx Zynq US+ ZU19EG FPGA HBM • 8GB High Bandwidth Memory
• 16 GB DDR4 • 8/16 GB DDR4, 4/8 GB DDR4 ARM • 8GB High Bandwidth Memory • PCIe Gen4 x8 or Gen3 x16, CAPI2
• 2 25Gb SFP Cages • PCIe Gen3 x16, CAPI2 • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 2 x8 25 Gb/s OpenCAPI Ports (support
• X8 25Gb/s OpenCAPI Support • 4 x8 Oculink Ports support NVMe, • 1 x8 25 Gb/s OpenCAPI Ports up to 50 GB/s)
Network Acceleration (NFV, Packet Network, or OpenCAPI • 1 2x100Gb QSFP28-DD Cage • 4 100Gb QSFP28 Cages
Classification), Security Acceleration • 2 100Gb QSFP28 Cages
ML/DL, Inference, System Modeling, ML/DL, Inference, System Modeling, HPC
NVMEoF Target, High BW Storage HPC
Server

Conclusions
• OpenPOWER delivers the three essential technologies of supercomputing:
▪ Number crunching: Through its SIMD and MMA instructions
▪ Memory bandwith and capacity: Through OMI
▪ System interconnect: Through OpenCAPI
• The MMA instructions provide a new level of performance for dense linear
algebra and related computations
• OMI provides a new level of memory bandwidth for computing systems while
delivering low cost, versatility and capacity
• OpenCAPI provides an enhanced system interconnect for acceleration and
additional functionality
• All three are scalable, offering room to growth in all dimensions
• Together with Open Source software, we see a clear road ahead for high-
performance computing systems that combine the best innovation from the
community!

Openpower Innovations For High-Performance Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Openpower Innovations For High-Performance Computing

Uploaded by

Copyright:

Available Formats

OpenPOWER Innovations for

© 2020 IBM Corporation 2

• Abbreviated ineage of Power ISA IBM Server

Power ISA 2.05 (P6) 105 642 Contributing 2006

© 2020 IBM Corporation 4

• Prefix architecture, primary opcode=1

© 2020 IBM Corporation 5

© 2020 IBM Corporation 6

© 2020 IBM Corporation 7

• The elements can be either 32-bit signed-integers (int32) or 32-bit single-

32-bit input elements 8-bit input elements

rank-1 update rank-4 update

16-bit input elements 4-bit input elements

© 2020 IBM Corporation 10

© 2020 IBM Corporation 11

• 4 VSR loads/4 xv_ger instructions

© 2020 IBM Corporation 13

© 2020 IBM Corporation 14

Scale Out Low latency access

OpenCAPI Agnostic Buffered Memory (OMI)

4 x DDR4 memory 8 x OMI buffered memory

© 2020 IBM Corporation 4 x DDR4 memory 8 x OMI buffered memory 17

IBM Centaur DIMM

© 2020 IBM Corporation

1. Architecture agnostic bus – Applicable with any system/microprocessor architecture

© 2020 IBM Corporation 19

Mellanox Innova2 Nallatech 250-SoC AlphaData ADM-9H3 AlphaData ADM-9H7

© 2020 IBM Corporation 20

© 2020 IBM Corporation 21

You might also like