You are on page 1of 21

OpenPOWER Innovations for

High-Performance Computing
OpenPOWER Workshop
CINECA, Italy
July 9, 2020
José Moreira – IBM Research
Many thanks to: Jeff Stuecheli
Edmund Gieske
Brian Thompto
Open architectures for supercomputing
• Since the dawn of supercomputing with the CDC 6600, through the many
generations of Seymour Cray machines, massively parallel processors like Blue
Gene and the current champion Fugaku, high performance computing has been
about three things:
▪ Number crunching: perform as many arithmetic operations as possible
▪ Memory bandwidth: read/write as much data from/to memory as possible
▪ Interconnect: communicate between elements as fast as possible

• IBM servers are now being designed and built around a variety of computing
technologies that IBM has made openly available to the community, including the
three supercomputing technologies listed above
• In this talk, we will focus on recent OpenPOWER developments in those areas

© 2020 IBM Corporation 2


Power ISA – foundation of ecosystem 1985 America

• Abbreviated ineage of Power ISA IBM Server


PowerPC
− Greater than 30 years of innovation and a developed ecosystem 1990 POWER
− Instruction heritage shown for Power ISA 3.1 POWER 2 POWER PC

...
Instruction Heritage Note # Instr. Cum Instr. Open ISA
PowerPC
POWER (P1) Base 218 218 Contributing POWER PC Embedded
1997
POWER (P2) 6 224 Contributing PPC 64 PPC-E
PowerPC (P3) 64b 119 343 Contributing

...
PPC 2.00
PowerPC 2.00 (P4) 7 350 Contributing
2003
Power.org PPC-E
PowerPC 2.01 2 352 Contributing PPC 2.02
Power ISA
PowerPC 2.02 (P5) 14 366 Contributing 2006 2.03 Embedded
Power ISA 2.03 SIMD-VMX 171 537 Contributing Features

Power ISA 2.05 (P6) 105 642 Contributing 2006

Power ISA 2.06 (P7) SIMD-VSX 189 831 Contributing 2.07 2.07B
Power ISA 2.07 (P8) 111 942 Contributing
Power ISA 3.0 (P9) 2017 3.0
231 1173 Compliance Open Power ISA
Power ISA 3.1 (P10) Prefix 246 1419 Compliance Custom
2020 3.1 3.0C Extensions
© 2020 IBM Corporation 3
PowerISA 3.1: Foundation for expansion
The PowerISA 3.1 includes a number of new features (see specification / preface for details):

• General: byte reverse instructions, vector integer multiply/divide/modulo instructions, 128-bit binary
integer operations, set boolean extension, string operations, test LSB by byte operation, VSX scalar
minimum/maximum/compare quad-precision
• SIMD: VSX 32-byte storage access operations, SIMD permute-class operations, bit-manipulation
operations, VSX load/store rightmost element operations, VSX mask manipulation operations, VSX PCV
generate operations for expand/compress,
• Translation management extensions
• Copy/paste extensions
• Persistent storage / store synchronization
• Pause / wait-reserve
• Hypervisor interrupt location control
• Matrix math operations (more details later)
• Debug: BHRB filtering updates, multiple DEAW, new performance monitor SPRs, performance monitor
facility sampling security
• Instruction prefix support : 8-byte and modifying opcodes

© 2020 IBM Corporation 4


Power ISA 3.1 prefix architecture

Prefix Suffix
0 5 6 7 11 31 0 5 31
new suffix
PO=1 M0 PO
opcode space

• Prefix architecture, primary opcode=1


▪ RISC-friendly variable length instructions:
o New 8-byte instruction space lays the foundation for future ISA expansion
o Always 4-byte instruction alignment
▪ Two forms: modifying (M=1) and 8-byte opcode (M=0)
o Modifying: prefix extends function of existing instructions
o 8-byte opcode: provides new opcode space for expansion, multi-operand instructions, etc.
▪ PC-relative addressing: reduced path-length with new Power ABI support
▪ MMA lane masking : mask by lane for MMA operations
▪ Additional instructions and capabilities
• Generous room for expanded capabilities including opcode space and modifiers

© 2020 IBM Corporation 5


Matrix-Multiply Assist (MMA) instructions
• The latest version of Power ISA (for POWER10) is now publicly available
• The Matrix-Multiply Assist instructions lead to very efficient implementations for
key algorithms in technical computing, machine learning, deep learning and
business analytics
• These instructions are a natural match for implementing dense numerical linear
algebra computations
• We have also shown application to other computations such as convolution
• Various other computations require additional work and research, including
arbitrary precision arithmetic, discrete Fourier transform, …

© 2020 IBM Corporation 6


Power ISA Vector-Scalar Registers (VSRs)

© 2020 IBM Corporation 7


Accumulators
• Accumulators are 4 × 4 arrays of 32-bit elements (we will briefly mention the
64-bit extension later)
𝑎00 𝑎01 𝑎02 𝑎03
𝑎 𝑎11 𝑎12 𝑎13
𝐴 = 𝑎10 𝑎21 𝑎22 𝑎23
20
𝑎30 𝑎31 𝑎32 𝑎33

• The elements can be either 32-bit signed-integers (int32) or 32-bit single-


precision floating-point numbers (fp32)
• Each accumulator is a software-managed shadow to a set of 4 consecutive
VSRs (8 architected accumulators – ACC[0:7]) – must choose between
using accumulator or associated VSRs
• State must be explicitly transferred between accumulators and VSRs using
VSX Move From Accumulator (xxmfacc) and VSX Move To Accumulator
(xxmtacc)
© 2020 IBM Corporation 8
Outer-product (xv<type>ger<rank-𝑘>) instructions
• Accumulators are updated by rank-𝑘 update instructions:
• Input: 1 accumulator (𝐴) + 2 VSRs (𝑋, 𝑌)
• Output: 1 accumulator (same as input to reduce instruction encoding space)
• Operation: 𝐴 ← ± 𝐴 ± 𝑋𝑌 𝑇
• For 32-bit data, 𝑋 and 𝑌 are 4 × 1 arrays of elements
• For 16-bit data, 𝑋 and 𝑌 are 4 × 2 arrays of elements
• For 8-bit data, 𝑋 and 𝑌 are 4 × 4 arrays of elements
• For 4-bit data, 𝑋 and 𝑌 are 4 × 8 arrays of elements
• This way 𝑋𝑌 𝑇 always has a 4 × 4 shape, compatible with accumulator
Instruction 𝑨 𝑿 𝒀𝑻 # of madds/instruction
xvf32ger 4 × 4 (fp32) 4 × 1 (fp32) 1 × 4 (fp32) 16
xvf16ger2 4 × 4 (fp32) 4 × 2 (fp16) 2 × 4 (fp16) 32
xvi16ger2 4 × 4 (int32) 4 × 2 (int16) 2 × 4 (int16) 32
xvi8ger4 4 × 4 (int32) 4 × 4 (int8) 4 × 4 (int8) 64
xvi4ger8 4 × 4 (int32) 4 × 8 (int4) 8 × 4 (int4) 128
© 2020 IBM Corporation 9
Rank-𝑘 update

32-bit input elements 8-bit input elements


4 1 4 4 4 4
1
± 4 ± 4 × ± 4 ± 4 ×4

rank-1 update rank-4 update

16-bit input elements 4-bit input elements


4
4 2 4 4 8
2
± 4 ± 4 × ± 4 ± 4 ×8

rank-2 update
rank-8 update

© 2020 IBM Corporation 10


Extension to 64-bit
• Accumulators are 4 × 2 arrays of 64-bit floating-point elements
𝑎00 𝑎01
𝑎 𝑎
𝐴 = 𝑎10 𝑎11
20 21
𝑎30 𝑎31
• Accumulators are updated by outer-product instructions:
▪ Input: 1 accumulator (𝐴) + 3 VSRs (𝑋1 , 𝑋2 , 𝑌)
▪ Output: 1 accumulator (same as input to reduce instruction encoding space)
𝑋1 𝑇
• Operation: 𝐴 ← 𝐴 + 𝑌
𝑋2
▪ 𝑋1 , 𝑋2 and 𝑌 are 2 × 1 arrays of elements
𝑋1 𝑇
▪ 𝑌 always has a 4 × 2 shape, compatible with accumulator
𝑋2

Instruction 𝑨 𝑿 𝒀𝑻 # of madds/instruction
xvf64ger 4 × 2 (fp64) 4 × 1 (fp64) 1 × 2 (fp64) 8

© 2020 IBM Corporation 11


Load pressure and unit latency (32-bit results)
• 8 accumulators (8 × 16 result) Y[0] Y[1] Y[2] Y[3]
• 6 VSR loads/8 xv_ger instructions
• 0.75 VSR load/xv_ger

X[0]
ACC[0] ACC[2] ACC[4] ACC[6]
• Tolerates the most latency

X[1]
• 4 accumulators (8 × 8 result) ACC[1] ACC[3] ACC[5] ACC[7]

• 4 VSR loads/4 xv_ger instructions


• 1 VSR load/xv_ger
• Could work well in SMT modes
Y[0] Y[1] Y[0]

• 2 accumulators (8 × 4 result)

X[0]

X[0]
• 3 VSR loads/2 xv_ger instructions ACC[0] ACC[2] ACC[0]
• 1.5 VSR load/ xv_ger

X[1]

X[1]
ACC[1] ACC[3] ACC[1]
• 1 accumulator (4 × 4 result)
• 2 VSR loads/1 xv_ger instruction
• 2 VSR load/ xv_ger
© 2020 IBM Corporation 12
The micro-kernel of GEMM: 𝑪𝒎×𝒏 += 𝑨𝒎×𝒌 × 𝑩𝒌×𝒏

© 2020 IBM Corporation 13


SGEMM micro-kernel using 𝟖 × 𝟖 virtual accumulator

© 2020 IBM Corporation 14


POWER family memory architecture
Scale Up Superior RAS, High bandwidth, High Capacity
Buffered Memory Agnostic interface for alternate memory innovations

Scale Out Low latency access


Direct Attach Memory Commodity packaging form factor

OpenCAPI Agnostic Buffered Memory (OMI)


Near Tier Commodity Enterprise Storage Class
Extreme Bandwidth Low Latency RAS Extreme Capacity
Lower Capacity Low Cost Capacity Persistence
Bandwidth

© 2020 IBM Corporation Same Open Memory Interface used for all Systems and Memory Technologies 15
Primary tier memory options
72b ~2666 MHz bidi
8b
DDR4 RDIMM
Capacity ~256 GB
BW ~150 GB/sec
Same System
72b ~2666 MHz bidi DDR4 LRDIMM Only 5-10ns
8b
8b Capacity ~2 TB higher load-to-use
BW ~150 GB/sec than RDIMM
(< 5ns for LRDIMM)

8b ~25G diff
8b
DDR4 OMI DIMM
BUF

Capacity ~256GB→4 TB
OMI Strategy

8b

BW ~320 GB/sec
Same System
8b ~25G diff
BW Opt OMI DIMM
16b
Capacity ~128→512 GB
BUF

BW ~650 GB/sec

1024b
On module
On Module HBM
Si interposer Capacity ~16→32 GB Unique System
BW ~1 TB/sec
© 2020 IBM Corporation 16
POWER connectivity variants
Direct Attach Memory OMI Buffered Memory
Max capacity: 2 TB Max capacity: 4 TB
Max bandwidth: 150 GB/s Max bandwidth: 650 GB/s
Limited system interconnect Enhanced system interconnect

4 x DDR4 memory 8 x OMI buffered memory


x24 system attach x48 system attach
Interconnect Advanced interconnect

6x bandwidth
per mm2 as
DDR signaling

x24 system attach 2 x local SMP x48 system attach 3 x local SMP

© 2020 IBM Corporation 4 x DDR4 memory 8 x OMI buffered memory 17


DRAM DIMM comparison

IBM Centaur DIMM


OMI DDIMM

• Technology agnostic
• Low cost
JEDEC DDR DIMM • Ultra-scale system density
• Enterprise reliability
• Low-latency
• High bandwidth

© 2020 IBM Corporation


Approximate Scale 18
OpenCAPI key attributes

▪ Storage/Compute/Network etc
Advanced ▪ ASIC/FPGA/FFSA
▪ FPGA, SOC, GPU Accelerator
SCM Solutions ▪ Load/Store or Block Access
Buffered System Memory

Accelerated TLx/DLx
U
OpenCAPI Memory Buffers

OpenCAPI
Caches Device Accelerated
Function
Device Memory
TL/DL 25Gb I/O

Application
OpenCAPI Enabled Processor

1. Architecture agnostic bus – Applicable with any system/microprocessor architecture


2. Optimized for high bandwidth and low latency
3. High performance 25Gbps PHY design
4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor
5. Virtual addressing enables low overhead with no kernel, hypervisor or firmware involvement; security benefit
6. Wide range of use cases and access semantics
7. CPU coherent device memory (Home Agent Memory)
8. Architected for both classic memory and emerging advanced storage class memory
9. Minimal OpenCAPI design overhead (Less than 5% of a Xilinx VU3P FPGA)

© 2020 IBM Corporation 19


OpenCAPI adapters

Mellanox Innova2 Nallatech 250-SoC AlphaData ADM-9H3 AlphaData ADM-9H7


Network + FPGA Multipurpose Converged Medium FPGA with 8GB HBM Large FPGA with 8GB HBM
• Xilinx US+ KU15P FPGA Network / Storage • Xilinx Virtex US+ VU33P-3 FPGA + • Xilinx US+ VU37P FPGA + HBM
• Mellanox CX5 NIC • Xilinx Zynq US+ ZU19EG FPGA HBM • 8GB High Bandwidth Memory
• 16 GB DDR4 • 8/16 GB DDR4, 4/8 GB DDR4 ARM • 8GB High Bandwidth Memory • PCIe Gen4 x8 or Gen3 x16, CAPI2
• 2 25Gb SFP Cages • PCIe Gen3 x16, CAPI2 • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 2 x8 25 Gb/s OpenCAPI Ports (support
• X8 25Gb/s OpenCAPI Support • 4 x8 Oculink Ports support NVMe, • 1 x8 25 Gb/s OpenCAPI Ports up to 50 GB/s)
Network Acceleration (NFV, Packet Network, or OpenCAPI • 1 2x100Gb QSFP28-DD Cage • 4 100Gb QSFP28 Cages
Classification), Security Acceleration • 2 100Gb QSFP28 Cages
ML/DL, Inference, System Modeling, ML/DL, Inference, System Modeling, HPC
NVMEoF Target, High BW Storage HPC
Server

© 2020 IBM Corporation 20


Conclusions
• OpenPOWER delivers the three essential technologies of supercomputing:
▪ Number crunching: Through its SIMD and MMA instructions
▪ Memory bandwith and capacity: Through OMI
▪ System interconnect: Through OpenCAPI
• The MMA instructions provide a new level of performance for dense linear
algebra and related computations
• OMI provides a new level of memory bandwidth for computing systems while
delivering low cost, versatility and capacity
• OpenCAPI provides an enhanced system interconnect for acceleration and
additional functionality
• All three are scalable, offering room to growth in all dimensions
• Together with Open Source software, we see a clear road ahead for high-
performance computing systems that combine the best innovation from the
community!

© 2020 IBM Corporation 21

You might also like