You are on page 1of 128

Welcome to the CXL3.

0 Technical Training
To access Wifi, join the network “Google Guest”.

Confidential | CXL™ Consortium 2022


Introduction to CXL3.0
Debendra Das Sharma
Intel Senior Fellow and Co-Chair CXL TTF

Confidential | CXL™ Consortium 2022


CXLBoard of Directors

Industry Open Standard for


200+ Member Companies
High Speed Communications

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022


CXLSpecification Release Timeline

March 2019 September 2019 November 2020 August 2022

CXL 1.0 CXL Consortium CXL 2.0 CXL 3.0


Specificatio Officially Specification Specification
n Released Incorporates Released Release

CXL 1.1
Specification
Released

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 4


Agenda
• Industry Landscape and CXL
• CXL 1.0/1.1 and CXL 2.0 recap
• CXL 3.0 Features
• Conclusions

12/8/2022 Confidential | CXL™ Consortium 2021 5


Industry Landscape
Cloudification of
the Network & Edge

Proliferation of
Cloud Computing Growth of
AI & Analytics

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022


CXLOverview
• New breakthrough high-speed fabric
• Enables a high-speed, efficient interconnect between CPU, memory and
accelerators
• Builds upon PCI Express® (PCIe®) infrastructure, leveraging the PCIe® physical and
electrical interface
• Maintains memory coherency between the CPU memory space and memory on CXL
attached devices
• Enables fine-grained resource sharing for higher performance in heterogeneous compute
environments
• Enables memory disaggregation, memory pooling and sharing, persistent memory and
emerging memory media
• Delivered as an open industry standard
• CXL Specification 3.0 available in August with full backward compatibility with CXL
1.0/ CXL 1.1 and CXL 2.0
• Future CXL Specification generations will include continuous innovation to meet
industry needs and support new technologies

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 7


CXL™Approach
Host Processor
Coherent Interface Internal I/O Devices CPU Core CPU Core

• Leverages PCIe with three Home Agent


PCIe/CXL.io Logic
multiplexed protocols Coherence and Memory
Memory
Controller
Host Memory
IOMMU Write Cache Logic
• Built on top of PCIe® infrastructure
CXL.io CXL.cache CXL.memory

Low Latency PCIe PHY

• CXL.Cache/CXL.Memory targets Converged


CXL Link
Memory
near CPU cache coherent latency
(<200ns load to use) PCIe PHY

CXL.io CXL.cache
Discovery, Configuration, CXL.memory
Memory Flows,
Asymmetric Complexity Initialization, Interrupts, Memory Flows
Coherent Requests
DMA, ATS

• Eases burdens of cache coherence DTLB Caching Agent Memory Optional Device
interface designs for devices Accelerator Logic Controller Memory

Accelerator
Representative CXLUsages with CXL1.0
Caching Devices / Accelerators Accelerators with Memory Memory Buffers
TYPE 1 TYPE 2 TYPE 3
DDR DDR

DDR DDR

DDR DDR
Processor Processor Processor

PROTOCOLS PROTOCOLS PROTOCOLS

CXL • CXL.io
• CXL.cache CXL • CXL.io
• CXL.cache
• CXL.memory
CXL • CXL.io
• CXL.memory

HBM
Accelerator Accelerator Memory Controller

Memory

Memory

Memory

Memory
NIC
Cache HBM Cache

USAGES USAGES USAGES

• PGAS NIC • GP GPU • Memory BW expansion


• NIC atomics • Dense computation • Memory capacity expansion
• Storage class memory

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022


CXL2.0 FEATURESUMMARY
MEMORY POOLING

1
H1 H2 H3 H4 H#

1 Device memory can be


allocated across multiple
CXL 2.0 Switch hosts.

D1 D2 D3 D4 D#
2 Multi Logical Devices allow
2 for finer grain memory
allocation

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 10


CXL2.0 FEATURESUMMARY
SWITCH CAPABILITY

CXL 2.0
[Add
H1

CXL
• Supports single-level
switching
CXL Switch
• Enables memory
CXL CXL CXL CXL
expansion and resource
allocation
D1 D2 D3 D#

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 11


CXL™2.0: Resource Pooling at Rack Level, Persistent
Memory Support and Enhanced Security
• Resource pooling/disaggregation
• Managed hot-plug flows to move resources
• Type-1/Type-2 device assigned to one host
• Type-3 device (memory) pooling at rack level
• Direct load-store, low-latency access – similar to memory attached in a
neighboring CPU socket
(vs. RDMA over network)
• Persistence flows for persistent memory
• Fabric Manager/API for managing resources
• Security: authentication, encryption
• Beyond node to rack-level connectivity!
Disaggregated systemwith CXLoptimizes resource utilization delivering lower TCOand
power efficiency
CXL3.0 Specification
Industry trends CXL 3.0 introduces…
• Use cases driving need for higher • Fabric capabilities
bandwidth include: high
performance accelerators, • Multi-headed and fabric attached devices
system memory, SmartNIC and • Enhance fabric management
leading edge networking • Composable disaggregated infrastructure
• Improved capability for better scalability
• CPU efficiency is declining due to and resource utilization
reduced memory capacity and Enhanced memory pooling
bandwidth per core •
• Multi-level switching
• Efficient peer-to-peer resource • New symmetric coherency capabilities
sharing across multiple domains • Improved software capabilities
• Double the bandwidth
• Memory bottlenecks due to CPU • Zero added latency over CXL 2.0
pin and thermal constraints
• Full backward compatibility with CXL 2.0,
CXL 1.1, and CXL 1.0

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 13


CXL3.0 Specification
Fabric capabilities
and management Expanded capabilities for increasing
scale and optimizing resource utilization
Improved memory • Fabric capabilities and • Near memory
sharing and fabric attached memory
• Enhance fabric
processing
• Multi-headed devices
pooling management framework • Multiple Type 1/Type 2
• Memory pooling and devices per root port
sharing • Fully backward
Symmetric • Peer-to-peer memory compatible to CXL 2.0,
coherency access 1.1, and 1.0
• Multi-level switching • Supports PCIe® 6.0

Peer-to-peer
Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 14
CXL3.0 Spec Feature Summary
Features CXL 1.0 / 1.1 CXL 2.0 CXL 3.0
Release date 2019 2020 1H 2022
Max link rate 32GTs 32GTs 64GTs
Flit 68 byte (up to 32 GTs)   
Flit 256 byte (up to 64 GTs) 
Type 1, Type 2 and Type 3 Devices   
Memory Pooling w/ MLDs  
Global Persistent Flush  
CXL IDE  
Switching (Single-level)  
Switching (Multi-level) 
Direct memory access for peer-to-peer 
Symmetric coherency (256 byte flit) 
Memory sharing (256 byte flit) 
Multiple Type 1/Type 2 devices per root port  Not supported
Fabric capabilities (256 byte flit)   Supported

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 15


CXL3.0: Doubles Bandwidth with Same Latency

• Uses PCIe 6.0® PHY @ 64 GT/s


• PAM-4 and high BER mitigated by PCIe 6.0
FEC and CRC (different CRC for latency
optimized)
• Standard 256B Flit along with an additional
256B Latency Optimized Flit (0-latency
adder over CXL 2)
• 0-latency adder trades off FIT (failure in time,
109 hours) from 5x10-8 to 0.026 and Link
efficiency impact from 0.94 to 0.92 for 2-5ns
latency savings (x16 – x4)1
• Extends to lower data rates (8G, 16G, 32G)
• Enables several new CXL 3 protocol
enhancements with the 256B Flit format
CXL3.0 Protocol Enhancements (UIOand BI) for
Device to Device Connectivity

CXL 3.0 enables non-tree


topologies and peer-to-peer
communication (P2P) within
a virtual hierarchy of devices
• Virtual hierarchies are
associations of devices that
maintains a coherency domain

• P2P to HDM-DB memory is I/O


Coherent: a new Unordered I/O
(UIO) Flow in CXL.io – the
Type-2/3 device that hosts the
memory will generate a new
Back-Invalidation flow
(CXL.Mem) to the host to
ensure coherency if there is a
coherency conflict
CXL3.0 Protocol Enhancements: Mapping Large memory
in Type-2 Devices to HDMwith Back Invalidate
Existing Bias – Flip mechanism needed HDM to be tracked fully since device could not back snoop the host. Back Invalidate
with CXL 3.0 enables snoop filter implementation resulting in large memory that can be mapped to HDM
CXL3.0: SWITCHCASCADE/FANOUT
Supporting vast array of switch topologies
CXL 3.0 CXL 3.0

H1 H1

CXL CXL
1 Multiple switch levels
D1 D#
CXL CXL
(aka cascade)
• Supports fanout of
CXL Switch CXL Switch
all device types
D2 CXL CXL D3

1 CXL 1 CXL
1 CXL

CXL Switch CXL Switch CXL Switch

CXL CXL CXL CXL CXL CXL CXL CXL

D1 D2 D3 D# D1 D# D1 D#

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 19


CXL3.0: MULTIPLEDEVICESOFALLTYPESPERROOT PORT

CXL 2.0 CXL 3.0


H1 H1

1
Each host’s root port
CXL CXL
1
can connect to more
CXL Switch CXL Switch
than one device type
CXL CXL CXL CXL CXL CXL CXL CXL

D Memory Memory Memory FPGA NIC FPGA Memory

Type 1 or Type 3 Devices Type 2 Type 1 Type 2 Type 3


Type 2 Device Device Device Device
Device

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 20


CXL3.0: Pooling & Sharing
Expanded use case showing
memory sharing and pooling

CXL Fabric Manager is


available to setup, deploy, and
modify the environment

Shared Coherent Memory


across hosts using hardware
coherency (directory + Back-
Invalidate Flows). Allows one to
build large clusters to solve
large problems through shared
memory constructs. Defines a
Global Fabric Attached Memory
(GFAM) which can provide
access to up to 4095 entities
CXL3.0: Multiple Level Switching, Multiple Type-1/2
Each host’s root port
can connect to more
than one device type
(up to 16 CXL.cache
devices)

Multiple switch levels


(aka cascade)
• Supports fanout of
all device types
CXL3.0: COHERENT MEMORY SHARING
1 Device memory can be
H1 H2 H#
shared by all hosts to
S1 Copy S1 Copy S2 Copy increase data flow
S2 Copy 2
efficiency and improve
CXL CXL CXL
memory utilization
CXL Switch(es)
1 3
2 Host can have a coherent
Standardized CXL Fabric Manager
copy of the shared region or
CXL
portions of shared region in
D1
1
Shared host cache
S1 Memory

S2 3 CXL 3.0 defined


mechanisms to enforce
Pooled Memory hardware cache coherency
between copies
Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 23
CXL 3.0 Fabrics
Composable Systems with Spine/Leaf Architecture at Rack/ Pod Level

CXL 3.0 Fabric Architecture


• Interconnected Spine Switch
System
• Leaf Switch NIC Enclosure
CXL Switch Example CXL Switch CXL Switch Spine Switches • Leaf Switch CPU Enclosure
traffic flow
• Leaf Switch Accelerator Enclosure
• Leaf Switch Memory Enclosure

Fabric
Manager

CXL Switch CXL Switch CXL Switch Leaf Switches

Accelerator Memory CPUs CPUs Memory GFAM Memory


NIC NIC End Devices
Conclusions
• CXL 3.0 features • Enabling new usage • Call to Action
• Full fabric capabilities and models • Download the CXL 3.0
fabric management • Memory sharing between specification
• Expanded switching hosts and peer devices • Support future
topologies • Support for multi-headed specification
• Symmetric coherency devices development by joining
capabilities the CXL Consortium
• [Symmetric coherency
• Peer-to-peer resource capabilities use case] • Follow us on Twitter
sharing and LinkedIn for
• Expanded support for updates!
• Double the bandwidth and Type-1 and Type-2 devices
zero added latency • GFAM provides expansion
compared to CXL 2.0 capabilities for current and
• Full backward future memory
compatibility with CXL 2.0,
CXL 1.1, and CXL 1.0

Confidential | CXL™ Consortium 2022 25


Thank You

Confidential | CXL™ Consortium 2022


Next Session:
Flex Bus Physical Layer
To access Wifi, join the network “Google Guest”.

12/8/2022 Confidential | CXL™ Consortium 2022 27


CXL3.0: Flex Bus Physical Layer
Michelle Jen

December 8, 2022

Confidential | CXL™ Consortium 2022


Flex Bus Physical Layer -Overview
• Based on PCIe 6.0 Physical Layer
• Electrical Sub-block is identical to PCIe
• Logical Sub-block has deltas to support CXL mode
• Dynamic negotiation during initial link training
to select PCIe or CXL mode
• CXL speeds and widths:
• 64 GT/s (PAM4), 32 GT/s, 16 GT/s, 8 GT/s
• Link widths: x16, x8, x4, x2, x1
• Link subdivision down to x4

12/8/2022 Confidential | CXL™ Consortium 2022 29


Flex Bus Physical Layer –Newin CXL3.0
• Aligned to the PCIe 6.0 specification (Blue highlighted have delta from PCIe)
• Top speed increased to 64 GT/s (PAM4) from 32 GT/s (NRZ)
• Support for 256B flits
• Gray code
• FEC + CRC generation and checking
• Ack/Nak Generation; Retry logic (Tx Retry Buffer and Rx Retry Buffer)
• L0p state low power optimization
• Additional information in Modified TS1/TS2
• Latency-Optimized 256B Flit Capable/Enable
• CXL.io Throttle Required at 64 GT/s
• CXL NOP Hint Info[1:0]
• PBR Flit Capable/Enable
• Terminology Updates to remove ambiguity related to operating modes

12/8/2022 Confidential | CXL™ Consortium 2022 30


Terminology Changes
• CXL 1.1 mode and CXL 2.0 mode terminology eliminated; reference specific
features instead
Terminology in CXL 3.0 spec Previous Terminology or Description

68B Flits Previously CXL1.1/CXL2.0 Flits

256B Flits (standard and latency-optimized) New CXL Flit formats based on PCIe Flit

Restricted CXL Device (RCD) CXL device operating in RCD mode

Exclusive RCD (eRCD) Previously CXL 1.1 only device

RCD mode Restricted CXL Device Mode -- CXL operating mode limited by
specific restrictions such no hot-plug, RCiEP, no switch
support, and 68B Flit Mode only
Exclusive Restricted CXL Host (eRCH) Previously CXL 1.1 only host

Virtual Hierarchy (VH) Everything from the CXL RP down, including the CXL RP, the
CXL PPBs, and the CXL Endpoints.
VH mode Mode of operation where CXL RP is root of hierarchy

68B Flit and VH capable/enable (in modified TS1/TS2) CXL 2.0 capable

12/8/2022 Confidential | CXL™ Consortium 2022 31


68BFlit Mode vs 256BFlit Mode Transmit
Path CXL.io
68B Flit Mode

From CXL.cache/ CXL.io


256B Flit Mode

Transaction CXL.mem Queues Transaction From CXL.cache/


Layer Packets Layer Packets CXL.mem Queues

• CRC and Retry Logic


TLP Sequence
Number Flit Packing Logic DLLP Payload[31:0]
Generator

moved from Link


(credit 2B Credit
management) Info

Layer to Physical
TLP+Seq#

Layer (common point)


Flit Formation
Link Logic Flit Packing Logic
Retry Buffer Retry Buffer
(per TLP Layer (per Flit CXL.cache/
basis) Control basis) CXL.mem Link
Flits CXL.io Link Layer Layer

• Arb/MUX ALMPs go 256B CXL.io Flit 256B CXL.cache/mem Flit

through Retry Buffer DLLP+16-bit


256B ARB/MUX ALMP Flit

CRC 32-bit LCRC 16-bit Flit-


generator Generator Level CRC
Generator
DLLP + CRC TLP+Seq# + LCRC
CXL ARB/MUX
CXL.cache/CXL.mem
CXL.io Link Layer Link Layer IDLE/NOP flit generator

32-bit ARB/MUX ALMP Payload Retry Buffer


512-bit CXL.io Ack/NAK
(Protected Via Replication) 512-bit CXL.cache/mem (per 256B
Flit Payload
Flit Payload + 16-bit CRC Flit basis) request
from Rx
side
CXL ARB/MUX DLLP
processing
NULL flit generator
Flit Packing: 2B Flit Hdr, sequence #, 6B/8B CRC, 6B FEC

Insertion of 16-bit Protocol ID


(Protected Via Encoding Scheme)
Flex Bus Physical Layer

512-bit Flit + 16-bit CRC/unused + 16-bit Protocol ID Flex Bus Physical Layer 256B CXL Flit

12/8/2022 Confidential | CXL™ Consortium 2022 32


Review–68BFlit Framing
• 68B Flit Mode uses 128b/130b PCIe block
encoding
• Optional 2-bit sync header (negotiated)
• Transitions from Data Blocks to Ordered Set
Blocks are indicated via the 16-bit protocol ID
• 68B Flit consists of 16-bit protocol ID +
512-bit payload + 16-bit CRC
• Protocol ID identifies the source of the flit (and
end of data stream):
• CXL.io
• CXL.cache/CXL.mem
• ARB/MUX (virtual link state management)
• Physical Layer (NULL flits)
• 16-bit CRC field is reserved for CXL.io

12/8/2022 Confidential | CXL™ Consortium 2022 33


Background -PCIe Flit
• PAM4 signaling @64 GT/s: Pulse Amplitude Modulation 4-level
• 4 levels (2 bits) are transferred in a UI (Unit Interval)
• Increased susceptibility to errors: FBER (First Bit Error Rate) is 10-6. (vs FBER of 10-12
at NRZ signaling rates of 32 GT/s and lower)

• PCIe uses two mechanisms to address higher BER and correct errors:
• A light weight FEC (Forward Error Correction)
• A strong CRC followed by Link Level Retry (selective/standard) upon detection of error
post FEC

• PCIe introduced Flit-based unit of transfer to facilitate FEC & CRC


implementation
• FEC uses a fixed set of bytes in a 256B Flit
• Error Correction (FEC) in the Flit -> CRC (detection in Flit) -> Retry at Flit level

12/8/2022 Confidential | CXL™ Consortium 2022 34


Background -PCIe Flit continued
• 256B PCIe Flit
• 236B TLP, 6B DLP, 8B CRC, 6B FEC
• No Sync hdr @64 GT/s, no Framing Tokens, no TLP/DLLP CRC
• Reduced overhead -> improved bandwidth
• Latency: 2ns x16, 4ns, x8, 8ns x4, 16ns x2, 32ns x1
• Guaranteed space in each Flit for Ack and credit exchange  low
latency, low storage
• FEC is 3-way interleaved ECC
• Covers TLP, DLP, and CRC
• Three groups (83B + 83B + 84B) + (3 x 2B ECC)
• Each ECC code is capable of correcting up to a single Byte of error
• Burst of error in a single Lane up to 16-bits does not impact more than a
Byte
• 8B of CRC: covers TLP and DLP
• Reed Solomon (242, 250) code guarantees detection of up to 8 Symbol
errors
• PCIe Flit Mode negotiated during linking training
• Applies to all data rates

12/8/2022 Confidential | CXL™ Consortium 2022 35


CXL256BFlit
• CXL 256B Flit
• Based on PCIe Flit
TLP (128 bytes)
TLP (108 bytes) DLLP (6 bytes) CRC (8 bytes) FEC (6 bytes)

• 2-byte FlitHdr + 240 bytes of PCIe 256B Flit

Flit Data + CRC + FEC


• TLP bytes are shifted by 2- FlitHdr (2 bytes) FlitData (126 bytes)

bytes due to FlitHdr


FlitData (114 bytes) CRC (8 bytes) FEC (6 bytes)

• Two variants of 256B Flit


CXL Standard 256B Flit

• Standard 256B Flit FlitHdr (2 bytes) FlitData (126 bytes)

• Latency Optimized 256B Flit FlitData (110 bytes) DLLP (4 bytes) CRC (8 bytes) FEC (6 bytes)

CXL.io Standard 256B Flit

12/8/2022 Confidential | CXL™ Consortium 2022 36


CXL256BFlit –Flit Header
• 256B Flit Header
Flit Header Field Bit Description
Location
Flit Type[1:0] [7:6] • 00b = Physical Layer IDLE flit or Physical Layer
NOP flit or CXL.io NOP flit
• 01b = CXL.io Payload flit
• 10b = CXL.cachemem Payload flit or CXL.cachemem
generated Empty flit
• 11b = ALMP

Prior Flit Type [5] • 0 = Prior flit was a non-retryable Empty, NOP, or
IDLE flit (not allocated into Retry buffer)
• 1 – Prior flit was a Payload flit or retryable Empty
flit (allocated into Retry buffer)

Type of DLLP Payload [4] • If (Flit Type = (CXL.io Payload or CXL.io NOP): Use as
defined in the PCIe Base Specification
Replay Command[1:0] [3:2] Same as defined in the PCIe Base Specification
Flit Sequence Number[9:0] {[1:0], [15:8]} 10-bit Sequence Number as defined in the PCIe Base
specification

12/8/2022 Confidential | CXL™ Consortium 2022 37


CXL256BFlit –Flit Type[1:0]
• Flit Type[1:0]
Encoding Flit Payload Source Description Allocated to
Retry Buffer?
00b Physical Layer Physical Layer Physical Layer generated (and sunk) flit with no valid payload; inserted in No
NOP data stream when Tx Retry buffer is full or when no other flits from upper
layers are available to transmit or in response to a NOP hint
IDLE Physical Layer generated (and sunk) all 0s payload flit used to facilitate No
LTSSM transitions per PCIe Specification.
CXL.io NOP CXL.io Link Layer Valid CXL.io DLLP payload (no TLP payload); periodically inserted by the No
CXL.io link layer to satisfy the PCIe Base Specification requirement
for credit update interval if no other CXL.io flits are available to transmit.
01b CXL.io Payload Valid CXL.io TLP and valid DLLP payload Yes

10b CXL.cachemem CXL.Cachemem Valid CXL.cachemem slot and/or CXL.cachemem credit payload Yes
Payload Link Layer
CXL.Cachemem No valid CXL.cachemem payload; generated when CXL.cachemem Link Layer If retryable,
Empty speculatively arbitrates to transfer a flit to reduce idle to valid transition allocate to
time but no valid CXL.cachemem payload arrives in time to use any slots in Tx Retry
the flit Buffer only
11b ALMP ARB/MUX ARB/MUX Link Management Packet Yes
12/8/2022 Confidential | CXL™ Consortium 2022 38
Latency-Optimized 256BFlits
• Optionally supported flit
format; negotiated during link
FlitHdr (2 bytes) FlitData (126 bytes)
FlitData (114 bytes) CRC (8 bytes) FEC (6 bytes)
training (no dynamic CXL Standard 256B Flit
switching)
• Comprised of 128-byte even FlitHdr (2 bytes) FlitData (126 bytes)

and odd flit halves, each FlitData (110 bytes) DLLP (4 bytes) CRC (8 bytes) FEC (6 bytes)

independently protected by 6- CXL.io Standard 256B Flit

byte CRC
• Even flit half can be
FlitHdr (2 bytes) FlitData (120 bytes) CRC (6 bytes)
FlitData (116 bytes) FEC (6 bytes) CRC (6 bytes)
consumed without waiting for CXL Latency-Optimized 256B Flit
second half to be received
• Flit accumulation latency FlitHdr (2 bytes) FlitData (120 bytes) CRC (6 bytes)

savings increases for smaller FlitData (112 bytes) DLLP (4 bytes) FEC (6 bytes) CRC (6 bytes)

link widths (e.g. 8ns roundtrip CXL.io Latency-Optimized 256B Flit

flit accumulation latency for


x4 link @64 GT/s)

12/8/2022 Confidential | CXL™ Consortium 2022 39


Latency-Optimized 256BFlits Datapath

• Low latency path used if CRC


passes; switch over to higher
Lowest Latency Path

latency path if CRC fails 6B CRC


Check
• Entire 256-byte Flit is retried if
either flit half fails CRC after FEC
decode and correct Flit
accumulation
• Receiver tracks whether it has + FEC + CRC
previously consumed even flit half Check
when it receives replayed Flit

12/8/2022 Confidential | CXL™ Consortium 2022 40


Latency-Optimized 256BFlits continued
Example Scenarios (see CXLspec for comprehensive list)

Original Flit Post FEC Correct Flit Retransmitted Flit


Even CRC Odd CRC Action Even CRC Odd CRC Action Even CRC Odd CRC Action

Pass Pass Consume Flit N/A N/A N/A N/A N/A N/A

Pass Fail OK to consume Pass Pass Consume anything N/A N/A N/A
even half; perform not previously
FEC decode and consumed
correct
Pass Fail OK to consume even Pass Pass Consume anything not
half if not previously previously consumed
consumed; Request
Pass Fail OK to consume even
Retry
half if not previously
consumed; FEC
decode and correct
Fail Pass/Fail FEC decode and
correct

12/8/2022 Confidential | CXL™ Consortium 2022 41


6BCRCcomputation
• The specific Reed Solomon (130,136)
code selected enables the 6B CRC to
be derived from the 8B CRC logic (to
save area)
• When independently generated, the
6B CRC is computed for each half of
the Flit by zero padding the data
input on MSB to make 130 byte
message
• When 8B CRC logic is used to
generate the 6B CRC:
• Specific bytes of the message are fed to
PCIe CRC Input
Even Flit Half Mapping Odd Flit Half Mapping
Bytes

the corresponding bytes of the 8B CRC Byte 0 to Byte 113 00h for all bytes 00h for all bytes

generator. Byte 114 to Byte 229 Byte 0 to Byte 115 of the flit Byte 128 to Byte 243 of the flit

• Once the 8B CRC is generated, it can be


Byte 230 to Byte 235 Byte 116 to Byte 121 of the flit 00h for all bytes

used to generate the required 6B CRC


Byte 236 to Byte 241 00h for all bytes 00h for all bytes

with minimal logic

12/8/2022 Confidential | CXL™ Consortium 2022 42


Flex Bus Physical Layer -Negotiation
* Flit_Mode_Enabled is determined
• Utilizes PCIe in Config.Linkwidth.Accept, per
Downstream Upstream
Alternate Protocol
PCIe spec Retimer
Port Port
Config.Lanenum.Wait/
Negotiation during Config.Lanenum.Accept

training at Gen1 link


Phase 1:
• DP communicates

speed common clock mode


• DP and UP advertise CXL
capabilities
• Performed only once • Retimer communicates
sync header bypass
during initial training capability (68B Flit only)

• CXL mode Config.Complete

negotiation control Phase 2:


• DP communicates
and status bit in Flex communicates results of

Bus Port DVSEC


negotiation
• UP sends
acknowledgement
(reflects what was
received)

L0 at Gen1

12/8/2022 Confidential | CXL™ Consortium 2022 43


NewFlex Bus Info Bits in Modified TS1/TS2
Description Negotiated Communicated by Communicated
Upstream Port by DP and UP
Latency-Optimized DP/UP advertise capability in Phase 1; DP Yes
256B Flit communicates whether mode is enabled in
Capable/Enable Phase 2
CXL.io Throttle • Indicates that the Port does not support Yes
Required at 64 receiving consecutive CXL.io flits when
GT/s operating at 64 GT/s link speed;
• Enables devices to simplify hardware
for power and buffer savings, e.g. Type
3 device that only needs full bandwidth
for CXL.mem traffic
CXL NOP Hint 00 – No support for NOP hint feature Yes
Info[1:0]” 01 – Requires single NOP to switchover
from FEC+CRC path to low latency path
10 – Reserved
11 – Requires two back-to-back NOPs to
switchover
PBR Flit Link Layer Port-Based Routing Flit Yes
Capable/Enable DP/UP advertise capability in Phase 1; DP
communicates whether mode is enabled in
Phase 2
12/8/2022 Confidential | CXL™ Consortium 2022 44
CXLNOPHint[1:0]
• Switch from high latency FEC+CRC path to Example: Combinational CRC-only Path & 4-cycle FEC+CRC path
low latency CRC-only path requires bubble Low Latency Path
(Combinational)
• Periodic SKP OSs and NOP Flits provide natural
opportunity for switchover 6B or 8B
• Relying on PCIe SKP OS insertion frequency, CRC
system may spendmore time than desired in high Check
latency path

Flit
FEC -- High
accumulation
Latency Path + FEC + CRC
(4 cycle pipeline) Check

• NOP Hint Feature provides explicit mechanism


NOP B, NOP A, NOP A, Valid Flit,
Cycle N-1: Even Half Even Half Even Half Odd Half

for receiver to request link partner to inject Cycle N:


NOPs to provide a bubble for switchover
NOP B, NOP B, NOP A, NOP A,
Even Half Even Half Odd Half Even Half

Switchover in Cycle N to when no valid traffic in high latency pipeline

12/8/2022 Confidential | CXL™ Consortium 2022 45


CXLNOPHint[1:0] Usage
Downstream Upstream
• Each side independently Port Port
communicates during initial
link training whether a
single or two back-to-back
NOPs are needed to
switchover to low latency Each side advertises its

path
support for NOP Hints
during Phase 1 of APN

• If link partner supports


injecting NOPs, a port
utilizes a NOP Hint to
request its link partner to Additional steps to train to L0
inject NOPs
• NOP Hint is NAK with a
sequence number of 0 Detects CRC

• Switchover to higher error, switch to


FEC path
latency path FEC path may
trigger Physical Layer to
send NOP Hint request
(other algorithms possible)
Switch to low
latency path

12/8/2022 Confidential | CXL™ Consortium 2022 46


Negotiation Resolution Tables

12/8/2022 Confidential | CXL™ Consortium 2022 47


Link Training Example
Start

Detect: Begin training to L0 at


PCIe Gen1
Complete Training & Equalization to 64 GT/s
Yes
Bypass Equalization to Highest With PCIe Flit Mode & Drift Buffer Mode Enabled;
Supported NRZ Data Rate? Transmission of SDS Ordered Set sequence in
Negotiate PCIe Flit Mode per
Recovery.Idle followed by NULL flits and CTRL SOS
PCIe specification
insertion at fixed interval
Result: Flit_Mode_Enabled = 1
No

Complete Training & Equalization to 8 GT/s


No alternate NO
Modified TS1/TS2 With PCIe Flit Mode & Drift Buffer Mode Enabled; Yes Downstream
protocol Transmission of SDS Ordered Set sequence in Recovery.Idle
Usage Enabled? Port?
negotiation followed by IDLE flits and CTRL SOS insertion at fixed interval

YES No

Complete Alternate Wait until receive non-


Complete Training & Equalization to 16 GT/s
Protocol Negotiation for Physical Layer Flit from link
With PCIe Flit Mode & Drift Buffer Mode Enabled;
CXL mode; partner
Transmission of SDS Ordered Set sequence in Recovery.Idle
Final Result: Enable all CXL
followed by IDLE flits and CTRL SOS insertion at fixed interval
protocols selected
Independent decision:
Enable drift buffer mode Link is available for CXL
ARB/MUX and Link Layers
Complete Training & Equalization to 32 GT/s to Transmit CXL flits
With PCIe Flit Mode & Drift Buffer Mode Enabled; (‘DL_Up = 1b’)
L0 at 2.5 GT/s Transmission of SDS Ordered Set sequence in Recovery.Idle
followed by IDLE flits and CTRL SOS insertion at fixed interval

12/8/2022 Confidential | CXL™ Consortium 2022 48


Thank You

Confidential | CXL™ Consortium 2022


Lunch Break

Next Session: CXL3.0 Protocol


Training will resume at 1:00pm.

To access Wifi, join the network “Google Guest”.

12/8/2022 Confidential | CXL™ Consortium 2022 50


[AMD Official Use Only - General]

CXL3.0 Protocols

Presenter : Nathan Kalyanasundharam, AMD & CXL Board Member


Co-Presenter : Rob Blankenship, Intel Corporation & CXL Protocol WG co-chair

Confidential | CXL™ Consortium 2022


[AMD Official Use Only - General]

Agenda
• Protocols
• Coherence/Caching Primer
• CXL.Cache Protocol
• Focus on updates in CXL3
• CXL.mem Protocol
• Focus on updates in CXL3

12/8/2022 Confidential | CXL™ Consortium 2022 52


[AMD Official Use Only - General]

Protocols Supported

12/8/2022 Confidential | CXL™ Consortium 2022 53


[AMD Official Use Only - General]

3 Protocols in CXL
• IO
• PCI Express baseline with a few extensions
• Usage: configuration, Address Translation, DMA, etc.
• Not the focus of this training
• Cache
• Device Caching Host Memory
• Special use for managing HDM-D coherence (“Bias Flip” flows).
• Can also be used for DMA
• 64B Granularity only
• Host Physical Address (HPA) Only.
• Memory
• Exposes device memory to the host  Host-managed Device Memory (HDM)
• 3 Types of coherence: Host-only, Device, Device with Back-Invalidate.
• Abbreviated as HDM-H, HDM-D, HDM-DB.
• 64B Granularity
• HPA Only

12/8/2022 Confidential | CXL™ Consortium 2022 54


[AMD Official Use Only - General]

Caching Primer

12/8/2022 Confidential | CXL™ Consortium 2022 55


[AMD Official Use Only - General]

Caching Overview
• Caching temporarily brings data closer to the
consumer Host Memory
Access Latency: ~200ns
Shared Bandwidth: 100+ GB/s

• Improves latency and bandwidth using


prefetching and/or locality

Read
Data
• Prefetching: Loading Data into cache before it is
Local Data Cache
required Access Latency: ~10ns
• Spatial Locality (locality is space): Access Dedicated Bandwidth: 100+ GB/s

address X then X+n

Read
Data
• Temporal Locality (locality in Time): Multiple
access to the same Data
Accelerator

12/8/2022 Confidential | CXL™ Consortium 2022 56


[AMD Official Use Only - General]

CPUCache/Memory Hierarchy with CXL


~10 GB – Directly Connected Memory CXL.mem CXL.mem • Most CPUs have 2 or
(aka DDR) ~10 GB ~10 GB more levels of coherent
Home Agent cache
Coherent CPU-to-CPU Symmetric Links • Lower levels (L1),
smaller in capacity with
~10 MB – L3 (aka LLC)
CPU Socket 1
lowest latency and
highest bandwidth per
source.
Wr Wr
~500 KB – L2 ~500 KB – L2 ... Cache Cache

CXL.Cache

CXL.Cache
~50KB ~50KB
L1 L1 L1 L1
• Higher levels (L3), less
CXL.io

CXL.io
... PCIe bandwidth per source

PCIe
~50 KB ~50 KB ~50 KB ~50 KB

but much higher capacity


CPU CPU CPU CPU

CPU Socket 0
and support more
sources
~1 MB ~1 MB
• Devices may also have
Device Device
multiple levels of cache.
Note: Cache/Memory capacities are examples and not aligned to a specific product.

12/8/2022 Confidential | CXL™ Consortium 2022 57


[AMD Official Use Only - General]

Cache Consistency
• How do we make sure updates in cache are visible to other agents?
• Invalidate all peer caches prior to update
• Can be managed with software or hardware  CXL uses hardware coherence

• Define a point of “Global Observation” (aka GO) when new data is visible
from writes

• Tracking granularity is a “cacheline” of data  64-bytes for CXL

• All addresses are assumed to be Host Physical Address (HPA) in CXL


cache and memory protocols  Translations done using Address
Translation Services (ATS).

12/8/2022 Confidential | CXL™ Consortium 2022 58


[AMD Official Use Only - General]

Cache Coherence Protocol


• Modern CPU caches and CXL are built on M,E,S,I protocol/states
• Modified – Only in one cache, Can be read or written, Data NOT up-to-date in memory
• Exclusive – Only in one cache, Can be read or written, Data IS up-to-date in memory
• Shared – Can be in many caches, Can only be read, Data IS up-to-date in memory
• Invalid – Not in cache

• M,E,S,I is tracked for each cacheline address in each cache


• Cacheline address in CXL is Addr[51:6]

• Notes:
• Each level of the CPU cache hierarchy follows MESI and layers above must be consistent
• Other extended states and flows are possible but not covered in context of CXL

12/8/2022 Confidential | CXL™ Consortium 2022 59


[AMD Official Use Only - General]

Howare Peer Caches Managed?


• All peer caches managed by the “Home Agent”.

• A “Snoop” is the term for the Home Agent to check the cache state,
fetch the data and cause cache state changes.

• Example CXL Snoops:


• Snoop Invalidate (SnpInv): Causes a cache to degrade to I-state, and must
return any Modified data.
• Snoop Current (SnpCurr): Does not change cache state, but does return
indication of current state and any modified data.

12/8/2022 Confidential | CXL™ Consortium 2022 60


[AMD Official Use Only - General]

CXLCache Protocol

12/8/2022 Confidential | CXL™ Consortium 2022 61


[AMD Official Use Only - General]

Cache Protocol Summary


Simple set of 15 commands; both reads and writes from the device
to host memory

Keep the complexity of global coherence management in the host.

CXL3 enables up to 16 cache devices below each root port


• Prior generations limited to 1 per root port.

12/8/2022 Confidential | CXL™ Consortium 2022 62


[AMD Official Use Only - General]

Cache Protocol Channels

3 channels in each direction: D2H vs H2D


Data and RSP channels are pre-allocated
D2H Requests from the device
H2D Requests are snoops from the host
Ordering: H2D Req (Snoop) push H2D RSP

12/8/2022 Confidential | CXL™ Consortium 2022 63


[AMD Official Use Only - General]

Read Flow
• Diagram to show
message flows in time
• X-axis: Agents
• Y-axis: Time

12/8/2022 Confidential | CXL™ Consortium 2022 64


[AMD Official Use Only - General]

Read Flow
• Diagram to show message
flows in time
• X-axis: Agents
• Y-axis: Time

12/8/2022 Confidential | CXL™ Consortium 2022 65


[AMD Official Use Only - General]

Mapping FlowBack to CPUHierarchy


~10 GB – Directly Connected Memory CXL.mem CXL.mem
(aka DDR) ~10 GB ~10 GB

Home Agent

Coherent CPU-to-CPU Symmetric Links

CPU Socket 1
~10 MB – L3 (aka LLC)
Wr Wr
~500 KB – L2 ~500 KB – L2 ... Cache Cache

CXL.Cache

CXL.Cache
~50KB ~50KB
L1 L1 L1 L1

CXL.io

CXL.io
...

PCIe

PCIe
~50 KB ~50 KB ~50 KB ~50 KB
CPU CPU CPU CPU

CPU Socket 0

CXL
~1 MB ~1 MB

Device
Device Device

12/8/2022 Confidential | CXL™ Consortium 2022 66


[AMD Official Use Only - General]

Mapping FlowBack to CPUHierarchy


• Peer Cache can be: ~10 GB – Directly Connected Memory CXL.mem CXL.mem
• Peer CXL Device with (aka DDR) ~10 GB ~10 GB
Cache Home Agent
• CPU Cache in Local
Socket Coherent CPU-to-CPU Symmetric Links Peer Cache
• CPU Cache in Remote CPU Socket 1
Socket ~10 MB – L3 (aka LLC)
Wr Wr
Peer
~500 Cache
KB – L2 Peer Cache
~500 KB – L2 ... Cache Peer
Cache Cache

CXL.Cache

CXL.Cache
~50KB ~50KB
L1 L1 L1 L1

CXL.io

CXL.io
...

PCIe

PCIe
~50 KB ~50 KB ~50 KB ~50 KB
CPU CPU CPU CPU

CPU Socket 0

CXL
~1 MB Peer
~1 MB

Device
Device Cache
Device

12/8/2022 Confidential | CXL™ Consortium 2022 67


[AMD Official Use Only - General]

Mapping FlowBack to CPUHierarchy


Memory Memory
Controller Controller
• Peer Cache can be: Memory
CXL.mem CXL.mem
~10 GB – Directly Connected Memory
Controller Memory
• Peer CXL Device with (aka DDR) ~10 GB ~10 GB Controller
Cache
• CPU Cache in Local Socket Home
Home Agent Home
• CPU Cache in Remote
Coherent CPU-to-CPU Symmetric Links Peer Cache
Socket
• Memory Controller ~10 MB – L3 (aka LLC)
CPU Socket 1

can be: Wr Wr
• Native DDR on Local Peer Cache
~500 KB – L2 PeerKB
~500 Cache
– L2 ... Cache Cache
Peer Cache

CXL.Cache

CXL.Cache
~50KB ~50KB
Socket L1 L1 L1 L1

CXL.io

CXL.io
Native DDR on Remote

...

PCIe

PCIe
~50 KB ~50 KB ~50 KB ~50 KB
Socket CPU CPU CPU CPU
• CXL.mem on peer Device
CPU Socket 0

CXL
~1 MB Peer
~1 MB

Device
Device Cache
Device

12/8/2022 Confidential | CXL™ Consortium 2022 68


[AMD Official Use Only - General]

Example #2: Write


CXL Peer
• For Cache Writes Device Cache Home
there are three I S
phases: Memory
Controller
• Ownership
• Silent Write
• Cache Eviction

Legend
Cache State:
Modified
Exclusive
Shared
Invalid
Allocate Tracker
Deallocate Tracker
12/8/2022 Confidential | CXL™ Consortium 2022 69
[AMD Official Use Only - General]

Example #2: Write


CXL Peer
• For Cache Writes
Device Cache Home
there are three I S
phases: Memory
Controller
• Ownership

Ownership
• Silent Write SI
• Cache Eviction

I E

Legend
Cache State:
Modified
Exclusive
Shared
Invalid
Allocate Tracker
Deallocate Tracker
12/8/2022 Confidential | CXL™ Consortium 2022 70
[AMD Official Use Only - General]

Example #2: Write


CXL Peer
• For Cache Writes Device Home
Cache
there are three I S
phases: Memory
Controller
• Ownership

Ownership
• Silent Write SI
• Cache Eviction
I E

Legend

Write
<Silent Wr>
Cache State: EM
Modified
Exclusive
Shared
Invalid
Allocate Tracker
Deallocate Tracker
12/8/2022 Confidential | CXL™ Consortium 2022 71
[AMD Official Use Only - General]

Example #2: Write


CXL Peer
• For Cache Writes
Device Cache Home
there are three I S
phases: Memory
Controller
• Ownership

Ownership
• Silent Write SI
• Cache Eviction
I E

Legend

Write
<Silent Wr>
Cache State: EM
Modified
Exclusive
Eviction
Shared
M I Data
Invalid
Allocate Tracker
Deallocate Tracker
12/8/2022 Confidential | CXL™ Consortium 2022 72
[AMD Official Use Only - General]

Example #3: Steaming Write


CXL Peer
• Direct Write to Host Device Cache Home
• Ownership + Write in a I S
Memory
single flow. Controller

SI
• Rely on completion to
indicate ordering
Data
• May see reduced
bandwidth for ordered
Legend
traffic Cache State:
Modified
• Host may install data into Exclusive
LLC instead of writing to
Shared
Invalid
memory Allocate Tracker
Deallocate Tracker
12/8/2022 Confidential | CXL™ Consortium 2022 73
[AMD Official Use Only - General]

15 Request in CXL

• Reads: RdShared, RdCurr, RdOwn, RdAny


• Read-0: RdownNoData, CLFlush, CacheFlushed
• Writes: DirtyEvict, CleanEvict, CleanEvictNoData
• Streaming Writes: ItoMWr, WrCur, WOWrInv, WrInv(F)

12/8/2022 Confidential | CXL™ Consortium 2022 74


[AMD Official Use Only - General]

CXLMemory Protocol

12/8/2022 Confidential | CXL™ Consortium 2022 75


[AMD Official Use Only - General]

Memory Protocol Summary


Simple reads and writes from host to memory
Memory Technology Independent
• HBM, DDR, PMem
• Architected hooks to manage persistence
Includes 2-bits of “meta-state” per cacheline
• Memory Only device: Up to host to define usage.
• For Accelerators: Host encodes required cache state.
Host-managed Device Memory (HDM) comes in 3 types:
• Host Managed Coherence (HDM-H)
• Device Managed Coherence (HDM-D)
• Device Managed Coherence with Back-Invalidation (HDM-DB)  new in CXL3

12/8/2022 Confidential | CXL™ Consortium 2022 76


[AMD Official Use Only - General]

Memory Protocol Channels


3 channels in each direction
• M2S Request (Req), Request w/ Data (RwD)
• S2M Non-Data Response (NDR), Data
Response (DRS) which are pre-allocated.
• M2S BIRsp, S2M BISnp used for HDM-DB to
manage coherence  New in CXL3.

Limited Ordering
• Req channel for HDM-D memory (CXL2
Accelerators)
• NDR Channel for conflict flows with HDM-
DB

12/8/2022 Confidential | CXL™ Consortium 2022 77


[AMD Official Use Only - General]

Host Memory Caching ViewHDM-H


CXL.$ may share cache ~16 GB – Directly Memory CXL.mem ~16 GB CXL.mem ~16 GB
hierarchy with CPU
cores
Home Agent

CXL.mem treated Coherent CPU-to-CPU Symmetric Links


similar to native DDR
with optional meta data ~10 MB – L3 (aka LLC)
CPU Socket 1

Wr Wr

2 Level of coherence
~500 KB – L2 ~500 KB – L2 ... Cache Cache

CXL.Cache

CXL.cache
~50KB ~50KB
protocols L1 L1 L1 L1

CXL.io

CXL.io
• CXL.$ to LLC (Last ...

PCIe

PCIe
~50 KB ~50 KB ~50 KB ~50 KB
Level Cache) CPU CPU CPU CPU
• LLC to Home Agent
(host specific) CPU Socket 0

~1 MB L1/L2 ~1 MB L1/L2
Device Device

12/8/2022 Confidential | CXL™ Consortium 2020 78


[AMD Official Use Only - General]

Cache Hierarchy of HDM-D/HDM-DB


HDM-D* adds 3rd level to ~1MB Dev
~16 GB – HDM
coherence protocol at the top: ~16 GB – Directly Memory Cache for
DAMem
DCoh
(aka DAMem) T2 Device
HA to the Device

CXL.mem
Host Home Agent does not Home Agent
Proxy Caching Agent
track Device caching of HDM-
D*. Coherent CPU-to-CPU Symmetric Links

CPU Socket 1
~10 MB – L3 (aka LLC)
Device tracks host caching of
HDM-D*
Wr Wr
~500 KB – L2 ~500 KB – L2 ... Cache Cache

CXL.Cache

CXL.cache
~50KB ~50KB
L1 L1 L1 L1

CXL.io

CXL.io
...

PCIe

PCIe
~50 KB ~50 KB ~50 KB ~50 KB
DCoh is the final point of
coherence resolution for HDM-
CPU CPU CPU CPU

D* address region. CPU Socket 0

~1 MB L1/L2 ~1 MB L1/L2
T2 Device T1 Device

12/8/2022 Confidential | CXL™ Consortium 2020 79


[AMD Official Use Only - General]

Example #1: Write


CXL Memory Only Device

CXL Device
Media ECC handled by device Host
Memory
Controller
Memory
Media

HDM-H provide 2-bits of host New MetaValue

defined Meta Value which device


optionally supports

Sent when data visible


Note: only host caching of HDM-H to future reads

(Host only coherent)

12/8/2022 Confidential | CXL™ Consortium 2022 80


[AMD Official Use Only - General]

Example #2: Read


Memory Only Device

CXL Device
Meta Value Change requires Host
Memory
Controller
Memory

device to write.
Media

New MetaValue

Old MetaValue

12/8/2022 Confidential | CXL™ Consortium 2022 81


[AMD Official Use Only - General]

Example #3: Read no Meta


Memory Only Device

Host may indicate no Meta-


CXL Device
Memory Memory
Host
state update required on
Controller Media

reads
No Meta Update
Needed

12/8/2022 Confidential | CXL™ Consortium 2022 82


[AMD Official Use Only - General]

HDM-D/HDM-DBCommon Attributes
“Device Coherent”  Provide ability for host and device to cache

Request MetaValue field indicates host cache state.


• Any – Host may cache in M,E, or S states.
• Shared – Host will only cache in S state.
• Invalid – Host does not have a cached copy.

Request SnpType indicates Device Cache state change


• SnpInv – Invalidate Device Cache
• SnpData – Device Cache in I or S state.

Device Coherence Engine (Dcoh) is the final conflict resolution arbiter between
host and device accesses for HDM-D* memory.

12/8/2022 Confidential | CXL™ Consortium 2022 83


[AMD Official Use Only - General]

Device Coherent (HDM-D) Specifics


• CXL.mem requests indicate coherence required from the host.

• CXL.Cache used for device to change host cache state


• Host must detect device accessing its own memory and trigger special
flows which return a “Forward” message.
• Can be blocked behind access to host memory.
• Requires device to implement full directory tracking (aka Bias Table)

12/8/2022 Confidential | CXL™ Consortium 2022 84


[AMD Official Use Only - General]

Full Directory  Bias Table

Device state of 1 or 2 bits per Cacheline indicating if host has a


cached copy
• Device Bias: No host caching, allowing direct reads
• Host Bias: Host may have a cached copy, so read goes
through the host
• Optionally tracking Shared vs Any state in host.
• With Shared State, the device may directly read data, but must not
modify.

Host tracks which peer caches have copies

12/8/2022 Confidential | CXL™ Consortium 2022 85


[AMD Official Use Only - General]

Host Bias Read

MemRdFwd
message sent Accelerator Device with HDM-D

after Peer Cache Host/HA DCOH Dev $ Dev Mem

coherence S RdOwn
X
I
BIAS=HOST
resolved on RdOw
nX
S to I
M2S Request SnpI nv X
Hit SRs
Channel pI
MemR
dFwd X
GO-E I to E
BIAS=DEVICE M emRdFw
dX

D ata

12/8/2022 Confidential | CXL™ Consortium 2022 86


Device Bias Read
[AMD Official Use Only - General]

No messages on CXL
interface
Accelerator Device with HDM-D

Peer Cache Host/HA DCOH Dev $ Dev Mem


X
I RdOwn I
BIAS=DEVICE
GO-E I to E
MemRdFw
d X

Data

12/8/2022 Confidential | CXL™ Consortium 2022 87


Device Cache Evictions
[AMD Official Use Only - General]

E/M in cache
imply
Bias=Device Accelerator Device with HDM-D

so no Peer Cache Host/HA DCOH Dev $ Dev Mem

indication to I DirtyEv
ict X
M
host BIAS=DEVICE
GO-WrPull M to I
Data X

MemWr X

Cmp

12/8/2022 Confidential | CXL™ Consortium 2022 88


[AMD Official Use Only - General]

Host Bias Streaming Write

MemRdFwd
message sent Accelerator Device with HDM-D

after Peer Cache Host/HA DCOH Dev $ Dev Mem

WOWr X
coherence S
BIAS=HOST
I
resolved WOW
rX
S to I SnpI nv X
Hit SRs
pI
MemWrFwd X
GO-WrPull
Data
MemWr X
BIAS=DEVICE
Cmp
ExtCmp

12/8/2022 Confidential | CXL™ Consortium 2022 89


[AMD Official Use Only - General]

Device Bias Streaming Write

No message to host
Accelerator Device with HDM-D

Peer Cache Host/HA DCOH Dev $ Dev Mem

I WOWr X I
BIAS=DEVICE
GO-WrPull
Data
MemWr X
Cmp
ExtCmp

12/8/2022 Confidential | CXL™ Consortium 2022 90


[AMD Official Use Only - General]

HDM-DB Newin CXL3

12/8/2022 Confidential | CXL™ Consortium 2022 91


[AMD Official Use Only - General]

HDM-DB

“Device Coherent with Back-Invalidation” (HDM-DB) adds BISnp


and BIRsp channel for optimize coherence management enabling
inclusive Snoop Filter (SF) architectures.

Same “BIAS Table” states tracking host coherence: I, S, A

Inclusive SF architecture may block M2S Request waiting for


Back-Invalidation Snoop (BISnp) to complete which enables
sizing to match host caching instead of memory capacity.

12/8/2022 Confidential | CXL™ Consortium 2022 92


[AMD Official Use Only - General]

Back-Invalidation Snooping (HDM-DB)


Enables Inclusive Snoop
Filter (SF) to track host
caching

Device can block new


requests waiting for SF
Victim

12/8/2022 Confidential | CXL™ Consortium 2022 93


[AMD Official Use Only - General]

Block Access with BISnp


• To improve efficiency
there is BISnp messages
that cover more than one
cacheline (aka “Block”).

• Either 2 (128B) or 4 (256B)


cachelines are supported.

• Host may send combined


response or per cacheline
response.

12/8/2022 Confidential | CXL™ Consortium 2022 94


[AMD Official Use Only - General]

Conflicts
• BISnp conflicting with pending MemRd will create a race case if read is
requesting cache state (MV=A,S) from the Device.
• Host must find out if MemRd response is in flight before knowing how to
process the BISnp.
• Conflict resolution adds M2S handshake with M2S RwD and S2M NDR
• M2S RwD message “BI-Conflict”.
• RwD is lowest complexity for implementation and works within dependance graph.
• RwD will cause wasted 64B data transfer, but conflicts are not expected at high rates, so
better to keep flow simple and live with overhead.
• S2M NDR message is Conflict-Ack and must push all prior Cmp messages to the same
address
• Conflict cases resolved with handshake
• Late conflict is where Data/Cmp-E/S are inflight from device
• Early Conflict: MemRd is blocked at the device.
• Without the handshake host can not resolve between the two cases

12/8/2022 Confidential | CXL™ Consortium 2022 95


[AMD Official Use Only - General]

Early Conflict
• Conflict-ACK
observed before
Cmp* means Device
has not yet
processed MemRd

12/8/2022 Confidential | CXL™ Consortium 2022 96


[AMD Official Use Only - General]

Late Conflict
• Late is defined as after
device has processed
MemRd
• Cmp* observed before
Conflict-Ack confirms to
host that data is in flight.
• BI-Conflict message uses
RwD channel.

12/8/2022 Confidential | CXL™ Consortium 2022 97


[AMD Official Use Only - General]

Channel Dependance
• Req may be blocked by BISnp.
Req

• BISnp may be blocked by RwD


(Writebacks) BISnp

RwD
• RwD may only be blocked by
responses messages (NDR/DRS).
NDR
DRS

12/8/2022 Confidential | CXL™ Consortium 2022 98


[AMD Official Use Only - General]

NewUse Models with HDM-DB

12/8/2022 Confidential | CXL™ Consortium 2022 99


[AMD Official Use Only - General]

Direct Peer-to-Peer (P2P) to HDM


• HDM-DB enables direct P2P from CXL or
PCIe sources in CXL3
H1

• In prior generation all HDM access must go


through the host CPU to resolve coherence. CXL Switch

D1 D2 D3
CXL Type-2/3
• HDM-DB will directly resolve coherence with PCIe
Accel
CXL
Accel
the host before committing the P2P.

HDM-DB

12/8/2022 Confidential | CXL™ Consortium 2022 100


[AMD Official Use Only - General]

Pooled and Shared Memory


• Pooled Memory and CXL Switching added
in CXL2 allow for dedicated assignment
of memory resources from to a host.
H1 H2 H3 ... H#
• Shared Memory assigned to multiple
hosts enabled in CXL3 CXL Switch
D1 D2 D3 D4 D#
• Multi-Host Hardware Coherent Shared
Memory possible with HDM-DB
...

• MORE on these uses in Fabric Session


Hardware Coherent
Software Coherent
Shared
Shared

12/8/2022 Confidential | CXL™ Consortium 2022 101


[AMD Official Use Only - General]

Summary
• CXL protocols are evolving

• CXL2 added switching and pooled memory capabilities.

• CXL3 enabling new capabilities:


• CXL.Cache Scaling
• CXL.Mem Back-Invalidation Channel for SF, Direct P2P, Multi-Host
Coherence
• Port Based Routing (covered in Fabric Training))

12/8/2022 Confidential | CXL™ Consortium 2022 102


[AMD Official Use Only - General]

Thank You

Confidential | CXL™ Consortium 2022


Extended Break

Next Session: CXLARB/MUXand


CXL.cachememLink Layer
Training will resume at 3:30pm.

To access Wifi, join the network “Google Guest”.

12/8/2022 Confidential | CXL™ Consortium 2022 104


CXL.cachememLink Layer
Presented by: Swadesh Choudhary

Confidential | CXL™ Consortium 2022


Overview
• Link Layer Functions and CXL 256B Flit Mode Changes
• Flit Formats
• Slot Formats
• Implicit Decode and Packing rules
• Efficiency calculation examples
• Link Layer Control Messages
• Latency Optimizations

12/8/2022 Confidential | CXL™ Consortium 2022 106


CXL.cachememLink Layer Functions
• Flow Control with remote Link partner for the
different channels
• Flit packing (TX) and unpacking (RX)
• New Flit Formats for Standard 256B, Latency-Optimized
• 256B Flit mode has new Slot formats over 68B Flit mode
• Integrity and Encryption
• Updates to support 256B Flit Formats
• Efficiency with encryption for containment mode is higher
than 68B Flit mode (Aggregation Flit count updated for
256B Flit mode to 2 Flits of 256B)
• No “All Data Flits” in 256B => better latency in loaded case
• Reliable Flit transfer (CRC/Retry)
• Moved to Physical Layer for 256B Flit mode

12/8/2022 Confidential | CXL™ Consortium 2022 107


Flit Format –Standard 256B
• Standard 256B Flit
• 2B HDR, CRC and FEC are
populated by the Physical
Layer
• One 14B Slot (Header)
• Fourteen 16B Slots (Generic)
• PBR Flits only supported with
Standard 256B Flits
• 2B dedicated for Credit returns

12/8/2022 Confidential | CXL™ Consortium 2022 108


Flit Format –Latency-Optimized
• Latency-Optimized 256B Flit
• 2B HDR, CRC and FEC are
populated by the Physical
Layer
• One 14B Slot (Header)
• Thirteen 16B Slots (Generic)
• One 12B Slot (Header-Subset)
• 2B dedicated for Credit returns

12/8/2022 Confidential | CXL™ Consortium 2022 109


Decoding Slot Formats for 256BFlit mode
• Each slot has 4-bit encoding in bits[3:0] that defines the Slot Format
(SlotFmt Encoding)
• Different from 68B Flit mode where 3 bit encoding was used for the Flit
• Packing rules are constrained to a subset of messages of UP and
DP to match Transaction Layer requirements
• Slot packing is message oriented – encodings are unique between UP and
DP unless they are enabled to be sent in both directions (for example PBR)
• CXL.cache and CXL.mem messages are not mixed in the same slot
for 256B Flit mode
• HS-slots ⊂ H-Slots ⊂ G-Slots
• Slots used for PBR Flits can only be G-Slots or H-Slots, and are a
subset of regular Flit Slots

12/8/2022 Confidential | CXL™ Consortium 2022 110


Slot Summary G-Slots vs H-Slots

12/8/2022 Confidential | CXL™ Consortium 2022 111


Slot summary HS-Slots vs H-Slots

12/8/2022 Confidential | CXL™ Consortium 2022 112


Slot Packing examples

• Example of S2M DRS


• G-Slot can carry up to 3 S2M DRS headers
• H-Slot/HS-Slot can carry up to 2 S2M DRS headers
• PBR G or H Slots can carry up to 2 S2M DRS headers

12/8/2022 Confidential | CXL™ Consortium 2022 113


Implicit Decode and Packing Rules
• Data and Byte Enable slots are
implicitly decoded by the preceding
slots
• Data/Byte enable slots can only
occupy G-Slots
• Only 64B transfers are supported in
256B Flit mode
• Tightly packed rules apply
• TX must not send data headers in H-
slots unless the remaining data slots
to transfer are <=16
• Maximum message count rules apply
over rolling 128B groups

12/8/2022 Confidential | CXL™ Consortium 2022 114


Efficiency example –100% Read to CXL.mem
Flit number Slot number information rollover data bytes per flit
0 2 S2M DRS headers
1 to 8 Data Slots
0 208
9 3 S2M DRS headers
10 to 14 Data Slots 7
0 2 S2M DRS headers
1 224
1 to 14 Data Slots 1
Flit number Slot 0 Slot 1 Slot 2 Slot 3
0 2 S2M DRS headers

2
1 to 9 Data slots
208
1 2 DRS D0 D0 D0
10 3 S2M DRS headers
11 to 14 Data Slots 8
2 D0 D1 D1 D1
3
0 2 S2M DRS headers
224
3 2DRS D1 D2 D2
1 to 14 Data Slots 2
0 2 S2M DRS headers
4 D2 D2 D3 D3
4
1 to 10 Data slots
208
5 2DRS D3 D3 D4
11 3 S2M DRS headers
12 to 14 Data Slots 9
6 D4 D4 D4 D5
5
0 2 S2M DRS headers
224
7 2DRS D5 D5 D5
1 to 14 Data slots 3
0 2 S2M DRS headers
8 D6 D6 D6 D6
6
1 to 11 Data slots
208
9 D7 D7 D7 D7
12 3 S2M DRS headers
13 to 14 Data Slots 10
0 Empty Efficiency =(64*8) *100/(68*9) =83.67%
7
1 to 10
11
Data slots
3 S2M DRS headers
208 Above is computed for 68BFlit Format
12 to 14 Data Slots 9
0 2 S2M DRS headers
8 224
1 to 14 Data slots 3
0 2 S2M DRS headers
1 to 11 Data Slots
9 208
12 3 S2M DRS headers
13 to 14 Data Slots 10
Flits 7, 8 and 9 pattern repeats

Efficiency =((208*2) +(224))*100/(256*3) =83.33%


Above is computed for Standard 256BFlit Format

12/8/2022 Confidential | CXL™ Consortium 2022 115


Efficiency example –100% Write to CXL.mem

Flit number Slot number information rollover data bytes per flit
0 1 M2S RWD
1 to 4 Data slots Flit number Slot 0 Slot 1 Slot 2 Slot 3
5 1 M2S RWD 1 1 RwD D0 D0 D0
0 192
6 to 9 Data slots 2 D0 1 RwD D1 D1
10 1 M2S RWD 3 D1 D1 1 RwD D2
11 to 14 Data slots 4 D2 D2 D2 1 RwD
repeat 5 D3 D3 D3 D3

Efficiency =192*100/256 =75% Efficiency =64*4*100/(68*5) =75.3%


Above is computed for Standard 256BFlit Above is computed for 68BFlit Format

12/8/2022 Confidential | CXL™ Consortium 2022 116


Link Layer Control messages
• Carried in H8 Slot
• A subset is permitted in HS8 Slot
• All except IDE.MAC are such that
rest of the Flit is empty once
these are inserted

12/8/2022 Confidential | CXL™ Consortium 2022 117


Latency Optimization and Viral Containment
• Flow for Standard 256B Flit:

• Flow for Latency-Optimized 256B Flit:

12/8/2022 Confidential | CXL™ Consortium 2022 118


Latency Optimization –Empty Flit
• In order to not pay the latency to wait for current IDLE Flit to finish
before starting a CXL.cachemem Slot insertion (up to 6ns in x4),
Empty Flits are introduced in 256B Flit Mode
• A special encoding on the 2B Credit field indicates the Flit is an
“Empty Flit” – no protocol or Link Layer message or Credit returns
• On TX, default winner for ARB/MUX can be left at CXL.cachemem
Link Layer, allowing the Link Layer to continue sending Empty Flits
even when there are no Transaction Layer Packets
• On RX, the Physical Layer is permitted to use treat Empty Flits as
NOPs when trying to catch up from the higher latency (FEC) path to
the lower latency bypass path

12/8/2022 Confidential | CXL™ Consortium 2022 119


CXLARB/MUX

12/8/2022 Confidential | CXL™ Consortium 2022 120


Overview
• Recap of ARB/MUX functionality

• Summary of changes in 256B Flit mode

• ALMP mapping to 256B Flits

• PM NAK example

• L0p flow outline

12/8/2022 Confidential | CXL™ Consortium 2022 121


ARB/MUXFunctions
• The ARB/MUX provides arbitration (Tx) and data steering
(Rx) of the CXL.io and CXL.cache/CXL.mem flits
• Arbitration policy is weighted round robin with architected
registers to program relative weights associated with the
link layers

• Virtualizes and co-ordinates Link State transitions


• Maintains Virtual Link State Machines (vLSMs) for the respective
CXL link layers
• Coordinates state transitions with remote Arb/Mux through link
management packets (ALMPs) [feeds into Arbiter on Tx, and
consumed on Rx]
• Determines the link state request for Flex Bus Physical Layer

• Arb/Mux is bypassed if Link trains as PCIe protocol

12/8/2022 Confidential | CXL™ Consortium 2022 122


Changes in 256BFlit mode
• ALMPs go through TX/RX Retry buffers in the Physical Layer when operating in
256B Flit mode
• Reliable transfer is guaranteed
• No Status Synchronization Protocol on LTSSM Recovery etc.

• LTSSM Recovery not exposed on vLSM to Link Layers

• Active.PMNAK state added to simplify handshakes

• No scenarios causing unexpected ALMPs in 256B Flit mode

• L0p aggregation and negotiation with remote Link partner is the responsibility
of the ARB/MUX

12/8/2022 Confidential | CXL™ Consortium 2022 123


ALMPmapping to Flits
• ALMPs mapped to both
Standard or Latency Optimized
Flits
• No replication of bytes since it
is now protected by
CRC/FEC/Retry
• Maximum 2 can be outstanding
at a time
• PM/State change handshake
• L0p Handshake

12/8/2022 Confidential | CXL™ Consortium 2022 124


PMflowNAKexample

PMTimeout is mapped to uncorrectable internal error in 256BFlit mode


12/8/2022 Confidential | CXL™ Consortium 2022 125
L0p flow
• ALMPs are used to negotiate desired L0p width with remote Link
partner before triggering the Physical Layer to change Link width
• Each Link Layer indicates it’s required width of the Link
independently to ARB/MUX along with priority
• ARB/MUX aggregates the widths, and if the desired width is different CXL.cache
from current Link width, then it initiates L0p handshake with remote CXL.io
/mem
Link partner
• ARB/MUX notifies Physical Layer of the result of L0p negotiation Link width Link width
• Physical Layer takes L0p action at next SOS boundary (i.e. send EIOS on the request request
appropriate lanes)
• Physical Layer notifies ARB/MUX of successful L0p entry by reflecting L0p Arb/Mux
width to ARB/MUX (when both Tx and Rx lanes are of desired width)
• ARB/MUX propagates L0p width to upper layers
L0p ALMP negotiation
• No L0p DLLP communication from CXL.io. L0p does not impact vLSM
state
• LTSSM going to recovery restores the Link width to full configured
width

12/8/2022 Confidential | CXL™ Consortium 2022 126


Thank You

Confidential | CXL™ Consortium 2022


Thank You for Participating
Please complete the CXL 3.0 Technical Training survey linked above.
Confidential | CXL™ Consortium 2022

You might also like