CXL - 3.0 - Training - PPT - Day 1 - v1

Welcome to the CXL3.
0 Technical Training
To access Wifi, join the network “Google Guest”.
Confidential | CXL™ Consortium 2022

Introduction to CXL3.0
Debendra Das Sharma
Intel Senior Fellow and Co-Chair CXL TTF

CXLBoard of Directors
Industry Open Standard for

200+ Member Companies
High Speed Communications
Preliminary as of August 2022 Confidential | CXL™ Consortium 2022

CXLSpecification Release Timeline
March 2019 September 2019 November 2020 August 2022
CXL 1.0 CXL Consortium CXL 2.0 CXL 3.0

Specificatio Officially Specification Specification
n Released Incorporates Released Release
CXL 1.1
Specification
Released
Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 4

Agenda
• Industry Landscape and CXL
• CXL 1.0/1.1 and CXL 2.0 recap
• CXL 3.0 Features
• Conclusions
12/8/2022 Confidential | CXL™ Consortium 2021 5

Industry Landscape
Cloudification of
the Network & Edge
Proliferation of
Cloud Computing Growth of
AI & Analytics

CXLOverview
• New breakthrough high-speed fabric
• Enables a high-speed, efficient interconnect between CPU, memory and
accelerators
• Builds upon PCI Express® (PCIe®) infrastructure, leveraging the PCIe® physical and
electrical interface
• Maintains memory coherency between the CPU memory space and memory on CXL
attached devices
• Enables fine-grained resource sharing for higher performance in heterogeneous compute
environments
• Enables memory disaggregation, memory pooling and sharing, persistent memory and
emerging memory media
• Delivered as an open industry standard
• CXL Specification 3.0 available in August with full backward compatibility with CXL
1.0/ CXL 1.1 and CXL 2.0
• Future CXL Specification generations will include continuous innovation to meet
industry needs and support new technologies

CXL™Approach
Host Processor
Coherent Interface Internal I/O Devices CPU Core CPU Core
• Leverages PCIe with three Home Agent

PCIe/CXL.io Logic
multiplexed protocols Coherence and Memory
Memory
Controller
Host Memory
IOMMU Write Cache Logic
• Built on top of PCIe® infrastructure
CXL.io CXL.cache CXL.memory
Low Latency PCIe PHY
• CXL.Cache/CXL.Memory targets Converged

CXL Link
Memory
near CPU cache coherent latency
(<200ns load to use) PCIe PHY
CXL.io CXL.cache
Discovery, Configuration, CXL.memory
Memory Flows,
Asymmetric Complexity Initialization, Interrupts, Memory Flows
Coherent Requests
DMA, ATS
• Eases burdens of cache coherence DTLB Caching Agent Memory Optional Device
interface designs for devices Accelerator Logic Controller Memory
Accelerator
Representative CXLUsages with CXL1.0
Caching Devices / Accelerators Accelerators with Memory Memory Buffers
TYPE 1 TYPE 2 TYPE 3
DDR DDR
DDR DDR
DDR DDR
Processor Processor Processor
PROTOCOLS PROTOCOLS PROTOCOLS
CXL • CXL.io
• CXL.cache CXL • CXL.io
• CXL.cache
• CXL.memory
CXL • CXL.io
• CXL.memory
HBM
Accelerator Accelerator Memory Controller
Memory
Memory
Memory
Memory
NIC
Cache HBM Cache
USAGES USAGES USAGES
• PGAS NIC • GP GPU • Memory BW expansion

• NIC atomics • Dense computation • Memory capacity expansion
• Storage class memory

CXL2.0 FEATURESUMMARY
MEMORY POOLING
1
H1 H2 H3 H4 H#
1 Device memory can be

allocated across multiple
CXL 2.0 Switch hosts.
D1 D2 D3 D4 D#
2 Multi Logical Devices allow
2 for finer grain memory
allocation

CXL2.0 FEATURESUMMARY
SWITCH CAPABILITY
CXL 2.0
[Add
H1
CXL
• Supports single-level
switching
CXL Switch
• Enables memory
CXL CXL CXL CXL
expansion and resource
allocation
D1 D2 D3 D#

CXL™2.0: Resource Pooling at Rack Level, Persistent
Memory Support and Enhanced Security
• Resource pooling/disaggregation
• Managed hot-plug flows to move resources
• Type-1/Type-2 device assigned to one host
• Type-3 device (memory) pooling at rack level
• Direct load-store, low-latency access – similar to memory attached in a
neighboring CPU socket
(vs. RDMA over network)
• Persistence flows for persistent memory
• Fabric Manager/API for managing resources
• Security: authentication, encryption
• Beyond node to rack-level connectivity!
Disaggregated systemwith CXLoptimizes resource utilization delivering lower TCOand
power efficiency
CXL3.0 Specification
Industry trends CXL 3.0 introduces…
• Use cases driving need for higher • Fabric capabilities
bandwidth include: high
performance accelerators, • Multi-headed and fabric attached devices
system memory, SmartNIC and • Enhance fabric management
leading edge networking • Composable disaggregated infrastructure
• Improved capability for better scalability
• CPU efficiency is declining due to and resource utilization
reduced memory capacity and Enhanced memory pooling
bandwidth per core •
• Multi-level switching
• Efficient peer-to-peer resource • New symmetric coherency capabilities
sharing across multiple domains • Improved software capabilities
• Double the bandwidth
• Memory bottlenecks due to CPU • Zero added latency over CXL 2.0
pin and thermal constraints
• Full backward compatibility with CXL 2.0,
CXL 1.1, and CXL 1.0

CXL3.0 Specification
Fabric capabilities
and management Expanded capabilities for increasing
scale and optimizing resource utilization
Improved memory • Fabric capabilities and • Near memory
sharing and fabric attached memory
• Enhance fabric
processing
• Multi-headed devices
pooling management framework • Multiple Type 1/Type 2
• Memory pooling and devices per root port
sharing • Fully backward
Symmetric • Peer-to-peer memory compatible to CXL 2.0,
coherency access 1.1, and 1.0
• Multi-level switching • Supports PCIe® 6.0
Peer-to-peer
CXL3.0 Spec Feature Summary
Features CXL 1.0 / 1.1 CXL 2.0 CXL 3.0
Release date 2019 2020 1H 2022
Max link rate 32GTs 32GTs 64GTs
Flit 68 byte (up to 32 GTs)   
Flit 256 byte (up to 64 GTs) 
Type 1, Type 2 and Type 3 Devices   
Memory Pooling w/ MLDs  
Global Persistent Flush  
CXL IDE  
Switching (Single-level)  
Switching (Multi-level) 
Direct memory access for peer-to-peer 
Symmetric coherency (256 byte flit) 
Memory sharing (256 byte flit) 
Multiple Type 1/Type 2 devices per root port  Not supported
Fabric capabilities (256 byte flit)   Supported

CXL3.0: Doubles Bandwidth with Same Latency
• Uses PCIe 6.0® PHY @ 64 GT/s

• PAM-4 and high BER mitigated by PCIe 6.0
FEC and CRC (different CRC for latency
optimized)
• Standard 256B Flit along with an additional
256B Latency Optimized Flit (0-latency
adder over CXL 2)
• 0-latency adder trades off FIT (failure in time,
109 hours) from 5x10-8 to 0.026 and Link
efficiency impact from 0.94 to 0.92 for 2-5ns
latency savings (x16 – x4)1
• Extends to lower data rates (8G, 16G, 32G)
• Enables several new CXL 3 protocol
enhancements with the 256B Flit format
CXL3.0 Protocol Enhancements (UIOand BI) for
Device to Device Connectivity
CXL 3.0 enables non-tree

topologies and peer-to-peer
communication (P2P) within
a virtual hierarchy of devices
• Virtual hierarchies are
associations of devices that
maintains a coherency domain
• P2P to HDM-DB memory is I/O

Coherent: a new Unordered I/O
(UIO) Flow in CXL.io – the
Type-2/3 device that hosts the
memory will generate a new
Back-Invalidation flow
(CXL.Mem) to the host to
ensure coherency if there is a
coherency conflict
CXL3.0 Protocol Enhancements: Mapping Large memory
in Type-2 Devices to HDMwith Back Invalidate
Existing Bias – Flip mechanism needed HDM to be tracked fully since device could not back snoop the host. Back Invalidate
with CXL 3.0 enables snoop filter implementation resulting in large memory that can be mapped to HDM
CXL3.0: SWITCHCASCADE/FANOUT
Supporting vast array of switch topologies
CXL 3.0 CXL 3.0
H1 H1
CXL CXL
1 Multiple switch levels
D1 D#
CXL CXL
(aka cascade)
• Supports fanout of
CXL Switch CXL Switch
all device types
D2 CXL CXL D3
1 CXL 1 CXL
1 CXL
CXL Switch CXL Switch CXL Switch
CXL CXL CXL CXL CXL CXL CXL CXL
D1 D2 D3 D# D1 D# D1 D#

CXL3.0: MULTIPLEDEVICESOFALLTYPESPERROOT PORT
CXL 2.0 CXL 3.0

H1 H1
1
Each host’s root port
CXL CXL
1
can connect to more
CXL Switch CXL Switch
than one device type
CXL CXL CXL CXL CXL CXL CXL CXL
D Memory Memory Memory FPGA NIC FPGA Memory
Type 1 or Type 3 Devices Type 2 Type 1 Type 2 Type 3

Type 2 Device Device Device Device
Device

CXL3.0: Pooling & Sharing
Expanded use case showing
memory sharing and pooling
CXL Fabric Manager is

available to setup, deploy, and
modify the environment
Shared Coherent Memory

across hosts using hardware
coherency (directory + Back-
Invalidate Flows). Allows one to
build large clusters to solve
large problems through shared
memory constructs. Defines a
Global Fabric Attached Memory
(GFAM) which can provide
access to up to 4095 entities
CXL3.0: Multiple Level Switching, Multiple Type-1/2
Each host’s root port
can connect to more
than one device type
(up to 16 CXL.cache
devices)
Multiple switch levels

(aka cascade)
• Supports fanout of
all device types
CXL3.0: COHERENT MEMORY SHARING
1 Device memory can be
H1 H2 H#
shared by all hosts to
S1 Copy S1 Copy S2 Copy increase data flow
S2 Copy 2
efficiency and improve
CXL CXL CXL
memory utilization
CXL Switch(es)
1 3
2 Host can have a coherent
Standardized CXL Fabric Manager
copy of the shared region or
CXL
portions of shared region in
D1
1
Shared host cache
S1 Memory
S2 3 CXL 3.0 defined

mechanisms to enforce
Pooled Memory hardware cache coherency
between copies
CXL 3.0 Fabrics
Composable Systems with Spine/Leaf Architecture at Rack/ Pod Level
CXL 3.0 Fabric Architecture

• Interconnected Spine Switch
System
• Leaf Switch NIC Enclosure
CXL Switch Example CXL Switch CXL Switch Spine Switches • Leaf Switch CPU Enclosure
traffic flow
• Leaf Switch Accelerator Enclosure
• Leaf Switch Memory Enclosure
Fabric
Manager
CXL Switch CXL Switch CXL Switch Leaf Switches
Accelerator Memory CPUs CPUs Memory GFAM Memory

NIC NIC End Devices
Conclusions
• CXL 3.0 features • Enabling new usage • Call to Action
• Full fabric capabilities and models • Download the CXL 3.0
fabric management • Memory sharing between specification
• Expanded switching hosts and peer devices • Support future
topologies • Support for multi-headed specification
• Symmetric coherency devices development by joining
capabilities the CXL Consortium
• [Symmetric coherency
• Peer-to-peer resource capabilities use case] • Follow us on Twitter
sharing and LinkedIn for
• Expanded support for updates!
• Double the bandwidth and Type-1 and Type-2 devices
zero added latency • GFAM provides expansion
compared to CXL 2.0 capabilities for current and
• Full backward future memory
compatibility with CXL 2.0,
CXL 1.1, and CXL 1.0
Confidential | CXL™ Consortium 2022 25

Thank You

Next Session:
Flex Bus Physical Layer

CXL3.0: Flex Bus Physical Layer
Michelle Jen
December 8, 2022

Flex Bus Physical Layer -Overview
• Based on PCIe 6.0 Physical Layer
• Electrical Sub-block is identical to PCIe
• Logical Sub-block has deltas to support CXL mode
• Dynamic negotiation during initial link training
to select PCIe or CXL mode
• CXL speeds and widths:
• 64 GT/s (PAM4), 32 GT/s, 16 GT/s, 8 GT/s
• Link widths: x16, x8, x4, x2, x1
• Link subdivision down to x4

Flex Bus Physical Layer –Newin CXL3.0
• Aligned to the PCIe 6.0 specification (Blue highlighted have delta from PCIe)
• Top speed increased to 64 GT/s (PAM4) from 32 GT/s (NRZ)
• Support for 256B flits
• Gray code
• FEC + CRC generation and checking
• Ack/Nak Generation; Retry logic (Tx Retry Buffer and Rx Retry Buffer)
• L0p state low power optimization
• Additional information in Modified TS1/TS2
• Latency-Optimized 256B Flit Capable/Enable
• CXL.io Throttle Required at 64 GT/s
• CXL NOP Hint Info[1:0]
• PBR Flit Capable/Enable
• Terminology Updates to remove ambiguity related to operating modes

Terminology Changes
• CXL 1.1 mode and CXL 2.0 mode terminology eliminated; reference specific
features instead
Terminology in CXL 3.0 spec Previous Terminology or Description
68B Flits Previously CXL1.1/CXL2.0 Flits
256B Flits (standard and latency-optimized) New CXL Flit formats based on PCIe Flit
Restricted CXL Device (RCD) CXL device operating in RCD mode
Exclusive RCD (eRCD) Previously CXL 1.1 only device
RCD mode Restricted CXL Device Mode -- CXL operating mode limited by
specific restrictions such no hot-plug, RCiEP, no switch
support, and 68B Flit Mode only
Exclusive Restricted CXL Host (eRCH) Previously CXL 1.1 only host
Virtual Hierarchy (VH) Everything from the CXL RP down, including the CXL RP, the
CXL PPBs, and the CXL Endpoints.
VH mode Mode of operation where CXL RP is root of hierarchy
68B Flit and VH capable/enable (in modified TS1/TS2) CXL 2.0 capable

68BFlit Mode vs 256BFlit Mode Transmit
Path CXL.io
68B Flit Mode
From CXL.cache/ CXL.io

256B Flit Mode
Transaction CXL.mem Queues Transaction From CXL.cache/

Layer Packets Layer Packets CXL.mem Queues
• CRC and Retry Logic

TLP Sequence
Number Flit Packing Logic DLLP Payload[31:0]
Generator
moved from Link

(credit 2B Credit
management) Info
Layer to Physical
TLP+Seq#
Layer (common point)

Flit Formation
Link Logic Flit Packing Logic
Retry Buffer Retry Buffer
(per TLP Layer (per Flit CXL.cache/
basis) Control basis) CXL.mem Link
Flits CXL.io Link Layer Layer
• Arb/MUX ALMPs go 256B CXL.io Flit 256B CXL.cache/mem Flit
through Retry Buffer DLLP+16-bit

256B ARB/MUX ALMP Flit
CRC 32-bit LCRC 16-bit Flit-

generator Generator Level CRC
Generator
DLLP + CRC TLP+Seq# + LCRC
CXL ARB/MUX
CXL.cache/CXL.mem
CXL.io Link Layer Link Layer IDLE/NOP flit generator
32-bit ARB/MUX ALMP Payload Retry Buffer

512-bit CXL.io Ack/NAK
(Protected Via Replication) 512-bit CXL.cache/mem (per 256B
Flit Payload
Flit Payload + 16-bit CRC Flit basis) request
from Rx
side
CXL ARB/MUX DLLP
processing
NULL flit generator
Flit Packing: 2B Flit Hdr, sequence #, 6B/8B CRC, 6B FEC
Insertion of 16-bit Protocol ID

(Protected Via Encoding Scheme)
Flex Bus Physical Layer
512-bit Flit + 16-bit CRC/unused + 16-bit Protocol ID Flex Bus Physical Layer 256B CXL Flit

Review–68BFlit Framing
• 68B Flit Mode uses 128b/130b PCIe block
encoding
• Optional 2-bit sync header (negotiated)
• Transitions from Data Blocks to Ordered Set
Blocks are indicated via the 16-bit protocol ID
• 68B Flit consists of 16-bit protocol ID +
512-bit payload + 16-bit CRC
• Protocol ID identifies the source of the flit (and
end of data stream):
• CXL.io
• CXL.cache/CXL.mem
• ARB/MUX (virtual link state management)
• Physical Layer (NULL flits)
• 16-bit CRC field is reserved for CXL.io

Background -PCIe Flit
• PAM4 signaling @64 GT/s: Pulse Amplitude Modulation 4-level
• 4 levels (2 bits) are transferred in a UI (Unit Interval)
• Increased susceptibility to errors: FBER (First Bit Error Rate) is 10-6. (vs FBER of 10-12
at NRZ signaling rates of 32 GT/s and lower)
• PCIe uses two mechanisms to address higher BER and correct errors:
• A light weight FEC (Forward Error Correction)
• A strong CRC followed by Link Level Retry (selective/standard) upon detection of error
post FEC
• PCIe introduced Flit-based unit of transfer to facilitate FEC & CRC

implementation
• FEC uses a fixed set of bytes in a 256B Flit
• Error Correction (FEC) in the Flit -> CRC (detection in Flit) -> Retry at Flit level

Background -PCIe Flit continued
• 256B PCIe Flit
• 236B TLP, 6B DLP, 8B CRC, 6B FEC
• No Sync hdr @64 GT/s, no Framing Tokens, no TLP/DLLP CRC
• Reduced overhead -> improved bandwidth
• Latency: 2ns x16, 4ns, x8, 8ns x4, 16ns x2, 32ns x1
• Guaranteed space in each Flit for Ack and credit exchange  low
latency, low storage
• FEC is 3-way interleaved ECC
• Covers TLP, DLP, and CRC
• Three groups (83B + 83B + 84B) + (3 x 2B ECC)
• Each ECC code is capable of correcting up to a single Byte of error
• Burst of error in a single Lane up to 16-bits does not impact more than a
Byte
• 8B of CRC: covers TLP and DLP
• Reed Solomon (242, 250) code guarantees detection of up to 8 Symbol
errors
• PCIe Flit Mode negotiated during linking training
• Applies to all data rates

CXL256BFlit
• CXL 256B Flit
• Based on PCIe Flit
TLP (128 bytes)
TLP (108 bytes) DLLP (6 bytes) CRC (8 bytes) FEC (6 bytes)
• 2-byte FlitHdr + 240 bytes of PCIe 256B Flit
Flit Data + CRC + FEC

• TLP bytes are shifted by 2- FlitHdr (2 bytes) FlitData (126 bytes)
bytes due to FlitHdr

FlitData (114 bytes) CRC (8 bytes) FEC (6 bytes)
• Two variants of 256B Flit

CXL Standard 256B Flit
• Standard 256B Flit FlitHdr (2 bytes) FlitData (126 bytes)
• Latency Optimized 256B Flit FlitData (110 bytes) DLLP (4 bytes) CRC (8 bytes) FEC (6 bytes)
CXL.io Standard 256B Flit

CXL256BFlit –Flit Header
• 256B Flit Header
Flit Header Field Bit Description
Location
Flit Type[1:0] [7:6] • 00b = Physical Layer IDLE flit or Physical Layer
NOP flit or CXL.io NOP flit
• 01b = CXL.io Payload flit
• 10b = CXL.cachemem Payload flit or CXL.cachemem
generated Empty flit
• 11b = ALMP
Prior Flit Type [5] • 0 = Prior flit was a non-retryable Empty, NOP, or
IDLE flit (not allocated into Retry buffer)
• 1 – Prior flit was a Payload flit or retryable Empty
flit (allocated into Retry buffer)
Type of DLLP Payload [4] • If (Flit Type = (CXL.io Payload or CXL.io NOP): Use as
defined in the PCIe Base Specification
Replay Command[1:0] [3:2] Same as defined in the PCIe Base Specification
Flit Sequence Number[9:0] {[1:0], [15:8]} 10-bit Sequence Number as defined in the PCIe Base
specification

CXL256BFlit –Flit Type[1:0]
• Flit Type[1:0]
Encoding Flit Payload Source Description Allocated to
Retry Buffer?
00b Physical Layer Physical Layer Physical Layer generated (and sunk) flit with no valid payload; inserted in No
NOP data stream when Tx Retry buffer is full or when no other flits from upper
layers are available to transmit or in response to a NOP hint
IDLE Physical Layer generated (and sunk) all 0s payload flit used to facilitate No
LTSSM transitions per PCIe Specification.
CXL.io NOP CXL.io Link Layer Valid CXL.io DLLP payload (no TLP payload); periodically inserted by the No
CXL.io link layer to satisfy the PCIe Base Specification requirement
for credit update interval if no other CXL.io flits are available to transmit.
01b CXL.io Payload Valid CXL.io TLP and valid DLLP payload Yes
10b CXL.cachemem CXL.Cachemem Valid CXL.cachemem slot and/or CXL.cachemem credit payload Yes
Payload Link Layer
CXL.Cachemem No valid CXL.cachemem payload; generated when CXL.cachemem Link Layer If retryable,
Empty speculatively arbitrates to transfer a flit to reduce idle to valid transition allocate to
time but no valid CXL.cachemem payload arrives in time to use any slots in Tx Retry
the flit Buffer only
11b ALMP ARB/MUX ARB/MUX Link Management Packet Yes
Latency-Optimized 256BFlits
• Optionally supported flit
format; negotiated during link
FlitHdr (2 bytes) FlitData (126 bytes)
FlitData (114 bytes) CRC (8 bytes) FEC (6 bytes)
training (no dynamic CXL Standard 256B Flit
switching)
• Comprised of 128-byte even FlitHdr (2 bytes) FlitData (126 bytes)
and odd flit halves, each FlitData (110 bytes) DLLP (4 bytes) CRC (8 bytes) FEC (6 bytes)
independently protected by 6- CXL.io Standard 256B Flit
byte CRC
• Even flit half can be
FlitHdr (2 bytes) FlitData (120 bytes) CRC (6 bytes)
FlitData (116 bytes) FEC (6 bytes) CRC (6 bytes)
consumed without waiting for CXL Latency-Optimized 256B Flit
second half to be received
• Flit accumulation latency FlitHdr (2 bytes) FlitData (120 bytes) CRC (6 bytes)
savings increases for smaller FlitData (112 bytes) DLLP (4 bytes) FEC (6 bytes) CRC (6 bytes)
link widths (e.g. 8ns roundtrip CXL.io Latency-Optimized 256B Flit
flit accumulation latency for

x4 link @64 GT/s)

Latency-Optimized 256BFlits Datapath
• Low latency path used if CRC

passes; switch over to higher
Lowest Latency Path
latency path if CRC fails 6B CRC

Check
• Entire 256-byte Flit is retried if
either flit half fails CRC after FEC
decode and correct Flit
accumulation
• Receiver tracks whether it has + FEC + CRC
previously consumed even flit half Check
when it receives replayed Flit

Latency-Optimized 256BFlits continued
Example Scenarios (see CXLspec for comprehensive list)
Original Flit Post FEC Correct Flit Retransmitted Flit

Even CRC Odd CRC Action Even CRC Odd CRC Action Even CRC Odd CRC Action
Pass Pass Consume Flit N/A N/A N/A N/A N/A N/A
Pass Fail OK to consume Pass Pass Consume anything N/A N/A N/A
even half; perform not previously
FEC decode and consumed
correct
Pass Fail OK to consume even Pass Pass Consume anything not
half if not previously previously consumed
consumed; Request
Pass Fail OK to consume even
Retry
half if not previously
consumed; FEC
decode and correct
Fail Pass/Fail FEC decode and
correct

6BCRCcomputation
• The specific Reed Solomon (130,136)
code selected enables the 6B CRC to
be derived from the 8B CRC logic (to
save area)
• When independently generated, the
6B CRC is computed for each half of
the Flit by zero padding the data
input on MSB to make 130 byte
message
• When 8B CRC logic is used to
generate the 6B CRC:
• Specific bytes of the message are fed to
PCIe CRC Input
Even Flit Half Mapping Odd Flit Half Mapping
Bytes
the corresponding bytes of the 8B CRC Byte 0 to Byte 113 00h for all bytes 00h for all bytes
generator. Byte 114 to Byte 229 Byte 0 to Byte 115 of the flit Byte 128 to Byte 243 of the flit
• Once the 8B CRC is generated, it can be

Byte 230 to Byte 235 Byte 116 to Byte 121 of the flit 00h for all bytes
used to generate the required 6B CRC

Byte 236 to Byte 241 00h for all bytes 00h for all bytes
with minimal logic

Flex Bus Physical Layer -Negotiation
* Flit_Mode_Enabled is determined
• Utilizes PCIe in Config.Linkwidth.Accept, per
Downstream Upstream
Alternate Protocol
PCIe spec Retimer
Port Port
Config.Lanenum.Wait/
Negotiation during Config.Lanenum.Accept
training at Gen1 link

Phase 1:
• DP communicates
speed common clock mode

• DP and UP advertise CXL
capabilities
• Performed only once • Retimer communicates
sync header bypass
during initial training capability (68B Flit only)
• CXL mode Config.Complete
negotiation control Phase 2:

• DP communicates
and status bit in Flex communicates results of
Bus Port DVSEC

negotiation
• UP sends
acknowledgement
(reflects what was
received)
L0 at Gen1

NewFlex Bus Info Bits in Modified TS1/TS2
Description Negotiated Communicated by Communicated
Upstream Port by DP and UP
Latency-Optimized DP/UP advertise capability in Phase 1; DP Yes
256B Flit communicates whether mode is enabled in
Capable/Enable Phase 2
CXL.io Throttle • Indicates that the Port does not support Yes
Required at 64 receiving consecutive CXL.io flits when
GT/s operating at 64 GT/s link speed;
• Enables devices to simplify hardware
for power and buffer savings, e.g. Type
3 device that only needs full bandwidth
for CXL.mem traffic
CXL NOP Hint 00 – No support for NOP hint feature Yes
Info[1:0]” 01 – Requires single NOP to switchover
from FEC+CRC path to low latency path
10 – Reserved
11 – Requires two back-to-back NOPs to
switchover
PBR Flit Link Layer Port-Based Routing Flit Yes
Capable/Enable DP/UP advertise capability in Phase 1; DP
communicates whether mode is enabled in
Phase 2
CXLNOPHint[1:0]
• Switch from high latency FEC+CRC path to Example: Combinational CRC-only Path & 4-cycle FEC+CRC path
low latency CRC-only path requires bubble Low Latency Path
(Combinational)
• Periodic SKP OSs and NOP Flits provide natural
opportunity for switchover 6B or 8B
• Relying on PCIe SKP OS insertion frequency, CRC
system may spendmore time than desired in high Check
latency path
Flit
FEC -- High
accumulation
Latency Path + FEC + CRC
(4 cycle pipeline) Check
• NOP Hint Feature provides explicit mechanism

NOP B, NOP A, NOP A, Valid Flit,
Cycle N-1: Even Half Even Half Even Half Odd Half
for receiver to request link partner to inject Cycle N:

NOPs to provide a bubble for switchover
NOP B, NOP B, NOP A, NOP A,
Even Half Even Half Odd Half Even Half
Switchover in Cycle N to when no valid traffic in high latency pipeline

CXLNOPHint[1:0] Usage
Downstream Upstream
• Each side independently Port Port
communicates during initial
link training whether a
single or two back-to-back
NOPs are needed to
switchover to low latency Each side advertises its
path
support for NOP Hints
during Phase 1 of APN
• If link partner supports

injecting NOPs, a port
utilizes a NOP Hint to
request its link partner to Additional steps to train to L0
inject NOPs
• NOP Hint is NAK with a
sequence number of 0 Detects CRC
• Switchover to higher error, switch to

FEC path
latency path FEC path may
trigger Physical Layer to
send NOP Hint request
(other algorithms possible)
Switch to low
latency path

Negotiation Resolution Tables

Link Training Example
Start
Detect: Begin training to L0 at

PCIe Gen1
Complete Training & Equalization to 64 GT/s
Yes
Bypass Equalization to Highest With PCIe Flit Mode & Drift Buffer Mode Enabled;
Supported NRZ Data Rate? Transmission of SDS Ordered Set sequence in
Negotiate PCIe Flit Mode per
Recovery.Idle followed by NULL flits and CTRL SOS
PCIe specification
insertion at fixed interval
Result: Flit_Mode_Enabled = 1
No

No alternate NO
Modified TS1/TS2 With PCIe Flit Mode & Drift Buffer Mode Enabled; Yes Downstream
protocol Transmission of SDS Ordered Set sequence in Recovery.Idle
Usage Enabled? Port?
negotiation followed by IDLE flits and CTRL SOS insertion at fixed interval
YES No
Complete Alternate Wait until receive non-

Protocol Negotiation for Physical Layer Flit from link
With PCIe Flit Mode & Drift Buffer Mode Enabled;
CXL mode; partner
Transmission of SDS Ordered Set sequence in Recovery.Idle
Final Result: Enable all CXL
followed by IDLE flits and CTRL SOS insertion at fixed interval
protocols selected
Independent decision:
Enable drift buffer mode Link is available for CXL
ARB/MUX and Link Layers
Complete Training & Equalization to 32 GT/s to Transmit CXL flits
With PCIe Flit Mode & Drift Buffer Mode Enabled; (‘DL_Up = 1b’)
L0 at 2.5 GT/s Transmission of SDS Ordered Set sequence in Recovery.Idle
followed by IDLE flits and CTRL SOS insertion at fixed interval

Thank You

Lunch Break
Next Session: CXL3.0 Protocol

Training will resume at 1:00pm.

[AMD Official Use Only - General]
CXL3.0 Protocols
Presenter : Nathan Kalyanasundharam, AMD & CXL Board Member

Co-Presenter : Rob Blankenship, Intel Corporation & CXL Protocol WG co-chair

Agenda
• Protocols
• Coherence/Caching Primer
• CXL.Cache Protocol
• Focus on updates in CXL3
• CXL.mem Protocol
• Focus on updates in CXL3

Protocols Supported

3 Protocols in CXL
• IO
• PCI Express baseline with a few extensions
• Usage: configuration, Address Translation, DMA, etc.
• Not the focus of this training
• Cache
• Device Caching Host Memory
• Special use for managing HDM-D coherence (“Bias Flip” flows).
• Can also be used for DMA
• 64B Granularity only
• Host Physical Address (HPA) Only.
• Memory
• Exposes device memory to the host  Host-managed Device Memory (HDM)
• 3 Types of coherence: Host-only, Device, Device with Back-Invalidate.
• Abbreviated as HDM-H, HDM-D, HDM-DB.
• 64B Granularity
• HPA Only

Caching Primer

Caching Overview
• Caching temporarily brings data closer to the
consumer Host Memory
Access Latency: ~200ns
Shared Bandwidth: 100+ GB/s
• Improves latency and bandwidth using

prefetching and/or locality
Read
Data
• Prefetching: Loading Data into cache before it is
Local Data Cache
required Access Latency: ~10ns
• Spatial Locality (locality is space): Access Dedicated Bandwidth: 100+ GB/s
address X then X+n
Read
Data
• Temporal Locality (locality in Time): Multiple
access to the same Data
Accelerator

CPUCache/Memory Hierarchy with CXL

~10 GB – Directly Connected Memory CXL.mem CXL.mem • Most CPUs have 2 or
(aka DDR) ~10 GB ~10 GB more levels of coherent
Home Agent cache
Coherent CPU-to-CPU Symmetric Links • Lower levels (L1),
smaller in capacity with
~10 MB – L3 (aka LLC)
CPU Socket 1
lowest latency and
highest bandwidth per
source.
Wr Wr
~500 KB – L2 ~500 KB – L2 ... Cache Cache
CXL.Cache
CXL.Cache
~50KB ~50KB
L1 L1 L1 L1
• Higher levels (L3), less
CXL.io
CXL.io
... PCIe bandwidth per source
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
but much higher capacity

CPU CPU CPU CPU
CPU Socket 0
and support more
sources
~1 MB ~1 MB
• Devices may also have
Device Device
multiple levels of cache.
Note: Cache/Memory capacities are examples and not aligned to a specific product.

Cache Consistency
• How do we make sure updates in cache are visible to other agents?
• Invalidate all peer caches prior to update
• Can be managed with software or hardware  CXL uses hardware coherence
• Define a point of “Global Observation” (aka GO) when new data is visible
from writes
• Tracking granularity is a “cacheline” of data  64-bytes for CXL
• All addresses are assumed to be Host Physical Address (HPA) in CXL

cache and memory protocols  Translations done using Address
Translation Services (ATS).

Cache Coherence Protocol

• Modern CPU caches and CXL are built on M,E,S,I protocol/states
• Modified – Only in one cache, Can be read or written, Data NOT up-to-date in memory
• Exclusive – Only in one cache, Can be read or written, Data IS up-to-date in memory
• Shared – Can be in many caches, Can only be read, Data IS up-to-date in memory
• Invalid – Not in cache
• M,E,S,I is tracked for each cacheline address in each cache

• Cacheline address in CXL is Addr[51:6]
• Notes:
• Each level of the CPU cache hierarchy follows MESI and layers above must be consistent
• Other extended states and flows are possible but not covered in context of CXL

Howare Peer Caches Managed?

• All peer caches managed by the “Home Agent”.
• A “Snoop” is the term for the Home Agent to check the cache state,
fetch the data and cause cache state changes.
• Example CXL Snoops:

• Snoop Invalidate (SnpInv): Causes a cache to degrade to I-state, and must
return any Modified data.
• Snoop Current (SnpCurr): Does not change cache state, but does return
indication of current state and any modified data.

CXLCache Protocol

Cache Protocol Summary

Simple set of 15 commands; both reads and writes from the device
to host memory
Keep the complexity of global coherence management in the host.
CXL3 enables up to 16 cache devices below each root port

• Prior generations limited to 1 per root port.

Cache Protocol Channels
3 channels in each direction: D2H vs H2D

Data and RSP channels are pre-allocated
D2H Requests from the device
H2D Requests are snoops from the host
Ordering: H2D Req (Snoop) push H2D RSP

Read Flow
• Diagram to show
message flows in time
• X-axis: Agents
• Y-axis: Time

Read Flow
• Diagram to show message
flows in time
• X-axis: Agents
• Y-axis: Time

Mapping FlowBack to CPUHierarchy

~10 GB – Directly Connected Memory CXL.mem CXL.mem
(aka DDR) ~10 GB ~10 GB
Home Agent
Coherent CPU-to-CPU Symmetric Links
CPU Socket 1
Wr Wr
CXL.Cache
CXL.Cache
~50KB ~50KB
L1 L1 L1 L1
CXL.io
CXL.io
...
PCIe
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
CPU CPU CPU CPU
CPU Socket 0
CXL
~1 MB ~1 MB
Device
Device Device


• Peer Cache can be: ~10 GB – Directly Connected Memory CXL.mem CXL.mem
• Peer CXL Device with (aka DDR) ~10 GB ~10 GB
Cache Home Agent
• CPU Cache in Local
Socket Coherent CPU-to-CPU Symmetric Links Peer Cache
• CPU Cache in Remote CPU Socket 1
Socket ~10 MB – L3 (aka LLC)
Wr Wr
Peer
~500 Cache
KB – L2 Peer Cache
~500 KB – L2 ... Cache Peer
Cache Cache
CXL.Cache
CXL.Cache
~50KB ~50KB
L1 L1 L1 L1
CXL.io
CXL.io
...
PCIe
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
CPU CPU CPU CPU
CPU Socket 0
CXL
~1 MB Peer
~1 MB
Device
Device Cache
Device


Memory Memory
Controller Controller
• Peer Cache can be: Memory
CXL.mem CXL.mem
~10 GB – Directly Connected Memory
Controller Memory
• Peer CXL Device with (aka DDR) ~10 GB ~10 GB Controller
Cache
• CPU Cache in Local Socket Home
Home Agent Home
• CPU Cache in Remote
Coherent CPU-to-CPU Symmetric Links Peer Cache
Socket
• Memory Controller ~10 MB – L3 (aka LLC)
CPU Socket 1
can be: Wr Wr
• Native DDR on Local Peer Cache
~500 KB – L2 PeerKB
~500 Cache
– L2 ... Cache Cache
Peer Cache
CXL.Cache
CXL.Cache
~50KB ~50KB
Socket L1 L1 L1 L1
CXL.io
CXL.io
Native DDR on Remote
•
...
PCIe
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
Socket CPU CPU CPU CPU
• CXL.mem on peer Device
CPU Socket 0
CXL
~1 MB Peer
~1 MB
Device
Device Cache
Device

Example #2: Write

CXL Peer
• For Cache Writes Device Cache Home
there are three I S
phases: Memory
Controller
• Ownership
• Silent Write
• Cache Eviction
Legend
Cache State:
Modified
Exclusive
Shared
Invalid
Allocate Tracker
Deallocate Tracker
Example #2: Write

CXL Peer
• For Cache Writes
Device Cache Home
there are three I S
phases: Memory
Controller
• Ownership
Ownership
• Silent Write SI
• Cache Eviction
I E
Legend
Cache State:
Modified
Exclusive
Shared
Invalid
Allocate Tracker
Deallocate Tracker
Example #2: Write

CXL Peer
• For Cache Writes Device Home
Cache
there are three I S
phases: Memory
Controller
• Ownership
Ownership
• Cache Eviction
I E
Legend
Write
<Silent Wr>
Cache State: EM
Modified
Exclusive
Shared
Invalid
Allocate Tracker
Deallocate Tracker
Example #2: Write

CXL Peer
• For Cache Writes
Device Cache Home
there are three I S
phases: Memory
Controller
• Ownership
Ownership
• Cache Eviction
I E
Legend
Write
<Silent Wr>
Cache State: EM
Modified
Exclusive
Eviction
Shared
M I Data
Invalid
Allocate Tracker
Deallocate Tracker
Example #3: Steaming Write

CXL Peer
• Direct Write to Host Device Cache Home
• Ownership + Write in a I S
Memory
single flow. Controller
SI
• Rely on completion to
indicate ordering
Data
• May see reduced
bandwidth for ordered
Legend
traffic Cache State:
Modified
• Host may install data into Exclusive
LLC instead of writing to
Shared
Invalid
memory Allocate Tracker
Deallocate Tracker
15 Request in CXL
• Reads: RdShared, RdCurr, RdOwn, RdAny

• Read-0: RdownNoData, CLFlush, CacheFlushed
• Writes: DirtyEvict, CleanEvict, CleanEvictNoData
• Streaming Writes: ItoMWr, WrCur, WOWrInv, WrInv(F)

CXLMemory Protocol

Memory Protocol Summary

Simple reads and writes from host to memory
Memory Technology Independent
• HBM, DDR, PMem
• Architected hooks to manage persistence
Includes 2-bits of “meta-state” per cacheline
• Memory Only device: Up to host to define usage.
• For Accelerators: Host encodes required cache state.
Host-managed Device Memory (HDM) comes in 3 types:
• Host Managed Coherence (HDM-H)
• Device Managed Coherence (HDM-D)
• Device Managed Coherence with Back-Invalidation (HDM-DB)  new in CXL3

Memory Protocol Channels

3 channels in each direction
• M2S Request (Req), Request w/ Data (RwD)
• S2M Non-Data Response (NDR), Data
Response (DRS) which are pre-allocated.
• M2S BIRsp, S2M BISnp used for HDM-DB to
manage coherence  New in CXL3.
Limited Ordering
• Req channel for HDM-D memory (CXL2
Accelerators)
• NDR Channel for conflict flows with HDM-
DB

Host Memory Caching ViewHDM-H

CXL.$ may share cache ~16 GB – Directly Memory CXL.mem ~16 GB CXL.mem ~16 GB
hierarchy with CPU
cores
Home Agent
CXL.mem treated Coherent CPU-to-CPU Symmetric Links

similar to native DDR
with optional meta data ~10 MB – L3 (aka LLC)
CPU Socket 1
Wr Wr
2 Level of coherence
CXL.Cache
CXL.cache
~50KB ~50KB
protocols L1 L1 L1 L1
CXL.io
CXL.io
• CXL.$ to LLC (Last ...
PCIe
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
Level Cache) CPU CPU CPU CPU
• LLC to Home Agent
(host specific) CPU Socket 0
~1 MB L1/L2 ~1 MB L1/L2
Device Device

Cache Hierarchy of HDM-D/HDM-DB

HDM-D* adds 3rd level to ~1MB Dev
~16 GB – HDM
coherence protocol at the top: ~16 GB – Directly Memory Cache for
DAMem
DCoh
(aka DAMem) T2 Device
HA to the Device
CXL.mem
Host Home Agent does not Home Agent
Proxy Caching Agent
track Device caching of HDM-
D*. Coherent CPU-to-CPU Symmetric Links
CPU Socket 1
Device tracks host caching of
HDM-D*
Wr Wr
CXL.Cache
CXL.cache
~50KB ~50KB
L1 L1 L1 L1
CXL.io
CXL.io
...
PCIe
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
DCoh is the final point of
coherence resolution for HDM-
CPU CPU CPU CPU
D* address region. CPU Socket 0
~1 MB L1/L2 ~1 MB L1/L2
T2 Device T1 Device

Example #1: Write

CXL Memory Only Device
CXL Device
Media ECC handled by device Host
Memory
Controller
Memory
Media
HDM-H provide 2-bits of host New MetaValue
defined Meta Value which device

optionally supports
Sent when data visible

Note: only host caching of HDM-H to future reads
(Host only coherent)

Example #2: Read

Memory Only Device
CXL Device
Meta Value Change requires Host
Memory
Controller
Memory
device to write.
Media
New MetaValue
Old MetaValue

Example #3: Read no Meta

Memory Only Device
Host may indicate no Meta-

CXL Device
Memory Memory
Host
state update required on
Controller Media
reads
No Meta Update
Needed

HDM-D/HDM-DBCommon Attributes
“Device Coherent”  Provide ability for host and device to cache
Request MetaValue field indicates host cache state.

• Any – Host may cache in M,E, or S states.
• Shared – Host will only cache in S state.
• Invalid – Host does not have a cached copy.
Request SnpType indicates Device Cache state change

• SnpInv – Invalidate Device Cache
• SnpData – Device Cache in I or S state.
Device Coherence Engine (Dcoh) is the final conflict resolution arbiter between
host and device accesses for HDM-D* memory.

Device Coherent (HDM-D) Specifics

• CXL.mem requests indicate coherence required from the host.
• CXL.Cache used for device to change host cache state

• Host must detect device accessing its own memory and trigger special
flows which return a “Forward” message.
• Can be blocked behind access to host memory.
• Requires device to implement full directory tracking (aka Bias Table)

Full Directory  Bias Table
Device state of 1 or 2 bits per Cacheline indicating if host has a

cached copy
• Device Bias: No host caching, allowing direct reads
• Host Bias: Host may have a cached copy, so read goes
through the host
• Optionally tracking Shared vs Any state in host.
• With Shared State, the device may directly read data, but must not
modify.
Host tracks which peer caches have copies

Host Bias Read
MemRdFwd
message sent Accelerator Device with HDM-D
after Peer Cache Host/HA DCOH Dev $ Dev Mem
coherence S RdOwn
X
I
BIAS=HOST
resolved on RdOw
nX
S to I
M2S Request SnpI nv X
Hit SRs
Channel pI
MemR
dFwd X
GO-E I to E
BIAS=DEVICE M emRdFw
dX
D ata

Device Bias Read
No messages on CXL
interface
Accelerator Device with HDM-D
Peer Cache Host/HA DCOH Dev $ Dev Mem

X
I RdOwn I
BIAS=DEVICE
GO-E I to E
MemRdFw
d X
Data

Device Cache Evictions
E/M in cache
imply
Bias=Device Accelerator Device with HDM-D
so no Peer Cache Host/HA DCOH Dev $ Dev Mem
indication to I DirtyEv
ict X
M
host BIAS=DEVICE
GO-WrPull M to I
Data X
MemWr X
Cmp

Host Bias Streaming Write
MemRdFwd
message sent Accelerator Device with HDM-D
after Peer Cache Host/HA DCOH Dev $ Dev Mem
WOWr X
coherence S
BIAS=HOST
I
resolved WOW
rX
S to I SnpI nv X
Hit SRs
pI
MemWrFwd X
GO-WrPull
Data
MemWr X
BIAS=DEVICE
Cmp
ExtCmp

Device Bias Streaming Write
No message to host
Accelerator Device with HDM-D
Peer Cache Host/HA DCOH Dev $ Dev Mem
I WOWr X I
BIAS=DEVICE
GO-WrPull
Data
MemWr X
Cmp
ExtCmp

HDM-DB Newin CXL3

HDM-DB
“Device Coherent with Back-Invalidation” (HDM-DB) adds BISnp

and BIRsp channel for optimize coherence management enabling
inclusive Snoop Filter (SF) architectures.
Same “BIAS Table” states tracking host coherence: I, S, A
Inclusive SF architecture may block M2S Request waiting for

Back-Invalidation Snoop (BISnp) to complete which enables
sizing to match host caching instead of memory capacity.

Back-Invalidation Snooping (HDM-DB)

Enables Inclusive Snoop
Filter (SF) to track host
caching
Device can block new

requests waiting for SF
Victim

Block Access with BISnp

• To improve efficiency
there is BISnp messages
that cover more than one
cacheline (aka “Block”).
• Either 2 (128B) or 4 (256B)

cachelines are supported.
• Host may send combined

response or per cacheline
response.

Conflicts
• BISnp conflicting with pending MemRd will create a race case if read is
requesting cache state (MV=A,S) from the Device.
• Host must find out if MemRd response is in flight before knowing how to
process the BISnp.
• Conflict resolution adds M2S handshake with M2S RwD and S2M NDR
• M2S RwD message “BI-Conflict”.
• RwD is lowest complexity for implementation and works within dependance graph.
• RwD will cause wasted 64B data transfer, but conflicts are not expected at high rates, so
better to keep flow simple and live with overhead.
• S2M NDR message is Conflict-Ack and must push all prior Cmp messages to the same
address
• Conflict cases resolved with handshake
• Late conflict is where Data/Cmp-E/S are inflight from device
• Early Conflict: MemRd is blocked at the device.
• Without the handshake host can not resolve between the two cases

Early Conflict
• Conflict-ACK
observed before
Cmp* means Device
has not yet
processed MemRd

Late Conflict
• Late is defined as after
device has processed
MemRd
• Cmp* observed before
Conflict-Ack confirms to
host that data is in flight.
• BI-Conflict message uses
RwD channel.

Channel Dependance
• Req may be blocked by BISnp.
Req
• BISnp may be blocked by RwD

(Writebacks) BISnp
RwD
• RwD may only be blocked by
responses messages (NDR/DRS).
NDR
DRS

NewUse Models with HDM-DB

Direct Peer-to-Peer (P2P) to HDM

• HDM-DB enables direct P2P from CXL or
PCIe sources in CXL3
H1
• In prior generation all HDM access must go

through the host CPU to resolve coherence. CXL Switch
D1 D2 D3
CXL Type-2/3
• HDM-DB will directly resolve coherence with PCIe
Accel
CXL
Accel
the host before committing the P2P.
HDM-DB

Pooled and Shared Memory

• Pooled Memory and CXL Switching added
in CXL2 allow for dedicated assignment
of memory resources from to a host.
H1 H2 H3 ... H#
• Shared Memory assigned to multiple
hosts enabled in CXL3 CXL Switch
D1 D2 D3 D4 D#
• Multi-Host Hardware Coherent Shared
Memory possible with HDM-DB
...
• MORE on these uses in Fabric Session

Hardware Coherent
Software Coherent
Shared
Shared

Summary
• CXL protocols are evolving
• CXL2 added switching and pooled memory capabilities.
• CXL3 enabling new capabilities:

• CXL.Cache Scaling
• CXL.Mem Back-Invalidation Channel for SF, Direct P2P, Multi-Host
Coherence
• Port Based Routing (covered in Fabric Training))

Thank You

Extended Break
Next Session: CXLARB/MUXand

CXL.cachememLink Layer
Training will resume at 3:30pm.

CXL.cachememLink Layer
Presented by: Swadesh Choudhary

Overview
• Link Layer Functions and CXL 256B Flit Mode Changes
• Flit Formats
• Slot Formats
• Implicit Decode and Packing rules
• Efficiency calculation examples
• Link Layer Control Messages
• Latency Optimizations

CXL.cachememLink Layer Functions
• Flow Control with remote Link partner for the
different channels
• Flit packing (TX) and unpacking (RX)
• New Flit Formats for Standard 256B, Latency-Optimized
• 256B Flit mode has new Slot formats over 68B Flit mode
• Integrity and Encryption
• Updates to support 256B Flit Formats
• Efficiency with encryption for containment mode is higher
than 68B Flit mode (Aggregation Flit count updated for
256B Flit mode to 2 Flits of 256B)
• No “All Data Flits” in 256B => better latency in loaded case
• Reliable Flit transfer (CRC/Retry)
• Moved to Physical Layer for 256B Flit mode

Flit Format –Standard 256B
• Standard 256B Flit
• 2B HDR, CRC and FEC are
populated by the Physical
Layer
• One 14B Slot (Header)
• Fourteen 16B Slots (Generic)
• PBR Flits only supported with
Standard 256B Flits
• 2B dedicated for Credit returns

Flit Format –Latency-Optimized
• Latency-Optimized 256B Flit
• 2B HDR, CRC and FEC are
populated by the Physical
Layer
• One 14B Slot (Header)
• Thirteen 16B Slots (Generic)
• One 12B Slot (Header-Subset)
• 2B dedicated for Credit returns

Decoding Slot Formats for 256BFlit mode
• Each slot has 4-bit encoding in bits[3:0] that defines the Slot Format
(SlotFmt Encoding)
• Different from 68B Flit mode where 3 bit encoding was used for the Flit
• Packing rules are constrained to a subset of messages of UP and
DP to match Transaction Layer requirements
• Slot packing is message oriented – encodings are unique between UP and
DP unless they are enabled to be sent in both directions (for example PBR)
• CXL.cache and CXL.mem messages are not mixed in the same slot
for 256B Flit mode
• HS-slots ⊂ H-Slots ⊂ G-Slots
• Slots used for PBR Flits can only be G-Slots or H-Slots, and are a
subset of regular Flit Slots

Slot Summary G-Slots vs H-Slots

Slot summary HS-Slots vs H-Slots

Slot Packing examples
• Example of S2M DRS

• G-Slot can carry up to 3 S2M DRS headers
• H-Slot/HS-Slot can carry up to 2 S2M DRS headers
• PBR G or H Slots can carry up to 2 S2M DRS headers

Implicit Decode and Packing Rules
• Data and Byte Enable slots are
implicitly decoded by the preceding
slots
• Data/Byte enable slots can only
occupy G-Slots
• Only 64B transfers are supported in
256B Flit mode
• Tightly packed rules apply
• TX must not send data headers in H-
slots unless the remaining data slots
to transfer are <=16
• Maximum message count rules apply
over rolling 128B groups

Efficiency example –100% Read to CXL.mem
Flit number Slot number information rollover data bytes per flit
0 2 S2M DRS headers
1 to 8 Data Slots
0 208
9 3 S2M DRS headers
10 to 14 Data Slots 7
0 2 S2M DRS headers
1 224
Flit number Slot 0 Slot 1 Slot 2 Slot 3
0 2 S2M DRS headers
2
1 to 9 Data slots
208
1 2 DRS D0 D0 D0
10 3 S2M DRS headers
2 D0 D1 D1 D1
3
0 2 S2M DRS headers
224
3 2DRS D1 D2 D2
0 2 S2M DRS headers
4 D2 D2 D3 D3
4
1 to 10 Data slots
208
5 2DRS D3 D3 D4
6 D4 D4 D4 D5
5
0 2 S2M DRS headers
224
7 2DRS D5 D5 D5
1 to 14 Data slots 3
0 2 S2M DRS headers
8 D6 D6 D6 D6
6
1 to 11 Data slots
208
9 D7 D7 D7 D7
0 Empty Efficiency =(64*8) *100/(68*9) =83.67%
7
1 to 10
11
Data slots
3 S2M DRS headers
208 Above is computed for 68BFlit Format
0 2 S2M DRS headers
8 224
1 to 14 Data slots 3
0 2 S2M DRS headers
1 to 11 Data Slots
9 208
Flits 7, 8 and 9 pattern repeats
Efficiency =((208*2) +(224))*100/(256*3) =83.33%

Above is computed for Standard 256BFlit Format

Efficiency example –100% Write to CXL.mem
Flit number Slot number information rollover data bytes per flit
0 1 M2S RWD
1 to 4 Data slots Flit number Slot 0 Slot 1 Slot 2 Slot 3
5 1 M2S RWD 1 1 RwD D0 D0 D0
0 192
6 to 9 Data slots 2 D0 1 RwD D1 D1
10 1 M2S RWD 3 D1 D1 1 RwD D2
11 to 14 Data slots 4 D2 D2 D2 1 RwD
repeat 5 D3 D3 D3 D3
Efficiency =192*100/256 =75% Efficiency =64*4*100/(68*5) =75.3%

Above is computed for Standard 256BFlit Above is computed for 68BFlit Format

Link Layer Control messages
• Carried in H8 Slot
• A subset is permitted in HS8 Slot
• All except IDE.MAC are such that
rest of the Flit is empty once
these are inserted

Latency Optimization and Viral Containment
• Flow for Standard 256B Flit:
• Flow for Latency-Optimized 256B Flit:

Latency Optimization –Empty Flit
• In order to not pay the latency to wait for current IDLE Flit to finish
before starting a CXL.cachemem Slot insertion (up to 6ns in x4),
Empty Flits are introduced in 256B Flit Mode
• A special encoding on the 2B Credit field indicates the Flit is an
“Empty Flit” – no protocol or Link Layer message or Credit returns
• On TX, default winner for ARB/MUX can be left at CXL.cachemem
Link Layer, allowing the Link Layer to continue sending Empty Flits
even when there are no Transaction Layer Packets
• On RX, the Physical Layer is permitted to use treat Empty Flits as
NOPs when trying to catch up from the higher latency (FEC) path to
the lower latency bypass path

CXLARB/MUX

Overview
• Recap of ARB/MUX functionality
• Summary of changes in 256B Flit mode
• ALMP mapping to 256B Flits
• PM NAK example
• L0p flow outline

ARB/MUXFunctions
• The ARB/MUX provides arbitration (Tx) and data steering
(Rx) of the CXL.io and CXL.cache/CXL.mem flits
• Arbitration policy is weighted round robin with architected
registers to program relative weights associated with the
link layers
• Virtualizes and co-ordinates Link State transitions

• Maintains Virtual Link State Machines (vLSMs) for the respective
CXL link layers
• Coordinates state transitions with remote Arb/Mux through link
management packets (ALMPs) [feeds into Arbiter on Tx, and
consumed on Rx]
• Determines the link state request for Flex Bus Physical Layer
• Arb/Mux is bypassed if Link trains as PCIe protocol

Changes in 256BFlit mode
• ALMPs go through TX/RX Retry buffers in the Physical Layer when operating in
256B Flit mode
• Reliable transfer is guaranteed
• No Status Synchronization Protocol on LTSSM Recovery etc.
• LTSSM Recovery not exposed on vLSM to Link Layers
• Active.PMNAK state added to simplify handshakes
• No scenarios causing unexpected ALMPs in 256B Flit mode
• L0p aggregation and negotiation with remote Link partner is the responsibility
of the ARB/MUX

ALMPmapping to Flits
• ALMPs mapped to both
Standard or Latency Optimized
Flits
• No replication of bytes since it
is now protected by
CRC/FEC/Retry
• Maximum 2 can be outstanding
at a time
• PM/State change handshake
• L0p Handshake

PMflowNAKexample
PMTimeout is mapped to uncorrectable internal error in 256BFlit mode

L0p flow
• ALMPs are used to negotiate desired L0p width with remote Link
partner before triggering the Physical Layer to change Link width
• Each Link Layer indicates it’s required width of the Link
independently to ARB/MUX along with priority
• ARB/MUX aggregates the widths, and if the desired width is different CXL.cache
from current Link width, then it initiates L0p handshake with remote CXL.io
/mem
Link partner
• ARB/MUX notifies Physical Layer of the result of L0p negotiation Link width Link width
• Physical Layer takes L0p action at next SOS boundary (i.e. send EIOS on the request request
appropriate lanes)
• Physical Layer notifies ARB/MUX of successful L0p entry by reflecting L0p Arb/Mux
width to ARB/MUX (when both Tx and Rx lanes are of desired width)
• ARB/MUX propagates L0p width to upper layers
L0p ALMP negotiation
• No L0p DLLP communication from CXL.io. L0p does not impact vLSM
state
• LTSSM going to recovery restores the Link width to full configured
width

Thank You

Thank You for Participating
Please complete the CXL 3.0 Technical Training survey linked above.

CXL - 3.0 - Training - PPT - Day 1 - v1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CXL - 3.0 - Training - PPT - Day 1 - v1

Uploaded by

Copyright:

Available Formats

Welcome to the CXL3.

Confidential | CXL™ Consortium 2022

Confidential | CXL™ Consortium 2022

Industry Open Standard for

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022

March 2019 September 2019 November 2020 August 2022

CXL 1.0 CXL Consortium CXL 2.0 CXL 3.0

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 4

12/8/2022 Confidential | CXL™ Consortium 2021 5

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 7

• Leverages PCIe with three Home Agent

Low Latency PCIe PHY

• CXL.Cache/CXL.Memory targets Converged

PROTOCOLS PROTOCOLS PROTOCOLS

USAGES USAGES USAGES

• PGAS NIC • GP GPU • Memory BW expansion

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022

1 Device memory can be

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 10

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 11

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 13

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 15

• Uses PCIe 6.0® PHY @ 64 GT/s

CXL 3.0 enables non-tree

• P2P to HDM-DB memory is I/O

CXL Switch CXL Switch CXL Switch

CXL CXL CXL CXL CXL CXL CXL CXL

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 19

CXL 2.0 CXL 3.0

D Memory Memory Memory FPGA NIC FPGA Memory

Type 1 or Type 3 Devices Type 2 Type 1 Type 2 Type 3

Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 20

CXL Fabric Manager is

Shared Coherent Memory

Multiple switch levels

S2 3 CXL 3.0 defined

CXL 3.0 Fabric Architecture

CXL Switch CXL Switch CXL Switch Leaf Switches

Accelerator Memory CPUs CPUs Memory GFAM Memory

Confidential | CXL™ Consortium 2022 25

Confidential | CXL™ Consortium 2022

12/8/2022 Confidential | CXL™ Consortium 2022 27

Confidential | CXL™ Consortium 2022

12/8/2022 Confidential | CXL™ Consortium 2022 29

12/8/2022 Confidential | CXL™ Consortium 2022 30

68B Flits Previously CXL1.1/CXL2.0 Flits

Restricted CXL Device (RCD) CXL device operating in RCD mode

Exclusive RCD (eRCD) Previously CXL 1.1 only device

12/8/2022 Confidential | CXL™ Consortium 2022 31

From CXL.cache/ CXL.io

Transaction CXL.mem Queues Transaction From CXL.cache/

• CRC and Retry Logic

moved from Link

Layer (common point)

• Arb/MUX ALMPs go 256B CXL.io Flit 256B CXL.cache/mem Flit

through Retry Buffer DLLP+16-bit

CRC 32-bit LCRC 16-bit Flit-

32-bit ARB/MUX ALMP Payload Retry Buffer

Insertion of 16-bit Protocol ID

12/8/2022 Confidential | CXL™ Consortium 2022 32

12/8/2022 Confidential | CXL™ Consortium 2022 33