Professional Documents
Culture Documents
CXL - 3.0 - Training - PPT - Day 1 - v1
CXL - 3.0 - Training - PPT - Day 1 - v1
0 Technical Training
To access Wifi, join the network “Google Guest”.
CXL 1.1
Specification
Released
Proliferation of
Cloud Computing Growth of
AI & Analytics
CXL.io CXL.cache
Discovery, Configuration, CXL.memory
Memory Flows,
Asymmetric Complexity Initialization, Interrupts, Memory Flows
Coherent Requests
DMA, ATS
• Eases burdens of cache coherence DTLB Caching Agent Memory Optional Device
interface designs for devices Accelerator Logic Controller Memory
Accelerator
Representative CXLUsages with CXL1.0
Caching Devices / Accelerators Accelerators with Memory Memory Buffers
TYPE 1 TYPE 2 TYPE 3
DDR DDR
DDR DDR
DDR DDR
Processor Processor Processor
CXL • CXL.io
• CXL.cache CXL • CXL.io
• CXL.cache
• CXL.memory
CXL • CXL.io
• CXL.memory
HBM
Accelerator Accelerator Memory Controller
Memory
Memory
Memory
Memory
NIC
Cache HBM Cache
1
H1 H2 H3 H4 H#
D1 D2 D3 D4 D#
2 Multi Logical Devices allow
2 for finer grain memory
allocation
CXL 2.0
[Add
H1
CXL
• Supports single-level
switching
CXL Switch
• Enables memory
CXL CXL CXL CXL
expansion and resource
allocation
D1 D2 D3 D#
Peer-to-peer
Preliminary as of August 2022 Confidential | CXL™ Consortium 2022 14
CXL3.0 Spec Feature Summary
Features CXL 1.0 / 1.1 CXL 2.0 CXL 3.0
Release date 2019 2020 1H 2022
Max link rate 32GTs 32GTs 64GTs
Flit 68 byte (up to 32 GTs)
Flit 256 byte (up to 64 GTs)
Type 1, Type 2 and Type 3 Devices
Memory Pooling w/ MLDs
Global Persistent Flush
CXL IDE
Switching (Single-level)
Switching (Multi-level)
Direct memory access for peer-to-peer
Symmetric coherency (256 byte flit)
Memory sharing (256 byte flit)
Multiple Type 1/Type 2 devices per root port Not supported
Fabric capabilities (256 byte flit) Supported
H1 H1
CXL CXL
1 Multiple switch levels
D1 D#
CXL CXL
(aka cascade)
• Supports fanout of
CXL Switch CXL Switch
all device types
D2 CXL CXL D3
1 CXL 1 CXL
1 CXL
D1 D2 D3 D# D1 D# D1 D#
1
Each host’s root port
CXL CXL
1
can connect to more
CXL Switch CXL Switch
than one device type
CXL CXL CXL CXL CXL CXL CXL CXL
Fabric
Manager
December 8, 2022
256B Flits (standard and latency-optimized) New CXL Flit formats based on PCIe Flit
RCD mode Restricted CXL Device Mode -- CXL operating mode limited by
specific restrictions such no hot-plug, RCiEP, no switch
support, and 68B Flit Mode only
Exclusive Restricted CXL Host (eRCH) Previously CXL 1.1 only host
Virtual Hierarchy (VH) Everything from the CXL RP down, including the CXL RP, the
CXL PPBs, and the CXL Endpoints.
VH mode Mode of operation where CXL RP is root of hierarchy
68B Flit and VH capable/enable (in modified TS1/TS2) CXL 2.0 capable
Layer to Physical
TLP+Seq#
512-bit Flit + 16-bit CRC/unused + 16-bit Protocol ID Flex Bus Physical Layer 256B CXL Flit
• PCIe uses two mechanisms to address higher BER and correct errors:
• A light weight FEC (Forward Error Correction)
• A strong CRC followed by Link Level Retry (selective/standard) upon detection of error
post FEC
• Latency Optimized 256B Flit FlitData (110 bytes) DLLP (4 bytes) CRC (8 bytes) FEC (6 bytes)
Prior Flit Type [5] • 0 = Prior flit was a non-retryable Empty, NOP, or
IDLE flit (not allocated into Retry buffer)
• 1 – Prior flit was a Payload flit or retryable Empty
flit (allocated into Retry buffer)
Type of DLLP Payload [4] • If (Flit Type = (CXL.io Payload or CXL.io NOP): Use as
defined in the PCIe Base Specification
Replay Command[1:0] [3:2] Same as defined in the PCIe Base Specification
Flit Sequence Number[9:0] {[1:0], [15:8]} 10-bit Sequence Number as defined in the PCIe Base
specification
10b CXL.cachemem CXL.Cachemem Valid CXL.cachemem slot and/or CXL.cachemem credit payload Yes
Payload Link Layer
CXL.Cachemem No valid CXL.cachemem payload; generated when CXL.cachemem Link Layer If retryable,
Empty speculatively arbitrates to transfer a flit to reduce idle to valid transition allocate to
time but no valid CXL.cachemem payload arrives in time to use any slots in Tx Retry
the flit Buffer only
11b ALMP ARB/MUX ARB/MUX Link Management Packet Yes
12/8/2022 Confidential | CXL™ Consortium 2022 38
Latency-Optimized 256BFlits
• Optionally supported flit
format; negotiated during link
FlitHdr (2 bytes) FlitData (126 bytes)
FlitData (114 bytes) CRC (8 bytes) FEC (6 bytes)
training (no dynamic CXL Standard 256B Flit
switching)
• Comprised of 128-byte even FlitHdr (2 bytes) FlitData (126 bytes)
and odd flit halves, each FlitData (110 bytes) DLLP (4 bytes) CRC (8 bytes) FEC (6 bytes)
byte CRC
• Even flit half can be
FlitHdr (2 bytes) FlitData (120 bytes) CRC (6 bytes)
FlitData (116 bytes) FEC (6 bytes) CRC (6 bytes)
consumed without waiting for CXL Latency-Optimized 256B Flit
second half to be received
• Flit accumulation latency FlitHdr (2 bytes) FlitData (120 bytes) CRC (6 bytes)
savings increases for smaller FlitData (112 bytes) DLLP (4 bytes) FEC (6 bytes) CRC (6 bytes)
Pass Pass Consume Flit N/A N/A N/A N/A N/A N/A
Pass Fail OK to consume Pass Pass Consume anything N/A N/A N/A
even half; perform not previously
FEC decode and consumed
correct
Pass Fail OK to consume even Pass Pass Consume anything not
half if not previously previously consumed
consumed; Request
Pass Fail OK to consume even
Retry
half if not previously
consumed; FEC
decode and correct
Fail Pass/Fail FEC decode and
correct
the corresponding bytes of the 8B CRC Byte 0 to Byte 113 00h for all bytes 00h for all bytes
generator. Byte 114 to Byte 229 Byte 0 to Byte 115 of the flit Byte 128 to Byte 243 of the flit
L0 at Gen1
Flit
FEC -- High
accumulation
Latency Path + FEC + CRC
(4 cycle pipeline) Check
path
support for NOP Hints
during Phase 1 of APN
YES No
CXL3.0 Protocols
Agenda
• Protocols
• Coherence/Caching Primer
• CXL.Cache Protocol
• Focus on updates in CXL3
• CXL.mem Protocol
• Focus on updates in CXL3
Protocols Supported
3 Protocols in CXL
• IO
• PCI Express baseline with a few extensions
• Usage: configuration, Address Translation, DMA, etc.
• Not the focus of this training
• Cache
• Device Caching Host Memory
• Special use for managing HDM-D coherence (“Bias Flip” flows).
• Can also be used for DMA
• 64B Granularity only
• Host Physical Address (HPA) Only.
• Memory
• Exposes device memory to the host Host-managed Device Memory (HDM)
• 3 Types of coherence: Host-only, Device, Device with Back-Invalidate.
• Abbreviated as HDM-H, HDM-D, HDM-DB.
• 64B Granularity
• HPA Only
Caching Primer
Caching Overview
• Caching temporarily brings data closer to the
consumer Host Memory
Access Latency: ~200ns
Shared Bandwidth: 100+ GB/s
Read
Data
• Prefetching: Loading Data into cache before it is
Local Data Cache
required Access Latency: ~10ns
• Spatial Locality (locality is space): Access Dedicated Bandwidth: 100+ GB/s
Read
Data
• Temporal Locality (locality in Time): Multiple
access to the same Data
Accelerator
CXL.Cache
CXL.Cache
~50KB ~50KB
L1 L1 L1 L1
• Higher levels (L3), less
CXL.io
CXL.io
... PCIe bandwidth per source
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
CPU Socket 0
and support more
sources
~1 MB ~1 MB
• Devices may also have
Device Device
multiple levels of cache.
Note: Cache/Memory capacities are examples and not aligned to a specific product.
Cache Consistency
• How do we make sure updates in cache are visible to other agents?
• Invalidate all peer caches prior to update
• Can be managed with software or hardware CXL uses hardware coherence
• Define a point of “Global Observation” (aka GO) when new data is visible
from writes
• Notes:
• Each level of the CPU cache hierarchy follows MESI and layers above must be consistent
• Other extended states and flows are possible but not covered in context of CXL
• A “Snoop” is the term for the Home Agent to check the cache state,
fetch the data and cause cache state changes.
CXLCache Protocol
Read Flow
• Diagram to show
message flows in time
• X-axis: Agents
• Y-axis: Time
Read Flow
• Diagram to show message
flows in time
• X-axis: Agents
• Y-axis: Time
Home Agent
CPU Socket 1
~10 MB – L3 (aka LLC)
Wr Wr
~500 KB – L2 ~500 KB – L2 ... Cache Cache
CXL.Cache
CXL.Cache
~50KB ~50KB
L1 L1 L1 L1
CXL.io
CXL.io
...
PCIe
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
CPU CPU CPU CPU
CPU Socket 0
CXL
~1 MB ~1 MB
Device
Device Device
CXL.Cache
CXL.Cache
~50KB ~50KB
L1 L1 L1 L1
CXL.io
CXL.io
...
PCIe
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
CPU CPU CPU CPU
CPU Socket 0
CXL
~1 MB Peer
~1 MB
Device
Device Cache
Device
can be: Wr Wr
• Native DDR on Local Peer Cache
~500 KB – L2 PeerKB
~500 Cache
– L2 ... Cache Cache
Peer Cache
CXL.Cache
CXL.Cache
~50KB ~50KB
Socket L1 L1 L1 L1
CXL.io
CXL.io
Native DDR on Remote
•
...
PCIe
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
Socket CPU CPU CPU CPU
• CXL.mem on peer Device
CPU Socket 0
CXL
~1 MB Peer
~1 MB
Device
Device Cache
Device
Legend
Cache State:
Modified
Exclusive
Shared
Invalid
Allocate Tracker
Deallocate Tracker
12/8/2022 Confidential | CXL™ Consortium 2022 69
[AMD Official Use Only - General]
Ownership
• Silent Write SI
• Cache Eviction
I E
Legend
Cache State:
Modified
Exclusive
Shared
Invalid
Allocate Tracker
Deallocate Tracker
12/8/2022 Confidential | CXL™ Consortium 2022 70
[AMD Official Use Only - General]
Ownership
• Silent Write SI
• Cache Eviction
I E
Legend
Write
<Silent Wr>
Cache State: EM
Modified
Exclusive
Shared
Invalid
Allocate Tracker
Deallocate Tracker
12/8/2022 Confidential | CXL™ Consortium 2022 71
[AMD Official Use Only - General]
Ownership
• Silent Write SI
• Cache Eviction
I E
Legend
Write
<Silent Wr>
Cache State: EM
Modified
Exclusive
Eviction
Shared
M I Data
Invalid
Allocate Tracker
Deallocate Tracker
12/8/2022 Confidential | CXL™ Consortium 2022 72
[AMD Official Use Only - General]
SI
• Rely on completion to
indicate ordering
Data
• May see reduced
bandwidth for ordered
Legend
traffic Cache State:
Modified
• Host may install data into Exclusive
LLC instead of writing to
Shared
Invalid
memory Allocate Tracker
Deallocate Tracker
12/8/2022 Confidential | CXL™ Consortium 2022 73
[AMD Official Use Only - General]
15 Request in CXL
CXLMemory Protocol
Limited Ordering
• Req channel for HDM-D memory (CXL2
Accelerators)
• NDR Channel for conflict flows with HDM-
DB
Wr Wr
2 Level of coherence
~500 KB – L2 ~500 KB – L2 ... Cache Cache
CXL.Cache
CXL.cache
~50KB ~50KB
protocols L1 L1 L1 L1
CXL.io
CXL.io
• CXL.$ to LLC (Last ...
PCIe
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
Level Cache) CPU CPU CPU CPU
• LLC to Home Agent
(host specific) CPU Socket 0
~1 MB L1/L2 ~1 MB L1/L2
Device Device
CXL.mem
Host Home Agent does not Home Agent
Proxy Caching Agent
track Device caching of HDM-
D*. Coherent CPU-to-CPU Symmetric Links
CPU Socket 1
~10 MB – L3 (aka LLC)
Device tracks host caching of
HDM-D*
Wr Wr
~500 KB – L2 ~500 KB – L2 ... Cache Cache
CXL.Cache
CXL.cache
~50KB ~50KB
L1 L1 L1 L1
CXL.io
CXL.io
...
PCIe
PCIe
~50 KB ~50 KB ~50 KB ~50 KB
DCoh is the final point of
coherence resolution for HDM-
CPU CPU CPU CPU
~1 MB L1/L2 ~1 MB L1/L2
T2 Device T1 Device
CXL Device
Media ECC handled by device Host
Memory
Controller
Memory
Media
CXL Device
Meta Value Change requires Host
Memory
Controller
Memory
device to write.
Media
New MetaValue
Old MetaValue
reads
No Meta Update
Needed
HDM-D/HDM-DBCommon Attributes
“Device Coherent” Provide ability for host and device to cache
Device Coherence Engine (Dcoh) is the final conflict resolution arbiter between
host and device accesses for HDM-D* memory.
MemRdFwd
message sent Accelerator Device with HDM-D
coherence S RdOwn
X
I
BIAS=HOST
resolved on RdOw
nX
S to I
M2S Request SnpI nv X
Hit SRs
Channel pI
MemR
dFwd X
GO-E I to E
BIAS=DEVICE M emRdFw
dX
D ata
No messages on CXL
interface
Accelerator Device with HDM-D
Data
E/M in cache
imply
Bias=Device Accelerator Device with HDM-D
indication to I DirtyEv
ict X
M
host BIAS=DEVICE
GO-WrPull M to I
Data X
MemWr X
Cmp
MemRdFwd
message sent Accelerator Device with HDM-D
WOWr X
coherence S
BIAS=HOST
I
resolved WOW
rX
S to I SnpI nv X
Hit SRs
pI
MemWrFwd X
GO-WrPull
Data
MemWr X
BIAS=DEVICE
Cmp
ExtCmp
No message to host
Accelerator Device with HDM-D
I WOWr X I
BIAS=DEVICE
GO-WrPull
Data
MemWr X
Cmp
ExtCmp
HDM-DB
Conflicts
• BISnp conflicting with pending MemRd will create a race case if read is
requesting cache state (MV=A,S) from the Device.
• Host must find out if MemRd response is in flight before knowing how to
process the BISnp.
• Conflict resolution adds M2S handshake with M2S RwD and S2M NDR
• M2S RwD message “BI-Conflict”.
• RwD is lowest complexity for implementation and works within dependance graph.
• RwD will cause wasted 64B data transfer, but conflicts are not expected at high rates, so
better to keep flow simple and live with overhead.
• S2M NDR message is Conflict-Ack and must push all prior Cmp messages to the same
address
• Conflict cases resolved with handshake
• Late conflict is where Data/Cmp-E/S are inflight from device
• Early Conflict: MemRd is blocked at the device.
• Without the handshake host can not resolve between the two cases
Early Conflict
• Conflict-ACK
observed before
Cmp* means Device
has not yet
processed MemRd
Late Conflict
• Late is defined as after
device has processed
MemRd
• Cmp* observed before
Conflict-Ack confirms to
host that data is in flight.
• BI-Conflict message uses
RwD channel.
Channel Dependance
• Req may be blocked by BISnp.
Req
RwD
• RwD may only be blocked by
responses messages (NDR/DRS).
NDR
DRS
D1 D2 D3
CXL Type-2/3
• HDM-DB will directly resolve coherence with PCIe
Accel
CXL
Accel
the host before committing the P2P.
HDM-DB
Summary
• CXL protocols are evolving
Thank You
2
1 to 9 Data slots
208
1 2 DRS D0 D0 D0
10 3 S2M DRS headers
11 to 14 Data Slots 8
2 D0 D1 D1 D1
3
0 2 S2M DRS headers
224
3 2DRS D1 D2 D2
1 to 14 Data Slots 2
0 2 S2M DRS headers
4 D2 D2 D3 D3
4
1 to 10 Data slots
208
5 2DRS D3 D3 D4
11 3 S2M DRS headers
12 to 14 Data Slots 9
6 D4 D4 D4 D5
5
0 2 S2M DRS headers
224
7 2DRS D5 D5 D5
1 to 14 Data slots 3
0 2 S2M DRS headers
8 D6 D6 D6 D6
6
1 to 11 Data slots
208
9 D7 D7 D7 D7
12 3 S2M DRS headers
13 to 14 Data Slots 10
0 Empty Efficiency =(64*8) *100/(68*9) =83.67%
7
1 to 10
11
Data slots
3 S2M DRS headers
208 Above is computed for 68BFlit Format
12 to 14 Data Slots 9
0 2 S2M DRS headers
8 224
1 to 14 Data slots 3
0 2 S2M DRS headers
1 to 11 Data Slots
9 208
12 3 S2M DRS headers
13 to 14 Data Slots 10
Flits 7, 8 and 9 pattern repeats
Flit number Slot number information rollover data bytes per flit
0 1 M2S RWD
1 to 4 Data slots Flit number Slot 0 Slot 1 Slot 2 Slot 3
5 1 M2S RWD 1 1 RwD D0 D0 D0
0 192
6 to 9 Data slots 2 D0 1 RwD D1 D1
10 1 M2S RWD 3 D1 D1 1 RwD D2
11 to 14 Data slots 4 D2 D2 D2 1 RwD
repeat 5 D3 D3 D3 D3
• PM NAK example
• L0p aggregation and negotiation with remote Link partner is the responsibility
of the ARB/MUX