Professional Documents
Culture Documents
(CG095)
Boost system performance for emerging use cases
External
Flash SRAM DDR
Cache miss
Interconnect
means wait
Majority of the memory accesses Mem Ctrl states that
are likely to be cache hit depend on the
(high) latency to
High latency memory
memory
• Lower overall memory access latency when cache hit 32-bit AHB or AHB5
rate is high
TrustZone Filters TrustZone Filters TrustZone Filters
• Runs at processor frequency depending on cache
configuration* Flash Controller SRAM Controller DDR Controller
– Approx. 165 MHz for TSMC 40 ULP, C40, 9 track library with
SVT
Flash SRAM External
– Approx. 440 MHz for TSMC 22 ULL, C30, 7 track library with DDR
SVT System memories
* Data based on AHB Cache EAC, subject to change
Description Value
Size of Data RAM 2 kB to 64 kB
User bus width of HAUSER, HRUSER, HWUSER 0-64
Width of Master ID (HMASTER_WIDTH) 4-8
The Master ID that the cache will generate 0-2HMASTER_WIDTH-1
Endianness support LE, BE8, BE32
XOM support OFF, ON
Snapshot support OFF, ON
Clock Q synchronizer OFF, ON
• 4 Tag RAMs
• Support to join into a single RAM instance
• Single write enable for each Tag RAM
• 4 Data RAMs
• Write enable is byte granularity
Tag RAM
• 1 RAM for Dirty status bits
• Write enable is bit granularity
AHB Cache
RAM Sizes and Geometry for a 64 kB Cache Size Data RAM
Cache Size / cache line size = 64 kB / 32 B
= 2 k cache lines
Dirty RAM
Data RAM size = 64 kB
Tag RAM size = 2k*20 or 2k*21 bits depending on XOM
configuration
Dirty RAM size = 2k bits
• Support for automatic maintenance which can be enabled/disabled via config. ports
1. Powerdown - perform clean all when low-power request is received
2. Cache enable - perform invalidate all in the background (traffic is unaffected) and then enable cache
3. Cache disable - stall slave interface, perform clean all, then disable cache and release slave interface
Invalidate all manual operation Enable the Cache (this performs Invalidate all
before enabling the cache)
Enable the Cache
Workloads compiled with Arm Compiler 6.14 with below flags – Cortex-M33
Compiler flags: --target=arm-arm-none-eabi -mcpu=cortex-m33 -D__ARM_ARCH_8M__=1 -march=armv8-m.main -mcmse -mthumb -munaligned -access -O3 -munaligned-access -Omax -MMD -g -c
Linker flags: --map --ro-base=0x0 --rw-base=0x20000000 --first='boot_exectb_mcu.o(vectors)' --datacompressor=off --info=inline --entry=main --cpu=teal --lto -Omax
Workloads compiled with Arm Compiler 6.14 with below flags – Cortex-M4
Compiler flags: -mcpu=Cortex-m4 --target=arm-arm-none-eabi -g -Omax -mfloat-abi=hard -mfpu=fpv4-sp-d16 -mthumb -fno-common -ffunction-sections -funsigned-char -fshort-enums -fshort-
wchar -gdwarf-3 -ffp-mode=fast -mcpu=Cortex-m4 -O3 -munaligned-access -g -c
Linker flags: --ro-base=0x0 --rw-base=0x20100000 --first='startup_CMSDK_CM4.s.o(RESET)' --info=inline --datacompressor=off --debug --map -Omax --cpu=Cortex-M4 --fpu=FPv4-SP --
load_addr_map_info --xref --callgraph --symbols
19 Confidential © 2020 Arm Limited
Cortex-M33 & SRAM Performance Uplift with AHB Cache (1)
Cortex-M33 CPU clock to system memory clock is 1:4
Interconnect
Sync down
bridge
ROM RAM
(code) (data)
@ CPUCLK/4
Performance of Cortex-M33 with FPU, FPMark single precision small data set, Write Back policy
5
0
ArcTan Black Scholes Horner's method Livermore loops Neural net Linear algebra Matrix LU Inner product
decomposition
X-axis shows workloads from the FPMark suite - single precision versions with small data set.
Y-axis is the performance of each configuration setting relative to the performance of the ‘cache off’ setting.
When cache is on, write policy is Write Back.
Code memory accesses are zero wait state for all runs. Ideal memory means zero wait state for data.
21 Confidential © 2020 Arm Limited Data based on AHB Cache EAC, subject to change
Cortex-M33 & SRAM Performance Uplift with AHB Cache (3)
Cortex-M33 CPU clock to system memory clock is 1:4
Performance of Cortex-M33 with FPU, FPMark single precision small data set, Write-Through policy
5
4
3
2
1
0
ArcTan Black Scholes Horner's method Livermore loops Neural net Linear algebra Matrix LU Inner product
decomposition
X-axis shows workloads from the FPMark suite - single precision versions with small data set.
Y-axis is the performance of each configuration setting relative to the performance of the ‘cache off’ setting.
When cache is on, write policy is Write-Through.
Code memory accesses are zero wait state for all runs. Ideal memory means zero wait state for data.
22 Confidential © 2020 Arm Limited Data based on AHB Cache EAC, subject to change
Cortex-M33 & DDR4 Performance Uplift with AHB Cache (1)
Round trip latency for a read to DDR4 memory is 19 CPU cycles
Cortex-M33
C-AHB S-AHB
19 CPU • System based on SSE-200 MPS3 FPGA
cycles • Cortex-M33 is an Armv8-M processor and has two
SSE-200
AHB Cache
AHB5 master interfaces
instr. cache • AHB Cache (CG095) is integrated on the S-AHB
system master interface
Interconnect
• Instruction cache from SSE-200 is integrated on the
Memory C-AHB code master interface
controller • Data is mapped to DDR4, round trip latency for a
read to DDR4 memory is 19 CPU cycles
• Code is mapped to BRAM on FPGA
BRAM DDR4
• AHB Cache write policy is Write Back
Only relevant part of SSE-200 is shown, AHB Cache is integrated on data path
Performance of Cortex-M33 with FPU, FPMark single precision small data set
5
4
3
2
1
0
ArcTan Black Scholes Horner's method Livermore loops Neural net Linear algebra Inner product
X-axis shows workloads from the FPMark suite - single precision versions with small data set.
Y-axis is the performance of each configuration setting relative to the performance of the ‘cache off’ setting.
Cortex-M33 Floating Point Unit (FPU) is included and FP operations are emulated in software. When cache is on, write policy is Write Back.
24 Confidential © 2020 Arm Limited Data based on AHB Cache EAC, subject to change
Cortex-M33 & DDR4 Performance Uplift with AHB Cache (3)
Round trip latency for a read to DDR4 memory is 19 CPU cycles
Performance of Cortex-M33 with FPU, FPMark single precision medium data set
5
4
3
2
1
0
Black Scholes Horner's method Linear algebra
X-axis shows select workloads from the FPMark suite - single precision with medium data set. Due to technical issues not all workloads were run.
Y-axis is the performance of each configuration setting relative to the performance of the ‘cache off’ setting.
Cortex-M33 Floating Point Unit (FPU) is included and FP operations are emulated in software. When cache is on, write policy is Write Back.
25 Confidential © 2020 Arm Limited Data based on AHB Cache EAC, subject to change
Cortex-M33 & DDR4 Performance Uplift with AHB Cache (4)
Round trip latency for a read to DDR4 memory is 19 CPU cycles
Performance of Cortex-M33 without FPU, FPMark single precision small data set
3
0
ArcTan Black Scholes Horner's method Livermore loops Neural net Linear algebra Inner product
X-axis shows workloads from the FPMark suite - single precision versions with small data set.
Y-axis is the performance of each configuration setting relative to the performance of the ‘cache off’ setting.
Cortex-M33 Floating Point Unit (FPU) is not included and FP operations are emulated in software. When cache is on, write policy is Write Back.
26 Confidential © 2020 Arm Limited Data based on AHB Cache EAC, subject to change
Cortex-M33 & DDR4 Performance Uplift with AHB Cache (5)
Round trip latency for a read to DDR4 memory is 19 CPU cycles
Performance of Cortex-M33 without FPU, FPMark single precision medium data set
3
0
Black Scholes Horner's method Linear algebra
X-axis shows select workloads from the FPMark suite - single precision with medium data set. Due to technical issues not all workloads were run.
Y-axis is the performance of each configuration setting relative to the performance of the ‘cache off’ setting.
Cortex-M33 Floating Point Unit (FPU) is not included and FP operations are emulated in software. When cache is on, write policy is Write Back.
27 Confidential © 2020 Arm Limited Data based on AHB Cache EAC, subject to change
Cortex-M4 & SRAM Performance Uplift with AHB Cache (1)
Cortex-M4 CPU clock to system memory clock is 1:4
@ CPUCLK
• System setup
Cortex-M4
• Cortex-M4 is an Armv7-M processor and has three
ICode DCode S-AHB AHB-Lite master interfaces
AHB-Lite • Code access at 1:1 clock, data access at 1:4 clock
AHB Cache wrapper • AHB Cache size configuration is swept
• Cortex-M4 Floating Point Unit (FPU) is enabled
Interconnect • AHB Cache write policy is Write Back
Sync down
bridge
ROM RAM
(code) (data)
@ CPUCLK/4
Performance of Cortex-M4 with FPU, FPMark single precision small data set, Write Back policy
4
3
2
1
0
ArcTan Black Scholes Horner's method Livermore loops Neural net Linear algebra Matrix LU Inner product
decomposition
X-axis shows workloads from the FPMark suite - single precision versions with small data set.
Y-axis is the performance of each configuration setting relative to the performance of the ‘cache off’ setting.
When cache is on, write policy is Write Back.
Code memory accesses are zero wait state for all runs.
29 Confidential © 2020 Arm Limited Data based on AHB Cache EAC, subject to change
CoreMark
Cortex-M0, Cortex-M0+, Cortex-M23,
Cortex-M3, Cortex-M4, Cortex-M33
Cortex-M0/M0+/M23 Performance Uplift with AHB Cache (1)
Cortex-M CPU clock to system memory clock is 1:4
@ CPUCLK @ CPUCLK
• System setup
Cortex-M0
or Cortex-M23 • Cortex-M0 and Cortex-M0+ are Armv6-M processors
Cortex-M0+ and have single AHB master interface
AHB-Lite • AHB-Lite wrapper handles the AHB5 <-> AHB-Lite
AHB Cache wrapper AHB Cache conversion
• Cortex-M23 is an Armv8-M processor with a single
Interconnect Interconnect AHB5 master interface
• AHB Cache write policy is Write Back
Sync down Sync down • Code access at 1:1 clock, data access at 1:4 clock
bridge bridge
Benchmark
ROM RAM ROM RAM EEMBC CoreMark
(code) (data) (code) (data)
Compiled with Arm Compiler 6.14 with below flags
@ CPUCLK/4 @ CPUCLK/4
Compiler flags: -O0 -fomit-frame-pointer -fno-common -fno-inline
Performance uplift with AHB Cache: Performance uplift with AHB Cache: Performance uplift with AHB Cache:
Cortex-M0 CoreMark Cortex-M0+ CoreMark Cortex-M23 CoreMark
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
Y-axis is the CoreMark score for each configuration setting relative to the performance of the ‘cache off’ setting.
When cache is on, write policy is Write Back.
Code memory accesses are zero wait state for all runs.
32 Confidential © 2020 Arm Limited Data based on AHB Cache EAC, subject to change
Cortex-M3/M4/M33 Performance Uplift with AHB Cache (1)
Cortex-M CPU clock to system memory clock is 1:4
@ CPUCLK @ CPUCLK
• System setup
Cortex-M3 or Cortex-M4
Cortex-M33 • Cortex-M3 and Cortex-M4 are Armv7-M processors
ICode DCode S-AHB C-AHB S-AHB and have three AHB-Lite master interfaces
AHB-Lite • AHB-Lite wrapper handles the AHB5 <-> AHB-Lite
AHB Cache wrapper AHB Cache conversion
• Cortex-M33 is an Armv8-M processor and has two
Interconnect Interconnect AHB5 master interfaces
• AHB Cache write policy is Write Back
Sync down Sync down • Code access at 1:1 clock, data access at 1:4 clock
bridge bridge
Benchmark
ROM RAM ROM RAM EEMBC CoreMark
(code) (data) (code) (data)
Compiled with Arm Compiler 6.14 with below flags
@ CPUCLK/4 @ CPUCLK/4
Compiler flags: -O0 -fomit-frame-pointer -fno-common -fno-inline
Performance uplift with AHB Cache: Performance uplift with AHB Cache: Performance uplift with AHB Cache:
Cortex-M3 CoreMark Cortex-M4 CoreMark Cortex-M33 CoreMark
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
Cache 2 kB 4 kB 8 kB 64 kB Cache 2 kB 4 kB 8 kB 64 kB Cache 2 kB 4 kB 8 kB 64 kB
disabled cache cache cache cache disabled cache cache cache cache disabled cache cache cache cache
Y-axis is the CoreMark score for each configuration setting relative to the performance of the ‘cache off’ setting.
When cache is on, write policy is Write Back.
Code memory accesses are zero wait state for all runs.
34 Confidential © 2020 Arm Limited Data based on AHB Cache EAC, subject to change
Value Proposition of
CoreLink AHB Cache
(CG095)
CoreLink AHB5 Cache for Cortex-M systems (CG095)
• Get more efficient embedded systems
• Enable faster systems with larger memory
Improve performance • Integrate even in secure embedded system
• Reduce frequency of large memories to save power
www.arm.com/company/policies/trademarks
FAQ for internal use
• Is ECC supported?
• No. In theory could integrate it outside the cache controller and this logic would then be on the critical path.
• Can cache lines be locked?
• No.
• Could pseudo-random replacement perform badly or in an indeterministic way? / Why not least recently used (LRU)?
• For caches with small number of ways, such as 4 in our case, LRU and random have similar hit rates, while random has lower area and power
• Literature supports this, e.g. “the performance of RND is reasonably close to that of LRU in terms of average miss rate” is the conclusion of this
analysis: https://www.normalesup.org/~kauffman/publi/perf-random.pdf
• Random is better when there is set thrashing [slide 24: https://course.ece.cmu.edu/~ece740/f10/lib/exe/fetch.php?media=740-fall10-lecture12-
advancedcaching.pdf] and does not depend on workload specific data patterns (which LRU and other sophisticated policies do depend on)
• Can the seed for pseudo-random replacement be manipulated?
• No.
• Is AHB Cache integrated in Corstone-201/SSE-200?
• No. By following steps in CIM it is fairly straightforward to do this integration.
• Does it support register slice?
• No. Expectation is that it is quite straightforward to insert registers on the AHB interface.