You are on page 1of 8

Analyzing and Debugging Performance Issues with

Advanced ARM CoreLink System IP Components


By William Orme, Strategic Marketing Manager, ARM Ltd. and Nick Heaton, Senior Solutions
Architect, Cadence

Finding the optimal configuration options that meet the requirements of a particular system requires
complementary design tools to enable the designer to rapidly explore and correlate trade-offs in
performance, power, and area (PPA). This paper describes the challenges confronting the designer and
proposes a new tool leveraging ARM® and Cadence technology to overcome the challenges of today’s
highly integrated, multi-processor system-on-chip (SoC) designs.

Introduction
Contents The evolution of today’s system-on-chip (SoC) devices from uni-processor
Introduction.................................. 1 systems to heterogeneous multi-processor designs has added a significant
burden to the SoC designer’s job. Designers are confronted with integrating
Accelerating SoC Integration many high-performance masters and slaves with dynamically changing traffic
with CoreLink NIC-400 and profiles.
Interconnect Workbench............... 3
Figure 1 illustrates how functions with real-time, maximum-latency require-
Performance Implications of ments compete with high-bandwidth streaming traffic, along with CPUs that
Interconnect Choices.................... 3 need minimum latency to reach optimum performance. Advanced system intel-
lectual property (IP) —the “glue” that provides the interconnect tying all of the
Interconnect Design Choices......... 4
major functional blocks together and connecting them to main memory—is
Verifying Latency.......................... 4 required to help solve these competing requirements. Just as each system may
have its own unique set of design challenges, system IP is, by its nature, highly
Push-Button Testbench configurable, allowing the designer to choose the most optimal configuration
Generation................................... 6 for their design. Advanced system IP not only allows designers to select inter-
connect topologies but also places solutions such as hardware-managed cache
And There’s More......................... 7
coherency and dynamic end-to-end quality of service at their disposal.
Conclusion.................................... 8
Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

CPU GPU
Comms Geometry
Control Processor

Apps Renderer
Processor Tiling Network
Interface

DMA
Display
Controller
Controller

Audio
CODEC Peripheral
Interconnect
Peripheral
Image
Transform Peripheral
HD Video

Motion Peripheral
Estimation
Dynamic Static
Motion MemoryCtrl MemoryCtrl
Compensate

NAND Flash
Buffer Texture Buffer

Primitives Primitives Tile Lists Minimum Bandwidth


Maximum Latency
Application
Frame Buffer Media Source Minimum Latency
Memory

Figure 1: Typical smartphone traffic

The configuration options the designer chooses need to satisfy a multi-dimensional problem affecting the perfor-
mance of each function as well as the physical size and power dissipation.

Figure 2 shows a typical SoC core, which uses the ARM CoreLink™ System IP components connected to a Cadence ®
Databahn DDR controller.

Thin Link

Configurable: AXI4/AXI3/AX8

ADB-400 ADB-400 NIC-400


AXI4

ADB-400 ADB-400 MMU-400 MMU-400 MMU-400

128b 128b 128b 128b 128b

ACE ACE ACE-Lite + DVM ACE-Lite + DVM ACE-Lite + DVM

CoreLink™ CCI-400 Cache Coherent Interconnect


128 bit @ up to 0.5 Cortex-A15 frequency
ACE-Lite ACE-Lite ACE-Lite

128b 128b 128b

ACE-Lite ACE-Lite AXI4


NIC-400
Databahn IP Configurable: AXI4/AXI3/AHB/APB

DDR PHY DDR PHY

LPDDR2 Model DDR3 Model

Figure 2: Typical SoC configuration

www.cadence.com 2
Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

The sophistication of these system IP components, which is necessary to allow the designer to integrate many
functions together, provides many choices to the designer. Finding the optimal configuration options that meet the
requirements of a particular system requires complementary design tools to enable the designer to rapidly explore
and correlate trade-offs in performance, power, and area (PPA). This paper describes the challenges confronting
the designer and proposes a new tool to accelerate the integration of many SoC functions with an optimized
system IP configuration.

Accelerating SoC Integration with CoreLink NIC-400 and Interconnect Workbench


In order to better understand the range of choices SoC designers face, let us look at one of the system IP compo-
nents from ARM, the CoreLink NIC-400™. This is a highly configurable component that can have any number (there
are practical limits) of master and slave interfaces of any of the AMBA® 3 family of protocols (AHB™, AXI™, APB™ )
as well as AXI4™ from the AMBA 4 family. Each of these interfaces can have configurable width, address maps, and
clock speeds. In addition, the user can configure the internal implementation of the IP to control the paths from
master to slave and add mechanisms for bandwidth and latency management called Quality of Service (QoS) and
Virtual Networks (QVN).

Adding to this high configurability, the IP also allows the user to make additional choices to help with routing
congestion and layout through a mechanism called “Thin Links.” For a complex SoC with hundreds of IP,
connecting them all to the main system memory can create situations where an AMBA bus may need to be routed
across the chip. However, this situation may not be ideal for wide AMBA buses. Thin Links allow the user to create
a point-to-point AMBA connection using only a few wires, thereby alleviating the routing problem.

This connection is a user configuration choice for each interface. In fact, the NIC-400 is so configurable that ARM
provides CoreLink AMBA Designer, an interactive tool created specifically to make it easy for users to select imple-
mentation options. Figure 3 shows an example of using AMBA Designer to configure a complex NIC-400 inter-
connect.

Figure 3: AMBA Designer

Performance Implications of Interconnect Choices


From the perspective of the SoC designer, using a contemporary GUI to configure and generate a complex IP is very
efficient: they can quickly architect the SoC and, within minutes, have a new interconnect IP (or multiple cascaded
interconnect IPs) configured, generated, and stitched together. Creating the design is, however, only one piece of
the problem. How will this choice of implementation actually perform under the range of scenarios that will be
encountered in the full SoC context? This question is a challenge to answer.

www.cadence.com 3
Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

The Cadence Interconnect Workbench provides a suite of capabilities to enable this kind of “what if?” experimen-
tation. Let’s look at an example of the kind of analysis that Interconnect Workbench enables.

Figure 4 illustrates a bandwidth plot from a performance scenario with specific read bandwidth criteria met;
displayed USB and High-Speed I/O bandwidths are in the 100-300MBps range. Interconnect Workbench allows us
to quickly visualize this kind of simulation running on cycle-accurate register-transfer level (RTL) models of the inter-
connect using Cadence VIP for AMBA to model the masters and slaves.

Figure 4: Bandwidth split by source

Interconnect Design Choices


The SoC designer must design his or her interconnect to provide sufficient throughput at low enough latency while
minimizing its area, power, and routing congestion. The CoreLink NIC-400 allows the designer to craft different
topologies while varying the size and number of switch matrices. Smaller switches permit higher operating
frequencies and lower latencies, and timing closure options for the insertion of register slices allow trade off of
throughput with static latencies. Different AMBA protocols can be selected and data bus widths varied to increase
bandwidth.

To prevent blocking without adding more and more physical channels, virtual channels can be defined, allowing
virtual channels to remain clear for latency-critical masters even where another virtual channel is fully utilized.
Dynamic regulators can be inserted at the ingress to the interconnect network to prioritize traffic within a single
virtual or physical channel, thus ensuring the required quality of service is met for each master. Once an inter-
connect configuration is selected, the designer needs to be able to verify its performance under load.

Verifying Latency
An important question that designers should ask is, “What is the consequence of adding an asynchronous bridge
into my architecture with respect to latency?” The graph in Figure 5 shows the latency of the accesses across a
CCI-400 Interconnect with and without ADB-400 asynchronous domain bridges. The top chart is the latency distri-
bution without bridges, the bottom chart is with bridges.

www.cadence.com 4
Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Figure 5: CCI-400 latency with and without ADB bridges

Interconnect Workbench allows us to also investigate latency through statistical distributions. Figure 6 shows a
latency distribution view of a group of simulations. It is easy to identify the slowest transactions on a distribution,
as the buckets to the right are the slowest. Also the chart clearly illustrates that the latency for reads and writes is
distorted and writes happen more quickly than reads.

Figure 6: Latency distribution

From the latency distribution, Interconnect Workbench provides the ability to click on a bucket and show the
transaction(s) in that bucket along with all the details, thus enabling the rapid debug of latency outliers. As shown
in Figure 7, right-clicking on the transaction details further accelerates debugging by launching the SimVision tool.
Within the tool, the simulation waveform is already configured and markers highlight the transaction of interest.

www.cadence.com 5
Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Figure 7: One-click waveform debug of slow transactions

Push-Button Testbench Generation


As has been shown, Interconnect Workbench provides comprehensive analysis capabilities for capturing cycle-
accurate performance metrics. How are the testbenches created to run this kind of analysis?

Interconnect Workbench provides a complete solution for automatically generating a UVM testbench for any ARM
Interconnect from the NIC-301™, NIC-400, and CCI-400™ CoreLink System IPs. Once a user has defined the inter-
connect implementation details, AMBA Designer generates the RTL as well as an IP-XACT XML file that matches
the design. Interconnect Workbench has been architected to read this IP-XACT and automatically generate a UVM
testbench in either of the most popular high-level verification languages (HVLs): e or SystemVerilog.

In a typical SoC, a mix of components makes up the “glue” connecting all of the major IP together. Reading an
IP-XACT description of these system IP cores enables Interconnect Workbench to provide performance analysis
capabilities for not only the interconnect components but also the cycle-accurate models of the DDR controller.
Figure 8 shows how Interconnect Workbench might be used to generate a testbench for the core of an SoC and
the included DDR controller.

www.cadence.com 6
Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

AXI4 VIP

AXI4 VIP Thin Link


ACE-Lite ACE-Lite
VIP VIP
Configurable: AXI4/AXI3/AHB

ACE VIP ACE VIP ADB-400 ADB-400 NIC-400


AXI4

ADB-400 ADB-400 MMU-400 MMU-400 MMU-400

128b 128b 128b 128b 128b

ACE ACE ACE-Lite + DVM ACE-Lite + DVM ACE-Lite + DVM

CoreLink™ CCI-400 Cache Coherent Interconnect


128 bit @ up to 0.5 Cortex-A15 frequency ICM VIP
ACE-Lite ACE-Lite ACE-Lite

128b 128b 128b

ACE-Lite VIP ACE-Lite VIP AXI4


NIC-400
Databahn IP Configurable: AX14/AX13/AHB/APB

AXI4 VIP AXI4 VIP


DDR PHY DDR PHY

LPDDR2 Model DDR3 Model

Plug-in Plug-in

Figure 8: Interconnect Workbench-generated testbench

And There’s More...


With the introduction of AMBA 4, ARM introduced hardware coherency to the world through the ACE™ protocol,
a contrast to the previously described, relatively simple non-coherent systems. Coherency enables multiple masters
to share coherent data structures while enabling L1 and L2 local caching for higher performance, for example
in the ARM big.LITTLE™ processor clusters using Cortex™-A15 and Cortex-A7 processors. These caches are kept
coherent through interconnect hardware, with other components in the system further improving performance,
power consumption, and simplifying software. This coherency adds to the performance prediction scenarios for the
system designer.

In the same way the Interconnect Workbench can help with non-cached systems, it can be used with the AMBA 4
cache-coherent protocols. In a cached system using, for example, the ARM CCI-400 cache-coherent interconnect
traffic from an I/O, the master can share the L2 cache of either of two processor clusters connected via the ACE
interfaces using snoop commands. If transactions have data cached in these clusters, there will be a “snoop hit.”
If the corresponding data is not stored in these caches, then the transaction will eventually be forced to go to
main memory, resulting in a “snoop miss”. The difference in latency of these hits and misses is significant, and it
is of paramount importance for a SoC designer to characterize the behavior of the system under differing loads
and conditions. Interconnect Workbench provides the perfect vehicle to do this kind of analysis. Figure 9 shows a
latency distribution for a CCI-400 simulation with data split by hits and misses.

www.cadence.com 7
Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Figure 9: CCI-400 latency distribution showing hits and misses

As can be seen, the expected lower latency for hits is validated by the analysis. The value of visualization is that is it
easy to see if hits were slower than expected or if the misses were quicker, which might point to either a functional
problem or perhaps an error in the scenario.

Conclusion
The increase in complexity of SoCs based on heterogeneous, multi-core systems can be addressed by advanced
system IP. Design integration can be accelerated with appropriate tools that simplify choice of architectures, clock
schemes, power domains, memory sizes, cache sizes, QoS mechanisms, and other configuration options. ARM’s
AMBA Designer provides a quick way to generate CoreLink interconnect designs from a large set of configurable
options. The Cadence Interconnect Workbench is a valuable tool for measuring and comparing different archi-
tectures and configurations in cycle-accurate RTL simulations for a variety of scenarios. Understanding how these
numerous and varying IP cores behave together in a system when pushed to their limits is key to ensuring that a
new design delivers on its expected performance targets.

Cadence Design Systems enables global electronic design innovation and plays an essential role in the
creation of today’s electronics. Customers use Cadence software, hardware, IP, and expertise to design
and verify today’s mobile, cloud and connectivity applications. www.cadence.com www.cadence.com

© 2013 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence and the Cadence logo are registered trademarks of Cadence Design
Systems, Inc. in the United States and other countries. ARM and AMBA are registered trademarks and ACE, AHB, APB, AXI, AXI4, big.LITTLE, CCI-
400, CoreLink, Cortex, NIC-301, and NIC-400 are trademarks of ARM Ltd. 1496 10/13 CY/DM/PDF

You might also like