You are on page 1of 41

Contents

Certificate .................................................................................................................................. 2

Abstract ......................................................................................................................................4

1 Introduction ...........................................................................................................................5

2 Overview of SoC.....................................................................................................................5
2.1 Structure of SoC………………………………………………………………………………………………………….…6
2.2 FPGAs and MPSOCs…………………………………………………………………………………………………….…7
2.3 Introduction to Zynq Ultra scale+ MPSOC……………………………………………………………………...8

3 PCI Express Controller ……………………………………………………………..……………....……..10


3.1
Introduction to PCIe……………………………………………………………………………..………….…………..10
3.2
PCIe Features ….…………………………………………………………………………………………………………..10
3.3
PCI Express Layering Overview....………………………………………………………………………………….13
3.4
PCI Express Topology…………………………………………………………………………………………………...16
3.5
PCI Express Transactions.…………………………………………………………………………………….……….20
3.6
Transaction Layer Packets…………………………………………………………………………………….………21
4 Debug Flow…………………………………………………………………………………...…………………….…28

4.1 Xilinx System Debugger …………………………………………………………………………………….…………28


4.2 Hardware Debug Target ………………………………………….……………………………………………….…29
4.3 PCIE Decode Utility Flow ………………………………………………………………………………………………29
4.4 Results …………………………………………………………………………………………………………………………29

5 PCIE Enumeration.……..……………………………………………………………………….…………………….33
5.1 Introduction
5.2 Setting up with ARM DEV Studio
5.3 Running A (General) Bare Metal Image on the N1SDP
5.4 Running XSDB with PCIE4.0 EP
6 Result ………………………………………………………………………………………………………...…….…..33
7 References

1
Abstract
“Post-silicon validation is a critical part of Integrated Circuits. It is the process of finding the bugs that have
escaped from the pre-silicon phase. According to International Technology Roadmap for Semiconductors
(ITRS), time-to-market is the major constraint for verification and testing. The main challenge of post-
silicon validation is that it has limited observability and controllability of internal signals in the
manufactured chips. On-chip buffers are used to improve the observability and controllability of these
signal states during runtime. Compared to existing techniques Trace-based debug technique has been
widely utilized by the industries for the past few years to overcome the challenges.”

“PCI Express is a high performance, general purpose I/O interconnect defined for a wide variety of future
computing and communication platforms. Key PCI attributes, such as its usage model, load-store
architecture, and software interfaces, are maintained, whereas its parallel bus implementation is
replaced by a highly scalable, fully serial interface. PCI Express takes advantage of recent advances in
point-to-point interconnects, Switch-based technology, and packetized protocol to deliver new levels of
performance and features. Power Management, Quality of Service (QoS), Hot-Plug/ hot-swap support,
data integrity, and error handling are among some of the advanced features supported by PCI Express.”

In post-silicon validation of PCIe 4.0, PCIe End point devices connected to AXI Masters via AXI2PCIe
bridge. Validation of PCIe 4.0 Master controller will have standalone test sequence covering all supported
features. To validate a PCI Express transaction in debugging, we need to decode data captured at PCIe
Transaction layer. Which will give us idea about the data transmitted. To decode encrypted data at PCI
Express, we need to develop a utility. which gives end to end debugging visibility on every test sequence,
which saves lot of time spent in debugging the issues thereby reducing the time to market

2
1. Introduction
The primary challenge of PCIe validation is to test system functionality with speed and accuracy so that
the product can go to market. Protocol errors must be detected, analyzed, and corrected in an efficient
manner. Debugging PCIe protocol means capturing at-speed traffic, including power management
transitions. Protocol debug tools need to lock onto traffic quickly, then trigger on a unique protocol
sequence. Debugging lower-level problems, such as power management, requires exceptionally fast lock
times. Once traffic is captured, viewing the data at different levels of abstraction makes it possible to
isolate the problem

Once users achieve a data valid state at the physical-layer level, which requires validation of link signaling,
they test the higher layers at the protocol level with a protocol analyzer and exerciser. Validation of the
PCIe data link layer is performed by the specification tests that check for Transaction layer packets (TLPs)
being transferred, flow control. Validation teams need robust systems that can recover from all errors,
including inter-mittent failures

2. Overview of SoC
SoC stands for System on chip. Whole system is integrated on a single chip. System consists of I/O devices,
Microprocessor, on chip memories, external memory interfaces, clock and reset wizards, oscillator
circuitry, interrupt control unit and these are the basic building block of any SoC. it has many more other
blocks also. SoC has one processor core per chip. This Soc is nothing but a IC or chip, that integrates all
components of a computer or other electronics systems. it has basically analog, digital, mixed signal and
radio frequency functions. The power consumption of any SoC is very low so that these are widely used
in mobile computing market. It consists of Graphic Processing Unit, co-processors. Figure shows
generalized architecture of SoC.

There are three different types of SoC that have captured the today’s market. SoCs having micro
controllers, SoCs having microprocessors this is the second type of SoC. Third category is MPSoCs these
are programmable system on chip. The internal elements are not predefined and can be programmable
in a manner as same as to the FPGA or CPLD.

3
Figure 2.1: Architecture of SoC

2.1 Structure of SoC


A typical SoC consists of
a. MC, MP or DSP core multiprocessor SoCs (MPSoC) having more than one processor core
b. Memory blocks including a selection of ROM, RAM, EEPROM and flash memory
c. Timing sources including oscillators and phase-locked loops
d. Peripherals including counter-timers, real-time timers and power-on reset generators
e. External interfaces, including industry standards such as USB, FireWire, Ethernet, USART
f. Analog interfaces including ADCs and DACs
g. Voltage regulators and power management circuits
A bus either proprietary or industry-based standard such as the AMBA bus from ARM Holdings connects
these blocks. DMA controllers directly routes the data between external interfaces and memory,
bypassing the processor core and so that, increasing the data throughput of the SoC. There is a concept
MPSoC emerging now a day. MPSOC is the successor of SoC. The name itself gives idea about the
architecture of MPSOC, Multi-Processor System On Chip. Multiple processor can communicate with each
other and maintaining the nature of device. Other than MPSOC there is one more architecture depends
on SoC, RFSOC. Radio Frequency System on Chip. Generally, consists of RF data convertors and soft
functional decisions and forward error correction techniques.

4
2.2 FPGAs and MPSOCs
Xilinx’s MPSOC, RFSOC, FPGAs comes under Xilinx’s Ultrascale architectures this are the architectures
which differentiated by there very high operating frequency. These families that address a vast spectrum
of system requirements with a focus on lowering total and relative power consumption through different
innovative technological advancements. Following are the various versions which are available in the
market.

a. Kintex Ultrascale FPGAs High-performance FPGAs with the focus on price and performance,
using both monolithic and next-generation xilinx patented stacked silicon interconnect (SSI)
technology. High DSP and next-generation manufacturing technology transceivers, combined with
low-cost packaging, enable an optimum edge of capability and cost.

b. Kintex UltraScale+ FPGAs Increased performance and on-chip UltraRAM memory to reduce
BOM cost. The ideal mix of high-performance peripherals and cost-effective system
implementation. Kintex UltraScale+ FPGAs have numerous power options that deliver the optimal
balance between the required system performance and the smallest power envelope.

c. Virtex UltraScale FPGAs High-capacity, high-performance FPGAs enabled using same process as
that of kintex ultrascale FPGAs. Virtex UltraScale devices achieves the highest system capacity,
performance and huge bandwidth to address main market and application requirements through
integration of various system-level functions.

d. Virtex UltraScale+ FPGAs the highest transceiver bandwidth, highest DSP count, and highest
on-chip and ecternal memory available in the UltraScale architecture. Virtex UltraScale+ FPGAs
also provide various power options that delivers the minimum optimal balance between the
required system performance and the smallest power envelope.

e. Zynq UltraScale+ RFSoCs Combine RF data converter subsystem and forward error correction
with industry-leading programmable logic and heterogeneous processing capability. Integrated RF-
ADCs, RF-DACs, and soft decision FECs (SD-FEC) provide the key subsystems for multiband,
multimode cellular radios and cable infrastructure.

f. Zynq UltraScale+ MPSoCs Combine the ARMv8-based Cortex-A53 processor high-performance


than previous families, energy-efficient in all aspect, 64-bit application processing unit with the
5
ARM Cortex-R5 real-time processor on a die and the UltraScale architecture to create the industry’s
first All Programmable MPSoCs. Provide unprecedented power savings, heterogeneous processing,
and programmable acceleration.

2.3 Introduction to Zynq Ultrascale+ MPSOC


2.3.1 Architecture of Zynq MPSOC
This architecture also called as ultrascale+ MPSOC. This bifurcation of ultrascale and ultrascale+ is depends
on the specific parameters like specific clocking to each module present on chip, provided power savings,
heterogeneous processing and programmable acceleration and many more terms.
“Zynq UltraScale+ MPSoC devices provide 64-bit processor scalability while combining real-time control
hard engines for graphics, video, waveform, and packet processing capabilities in the programmable logic.
Integrating an ARM based system for advanced analytics and on-chip programmable logic for task
acceleration creates unlimited possibilities for applications including 5G Wireless, next generation ADAS,
and Industrial Internet-of-Things. Zynq UltraScale+ MPSoC is the Xilinx second-generation Zynq platform,
combining a powerful processing system (PS) and user-programmable logic (PL) into the same device. The
processing system features the ARM Cortex-A53 64bit quad-core or dual-core processor and Cortex-R5
dualcore real-time processor. In addition to the cost and integration benefits previously provided by the
Zynq-7000 devices.”

Figure 2.2: Architecture of Zynq MPSOC


6
“Zynq UltraScale+ MPSoCs offer the flexibility and scalability of an FPGA, while providing the performance,
power, and ease-of-use typically associated with ASICs and ASSPs. The range of the Zynq UltraScale+
family enables designers to target cost-sensitive and high-performance applications from a single
platform using industry-standard tools. There are two versions of the PS; dual Cortex-A53 and quad
Cortex-A53. The features of the PL vary from one device type to another. A Zynq UltraScale+ MPSoC
consists of two major underlying sections PS and PL in two isolated power domains. PS acts as one
standalone SoC and can boot and support all the features of the processing system without powering on
the PL. The PS block has three major processing units.

a. Cortex-A53 application processing unit (APU):


ARM v8 architecture-based 64-bit quad-core or dual-core multiprocessing CPU.

b. Cortex-R5 real-time processing unit (RPU):


ARM v7 architecture-based 32-bit dual real-time processing unit with dedicated tightly coupled
memory (TCM).

c. Mali-400 graphics processing unit (GPU):


Graphics processing unit with pixel and geometry processor and 64KB L2 cache”

3. PCI Express Controller:


3.1 Introduction to PCIe
The Zynq® UltraScale+™ MPSoC provides a controller for the integrated block for PCI Express® v4
compliant, AXI-vi PCIe bridge, and DMA modules. The AXI-PCIe bridge provides high-performance
bridging between PCIe and AXI. The controller for PCIe supports both Endpoint and Root Port modes of
operations. As shown in Figure, the controller comprises two sub-modules.

PCI Express is the fourth-generation high performance I/O bus used to interconnect peripheral devices in
applications such as computing and communication platforms. The fourth-generation buses include the
ISA, EISA, VESA, and Micro Channel buses, while the second-generation buses include PCI, AGP, and PCI-
X. PCI Express is an all-encompassing I/O device interconnect bus that has applications in the mobile,
desktop, workstation, server, embedded computing and communication platforms

PCI Express implements switch-based technology to interconnect many devices. Communication over the
serial interconnect is accomplished using a packet-based communication protocol.

• The AXI-PCIe bridge provides AXI to PCIe protocol translation and vice-versa, ingress/egress address
translation, DMA, and Root Port/Endpoint (RP/EP) mode specific services.

• The integrated block for PCIe interfaces to the AXI-PCIe bridge on one side and the PS-GTR
transceivers on the other. It performs link negotiation, error detection and recovery, and many other
PCIe protocol specific functions. This block cannot be directly accessed.

7
PCI Express Link
“A Link represents a dual-simplex communications channel between two components. The fundamental
PCI Express Link consists of two, low-voltage, differentially driven signal pairs: a Transmit pair and a
Receive pair.

The primary Link attributes for PCI Express Link are:

1. The basic Link – PCI Express Link consists of dual unidirectional differential Links, implemented as
a Transmit pair and a Receive pair. A data clock is embedded using an encoding 5 scheme to
achieve very high data rates.
2. Signaling rate – Once initialized, each Link must only operate at one of the supported signaling
levels.
a. For the first generation of PCI Express technology, there is only one signaling rate defined,
which provides an effective 2.5 Gigabits/second/Lane/direction of raw bandwidth.
b. The second generation provides an effective 5.0 Gigabits/second/Lane/direction of raw
bandwidths.
c. The third generation provides an effective 8.0 Gigabits/second/Lane/direction of raw
bandwidth.
d. The fourth generation provides an effective 16.0 Gigabits/second/Lane/direction of raw
bandwidth.
3. Lanes – A Link must support at least one Lane – each Lane represents a set of differential signal
pairs (one pair for transmission, one pair for reception). To scale bandwidth, a Link may aggregate
multiple Lanes denoted by xN where N may be any of the supported Link widths. A x8 Link
operating at the 2.5 GT/s data rate represents an aggregate bandwidth of 20 Gigabits/second of
raw bandwidth in each direction. This specification describes operations for x1, x2, x4, x8, x12,
x16, and x32 Lane widths.
4. Initialization – During hardware initialization, each PCI Express Link is set up following a
negotiation of Lane widths and frequency of operation by the two agents at each end of the Link.
No firmware or operating system software is involved.
5. Symmetry – Each Link must support a symmetric number of Lanes in each direction, i.e., a x16
Link indicates there are 16 differential signal pairs in each direction.”

3.2 PCIe Features


“This section provides a summary of features supported by the controller for PCIe.

• Endpoint or Root Port mode of operation

• Support for Gen1 , Gen2 or Gen3 link rates.

• Support for single x1, x2, x4, x8, x16 or x32 link.

8
• Endpoint mode supports MSI-X interrupts in addition to MSI and legacy.

• Support for advanced error reporting capability.

AXI-PCIe bridge supports:

• 64-bit AXI3 compliant AXI master and AXI slave interfaces operating at a 250 MHz

clock.

• MSI-X table and PBA implementation at predefined location for Endpoint mode.
• Eight fully-configurable address translation apertures in each direction (egress— AXI to PCIe
and ingress—PCIe to AXI).
• Generation of configuration transactions through the enhanced configuration access
mechanism (ECAM) and messages by the AXI CPU in Root Port mode.
• Receive interrupt controller aggregates and presents legacy and MSI interrupts from PCIe to
the AXI CPU in Root Port mode.

• Receive PCIe message FIFO for Root Port.

• Four-channel fully-configurable DMA engine.

• Each DMA channel controllable from PCIe CPU, AXI CPU, or both.
• Separate source and destination scatter-gather queues with the option to have separate status
scatter-gather queues.”

Figure 3.2 Block Diagram of PCI Express

9
3.3 PCI Express Layering Overview
“The architecture in terms of three discrete logical layers: the Transaction Layer, the Data Link Layer, and
the Physical Layer. Each of these layers is divided into two sections: one that processes outbound ( to be
transmitted) information and one that processes inbound (received) information.”

Fig 3.3 a) High-Level Layering Diagram

“PCI Express uses packets to communicate information between components. Packets are formed in the
Transaction and Data Link Layers to carry the information from the transmitting component to the
receiving component. As the transmitted packets flow through the other layers, they are extended with
additional information necessary to handle packets at those layers. At the receiving side the reverse
process occurs and packets get transformed from their Physical Layer representation to the Data Link
Layer representation and finally (for Transaction Layer Packets) to the form that can be processed by the
Transaction Layer of the receiving device”

10
Fig 3.3 b) Packet Flow Through the Layers

3.3.1 Transaction Layer


“The upper Layer of the architecture is the Transaction Layer. The Transaction Layer ’ s primary
responsibility is the assembly and disassembly of TLPs. TLPs are used to communicate transactions, such
as read and write, as well as certain types of events.” The Transaction Layer is also responsible for
managing credit-based flow control for TLPs. Every request packet requiring a response packet is
implemented as a Split Transaction. Each packet has a unique identifier that enables response packets to
be directed to the correct originator. The packet format supports different forms of addressing depending
on the type of the transaction (Memory, I/O, Configuration, and Message). The Packets may also have
attributes such as No Snoop, Relaxed Ordering, and ID-Based Ordering (IDO).”

Fig 3.3.1 Layering Diagram Highlighting the Transaction Layer

11
“The Transaction Layer supports four address spaces: it includes the three PCI address spaces (memory,
I/O, and configuration) and adds Message Space. This specification uses Message Space to support all
prior sideband signals, such as interrupts, power-management requests, and so on, as in-band Message
transactions. You could think of PCI Express Message transactions as “virtual wires” since their effect
is to eliminate the wide array of sideband signals currently used in a platform implementation.”

3.3.2 Data Link Layer


“The middle Layer in the stack, the Data Link Layer, serves as an intermediate stage between the
Transaction Layer and the Physical Layer. The primary responsibilities of the Data Link Layer include Link
management and data integrity, including error detection and error correction.

The transmission side of the Data Link Layer accepts TLPs assembled by the Transaction Layer, calculates
and applies a data protection code and TLP sequence number, and submits them to Physical Layer for
transmission across the Link. The receiving Data Link Layer is responsible for checking the integrity of
received TLPs and for submitting them to the Transaction Layer for further processing. On detection of
TLP error(s), this Layer is responsible for requesting retransmission of TLPs until information is correctly
received, or the Link is determined to have failed.

The Data Link Layer also generates and consumes packets that are used for Link management functions.
To differentiate these packets from those used by the Transaction Layer (TLP), the term Data Link Layer
Packet (DLLP) will be used when referring to packets that are generated and consumed at the Data Link
Layer.”

3.3.3 Physical Layer


“The Physical Layer includes all circuitry for interface operation, including driver and input buffers, parallel-
to-serial and serial-to-parallel conversion, PLL(s), and impedance matching circuitry. It also includes logical
functions related to interface initialization and maintenance. The Physical Layer exchanges information
with the Data Link Layer in an implementation-specific format. This Layer is responsible for converting
information received from the Data Link Layer into an appropriate serialized format and transmitting it
across the PCI Express Link at a frequency and width compatible with the device connected to the other
side of the Link.
The PCI Express architecture has “hooks” to support future performance enhancements via speed
upgrades and advanced encoding techniques. The future speeds, encoding techniques or media may only
impact the Physical Layer definition.”

12
3.3.4 Layer Functions and Services
“1.Transaction Layer Services:
The Transaction Layer, in the process of generating and receiving TLPs, exchanges Flow Control
information with its complementary Transaction Layer on the other side of the Link. It is also responsible
for supporting both software and hardware-initiated power management.

Initialization and configuration functions require the Transaction Layer to:


• Store Link configuration information generated by the processor or management device
• Store Link capabilities generated by Physical Layer hardware negotiation of width and operational
frequency

A Transaction Layer’s Packet generation and processing services require it to:


• Generate TLPs from device core Requests
• Convert received Request TLPs into Requests for the device core
• Convert received Completion Packets into a payload, or status information, deliverable to the core
• Detect unsupported TLPs and invoke appropriate mechanisms for handling them
• If end-to-end data integrity is supported, generate the end-to-end data integrity CRC and update the
TLP header accordingly.

Flow Control services:


• The Transaction Layer tracks Flow Control credits for TLPs across the Link.
• Transaction credit status is periodically transmitted to the remote Transaction Layer using transport
services of the Data Link Layer.
• Remote Flow Control information is used to throttle TLP transmission.

Ordering rules:
• PCI/PCI-X compliant producer/consumer ordering model
• Extensions to support Relaxed Ordering
• Extensions to support ID-Based Ordering

Power management services:


• Software-controlled power management through mechanisms, as dictated by system software.
• Hardware-controlled autonomous power management minimizes power during full-on power states.

Virtual Channels and Traffic Class:


• The combination of Virtual Channel mechanism and Traffic Class identification is provided to support
• differentiated services and QoS support for certain classes of applications.
• Virtual Channels: Virtual Channels provide a means to support multiple independent logical data flows
over given common physical resources of the Link. Conceptually this involves multiplexing different
data flows onto a single physical Link.
• Traffic Class: The Traffic Class is a Transaction Layer Packet label that is transmitted unmodified end-
to-end through the fabric. At every service point (e.g., Switch) within the fabric, Traffic Class labels
13
are used to apply appropriate servicing policies. Each Traffic Class label defines a unique ordering
domain - no ordering guarantees are provided for packets that contain different Traffic Class labels.”

2. Data Link Layer Services


“The Data Link Layer is responsible for reliably exchanging information with its counterpart on the opposite
side of the Link.
Initialization and power management services:
• Accept power state Requests from the Transaction Layer and convey to the Physical Layer
• Convey active/reset/disconnected/power managed state to the Transaction Layer
• Data protection, error checking, and retry services:
• CRC generation
• Transmitted TLP storage for Data Link level retry
• Error checking
• TLP acknowledgement and retry Messages
• Error indication for error reporting and logging”

3. Physical Layer Services


“Interface initialization, maintenance control, and status tracking:
• Reset/Hot-Plug control/status
• Interconnect power management
• Width and Lane mapping negotiation
• Lane polarity inversion
Symbol and special Ordered Set generation:
• 8b/10b encoding/decoding
• Embedded clock tuning and alignment
Symbol transmission and alignment:
• Transmission circuits
• Reception circuits
• Elastic buffer at receiving side
• Multi-Lane de-skew (for widths > x1) at receiving side
System Design For Testability (DFT) support features:
• Compliance pattern
• Modified Compliance pattern”

14
3.4 PCI Express Topology

A fabric is composed of point-to-point Links that interconnect a set of components. This figure illustrates
a single fabric instance referred to as a Hierarchy - composed of a Root Complex (RC), multiple Endpoints
(I/O devices), a Switch, and a PCI Express to PCI/PCI-X Bridge, all interconnected via PCI Express Links.

3.4.1 Root Complex

An RC denotes the root of an I/O hierarchy that connects the CPU/memory subsystem to the I/O. As
illustrated in Figure, an RC may support one or more PCI Express Ports. Each interface defines a separate
hierarchy domain. Each hierarchy domain may be composed of a single Endpoint or a sub-hierarchy
containing one or more Switch components and Endpoints.

3.4.2 Endpoints

Endpoint refers to a type of Function that can be the Requester or Completer of a PCI Express transaction
either on its own behalf or on behalf of a distinct non-PCI Express device (other than a PCI device or host
CPU), e.g., a PCI Express attached graphics controller or a PCI Express-USB host controller. Endpoints are
classified as either legacy, PCI Express, or Root Complex Integrated Endpoints

PCI Express Endpoint Rules


• A PCI Express Endpoint must be a Function with a Type 00h Configuration Space header.
• A PCI Express Endpoint must support Configuration Requests as a Completer.
• A PCI Express Endpoint must not depend on operating system allocation of I/O resources claimed
through BAR(s). A PCI Express Endpoint must not generate I/O Requests. A PCI Express Endpoint must
not support Locked Requests as a Completer or generate them as a Requester.
• PCI Express-compliant software drivers and applications must be written to prevent the use of lock
semantics when accessing a PCI Express Endpoint
• For a PCI Express Endpoint, 64-bit addressing must be supported for all BARs that have the
Prefetchable bit Set. 32-bit addressing is permitted for all BARs that do not have the Prefetchable bit
Set.
• A PCI Express Endpoint must appear within one of the hierarchy domains originated by the Root
Complex.

15
Figure 3.4 Example of PCI Express Topology

3.4.3 Switch

A Switch is defined as a logical assembly of multiple virtual PCI-to-PCI Bridge devices as illustrated in
Figure. All Switches are governed by the following base rules. Switches appear to configuration software
as two or more logical PCI-to-PCI Bridges. A Switch forwards transactions using PCI Bridge mechanisms;
e.g., address-based routing except when engaged in a Multicast. Each enabled Switch Port must comply
with the Flow Control specification within this document

Figure 3.4.3 PCI Express Switch Block

16
3.5 PCI Express Transactions
“PCI Express employs packets to accomplish data transfers between devices. A root complex can
communicate with an endpoint. An endpoint can communicate with a root complex. An endpoint can
communicate with another endpoint. Communication involves the transmission and reception of packets
called Transaction Layer packets (TLPs).

PCI Express transactions can be grouped into four categories:

1) memory,
2) IO,
3) configuration,
4) message transactions.

Memory, IO and configuration transactions are supported in PCI and PCI-X architectures, but the message
transaction is new to PCI Express. Transactions are defined as a series of one or more packet transmissions
required to complete an information transfer between a requester and a completer.

Table 3.5 Type of PCIe transactions

For Non-posted transactions, a requester transmits a TLP request packet to a completer. Later, the
completer returns a TLP completion packet back to the requester. The purpose of the completion TLP is
to confirm to the requester that the completer has received the request TLP. In addition, non-posted read
transactions contain data in the completion TLP. Non-Posted write transactions contain data in the write
request TLP.

For Posted transactions, a requester transmits a TLP request packet to a completer. The completer
however does NOT return a completion TLP back to the requester. Posted transactions are optimized for
best performance in completing the transaction at the expense of the requester not having knowledge of
successful reception of the request by the completer. Posted transactions may or may not contain data in
the request TLP.”
17
3.6 Transaction Layer Packets
“In PCI Express terminology, high-level transactions originate at the device core of the transmitting device
and terminate at the core of the receiving device. The Transaction Layer is the starting point in the
assembly of outbound Transaction Layer Packets (TLPs), and the end point for disassembly of inbound
TLPs at the receiver. Along the way, the Data Link Layer and Physical Layer of each device contribute to
the packet assembly and disassembly.”

“PCI Express uses a packet-based protocol to exchange information between the Transaction Layers of the
two components communicating with each other over the Link.
PCI Express supports the following basic transaction types:

1. Memory, I/O, Configuration, and Messages.


2. Two addressing formats for Memory Requests are supported: 32 bit and 64 bit.

Transactions are carried using Requests and Completions. Completions are used only where required, for
example, to return read data, or to acknowledge Completion of I/O and Configuration Write Transactions.
Completions are associated with their corresponding Requests by the value in the Transaction ID field of
the Packet header.

All TLP fields marked Reserved (sometimes abbreviated as R) must be filled with all 0's when a TLP is
formed. Values in such fields must be ignored by Receivers and forwarded unmodified by Switches. Note
that for certain fields there are both specified and Reserved values - the handling of Reserved values in
these cases is specified separately for each case.”

3.6.1 Packet Format Overview


“Transactions consist of Requests and Completions, which are communicated using packets. Figure shows
a high level serialized view of a TLP, consisting of one or more optional TLP Prefixes, a TLP header, a data
payload (for some types of packets), and an optional TLP Digest.”

Figure 3.6.1 Serial View of a TLP

18
“Figure above being transmitted/received first (byte 0 if one or more optional TLP Prefixes are present else
byte H). Detailed layouts of the TLP Prefix, TLP Header and TLP Digest are drawn with the lower numbered
bytes on the left rather than on the right as has traditionally been depicted in other PCI specifications. The
header layout is optimized for performance on a serialized interconnect, driven by the requirement that
the most time critical information be transferred first. For example, within the TLP header, the most
significant byte of the address field is transferred first so that it may be used for early address decode.”

Table 3.6.1 Format of TLP

3.6.2 Posted Memory Write Transactions:


“Memory write requests shown in Figure 2-4 are posted transactions. This implies that the completer
returns no completion notification to inform the requester that the memory write request packet has
reached its destination successfully. No time is wasted in returning a completion, thus back-to-back
posted writes complete with higher performance relative to non-posted transactions.

The write request packet which contains data is routed through the fabric of switches using information
in the header portion of the packet. The packet makes its way to a completer. The completer accepts the
specified amount of data within the packet. Transaction over. If the write request is received by the
completer in error or is unable to write the posted write data to the destination due to an internal error,
the requester is not informed via the hardware protocol. The completer could log an error and generate
an error message notification to the root complex. Error handling software manages the error.”

19
Table 3.6.2 Fmt[2:0] and Type[4:0] Field Encodings

20
Figure 3.6.2 Memory Write Transaction Protocol

3.6.3 Non-Posted Read Transactions


“To complete this transfer, a requester transmits a non-posted read request TLP to a completer it intends
to read data from. Non-posted read request TLPs include memory read request (MRd), IO read request
(IORd), and configuration read request type 0 or type 1 (CfgRd0, CfgRd1) TLPs. Requesters may be root
complex or endpoint devices (endpoints do not initiate configuration read/write requests however).

The request TLP is routed through the fabric of switches using information in the header portion of the
TLP. The packet makes its way to a targeted completer. The completer can be a root complex, switches,
bridges or endpoints. When the completer receives the packet and decodes its contents, it gathers the
amount of data specified in the request from the targeted address. The completer creates a single
completion TLP or multiple completion TLPs with data (CplD) and sends it back to the requester. The
completer can return up to 4 KBytes of data per CplD packet. The completion packet contains routing
information necessary to route the packet back to the requester. This completion packet travels through
the same path and hierarchy of switches as the request packet. Requesters uses a tag field in the
completion to associate it with a request TLP of the same tag value it transmitted earlier. Use of a tag in
the request and completion TLPs allows a requester to manage multiple outstanding transactions. If a
completer is unable to obtain requested data because of an error, it returns a completion packet without
data (Cpl) and an error status indication. The requester determines how to handle the error at the
software layer.”

21
Figure 3.6.3 Non-Posted Read Transaction Protocol

3.6.4 Non-Posted Write Transactions


“In non-posted write transaction, to complete this transfer, a requester transmits a non-posted write
request TLP to a completer it intends to write data to. Non-posted write request TLPs include IO write
request (IOWr), configuration write request type 0 or type 1 (CfgWr0, CfgWr1) TLPs. Memory write
request and message requests are posted requests. Requesters may be a root complex or endpoint device
(though not for configuration write requests).

A request packet with data is routed through the fabric of switches using information in the header of the
packet. The packet makes its way to a completer. When the completer receives the packet and decodes
its contents, it accepts the data. The completer creates a single completion packet without data (Cpl) to
confirm reception of the write request. This is the purpose of the completion.”

22
Figure 3.6.4 Non-Posted Write Transaction Protocol

3.6.5 Posted Message Transactions

“There are two categories of message request TLPs, Msg and MsgD. Some message requests propagate
from requester to completer, some are broadcast requests from the root complex to all endpoints, some
are transmitted by an endpoint to the root complex. Message packets may be routed to completer(s)
based on the message’s address, device ID or routed implicitly. The completer accepts any data that may
be contained in the packet (if the packet is MsgD) and/or performs the task specified by the message.
Message request support eliminates the need for side-band signals in a PCI Express system. They are used
for PCI style legacy interrupt signaling, power management protocol, error signaling, unlocking a path in
the PCI Express fabric, slot power support, hot plug protocol, and vender defined purposes.”

23
Figure 3.6.5 Posted Message Transaction Protocol

3.6.5 Power Management Capabilities Register:

24
“PCI Express device Functions are required to support D0 and D3 device states; PCI-PCI Bridge structures
representing PCI Express Ports as described in Section 7.1 are required to indicate PME Message passing
capability due to the in-band nature of PME messaging for PCI Express.

The PME_Status bit for the PCI-PCI Bridge structure representing PCI Express Ports, however, is only Set
when the PCI-PCI Bridge Function is itself generating a PME. The PME_Status bit is not Set when the Bridge
is propagating a PME Message but the PCI-PCI Bridge Function itself is not internally generating a PME.”

4 Power Management:
Power Management states are as follows:
1. D states are associated with a particular Function
2. D0 is the operational state and consumes the most power
3. D1 and D2 are intermediate power saving states
4. D3Hot is a very low power state
5. D3Cold is the power off state
L states are associated with a particular Link
1. L0 is the operational state
2. L0s, L1, L1.0, L1.1, and L1.2 are various lower power states

PM provides the following services:


3. A mechanism to identify power management capabilities of a given Function
4. The ability to transition a Function into a certain power management state
5. Notification of the current power management state of a Function
6. The option to wake up the system on a specific event

4.1 Link State Power Management

“PCI Express defines Link power management states, replacing the bus power management states that
were defined by the PCI Bus Power Management Interface Specification. Link states are not visible to PCI-
PM legacy compatible software, and are either derived from the power management D-states of the
corresponding components connected to that Link or by ASPM protocols

PCI Express-PM defines the following Link power management states:

• L0 - Active state.

L0 support is required for both ASPM and PCI-PM compatible power management.

All PCI Express transactions and other operations are enabled.

• L0s - A low resume latency, energy saving “standby” state.

25
L0s support is optional for ASPM unless the applicable form factor specification for the Link
explicitly requires L0s support.

All main power supplies, component reference clocks, and components' internal PLLs must be active at all

times during L0s. TLP and DLLP transmission is disabled for a Port whose Link is in Tx_L0s.

The Physical Layer provides mechanisms for quick transitions from this state to the L0 state. When
common (distributed) reference clocks are used on both sides of a Link, the transition time from L0s to L0
is desired to be less than 100 Symbol Times.

It is possible for the Transmit side of one component on a Link to be in L0s while the Transmit side of the
other component on the Link is in L0.

• L1 - Higher latency, lower power “standby” state.

L1 support is required for PCI-PM compatible power management. L1 is optional for ASPM
unless specifically required by a particular form factor.

When L1 PM Substates is enabled by setting one or more of the enable bits in the L1 PM Substates Control
1 Register this state is referred to as the L1.0 substate.”

All main power supplies must remain active during L1. As long as they adhere to the advertised L1 exit
latencies, implementations are explicitly permitted to reduce power by applying techniques such as, but
not limited to, periodic rather than continuous checking for Electrical Idle exit, checking for Electrical Idle
exit on only one Lane, and powering off of unneeded circuits. All platform-provided component reference
clocks must remain active during L1, except as permitted by Clock Power Management (using CLKREQ#)
and/or L1 PM Substates when enabled. A component's internal PLLs may be shut off during L1, enabling
greater power savings at a cost of increased exit latency

26
“The L1 entry negotiation (whether invoked via PCI-PM or ASPM mechanisms) and the L2/L3 Ready entry
negotiation map to a state machine which corresponds to the actions described later in this chapter. This
state machine is reset to an idle state. For a Downstream component, the first action taken by the state
machine, after leaving the idle state, is to start sending the appropriate entry DLLPs depending on the
type of negotiation. If the negotiation is interrupted, for example by a trip through Recovery, the state
machine in both components is reset back to the idle state. The Upstream component must always go to
the idle state and wait to receive entry DLLPs. The Downstream component must always go to the idle
state and must always proceed to sending entry DLLPs to restart the negotiation.”

4.Debug Flow
4.1 Xilinx System Debugger
Xilinx® System Debugger uses the Xilinx hw_server as the underlying debug engine. SDK translates each
user interface action into a sequence of TCF commands. “It then processes the output from system
Debugger to display the current state of the program being debugged. It communicates to the processor
on the hardware using Xilinx hw_server.”

The workflow is made up of the following components:

• Executable ELF File: “To debug your application, you must use an Executable and Linkable Format
(ELF) file compiled for debugging. The debug ELF file contains additional debug information for the
debugger to make direct associations between the source code and the binaries generated from that
original source.”

27
• Debug Configuration: “To launch the debug session, you must create a debug configuration in SDK.
This configuration captures options required to start a debug session, including the executable name,
processor target to debug, and other information.”
• SDK Debug Perspective: “Using the Debug perspective, you can manage the debugging or running of
a program in the Workbench. You can control the execution of your program by setting breakpoints,
suspending launched programs, stepping through your code, and examining the contents of
variables.”

The debug workflow is described in the following diagram:

Figure 4.1 Debug workflow Diagram

4.2 Hardware Debug Target


“SDK supports debugging of a program on processor running on an FPGA or a Zynq-7000 AP Soc device. All
processor architectures (Micro Blaze™ and ARM Cortex A9 processors) are supported. SDK communicates
to the processor on the FPGA or Zynq-7000 AP SoC device.

The debug logic for each processor enables program debugging by controlling the processor execution.
The debug logic on soft Micro Blaze processor cores is configurable and can be enabled or disabled by the
hardware designer when building the embedded hardware. Enabling the debug logic on Micro Blaze
processors provides advanced debugging capabilities such as hardware breakpoints, read/write memory
watchpoints, safe-mode debugging, and more visibility into Micro Blaze processors.”

28
4.3 PCIE Decode Utility Flow
1. The System Debugger will starts generating log file, which will have all PCI express transaction
data over Root complex and End point devices. The log file data will not be in readable format,
because it is decrypted over through different layers PCI Express in transaction.
2. Generated log file will be given as input to the PCIE Decoder, it will read each state of transaction.
If any TLP available in the transaction, decoder will analyses whether TLP is valid or not.
3. Whenever a Transaction is starting, Transaction Layer will assemble Header and then the Payload.
Header is part of the data that has all the fields required to classify the data. Header holds data
like Address, Requestor ID, Completer ID, Device ID, Bus number, Tag, Length etc. Length in
decoded format.
4. PCIE Decode Utility will decode the Transaction Layer Packets to obtain Length of TLP, Address,
TLP format, TLP Type, Header, Data (if applicable), Requestor ID, Completer ID, Device ID, Bus
number, Tag and other details hidden in the Header.
5. When Memory or I/O read request type of TLP transaction decoded, which will not be having
data payload. Because Root complex request to read memory. But in memory Write will be having
data payload.
6. But root complex will be expecting TLP from End point after the memory read request, in that
Completer TLP data payload will encrypted by End point devices.
7. PCIE Decoder extracts data payload from requester transaction, only when memory write request
and I/O write requests.

Figure 4.3 PCIE decoder flow

29
5. PCIE Enumeration:
5.1 Introduction

“PCI Express (PCIe) utilizes a point to point interconnect and uses switches to fan out and expand the
number of PCIe connections in a system. Upon system boot up a critical task is the discovery or
enumeration process of all the devices in the PCIe tree so they can be allocated by the system software.
During the enumeration process the system software discovers all of the switch and endpoint devices that
are connected to the system, determines the memory requirements and then configures the PCIe devices.
The PCIe switch devices represent a special case in this process as their configuration is unique and
separate from that of PCIe endpoints. In the simulation testbench environment however, only their
configuration is required; the discovery process is not strictly necessary as the number of PCIe devices are
known ahead of time. This paper will elucidate the process of switch configuration using Xilinx' PCI Express
simulation class libraries.

First, while the discovery process is not needed within the testbench environment, the testbench must
still select the bus numbers and memory address of all the devices.”

30
5.2 Setting up with ARM DEV Studio

Arm Development Studio is an embedded C/C++ development toolchain designed specifically


for Arm-based SoCs, from tiny microcontrollers to custom multicore processors. Designed
alongside Arm processor IP, it accelerates system design and software development for Cortex-M, Cortex-
R and Cortex-A processors.

Gather Equipment:

▪ Windows PC

▪ USB Hub (not completely necessary, but probably useful)

Download Software:

▪ ARM Development Studio

▪ Python

▪ Pip

▪ PuTTY

▪ Win32 Disk Imager

31
▪ Plug the ARM D-Stream's USB output to the PC. Plug the PC into the DBG USB output of
the N1SDP

5.3 Running A (General) Bare Metal Image on the N1SDP

Open Arm Development Studio and click “New Debug Connection”, selecting the following
options:

1. “Hardware Connection” > Next


2. Enter desired debug connection name e.g. “Neoverse N1 SDP” > Next
3. Click “Add New Platform”
4. Select your DSTREAM unit from the list > Next
5. Save a Debug-Only Platform Configuration > Next
6. “Platform Manufacturer”=“Arm” and “Platform Name”=“Neoverse N1 SDP” > Finish
7. This will scan your N1SDP’s debug logic and automatically generate a debug configuration
(you only need to do this once to import the N1SDP platform config into ArmDS and save
it in the local database).
8. Click “Finish” on the window that pops up, then select “V8_3-Generic_0” as the target;
this corresponds to the cpu0 Neoverse N1 core. Remember that N1SDP is an ALPHA1
release right now which is why the CPU is not fully recognized as a Neoverse N1 core.
9. You will also need to set the Bare Metal Debug connection to your DSTREAM again (note
your DSTREAM address will almost certainly be different to mine, click the “Browse…”
button to auto-detect):

32
For the N1 SDP, the PCIe root configuration space, endpoint configuration space, and endpoint
memory mapped IO are not all contiguous. To remedy this, the SCP (system control processor,
which boots before the application processor) builds a BDF. A "segment" is a term used to refer
to a PCIe root. That is, if there are 2 roots, then there are 2 segments (seg 0 and seg 1). For the
N1SDP there are 2 roots. The first root (segment 0) is for PCIe slots 0 through 3. Segment 1 is for
PCIe slot 4 (the CCIX port).

5.4 Running XSDB with PCIE4.0 EP

10. In Remote Desktop, connect to the host PC


11. Open 3 PuTTY sessions on the host PC to connect to the SCP, AP, and MCP
12. On the host PC in the MCP terminal type `reboot`
13. You should see it rebooting
14. Once you see "program the bit file and hit enter to continue" in the SCP terminal, DO NOT
HIT ENTER YET--go to step 5.
15. In systest
16. xsdb
17. conn
18. source execute.tcl
19. Wait for this to complete, then move onto step 6
20. In the SCP terminal on the host PC, hit enter
21. It should now say "program MWR commands"
22. In systest
23. Wait until this completes. It takes a bit, but when it's done and you hit enter in the systest
terminal, it should show `xsdb`
24. Make sure the AP terminal is open. Then, hit enter in the SCP terminal
25. Once the AP terminal asks you if you want to pause boot (pauses and starts printing dots),
hit escape (you only have a few seconds to do this)
26. It should put you into a grey screen with some options in the AP terminal
27. Now, open ARM Design Studio. It should have a greyed out debug connection that says
N1SDP. Double click on that
28. If there is no debug connection that says N1SDP, then you need to follow the steps in
"Running A (General) Bare Metal Image on the N1SDP" for creating a debug connection
29. Once you've clicked on that, you'll need to run run_burst.ds, so in the command tab of
Design Studio, type:
30. source "run.ds"
31. Hit enter. This should download elf
32. output is in the AP terminal.

33
34
4.4 Results
Encrypted Input Data:

35
Decoded Output:
1. Memory Write Request TLP

36
2. Memory Read Request TLP

3. Completion with Data Memory Read Request TLP

37
PCIE Enumeration Result:

38
39
6.References
1. https://www.xilinx.com/support/documentation/user_guides/ug1085-zynq-ultrascale-
trm.pdf
40
2. https://www.xilinx.com/support/documentation/user_guides/ug1137-zynq-
ultrascale-mpsoc-swdev.pdf
3. Interactions of Zynq-7000 devices with general purpose computers through PCI-express:
A case study -- https://ieeexplore.ieee.org/document/7495400
4. https://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_4/ug1138-
generating-basic- software-platforms.pdf
5. Ready PCIe Data Streaming Solutions for FPGAs --
https://ieeexplore.ieee.org/document/6927444

41

You might also like