You are on page 1of 10

10/31/2017 WWW.TESTBENCH.

IN

|HOME |ABOUT |ARTICLES |ACK |FEEDBACK |TOC |LINKS |BLOG |JOBS | Search

TUTORIALS
About the author:
SystemVerilog Introduction to PCI Express
Verification Arjun Shetty
is perusing Master of
Constructs
We will start with a conceptual understanding of PCI Express. This will let us Technology in VLSI &
Interface appreciate the importance of PCI Express. This will be followed by a brief study of Embedded Systems in
OOPS the PCI Express protocol. Then we will look at the enhancements and improvements International Institute
Randomization of the protocol in the newer 3.0 specs. Of Information
Technology, Hyderabad
Functional Coverage
1 Basic PC system architecture
Assertion Arjuns Blog
DPI We will start by looking at the basic layout of a PC system. Logically, an average PC
UVM Tutorial system is laid out in something like shown in the figure.
VMM Tutorial The core logic chipset acts as a switch or router, and routes I/O traffic among the Report a Bug or Comment
OVM Tutorial different devices that make up the system. on This section - Your
input is what keeps
Easy Labs : SV
In reality, the core logic chipset is split into two parts: the northbridge and the Testbench.in improving
Easy Labs : UVM
southbridge (or I/O bridge). This split is there for a couple of reasons, the most with time!
Easy Labs : OVM important of which is the fact that there are three types of devices that naturally
Easy Labs : VMM work very closely together, and so they need to have faster access to each other:
AVM Switch TB
the CPU, the main memory, and the video card. In a modern system, the video
card's GPU is functionally a second (or third) CPU, so it needs to share privileged
VMM Ethernet sample access to main memory with the CPU(s). As a result, these three devices are all
clustered together off of the northbridge.

Verilog
Verification
Verilog Switch TB
Basic Constructs

OpenVera
Constructs
Switch TB
RVM Switch TB
RVM Ethernet sample

Specman E
Interview Questions

The northbridge is tied to a secondary bridge, the southbridge, which routes traffic
from the different I/O devices on the system: the hard drives, USB ports, Ethernet
ports, etc. The traffic from these devices is routed through the southbridge to the
northbridge and then on to the CPU and/or memory.

http://www.testbench.in/introduction_to_pci_express.html 1/10
10/31/2017 WWW.TESTBENCH.IN

As is evident from the diagram above, the PCI bus is attached to the southbridge.
This bus is usually the oldest, slowest bus in a modern system, and is the one most
in need of an upgrade.

The main thing that we should take away from the previous diagram is that the
modern PC is a motley collection of specialized buses of different protocols and
bandwidth capabilities. This mix of specialized buses designed to attach different
types of hardware directly to the southbridge is something of a continuously
evolving hack that has been gradually and collectively engineered by the PC industry
as it tries to get around the limitations of the aging PCI bus. Because the PCI bus
can't really cut it for things like Serial ATA, Firewire, etc., the trend has been to
attach interfaces for both internal and external I/O directly to the southbridge. So
today's southbridge is sort of the Swiss Army Knife of I/O switches, and thanks to
Moore's Curves it has been able to keep adding functionality in the form of new
interfaces that keep bandwidth-hungry devices from starving on the PCI bus.

In an ideal world, there would be one primary type of bus and one bus protocol that
connects all of these different I/O devices ? including the video card/GPU ? to the
CPU and main memory. Of course, this "one bus to rule them all" ideal is never, ever
going to happen in the real world. It won't happen with PCI Express, and it won't
happen with Infiniband (although it technically could happen with Infiniband if we
threw away all of today's PC hardware and started over from scratch with a round of
natively Infiniband-compliant devices).

Still, even though the utopian ideal of one bus and one bus protocol for every device
will never be achieved, there has to be way bring some order to the chaos. Luckily
for us, that way has finally arrived in the form of PCI Express (a.k.a. PCIe).

2 A primer on PCI

Before we go into detail on PCIe, it helps to understand how PCI works and what its
limitations are.

The PCI bus debuted over a decade ago at 33MHz, with a 32-bit bus and a peak
theoretical bandwidth of 132MB/s. This was pretty good for the time, but as the
rest of the system got more bandwidth hungry both the bus speed and bus width
were cranked up in a effort keep pace. Later flavors of PCI included a 64-bit, 33MHz
bus combination with a peak bandwidth of 264MB/s; a more recent 64-bit, 66MHz
combination with a bandwidth of 512MB/s.

PCI uses a shared bus topology to allow for communication among the different
devices on the bus; the different PCI devices (i.e., a network card, a sound card, a
http://www.testbench.in/introduction_to_pci_express.html 2/10
10/31/2017 WWW.TESTBENCH.IN
RAID card, etc.) are all attached to the same bus, which they use to communicate
with the CPU. Take a look at the following diagram to get a feel for what a shared
bus looks like.

Because all of the devices attached to the bus must share it among themselves,
there has to be some kind of bus arbitration scheme in place for deciding who gets
access to the bus and when, especially in situations where multiple devices need to
use the bus at the same time. Once a device has control of the bus, it becomes the
bus master, which means that it can use the PCI bus to talk to the CPU or memory
via the chipset's southbridge.

The shared bus topology's main advantages are that it's simple, cheap, and easy to
implement ? or at least, that's the case as long as you're not trying to do anything
too fancy with it. Once you start demanding more performance and functionality
from a shared bus, then you run into its limitations. Let's take a look at some of
those limitations, in order to motivate our discussion of PCI Express's improvements.

This scheme works fine when there are only a few devices attached to the bus,
listening to it for addresses and data. But the nature of a bus is that any device
that's attached to it and is "listening" to it injects a certain amount of noise onto the
bus. Thus the more devices that listen to the bus ? and thereby place an electrical
load on the bus ? the more noise there is on the bus and the harder it becomes to
get a clean signal through.

2.1 Summary of PCI's shortcomings

To summarize, PCI as it exists today has some serious shortcomings that prevent it
from providing the bandwidth and features needed by current and future
generations of I/O and storage devices. Specifically, its highly parallel shared-bus
architecture holds it back by limiting its bus speed and scalability, and its simple,
load-store, flat memory-based communications model is less robust and extensible
than a routed, packet-based model.

3 PCI-X: wider and faster, but still outdated

The PCI-X spec was an attempt to update PCI as painlessly as possible and allow it to
hobble along for a few more years. This being the case, the spec doesn't really fix
any of the inherent problems outlined above. In fact, it actually makes some of the
problems worse.

The PCI-X spec essentially doubled the bus width from 32 bits to 64 bits, thereby
increasing PCI's parallel data transmission abilities and enlarging its address space.
The spec also ups PCI's basic clock rate to 66MHz with a 133MHz variety on the high
end, providing yet another boost to PCI's bandwidth and bringing it up to 1GB/s (at
133MHz).

The latest version of the PCI-X spec (PCI-X 266) also double-pumps the bus, so that
data is transmitted on the rising and falling edges of the clock. While this improves
PCI-X's peak theoretical bandwidth, its real-world sustained bandwidth gains are
more modest.

While both of these moves significantly increased PCI's bandwidth and its usefulness,
they also made it more expensive to implement. The faster a bus runs, the sensitive
it becomes to noise; manufacturing standards for high-speed buses are exceptionally
strict for this very reason; shoddy materials and/or wide margins of error translate
directly into noise at higher clock speeds. This means that the higher-speed PCI-X
bus is more expensive to make.
http://www.testbench.in/introduction_to_pci_express.html 3/10
10/31/2017 WWW.TESTBENCH.IN

The higher clock speed isn't the only thing that increases PCI-X's noise problems and
manufacturing costs. The other factor is the increased bus width. Because the bus is
wider and consists of more wires, there's more noise in the form of crosstalk.
Furthermore, all of those new wires are connected at their endpoints to multiple
PCI devices, which means an even larger load on the bus and thus more noise
injected into the bus by attached devices. And then there's the fact that the PCI
devices themselves need 32 extra pins, which increases the manufacturing cost of
each individual device and of the connectors on the motherboard.

All of these factors, when taken together with the increased clock rate, combine to
make the PCI-X a more expensive proposition than PCI, which keeps it out of
mainstream PCs. And it should also be noted that most of the problems with
increasing bus parallelism and double-pumping the bus also plague recent forms of
DDR, and especially the DDR-II spec.

And after all of that pain, you still have to deal with PCI's shared-bus topology and
all of its attendant ills. Fortunately, there's a better way.

4 PCI Express: the next generation

PCI Express (PCIe) is the newest name for the technology formerly known as 3GIO.
Though the PCIe specification was finalized in 2002, PCIe-based devices have just
now started to debut on the market.

PCIe's most drastic and obvious improvement over PCI is its point-to-point bus
topology. Take a look at the following diagram, and compare it to the layout of the
PCI bus.

In a point-to-point bus topology, a shared switch replaces the shared bus as the
single shared resource by means of which all of the devices communicate. Unlike in
a shared bus topology, where the devices must collectively arbitrate among
themselves for use of the bus, each device in the system has direct and exclusive
access to the switch. In other words, each device sits on its own dedicated bus,
which in PCIe lingo is called a link.

Like a router in a network or a telephone switchbox, the switch routes bus traffic
and establishes point-to-point connections between any two communicating devices
on a system.

http://www.testbench.in/introduction_to_pci_express.html 4/10
10/31/2017 WWW.TESTBENCH.IN
4.1 Enabling Quality of Service

The overall effect of the switched fabric topology is that it allows the "smarts"
needed to manage and route traffic to be centralized in one single chip ? the
switch. With a shared bus, the devices on the bus must use an arbitration scheme to
decide among themselves how to distribute a shared resource (i.e., the bus). With a
switched fabric, the switch makes all the resource-sharing decisions.

By centralizing the traffic-routing and resource-management functions in a single


unit, PCIe also enables another important and long overdue next-generation
function: quality of service (QoS). PCIe's switch can prioritize packets, so that real-
time streaming packets (i.e., a video stream or an audio stream) can take priority
over packets that aren't as time critical. This should mean fewer dropped frames in
your first-person shooter and lower audio latency in your digital recording software.

4.2 Traffic runs in lanes

When PCIe's designers started thinking about a true next-generation upgrade for
PCI, one of the issues that they needed to tackle was pin count. In the section on
PCI above, I covered some of the problems with the kind of large-scale data
parallelism that PCI exhibits (i.e. noise, cost, poor frequency scaling, etc.). PCIe
solves this problem by taking a serial approach.

As I noted previously, a connection between two a PCIe device and a PCIe switch is
called a link. Each link is composed of one or more lanes, and each lane is capable
of transmitting one byte at a time in both directions at once. This full-duplex
communication is possible because each lane is itself composed of one pair of
signals: send and receive.

In order to transmit PCIe packets, which are composed of multiple bytes, a one-lane
link must break down each packet into a series of bytes, and then transmit the
bytes in rapid succession. The device on the receiving end must collect all of the
bytes and then reassemble them into a complete packet. This disassembly and
reassembly happens must happen rapidly enough to where it's transparent to the
next layer up in the stack. This means that it requires some processing power on
each end of the link. The upside, though, is that because each lane is only one byte
wide, very few pins are needed to transmit the data. You might say that this serial
transmission scheme is a way of turning processing power into bandwidth; this is in
contrast to the old PCI parallel approach, which turns bus width (and hence pin
counts) into bandwidth. It so happens that thanks to Moore's Curves, processing
power is cheaper than bus width, hence PCIe's tradeoff makes a lot of sense.

We saw earlier that a link can be composed of "one or more lanes", so us clarify that
now. One of PCIe's nicest features is the ability to aggregate multiple individual

http://www.testbench.in/introduction_to_pci_express.html 5/10
10/31/2017 WWW.TESTBENCH.IN
lanes together to form a single link. In other words, two lanes could be coupled
together to form a single link capable of transmitting two bytes at a time, thus
doubling the link bandwidth. Likewise, you could combine four lanes, or eight lanes,
and so on.

A link that's composed of a single lane is called an x1 link; a link composed of two
lanes is called an x2 link; a link composed of four lanes is called an x4 link, etc.
PCIe supports x1, x2, x4, x8, x12, x16, and x32 link widths.

PCIe's bandwidth gains over PCI are considerable. A single lane is capable of
transmitting 2.5Gbps in each direction, simultaneously. Add two lanes together to
form an x2 link and you've got 5 Gbps, and so on with each link width. These high
transfer speeds are good, good news, and will enable a new class of applications,
like SLI video card rendering.

5 PCIe Protocol Details

Till now we were concerned with the system level impact of PCIe. We did not look
at the protocol itself. The following material will make an attempt to explain the
details of PCIe protocol, its layers and the functions of each of the layers in a brief
way.

PCI Express is a high performance, general purpose I/O interconnect defined for a
wide variety of future computing and communication platforms.

5.1 PCIe Link

A Link represents a dual-simplex communications channel between two components.


The fundamental PCI Express Link consists of two, low-voltage, differentially driven
signal pairs: a Transmit pair and a Receive pair

5.2 PCIe Fabric Topology

http://www.testbench.in/introduction_to_pci_express.html 6/10
10/31/2017 WWW.TESTBENCH.IN

5.2.1 Root Complex

A Root Complex (RC) denotes the root of an I/O hierarchy that connects the
CPU/memory subsystem to the I/O.

5.2.2 Endpoints

Endpoint refers to a type of Function that can be the Requester or Completer of a


PCI Express transaction either on its own behalf or on behalf of a distinct non-PCI
Express device (other than a PCI device or Host CPU), e.g., a PCI Express attached
graphics controller or a PCI Express-USB host controller. Endpoints are classified as
either legacy, PCI Express, or Root Complex Integrated Endpoints.

5.2.3 PCI Express to PCI/PCI-X Bridge

A PCI Express to PCI/PCI-X Bridge provides a connection between a PCI Express


fabric and a PCI/PCI-X hierarchy.

5.3 PCI Express Layering Overview

PCI Express can be divided into three discrete logical layers: the Transaction Layer,
the Data Link Layer, and the Physical Layer. Each of these layers is divided into two
sections: one that processes outbound (to be transmitted) information and one that
processes inbound (received) information.

http://www.testbench.in/introduction_to_pci_express.html 7/10
10/31/2017 WWW.TESTBENCH.IN
PCI Express uses packets to communicate information between components. Packets
are formed in the Transaction and Data Link Layers to carry the information from
the transmitting component to the receiving component. As the transmitted packets
flow through the other layers, they are extended with additional information
necessary to handle packets at those layers. At the receiving side the reverse
process occurs and packets get transformed from their Physical Layer representation
to the Data Link Layer representation and finally (for Transaction Layer Packets) to
the form that can be processed by the Transaction Layer of the receiving device.
Figure below shows the conceptual flow of transaction level packet information
through the layers.

Note that a simpler form of packet communication is supported between two Data
Link Layers (connected to the same Link) for the purpose of Link management.

5.4 Layers of the Protocol

We take a brief look at the functions of each of the 3 layers.

5.4.1 Transaction Layer

This is the top layer that interacts with the software above.

Functions of Transaction Layer:

1. Mechanisms for differentiating the ordering and processing requirements of


Transaction Layer Packets (TLPs)

2. Credit-based flow control

3. TLP construction and processing

4. Association of transaction-level mechanisms with device resources including Flow


Control and Virtual Channel management

5.4.2 Data Link Layer

The Data Link Layer acts as an intermediate stage between the Transaction Layer
and the Physical Layer. Its primary responsibility is to provide a reliable mechanism
for exchanging Transaction Layer Packets (TLPs) between the two components on a
Link.

Functions of Transaction Layer:


1. Data Exchange:
2. Error Detection and Retry:
3. Initialization and power management:

5.4.3 Physical Layer


The Physical Layer isolates the Transaction and Data Link Layers from the signaling
technology used for Link data interchange. The Physical Layer is divided into the
logical and electrical subblocks.

Logical Sub-block
Takes care of Symbol Encoding, framing, data scrambling, Link initialization and
training, Lane to lane de-skew

Electrical Sub-block
The electrical sub-block section defines the physical layer of PCI Express 5.0 GT/s
that consists of a reference clock source, Transmitter, channel, and Receiver. This
section defines the electrical-layer parameters required to guarantee
interoperability among the above-listed PCI Express components. This section
comprehends both 2.5 GT/s and 5.0 GT/s electricals. In many cases the parameter
http://www.testbench.in/introduction_to_pci_express.html 8/10
10/31/2017 WWW.TESTBENCH.IN
definitions between 2.5 and 5.0 GT/s are identical, even though their respective
values may differ. However, the need at 5.0 GT/s to minimize guardbanding, while
simultaneously comprehending all phenomena affecting signal integrity, requires
that all the PCI Express system components - Transmitter, Receiver, channel, and
Refclk, be explicitly defined in the specification. For this reason, each of these four
components has a separate specification section for 5.0 GT/s.

6 Changes in PCIe 3.0 (GEN3)

The goal of the PCI-SIG work group defining this next-generation interface was to
double the bandwidth of PCIe Gen 2, which is 5 gigatransfers per second (GT/s)
signaling but 4GT/s effective bandwidth after 8b/10b encoding overhead. The group
had two choices: either to increase the signaling rate to 10GT/s with 20 percent
encoding overhead or select a lower signaling rate (8GT/s) for better signal integrity
and reduced encoding overhead with a different set of challenges. The PCI-SIG
decided to go with 8GT/s and reduce the encoding overhead to offer approximately
7.99GT/s of effective bandwidth, approximately double that of PCIe Gen 2. The
increase in signaling rate from PCIe Gen 2's 5GT/s to PCIe Gen 3's 8GT/s provides a
sixty percent increase in data rate and the remainder of the effective bandwidth
increase comes from replacing the 8b/10b encoding (20 percent inefficiency) with
128b/130b coding (1-2 percent inefficiency). The challenge of moving from PCIe
Gen 2 to Gen 3 is to accommodate the signaling rate where clock timing goes from
200ps to 125ps, jitter tolerance goes from 44ps to 14ps and the total sharable band
(for SSC) goes down from 80ps to 35ps. These are enormous challenges to overcome
but the PCI-SIG has already completed board, package, and system modeling to
make sure designers are able to develop systems that support these rates. The table
below highlights some key aspects of PCIe Gen 2 and Gen 3. The beauty of the Gen 3
solution is that it will support twice the data rate with equal or lower power
consumption than PCIe Gen 2. Additionally, applications using PCIe Gen 2 would be
able to migrate seamlessly as the reference clock remains at 100MHz and the
channel reach for mobiles (8 inches), clients (14 inches), and volume servers (20
inches) stay the same. More complex equalizers, such as decision feedback
equalization, may be implemented optionally for extended reach needed in a
backplane environment. The Gen 3 specification will enhance signaling by adding
transmitter de-emphasis, receiver equalization, and optimization of Tx/Rx Phase
Lock Loops and Clock Data Recovery. The Gen 3 specification also requires devices
that support Gen 3 rate to dynamically negotiate up or down to/from Gen 1 and Gen
2 data rates based on signal/line conditions.

6.1 Benefits from the newer specs:-

6.1.1 Higher Speed

Goal: improve performance. Each successive generation doubles the bit rate of the
previous generation, and that holds true for Gen3, too, but theres a significant
difference this time. Since the previous speed was 5.0 GT/s, the new speed would
normally have been 10.0 GT/s, but the spec writers considered a signal that used a
10GHz clock problematic because of the board design and signal integrity issues that
many vendors would face. Constrained to stay under that frequency, they were

http://www.testbench.in/introduction_to_pci_express.html 9/10
10/31/2017 WWW.TESTBENCH.IN
forced to consider other options. The solution they chose was to move away from
the 8b/10b encoding scheme that PCIe and most other serial transports have used
because it adds a 20% overhead from the receivers perspective. Instead, they chose
a modified scrambling method that effectively creates a 128/130 encoding method.
This gain in efficiency meant that an increase in frequency to only 8.0GHz would be
enough to achieve a doubling of the bandwidth and meet this goal

6.1.2 Resizable BAR Capability

Goal: allow the system to select how much system resource is allocated to a device.
This new optional set of registers allows functions to communicate their resources
size options to system software, which can then select the optimal size and
communicate that back to the function. Ideally, the software would use the largest
setting reported, since that would give the best performance, but it might choose a
smaller size to accommodate constrained resources. Currently, sizes from 1MB to
512GB are possible. If these registers are implemented, there is one capability
register to report the possible sizes, and one control register to select the desired
size for each BAR. Note that devices might report a smaller size by default to help
them be compatible in many systems, but using the smaller size would also reduce
its performance.

6.1.3 Dynamic Power Allocation

Goal: provide more software-controlled power states to improve power


management (PM). Some endpoint devices dont have a device-specific driver to
manage their power efficiently, and DPA provides a means to fill that gap. DPA only
applies when the device is in the D0 state, and it defines up to 32 substates.
Substate0 (default) defines the max power, and successive sub-states have a power
allocation equal to or lower than the previous one. Software is permitted to change
the sub-states in any order. The Substate Control Enabled bit can be used to disable
this capability. Any time the device is changing between substates, it must always
report the highest power requirement of the two until the transition has been
completed, and the time needed to make the change is implementation specific.

To allow software to set up PM policies, functions define two transition latency


values and every substate associates its max transition time (longest time it takes to
enter that substate from any other substate) with one of those.

6.1.4 Alternative Routing-ID Interpretation

Goal: support a much larger number of functions inside devices. For requesters and
completers, this means treating the device number value as though it was really just
an extension of the function field to give an 8-bit value for the function number.
Since the device number is no longer included, its always assumed to be 0. The
spec also defines a new set of optional registers that can be used to assign a
function group number to each function. Within an ARI device, several functions can
be associated with a single group number, and that can serve as the basis for
arbitration or access control.

References:
www.arstechnica.com
www.pcisig.com
www.pcstats.com
www.digitalhomedesignline.com

copyright 2007 :: all rights reserved www.testbench.in::Disclaimer

http://www.testbench.in/introduction_to_pci_express.html 10/10

You might also like