You are on page 1of 71

System Address Map Initialization in x86/x64 Architecture Part 1: PCIBased Systems

This article serves as a clarification about the PCI expansion ROM address mapping, which was
not sufficiently covered in my Malicious PCI Expansion ROM article published by Infosec
Institute last year (http://resources.infosecinstitute.com/pci-expansion-rom/). Low-level
programmers are sometimes puzzled about the mapping of device memory, such as PCI device
memory, to the system address map. This article explains the initialization of the system address
map, focusing on the initialization of the PCI chip registers that control PCI device memory
address mapping to the system address map. PCI device memory address mapping is only
required if the PCI device contains memory, such as a video card, network card with onboard
buffer, or network card that supports PCI expansion ROM, etc.
X86/x64 system address map is complex due to backward compatibility that must be maintained
in the bus protocol in x86/x64 architecture. Bus protocol being utilized in a system dictates the
address mapping of the memory of a devicethats attached to the busto the system address
map. Therefore, you must understand the address mapping mechanism of the specific bus
protocol to understand the system address map initialization. This article focuses on systems
based on the PCI bus protocol. PCI bus protocol is a legacy bus protocol by todays standard.
However, its very important to understand how it works in the lowest level in terms of
software/firmware, because its impossible to understand the later bus protocol, the PCI Express
(PCIe) without understanding PCI bus protocol. PCIe is virtually the main bus protocol in every
x86/x64 systems today. Part 2 of this article will focus on PCIe-based systems.

Conventions
There are several different usages of the word memory in this article. It can be confusing for
those new to the subject. Therefore, this article uses these conventions:
1. The word main memory refers to the RAM modules installed on the motherboard.
2. The word memory controller refers to part of the chipset or the CPU that controls the
RAM modules and accesses to the RAM modules.
3. Flash memory refers to either the chip on the motherboard that stores the BIOS/UEFI or
the chip that stores the PCI expansion ROM contents.
4. The word memory range or memory address range means the range, i.e., from the
base/start address to the end address (base address + memory size) occupied by a
device in the CPU memory space.
5. The word memory space means the set of memory addresses accessible by the CPU,
i.e., the memory that is addressable from the CPU. Memory in this context could mean
RAM, ROM or other forms of memory which can be addressed by the CPU.
6. The word PCI expansion ROM mostly refers to the ROM chip on a PCI device, except
when the context contains other specific explanation.

The Boot Process at a Glance


This section explains the boot process in sufficient detail to understand the system address map
and other bus protocol-related matters that are explained later in this article. You need to have a
clear understanding of the boot process before we get into the system address map and bus
protocol-related talks.
The boot process in x86/x64 starts with the platform firmware (BIOS/UEFI) execution. The
platform firmware execution happens prior to the operating system (OS) boot, specifically before
the boot loader loads and executes the OS. Platform firmware execution can be summarized as
follows:
1. Start of execution in the CPU (processor) reset vector. In all platforms, the bootstrap
processor (BSP) starts execution by fetching the instruction located in an address known
as the reset vector. In x86/x64 this address is 4GB minus 16-bytes (FFFF_FFF0h). This
address is always located in the BIOS/UEFI flash memory on the motherboard.
2. CPU operating mode initialization. In this stage, the platform firmware switches the CPU
to the platform firmware CPU operating mode; it could be real mode, voodoo mode, or
flat protected mode, depending on the platform firmware. X86/x64 CPU resets in a
modified real mode operating mode, i.e., real mode at physical address FFFF_FFF0h.
Therefore, if the platform firmware CPU operating mode is flat protected mode, it must
switch the CPU into that mode. Present-day platform firmware doesnt use voodoo
mode as extensively as in the past. In fact, most present-day platform firmware has
abandoned its use altogether. For example, UEFI implementations use flat protected
mode.
3. Preparation for memory initialization. In this stage there are usually three steps carried
out by the platform firmware code:
1. CPU microcode update. In this step the platform firmware loads the CPU
microcode update to the CPU.
2. CPU-specific initialization. In x86/x64 CPUs since (at least) the Pentium III and
AMD Athlon era, part of the code in this stage usually sets up a temporary stack
known as cache-as-RAM (CAR), i.e., the CPU cache acts as temporary
(writeable) RAM because at this point of execution there is no writable memory
the RAM hasnt been initialized yet. Complex code in the platform firmware
requires the use of a stack. In old BIOS, there is some sort of assembler macro
trick for return address handling because by default the return address from a
function call in x86/x64 is stored in a read only stack, but no writeable memory
variable can be used. However, this old trick is not needed anymore, because all
present-day CPUs support CAR. If you want to know more about CAR, you can
consult the BIOS and Kernel Developer Guide (BKDG) for AMD Family 10h over

at http://support.amd.com/us/Processor_TechDocs/31116.pdf. Section 2.3.3 of


that document explains how to use the CPU L2 cache as general storage on
boot. CAR is required because main memory (RAM) initialization is a complex
task and requires the use of complex code as well. The presence of CAR is an
invaluable help here. Aside from CAR setup, certain CPUs need to initialize some
of its machine-specific registers (MSRs); the initialization is usually carried out in
this step.
3. Chipset initialization. In this step the chipset registers are initialized, particularly
the chipset base address register (BAR). Well have a look deeper into BAR later.
For the time being, its sufficient that you know BAR controls how the chip
registers and memory (if the device has its own memory) are mapped to the
system address map. In some chipsets, there is a watch dog timer that must be
disabled before memory initialization because it could randomly reset the system.
In that case, disabling the watch dog timer is carried out in this step.
4. Main memory (RAM) initialization. In this step, the memory controller initialization
happens. In the past, the memory controller was part of the chipset. Today, thats no
longer the case. The memory controller today is integrated into the CPU. The memory
controller initialization and RAM initialization happens together as complementary code,
because the platform firmware code must figure out the correct parameters supported by
both the memory controller and the RAM modules installed on the system and then
initialize both of the components into the correct setup.
5. Post memory initialization. Before this step, the platform firmware code is executed from
the flash ROM in the motherboardand if CAR is enabled, the CPU cache acts as the
stack. Thats painfully slow compared to ordinary code execution in RAM, especially
with instructions fetched into the CPU, because the flash ROM is very slow compared to
RAM. Therefore, the platform firmware binary usually copies itself to RAM in this step and
continues execution there. In the previous step, the main memory (RAM) is initialized.
However, there are several more steps required before the main memory (RAM) can be
used to execute the platform firmware code:
1. Memory test. This is a test performed to make sure RAM is ready to be used
because its possible that some parts of the RAM are broken. The detail of how
the test is carried out depends on the boot time requirement of the system. If the
boot time requirement is very fast, in many cases its impossible to test all parts
of the RAM and only some parts can be tested with some sort of statistical
approach on which parts to test to make sure the test covers as wide parts as
possible (statistically speaking).
2. Shadowing the firmware to RAM. Shadowing in this context means copying
the RAM from the flash ROM to the RAM at address range below the 1MB limit
1 mb is the old 20-bit address mapping limit set for DOS-era hardware. However,
the copying is not a trivial copy, because the code will reside in the RAM but in
the same address range previously occupied by the flash ROMthis is why its
called shadowing. Some bit twiddling must be done on the chipset by the
platform firmware code to control the mapping of the address range to the RAM

6.

7.

8.

9.

10.

and the flash ROM. Details of the bit twiddling are outside the scope of this
article. You can read details of the mapping in the respective chipset datasheet.
3. Redirecting memory transaction to the correct target. This is a continuation of the
shadowing step. The details depends on the platform (CPU and chipset
combination), and the runtime setup, i.e., whether to shadow the platform
firmware or not at runtime (when the OS runs).
4. Setting up the stack. This step sets up the stack (in RAM) to be used for further
platform firmware code execution. In previous steps, the stack is assumed to be
present in the CAR. In this step, the stack is switched from CAR to RAM because
the RAM is ready to be used. This is important because the space for stack in
CAR is limited compared to RAM.
5. Transferring platform firmware execution to RAM. This is a jump to the platform
firmware code which is shadowed to the RAM in step b.
Miscellaneous platform enabling. This step depends on the specific system configuration,
i.e., the motherboard and supporting chips. Usually, it consists of clock generator chip
initialization, to run the platform at the intended speed, and in some platforms this step
also consists of initializing the general purpose I/O (GPIO) registers.
Interrupt enabling. Previous steps assume that the interrupt is not yet enabled because
all of the interrupt hardware is not yet configured. In this step the interrupt hardware such
as the interrupt controller(s) and the associated interrupt handler software are initialized.
There are several possible interrupt controller hardware in x86/x64, i.e., the 8259
programmable interrupt controller (PIC), the local advanced programmable interrupt
controller (LAPIC) present in most CPUs today, and the I/O advanced programmable
interrupt controller (IOxAPIC) present in most chipsets today. After the hardware and
software required to handle the interrupt are ready, the interrupt is enabled.
Timer initialization. In this step, the hardware timer is enabled. The timer generates timer
interrupt when certain interval is reached. OS and some applications running on top of
the OS use the timer to work. There are also several possible pieces of hardware (or
some combination) that could act as the timer in x86/x64 platform, i.e., the 8254
programmable interrupt timer (PIT) chip that resides in the chipset, the high precision
event timer (HPET) also residing in the chipsetthis timer doesnt need initialization and
is used only by the OS, real time clock (RTC) which also resides in the chipset, and the
local APIC (LAPIC) timer present in the CPU.
Memory caching control initialization. X86/x64 CPU contains memory type range
registers (MTRRs) that controls the caching of all memory ranges addressable by the
CPU. The caching of the memory ranges depends on the type of hardware present in the
respective memory range and it must be initialized accordingly. For example, the memory
range(s) occupied by I/O devices such as the PCI bus must be initialized as uncached
address range(s)memory range(s) in this context is as seen from the CPU point of
view.
Application processor(s) initialization. The non-bootstrap CPU (processor) core is called
the application processor (AP) in some documentation; we will use the same naming
here. In multicore x86/x64 CPUs, only the BSP is active upon reset. Therefore, the other

coresthe APmust be initialized accordingly before the OS boot-loader takes control of


the system. One of the most important things to initialize in the AP is the MTRRs. The
MTRRs must be consistent in all CPU cores, otherwise memory read and write could
misbehave and bring the system to a halt.
11. Simple I/O devices initialization. Simple IO devices in this context are hardware such
as super IO (SIO), embedded controller, etc. This initialization depends on the system
configuration. The SIO typically controls legacy IO, such as PS/2 and serial interfaces,
etc. The embedded controller is mostly found on laptops, it controls things such as
buttons on the laptop, the interface from the laptop motherboard to the battery, etc.
12. PCI device discovery and initialization. In this step, PCI devicesby extension the PCIe
devices and other devices connected to PCI-compatible busare detected and
initialized. The devices detected in this step could be part of the chipset and/or other PCI
devices in the system, either soldered to the motherboard or in the PCI/PCIe expansion
slots. There are several resource assignments to the device happening in this step: IO
space assignment, memory mapped IO (MMIO) space assignment, IRQ assignment (for
devices that requires IRQ), and expansion ROM detection and execution. The
assignment of memory or IO address space happens via the use of BAR. Well get into
the detail in the PCI bus base address registers initialization section. USB devices
initialization happens in this step as well because USB is a PCI bus-compatible protocol.
Other non-legacy devices are initialized in this step as well, such as SATA, SPI, etc.
13. OS boot-loader execution. This is where the platform firmware hands over the execution
to the OS boot-loader, such as GRUB or LILO in Linux or the Windows OS loader.
Now, the boot process carried out by the platform firmware should be clear to you. Particularly the
steps where the system address map is initialized in relation to PCI devices, namely step 3c and
step 12. All of these steps deal with the BAR in the PCI chip or part of the chipset.

Dissecting PCI-Based System Address Map


We need to dissect a typical implementation sample of the system address map in x86/x64
before we can proceed to the system address map initialization in more detail. The Intel 815E
chipset is the implementation sample here. You can download the chipset datasheet
at http://download.intel.com/design/chipsets/datashts/29068801.pdf. Intel 815E is only half of the
equation because its the northbridge part of the chipset. The southbridge part of the chipset is
Intel ICH2. You can download Intel ICH2 datasheet
at http://www.intel.com/content/www/us/en/chipsets/82801ba-i-o-controller-hub-2-82801bam-i-ocontroller-hub-2-mobile-datasheet.html. The northbridge is the chipset closer to the CPU and
connected directly to the CPU, while the southbridge is the chipset farther away from the CPU
and connected to the CPU via the northbridge. In present-day CPUs, the northbridge is typically
integrated into the CPU package.

System Based on Intel 815E-ICH2 Chipset


Figure 1 shows the simplified block diagram of the system that uses Intel 815E-ICH2 chipset
combination. Figure 1 doesnt show the entire connection from the chipset to other components in
the system, only those related to the address mapping in the system.
Figure 1 Intel 815E-ICH2 (Simplified) Block Diagram

The Intel 815E-ICH2 chipset pair is not a pure PCI chipset, because it implements a non-PCI
bus to connect the northbridge and the southbridge, called the hub interface (HI), as you can see

in Figure 1. However, from logic point of view, the HI bus is basically a PCI bus with faster
transfer speed. Well, our focus here is not on the transfer speed or techniques related to data
transfers, but on the address mapping and, since HI doesnt alter anything related to address
mapping, we can safely ignore the HI bus specifics and regard it in the same respect as PCI bus.
The Intel 815E chipset is ancient by present standards, but its very interesting case study for
those new to PCI chipset low-level details because its very close to pure PCI-based systems.
As you can see in Figure 1, Intel 815E chipset was one of the northbridge chipset for Intel CPUs
that uses socket 370, such as Pentium III (code name Coppermine) and Intel Celeron CPUs.
Figure 1 shows how the CPU connects (logically) to the rest of the system via the Intel 815E
northbridge. This implies that any access to any device outside the CPU must pass through the
northbridge. From an address mapping standpoint, this means that Intel 815E acts as a sort of
address mapping router, i.e., the device that routes read or write transactions to a certain
addressor address range(s)to the correct device. In fact, that is how the northbridge works in
practice. The difference with present-day systems is the physical location of the northbridge,
which is integrated into the CPU, instead of being an individual component on the motherboard
like Intel 815E back then. The configuration of the address mapping router in the northbridge at
boot differs from the runtime configuration. In practice, the address mapping router is a series of
(logical) PCI device registers in the northbridge that control the system address map. Usually,
these registers are part of the memory controller and the AGP/PCI Bridge logical devices in the
northbridge chipset. The platform firmware initializes these address mapping-related registers at
boot to prepare for runtime usage inside an OS.
Now that you know how the system address map works at the physical level, i.e., it is controlled
by registers in the northbridge, its time to dive into the system address map itself. Figure 2 shows
the system address map of systems using Intel 815E chipset.

Figure 2 Intel 815E System Address Map

Figure 2 shows Intel 815E system address map. You can find a complementary address mapping
explanation in Intel 815E chipset datasheet in Section 4.1, System Address Map. Anyway, you
should be aware that the system address map is seen from the CPU point of view, not from
other chip in the system.
Figure 2 shows that the memory space supported by the CPU is actually 64GB. This support has
been in Intel CPUs since the Pentium Pro era, by using a technique called physical address
extension (PAE). Despite that, the amount of memory space used depends on the memory
controller in the system, which in this case located in the northbridge (Intel 815E chipset). The
Intel 815E chipset only supports up to 512MB of main memory (RAM) and only uses a 4GB
memory space. Figure 2 also shows that PCI devices consume (use) the CPU memory space. A
device that consumes CPU memory space is termed a memory mapped I/O device or MMIO

device for short. The MMIO term is widely used in the industry and applies to all other CPUs, not
just x86/x64.
Figure 2 shows that the RAM occupies (at most) the lowest 512MB of the memory space.
However, above the 16MB physical address, the space seems to be shared between RAM and
PCI devices. Actually, there is no sharing of memory range happening in the system because the
platform firmware initializes the PCI devices memory to use memory range above the memory
range consumed by main memory (RAM) in the CPU memory space. This memory range free
from RAM depends on the amount of RAM installed in the system. If the installed RAM size is
256MB, the PCI memory range starts right after the 256MB boundary up until the 4GB memory
space boundary, if the installed RAM size is 384MB, the PCI memory range starts right after the
384MB boundary up until the 4GB memory space boundary and so on. Therefore, this implies
that the memory range consumed by the PCI devices is relocatable, i.e., can be relocated within
the CPU memory space, otherwise its impossible to prevent conflict in memory range usage.
Well, its true and actually its one of the features of the PCI bus which sets it apart from the ISA
bus that it replaced. In thevery oldISA bus, you have to set the jumpers on the ISA device to
the correct setting; otherwise there will be address usage conflict in your system. In PCI bus, its
the job of the platform firmware to set up the correct address mapping in the PCI devices.
There are several special ranges in the memory range consumed by PCI device memory above
the installed RAM sizeinstalled RAM size is termed top of main memory (TOM) in Intel
documentation, so Ill use the same term from now on. Some of the special ranges above TOM
are hardcoded and cannot be changed because the CPU reset vector and certain non-CPU chip
registers always map to those special memory ranges, i.e., they are all not relocatable. Among
these hardcoded memory ranges are the memory ranges used by advanced programmable
interrupt controller (APIC) and the flash memory which stores the BIOS/UEFI code. These
hardcoded memory ranges cannot be used by PCI devices at all.

PCI Configuration Register


Now, we have arrived in the core of our discussion: how does the PCI bus protocol map PCI
devices memory to the system address map? The mapping is accomplished by using a set of PCI
device registers called BAR (base address register). We will get into details of the BAR
initialization in the PCI Bus Base Address Registers Initialization section later. Right now, we will
look at details of the BAR implementation in PCI devices.
BARs are part of the so-called PCI configuration register. Every PCI device must implement the
PCI configuration register dictated by the PCI specification. Otherwise, the device will not be
regarded as a valid PCI device. The PCI configuration register controls the behavior of the PCI
device at all times. Therefore, changing the (R/W) value in the PCI configuration register would
change the behavior of the system as well. In x86/x64 architecture, the PCI configuration register
is accessible via two 32-bit I/O ports. I/O port CF8h-CFBh acts as address port, while I/O port
CFCh-CFFh acts as the data port to read/write values into the PCI configuration register. For

details on accessing the PCI configuration register in x86, please read this section of my past
article: https://sites.google.com/site/pinczakko/pinczakko-s-guide-to-award-bios-reverseengineering#PCI_BUS. The material in that link should give you a clearer view about access to
the PCI configuration register in x86/x64 CPUs. Mind you that the code in that link must be
executed in ring 0 (kernel mode) or under DOS (if youre using 16-bit assembler), otherwise it
would make the OS respond with access permission exception.
Now, lets look more closely at the PCI configuration register. The PCI configuration register
consists of 256 bytes of registers, from (byte) offset 00h to (byte) offset FFh. The 256-byte PCI
configuration register consists of two parts, the first 64 bytes are called PCI configuration register
header and the rest are called device-specific PCI configuration register. This article only deals
with the BARs, which are located in the PCI configuration register header. It doesnt deal with the
device-specific PCI configuration registers because only BARs affect the PCI device memory
mapping to system address map.
There are two types of PCI configuration register header, a type 0 and a type 1 header. The PCIto-PCI bridge device must implement the PCI configuration register type 1 header, while other
PCI device must implement the PCI configuration register type 0 header. This article only deals
with PCI configuration register type 0 header and focuses on the BARs. Figure 3 shows the PCI
configuration register type 0 header. The registers are accessed via I/O ports CF8h-CFBh and
CFCh-CFFh, as explained previously.

Figure 3 PCI Configuration Registers Type 0

Figure 3 shows that there are two types of BAR, highlighted on a blue background: the BARs
themselves and the expansion ROM base address register (XROMBAR). BARs span the range of
six 32-bit registers (24-bytes), from offset 10h to offset 27h in the PCI configuration header type 0.
BARs are used for mapping the non-expansion ROM PCI device memoryusually RAM on the
PCI deviceto the system memory map, while XROMBAR is specifically used for mapping the
PCI expansion ROM to system address map. Its the job of platform firmware to initialize the
values of the BARs. Each BAR is a 32-bit registers, hence each of them can map PCI device
memory in the 32-bit system address map, i.e., can map the PCI device memory to the 4GB
memory address space.

The BARs and XROMBAR are readable and writeable registers, i.e., the contents of the
BARs/XROMBAR can be modified. Thats the core capability required to relocate the PCI device
memory in the system address map. PCI device memory is said to be relocatable in the system
address map, because you can change the base address (start address) of the PCI device
memory in the system address map by changing the contents of the BARs/XROMBAR. It works
like this:

Different systems can have different main memory (RAM) size.


Different RAM size means that the area in the system address map set aside for PCI

MMIO range differs.


Different PCI MMIO range means that the PCI device memory occupies different address

in the CPU memory space. That means you have to be able to change the base address
of the PCI device memory in the CPU memory space required when migrating the PCI
device to a system with a different amount of RAM; the same is true if you add more RAM
to the same system.
BARs and XROMBAR control the address occupied by the PCI device memory. Because
both of them are modifiable, you can change the memory range occupied by the PCI
device memory (in the CPU memory space) as required.

Practical System Address Map on Intel 815E-ICH2 Platform


Perhaps the XROMBAR and BARs explanation is still a bit confusing for beginners. Lets look at
some practical examples. The scenario is as follows:
1. The system in focus uses an Intel Pentium III CPU with 256 mb RAM, a motherboard with
Intel 815E chipset, and an AGP video card with 32 mb onboard memory. The AGP video
card is basically a PCI device with onboard memory from the system address map point
of view. We call this configuration the first system configuration from now on.
2. The same system as in point 1. However, we add new 256 mb RAM module. Therefore,
the system now has 512 mb RAMthe maximum amount of RAM supported by the Intel
815E chipset based on its datasheet. We call this configuration the second system
configuration from now on.
Well have a look at what the system address map looks like in both of the configurations above.
Figure 4 shows the system address map for the first system configuration (256 mb RAM) and the
system address map for the second system configuration (512 mb RAM). As you can see, the
memory range occupied by the PCI devices shrinks from 3840 mb (4GB 256 mb) in the first
system configuration (256 mb RAM) to 3584 mb (4GB 512MB) in the second system
configuration (512 mb RAM). The change also causes the base address of the AGP video card
memory to change; in the first system configuration the base address is 256 mb while in the
second system configuration the base address is 512 mb. The change in the AGP video card
memory base address is possible because the contents of the AGP video cardits video chip
BAR can be modified.

Figure 4 System Address Map for the First (256 mb RAM) and Second (512 mb RAM) System
Configuration

Now, lets see how the Intel 815E northbridge routes access to the CPU memory space in the first
system configuration shown in Figure 4 (256 mb RAM). Lets breakdown the steps for a read
request to the video (AGP) memory in the first system configuration. In the first system
configuration, the platform firmware initializes the video memory to be mapped in memory range
256 mb to 288 mb, because the video memory size is 32 mbthe first 256 mb is mapped to
RAM. The platform firmware does so by configuring/initializing the video chip BARs to accept
accesses in the 256 mb to 288 mb memory range. Figure 5 shows the steps to read the contents
of the video card memory starting at physical address11C0_0000h (284MB) at runtime (inside an
OS).

Figure 5 Steps for a Memory Request to the AGP Video Card in the First System Configuration (256
mb RAM)

Details of the steps shown in Figure 5 are as follows:


1. Suppose that an application (running inside the OS) requests texture data in the video
memory that ultimately translates into physical address 11C0_0000h (284MB), the read
transaction would go from the CPU to the northbridge.
2. The northbridge forwards the read request to a device whose memory range contains the
requested address,11C0_0000h. The northbridge logic checks all of its registers related to
address mapping to find the device that match the address requested. In the Intel 815E
chipset, address mapping of device on the AGP port is controlled by four registers, in the
AGP/PCI bridge part of the chipset. Those four registers are MBASE, MLIMIT, PMBASE,
and PMLIMIT. Any access to a memory address within the range of either MBASE to
MLIMIT or PMBASE to PMLIMIT is forwarded to the AGP. It turns out the requested

address (11C0_0000h) is within the PMBASE to PMLIMIT memory range. Note that
initialization of the four address mapping registers is the job of the platform firmware.
3. Because the requested address (11C0_0000h) is within the PMBASE to PMLIMIT
memory range, the northbridge forwards the read request to the AGP, i.e., to the video
card chip.
4. The video card chips BARs setting allows it to respond to the request from the
northbridge. The video card chip then reads the video memory at the requested address
(at 11C0_0000h) and returns the requested contents to the northbridge.
5. The northbridge forwards the result returned by the video card to the CPU. After this step,
the CPU receives the result from the northbridge and the read transaction completes.
The breakdown of a memory read in the sample above should be clear enough for those not yet
accustomed to the role of the Intel 815E northbridge chipset as address router.

The System Management Mode (SMM) Memory Mapping


In the beginning of this section, I have talked about the special memory range in the PCI
memory range, i.e., the memory range occupied by the flash ROM chip containing the platform
firmware and the APIC. If you look at the system address map in Figure 2, you can see that there
are two more memory ranges in the system address map that show up mysteriously. They are the
HSEG and TSEG memory ranges. Both of these memory ranges are not accessible in normal
operating mode, i.e., not accessible from inside the operating system, even DOS. They are only
accessible when the CPU is in system management mode (SMM). Therefore, its only directly
accessible to SMM code in the BIOS. Lets have a look at both of them:
1. HSEG is an abbreviation of high segment. This memory range is hardcoded
to FEEA_0000h-FEEB_FFFFh. Thesystem management RAM control register in the Intel
815E chipset must be programmed (at boot in the BIOS code) to enable this hardcoded
memory range as HSEG, otherwise the memory range is mapped to the PCI bus. HSEG
maps access to the FEEA_0000h-FEEB_FFFFh memory range into the A_0000hB_FFFFh memory range in the RAM. By default, the A_0000h-B_0000h memory range is
not mapped to RAM, but to part of the video memory to provide compatibility to DOS
application. The HSEG capability makes the RAM hole in the A_0000hB_FFFFh memory range available to store SMM code. Both of the HSEG memory ranges
are not accessible at runtime (inside an OS).
2. TSEG is an abbreviation of top of main memory segmenttop of main memory (RAM) is
abbreviated as TOM. This memory range is relocatable, depending on the size of main
memory. Its always mapped from TOM-minus-TSEG-size to TOM. The system
management RAM control register in the Intel 815E chipset must be programmed (at
boot in the BIOS code) to enable TSEG. TSEG is only accessible in SMM with physical
address set in the northbridge, the areas occupied by TSEG is a memory hole with
respect to OS, i.e., OS cannot see TSEG area.

Accesses to all of the special memory ranges (APIC, flash ROM and HSEG) are not forwarded
to the PCI bus by the Intel 815E-ICH2 chipset, but they are forwarded to their respective devices,
i.e., accesses to APIC memory range are forwarded to the APIC, accesses to flash ROM memory
range are forwarded to the flash ROM, and accesses to HSEG at FEEA_0000hFEEB_FFFFh memory range are forwarded to RAM at A_0000h-B_FFFFh memory range.
At this point, everything regarding system address map in a typical Intel 815E-ICH2 system
should be clear. The one thing remaining to be studied is initialization of the BAR. There are some
additional materials in the remaining sections, though. They are all related to system address
map.

PCI Bus Base Address Registers Initialization


PCI BAR initialization is the job of the platform firmware. The PCI specification provides
implementation note on BAR initialization. Lets have a look at the BAR format before we get into
the implementation note.

PCI BAR Formats


There are two types of BAR, the first is BAR that maps to the CPU I/O space and the second is
BAR that maps to the CPU memory space. The formats of these two types of BARs are quite
different. Figure 6 and Figure 7 show the formats of both types of BAR.
Figure 6 BAR That Maps to CPU I/O Space Format

Figure 6 shows the format of the PCI BAR that maps to CPU I/O space. As you can see, the
lowest two bits in the 32-bit BAR are hardcoded to 01 binary value. Actually, only the lowest bit
matters; it differentiates the type of the BAR, because the bit has a different hardcoded value
when the BAR types differ. In BAR that maps to CPU I/O space, the lowest bit always hardcoded
to one, while in BAR that maps to CPU memory space, the lowest bit always hardcoded to zero.
You can see this difference in Figure 6 and Figure 7.

Figure 7 BAR That Maps to CPU Memory Space Format

Figure 7 shows the BAR format for the BAR that maps to CPU memory space. This article deals
with this type of BAR because the focus is on the system address map, particularly the system
memory map. Figure 7 shows the lowest bit is hardcoded to zero in BAR that map to CPU
memory space. Figure 7 also shows that bits 1 and bit 2 determine whether the BAR is a 32-bit
BAR or 64-bit BAR. This needs a little bit of explanation. PCI bus protocol actually supports 32-bit
and 64-bit memory space. The 64-bit memory space is supported through dual cycle
addressing, i.e., access to the 64-bit address requires two cycles instead of one because the PCI
bus is natively 32-bit bus in its implementation. However, were not going to delve deeper into it
because we are only concerned with the 32-bit PCI bus in this article. If you ask: Why only 32-bit?
Its because the CPU we are talking about here only supports 32-bit operationwithout physical
address extension (PAE), and we dont deal with PAE here.
Figure 7 shows that bit 3 controls the prefetching in BAR that maps to CPU memory space.
Prefetching in this context means that the CPU fetches the contents of memory addressed by the
BAR before a request to that specific memory address is made, i.e., the fetching happens in
advance, hence pre-fetching. This feature is used to improve the overall PCI device memory
read speed.
Figure 6 and Figure 7 shows the rest of the BAR contents as Base Address. Not all of the bits in
the part marked as Base Address are writeable. The number of writeable bits depends on the
size of the memory range required by the PCI device onboard memory. Detail of which bits of the
Base Address are hardcoded and which bits are writeable is discussed in the next section: PCI
BAR Sizing Implementation Note.

Figure 8 XROMBAR Format

Figure 8 shows the XROMBAR format. As you can see, the lowest 11 bits are practically
hardcoded to zerobecause the expansion ROM enable bit doesnt act as an address bit, only
as a control bit. This means a PCI expansion ROM must be mapped to a 2 KB boundary. The
expansion ROM enable bit controls whether access to the PCI expansion ROM is enabled or
not. When the bit is set to one, the access is enabled and when the bit is set to zero the access is
disabled. If you look at Figure 3 PCI Configuration Registers Type 0, you see the command
register at offset 04h. Theres one bit in the command register called the memory space bit that
must also be enabled to enable access to the PCI expansion ROM. The memory space bit has
precedence over the expansion ROM enable bit. Therefore, both the memory space bit in the
command register and the expansion ROM enable bit in the XROMBAR must be set to one to
enable access to the PCI expansion ROM.
The number of writeable bits in the expansion ROM base address bits depends on the size of
the memory range required by the PCI expansion ROM. Detail of which bits of the expansion
ROM base address are hardcoded and which bits are writeable is discussed in the next section:
PCI BAR Sizing Implementation Note. The process for BAR initialization also applies to initializing
the XROMBAR.
In practice, some vendors prefer to use BAR instead of XROMBAR to map the PCI expansion
ROM to either the CPU memory space or the CPU I/O space. One particular example is the old
Realtek RTL8139D LAN card chip. This chip maps the PCI expansion ROM to the CPU I/O space
instead of the CPU memory space by using a BAR, instead of the XROMBAR. You can see the
details in the chip datasheet. This shows that, despite the presence of the bus protocol standard,
some vendors prefer a quite different approach compared to what the standard suggests. Well,
the Realtek approach is certainly not wrong as per the PCI bus protocol but its quite outside of
the norm. However, I think there might be an economic reason to do so. Perhaps, by doing that
Realtek saved quite sizeable die area in the LAN card chip. This translated into cheaper
production costs.

PCI BAR Sizing Implementation Note


At this point, the PCI BAR format and XROMBAR format are already clear except for the base
address bits, which as stated previously, contain device specific values. Base address bits
depend on the size of the memory range required by the PCI device. The size of the memory

range required by a PCI device is calculated from the number of writeable bits in the base
address bits part of the BAR. The detail is described in the BAR sizing implementation note in
PCI specification v2.3, as follows:

Implementation Note: Sizing a 32-bit Base Address Register Example


Decoding (I/O or memory) of a register is disabled via the command register (offset 04h in PCI
configuration space) before sizing a base address register. Software saves the original value of
the base address register, writes 0FFFFFFFFh to the register, then (the software) reads it back.
Size calculation can be done from the 32-bit value read by first clearing the encoding information
bits (bit 0 for I/O, bits 0-3 for memory), inverting all 32 bits (logical NOT), then incrementing by 1.
The resultant 32-bit value is the memory/I/O range size decoded by the register. Note that the
upper 16 bits of the result are ignored if the base address register is for I/O and bits 16-31 return
zero upon read. The original value in the base address register is restored before re-enabling
decode in the command register (offset 04h in PCI configuration space) of the device. 64-bit
(memory) base address registers can be handled the same, except that the second 32-bit register
is considered an extension of the first; i.e., bits 32-63. Software writes 0FFFFFFFFh to both
registers, reads them back, and combines the result into a 64-bit value. Size calculation is done
on the 64-bit value.
The implementation note above is probably still a bit vague. Now, lets see how the BAR in the
AGP video card chip should behave in order to allocate 32 mb from the CPU memory space for
the onboard video memory. The address range allocation happens during BAR sizing phase in
the BIOS execution. BAR sizing routine is part of the BIOS code that builds the system address
map. The BAR sizing happens in these steps (as per the implementation note above):
1. BIOS code saves the original value of the video card chip BAR to temporary variable in
RAM. There are six BARs in the PCI configuration registers of the video card chip. We
assume the BIOS work with the first BAR (offset 10h in the PCI configuration registers).
2. BIOS code writes FFFF_FFFFh to the first video card chip BAR. The video card chip would
allow write only to writeable bits in the Base Address part of the BAR. The lowest 4-bits
are hardcoded values meant as control bits for the BAR. The bits controls prefetching
behavior, whether the BAR is 32-bit or 64-bit and whether the BAR maps to CPU memory
space or CPU I/O space. Here we assume that the BAR maps to CPU memory space.
3. The BAR should contain FE00_000Xh after the write in step 2because the video card
size is 32 mb, you will see later how this value transforms into 32 mb. X denotes variable
contents that we are not interested in, because it depends on the specific video card and
it doesnt affect BAR sizing process.
4. As per the implementation note, the lowest 4 bits must be reset to 0h. Therefore, now we
have FE00_0000h.
5. Then, the value obtained in the previous step must be inverted (logical NOT 32 bits).
Now, we have 01FF_FFFFh.

6. Then, the value obtained in the previous step must be incremented by one. Now, we
have 200_0000h, which is 32 mb, the size of memory range required by the AGP video
card.
Perhaps, youre asking why the FFFF_FFFFh writes to the BAR only returns FE00_000Xh? The
answer is because only the top seven bits of the BAR are writeable. The rest is hardcoded to
zero, except the lowest 4 bits, which are used for BAR informationI/O or memory mapping
indicator, prefetching, and 32-bit/64-bit indicator. These lowest 4 bits could be hardcoded to nonzero values, which is why the read result must be reset to zero. Its possible to relocate the video
card memory anywhere in the 4GB CPU memory space in a 32 mb boundary because the top
seven bits of the 32-bit BAR are writeableI believe you can do the math yourself for this.
After the BAR sizing, the BIOS knows the amount of memory required by the video card, i.e., 32
mb. The BIOS can now map the video card memory into the CPU memory space by writing the
intended address into the video card BAR. For example, to map the video card memory to
address 256 mb, the BIOS would write 1000_0000h (256 mb) to the video card BAR; with this
value, the top seven bits of the BAR contains the 0001_000b binary value.
Now, you might be asking, what about XROMBAR? Well, the process just the same, except that
XROMBAR maps read only memory (ROM) instead of RAMas in the video card memory
sample explained hereinto the CPU memory space. This means theres no prefetching of bits to
be read or processed during the BAR sizing because ROM usually doesnt support prefetching
at all unlike RAM, which acts as video card memory. The explanation on BAR sizing in this
subsection should clarify the expansion ROM mapping to the system address map, which is not
very clear in my previous article (http://resources.infosecinstitute.com/pci-expansion-rom/).

The AGP Graphics Address Remapping/Relocation Table (GART)


The example of PCI device mapping to system address map in this article is an AGP video card.
Logically, an AGP is actually an enhanced PCI device and it has some peculiarities, such as the
fact that the AGP device slot runs at 66 MHz instead of 33 MHz in ordinary PCI slot and AGP also
has some AGP-specific logic support inside the northbridge chipset.
One of the AGP-specific supports in the northbridge is the so-called AGP graphics address
remapping/relocation table (GART) hardware logic. AGP GARTor GART for shortallows the
video card chip in the video card to use a portion of the main memory (RAM) as video memory if
the application(s) that use the video memory runs out of the onboard video memory. The GART
hardware is basically a sort of memory management unit (MMU) that presents chunks of the RAM
allocated as additional video memoryas one contiguous memory to the video card chip. This
is required because, by the time the video card chip requests for additional video memory to the
system, the RAM is very probably already fragmented because other applications have used
portions of the RAM for their own purposes.

The GART hardware has a direct relation to the system address map initialization via the socalled AGP aperture. For those old enough, you might remember setting the AGP aperture size in
your PC BIOS back then. The GART in some respect is a precursor to the input/output memory
management unit (IOMMU) technology that is in use in some of present-day hardware.
Now, lets look deeper into the GART and the AGP aperture. Figure 9 shows how the GART and
the AGP aperture work.
Figure 9 AGP GART and AGP Aperture in the System Address Map

Figure 9 shows implementation of the AGP aperture and GART. The AGP aperturewhich is set
in the BIOS/platform firmwarebasically reserves a contiguous portion of the PCI MMIO range to
be used as additional video memory in case the video memory is exhausted at runtime (inside the
OS). The size of the AGP aperture is set in the BIOS setting. Therefore, when the BIOS initializes
the system address map, it sets aside a contiguous portion of the PCI MMIO range to be used as
the AGP aperture.
The GART logic in the northbridge chipset contains a register that contains pointer to the GART
data structure (translation table) in RAM. The GART data structure is stored in RAM, much like
the global descriptor table or local descriptor table in x86/x64 CPUs. The video driver working

alongside the OS kernel initializes the GART data structure. When the video memory is running
low, the video driver (working alongside the OS kernel) allocates chunks of RAM as required.
The video card chip accesses the additional video memory via the AGP aperture. Therefore, the
video card chip only sees a contiguous additional video memory instead of chunks of memory in
RAM. The GART hardware in the northbridge chipset translates from the AGP aperture address
space to the RAM chunks address space for the video card chip. For example, in Figure 9,
when the video card chip accesses block #1 in the AGP aperture region, the GART logic would
translate the access into access to block #1 in the corresponding RAM chunkthe block is
marked in red in Figure 9. The GART logic translates the access based on the contents of the
GART data structure, which is also located in RAM but cached in the GART logic, much like the
descriptor cache in x86/x64 CPUs.
The downside of the AGP architecture lies in the GART being located in the northbridge, a shared
and critical resource for the entire system. This implies that a misbehaving video card driver could
crash the entire system because the video card literally programmed the northbridge for GARTrelated stuff. This fact encourages the architects of the PCIe protocol to require the GART logic to
be implemented in the video card chip instead of in the chipset in systems that implement the
PCIe protocol. For details on the requirement, you can read this: http://msdn.microsoft.com/enus/library/windows/hardware/gg463285.aspx.
Now you should have a very good overall view of the effect of the AGP in the system address
map. The most important thing to remember is that the GART logic consults a translation table,
i.e., the GART data structure, in order to access the real contents of the additional video memory
in RAM. A similar technique is employed in IOMMUthe use of translation table.

Hijacking BIOS Interrupt 15h AX=E820h Interface


Now lets see how you can query the system for the system address map. In legacy systems with
legacy platform firmware, i.e., BIOS, the most common way to request system address map is via
interrupt 15h function E820h (ax=E820h). The interrupt must be performed when the x86/x64
system runs in real mode; right after the BIOS completes platform initialization. You can find
details of this interface at: http://www.uruk.org/orig-grub/mem64mb.html. Interrupt 15h function
E820h is sometimes called the E820h interface. Well adopt this naming here.
A legacy boot rootkitlets call it bootkitcould hide in the system by patching the interrupt 15h,
function E820h handler. One of the ways to do that is to alter the address range
descriptor structure returned by the E820h interface. The address range descriptor structure is
defined as follows:

1
2
3

typedef AddressRangeDescriptorTag{
unsigned long BaseAddrLow;
unsigned long BaseAddrHigh;

4
5
6
7

unsigned long LengthLow;


unsigned long LengthHigh;
unsigned long Type;
} AddressRangeDescriptor;

The type field in the address range descriptor structure determines whether the memory range is
available to be used by the OS, i.e. can be read and written by the OS. The encoding of the type
field as follows:

A value of 1 means this run (the returned memory range from the interrupt handler) is

available RAM usable by the operating system.


A value of 2 means this run of addresses (the returned memory range from the interrupt
handler) is in use or reserved by the system, and must not be used by the

operating system.

Other values means undefinedreserved for future use. Any range of this type

must be treated by the OS as if the type returned was reserved (the


return value is 2).
If the RAM chunk used by the bootkit is marked as reserved region in the Type field of
the address range descriptor structure, it means that the OS would regard the RAM chunk as offlimits. This practically hides the bootkit from the OS. This is one of the ways a bootkit can hide
in a legacy system.
Anyway, memory rangein the CPU memory spaceconsumed by PCI MMIO devices would
also be categorized as a reserved region by the BIOS interrupt 15h function E820h handler
because its not in the RAM region and it should not be regarded as such. Moreover, random read
and write to PCI devices can have unintended effects.
The users of the E820h interface are mostly the OS and the OS bootloader. The OS needs to
know the system address map in order to initialize the system properly. The OS bootloader
sometimes has additional function such as RAM tests and in some circumstances it also passes
the system memory map to the OS. Thats why the OS bootloader is also a user of the E820h
interface.

UEFI: The GetMemoryMap() Interface


Now you might be asking: what is the equivalent of the E820h interface in UEFI? Well, the
equivalent of the E820h interface in UEFI is the GetMemoryMap()function. This function is
available as part of the UEFI boot services. Therefore, you need to traverse into the UEFI boot
services table to call the function. The simplified algorithm to call this function as follows:
1. Locate the EFI system table.
2. Traverse to the EFI_BOOTSERVICES_TABLE in the EFI System Table.
3. Traverse the EFI_BOOTSERVICES_TABLE to locate the GetMemoryMap() function.

4. Call the GetMemoryMap() function.


The GetMemoryMap()function returns a data structure that is similar to the one returned by the
legacy E820h interface. The data structure is called EFI_MEMORY_DESCRIPTOR. Well, this article
doesnt try to delve deeper into this interface. You can read details of the interface and
the EFI_MEMORY_DESCRIPTOR in the UEFI specification. Should you be interested in digging
deeper, the GetMemoryMap() function is located in the boot services chapter of the UEFI
specification, under the memory allocation services section.

Closing Thoughts
Thanks go to reader Jimbo Bob, who asked the question regarding clarification on PCI expansion
ROM chip address mapping in my Malicious PCI Expansion ROM article
(http://resources.infosecinstitute.com/pci-expansion-rom/). I was not aware of the insufficiency of
my explanation in that article. Perhaps, because I have been working on BIOS and other stuff
bare metal for years, I took it for granted that readers of my article would be sufficiently informed
regarding the bus protocols I talked about. So here I am, fixing that mistake.
Looking forward, Im preparing another article in the same spirit as this one that focuses on
present-day bus protocol, the PCI express (PCIe). For those who wandered to bare metal
programming in x86/x64 for the first time, I hope this article will help understanding the system.
Hopefully the reader becomes fully informed about where to go to find information regarding chipspecific information, particularly, knowledgeable enough to travel the web to find relevant
x86/x64-related chip/chipset datasheet according to his/her needs.

System Address Map Initialization in x86/x64 Architecture Part 2: PCI


Express-Based Systems
This article is the second part of a series that clarifies PCI expansion ROM address mapping to
the system address map. The mapping was not sufficiently covered in my Malicious PCI
Expansion ROM article (http://resources.infosecinstitute.com/pci-expansion-rom/). You are
assumed to have a working knowledge of PCI bus protocol and details of the x86/x64 boot
process. If you dont, then please read the first part to get up to speed with the background
knowledge required to understand this article (at http://resources.infosecinstitute.com/systemaddress-map-initialization-in-x86x64-architecture-part-1-pci-based-systems/).
The first part focuses on system address map initialization in a x86/x64 PCI-based system. This
article focuses on more recent systems, i.e., x86/x64 PCI Express-based systems. From this
point on, PCI Express is abbreviated as PCIe throughout this article, in accordance with official
PCI Express specification.
We are going to look at system address map initialization in x86/x64 PCIe-based systems. Similar
to the first part, the focus is on understanding the address mapping mechanism of the PCIe bus
protocol. Knowledge of the address mapping is important to understand access to contents of the
PCI expansion ROM in PCIe-based system.
PCIe is very different on the physical level from PCI. However, on the logical level PCIe is an
extension of PCI. In fact, you can boot an OS only supporting the PCI bus on platforms using
PCIe without a problem, as long as the OS support conforms to the PCI bus protocol
specification. The fact that PCIe is an extension to PCI means that you should be familiar with the
PCI bus protocol before you can understand PCIe. Thats why you are strongly advised to read
the first part before moving forward with this second part.

Conventions
This article uses these conventions:
1. Main memory refers to RAM modules installed on the motherboard.
2. Memory controller refers to that part of the chipset or the CPU that controls the RAM
modules and controls access to the RAM modules.
3. Flash memory refers to either the chip on the motherboard that stores the BIOS/UEFI or
the chip on expansion card that stores the PCI expansion ROM contents.
4. Memory range or memory address range means the range from the base/start
address to the end address (base address + memory size) occupied by a device in the
CPU memory space.

5. Memory space means the set of memory addresses accessible by the CPU, i.e., the
memory that is addressable from the CPU. Memory in this context could mean RAM,
ROM, or other forms of memory that can be addressed by the CPU.
6. PCI expansion ROM refers to the ROM chip on a PCI device or the contents of the chip,
except when the context contains other specific explanation.
7. The terms hostbridge and northbridge refer to the same logic components in this
article. Both terms refer to the digital logic component that glues the CPU cores to the
rest of the system, i.e., connecting the CPU cores to RAM modules, PCIe graphics, the
southbridge chipset, etc.
8. Intel 4th Generation Core Architecture CPUs are called Haswell CPU or simply Haswell
in most of this article. Intel uses Haswell as the codename for this CPU generation.
9. Hexadecimal values end with h as in 0B0Ah or starts with 0x as in 0xB0A.
10. Binary values end with b as in 1010b.
11. The term memory transactions routing means memory transactions routing based on
target address of the transaction, unless stated otherwise.
Another recurring word in this article is platform firmware. Platform firmware refers to code to
initialize the platform upon reset, i.e., the BIOS or UEFI code residing in the flash ROM chip of the
motherboard.

Preserving Firmware Code Compatibility in Modern-Day x64 Hardware


The x64 architecture is an extension of the x86 architecture. Therefore, x64 inherits most of the
x86 architecture characteristics, including its very early boot characteristics and most of its
system address map. There are two important aspects that x64 preserves from x86 with respect
to firmware code execution:
1. The CPU reset vector location. Even though x64 architecture is a 64-bit CPU
architecture, the reset vector remains the same as in x86 (32-bit) architecture, i.e., at
address 4GB-16 bytes (FFFF_FFF0h). This is meant to preserve compatibility with old
add-on hardware migrated to x64 platforms and also compatibility with numerous lowlevel code depending on the reset vector.
2. The compatibility/legacy memory ranges in the system address map. The compatibility
memory ranges are used for legacy devices. For example, some ranges in the lowest
1MB memory area are mapped to legacy hardware or their hardware emulation
equivalent. More important, part of the memory range is mapped to the bootblock part of
the BIOS/UEFI flash ROM chip. The memory ranges for the BIOS/UEFI flash ROM didnt
change because the CPU reset vector remains the same. Moreover, lots of firmware
code depends on that address mapping as well. Breaking the compatibility from x86
would cause a lot of problems for the smooth migration to the x64 64-bit architecture, not
counting on the business impact such a thing would cause. Thats why preserving the

compatibility between the two different architectures is important, down to the firmware
and chip level.
Lets look at what is needed at the chip level to preserve the backward compatibility x86
architecture, now that you know the reason for preserving the compatibility. Figure 1 shows the
logic components of the Haswell platform with relation to the UEFI/BIOS code fetch/read. As you
can see, two blocks of logic, one in the CPU and one in the Platform Controller Hub(PCH), are
provided to preserve the backward compatibility. They are the compatibility memory range logic in
the CPU and the internal memory target decoder logic in the PCH. As for the Direct Media
Interface (DMI) 2.0 controller logic, its transparent with respect to software, including firmware
codeit just acts as a very fast pass-through device; it doesnt alter any of the transactions
initiated by the firmware code that pass through it.

Figure 1. BIOS/UEFI Code Read Transaction in Modern Platform

Figure 1 shows the CPU core fetching code from the BIOS/UEFI flash ROM connected to the
PCH (southbridge) via the serial peripheral interface (SPI)see the dashed red line in Figure 1.
This is what happened in the very early boot stage, when the CPU has just finished initializing
itself and starts to fetch code located in the reset vector. The presence of compatibility logic in
the platform, as shown in Figure 1, makes it possible to run DOS or 32-bit OS without any
problems.

Figure 1 shows there are four CPU cores in the CPU. However, not all of them are the same; one
of them is marked as boot strap processor (BSP), while the other three are marked as application
processors (AP). When the system first boots or during a hardware reset, there is only one core
thats active, the BSP. The APs are not active at that point. Its the job of the firmware
(BIOS/UEFI) code that runs in the BSP to initialize and activate the APs during the system
initialization phase.
Be aware though, that Figure 1 doesnt show all of the interconnections and hardware logic on
both the CPU and the PCH, only those related to BIOS/UEFI code execution. The point is to
highlight components that take part in very early BIOS/UEFI code execution after a system reset
takes place.
As you can see in Figure 1, the transaction to reach the BIOS/UEFI flash ROM chip doesnt
involve any PCIe logic or fabric; even if the hostbridge contains the PCIe root complex logic, the
transaction doesnt go through it. Nonetheless, you still need to learn about PCIe bus protocol
because the PCI expansion ROM that resides in a PCIe expansion card will use the PCIe fabric
and logic. Thats the reason PCIe-related sections are coming next in this article.

A Deeper Look into PCI-to-PCI Bridge


PCIe hardware is logically represented as one PCI device or a collection of PCI devices. Some
contain logical PCI-to-PCI bridge(s). The first part of this series doesnt delve much into PCI-toPCI bridge. Therefore, were going to take a much closer look into it here because its used
heavily as a logical PCIe device building block. For example, the root port(outgoing port from
the root complex) is logically a PCI-to-PCI bridge and a PCIe switch logically looks like several
connected PCI-to-PCI bridges.
Well start dissecting PCI-to-PCI bridge by looking at its PCI configuration register header. PCI-toPCI bridge must implement PCI configuration register type 1 header in its PCI configuration
space register, unlike the header that must be implemented by non PCI-to-PCI bridge device
refer to the first part for PCI configuration register type 0 header. Figure 2 shows format of PCI-toPCI bridge configuration space header, i.e. PCI configuration register type 1 header. This format
is dictated by the PCI-to-PCI Bridge Architecture Specification v1.1 published by PCISIG.

Figure 2. PCI Configuration Register Type 1 Header (for PCI-to-PCI Bridge)

The numbers in the top of Figure 2 mark the bit position in the registers of the PCI configuration
space header. The numbers to the right of Figure 2 mark the offset of the registers in the PCI
configuration space header. Registers marked with yellow in Figure 2 determine the memory and
IO range forwarded by the PCI-to-PCI bridge from its primary interface (the interface closer to the
CPU) to its secondary interface (the interface farther away from the CPU). Registers marked with
green in Figure 2 determine the PCI bus number of the bus in the PCI-to-PCI bridge primary
interface (Primary Bus Number), the PCI bus number of the PCI bus in its secondary interface
(Secondary Bus Number) and the highest PCI bus number downstream of the PCI-to-PCI bridge
(Subordinate Bus Number).

NOTE:
Every PCIe device must set the Capabilities List bit in the Status register to 1 and every PCIe
device must implement the Capabilities Pointer register (marked in purple in Figure 2). The
reason is because PCIe is implemented as an extension to PCI protocol and the way to extend
the configuration space of a PCIe device (compared to ordinary PCI device) is via the
Capabilities Pointer register.

Figure 3 shows an illustration of PCI-to-PCI bridge primary and secondary interface in a


hypothetical platformthe platform components inner workings are the same with a real world
system despite the platform is hypothetical; its just simplified to make it easier to understand. PCI
bus 1 connects to the PCI-to-PCI bridge primary interface and PCI bus 2 connects to the PCI-toPCI bridge secondary interface in Figure 3.

Figure 3. PCI-to-PCI Bridge Interfaces

The PCI-to-PCI bridge forwards an IO transaction downstream (from the primary interface to the
secondary interface) if the IO limit register contains a value greater than the IO base register
value and the transaction address falls within the range covered by both registers. Likewise, the
PCI-to-PCI bridge forwards a memory transaction downstream if the memory limit register

contains a value greater than the memory base register value and the transaction address falls
within the range covered by both registers.
There is a fundamental difference between the memory base/limit register and the prefetchable
memory base/limit register. The memory base/limit registers are used for memory ranges
occupied by devices that have a side effect on read transactions. The prefetchable memory
base/limit registers are used only for devices that dont have side effects on read because, in this
case, the PCI-to-PCI bridge can prefetch the data a on read transaction from the device without
problems. Prefetching works because there is no side effect on the read transaction. Another
difference is that the prefetchable memory base/limit registers are able to handle devices located
above the 4GB limit because they can handle 64-bit address space.
There are no memory base/limit registers for devices mapped above 4GB because the PCI
specification assumes all devices that require large address ranges behave like memory, i.e.,
their memory contents are prefetchable and dont have side effects on reads. Therefore, the PCI
specification implies that devices with large address range consumption should implement
prefetchable memory base/limit registers instead of memory base/limit registers and all devices
with memory that have side effects on read should be mapped to address ranges below the 4GB
limit by the platform firmware.
A fact sometimes overlooked when dealing with PCI-to-PCI bridge is that the bridge forwards
memory transactions upstream (from the secondary interface to the primary interface<) i.e.,
from PCI device to the direction of the CPUif the transaction address range doesnt fall within
the range covered by the memory base/limit or prefetchable memory base/limit registers.
Perhaps, youre asking, why is this behavior needed? The answer is because we need direct
memory access (DMA) to work for devices connected to the PCI-to-PCI bridge secondary
interface. In DMA, the device downstream of the PCI-to-PCI bridge initiates the transaction (to
read-from or write-to RAM) and the PCI-to-PCI bridge must ensure that the transaction is
forwarded from the device in upstream direction toward the RAM.
Devices in DMA (in this case PCI devices) need to write data into the system memorythe socalled DMA write transaction. If you look at Figure 3, the DMA write transaction for devices
connected to the PCI-to-PCI bridge secondary interface must go through the PCI-to-PCI bridge to
reach the system memory; if the PCI-to-PCI bridge doesnt forward the write transaction
upstream, DMA cannot work because the contents from the device cannot be written to the
system memory.
Now, lets have a look at an example of a memory transaction thats forwarded downstream by
PCI-to-PCI bridge in Figure 3. Before we proceed to examine the example, we are going to make
several assumptions:

The system in Figure 3 has 8GB system memory. The first 3GB of the system memory is

mapped to the lowest 3GB of the CPU memory address space; the rest is mapped to
address range 4GB-to-9GB in the CPU memory address spaceabove the 4GB limit.
The platform firmware has initialized all of the base address registers (BARs) of the PCI

devices in the system; including the PCI-to-PCI bridge BARs. The platform firmware
initialized the CPU memory range from 3GB to 4GB to be used by PCI devices; of course
outside of the hardcoded range used by advanced programmable interrupt controller
(APIC) logic, the platform firmware flash ROM chip and some other legacy system
functions in the memory range close to the 4GB limit.
Contents of the initialized PCI devices BARs and related registers are as follows:
o PCI device 1, only one BAR in use with 16 MB (non-prefetchable) memory space
consumption starting at address E000_0000h (3.5GB). This BAR claims
transactions to E000_0000h E0FF_FFFFh non-prefetchable memory range.
o PCI device 2, only one BAR in use with 16 MB (non-prefetchable) memory space
consumption starting at address E100_0000h (3.5GB + 16 MB). This BAR
claims transactions to E100_0000h E1FF_FFFFhnon-prefetchable memory
o

range.
PCI device 3, only one BAR in use with 32 MB (prefetchable) memory space
consumption starting at addressE200_0000h (3.5GB + 32 MB). This BAR claims
transactions to E200_0000h E3FF_FFFFhprefetchable memory range.
PCI device 4, only one BAR in use with 128 MB (prefetchable) memory space
consumption starting at address C000_0000h (3GB). This BAR claims
transactions to C000_0000h C7FF_FFFFh prefetchable memory range.
PCI device 5, only one BAR in use with 128 MB (prefetchable) memory space
consumption starting at address C800_0000h (3GB + 128MB). This BAR claims
transactions to C800_0000h CFFF_FFFFhprefetchable memory range.
PCI device 6, only one BAR in use with 256 MB (prefetchable) memory space
consumption starting at address D000_0000h (3GB + 256MB). This BAR claims
transactions to D000_0000h DFFF_FFFFhprefetchable memory range.
PCI-to-PCI bridge address and routing related configuration registers contents:
Primary Bus Number Register: 1
Secondary Bus Number Register: 2
Subordinate Bus Number Register: 2 (Note: This is the same as

secondary bus number because there is no other bus with higher number
downstream of the PCI-to-PCI bridge)
Memory Base: (Disabled)
Memory Limit: (Disabled)
Prefetchable Memory Base: C000_0000h (3GB)
Prefetchable Memory Limit: DFFF_FFFFh (3.5GB 1)

Now lets look at a sample read transaction with the PCI devices arrangement as in the
assumptions above. Lets say the CPU needs to read data from PCI device at
address D100_0000h (3GB + 16MB) to RAM. This is what happens:

1. The CPU core issues a read transaction. This read transaction reaches the integrated
hostbridge/northbridge.
2. The northbridge forward the read transaction to the southbridge because it knows that
the requested address resides in the southbridge.
3. The southbridge forwards the read transaction to the PCI bus 1, which is connected
directly to it.
4. The PCI-to-PCI bridge claims the read transaction because its within its assigned
range. The PCI-to-PCI bridge claims the read transaction and responds to it because
the requested address is within the range of its prefetchable memory range (between the
value of the prefetchable memory base and prefetchable memory limit).
5. The PCI-to-PCI bridge forwards the read transaction to its secondary bus, PCI bus 2.
6. PCI device 6 claims the read transaction in PCI bus 2 because it falls within the range
of its BAR.
7. PCI device 6 returns the data at the target address (D100_0000h) via PCI bus 2.
8. The PCI-to-PCI bridge forwards the data to the southbridge via PCI bus 1.
9. The southbridge forwards the data to the CPU.
10. The northbridge in the CPU then places the data in RAM and the read transaction
completes.
From the sample above, you can see that the PCI-to-PCI bridge forwards read/write transaction
from its primary interface to its secondary interface if the requested address falls within its range.
If the read/write transaction doesnt fall within its configured range, the PCI-to-PCI bridge would
not forward the transaction from its primary interface to its secondary interface.
A seldom known fact about PCI-to-PCI bridge is the presence of a subtractive decode PCI-to-PCI
bridge. The decoding method explained in the example aboveto claim the read transaction
is known as positive decode, i.e., the device claims the transaction if its within its assigned
range (in one of its BAR). The reverse of positive decode is known as subtractive decode. In
subtractive decode the devicewith subtractive decode supportclaims the transaction if there
is no other device on the bus that claims the transaction, irrespective of whether the transaction is
within the device range or not. There could only be one subtractive decode device in one PCI bus
tree. There is a certain class of PCI-to-PCI bridge device that supports subtractive decode. It was
used to support address decoding of legacy devicessuch as a BIOS chipin older chipsets.
However, this technique is largely abandoned in modern-day chipsets because there is already
legacy-support logic in the chipset and the CPU.

PCIe Device Types


You have learned all the required prerequisites to understand PCIe protocol in the previous
section. Now lets start by looking into PCIe device types based on their role in a PCIe device tree
topology. This is important to understand because you need a fundamental understanding of

PCIe device types to understand PCIe devices initialization. PCIe devices are categorized as
follows:
1. PCIe root

complex. The root complex is similar to northbridge in PCI-based system. It

acts as the glue logic to connect the PCIe device tree to main memory (RAM), and the
CPU. In many cases, the root complex also provides high speed PCIe connection to the
GPU. The root complex can be implemented as part of the northbridge in systems that
employ two physical chips for the chipset logic. However, nowadays the root complex is
always integrated into the CPU chip, as you can see in Figure 1. Figure 1 shows the
PCIe root complex as part of the hostbridge thats integrated into the CPU. The root
complex connects to the PCIe device tree through a logical port known as root port. It
is a logical port because the root port can be implemented physically in a chip outside the
chip containing the root complex. For example, the root complex can reside in the CPU,
but the root port is located in the chipset. The Haswell CPU and Intel 8-series PCH
implements this root port arrangement. Note that the root complex can have more than
one root port.
2. PCIe switch. A PCIe switch is a device that connects two or more PCIe links. A switch
contains several connected virtual PCI-to-PCI bridges internally. That is the reason why
you should have a deep understanding of PCI-to-PCI bridge in order to understand PCIe
device tree topology. Anyway, the root complex can contain a switch, in which case the
root complex will have several root ports.
3. PCIe endpoint device . This is the PCIe device type that most people know as a PCIe
device. A PCIe endpoint device is a PCIe device that terminates a PCIe link; it only has
one connection to the PCIe tree topologyit can have connection to another kind of bus,
though. For example, a PCIe network card in most cases is an endpoint device, just as
PCIe storage controller, etc. PCIe endpoint device can also act as a bridge to
legacy/compatibility bus, such as a PCIe-to-PCI bridge, or a bridge to a low pin count
(LPC) bus, etc.
Perhaps the explanation of a PCIe switch and endpoint device is still vague. Figure 4 shows an
example of a PCIe switch and endpoint devices in a PCIe device tree topology. Figure 4 shows
that the PCIe switch is composed of three connected virtual (logical) PCI-to-PCI bridges. The
switch has one inbound port (called an ingress port in PCIe) and two outbound ports
(called egress ports in PCIe). There are two endpoint devices connected to the switch, an add-in
network and an add-in SCSI controller. Each of the endpoint devices connect to the switch via the
switchs virtual PCI-to-PCI bridges.
Figure 4 shows the physical location of the root ports of the PCIe root complex. One is directly
connected to the PCIe root complex and the other is not directly connected to the PCIe root
complexi.e.. connected via the chipset interconnect. In the latter case, the chipset interconnect
is said to be transparent with respect to PCIe device tree topology. Figure 4 shows the external
PCIe graphics links to the root port thats located in the PCIe root complex while the PCIe switch

links to the root port via the chipset interconnect. There is no difference between them from a
PCIe logic point of view.

Figure 4. PCIe Switch and Endpoint Devices

Figure 4 shows the interconnection between PCIe devices. This interconnection is called
a link in PCIe bus protocol. The link is a logical interconnection that connects two PCIe ports on

two different PCIe devices. Each link consists of one or more

lanes. Each lane consists of a pair

of physical interconnects, one in the outgoing direction from the PCIe device and one in the
incoming direction to the PCIe device. The physical interconnect uses differential signaling to
transmit the PCIe packets in either direction.
At this point, PCIe device basics should be clear to you. In the next section Ill go through the
details of communication between PCIe devices.

PCIe Packets and Device Layering


One of the major differences between PCIe and PCI bus protocol is the implementation of a
higher level of abstraction in PCIe. Each transaction in PCIe is wrapped into a PCIe packet before
its transmitted to another PCIe device. Therefore, PCIe is a packet-based protocol for chip-tochip communication. This fact has the consequence that PCIe has the capability to implement
quality of service (QoS) via packet prioritization. However, Im not going to explain about the QoS
in PCIe, you just need to know QoS exists in PCIe.
Now lets get to the packet details. PCIe protocol employs the same philosophy as TCP/IP in that
it uses several communication layers, with each layer appending header to the content of the
packet to provide routing, error correction and other housekeeping. Figure 5 shows how PCIe
implements this philosophy.

Figure 5. PCIe Packet-Based Communication

There are three types of packets in PCIe protocol (as seen from the highest level of abstraction
down to lowest level packet sent over the PCIe link):
1. Transaction layer packet (TLP)The transaction layer in the PCIe device constructs this
packet, as seen in Figure 5. the TLP consists of a TLP header and the data content being
transmitted. The source of the data content is the PCIe device core and the PCIe core
logic interface in the device. The TLP header contains CRC, among other data. TLP can
travel through the PCIe device tree, passing through more than one PCIe devices
between the source and the destination device. This looks like the packet tunnels
through the PCIe device in between the source and the destination. But in reality the
packet is just routed through the PCIe device tree. The PCIe device in between the
source and the destination must be a PCIe switch because only a switch can
forward/route packets from its ingress port to its egress port.
2. Data link layer packet (DLLP)The data link layer in the PCIe device constructs this
packet, as seen in Figure 5. DLLP wraps the TLP in yet another header. DLLP provides
another CRC for the packet in the DLLP header. DLLP can only travel between PCIe
devices directly connected to each other through a PCIe link. Therefore, the purpose of
the CRC is different from that provided by the TLP because DLLP CRC is used to make
sure that the neighboring PCIe device receives the packet correctly. There are also some
specific DLLPs which dont contain any TLP packet inside of it, such as DLLP for link
power management, flow control, etc.

3. Physical layer packet (PLP)The physical layer in the PCIe device constructs this
packet, as seen in Figure 5. PLP wraps the DLLP into one or several PLPs, depending on
the size of the DLLP; if the size of the DLLP cannot fit into a single PLP, the PLP logic
divides the DLLP into several frames of PLPs. The PLPs are transmitted in the link
between two connected PCIe devices. There are also some specific PLPs that dont
contain any DLLP packet, such as PLP for link training, clock tolerance compensation,
etc.
The explanation about PCIe packet types above implies that a PCIe device must have three
device layers, one for each type of packet. In practice, thats not always the case. As long as the
PCIe device can create PCIe packets that conform to the specification, its fine.

PCIe Address Spaces


You know from the previous section that PCIe is a packet-based chip-to-chip communication
protocol. This means that the protocol requires some means to route the DLLP or TLP between
chips. DLLP can only reach directly linked PCIe chips. Therefore, we are more interested in TLP
routing because in several cases the target of a read/write transaction lies several chips away
from the source of the read/write transaction. There are several mechanisms to route the TLP.
Here, we are going to look into one of them, namely, the TLP routing based on address, also
known as address routing.
There are four address spaces in PCIe. In contrast, PCI only have three address spaces. PCIe
address spaces are as follows:
1. PCIe configuration spaceThis address space is used to access the PCI-compatible
configuration registers in PCIe devices and also the PCIe enhanced configuration
registers. Part of this address space that resides in the CPU IO space is provided for
backward compatibility reasonsi.e., compatibility with PCI bus protocol. The rest of the
PCIe configuration space resides in the CPU memory space. The access mechanism for
the first 256 registers is the same as in PCI for x64 architecture, i.e., using IO
port 0xCF8-0xCFB for address and 0xCFC-0xCFF for data. As in PCI devices, there are
256 eight-bit configuration space registers that are mapped to this IO address space in
PCIe. The first 256 byte configuration registers are immediately available at the very early
boot stage via the CPU IO space (because the mapping doesnt require firmware
initialization), while the rest are available only after the platform firmware finishes
initializing CPU memory space used for PCIe configuration space. PCIe supports a larger
number of configuration space registers than PCI. Each PCIe device has 4KB
configuration space registers. The first 256 bytes of those registers are mapped to both
the legacy PCI configuration space and to PCIe configuration space in the CPU
memory space. The entire 4KB PCIe configuration space registers can be accessed via
PCIe enhanced configuration mechanism. PCIe enhanced configuration mechanism uses

the CPU memory space instead of the CPU IO space (PCI configuration mechanism
uses the CPU IO space in x86/x64 architecture).
2. PCIe memory spaceThis address space lies in the CPU memory address space, just
as in PCI. However, PCIe supports 64-bit addressing by default. Part of the PCIe
configuration register is located in the PCIe memory space. However, what is meant by
PCIe memory space in this context is the CPU memory space consumed by PCIe
devices for non-configuration purposes. For example, the CPU memory space used to
store PCIe device data, such as for local RAM in PCIe network controller card or PCIe
graphics card local RAM used for graphics buffer.
3. PCIe IO spaceThis is the same as the IO space in PCI bus protocol. It exists only for
PCI backward compatibility reason.
4. PCIe message spaceThis is a new address space not previously implemented in PCI.
This address space exists to eliminate the need for physical sideband signals. Therefore,
everything that used to be implemented physically in previous bus protocols, such as the
interrupt sideband signal, is now implemented as messages in the PCIe device tree. We
are not going to look deeper into this address space. Its enough to know its purpose.
This article only deals with two address spaces of the four PCIe address spaces explained above,
PCIe configuration space and PCIe memory space. We are going to look into the PCIe
configuration space in the PCIe configuration mechanism section later. In this section, were
going to look into the PCIe memory space in detail.
For the sample, were going to proceed to scrutinize a PCIe memory read transaction that goes
through the PCIe fabric (device tree), a read transaction routed via address-routing. Were going
to look at a quite complicated PCIe platform that contains a PCIe switch. This kind of
configuration usually doesnt exist on a desktop-class PCIe platform, only on a server-class PCIe
platform. The complexity of the sample would make it a lot easier for the reader to deal with
desktop-class hardware in a real-world scenario because the latter is simpler compared to serverclass platform.
Figure 6 shows the sample memory read transaction with targets address at C000_0000h (3GB).
The memory read transaction originated in the CPU core 1, and the target is the contents of the
PCIe Infiniband network controller local memory because that address is mapped to the latter
devices memory. The transaction is routed through the PCIe fabric. The double arrow in the read
transaction path in Figure 6marked as a dashed purple lineindicates that the path taken to
get to the PCIe device memory contents is identical to the path taken by the requested data back
to CPU core 1.
Address-routing in the PCIe fabric can only happen after all the address-related registers in all
PCIe devices in the fabric are initialized. We assume that the platform firmware initializes the
platform in Figure 6 as follows:

1. The system has 8GB RAM; 3GB mapped to the 0-to-3GB memory range and the rest
mapped to the 4GB-to-8GB memory range. The mapping is controlled by the respective
mapping registers in the hostbridge.
2. The PCIe Infiniband network controller has 32MB of local memory, mapped to
addresses C000_0000h toC1FF_FFFFh (3GB to 3GB+32MB-1).
3. The PCIe SCSI controller card has 32MB of local memory, mapped to
addresses C200_0000h to C3FF_FFFFh(3GB+32MB to 3GB+64MB-1).
4. Virtual PCI-to-PCI bridge 1, 2, and 3 dont claim memory or an IO range for themselves,
i.e., BAR 0 and BAR 1 of each of these devices are initialized to values that dont claim
any memory ranges.
5. Virtual PCI-to-PCI bridge 1 claims memory range C000_0000h to C3FF_FFFFh (3GB to
3GB+64MB-1) as prefetchable memory.
6. Virtual PCI-to-PCI bridge 2 claims memory range C000_0000h to C1FF_FFFFh (3GB to
3GB+32MB-1) as prefetchable memory.
7. Virtual PCI-to-PCI bridge 3 claims memory
range C200_0000h to C3FF_FFFFh (3GB+32MB to 3GB+64MB-1) as prefetchable
memory.
With all the memory-related stuff initialized, we can proceed to see how the read transaction
travels through the PCIe fabric.

Figure 6. PCIe Memory Read Transaction Sample Going through the PCIe
Fabric via Address Routing

Now lets look at the steps taken by the read transaction shown in Figure 6:
1. The memory read transaction with target address at C000_0000h originated in the
CPUs core 1.
2. The memory read transaction reaches the hostbridge. The hostbridge mapping register
directs the transaction to the PCIe root complex logic because the address maps to the
memory range claimed by PCIe.
3. The PCIe root complex logic in the hostbridge recognizes the memory read transaction
as targeting the PCIe fabricdue to the hostbridge mapping registers settingand
converts the memory read transaction into a PCIe read TLP.
4. The TLP is placed in logical PCI bus 0. PCI bus 0 originates in the PCIe root complex
logic and ends in virtual PCI-to-PCI bridge 1 in the PCIe switch inside the chipset. Note
that the chipset interconnect logic is transparent with respect to PCI and PCIe protocol.
5. Virtual PCI-to-PCI bridge 1 checks the target address of the TLP. In the beginning, virtual
PCI-to-PCI bridge 1 checks whether the TLP target address is within virtual PCI-to-PCI
bridge 1 itself by comparing the target address with the value of its BAR 0 and BAR 1
registers. However, virtual PCI-to-PCI bridge 1 BAR 0 and BAR 1 dont claim any

memory read/write transaction as per the platform firmware initialization value. Then it
checks whether the target address of the TLP is within the range of one of its memory
base/limit or prefetchable memory base/limit registers. Figure 6 shows both of these
steps in point (a) and (b). Virtual PCI-to-PCI bridge 1 found that the target address of the
TLP is within the range of the range of its prefetchable memory range. Therefore, the
Virtual PCI-to-PCI bridge 1 accepts the TLP in PCI bus 0 and routes the TLP to PCI bus
1.
6. Virtual PCI-to-PCI bridge 2 and virtual PCI-to-PCI bridge 3 do a similar thing to the TLP in
PCI bus 1 as virtual PCI-to-PCI bridge 1 did on PCI bus 0, i.e., check their own BARs and
their memory base/limit and prefetchable memory base/limit. Virtual PCI-to-PCI bridge 2
found that the TLP target address is in its secondary interface. Therefore, virtual PCI-toPCI bridge 2 accepts the TLP in PCI bus 1 and routes the TLP to PCI bus 2.
7. The PCIe Infiniband network controller in PCI bus 2 checks the target address of the TLP
routed by virtual PCI-to-PCI bridge 2 and accepts the TLP because the target address is
within the range of one of its BARs.
8. The PCIe Infiniband network controller returns contents of the target address to the CPU
via the PCIe fabric. Note: we are not going to scrutinize this process in detail because we
have learned the details of the TLP address-based routing from the CPU to the PCIe
Infiniband network controller.
At this point, PCIe address spaces and PCIe address routing should be clear. The next section
focuses on PCIe configuration space and the mechanism for routing PCIe configuration
transactions to their targets.

PCIe Configuration Mechanisms


You need to know PCIe configuration mechanisms because they are the methods used to
initialize all of the PCIe devices in a platform that implements PCIe. There are two types of
configuration mechanisms in PCIe. as follows:
1. The PCI-compatible configuration mechanismThis configuration mechanism is identical
to the PCI configuration mechanism. On an x86/x64 platform, this configuration
mechanism uses IO port CF8h-CFBh as the address port, and IO port CFCh-CFFh as the
data port to read/write values from/into the PCI configuration register of the PCIe device.
This configuration mechanism can access 256-bytes configuration registers per device
refer to the PCI configuration register section in the first part of this series
(at http://resources.infosecinstitute.com/system-address-map-initialization-in-x86x64architecture-part-1-pci-based-systems/) for details on PCI configuration mechanisms.
PCIe supports 4KB of configuration registers per device, in contrast to only 256 bytes
supported by PCI. The rest of the configuration registers can be accessed via the second
PCIe configuration mechanism, the PCIe enhanced configuration mechanism.

2. The PCIe enhanced configuration mechanismIn this configuration mechanism, the


entire PCIe configuration registers all of the PCIe devices that are mapped to the CPU
memory space, including the first 256-bytes PCI-compatible configuration registers,
which are mapped to both the CPU IO space and the CPU memory spacesee point 1
above. The CPU memory range occupied by the PCIe configuration registers must be
aligned to the 256MB boundary. The size of the memory range is 256MB. The calculation
to arrive in this memory range size is simple: each PCIe device has 4KB configuration
registers, PCIe supports the same number of buses as PCI, i.e. 256 buses, 32 devices
per bus, and 8 functions per device. Therefore, the total size of the required memory
range is: 256 x 32 x 8 x 4KB; which is equal to 256MB.
One of the implications of the PCIe configuration mechanism is that the first 256-bytes of each of
the PCIe device configuration registers are mapped into two different spaces, the CPU IO space
through the PCI-compatible configuration mechanismand the CPU memory spacethrough
the PCIe enhanced configuration mechanism. If you are still confused about this explanation, take
a look at Figure 7. Figure 7 shows mapping of the PCIe device configuration space registers of
one PCIe device into the CPU IO space and CPU memory space.

Figure 7. PCIe Device Configuration Space Register Mapping as Seen from


the CPU

You might be asking why PCIe systems still need to implement the PCI configuration mechanism.
The first reason is to provide backward-compatibility to operating systems that existed prior to the
PCIe being adopted and the second reason is to provide a way to initialize the PCIe enhanced

configuration mechanism. On an x64 platform, the CPU memory range consumed by the PCIe
enhanced configuration mechanism is not hardcoded to a certain CPU memory range, its
relocatable in the 64-bit CPU memory space. The platform firmware must initialize certain register
in the PCIe root complex logic to map the PCIe devices configuration registers to certain address
in the 64-bit CPU memory space. The start address to map the PCIe configuration registers must
be aligned to 256MB boundary. On the other hand, location of the PCI configuration registers in
the CPU IO space is hardcoded in x86 and x64; this provides a way to initialize the register that
controls the mapping of all of the PCIe configuration registersin the PCIe root complexvia
PCI-compatible configuration mechanism because PCI-compatible configuration mechanism is
available at all times, including very early at system boot.
A PCIe enhanced configuration mechanism has an implication that reading or writing the PCIe
configuration registers of a PCIe device requires a memory read or write. This is a contrast to the
PCI configuration mechanism, where the code to do the same thing requires an IO read or IO
write. This approach was a trend in the hardware world in the late 90si.e., moving all hardwarerelated registers to CPU memory space to simplify hardware and system software design. It was
not adopted just by the PCIe bus protocol but also by other bus protocols in CPU architectures
other than x64.

Figure 8. PCIe Enhanced Configuration Mechanism Address Bits Mapping


to CPU Memory Space

Figure 8 shows mapping of the PCIe enhanced configuration space into the 64-bit CPU memory
space. This is the breakdown of the 64-bit PCIe enhanced configuration space register address:
1. Address bits 28-63 are upper bits of the 256MB-aligned base address of the 256MB
memory-mapped IO address range allocated for the enhanced configuration mechanism.
The manner in which the base address is allocated is implementation-specific. Platform
firmware supplies the base address to the OS. Mostly, the base address is controlled by a
programmable register that resides in the chipset or is integrated into the CPU.
2. Address bits 20-27 select the target bus (1-of-256).
3. Address bits 15-19 select the target device (1-of-32) on the bus.
4. Address bits 12-14 select the target function (1-of-8) within the device.
5. Address bits 2-11 select the target double-word (a.k.a. dword); 1-of-1024 within the
selected functions configuration space.
6. Address bits 0-1 define the start byte location within the selected dword.

As in PCI configuration register address accesses, reading or writing to PCIe enhanced


configuration registers must be aligned into a dword (32-bit) boundary. This is because the CPU
and the chipset in the path to the PCIe enhanced configuration register only guarantee the
delivery of configuration transactions if they are aligned to a 32-bit boundary.
In x64 architecture, a special register in the CPUpart of the PCIe root complex logiccontrols
the 36-bit PCIe enhanced configuration space base address. This base address register must be
initialized by the platform firmware on boot. The register initialization is carried out through a PCIcompatible configuration mechanism because, at very early boot, the register contains a default
value that is not usable to address the registers in the PCIe enhanced configuration space. Well
have a look deeper into the implementation of this base address when we dissect the PCIe-based
system address map later.
Now, lets look at a simple example of PCIe enhanced configuration register mapping into the
CPU address space. Lets make these assumptions:
1. The base address of the PCIe enhanced configuration address space is set
to C400_0000h (3GB+64MB) in the PCIe root complex register.
2.
3.
4.
5.
6.

The target PCIe device resides in bus number one (1).


The target PCIe device is device number zero (0) in the corresponding bus.
The target PCIe function is function number zero (0).
The target register resides at offset 256 (100h) in the PCIe device configuration space.
Size of the target register is 32 bits (1 dword).

With the assumptions above, we found out that the target PCIe enhanced configuration register
resides at addressC410_0100h. The higher 32-bit value of the PCIe enhanced configuration
register address is practically 0; the target address only uses the lower 32-bit of the CPU memory
address space. If the target address that corresponds to the target PCIe configuration register is
still confusing, break it down according to the mapping shown in Figure 8. It should be clear after
that.

PCIe Capabilities Register Set


There are several fundamental differences between PCIe and legacy PCI devices. We are going
to look into one of those differences before we move on to PCIe BAR initialization because they
affect PCIe BAR implementation, the PCIe capabilities register set. All PCIe devices must
implement the PCIe capabilities register set in the first 256 bytes of its configuration space
registers. In contrast, a legacy PCI device is not required to implement any capabilities register
set. In legacy PCI devices, implementing a capabilities pointer is optional, not mandatory.
Figure 9 shows implementation of the PCIe capabilities register set in a PCIe device configuration
space register.

Figure 9. PCIe Device Capabilities Register Set

Figure 9 shows a capabilities pointer registerhighlighted in purplein PCIe device configuration


space pointing to the PCIe capabilities register set. In practice, the capabilities pointer register
points to the start of PCIe capabilities register set by using an 8-bit offset (in bytes) of the start of
PCIe capabilities register set. The offset is calculated from start of the PCIe device configuration
space. This 8-bit offset is stored in the capabilities pointer register. The position of the PCIe
capabilities register set is device-specific. However, the PCIe capabilities register set is
guaranteed to be placed in the first 256 bytes of the PCIe device configuration space and located
after the mandatory PCI header. Both a type 0 or type 1 header must implement the PCIe
capabilities register set in a PCIe device configuration space.
Now, lets look more closely at part of the PCIe capabilities register set. Figure 9 shows the third
register in the capabilities register set is the PCIe capabilities register. Figure 10 shows format of
this registers contents.

Figure 10. PCIe Capabilities Register Format

Device/port type bits (bits 4-7) in the PCIe capabilities register are the ones that affect the PCIe
device mapping to the system address map. Device/port type bits determine whether the PCIe
device is a native PCIe endpoint function or a legacy PCIe endpoint function. Differences
between the two types of PCIe device are:
1. The value of device/port type bits in a native PCIe endpoint function is 0000b. Native
PCIe endpoint function devices must map all of the device components, such as its
registers and local memory, to the CPU memory space at runtimefrom inside a running
OS. The only time the device is permitted to use the CPU IO space is during early boot
time, before the platform firmware finishes initializing the system.

2. Value of device/port type bits in a legacy PCIe endpoint function is 0001b. Legacy PCIe
endpoint function devices are permitted to use the CPU IO space even at runtime. The
PCIe specification assumes that legacy PCIe endpoint function devices act as front-end
to legacy bus, such as PCI or PCI-X.
Now, its clear that the contents of the PCIe capabilities register determine whether the PCIe
device will map its BARs to the CPU memory space or to the CPU IO space at runtime. There are
special cases though, especially when dealing with legacy IO devices. For example, legacy PCcompatible devices such as VGA and IDE controllers frequently expect to be located within fixed
legacy IO ranges. Such functions do not implement base address registers. Instead, the
configuration software identifies them as legacy functions via their respective class codes
offset 09h in the PCIe configuration spaceand then enables their IO decoder(s) by setting the
IO space bit in its command register to one.

PCIe Base Address Register Initialization


PCIe devices use BAR just like PCI devices. Therefore, a PCIe devices BAR must be initialized
before the device can be used. PCI BAR initialization is the job of the platform firmware. The PCI
specification provides implementation notes on PCI BAR initialization. PCIe continues to support
this BAR initialization method.
Im not going to repeat the explanation PCI BAR initialization here; Im only going to highlight the
differences between PCIe BAR initialization and PCI BAR initialization in this section. Please refer
to the first part of the series for the basics of PCI BAR formats and PCI BAR initialization
(at http://resources.infosecinstitute.com/system-address-map-initialization-in-x86x64-architecturepart-1-pci-based-systems/).

PCIe BAR Formats


There are two types of BAR: The first is a BAR that maps to the CPU IO spacean IO BARand
the second one is a BAR that maps to the CPU memory spacea memory BAR. A PCIe IO BAR
is exactly the same as a PCI IO BAR. However, the PCIe specification recommends abandoning
using the IO BAR for new PCIe devices. These new devices should use the memory BAR
instead.

Figure 11. PCI/PCIe Memory BAR Format

Figure 11 shows the memory BAR format. Figure 11 shows that the lowest bit is hardcoded to 0 in
the BAR that map to CPU memory space. It also shows that bit 1 and bit 2 determine whether the
BAR is a 32-bit BAR or 64-bit BAR.
Figure 11 shows that bit 3 controls the prefetching in the BAR that map to CPU memory space.
Prefetching in this context means that the CPU fetches the contents of memory addressed by the
BAR before a request to that specific memory address is made, i.e., the fetching happens in
advance, hence pre-fetching. This feature is used to improve the overall PCI/PCIe device
memory read speed.
The main difference between a PCI and PCIe memory BAR is that all memory BAR registers in
PCIe endpoint functions with the prefetchable bit set to 1 must be implemented as 64-bit memory
BARs. Memory BARs that do not have the prefetchable bit set to 1 may be implemented as 32-bit
BARs. The minimum memory range requested by a memory BAR is 128 bytes.
Another difference between PCIe and PCI is the notion of a dual address cycle (DAC). PCIe is a
serial bus protocol and doesnt implement DAC. PCIe was designed with native 64-bit addressing
in mind. Therefore, support for memory transactions targeting 64-bit addresses is native in PCIe.
There is no performance penalty for carrying out memory transactions targeting 64-bit addresses.

PCIe BAR Sizing


The algorithm for PCIe BAR sizing is the same as the algorithm for PCI device BAR sizing
explained in the first article. The difference lies only in prefetchable memory BAR, because a

prefetchable memory BAR in PCIe must be 64 bits wide, the BAR sizing algorithm must use two
consecutive 32-bit BARs instead of one 32-bit BAR during BAR sizing.

Dissecting PCIe-Based System Address Map


In this section we look at an implementation sample of the system address map in x86/x64 before
proceeding to the system address map initialization in more detail. The implementation sample is
based on Haswellwith integrated northbridge/hostbridgeand the Intel 8-series PCH platform.
This platform implements the PCIe bus and its an up-to-date platform. Therefore, its a perfect
example to learn real-world PCIe implementation.
Intel 8-series PCH can be viewed as southbridge in the classic system layout; however, both are
not the same logic because there are some functions in the PCH that absent in the classic
southbridge. You can download the CPU datasheet
fromhttp://www.intel.com/content/www/us/en/processors/core/CoreTechnicalResources.html and
PCH datasheet fromhttp://www.intel.com/content/www/xr/en/chipsets/8-series-chipset-pchdatasheet.html.
PCIe differs from PCI in that PCIe moves everything to CPU memory space, including its
configuration space, as you can see from the PCIe configuration mechanisms section. The
presence of part of PCIe configuration registers in the CPU IO space is only for backward
compatibility reasons. This fact means the CPU memory space in a PCIe-based system is a bit
more fragmented compared to PCI-based systems. However, this approach pays back in terms of
less complication in CPU design and quicker access to all of the memory ranges mapped to the
CPU memory space, including PCIe configuration registers, because access to CPU memory
space is quicker than access to IO space by default.

Haswell CPU and Intel 8-series Chipset Platform


Figure 12 shows a block diagram of systems with Haswell CPU and 8-series chipset combination.
Figure 12 shows the entire connection from the chipset to other components in the system,
including those that might not exist in all chipset stock keeping units (SKUs).

Figure 12. Intel Haswell CPU with 8-series Chipset Block Diagram

Not all of the system interconnects in Figure 12 affect the system address map. We are going to
focus only on interconnects and control registers that affect the system address map in this
article. The interconnects of interest in Figure 12 are the DMI 2.0 interconnect, the interconnect
from CPU to PCIe graphics, the SPI interconnect from the Intel H87 chipset to the platform
firmware, the interconnect from the Intel H87 chipset to PCIe devices, and the interconnect from
the CPU to DDR3 DRAM modules. We will get into the details of memory transactions routing to
these interconnects in the next section (Haswell Memory Transactions Routing).

Haswell Memory Transactions Routing


Address-based memory transactions routing in Haswell CPU determines the system memory
map. There are several control registers in the hostbridge part of the CPU that control memory
transaction routing in this platform. Before we get into the registers details, well have a look at
the high-level view of the memory transaction routing in the northbridge. Figure 13 shows the
logic components in the northbridge that takes care of memory transaction routing. You wont see
these logic blocks depicted in any of the publicly available datasheet from Intel. I draw them in
Figure 13 based on details provided in the datasheet. The logic blocks are abstractions to make
the memory transaction routing understandable.

Figure 13. Memory Transactions Routing in Haswell


Northbridge/Hostbridge

Memory transactions in Figure 13 originate in the CPU and target DRAM, DMI, or the external
PCIe graphics. We are not going to delve into direct memory access (DMA) in this article because
DMA can originate in the PCIe graphics or the DMI. This means that the DMA adds unnecessary
complications to understanding memory transaction routing in the hostbridge.
Figure 13 shows five different memory transaction routing logic blocks that connect directly to
Haswell CPU cores. The memory transaction routing logic blocks are as follows:
1. Normal DRAM range logicThis logic block routes memory transactions (read/write)
targeting the range covered by DRAM, which requires no
remapping, i.e., the target address of the transaction doesnt need any translation before

entering the memory/DRAM controller. The control registers that control this memory
range are top of low usable DRAM (TOLUD) register and remap base register.
Both registers are in the hostbridge. TOLUD controls CPU memory range occupied by
the DRAM below 4GB. Remap base is only in use if the system DRAM size is equal to or
larger than 4GB; in this case remap base marks the end of the normal CPU DRAM
range above 4GB.
2. Remapped DRAM range logicThis logic block routes memory transactions (read/write)
targeting range covered by DRAM that requires remapping, i.e., the target address of the
transaction need to be translated before entering the memory/DRAM controller. There are
two control registers that control the remapped memory range, the remap
base and remap limit registers. The registers are in the hostbridge.
3. Compatibility memory range logicThis logic block routes memory transactions
(read/write) targeting range covered by the compatibility memory range. This memory
range comprises the range between A_0000h toF_FFFFh and the ISA hole
from F0_0000h to F_FFFFh (15MB to 16MB). This memory range is further divided into
three sub-ranges:
1. Legacy VGA memory range lies between A_0000h and B_FFFFhVGA

memory map mode control register controls mapping of compatibility


memory range from A_0000h to B_FFFFh. This range may be mapped to PCIe,
DMI or Internal Graphics Device (IGD), depending on the VGA memory map
mode control register value. Therefore, a memory transaction targeting memory
ranges between A_0000h andB_FFFFh will be routed to either PCIe or DMI or
IGD.
2. Non-VGA compatibility and non-ISA Hole memory range, which consists of
memory range from C_0000hto F_FFFFhAll memory transactions targeting
this compatibility memory range are routed either to the memory/DRAM controller
or the 8-series PCH chipset (Intel H87 chipset in Figure 12) via the DMI interface,
depending on values in the control registers of the corresponding compatibility
memory range logic. The control registers for compatibility memory range
from C_0000h to F_FFFFh are named programmable attribute

map (PAM) registers. There are seven PAM registers, from PAM0 to PAM6; all
of them are located in the hostbridge part of the CPU.
3. ISA hole memory range from F0_0000h to F_FFFFh (15MB-16MB)
A legacy

access control (LAC) register in the hostbridge controls routing of

memory transactions targeting the ISA hole at memory


rangeF0_0000h to F_FFFFh (15MB-16MB). All memory transactions targeting
this compatibility memory range are routed either to the memory/DRAM controller
or to the 8-series PCH chipset (Intel H87 chipset in Figure 12) via the DMI
interface, depending on values in the LAC control register. The ISA hole is an
optional range; its by default disabled on the hostbridge.
4. Platform firmware flash, message-signaled interrupt (MSI) and advanced
programmable interrupt controller (APIC) memory rangeThis range is between
4GB-minus-20MB to 4GB (FEC0_0000h-FFFF_FFFFh). All memory transactions

targeting this compatibility memory range are always routed to the DMI interface,
except those targeting the MSI address range and the APIC memory ranges that
correspond to the local APIC in the CPU cores. Memory transactions targeting
the range occupied by the platform firmware flash will be forwarded by the
southbridge to the platform firmware flash chip once the transactions reached the
southbridge via the DMI. This memory range is hardcoded; no control register in
the hostbridge is needed for memory transaction routing in this memory range.
4. PCIe 3.0 graphics memory range logicThis logic block routes memory transactions
(read/write) targeting range covered by the BARs of the external PCIe graphics card. If
the range is below 4GB, there are no specific control register in the hostbridge that alter
access to this memory range, only the external PCIe graphics BARs that determine the
routing. The PMBASEU and PMLIMITU registers control access to the PCIe
graphics memory range if the external PCIe graphics uses memory range above 4GB.
Both registers are part of the PCIe controllers integrated into the Haswell CPU.
5. PCIe 2.0/PCI memory range logic. This logic block routes memory transactions
(read/write) targeting range from the value of TOLUD register to 4GB to the 8-series PCH
chipset via the DMI interface. This logic block also routes memory transactions
(read/write) targeting range between PMBASEU and PMLIMITU registersbut dont
fall within the range covered by the PCIe 3.0 graphics memory rangeif the system has
4GB RAM or more . The range from TOLUD value to 4GB is set aside for PCI/PCIe
memory. The PCI/PCIe memory range thats not claimed by the PCIe 3.0 graphics
resides in the 8-series PCH chipset. The control register for this range is the TOLUD,
PMABASEU and PMLIMITU registers, located in the hostbridge.
All five memory transaction routing logic blocks are mutually exclusive, i.e., every memory
transaction must be claimed only by either one of them. There should be only one memory
transaction routing logic block that claims one memory transaction. Anarchy in memory
transaction routing could happen though. Anarchy in this context means more than one logic
block claims a memory transaction. Anarchy happens if the platform firmware initializes one or
more control registers of these logic blocks incorrectly.

Haswell System Address Map


In the preceding section, you have learned how memory transactions are routed in Haswell by the
northbridge based on the target address of the transactions. This section delve into the result of
the routing, the system address map. The presence of address remapping in the northbridge
makes the system address map quite complicated, i.e., the address map depends on the point of
view, whether the address map is seen from the CPU core(s) perspective or not. Figure 14 shows
a Haswell system address map with 4GB RAM or more. I choose not to talk about Haswell
systems with less than 4GB of RAM because address remapping is not in use in such
configuration.

Figure 14. Haswell System Address Map (System Memory >= 4GB)

Figure 14 shows the Haswell system address map from the CPU core perspective and from the
DRAM controller perspective. System address maps from both perspectives are different
because the DRAM controller doesnt see memory ranges consumed by PCI/PCIe devices and it
doesnt need such visibility either. The CPU views the memory ranges from TOLUD-to-4GB as
allocated to PCI/PCIe devices, while the DRAM controller views the same memory range to be
allocated to DRAM. Such different views are possible because the northbridge remaps the
respective memory range in the DRAM from TOLUD-to-4GB (as seen from DRAM controller) to a
new memory range above 4GB called reclaim memory range in the CPU memory space. The
reclaim memory range is determined by two registers: REMAP BASE and REMAP LIMIT
registers in the northbridge. The memory remapping logic in the northbridge carries out the
remapping task, as you can see in Figure 13.

Boxes with light blue color in Figure 14 represent memory ranges occupied by RAM. This means
that the DRAM controller sees the available RAM as a contiguous memory range while the CPU
core doesnt. The CPU core view contains holes in the memory range below 4GB that dont
belong to RAMthe holes are marked as boxes with non-light-blue colors in Figure 14.
Detail of the memory ranges in Figure 14 as follows:
1. Legacy address range (as seen from CPU core perspective)This range is the DOS
compatibility range between 0-to-1MB. Memory transactions targeting the range between
0-640KB are always routed to DRAM, while memory transactions targeting the range
between 640KB-1MB are routed based on the value of the PAM register which controls
the respective ranges. Recall that there are seven PAM registers controlling the memory
range between 640KB-1MB.
2. Main memory ranges are memory ranges occupied by DRAM that dont require address
remapping. OS has visibility to these ranges. The normal memory range logic in Figure
13 handles these memory ranges. These memory ranges have identical mappings from
both the CPU perspective and the DRAM controller perspective.
3. TSEG rangeThis memory range has an identical mapping from both the CPU
perspective and the DRAM controller perspective. TSEG is an abbreviation for top of
main memory segment. However, in todays context this means the segment that lies in
the top of main memory range below the 4GB limit. The start of this memory range is
determined by the value of the TSEG memory base ( TSEGMB) register in the
hostbridge. Contents of this segment can only be seen when the CPU is running in
system management mode (SMM). Therefore, code running outside of SMM doesnt
have visibility of this range in RAM, even the operating system code. This segment stores
the runtime data and code of the platform firmware.
4. GFX GTT Stolen memory rangeThis memory range is seen as part of the PCI/PCIe
memory range from the CPU perspective, while its seen as part of the DRAM from
DRAM controller perspective. This memory range only exists if the integrated graphics
device (IGD) in the CPU is enabled in the hostbridge GMCH graphics control (GGC)
register (via its VAMEN bit)the platform firmware must initialize the bit based on the
firmware configuration setting. This memory range stores the graphics translation table
(GTT) entriesGTT entries are akin to page table entries (PTEs) in the CPU, but GTT
entries are used for graphics memory. This memory range occupies the PCI/PCIe
memory range from the CPU perspective despite the fact that it physically resides in the
system memory (DRAM), not in the local RAM of a PCIe graphics card. Bits 8-9 (GGMS
bits) in the GGC register in the hostbridge determine the size of this memory range, while
the base of the GTT stolen memory (BGSM) register in the hostbridge determines the
start/base address of this memory range.
5. GFX stolen memory rangeThis memory range is seen as part of the PCI/PCIe
memory range from the CPU perspective, while its seen as part of the DRAM from the
DRAM controller perspective. This memory range only exists if the IGD in the CPU is
enabledsee GFX GTT stolen above for details on enabling the IGD. This memory

range stores the graphics data, i.e., it acts as graphics memory for the IGD. This memory
range occupies the PCI/PCIe memory range from the CPU perspective despite the fact
that it physically resides in the system memory (DRAM), not in the local RAM of a PCIe
graphics card. Bits 3-7 (GMS bits) in the GGC register in the hostbridge determine the
size of this memory range, while the base data of stolen memory ( BDSM) register in the
hostbridge determines the start/base address of this memory range.
6. PCI/PCIe memory range below 4GBThis memory range is only seen from the CPU
perspective. This range is determined by whether the IGD is activated or not. If the IGD is
active, this memory range starts at the value of the BGSM register; otherwise this
memory range starts at the value of TOLUD register. The upper limit of this memory
range is 4GB-20MB (FEC0_0000h). Access to this range is forwarded to either the
external PCIe graphics via the PCIe 3.0 connection or to the southbridge via the DMI.
7. PCI/PCIe memory range above 4GBThis memory range is only seen from the CPU
perspective. The value of top of upper usable DRAM (TOUUD) register in the
hostbridge determines the start of this memory range. The value of TOUUD is equal to
the value of REMAPLIMIT register plus one. The PMLIMITU register (in the
hostbridge) value determines the end of this memory range. Access to this range is
forwarded to either the external PCIe graphics via the PCIe 3.0 connection or to the
southbridge via the DMI.
8. Graphics aperture rangeThis memory range is seen as part of PCI/PCIe memory range
below 4GB in Figure 14. However, in practice, this memory range may as well reside in
the PCI/PCIe memory range above 4GB, depending on the platform firmware and the
system configuration. This memory range is always mapped to the PCI/PCIe memory
range. Its used as contiguous address space or as additional graphics memory space if
the graphics memory in the IGD or external PCIe graphics card is running out. The
graphics aperture range in the platform firmware configuration must be enabled for this
range to exist. Otherwise, it doesnt exist. Memory management for the system memory
allocated to serve the graphics aperture uses a graphics translation table just like the
legacy AGP aperture. The difference lies in the possibility that the aperture lies above the
4GB limit and the handling of the graphics aperture memory management. Its the
responsibility of the OSspecifically the graphics device driverto carry out memory
management for the graphics aperture. Refer to the first part of this series for graphics
aperture basics (at http://resources.infosecinstitute.com/system-address-mapinitialization-in-x86x64-architecture-part-1-pci-based-systems/). The graphics memory
aperture base address register (GMADR) in the IGD determines the start address of the
graphics aperture when the IGD acts as the graphics chip in the Haswell platform.
9. Flash BIOS, APIC, MSI interrupt memory range (FEC0_0000h FFFF_FFFFh)As
explained in the previous section, all memory transactions targeting this compatibility
memory range are always routed to the DMI interface, except those targeting the MSI
address range and the APIC memory ranges that correspond to the local APIC in the
CPU cores. Memory transactions targeting the last two ranges are always directed to the
CPU.

10. Main memory reclaim address rangeThis memory range occupies a different address
when viewed from the CPU perspective and from the DRAM controller perspective.
The REMAPBASE and REMAPLIMIT register in the hostbridge determine the start
and the end of this memory range as seen from the CPU perspective. The TOLUD
register and the 4GB limit defines the start and end of the same memory range when
viewed from the DRAM controller perspective. The remapped memory range logic in the
hostbridgeshown in Figure 13remaps memory transactions from the CPU targeting
this memory range before the transactions reach the DRAM controller.
11. Manageability engine UMA memory rangeThis range is not controlled by the CPU.
The manageability engine (ME) is integrated into the 8-series PCH (southbridge). The
platform firmware reads the Intel management engine UMA Register in the 8series PCH to determine the size of this range. The platform firmware must allocate this
range from the top of memory (TOM) up-to the size requested by UMA register. The
platform firmware initializes theMESEG_BASE and MESEG_MASK registers in
the hostbridge to allocate the requested range from DRAM. ME is basically a
microcontroller running in its own execution environment, out of the CPUs control. ME
uses the manageability engine UMA memory range in RAM to cache its firmware while
the system is not in low power state. ME uses its ownintegratedstatic RAM if the
system is on low-power state. ME doesnt run when the system is completely out of
power.
At this point, the Haswell system memory map should be clear. Nonetheless, we are going to look
into a hypothetical memory read transaction to improve our Haswell system memory map
understanding. Lets make the following assumptions on the system configuration:

Physical memory (DRAM) size: 6GB


Memory space allocated to memory mapped IO (including Flash, APIC, MSI, Intel TXT):

1GB
Size of remapped physical memory: 1GB
Top of memory (TOM): 1_8000_0000h (6GB)
ME UMA size: 0 ME is disabled
TOUUD: 1_C000_0000h (7GB) This address is 1MB-aligned
TOLUD: C000_0000h (3GB)
REMAPBASE: 1_8000_0000h (6GB)
REMAPLIMIT: 1_BFF0_0000h (7GB 1)

Note: The remap range is inclusive of the base and limit addresses. In the address
decoder, bits 0-19 of the remap base address are assumed to be 0h. Similarly, bits 0-19
of the remap limit are assumed to be 0Fh. This configuration ensures the remap range to
stay in 1MB boundaries.

Now, lets trace a memory read transaction that targets 1_8000_0000h (6GB) physical address
in this system configuration. Figure 15 shows how the memory read transaction travels in the
system.

Figure 15. Haswell Memory Read Transaction Sample

The red lines in Figure 15 denote the memory read transaction. Figure 15 intentionally doesnt
show logic blocks not related to the memory read transaction to ease understanding of the
transaction flow. Figure 15 shows the memory read transaction originated from CPU core 1.
Then, the remapped memory range logic claims the memory read transaction once it enters the
hostbridge because its within the range covered by REMAPBASE and REMAPLIMIT registers.
The remapped memory range logic then remaps the transaction target address into the address
as seen from the DRAM controller perspective and forwards the transaction to the DRAM
controller afterwards. The DRAM controller then handles the memory read transactioni.e., it
fetches the correct contents from the DRAM module.
The sample memory read transaction illustrates how the logic block in the hostbridge claims a
memory read transaction and processes it accordingly. The Haswell system address map should
be clear to you once you fully understand this memory read transaction sample.

Last but not least, you might be asking how the PCIe expansion ROM is addressed in Haswell.
Well, its very similar to a PCI-based system. The XROMBAR register in the particular PCIe
expansion card must be enabled and programmed to consume memory range in the PCI/PCIe
memory range. The rest is just the same as in a PCI-based system. There is no particular
enhancement carried out by the PCIe bus protocol in this respect.

PCIe Enhanced Configuration Space in Haswell Platform


In this section we will look at PCIe enhanced configuration space location in the Haswell system
address map. The first 256-byte PCIe configuration space registers are mapped to the CPU IO
space at port CF8h-CFFh, just as in the legacy PCI busin addition, these registers are also
mapped to the PCIe enhanced configuration space.
Contrary to legacy PCI configuration space, the entire PCIe configuration space (4KB per-device)
is located in the CPU memory space. On the x86/x64 platform, the memory range consumed by
the PCIe configuration space is relocatable in the CPU memory space. The platform firmware
must initialize the location of this configuration space in the CPU memory space. We should look
more closely into Haswell-specific implementation in this section.
Now, lets calculate the memory space requirement of the PCIe configuration space registers:
1.
2.
3.
4.

The maximum number or PCIe buses in the system is 256


The maximum number of PCIe devices per bus is 32
The maximum number of function per device is 8
Each function can implement up-to 4KB configuration registers

Using the statistics above, the entire PCIe configuration space registers requires: 256 x 32 x 8 x
4KB of memory space. This amounts to 256MB of memory space. Therefore, the platform
firmware must initialize the system address map to accommodate this PCIe configuration space
requirement. However, in practice, the memory space requirement of the PCIe enhanced
configuration space in a particular system can be less than 256MB because the system cannot
support that many PCIe devices physically.
In most cases, the PCIe enhanced configuration space is carved out of the PCI/PCIe memory
range. The PCIe configuration space can be mapped to the PCI/PCIe memory range below 4GB
(from TOLUD to the 4GB limit) or mapped to PCI/PCIe memory above the 4GB limit (above
TOUUD) in the Haswell memory map, as shown in Figure 16.
On the Haswell platform, the PCI express register range base address ( PCIEXBAR )a
registerin the hostbridge determines the location of the PCIe enhanced configuration space.
PCIEXBAR contents determine the start address and the size of the PCIe enhanced configuration
space. Figure 16 shows the two possible alternatives to map the PCIe enhanced configuration
space. They are marked as Mapping Alternative 1 (within the PCI/PCIe memory range below

4GB) and Mapping Alternative 2 (within the PCI/PCIe memory range above TOUUD).
PCIEXBAR can set the size of the PCIe enhanced configuration space to 64 MB, 128 MB or 256
MB. The platform firmware should initialize the bits that control the size of the PCIe enhanced
configuration space in PCIEXBAR at boot.

Figure 16. PCIe Enhanced Configuration Space Register Mapping on


Haswell Platform

Mapping of the PCIe enhanced configuration space to the Haswell system address map should
be clear at this point. Now, lets proceed to learn how to access the PCIe enhanced configuration
space register. The memory address used to access the PCIe configuration space of a specific
device function on the Haswell platform follows:

PCIe_reg_addr_in_CPU_memory_space = PCIEXBAR + Bus_Number * 1MB +


Device_Number * 32KB + Function_Number * 4KB +
Register_Offset
Perhaps youre asking where the 1MB, 32KB, and 4KB multipliers come from. Its simple actually:
For each bus, we need 32 (device) * 8 (function) * 4KB of memory space, this is equal to 1MB; for
each device, we need 8 (function) * 4KB of memory space, this is equal to 32KB.
Now, lets look into a simple sample. Lets assume that PCIEXBAR is initialized
to C000_0000h (3GB) and we want to access the PCIe configuration register in Bus 0, device 2,
function 1, at offset 40h. What is the address of this particular register? Lets calculate it:
Register_address_in_memory = C000_0000h + 0 * 1MB + 2 * 32KB + 1 * 4KB +
40h
Register_address_in_memory = C000_0000h + 0 + 1_0000h + 1000h + 40h
Register_address_in_memory = C001_1040h
We found that the target PCIe configuration register is located at C001_1040h in the CPU
memory space. With this sample, you should now have no problem dealing with PCIe enhanced
configuration space.

System Management Mode (SMM) Memory on the Haswell Platform


In the first article of this series, you learned that there are two memory ranges used to store SMM
code and data, the high segment (HSEG) and TSEG. However, on the Haswell platform, HSEG
has been deprecated and unsupported. Therefore, there is only one memory range used to store
SMM code and data in Haswell, the TSEG memory range.
Figure 14 shows the location of the SMM memory in the system address map. The TSEGMB
register in the hostbridge controls the TSEG start address. The TSEG memory range always
ends at the value of BGSM register. Contents of the TSEG memory range are only accessible in
two occasions. The first is when the system has just started and the platform firmware has not
initialized the TSEG configuration. The second is when the CPU is running in system
management mode. Access to TSEG is controlled by the system management RAM control
(SMRAMC) register in the hostbridge.
The Haswell hostbridge prevents access not originating in the CPU core to TSEG. This prevents
rogue hardware or firmware code running on add-on device to mess with contents of TSEG. The
main reason to do this is because the security of the system is compromised if a device other
than the CPU is given access to TSEG. At this point, everything regarding SMM memory in a
typical Haswell-based system should be clear.

Graphics Address Remapping/Relocation Table (GART) on Haswell Platform

In this section we are going to delve into GART. In the first article, I talked about GART in a
legacy system, i.e. AGP GART. This section talks about present-day GART, i.e., GART in a PCIebased system. Microsoft outlines requirements for GART implementation in a PCIe-based system
PCIe GART for short. You can read the requirements athttp://msdn.microsoft.com/enus/library/windows/hardware/gg463285.aspx. This is the relevant excerpt:
By definition, AGP requires a chipset with a graphics address relocation table (GART), which
provides a linear view of nonlinear system memory to the graphics device. PCIe,

however, requires that the memory linearization hardware exist on the


graphics device itself instead of on the chipset. Consequently, driver support
for memory linearization in PCIe must exist in the video driver, instead of as an AGP-style
separate GART miniport driver. Graphics hardware vendors who want to use nonlocal video
memory in their Windows XP driver model (XPDM) drivers must implement both memory
linearization hardware and the corresponding software. All PCIe graphics adapters that are
compatible with the WDDM must support memory linearization in hardware and software.
Its clear from the excerpt above that GART logic must be implemented in the PCIe graphics chip
itself, not in the chipset logic. However, in the case of Haswell, there is an integrated PCIe
graphics chipthe IGDwhich is part of the northbridge. This is not a problem, though, as long
as the integrated PCIe graphics implements its GART logic, i.e., the GART logic is part of the
IGD, not part of other logic in the northbridge. This way the system is compliant with the Microsoft
requirement above. Indeed, Haswell implement the GART logic as part of the IGD. Well look
closer into it in this section.

Figure 17. Haswell GART Implementation

Figure 17 shows the inner working of the GART logic in the IGD, which is located inside the
northbridge/hostbridge. Figure 17 shows that the GART logic maps three memory blocks in the
graphics aperturelocated in the PCI/PCIe memory rangeto three different memory blocks in
the main memory (system DRAM).
Id like to point out the meaning of the abbreviations and component naming related to GART
shown in Figure 17, before we get into details of the GART logic. These are the details:
1. IGD or internal graphics device is the integrated graphics chip of the Haswell platform.
This chip is integrated in to the CPU silicon die. This chip contains the GART logic.
2. GTTADR is the graphics translation table base register. This register contains the start
address of the graphics translation table, i.e., the start of the GART entries in the CPU
memory space. Figure 17 shows the GART entries reside in the GTTADR range (marked
in light green). The GTTADR range is a memory-mapped IO range because the contents
of the range reside in a PCIe device, the IGD device. This memory range is not located in

3.

4.

5.
6.

7.

8.
9.
10.

11.

system memory but in the IGD. You can think of it as a buffer (memory) containing GART
entries but residing in the IGD. This is different from GART entries in a legacy AGP
system, where the GART entries reside in system memory.
GMADR is the graphics memory aperture base register. This register is part of the GART
logic. Contents of this register contain the start address of the graphics aperture in the
CPU memory space. Figure 17 shows that GMADR points to start of Block #1, which is
the first block of the graphics aperture range.
TLBs are the translation look-aside buffers used to handle graphics memory transactions.
They are part of the GART logic in the IGD. These TLBs are similar to TLBs you would
find in the CPUs memory management unit (MMU).
PTEs means page table entries. I use this term to highlight the fact that the GART entries
are basically similar to PTEs in the CPU but these are used specifically for graphics.
DSM is the graphics data stolen memory. This is the memory range used by the IGD as
graphics memory. If you look at Figure 14, this range is part of the PCI/PCIe memory
range despite it resides in the system DRAM. It resides in system DRAM because its
below TOLUD. This memory range is only accessible to the IGD at runtime. Therefore, it
behaves just like external PCIe graphics card local memory.
GSM is the graphics translation table (GTT) stolen memory. This memory range is
completely different from the range covered by GTTADR. This memory range contains
GTT entries for IGD internal use only. GTT entries in this memory range are not the same
as GTT entries for GART. You have to pay attention to this difference.
TOLUD is the top of low usable DRAM register. This register contains the highest
address below 4GB thats used by the system DRAM.
TSEGMB is the TSEG memory base register. This register contains the start address of
TSEG.
Graphics aperture is a memory range in the CPU memory address space which is part of
the PCI/PCIe memory space and is used to provide linear address space to additional
graphics memory. This is actually very similar to the AGP aperture in legacy AGP-based
systems, as explained in the first article. The only difference is the possible location of the
graphics aperture. In legacy AGP-based system, the graphics aperture could only be
located below 4GB, whereas the graphics aperture in PCIe-based system can lie either
below 4GB or above 4GB, as long as its within the PCI/PCIe memory range.
Block #1, Block #2, Block #3. These blocks illustrate mapping of adjacent memory blocks
from the graphics aperture range to the system DRAM memory range. They illustrate the
presence of linear memory space in the graphics aperture range, which is translated by
the GART logic into memory blocks allocated by the OS in the system DRAM. If you are
still confused about how this mapping works out, read the GART section in the first
article.

Figure 17 simplifies a couple of things; among them is the location of the graphics aperture
memory range. The graphics aperture pointed to by GMADR can start anywhere in the CPU
memory space, either below 4G or above 4GB. However, Figure 17 shows the graphics aperture
memory range resides below 4GB. You have to be aware of this.

Lets summarize the difference between legacy AGP GART and modern-day PCIe GART. The
first one is that AGP GART logic was implemented as part of the hostbridge while modern-day
GART logic is implemented as part of the PCIe graphics chip. In case the PCIe graphics chip is
located in the hostbridge (like in the Haswell case), the GART logic will be part of the hostbridge.
The operating system treats AGP GART and PCIe GART differently. AGP GART has its own
miniport driver, while the PCIe GART driver is part of the PCIe graphics device driver. The second
major difference is in the location of the graphics aperture: In a legacy AGP system, the graphics
aperture always resides below 4GB while the modern-day PCIe graphics aperture can lie either
below 4GB or above 4GB.
At this point you should have a clear understanding of GART on the Haswell platform. Even if this
section talks about GART in the IGD PCIe graphics chip, you should be able to understand GART
implemented by add-on PCIe graphics card easily because its principle is just the same. The
difference is only in the location of the graphics memory/buffer, which is basically very similar
from the system address map standpoint.

Haswell System Address Map Initialization


In this section well have a look at the Haswell system address map initialization. Were not going
to dive into the minute detail of the initialization but just sufficiently deep to understand the whole
process. There are several steps in the Haswell boot process that are parts of system address
map initialization. They are as follows:
1. Manageability engine (ME) initializationME initialization happens prior to platform
firmware code execution. ME initializes the Intel management engine UMA
register in the 8-series PCH to signal to the platform firmware how much space it
requires in the system DRAM for use as the ME UMA memory region.
2. Chipset initializationIn this step, the chipset registers is initialized, including the chipset
base address registers (BARs). We are particularly interested in chipset BAR initialization
because this initialization affects the system address map. There are two chipsets in the
Haswell platform, the northbridge and the southbridge. The northbridge is part of the CPU
sometimes called the uncore partand the southbridge is the 8-series PCH. There are
many registers involved in the system address map that are part of the chipset, as you
can see from the previous sections. TOLUD, TSEGMB, and TOUUD are just a few of
them. However, most of these registers are not initialized before the size of the system
DRAM is known. Therefore, most of them are initialized as part of or after main memory
initialization.
3. Main memory (RAM) initializationIn this step, the memory controller initialization
happens. The memory controller initialization and RAM initialization happen together as
complementary code, because the platform firmware code must figure out the correct
parameters supported by both the memory controller and the RAM modules installed on
the system and then initialize both of the components into the correct setup. The

memory sizing process is carried out in this step. The memory sizing determines the size
of system DRAM. This is mostly carried out by reading contents of the serial presence
detect (SPD) chip in the DRAM module. However, most platform firmware also executes
random read/write operations to the DRAM to determine whether the claimed size in the
SPD is indeed usable. As for the algorithm of the random read/write, it depends on the
particular platform firmware. Initialization of the BARs in the chipset such as
MESEG_BASE, TOLUD, TOUUD, TSEGMB, etc. is also carried in this step after the
actual size of the system DRAM is known.
4. PCI/PCIe device discovery and initializationIn this step, PCI devicesby extension the
PCIe devices and other devices connected to PCI-compatible busare detected and
initialized. The devices detected in this step could be part of the chipset and/or other PCI
devices in the system, either soldered to the motherboard or on the PCI/PCIe expansion
slots. There are several resources assignments to the device happening in this step: IO
space assignment, memory-mapped IO (MMIO) space assignment, IRQ assignment (for
devices that requires IRQ), and expansion ROM detection and execution. The
assignment of memory or IO address space happens via the use of BAR in the PCI/PCIe
devices. Initialization of USB devices happens in this step as well because USB is a PCI
bus-compatible protocol. Other non-legacy devices are also initialized in this step, such
as SATA, SPI, etc. The PCIe GART logic registers initialization also happens in this step
because all of them point to memory ranges in the PCI/PCIe memory range. This step
above actually consists of two sub-steps:
1. Initialization of the critical PCI configuration space registers in all of the PCI and
PCIe devices via the legacy PCI configuration mechanism that uses CPU IO port.
This step is required because only the CPU IO port used for the initialization is
hardcoded; thus, it can be used right away. This step also includes the step to
initialize the PCIEXBAR in the hostbridge. The PCIe enhanced configuration
space registers cannot be accessed before this register is initialized via legacy
PCI configuration mechanism.
2. Initialization of the PCIe enhanced configuration space registersOnce
PCIEXBAR is initialized and the PCIe devices identified, the platform firmware
can initialize all of the PCIe configuration registers, including the PCIe enhanced
configuration space registers.
Once all of the registers in the hostbridge, the 8-series PCH, and all PCI and PCIe devices are
initialized, the system address map is formed. The code in the Haswell platform firmware that
carries out this initialization must be complicated because, as you have seen in the Haswell
System Address Map section, the system address map is complicated. However, at this point
you should have a clear understanding of a modern-day PCIe-based system from system
address map point of view, including initialization of the system address mapcarried out by the
platform firmware.

Deeper Look into UEFI GetMemoryMap() Interface

In the first part of this series, you learned about the BIOS E820h interface. In this article I would
only reiterate the UEFI equivalent of that function, the UEFI GetMemoryMap() function. This
function is available as part of the UEFI boot services. Therefore, you need to traverse into the
UEFI boot services table to call the function. The simplified algorithm to call this function as
follows:
1.
2.
3.
4.

Locate the EFI system table.


Traverse to the EFI_BOOTSERVICES_TABLE in the EFI system table.
Traverse the EFI_BOOTSERVICES_TABLE to locate the GetMemoryMap() function.
Call the GetMemoryMap() function.

The GetMemoryMap() function returns a similar data structure to the one returned by the legacy
E820h interface. The data structure is
called EFI_MEMORY_DESCRIPTOR. EFI_MEMORY_DESCRIPTOR is defined as follows:
//*******************************************************
//EFI_MEMORY_DESCRIPTOR
//*******************************************************
typedef struct {
UINT32 Type;
EFI_PHYSICAL_ADDRESS PhysicalStart;
EFI_VIRTUAL_ADDRESS VirtualStart;
UINT64 NumberOfPages;
UINT64 Attribute;
} EFI_MEMORY_DESCRIPTOR;

The GetMemoryMap() function returns a copy of the current memory map. The map is an array
of memory descriptors, each of which describes a contiguous block of memory. The map
describes all of memory, no matter how it is being used. The memory map is only used to
describe memory that is present in the system. Memory descriptors are never used to describe
holes in the system memory map.
Well, this article doesnt try to delve deeper into UEFI GetMemoryMap() interface. You can read
details of the interface and the EFI_MEMORY_DESCRIPTOR in the UEFI specification. Should you
be interested in digging deeper, theGetMemoryMap() function is located in the Boot Services
chapter of the UEFI specification, under the Memory Allocation Services section.

Closing Thoughts
This article delves quite deeply into the Haswell system address map and its initialization. It
should give strong background knowledge for those looking to understand present-day systems,
which could be even more complex than the one explained here. If there is anything really

intriguing regarding the Haswell platform, its the manageability engine (ME). This part of the
system deserves its own scrutiny and further research. Im aware of at least one proof-of-concept
work in this particular field, but it was not on Haswell.

You might also like