0% found this document useful (0 votes)
39 views222 pages

Real World Digital Design

Topics include Error Detection and correction, Processor design, Cache design, Virtual memory, AXI bus, PCI bus

Uploaded by

arunsrl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views222 pages

Real World Digital Design

Topics include Error Detection and correction, Processor design, Cache design, Virtual memory, AXI bus, PCI bus

Uploaded by

arunsrl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Contents APB............................................................................

43
Signals ....................................................................... 43
Introduction to Error Detection and Correction #1 Parity
Transfers ................................................................... 44
to Hamming .................................................................... 2
AHB ........................................................................... 46
Introduction ................................................................ 2
Transfers ................................................................... 48
Basic Error Detection .................................................. 3
Conclusions ............................................................... 52
Parity ........................................................................... 4
SoC Bus and Interconnect Protocols #2: Interconnect
Checksum .................................................................... 5
(AXI) .............................................................................. 53
Cyclic Redundancy Checks .......................................... 6
Introduction .............................................................. 53
Error Correction .......................................................... 8
Write Channels ......................................................... 55
Hamming Codes .......................................................... 8
Read Channels .......................................................... 60
Conclusions ............................................................... 13
Implementing AXI Interfaces .................................... 63
Introduction to Error Detection and Correction #2:
Other Protocols......................................................... 64
Reed-Solomon ............................................................... 13
Wishbone .................................................................. 66
Introduction .............................................................. 13
Conclusions ............................................................... 66
Reed-Solomon Codes ................................................ 14
PCI Express Primer #1: Overview and Physical Layer ... 68
Implementation ........................................................ 17
Introduction .............................................................. 68
Real-World Example.................................................. 20
PCIe Overview ........................................................... 69
ECC in Digital Data Storage ....................................... 21
Physical Layer............................................................ 70
Conclusions ............................................................... 21
Link Initialisation and Training .................................. 74
Demystifying Memory Sub-systems Part1: Caches....... 22
SERDES Interface and PIPE ....................................... 78
Introduction .............................................................. 22
Conclusions ............................................................... 81
A History Lesson ........................................................ 22
PCI Express Primer #2: Data Link Layer ........................ 82
Cache Requirements ................................................. 23
Introduction .............................................................. 82
Cache Terminology ................................................... 24
Data Link Layer ......................................................... 82
Associativity .............................................................. 25
Virtual Channels........................................................ 84
Fully Associative Caches ............................................ 25
Conclusions ............................................................... 91
Set Associative Caches .............................................. 25
PCI Express Primer #3: Transaction Layer .................... 93
Multi-way Set Associative Caches ............................. 26
Introduction .............................................................. 93
Cache Control and Status Bits ................................... 28
Transaction Layer Packets ........................................ 93
Write Policies ............................................................ 28
Memory Accesses ..................................................... 97
Cache Coherency....................................................... 28
Conclusions ............................................................. 106
Cache Regions ........................................................... 29
PCI Express Primer #4: Configuration Space............... 107
Conclusions ............................................................... 30
Introduction ............................................................ 107
Demystifying Memory Sub-systems Part 2: Virtual
Configuration Space................................................ 107
Memory ......................................................................... 31
Real World Example................................................ 119
Introduction .............................................................. 31
Note on the PCIe Model Configuration Space ........ 120
Pages ......................................................................... 32
PCIe Evolution ......................................................... 121
Memory Management Unit ...................................... 34
Conclusions ............................................................. 123
TLB ............................................................................. 35
Access to Specifications .......................................... 123
Page Tables ............................................................... 35
C++ Modelling of SoC Systems Part 1: Processor
Real World Example, RISC-V ..................................... 35
Elements ..................................................................... 124
Virtual Memory and Caches...................................... 38
Introduction ............................................................ 124
Conclusions ............................................................... 39
Modelling the Processing Element ......................... 127
SoC Bus and Interconnect Protocols #1: Busses (APB and
Generic Processor Models ...................................... 128
AHB) .............................................................................. 41
Instruction Set Simulators ...................................... 130
Introduction .............................................................. 41
Timing Models ........................................................ 133
AMBA Overview ........................................................ 41
Multiple Processing Elements ................................ 134
Conclusions ............................................................. 135 Processor Core Port Architecture ........................... 179
C++ Modelling of SoC Systems Part 2 : Infrastructure 136 Internal Registers .................................................... 180
Introduction ............................................................ 136 Exceptions and Interrupts....................................... 182
FPU .......................................................................... 137 Conclusions ............................................................. 183
Bus and Interconnect .............................................. 137 Processor Design #2: Introduction to RISC-V.............. 184
Memory Model ....................................................... 140 Introduction ............................................................ 184
Interfaces ................................................................ 142 RISC-V Register set.................................................. 184
Timers...................................................................... 145 RV32I Instructions................................................... 185
Algorithmic Modelling............................................. 147 Control and Status Registers .................................. 191
Sources of Interrupts .............................................. 148 RISC-V Exception Handling ..................................... 193
Conclusions ............................................................. 148 RISC-V Extensions ................................................... 195
Finite Impulse Response Filters Part 1: Convolution and Conclusions ............................................................. 196
HDL .............................................................................. 149 Processor Design #3: Processor Logic ......................... 197
Introduction ............................................................ 149 Introduction ............................................................ 197
Convolution ............................................................. 151 Design Approaches ................................................. 198
HDL Implementation ............................................... 155 Example Implementation ....................................... 203
Conclusions ............................................................. 157 Conclusions ............................................................. 210
Finite Impulse Response Filters Part2: Sinc Functions Processor Design #4: Assembly Language .................. 211
and Windows .............................................................. 158 Introduction ............................................................ 211
Introduction ............................................................ 158 Assembly Language ................................................ 211
The sinc Function and Impulse Response ............... 158 Other Directives ...................................................... 214
Windowing and Window Functions ........................ 159 Macros .................................................................... 215
Window Mathematics............................................. 162 Pseudo-instructions ................................................ 215
Quantisation............................................................ 163 Compiling Code ....................................................... 217
Designing the Filter ................................................. 165 Getting the toolchain .............................................. 217
Conclusions ............................................................. 172 Compiling ................................................................ 218
Processor Design #1: Overview ................................... 174 Disassembling ......................................................... 219
Introduction ............................................................ 174 Running Code.......................................................... 220
What is a Processor? ............................................... 174 Conclusions ............................................................. 222
Basic CPU Operation ............................................... 176

Introduction to Error Detection and Correction #1 Parity to Hamming


Simon Southwell | Retired logic, software and systems designer | Published Aug 24, 2022

Introduction
I first came across error correction codes (ECC), back in the 1990s, whilst working on integrated circuits for
Digital Data Storage (DDS), based on digital audio tape (DAT) technology, at Hewlett Packard. I wasn’t working
on ECC implementation (I was doing, amongst other things, data compression logic) and it seemed like magic
to me. However, I had to learn how this worked to some degree as I’d pitched an idea for a test board, using
early FPGAs, to emulate the mechanism and channel for the tape drive for testing the controller silicon and
embedded software. That is, it could look, from the controller point of view, like a small section of tape than
could be written to and read and moved forward and back. To make this useful, the idea was to create a fully
encoded set of data, with any arbitrary error types added, which was loaded into a large buffer for the tape
data emulation. Therefore, I would have to be able to compile this data, including the encoding of ECC.
Fortunately, I had the engineer on hand who was doing the ECC implementation, and he’d usefully written
some C models for encoding and decoding which I could adapt for my purposes. This all worked well and was
used in earnest for testing hardware and software in all sorts of error scenarios. Indeed, when a colleague
from our partner development company was returning home, after secondment to analyse tape errors, we
had the idea for his last analysis to be a ‘fake’ error pattern, generated by the tool, that said ‘best wishes’ in an
error pattern.

Since then, access to information on ECC has become more readily available, with the expansion of the
internet (in its infancy when I was working on DDS), but I have found that most information on this subject to
be quite mathematical (necessarily when proving the abilities and limits of algorithms), with very little
discussion of how this is turned into actual logic gates. In this article I want to assume that the proof of
functionality is well established and concentrate on how to get from these mathematical concepts to actual
gates that encode and decode data. I want to take this one step at a time so that I can map the mathematical
concepts to real-world logic and hopefully dispel the idea that it’s too complicated. I will start at the beginning
with concepts that many of you will already be familiar with and you may skip these sections if you wish, but it
is in these sections I will be introducing the logic that fits the terminology, used in the following sections, as we
build to more complex ideas

Once we get to Hamming and Reed-Solomon coding there will be logic implementations of the examples that
can be accessed on github, along with accompanying test benches, so that these implementations can be
explored as executing logic. The logic is synthesisable, but has no state, which it would need for practical use,
so it is ‘behavioural’ in that sense, but is just logic gates, with no high-level Verilog used. Also, there will be an
Excel Spreadsheet for exploring Hamming codes where any data byte value can be input, and errors added to
the channel to see what the Hamming code does to correct or detect these.

ECC has been around for some time (even before the 90s when I was introduced to it), so you may wonder
how relevant it is to today’s design problems. Well, Reed-Solomon codes are used, for example, in QR codes
(see diagram below), now ubiquitous in our lives, and the newly launched PCI Express 6.0 specification adds
forward error correction to its encoding (see my PCIe article).

So, let’s get to it, and the beginning seems like a good place to start.

Basic Error Detection


In this section I want to go through two of the most basic error detection methods that you may already be
familiar with—parity and checksums. I want to do this firstly for completeness (some people may be just
starting out and not be as familiar with these concepts), but also to introduce some terms and concepts used
later.

In all the algorithms we will discuss, though, there is a trade-off between the amount of data we have to add
to the basic information (i.e., the amount of redundancy) versus the effectiveness of the encoding for error
detection or correction. The engineering choices made rely on an understanding of the channel that the data
is transported through (whether wire, wireless or even magnetic or optical media) and the noise and
distortion likely to be encountered. None of these methods is 100% and can be defeated so that errors occur.
The idea is to make the observed error rate below a certain probability. Often more than one mechanism is
employed. This article will not be discussing the analysis of channels and the design choices to make in
choosing the ECC scheme, but we will look at the effectiveness of each method.

Parity
This is perhaps the simplest encoding for detecting errors but, as we shall see, it is the foundation of many of
the more complex algorithms we will discuss.

Parity, fundamentally, is “adding up ones” to see if there are an odd number of them or an even number of
them. For example, taking an 8-bit byte, if there are an even number of ones (say, 4), then it has even parity,
else it is has odd parity. We can encode the data into a codeword to ensure the code has either even or odd
parity by adding an extra bit and setting that bit as one or zero to make the codeword even or odd parity. We
still need to ‘add up’ the ones from the data to know how to set this bit. This is done using modulo 2
arithmetic. That is, we add 1s and 0s with no carry, so 0 + 0 = 0, 0 + 1 = 1, 1 + 0 = 1, but 1 + 1 = 0 as it is
modulo 2—the equivalent in C is (x + y)%2. Note that in modulo 2 arithmetic, adding and subtracting are the
same thing, so 1 – 1 is 0, just like 1 + 1. The is useful to know as I’ve read articles that mention subtraction and
then give example where addition is being used.

You may have noticed that this is the same output as an exclusive OR gate (XOR). Therefore, we can use XOR
or XNOR gates to add up the number of ones and either set or clear the added bit for even or odd parity as
desired. The Verilog code to calculate the added bit is shown below for odd and even codes:

Here we see the even parity code as the byte bits all XORed together, whilst the odd parity code uses the
inverse of this. The byte and parity bit code can then be transmitted, over a UART say, and the receiver
performs a parity module 2 sum over the whole codeword. If the sum is 0, then no error occurred. If the sum
is 1, then error is detected. This is true whether even or odd parity, so long as the same XOR or XNOR is used
byte both transmitter and receiver.

This method is very limited in protection. It can’t correct any errors and can only detect a single bit in error. A
two-bit error will make the code ‘valid’ once more, and the bad data will be considered good. Therefore, it is
suitable for channels where bit errors are rare, and certainly the probability of two-bit errors within the space
of a codeword length (9 bits in the example) has a probability below the required rate for the system.

In the example a byte was used, so adding a parity bit has a redundancy of 12.5%. This can be reduced by
encoding over a larger data size, so 16 bits has a redundancy 6.25%, but the probability of two-bit errors with
the codeword now increases. This is why I said an understanding of the channel is important in making design
choices. Increasing to 16-bits might be a valid choice for increasing efficiency if the raw bit error rate is low
enough to meet design criteria.

Checksum
A checksum is an improvement to basic parity but is still only a detection algorithm. In this case the data is
formed into words (bytes, 16-bit words, 32-bit double words etc.), and then a checksum is done over a block
of these words, by adding them all up using normal arithmetic, to form a new word which is appended to the
data. The checksum word will be of a fixed number of bits, and so is doing modulo 2n summation, where n is
the number of bits.

The checksum word might be bigger than the word size of data so, for example, 8-bit byte data might be being
used, with a 16-bit checksum word. In all cases all the data words are added up as unsigned numbers, modulo
the width of the checksum word. I.e., (D0 + D1 + D2 … +Dn-1) % 2chksum_width. This sum is then two’s
complemented and appended to the data. On reception, the same summation is made, including the
appended word and if zero, the checksum detected no errors, else an error occurred. Alternatively, the
checksum is appended unmodified, and then subtracted from the data summation at the receiver to give 0 for
no error—it depends where it is most convenient to do the two’s complement. The diagram below shows
some code for generating a checksum, with the data buffer assumed loaded with data.

Like for parity, the checksum adds redundancy, and the trade of is between size of the checksum word, the
length of the data block and the probability of errors. In general, checksums perform more efficiently than
parity. It comes down to the probability of corrupted data generating the same checksum value as good data.
If the checksum word is narrow, say a byte, then the number of messages that can generate a ‘valid’ checksum
is 1 in 256. This goes up as the word gets wider but drops again as the data block size is increased.

Another failing of checksums is there is no position information in the encoding. In the example above, all the
data in the buffer could be re-arranged in all possible combinations and all give a valid checksum, though only
one is correct. This can be somewhat alleviated using cyclic redundancy checks, which we’ll deal with in the
next section.

Before leaving checksums, though, I want to briefly mention MD5, which is sometimes referred to as a
checksum. Actually, it is a kind of hash function using cryptographic techniques. Its original intention was for it
to be infeasible to have two distinct data patterns (of the same size—512-bit blocks for MD5) that produce the
same MD5 value (128 bits), which would give some indication that the data received, matching a given MD5
number, was the message intended to be sent. However, this has long since been shown not to be true for
MD5, so it is not really used for this purpose anymore. It is still seen in many places though, as it does give
high confidence in detecting data being corrupted by channel noise (as opposed to malicious interference)
and, in this sense, is used as a checksum. As it’s not a really a checksum, I won’t discuss this further here.

Cyclic Redundancy Checks


For cyclic redundancy checks we need to introduce the concept of a linear feedback shift register, or twisted
ring counter. This is like a normal shift register, but the output is feedback to various stages and combined
with the shifting data using an XOR gate. The useful property of this arrangement is that, if the feedback pints
are chosen carefully, and the shift registers are initialised to a non-zero value and then clocked, the shift
registers will the go through all possible values except 0, given the length of the shift register (i.e., modulo n,
with n the number of bits in the shift register), without repeating, until it reaches the start point value again.
Depending on the feedback points, the values at each clock cycle will follow a pseudo random pattern and,
indeed, these types of shift register are used for pseudo-random number generation, with the start value
being the ‘seed’.

The set of numbers and their pattern are known as a Galois filed (after Évariste Galois, and interesting
character worth looking up), or sometimes a finite field. These can (and often are) described using a
polynomial. For example, if we have a 16-bit shift register and XOR the output into the shift registers bit inputs
at 0, 5 and 12, with the actual output at 16, then this can be described with the polynomial:

(note that x0 is 1, and you will see this written as 1 quite often). By default, this will cycle through its Galois
field, but if we add a data input, then this modifies the shift values dependent on that data. The diagram
below shows this arrangement of the polynomial with a data input:

If we have a block of data and shift each data bit into the shift register at each clock, then the shift register will
have a value which, mathematically, is the remainder when dividing the data by the polynomial (using modulo
2 arithmetic). By appending this remainder (the CRC code) to the data at transmission, then at a reception,
pushing the data, including the CRC codeword, through the same LFSR, with the same starting ‘seed’, should
return 0—i.e., there is no remainder. If there is, then the data has been corrupted.

The polynomials used are often standardised. The example polynomial is the CRC-CCITT 16 bit standard, but

is the CRC-16 standard. Larger width CRCs are used for better protection, such as 32-bit:
This polynomial is used in, amongst other things, Ethernet, PKZIP, and PNG. The advantage of the CRC over a
checksum, even though it might be the same width, is that the CRC codeword is a function of the data and the
order in which the data was input.

The arrangement shown in the above diagram, works well if shifting data in serial form, but often data is being
processed as words (bytes, 16-bit words, double words etc.). Serialising this to calculate the CRC codeword is
not practical without sacrificing bandwidth. What is required is a way to calculate the CRC in a single cycle as if
all n bits of the data had been shifted through the LSFR. For a given set of data bits, from a given CRC starting
point, the remainder will be fixed. Therefore, we can generate a table of all possible remainders for dividing
the polynomial by the data. For 8-bit data, this is 256 remainders. The data can be pre-calculated and placed in
a lookup table (LUT). The C code fragment below generates the LUT for 8-bit data for the 32-bit polynomial
shown above:

Note that the set bits in the POLYNOMIAL definition correspond to the polynomial terms, except for an
implied x32. To calculate the CRC codeword using the LUT, a 32-bit CRC state is kept and initialised (usually to
0xffffffff, the inverse of a remainder of 0). The block of data bytes are run through the CRC, with each byte
XORed with the CRC bottom 8 bits to index into the LUT. This LUT value is XORed with the CRC state shifted
down by a byte, which becomes the new CRC value. When all the bytes have been processed like this, the CRC
codeword is append as the inverse of the CRC state.

Note that data wider than a byte can be processed in this way but the LUT length, for the arrangement above,
is a function of 2 ^DATA_WIDTH, so 16-bit data needs a LUT of 65536 entries and 32-bit data a LUT of length
of 4294967296 entries. It can be pipelined, with intermediate CRC values being fed forward and multiple ports
into the smaller LUT, or multiple copies of the LUT for each stage, at the expense of more gates and other
resources. These, though, are optimisations for speed, but do not change the result of the basic CRC shift
register.
It should be noted that for a given CRC polynomial, with the protection properties that it brings, has the same
protection if inverted or reversed. For example, the 32-bit polynomial is represented as a value 0x04c11db7 in
the LUT generation code. If this 32-bit value is inverted (0xfb3ee248) or bit reversed (0xedb88320), or both
(0x12477cdf), then the characteristics remain the same, even if the generated values are different and often
this 32-bit polynomial is seen in these alternative forms.

Error Correction
All of the above methods are for error detection only. In this section, and the next article, I want to start
talking about error correction and look at a couple of methods that build on what we have looked at so far.
These methods are known as forward error correction codes. That is, information is passed forward to allow
the receiver to recover the data. This is in contracts with, say, a retry mechanism, where errors are detected,
and a message passed back to the transmitter (a NAK, for example) to resend the data. We will look first at
Hamming codes and then, in the next article, Reed-Solomon codes.

Hamming Codes
Hamming codes build upon the concept of parity and by careful placement of multiple parity can be used to
locate the position of an error, and thus correct it. Since we are using ones and zeros, knowing the position of
an error means we can correct it by simply inverting the bit. The basic principle is that data is encoded into
codewords, which are a sub-set of a larger set of values, such that valid codewords do not become other valid
codewords in the presence of n bits in error. The n value is known as the Hamming distance.

For example, a 4-bit code with valid codes being 0000b, 0011b, 0101b, 0110b, 1001b, 1111b, would have a
Hamming distance of 2 as it would take 2 bits in error to move from one valid code to the next. All eleven
other codes are invalid, and thus this example is only 37.5% efficient. With more distance these codes can be
used to correct errors by moving valid codes back to the nearest valid code.

Let’s look at a more practical example of Hamming code usage. To keep the example manageable, we will
encode 8-bit bytes into code words for the ability to correct a single bit in error. The addition of another parity
bit will allow for the detection of a second bit in error. The Hamming code will need four bits, and with a fifth
parity bits, the codeword will be 13 bits in total, giving a 62.5% redundancy. This is not so good, but after going
through the example we will see how this might be scaled for encoding larger words with better efficiency.
Below is a view of a spreadsheet that encodes an 8-bit byte into a 12-bit hamming code and then adds a
detection parity bit, and then allows the code to be decoded, through a channel that can add errors. This
spreadsheet is available on github, along with the Hamming and Reed-Solomon Verilog source code, so this
can be experimented with.
From the top, the data is defined as a set of 8 bits that can be set to any byte value. To construct the Hamming
codeword we need to calculate 4 different parity values (even parity in this case) that calculate parity over
different bits of the data. The codeword bits are numbered 1 to 12, with a P detection parity bit (more on this
later). The parity bits P1, P2, P3, and P4, are positioned at codeword locations 1, 2, 4 and 8. These are the
power of 2 locations. The rest of the 12 bits are filled with the data byte bits. It actually doesn’t matter where
the data bits are placed in the remaining spaces, so long as they are extracted in the same manner on decode
but placing them in order seems like a good idea.

So, the line marked D has the values of the data byte, and the parity bits are calculated on the next four lines.
P1 is calculated for even parity across D0, D1, D3, D4, and D6, as marked by the cells with a x. The other four
parity values are calculated in a similar manner but choosing different data bits for inclusion, as shown. We
will discuss the choice of data bit inclusion shortly.

The code line, then, contains the final codeword. Over this whole codeword the detection parity bit is
calculated for even parity. The spreadsheet then has an error ‘channel’ where errors can be added (none in
the example above), before getting the received codeword (marked rx code). To decode this, the parity is
calculated for the bit positions as during encoding, giving four results marked e0 to e3 on the spreadsheet.
When all zeros, as in the above example, there is no error, and the data bits can be extracted from the
codeword as good.

Notice the pattern of x positions on the e0, to e3 decoding (and I left this until now, as it is easer to see, I
think). Looking down the cells the x positions (if representing 1 with spaces as 0) encode values 1, 2, 3, and so
on, in binary, for the 12 positions of the codeword. This is how location of an error will be detected. Let’s
introduce a single bit error.
In the above example D1 is corrupted at position 5, and on decode e0 and e2 are set, giving a value of 5. This is
the index of the corrupted bit, so now the bit can be inverted and the corrected code at the bottom is the
same as that transmitted. This works for any position, including for the parity bits. So, by careful encoding and
positioning of the parity bits, the codeword can move an invalid codeword back to a valid codeword in the
presence of a single bit error. What if we have two errors?

In the above example an additional error is introduced at position 7. The error bits now index position 2, which
has no error, but it is inverted anyway as there is no indication that this isn’t correct at this point. The
corrected codeword is now wrong, but the parity is now odd, which is incorrect, and so the double error is
detected. It can’t be corrected but at least its presence is known. There is another failure mode where two
errors decode to give an error index that is greater than 12—i.e., it indexes off the end of the codeword. This
is also detecting a double bit error. I won’t show a spreadsheet example here but get hold of the spreadsheet
and try for yourself.
So, what does the Verilog look like for this? An encoder Verilog module is shown below.

This code is purely combinatorial (one could register the output if required) and simply calculates the four
parity bits from the relevant input data positions. The 12-bit hamming code is then constructed, placing the
parity and data in the correct locations, and then the detection parity bit is calculated over the entire
hamming code. The output is then the parity bit and hamming code concatenated. The diagram below shows
the decoder.
In this code, the error bits are calculated from the relevant input code positions. From this error index a ‘flip
mask’ is generated by shifting a 1 by the index. This is a 16-bit number to cover all possible value for e. A
corrected code is then calculated by XORing it with the flip_code, bits 12 down to 1. Bit 0 isn’t used, as an
error code of 0 is the no-error case. If the index is greater than 12, then this is an invalid code, which we will
deal with shortly. The corrected code plus parity bit is then used to generate the parity for the entire message.
The output data is extracted from the relevant positions within the code, before an error status calculated as
either a parity error (when error index non-zero), or an invalid index (flip_mask bit above 12 set). And that’s all
there is to it. This code is available on github, along with a simple test bench to run through all possible data
and one and two bit errors.

It was stated before that encoding a byte this way is not too efficient. In more practical usage, such as
protecting RAM data, 64-bit data can be encoded with 7 bits of hamming parity, and an 8th detection parity
bit for 72 bits in total. Hopefully it can be seen from the index pattern on the spreadsheet how this may be
expanded for 64-bit data by simply following the indexing in binary, with the parity bits at the power of 2
locations. Many RAM components come in multiple of 9, 18 or 36 bits just so 72-bit words (or multiples
thereof) can be constructed. For example, the Micron 576Mb RLDRAM 3 comes in 18- and 36-bit wide
versions.

I have implemented Hamming codecs for DDR DRAM in just this manner for memory systems in high
performance computers, where error rates over the system skyrocket as the amount of RAM used becomes so
much larger, with thousands of compute nodes.
Conclusions
In this article we have gone from simple parity for low level error detection, to a more useful way of
employing multiple parity for error correction, with a detection fallback. I have tried to simplify the
mathematical concepts into practical logic-based implementations, with example logic that can be used in real
synthesisable modules. Checksums give a simple method for improving detection over parity but have no
position protection. CRCs introduce the concept of polynomials and Galois fields and then reduce this to
practical logic, with methods to improve encoding efficiency. For the Hamming code, a spreadsheet
implementation is used for exploring the codes more immediately and visually and revealing the pattern of
the encoding process to allow indexing to error locations giving the means for correction. (Verilog and
spreadsheet available on github.) Each of the methods discussed trade complexity against expected channel
error rates and nature, and the appropriate method depends on the design requirements.

In the next article will be explored Reed-Solomon codes, which takes parity even further, also using the Galois
fields we saw for CRC, to be tolerant of burst errors as well as random (Gaussian) errors.

Introduction to Error Detection and Correction #2: Reed-Solomon


Simon Southwell | Published Sep 1, 2022

Introduction
In the first article we looked at error detection, starting from parity before moving through checksums and the
CRCs. This introduced the idea of modulo 2 arithmetic which maps to XOR gates, and linear feedback shift
registers to implement polynomials and generate Galois fields (basically pseudo-random number sequencers).
In this final article I want to cover the concepts of Reed-Solomon codes and we’ll bring together the concepts
of the previous article to do so. The goal though, like for most of my articles, is to get to actual synthesisable
gate implementations to map the concepts to practical solutions.

We will work through a small example, so that large, unwieldy amounts of data don’t obscure what’s going on,
and there is Verilog code, discussed in the text, implementing an encoder and decoder, along with a test
bench to exercise it so that those interested can experiment. This code can be found, along with the first
article's Hamming code Verilog, on github here.

Reed-Solomon codes are a form of block code. That is, they work on a fixed size block of data made up of
words, in contrast to Hamming codes that work on a single multi-bit word to make up a code word. The
advantage of this is that they can correct whole words so that burst errors, where consecutive data bits are
corrupted for a short period, can be corrected. This might occur on, say, magnetic or optical media where
there is dirt or a scratch. There may still be random noise on the channel and other schemes may need to be
employed to counter these affects. Reed-Solomon codes are often used with interleaving data, and with two
stage processing, as we’ll see.

We will now look in detail of the small example. This is adapted from John Watkinson’s The Art of Digital Audio
book, with some modifications and, more importantly for this article, taken to a logic implementation, which is
not done in the book. I was lucky enough to have had John Watkinson teach me some of these concepts when
at Hewlett Packard (see the introduction to the first article), as he was hired by them to teach us Digital Audio
Tape concepts, which we were about to adapt into Digital Data Storage. I have had his book since then and still
refer to it often. So, let’s get started.

Reed-Solomon Codes
For local reference I will redefine some terms here that we will need for generating Reed-Solomon codes,
along with some new ones.

• Galois Field: a pseudo-random table generated from an LFSR defined as a polynomial (as per CRC in the first
article). The table of Galois field values will be used as powers (i.e., the logarithms) of the primitive element a
(defined below).
• The primitive element a, which is a constant value. In our case this is simply 2, and we will stick to modulo 2
arithmetic. This primitive element will effectively be the ‘seed’ to the pseudo-random number generator
(PRNG), made from the LFSR.
• P and Q: the parity symbols added to the message symbols. This is the redundant information for allowing error
correction. In Reed-Solomon codes these are ‘appended’ to the data block, rather than mixed in with the
message like Hamming codes. P is used to determine the error pattern and is called the corrector, and Q is used
to determine the location of the symbol in error and is called the locator.
• Generator Polynomials: polynomials calculated from a set procedure. It is these polynomials that are used to
generate the parity symbols P and Q. In the example this is done by solving simultaneous equations.

In the example we’ll look at we will have data symbols of 3 bits. In a practical solution this might be 8-bit
bytes, but we will reduce this to 3 for clarity. The example will have 5 of these data symbols (A to E) and then
the two parity symbols, (P and Q) which we will calculated using the predetermined generator polynomials
(more later). The diagram below shows the example codeword.

The symbols in the example are 3 bits, so there are 8 possible values, but a value of 0, like for the Hamming
codes, is the error free case and so we have a seven symbol block, allowing 5 data symbols and the two parity
symbols. To generate P and Q from the data, the Galois field must be defined and each of the powers of the
primitive element a mapped to this field. The diagram below shows the polynomial for this example, the logic
used to generate it, and a table of the output values if seeded with the value of a (i.e., 2).
The value of a is 2, and we will map each subsequent step as successive powers of a, as shown in the table.
Thus, each power of a has its corresponding value, as shown. That is, a to the fifth power is 111b. These, then,
are the values of a raised to successive powers, and the indexes into the table (starting from 1) are the logs of
a. It is by this means, as we shall see, that we can bypass multiplications by finding the logs and doing addition.

We can find the values of P and Q for the particular data by using the following polynomials, given here but
calculated from generator polynomials and solving simultaneous equations:

The equations involve multiplying powers of a with the data but, by using the logs this becomes much simpler.
For example, if a data symbol is 100b and needs to be multiplied by a cubed then:

So, this multiplication is done by adding the logs. Note that, if the addition falls off the end of the table (i.e., is
> 7) it wraps back to the beginning of the table so a to the 11th power maps to a to the 4th power. This isn’t
quite modulo8, as that would include the value 0, which is not part of the Galois field. It is really modulo8 + 1.

It is possible to generate simple logic circuits, using XOR gates, for multiplying data symbols by each of the
powers of a, effectively stepping through the table the number of times to the value of the data, with the
required multiplying power of a as the starting point in the table. This can also be done with look up tables for
generating logs and antilogs and that is what we’ll do in the implementation.

So now we can calculate the P and Q using the polynomials as given above. The diagram below shows this
worked through:
It happens, in this example, the P and Q values are the same, but this is a coincidence. Column 2 is the data
values, with column 3 the terms for the generator polynomial for P and the fifth column that for Q. The results
(columns 4 and 6) are worked through, just as for the example from before. So, C is 010b which is a multiplied
by a squared, to give 1 + 2 = 3. So, a cubed is 011b and that’s the result shown. You can go through the rest at
your leisure. Having done this for data symbols A through E, the parity is then just all five values XORed
together, which is where the P and Q values come from. This is the block that is then transmitted.

On reception we need to generate two syndromes S0 and S1. The first is just the received symbols, including
the parity symbols, XORed together. The second is the received symbols multiplied by powers of a (using the
log trick) and then XORed. This step of generating syndromes is effectively multiplying the received symbols
with a matrix, albeit that one multiplication row (that of generating S0) is multiplying by 1. This matrix is
known as the check matrix, and the check matrix for our example is as shown below, in a form that would be
the usual way of specifying this.

Other larger Reed-Solomon codes might have more rows in such a matrix and also slightly vary in form, for
instance starting at a to the power 0 (i.e., 1)—we will look at an example of a real-world specification later.

The diagram below shows the syndrome calculation where no errors have been introduced.
Since there were no errors, both the syndromes come to 0 and the message is good. Note that the table
shows the power of a multiplications going from power of 7 down to 1. This could be the other way around,
but this would also affect the P and Q polynomial equations, so it is about matching between these two things.

Let’s now introduce an error into the message. We will corrupt symbol C, which is then C’. In order to locate
which symbol is bad we need to know the power of a that the bad symbol was multiplied by. This, as yet
unknown, power is k. The way things have been set up we can find a to the power k by dividing S1 by S0. This
is because the error will have been XORed with power of a at the corrupted location (i.e., k) to give S1, so
dividing both sides by S0 gives a to the power of k = S1/S0.

We can use logs again, where division becomes subtraction. The diagram below shows the case where C has
been corrupted (to 110b).

From the example, C is now C’ and is 110b. C’ multiplied by a to the fifth power yields 100b. The two
syndromes are now 100b and 001b. So, 1 divided by 4 is antilog(7) divided by antilog(2), or 7 – 2 = 5. Thus, k is
5 and indicates the location of the error—at symbol C.

Now we have to correct the located symbol. Since S0 is 000b when there are no errors, it must be the value of
the corruption when there are errors, so C’ xor S0 gives C, the correct symbol value, or 110b xor 100b = 010b.

An important point to note here is that it doesn’t matter what the corruption of the symbol was, with 1, 2, or
3 bits in error—S0 would be the corruption value and the multi-bit symbol can still be corrected. This is where
the burst error correction capabilities come from, and is also important for using other methods without
locator parity symbols (more later).

Implementation
In this section we will examine and encoder and decoder Verilog implementation for the example. Necessarily,
this will be fragments of code to highlight the operation, but the entire code is available on github, along with
a test bench.

Encoder
The encoder’s main purpose is to calculate the locator and corrector value P and Q. It uses tables of pre-
calculated values for the Galois field and inverse Galois field. That is, it will have an antilog and log table for
the primitive value a. In the code these are two arrays GF_alog and GF_log, with seven entries of 3 bits width.
In addition, it has tables for the generator polynomials’ (P and Q) powers (i.e., log values). These arrays are
p_poly and q_poly. All these arrays are initialised in an initial block and remain unchanged, with GF_alog set to
be like the table in the LFSR diagram above, and GF_Log set with the diagram values as the indexes and the
powers of a the entries. (An initial block is fine for FPGA, but these might be better moved to a reset block or
fixed as constants, such as using local parameters. This is just for ease of illustration here.)

P and Q are defined as 3-bit vectors (of the same name), and there is a seven-entry array of 3 bits with, s,
which holds the calculated symbols. This is used to split the flat input data (idata—see below) into the 3-bit
words A to E, assigning them to s[0] through s[4], whilst s[5] and s[6] are assigned to P and Q results. In this
implementation, the input data (idata) is a flat 15-bit vector, for the 15 bits of five 3-bit words A to E. Similarly,
the output is a flat 21-bit vector. Two intermediate arrays, p_data and q_data, are also defined for use in
calculating the P and Q values, as we shall see. Having defined the signalling (and reference the source code
for the details), let’s look at the functional part of the encoder:

The code loops through all five input data words (which are assigned to s[0] through s[4]), calculating P and Q.
The logs of these value are looked up by indexing into GF_log and adding to the polynomial logs (as
appropriate for P or Q), placing the result in the temporary variables (p_data or q_data). This is the main
operation of ‘multiplication’ by adding the logarithms.

The next line takes care of overflow, where the addition might index off the end of the tables. As we saw
earlier, this is modulo 8 + 1, so the logic does the modulo 8 by taking the 3 lower bits of the addition result,
and then adds one if bit 3 is 1, indicating the overflow.

Once the log addition and overflow are done, the antilog of the result is taken by indexing the result into
GF_alog to get the result of the data multiplied by the polynomial term. These are accumulated in P (or Q),
which were set to 0 at the start of the loop, to get all the terms XORed together, to generate the parity
symbols. These are already being assigned to s[5] and s[6]. The flat output signal is then generated by
concatenating each entry of the s array.

Like for the Hamming Verilog of article 1, this code is purely combinatorial (though synthesisable), and a
practical solution might have synchronous registers, at least on the outputs as appropriate to the design, and
also be better constructed. Here, I just wish to remove any extraneous code that might obfuscate the main
points of an implementation. This also applies to the decoder, which we will discuss in the next section.

Decoder

The decoder’s main purpose is to generate the syndrome values and the use them to find the index (k) that
indicates where an error occurred (if any). It has its own Galois Field tables, set in an initial block, just as for
the encoder. In place of P and Q we have S0 and S1 syndrome signals, and a 4-bit k index. An s symbol array is
used as for the encoder, to hold the seven 3-bit symbols, split out from the flat input vector (icode). There is
also a temporary calculation signal, S1_poly. The diagram below shows the functional Verilog fragment for the
decoder.

As discussed earlier, S0 is just all the symbols XORed together, and so this does not need to be in the loop
(though it could have been and generate the same logic). S1 is set to 0, and then a loop iterates over all seven
symbols. Firstly, in the loop, the log of the power of a is calculated as (7 – idx), since the loop has an increasing
index, but the example multiplies the symbols by decreasing powers of a, as shown in the example earlier. Just
like for the encoder, once adding the logs for multiplication, any wrapping over the table is dealt with. It is
possible the symbol being processed is 0 but using logs to process zero doesn’t really work, so this is detected
on the next line and the result set to 0, as multiplying by 0 gives 0, without the need for logs. If it isn’t zero,
then the antilog is taken for the result for that term. These are all accumulated over the loop by XORing into
S1, which was initialised to 0.

Now we have S0 and S1, k needs to be calculated. This is the division of S1 by S0, turned into a subtraction
with logs. Thus, the log of S0 is subtracted from the log of S1 indexing into the GF_log table. The k value can be
negative, which needs handling, so this is done by subtracting 1 if the result underflowed in much the same
was as overflow was handled. Also, the result for k can be 0—that is a to the power 0, or 1. Again, logarithms
don’t handle zero, but the result is known to be 1, so the log of 1 is looked up in this case (k is already a power
of a, so is in logarithmic form already, the value of 1 also needs to be a logarithm).

To correct the error, then, the indexed value in the s array (reversed for earlier stated reasons) is XORed with
the S0 to get the correct result. The flat output data vector (odata—not shown in code above) is just s[4] down
to s[0] concatenated.

Real-World Example
The example we have worked through is meant to boil down the essential operations of Reed-Solomon. There
are many variants of this basic operation. For instance, if there is a means to locate the position of the errors
by some other means (e.g., product codes, not discussed here), then the locator parity symbols are not
needed, halving the redundancy added or using the same number of parity symbols to correct multiple errors.
Since two errors are effectively XORed together, then simultaneous equations are solved to separate them.
This is vastly simplified by the fact that, as we saw, the nature of the error is not important (1, 2 or n bits in
error makes no difference), so the known erroneous symbols can be set to 0, thus removing the error values
from the calculations. This is known as correction by erasure.

Now, I’ve skipped the calculations of the polynomials P and Q in this article as I wanted to keep away from
complex mathematics and because they are fixed values, known up front. In general, so the theory goes, if n
symbols are to be corrected (or n/2 corrected and n/2 located), then n redundant symbols are necessary. In
general, there is a generator polynomial defined as:

Now you might think that this doesn’t look much like the polynomials of the example but the values of the
powers of a are known constants (the values of the Galois field), and one just has to multiply it all out to get a
polynomial in terms of powers of a multiplied by the polynomial terms (powers of x) to get the generator
polynomials. These generator polynomials are then used to solve the simultaneous equations to get the parity
symbols (P and Q in the example). Most specifications I have seen give this general form and leave it up to the
engineer to plug in the values. The reason I’m explaining all this is so that the example we’re about to look at
makes sense, and there is a mapping back to the example already discussed. There is no space to go through
this here, but for those interested in the mathematics, then there is plenty of information out there.
ECC in Digital Data Storage
Let’s look at a case I’m familiar with, namely DDS as mentioned in the first article. In DDS, Reed-Solomon
encoding is done as a two-stage process. Firstly, data is interleaved to improve error robustness, and formed
into two rectangular arrays (a plus and minus array), with space for parity symbols. A first RS coding is done
across this array, filling in parity symbols in the middle of the rows, and the down the array, filling in parity
symbols at the end of the columns, though the nature and position of the parity symbols changes with
iterations of the specification. So, two RS codes are added at right angles to each other over an array of
interleaved data bytes.

The specification for the two RS codes is given in the specifications as follows. There is an RS(32,28) code and
an RS(32,26). It uses correction by erasure, as discussed above, and has a Galois field polynomial of:

The primitive element a is 00000010b, and the generator polynomials for the two RS codes are defined as:

Finally, the check matrices are defined as:

This, then, is what’s provided in a typical specification, which then needs to be implemented as gates—or
possibly as software on an embedded processor, depending on the bandwidth requirements. Hopefully, some
of this now makes sense with respect to the example discussed, though we’ve had to take some things as
given.

Conclusions
In this article, we’ve taken some of the concepts from the first article and seen how they can be used in Reed-
Solomon codes to generate a code with properties robust to burst errors. An example has provided a means
to show how parity symbols can be generated to locate and then correct an erroneous symbol, whatever the
number of bits in the symbol are in error. One needs double the amount of parity symbols for the number of
symbols that can be corrected, but one can trade this for just correction if other means of error location are
employed (not discussed here) and with the use of error symbol erasure.

Having worked through the example, this was mapped to an actual implementation in logic (the source code
of which can be found here), which is the main purpose of these articles, to take theory to practice. A real-
world example was reviewed, in the form of DDS error correction, to see what a practical specification looks
like, mapping some of this back to the example that was worked through.

I’ve tried to minimise the mathematical theory behind this and have taken somethings as given in order to get
to a practical solution and demonstrate how one gets from mathematical symbols to actual logic gates. This,
of course, means that some details have been glossed over. If you ever find yourself having to implement a
solution, then there will still be work to do for your particular case. The hope is that, having seen a worked
example mapped to logic, you know that it is at least possible and the problem does simplify to manageable
levels.

Demystifying Memory Sub-systems Part1: Caches


Simon Southwell
Published Jun 30, 2022

Introduction
In this and the next article I want to cover the topic of memory sub-systems, discussing subjects such as caches,
memory protection, virtual memory, and memory-management systems. It is my experience over the years that
some engineers seem to think these topics are too difficult and specialist to understand—even ‘magic’. This
depends where one finds oneself working, of course, and I have worked in environments where these functions
need to be implemented and expertise is at hand. So in these articles I want to demystify these topics and show
that, even if this is not an area of expertise or experience it is not really a difficult subject if approached at the
right pace and appreciate the context of what problems are being solved by these systems. I also think that many
logic and embedded software engineers will at least be working within a processor based SoC system with these
type of memory sub-systems, such as an ARM or RISC-V based SoC and knowing how these systems work is
extremely helpful in understanding how to use them effectively, and to debug issues when constructing a
solution based on them. This first article, though, will concentrate on caches, what types are common and why
they are employed.

A History Lesson
In 1981, at the early stages of the personal computer revolution, several PC computers were launched; the
Sinclair ZX81, The BBC Micro, the Commodore 64 and the IBM PC, amongst others. These systems had
certain system attributes in common, such as an 8-bit microprocessor, RAM, ROM and peripherals for
keyboard, sound, graphics, tape/disk interface and programmable input/output. The operating system might be
stored on the ROM and programs loaded into RAM. When the computer started running it would read
instructions over its memory mapped bus directly from ROM and/or RAM. Memory was expensive and these
early systems were limited to tens of kilobytes (or less—the ZX81 was shipped with 1Kbyte). The dynamic
random-access memory (DRAM) used at the time had comparable access times to the CPU clock rate and the
CPU was not stalled on instruction accesses. In some cases, where video RAM was shared with the CPU it
might even be faster. The BBC micro, for example, had a 2MHz CPU, with RAM running at 4MHz,
interleaving video and CPU access, without stalling either.

As time progressed and CPUs became faster, accesses to memory could not keep up, particularly with the
introduction of synchronous DRAM (SDRAM). A CPU system can function perfectly well with the slower
memory, but each load or store between the CPU and memory introduces wait states that slows down the
effective performance of the CPU. The solution was to introduce cache memory. That is a faster memory that
contains fragments of the code and data currently being used or, more precisely, likely to be used. Initially
caches were external fast RAM, such as SRAM, but were quickly integrated into the microprocessors
themselves as on chip memory.

Cache Requirements
The idea behind the use of a cache makes some assumptions about the way programs tend to execute. Most
programs have a linear flow for more than just a few instructions before it might branch or jump, and also
mostly works on local data within those fragments of code, say on a section of memory on the stack. If a
program did truly jump around at random memory locations, and access random areas of data memory a cache
would not be an advantage—actually, it would slow it down.

Given this, a cache design would need to access sections of instructions or data from main memory to place into
the cache memory. This is actually efficient from an SDRAM access point of view as well. Since, in modern
CPU systems, multiple processes are running, it will need to keep several disparate fragments in the cache, and
know which memory fragments these are (i.e. their address) and also, given the finite size of cache memory that
will be available, it will need some mechanism for writing back updated fragments to main memory, in the
correct place, so it can use that part of the cache memory for a new fragment, as needed.

As we shall see, the way caches work means that even they may take multiple CPU clock cycles to access,
though less than accesses to main memory. In ARM (and I assume other) systems there may be some tightly
coupled memory (TCM) which would be single cycle memory, holding important, time critical, code that needs
fast execution, such as initial interrupt routines or special operating system code and data. TCM, however, is not
to be confused with cache memory. TCM would sit at some fixed place in the memory map and would not be
cached.

Caches can also multi-layered. The CPU may access memory via a cache (let’s say level 1) which is very fast,
but this limits the amount that can be deployed. There is a (rough) inverse relationship between RAM size and
speed. So many systems will have another layer of cache (layer 2) underneath layer 1, which is larger but
slower—though still faster than direct access to main memory. If data is not in the L1 cache, L2 is inspected and
L1 updated if present, else main memory is accessed. System can even have a third layer (L3).

Many modern processors employ a Harvard architecture. That is, there are separate busses for instructions and
data. Many systems reflect this architecture by having separate instruction and data level 1 caches, though level
2 would be common. Below is a block diagram of the Xilinx Zynq-7000 SoC FPGA.
Highlighted in yellow, at the centre of the diagram, are the blocks for the ARM Cortex-A9 processors,
indicating 32Kbyte instruction and data L1 caches, with a common 512 Kbyte L2 caches. This is mirrored in
other similar devices, such as the Intel Cyclone V FPGA and gives an idea of a typical SoC setup.

So, after this overview, let’s move on to how a cache is constructed.

Cache Terminology
There is a lot of potentially confusing terminology surrounding caches, but it is all for good reason. We need to
define this terminology so that we use a common, understood language.

A cache that stored individual bytes, or even words of the basic architecture size (e.g., 32 bits) as the size of
fragment mention before, would not work and does not fit with the assumption about instructions and data being
accessed from locally contiguous memory sections. What the size of these fragments needs to be is part of the
design of the cache and is a trade-off between efficiency when accessing a block from main memory, the cache
size, and the likelihood of reading values that are not accessed before the data is removed from the cache once
more. This fragment of the chosen size is called a cache line. Typical sizes might range from 8 to 256 bytes for
an L1 cache (in powers of 2). The cache line fragments would be aligned in memory to its size so that, for
example, 32-byte cache line fragments will have start addresses at multiples of 32 bytes—0, 32, 64 etc.

Obviously, a single cache line in memory would not be very efficient, as it would quickly need to be replaced
with a new line, with all the delay that that would cause. So, the cache has a number of cache line spaces, and
the collection of cache-lines is called a set. A typical set size might range from 128 to 16K cache-lines.
A cache may also have more than one set. This is known as the number of ways and might range from 1 to 4.
The total size of the cache, in bytes, is just these three numbers multiplied together. For example, if a cache-line
size is 32 bytes, with a set size of 128 and is a 2-way cache, then this is 32 × 128 × 2 = 8Kbyte cache. Below is
a table showing some real-world processor systems and their level 1 cache parameters (some of which are
configurable):

Associativity
Another new term (unfortunately) that defines how a cache will function—associativity. Basically there are two
types of associativity

• Fully associative
• Set associative

What it indicates is how the particular cache-line addresses are arranged and searched for to see if an address
being accessed is actually in the cache.

Fully Associative Caches


A fully associative cache is one where a cache-line can be placed anywhere within a set. The data will be placed
in the cache RAM and, in order to identify where in memory that cache line resides, a ‘tag’ RAM will store the
relevant address bits. In a fully associative cache, since this can be in any aligned location in the cache, when
checking that an access is to an address in the cache, every location within the tag RAM will need to be checked
to see if the access matches an address within one of the store cache-lines.

Functionally, this is the simplest to understand. One has a finite number of cache-lines spaces in a set, and any
can be used to store a given cache-line. When it’s full, some mechanism (more later) can be used to release one
of the lines currently in use and replace with the new required cache-line. From an implementation point of
view, this is an expensive mechanism. Each lookup of the tag RAM requires all entries to be inspected.
Normally this would be via a content addressable memory (CAM), sometimes known as an associative
memory. These are larger and slower than standard RAMs, but are used in other applications, such as network
address lookups and are sometimes employed as caches—but not too often, so I won’t delve further here. Once
we have looked at Set Associative caches, it should become obvious how a fully associative cache would work,
just with less restrictions.

Set Associative Caches


Set associative caches are not as flexible as fully associative caches, but are much cheaper, faster, and simpler
to implement and are, thus, more widespread. I mentioned, in passing, that a cache has two RAMs, the cache
RAM itself, and the tag RAM—which would be a CAM for fully associative. The diagram below shows the set
up for a single set for a set associative cache.
The diagram shows an example of a cache which has 32-byte cache-lines and a 16 entry set, giving a 512-byte
cache RAM, and a tag ram with 16 entries of tag width bits. This is probably not a practical cache but is useful
as an illustration.

In a set associative cache, a cache-line can’t be store anywhere in the cache but is a function of some bits in its
address. A cache-line’s address is always aligned with is size, so the bottom bits of a cache-line’s address are
always 0. When accessing a particular byte or word within a cache line, the access bottom address bits are an
offset within the cache-line. For a 32-byte cache-line, then, the bottom 5 bits are an offset into the cache-line.
The next n bits are used to determine which entry in the cache set the cache-line will reside. For a set of 16
entries (as per this example), this is the next 4 bits—bits 8 down to 5. All addresses that have this same value in
these ‘index’ bits must uses that entry in the set. The remaining bits are the ‘tag’ and it is these bits that are store
in the tag RAM, along with some control and status bits (more later).

When a processor does a load or a store to a particular address, a lookup in the tag RAM will be performed by
breaking up the access address bits into the index and TAG. The index will address the entry from the tag RAM
and the data read compared with the tag bits of the access address. If the values match—a ‘hit’—then the
addressed data is in the cache, and the address of the data in the cache RAM is the index × cache-line size +
offset, and the data retrieved for a load, or written for a store. If the values don’t match—a ‘miss’—then the data
is not in the cache, and the cache-line will need to be replaced with the data from main memory. The existing
data in the particular cache-line may have to be written to main memory first if it has been updated without
being written at the point of the miss. Once the cache-line has been replaced, then the comparison will hit, and
the data read/written as for a first-time hit.

This description is for a 1 way set associative cache. This will function perfectly well but is limited in that
addresses with the same index bits will clash on the same set entry, increasing the likelihood that an entry won’t
be in the cache and that an entry will have to be swapped. A solution is to increase the number of sets
available—the ‘ways’.

Multi-way Set Associative Caches


If a cache has multiple sets (i.e., the number of ways) then a cache-line address clash is reduced as it can reside
in any of the ways available. As mentioned before, typical values for the number of multiple ways is 2 or 4. The
diagram below extends the example by increasing the number of ways to 4.

Now the index bits can index into any of the 4 set entries in the ways. When matching a load or store address,
all four tag RAMs are read at the index and all four tags compared. If any of them match, this is a hit, and the
‘way’ that contains the match is flagged to access the data from the cache RAM. The cache RAM might also be
four different RAMs, but the way bits (2 bits for a 4-way cache) could just as easily be used as the top address
bits in a larger single RAM. The choice will depend on factors like speed and resource requirements. Separate
cache RAMs allow pre-fetching of data (on reads) and will be smaller and thus faster but will take up more
ASIC real-estate.

If none of the ways contains a match, then this is a miss, but now an added complication arises as to which way
to update? In the single set there was only one choice. Now we have to choose between n different ways (4 in
the example). Common algorithms employed are

• Round-robin
• Least recently used (LRU)

The first is the easiest to understand and implement. For a give index one just loops round the way count each
time an entry is updated with a new line: 0, 1, 2, 3, 0, 1…, and so on. This round-robin count, however, must be
kept for each line in the set as there is no association between the lines and when they are updated.

In least recently used implementation a tally must be kept of when a particular way was accessed at a given
index. When a miss occurs, the entry that was accessed the furthest back in time is chosen to be updated. This
sound like it might be complicated, but the number of ways is not likely to be huge (4 is a typical value, as
we’ve seen). The way I have approached this before is to assign each entry across the ways a value 0 to n, with
0 being the least recent access and n being the most recent. For a 4-way cache, n is 3. Each time there is a hit on
a given index, the matching way has its LRU count updated to 3. The value it had before is used to flag that any
way with an LRU value greater than this is decremented. Everything else is left unchanged. For example, if a
way is accessed with previous value of 1, then the ways with values 3 and 2 are decremented to become 2 and 1,
and the matched way entry set to 3. This maintains the LRU counts across the ways. Note that an LRU count
must be kept for each set index, as per the round-robin method.

Cache Control and Status Bits


Along with the address TAG bits, the tag RAM will usually have addition control and status bits. As a
minimum each entry will have the following bits

• Valid
• Dirty

The valid bit indicates that a valid cache-line is stored at that location. After reset (or a cache invalidation),
there are no cache-line in the cache, and the valid bits are reset. When doing an address match, a valid bit that is
clear forces a miss.

The dirty bit is used to indicate that a write has been performed at the cache-line entry, but that entry has not
been written to main memory. When a cache line needs to be replaced, if the dirty bit is set, then the current
cache-line must be written back to main memory before it can be updated. If the entry has never been written,
then this write back can be skipped, saving time.

Additional, but optional, bits are often seen. Some 'accessed' bits may be present to store things such as the
LRU count bits, used in choosing which line across the ways is to be replaced. Also, it may contain permission
bits, indicating read, write, execute, and privilege level permissions. This is only usually done if the system does
not have a separate memory protection unit, or an MMU managing these access permissions.

Write Policies
Most commonly, because it’s the best performance, a cache has a write policy of ‘write-back’. That is, it will
write-back a cache-line to memory only when it is to be replaced and has the dirty (and valid) bit set.

Another policy used is that of write-through. Here a cache-line is written back to memory every time it is
written to from the processor. This does not mean it is invalidated or replaced but means that the line is written.
This increases the number of writes to main memory over a write-back policy, making it have a decreased
performance. It is used where more than one processing element or other function have access to main memory
and need to access data from the processor updating the cache. If a write-back policy, there can be a mismatch
of data in a dirty cache-line and main memory for an indefinite amount of time. With write-through the data
ends up in main memory after just the cache-line write latency.

Many caches also allow the ability to flush them or to invalidate them—something that an operating system
might want to do, rather than a user application. Flushing requires that all dirty entries in the cache be written to
main memory. The cache remains valid and can still have hits. If invalidated it is both flushed and all valid bits
cleared so that no matches on previous cache-lines are possible.

Cache Coherency
The issue of dirty cache entries and main memory mismatches is a problem known as cache coherency. In a
single processor system, when it is the sole accessor of the main memory, this is not an issue. When there’s
more than one device in the system with memory access, this might be a problem, if the blocks need to
communicate through main memory.

We’ve seen how a write-through policy might solve this issue, but it has a high performance penalty. Also
flushing ensures valid main memory but requires co-ordination with operating system software. Other solutions
exist to address cache coherency. If you inspect the block diagram of the Xilinx Zynq-7000 from earlier, you
will see a snoop control unit (SCU). Connected to it is a bus called ACP—an Accelerator Coherency Port. This
allows, via the SCU, to check whether a memory access is in the cache and retrieve the data from there, or
whether to access directly from memory. There is a penalty for this cache lookup, but it is smaller than for
write-through, where every cache-line is written back whether it needs to be accessed from another source or
not. Modern busses and interconnect protocols, such as AXI, support whether a cache coherent access is
required or not, giving control to the lookup penalty for only those accesses that require it. Beyond this, for
large distributed systems, the Coherent Hub Interface (CHI) interconnect uses message transactions for cache
coherent accessing across many caches. These kinds of transactions were employed in supercomputers when I
was working in that field and are now making their way into SoCs and chips with high processor counts to solve
these types of coherency problems.

As you can see, the use of a cache solves lots of problems in slow memory accesses but create a set of new
problems in coherency which must be addressed.

Cache Regions
The assumption so far is that all processor loads and stores are to go through the cache, but not all accesses to
the memory map go to main memory. In an SoC, the control and status registers of peripherals may be memory
mapped in the same address space. It would not be useful, for instance, to cache accesses to control registers
when writing, say, a ‘go’ bit to start a transmission. This is controlled by defining cache regions.

At its simplest a system might mirror an address range, so that a 1 GByte region exists at 0x00000000 to
0x3fffffff and is mirrored at 0x80000000 to 0xbfffffff. The first of these regions might be cached whilst the
second region, though accessing the same physical memory, would be un-cached. The top bit of the address
defines whether the cache is bypassed or not.

More generally, a cache system may have the ability to define multiple regions of cached and un-cached
accesses. These might be separate for instruction and data caches. The LatticeMico32 softcore processor’s
cache, for example, has just such definable regions. The diagram below shows some examples of the
configurability of cached regions.
More sophisticated systems may allow more, and discontiguous, regions to be defined.

Conclusions
In this article was discussed how there is a mismatch on processor speed versus accesses to main memory,
causing bottlenecks in programs at loads and store, and how caches can help solve some these issues. Caches
can take on a couple of different forms of associativity, with advantages and disadvantages of both.

Having looked at some practical cache architectures it was seen that caches create new problems of coherency
across a distributed system. This was looked at in terms of write policy and cache coherent accesses via snoop
control units and an ACP.

Hopefully this has given you some insight into how caches work, and to terms that you may have heard but
were unfamiliar to their meaning. It may be that you will never need to deal with caches directly, or implement
new cache systems yourself, but I think understanding them can be crucial when working in systems with these
functions. As a software or logic engineer, using the system efficiently with caches and knowing when
coherency might be an issue and how to deal with it is useful knowledge.

In the next article we look at virtual memory, memory protection and memory management units (MMUs) that
implement this.
Demystifying Memory Sub-systems Part 2: Virtual Memory
Simon Southwell | Published Jul 4, 2022

Introduction
In this article we will look at virtual memory and what problems it is trying to solve in a multi-tasking
environment. Dividing memory in to 'pages' to map to main memory will be discussed, as well as the tables
used to keep track of these mappings. Hardware support for virtual memory is also covered in the form of
memory management units (MMUs) and a real-world example of VM is discussed in the form of the RISC-V
specification, using Sv32 as an example and how an MMU could use this do virtual address translation.

In the first article of this series, discussing caches, I gave a quick history lesson on what the situation was, pre-
caches, in 1981 when the blossoming PC revolution was well underway. Let’s start but going back to this era
which was also pre-virtual memory.

These original PCs could only perform was task at a time—they were single task machines. After reset the
computer would boot and execute start up operating system code until it was ready to except user input to load
some application. The OS might be in ROM (or loaded into RAM from disk or tape, via some BIOS in ROM),
leaving some unreserved RAM for applications. The OS would load an application to this spare RAM and the
jump to the start address of that application which would then run. The OS is no longer running in the sense that
it cannot process any more user input to load and run another program. The application, of course, might make
calls to OS service routines, but a single ‘thread’ of execution is maintained. When the application exits, this
thread of execution returns to the OS, and new user input can be accepted to load and run other applications.

Given the speed of microprocessors and memory at the time, this was all that was possible on a PC. As CPUs
become faster, and faster, larger memories became available, PCs started to have the capacity to run more than
one application or task at a time. But there is a problem. The application software written for, say, DOS, would
have some minimum system memory requirements to run, and would expect that memory to be at a known
fixed location within the memory map. Each application would make the same assumptions about where the
resources it could use would reside. To run more than one task at once would cause clashes in the use of
memory.

One solution is to compile the code to only use relative addresses: so-called position-independent code (PIC).
The OS can load separate applications in differing parts of memory (assuming enough memory) and run the
code from their loaded locations, swapping between the tasks to give CPU time to each. (This is context
switching for multi-tasking, which is outside the scope of this article.) A version of Linux, called μClinux, uses
just such a method, so there is no virtual memory. As an example, my LatticeMico32 instruction set simulator
(mico32) can boot this version of Linux, which also uses μClibc and BusyBox for a multi-tasking environment
and user shell. The diagram below, taken from the ISS manual, shows the fairly straightforward system required
to boot this OS without a memory management unit and VM support. (The ISS does actually have a basic
MMU model, but it is not used in this scenario.)
This approach, though quite useful, is limiting in that all the relevant code must reside in RAM at the same
time, which will quickly run out as more tasks need to be run together ‘concurrently’.

A better solution might involve fooling each task into having the illusion that it is the sole process running and
has access to all the memory it would have had in the original situation of a single task computer. That is, it has
a ‘virtual’ view of memory. In reality, the task has its memory dynamically mapped to real physical memory,
and each task's memory is mapped to different parts of actual memory, avoiding clashes. On top of this, in order
to support multiple tasks whose total memory requirements might exceed that of the physical memory available,
parts of a task’s memory might be removed from physical memory and stored on a slower storage device, such
as a hard disk drive. As tasks are given CPU time by the OS, this data may be reloaded to physical memory
(most likely to a different physical location from the last time it was loaded). This situation was already
available on main-frame computers of the time, and quickly adopted in PCs and then embedded SoC systems.
Since the application code has a virtual memory view as if it were the only task running, code from the pre-
virtual memory era can run without modification.

This system of memory mapping is Virtual Memory, and the in rest of this article we will discuss how one
might arrange virtual memory and the operations of hardware, such as a memory management unit (MMU), to
support this virtual memory in a multi-tasking system. We will look at how memory is divided into pages, how
the pages in virtual memory are mapped to physical memory and how this mapping is tracked and managed
efficiently.

Pages
In order to get a handle on managing virtual memory, the memory is divided into ‘pages’. That is, fixed blocks
of contiguous memory. A virtual memory system may support only a single page size, such as 4Kbytes, but
some systems can support pages of a few different sizes. A 4Kbyte page size is common in operating systems
such as Linux, for example. This is quite a small page size but was chosen some time ago and modern
computers could deal with larger pages (and some do), but 4Kbyte is still widespread.
A particular task or process will have a virtual memory space that is contiguous. This space is divided up into
pages, each with a page number. Since this is virtual memory, this is the virtual page number (VPN). The page
number is really the top bits of the virtual address of the start of the page. Each page is aligned to its size, so that
a 4Kbyte page will have the bottom 12 bits equal to 0 at its start address. The page number is then the remaining
address bits. For a 32-bit system, for example, this is bits 31 down to 12. The virtual pages will be mapped to
physical pages. Real physical memory is also divided into pages in the same way, with their own physical page
numbers (PPN). When as task requires memory for its code or data, the virtual pages will be allocated to
physical pages, and tables must be kept to remember which physical page addresses are being referenced when
an access by the task is made to its virtual addresses. The diagram summaries the situation of virtual to physical
mapping.

The left of the diagram shows part of a process running requiring some contiguous buffer—a large array, say.
The buffer is larger than the page size but is contiguous in virtual memory, with the virtual address ranging
from the buffer’s base address (0x78ffc084 in the example) to the base plus buffer size (0x79000084). This
address space will map over a set of contiguous virtual pages, though it needn’t necessarily be aligned with the
start and/or end of a page. The OS will allocate physical pages for each of the virtual pages referenced. This is
done as whole pages, even though the used virtual memory doesn’t use the start or end of the lowest and highest
virtual pages’ memory. The physical pages could be allocated to any PPN in physical memory and need not be
contiguous.

In order to access the data in physical memory, tables must be kept with the translations from virtual pages to
physical pages. The page tables reside in physical memory and, of course, don’t have virtual page equivalents.
As was mentioned before, the system can run out of physical pages to hold all the required data for every
running process, and so some that have not been accessed for a while can be stored in secondary storage, as
shown for one of the pages in the diagram. These are kept on swap partitions or page files, depending on the
operating system. On Linux, for instance, there is a swap partition for storing these pages. The diagram below
shows a Linux system’s partitions, with the swap partition highlighted.
The pages that have been ‘swapped’ out to disk do not have a mapping to physical memory, and no entry in the
page tables. If a swapped-out page is referenced, the lookup in the page table fails and the swap space searched.
If found, then a resident page in memory is selected for removal to swap space, and the referenced page loaded
to that physical page, and the page tables updated to map the virtual address of the loaded page to the selected
physical page. Of course, if the virtual page is not in the swap space either, then an exception is raised. The
details of how an OS does this swapping and the format of a swap partition is beyond the scope of this article,
which is looking at VM from an MMU point of view. For those that want a deeper dive into OS VM
functionality, I have found Mel Gorman’s book Understanding The Linux Virtual Memory Manager helpful,
which is freely available online. Chapter 11 discusses swap management.

Note that there are separate page tables for each process that is running, since each process can reference the
same virtual address as another process, mapped to different physical pages in memory. That clash, of course, is
where we started and why we need virtual memory. Some systems (that I have worked on) combine the page
tables for all processes, with additional information to uniquely identify the process with an ID (PID) or
context number. A lookup then has to match both a page number and a context number/PID.

The logic unit that manages the translations from virtual to physical pages, along with some other functionality
we will discuss, is the memory management unit (MMU).

Memory Management Unit


Hardware support for virtual memory comes in the form of a memory management unit. It is usually fairly
autonomous as a function in regards translating between virtual and physical pages, which needs to be done at
high speed in order for the system to perform efficiently. The operating system, however, would usually be
looking after the physical page allocations and the context switches, and would configure the MMU
accordingly. This might be as simple as pointing the MMU to a different set of page tables when there is a
context switch. The MMU would access, for a given process, a set of translations from memory, kept in a page
table, in order to do the translations. As we shall see, it might also keep a local cache of some translation to
speed up this process. The MMU would also normally be responsible for memory access protection and
privilege.

Fundamentally, then, in co-ordination with the operating system, the MMU does the following things:

• Translates virtual addresses to physical addresses


• Reads these translations from page tables in memory and keeps a cache of them
• Does memory protection and monitors access privileges.

TLB
As mentioned above, page tables are kept in physical RAM and each time the processor does a load or a store to
virtual memory the MMU has to translate this to a physical load or store. If it had to access the pages tables in
main memory every time, then this access latency would be added to each and every load/store that the
processor did. To alleviate this penalty, a cache of translations is kept in a translation lookaside buffer (TLB).
This acts exactly like a cache for instructions or data that was discussed in the previous article. Only, now,
instead of being data or instructions in the cache it is virtual to physical page translations. The logic is that same
as for a data cache, just with parameters that fit translations. For instance, an entry in the page table (a Page
Table Entry or PTE) for a 32-bit system might be 4 bytes, so the TLB’s cache-line size would be 4 bytes. The
set size might be smaller and may even be a fully associative cache but, in general, the function is the same as
that for a data cache.

Page Tables
The entries in the page tables (PTEs) in RAM could all be kept in a single flat table, but this would be a
problem. Searching through a table if the entries where just randomly scattered in a variable sized table would
be too slow. Therefore, tables are organized where the position in the table is a function of the virtual page
number. This is similar to the index bits of an address in a set associative cache, as discussed in the previous
article. A flat table using this method, though, would require quite large tables. Even in a 32-bit system that can
address up to 4Gbytes of memory this requires, if using 4Kbyte pages, 1 mega table entries for each possible
virtual page. And this is for every process running. For a 64-bit system this rises to impractical values.

The solution is to have a hierarchy of page tables where the virtual page number is divided into sections, with
the top bits used to index into a first table. If the PTEs are 4 bytes, then the top 10 bits might be used, and the
table fit neatly into a 4Kbye page. The PTE there, instead of being a translation, is actually a pointer to another
table. The next (10) bits are used to index into that table, and so on, until a translation entry is found. The
number of tables traversed depends on the architectural size. For example, a 32-bit system might have just two
steps, whilst a 64-bit system might have a 3 or even 4 steps. Traversing the tables like this is called a table walk.
This could be done by the operating system and the TLB updated by software but, more efficiently, this would
be done in logic with a table walk engine as part of the MMU functionality. The process of reading an entry and
either continuing to another table, or fetching the translation is something well suited to logic. This also gives a
mechanism whereby different sized pages can be supported. If a table walk terminates early by finding a
translation in a higher table, then that page is 2n × number-early-steps times bigger than the fundamental page
size. For example, with 4 Kbyte pages, and 10-bit page number sub-indexes, terminating one step before the
4Kbyte translation table gives a page size of 210 × 4Kbyte = 4Mbyte page.

Real World Example, RISC-V


Having described the virtual memory functions of an MMU, let’s look at a real-world example to bring the
theory together.

Volume 2 of the RISC-V specification (see sections 4.3 to 4.5), defines page-based support for virtual memory
for various size address spaces. It defines 3 sizes, as listed below:

• Sv32 maps 32-bit virtual addresses to 34-bit physical addresses, 4Gbyte virtual space, 16Gbyte physical
space, two level page tables
• Sv39 maps 39-bit virtual addresses to 56-bit physical addresses, 512Gbyte virtual space, 64Pbyte
physical space, three level tables
• Sv48 maps 48-bit virtual addresses to 56-bit physical addresses, 256Tbyte virtual space, 64Pbyte
physical space, Four level tables

We will concentrate on Sv32, as an example, as the diagrams are less messy, but I hope it will be clear how this
would scale to the larger address spaces. The diagram below shows the format of the virtual- and physical
addresses, along with the format of a page table entry:

Since Sv32 is for a 32-bit system, the virtual address is 32 bits. The pages are 4Kbytes, and so the lower 12 bits
of the address are an offset into a page. The next 20 bits are the virtual page number, but these are divided into
two 10-bit values as shown. The physical address format is bigger, with the same offset bits for a 4Kbyte page,
but 22 bits for the physical page number. For a 32-bit architecture, only 4GBytes of virtual space can be
accessed by an individual process. However, since there are multiple processes, these can be mapped to a large
physical memory space. In this case, to a 16Gbyte space. Thus, a system can choose to add up to 16GBytes of
memory, even the architecture can only address 4Gbytes.

The page table entry (PTE) consists, firstly, of the 22 bits of a physical page number. (Note that bits 33 down to
12 of the physical address are mapped to bits 31 down to 10 of the PTE.) A couple of bits are reserved, and then
some control and status bits are defined for privilege and permission control, some TLB cache status and other
status bits for use by the system.

Firstly, the valid bit just marks the PTE as containing a translation. This is used for both the table entry in RAM
and when in the TLB. The next three bits defined the access permissions and hierarchy of the table entry—i.e.,
whether it points to another table or not. The diagram shows what each combination means, with greyed out
entries reserved. The values are mostly combinations of read, write and/or execution permissions. If all three
bits are 0, however, then the entry is a pointer to the next table in a table hierarchy, and the PPN is the physical
page number where that table is located.

The user bit indicates that the page has permissions for access in user privilege mode (the lowest privilege for
RISC-V). The global bit is used if the system supports multiple address spaces, and the page is mapped to all
the spaces. The accessed bits, along with the dirty bit is used, much as for the caches described in the previous
article. The accessed bit used for choosing which pages to swap out when a new page requires to be loaded.
This could be from the TLB to RAM, or from RAM to the swap space on disk. Similarly, the dirty bit indicates
that the page has been written since last written back to the layer below, thus indicating whether it needs to be
written back not if swapped out.

Having defined the format of the various addresses and page table entries, let’s look at how this all works
together. The diagram below shows how an Sv32 two stage lookup might work.

When the processor does a load or store, the VPN bits from the virtual address are presented to the TLB. If the
translation is in the TLB (a hit), then the physical page number (PPN) is returned. The physical address to
memory is then the PPN concatenated with the 12 bits of offset. The data at this physical address might be in
the data (or instruction) cache, or the cache system may have to retrieve this from main memory, just as
described in the previous article.

If the entry is not in the TLB (a miss), then a table walk must be started to look up the translation in RAM. In a
RISC-V system the root table’s page number for the running process is stored in a supervisor CSR register
called satp (Supervisor Address Translation Protection). The equivalent in an x86 system is the CR3 register.
This satp register will be updated at each context swap to point to the tables for the process being restarted. The
top bits of the virtual address are used to index into the first table. The entry there will have the xwr bits set to 0
(unless it is a ‘super page’), and the PPN in the entry points to the next table’s physical page where the actual
translation resides. The lower VPN bits then index into that page to get the PPN for the access. This would be
loaded into the TLB, possibly unloading a resident entry first, and the process continue as if for a hit. It is at this
point that the access permissions are checked, and an exception/trap/interrupt raised, if the checks fail, so that
the operating system can intervene as appropriate. Needless to say, the access is not completed in this
circumstance.
So that’s how, in general, an MMU works, using real-world formats, and there’s nothing more to discuss, right?
Well, there is one more thing where we need to go back to the cache and work out how the virtual memory, the
MMU and the data/ instruction caches interact.

Virtual Memory and Caches


Until now we have looked at virtual memory independently of a memory sub-system’s caches. When, in the
previous article, we looked at caches it was assumed that all addresses were physical addresses. Looking at both
together, the most obvious thing to do to combine the two would be to first translate virtual to physical
addresses and then send the physical address accesses through the cache. This is a valid arrangement, and is
known as a ‘physical index, physical tag’ architecture (PIPT). The diagram below shows this arrangement.

The disadvantage of this arrangement is that (assuming the VA hits in the TLB) the TLB must be looked up
first, and only then the cache inspected for a hit on the physical tag. The two lookup latencies are added.

A solution might be to work on the virtual addresses instead. This arrangement is known as ‘virtual index,
virtual tag’ or VIVT. The diagram below shows this architecture.

This arrangement avoids the TLB latency if the data is in the cache, as the hit cache entry is a virtual address
and physical memory references are not needed. However, it has a couple of problems. Firstly, the TLB entry
must still be looked up for the checking the permission bits. Secondly the cache would need flushing each time
there was a context switch as processes can use the same virtual addresses and there would be a clash if a
matching address from a different process remained in the cache. PIDs/context numbers as part of the tag might
alleviate that somewhat, but that is more complicated and, whilst a give process is running, the cache entries for
a different process will definitely not be hit, wasting cache space for the running process.

A compromise solution is where the virtual address is used for the cache index, but the physical address is used
for the tag. This is known as ‘virtual index, physical tag’ (VIPT). The diagram below shows this arrangement.

In this arrangement both the cache and TLB lookups are done in parallel. Since the physical address’s tag was
stored in the cache tag RAM, then the physical address returned from the TLB can be compared with the
physical address from the cache tag lookup and a hit/miss flagged. This architecture then reduces the latency to
just that of the longest between the TLB and cache lookups which, hopefully, are comparable.

Conclusions
In this article we have looked at virtual memory and what problems it is trying to solve for a multi-tasking
environment. A virtual view of memory is given to each task/process/program so that the software does not
need to be aware of the resource clashes caused by running lots of different code concurrently.

Virtual memory support is implemented by dividing the virtual and physical address spaces into pages, and then
the virtual pages are mapped to physical pages, dynamically. Page tables are resident in memory to keep track
of these mappings, with page table sets for each running process. The logic to do the table lookup and virtual to
physical address translations are implemented in an MMU, which also checks for access permissions. The
MMU might keep a cache of translations in a TLB in order to speed up the process, just as for a data or
instruction cache that speeds up main memory accesses.

We looked at a real-world example of virtual memory formats for addresses and page tables to consolidate the
theory, using the RISC-V specification. We also had to discuss how the data/instruction caches interact with the
virtual memory functionality in order to get an efficient system.

Over these two articles I’ve summarised the main aspects of a typical memory sub-system and it might seem
that every time a feature was added to solve a particular problem a new one was added which need solving. For
example, coherency problems caused by caches in a distributed multi-processor environment, or the positioning
of VM translations with respect to those caches to avoid unnecessary latencies. A processor-based environment,
as we saw historically, would function without these systems, but not very efficiently. It is often the case that
engineering is about optimising systems with added complexity to get the most from the capabilities.
Sometimes a solution causes a new issue which must be solved but, over time, a solution is found to all of these
to get a system with best performance.

If you never need to work directly on, or with, memory sub-systems I hope this and the previous article will act
as a case-study on how systems need to be refined in engineering to get a better solution, even if that adds
complexity. If you work within a processor-based systems (as is likely) an awareness of the memory sub-system
functionality is useful when designing within that system (in either logic or software realms) and I hope these
summaries are useful to that end. I have, in the past, worked more directly on designs that implement MMU
functionality but, even in completely different environments where the design space was outside of the
processor systems and sub-systems, I have had to design DMA functionality that can traverse over buffers in
contiguous virtual memory, but distributed in physical memory, and solve issues caused by cache coherency in
order to make it function correctly. This would not have been possible without a knowledge of the functionality
discussed in these articles.
SoC Bus and Interconnect Protocols #1: Busses (APB and AHB)
Simon Southwell
Published Sep 7, 2022

Introduction
Not so long ago I was in a meeting with a representative of a supplier (FPGA, tools, and support services),
along with a manager and another engineer. In the discussion we were looking at features of both Intel (which
we were using) and Xilinx FPGA products when it was mentioned that Xilinx uses AXI bus for interconnect.
(Intel products do too, but one may use the Intel/Altera Avalon protocol in its place.) At this point the other
engineer launched into, what I can only describe as, a rant about how complicated AXI was and how difficult it
was to use compared to Avalon, as if it had been deliberately done this way to foil would-be engineering users.
This notwithstanding that I was using AXI in our designs to have access to cache coherent data from the FPGA
logic as this was not available using Avalon (see my article on caches regarding coherency issues).

I'm sure the rep was not responsible for the AXI protocol, or in a position to do much about it. That said, I want,
in these articles, to dispel this misconception that AXI, or other bus protocols, are deliberately obtuse. The good
people at ARM (just a few klicks from where I'm writing), I believe, have done a fine job of creating bus
protocols that meet the requirements of processor based embedded systems (and beyond), with features added
only where necessary to create an efficient system. Not only that, they have carefully allowed certain features to
be optionally used so that only the complexity actually required need be considered. I have been working with
ARM based system for many years now (decades, even), and have been grateful that certain features were an
option I could use (such as the cache coherent accesses I mentioned before).

If one comes to a specification that is the result of decades of development and refinement without any prior
knowledge, then this can be overwhelming. However, breaking a specification down into its individual features,
which ones are required and which optional, and understanding what problem each feature is trying to solve,
then the problem becomes more manageable. In these articles, then, I want to use the ARM AMBA bus
protocols as a case study as these are the most likely encountered and have all the features I want to talk about.
So, we will start with APB for peripherals before moving to AHB for higher speed (both protocols which I have
used in many SoC designs). Then we will move to AXI in the next article and will see that there has been an
evolutionary path from these other protocols, building to this interconnect specification. To conclude, I will
review some alternative protocols along with the ARM coherent hub interface (CHI) next generation
interconnect specification. By building up in steps, I hope to make the journey to AXI comprehensible.

AMBA Overview
The diagram below shows the different protocols for the AMBA (Advanced Microcontroller Bus Architecture)
suite of interconnects and their revisions.
In these articles we will concentrate on AMBA versions 3 and 4, though version 5 will be mentioned, and we
will stick to APB, AHB and AXI, but CHI features will be reviewed. The Advanced Peripheral Bus (APB) is
meant for low power, low bandwidth peripherals. For example, an I2C controller where the serial protocol is
low bandwidth compared to the main SoC system, so the interface to the system bus does not need to be high
bandwidth. The Advanced High-performance Bus (AHB) is a true bus protocol and meant for high bandwidth
interconnect and peripherals. For example, a 100Mbps ethernet controller. Before AXI, AHB was the mainstay
of a system bus in an ARM based system. To meet the needs of higher performance systems, with multiple
cores and parallel high speed data transfer requirements the Advanced Extensible Interface (AXI) protocol was
developed. This is an interconnect rather than a true bus, and devices are connected through an interconnect
matrix—a small switching network, if you will.

Systems may have a mixture of these protocols, including AHB for reuse of legacy components with this
interface, and these are usually arranged in a hierarchy, with bridges connecting different busses. A typical
structure of an SoC based around AMBA bus protocols might look something like the following diagram:
The diagram shows the main system bus is AXI, with its interconnect fabric. A bridge connects one of the AXI
ports to an AHB bus, though multiple busses might be supported. On the AHB bus, another bridge connects an
APB bus, though it might have been connected directly to an AXI interconnect port. In general, though, there
would be a hierarchy, with lower bandwidth and latency tolerant peripherals further away from the main system
bus, and highspeed, low latency peripherals nearer the main bus.

I want to build up to AXI in steps from the simplest protocols, so that it is clearer what each added functionality
the more complex protocols bring, and the problems they are trying to solve, so that I’m not jumping straight to
a list of all the AXI signals with no context. So, let’s start with APB.

APB
APB is the simplest of the AMBA protocols, and the one I’ve probably added to IP I was developing the most.
By its very nature, it does not support advanced features meant for high-speed efficient data transfer, as it is
assumed peripherals requiring such features do not use APB. Therefore, it only supports single word transfers,
and there is no multiple word ‘burst’ transfers. It also does not support ‘split transactions’, that is a request for
data is completed, but the data is sent at some future time. A whole transfer is completed before another can be
started, so there are no ‘overlapping’ transfers, and it can be big- or little-endian (the default). There is optional
support for byte writes (when the bus width is wider than a byte), for some protection modes and user defined
attributes. The use of optional features we shall see is common in the protocols, some of which were added in
later specifications. This is useful for simplifying interface design for IP that does not, or cannot, use these
features, or interfacing to peripherals of an older specification, even if the manager interface (ARM uses the
terms manager and subordinate) has the optional features.

Signals
So, the APB signals are defined as mandatory or optional. The table below shows the mandatory signals.

The clock (PCLK) and reset (PRESETn) are global signals, and all the peripherals on the bus will connect to the
same sources. The reset is often connected to the system reset, but may be different if separate APB resetting
required, or if clock domains are different. Each peripheral on the bus will have an individual select line
(PSEL), and some address decode logic will generate a mutually exclusive select when a peripheral is being
accessed.

The clock (PCLK) and reset (PRESETn) are global signals, and all the peripherals on the bus will connect to the
same sources. The reset is often connected to the system reset, but may be different if separate APB resetting
required, or if clock domains are different. Each peripheral on the bus will have an individual select line
(PSEL), and some address decode logic will generate a mutually exclusive select when a peripheral is being
accessed.

The other signals are the data transfer signals and transfer control. This is for single word, memory mapped
transfers, so there is a write-not-read strobe (PWRITE) with an address (PADDR), and then separate read and
write data busses (PRDATA and PWDATA). The width of the address and data busses is variable, defined with
a parameter as shown in the table. The address width is up to 32 bits, whilst the data width can be 8, 16, or 32-
bits.

The PENABLE signal, along with PREADY, is used to ‘sequence’ the transfer, as we will see shortly. The final
mandatory signal is PSLVERR. This is a legacy misnomer, as ARM refers to manager and subordinate for
requesters and completers. This flags any error in the transfer (read or write) by the subordinate. This signal is
mandatory for a manager device but is optional on a subordinate. If a subordinate does not have a PSLVERR
output, then the manager signals is tied low.

Of the optional signals, these have been introduced in AMBA 4 and 5. These signals are shown in the table
below:

The 3-bit PPROT signal, generated by the manager indicates the protection privilege level—normal, privileged,
or secure—if the peripheral can make distinctions between these levels. The PSTRB signal is effectively byte
enables for the write data bus. This can only be present if DATA_WIDTH is greater than 8, where each bit
corresponds to a write enable for the byte lane on PWDATA.

The other signals (in red) were all introduced in AMBA 5, so we will just briefly discuss them here.
PWAKEUP is generated by a manager for a subordinate that supports different power states to indicate that
activity is on the bus. It is asynchronous to PCLK. The user signals are all undefined and for any additional
custom signalling that a design might wish to have. Each has its own (bounded) parameter to define its width.

Transfers
APB transfers are two-phase, with the ability to insert wait states by the subordinate, using the PENABLE and
the PREADY signals. At the start of a transfer, a peripheral’s PSEL line is asserted, and PENABLE is set low
by the manger to indicate the first phase. The address is set, with PWRITE either high or low to indicate a write
or a read. The optional PPROT signal is also set in phase 1. The PSEL, PADDR, and PWRITE will remain
unchanged for the rest of the transfer (as is PPROT if present). If a write, then PWDATA is set, and also
remains unchanged and, if PSTRB signalling supported, this would also be set in phase 1 and held.

In the cycle after PENABLE is low, PENABLE will go high to indicate the second phase. It is in the second
phase that a peripheral can insert wait state. During the first phase, the subordinate’s PREADY signal is a don’t
care, but will go low if wait state are to be asserted (for instance if there is no space to accept a write, or it takes
multiple cycles to fetch read data). It doesn’t have to set PREADY low if it can complete the transfer
immediately. When PENABLE is high, and the PREADY is also high, then phase 2 completes and the transfer
is finished. The use of PSEL and PENABLE means that a transfer to the same peripheral can begin in the next
cycle. In that case, PSEL remains high, but PENABLE is deasserted to indicate a new phase 1. The diagram
below shows a read transfer with wait states (shaded).

The subordinate inserts two wait states by de-asserting PREADY for 2 cycles when PENABLE goes high. It
then asserts PREADY, along with the read data (PRDATA), to complete the transfer. Not shown in the
diagram, but if the peripheral had a PSLVERR signal, this would be asserted (if an error occurred) when
PREADY high, and PRDATA would be a don’t care. A write transfer, with two wait states, is shown below:

The signalling and timings are almost identical to a read, but PWRITE is high instead of low, and PWDATA is
set in phase 1, and remains unchanged (PRDATA from the subordinate is a don’t care). If PSTRB byte enables
are supported, this is asserted in phase 1 and held for the whole transfer, along with the optional PPROT if
present.

This, then, is all that’s needed for an APB interface. A simple memory mapped word transfer interface with
some configurability, optional signalling, and a straight-forward two-phase protocol with simple flow control.

AHB
The Advanced High-performance Bus has many of the characteristics of the APB bus, with some signalling
that’s very similar. Now, though, the protocol must be able to move bulk data efficiently and, being nearer the
processor system, must support features that the processor system might need. This includes things like support
for locked, secured, or exclusive transfers. For even better efficiency, transfers can overlap, though the basic
transfers are much like for APB, and the general structure very similar. The diagram below shows the basic
transfer architecture, and could equally apply to APB (with different signal names):

Signals

So, let’s start with what is the same on AHB as for APB. The table below shows the signals:
As shown in the table, these ‘common’ signals map almost one-to-one with APB signals. The exception is that
there is no equivalent of a PENABLE signal, and sequencing the transfer is done a little differently to cater for
overlapping transfers (as we shall see). Like in APB, the protection signal (HPROT) is optional, and HWSTRB
isn’t introduced until AMBA 5 (hence red in the table).

New signals are introduced over APB to determine the amount of data to be sent, and the nature of the order of
the data. In addition, a transfer can be ‘locked’—i.e., indivisible, to support atomic operations. AMBA 5
introduces some extra signals for secure and exclusive operations, as well as a manager ID, for managing access
form multiple managers on a single bus. These optional signals are shown in the table below.
The first two signals (HBURST and HSIZE) are associated with the transfer of data and HMASTLOCK for
locked transfers, used for things like semaphores. The two-bit HTRANS signal is what replaces PENABLE and
sequences the transfers on the bus and the select line takes a less critical role, which we shall look at shortly.
The other signals, in red, are for AMBA 5 and no more will be said about these here.

Transfers
At its simplest, then, AHB can transfer single words much like APB. The diagrams below show the timings for
single word accesses, firstly for a read, followed by the timing for a write:

For clarity, the HSEL signal is implied for the whole of the transactions, and for single cycle access HTRANS
is set at a value of NONSEQ in the first phase, and then IDLE from then on, assuming no further transactions
(see below). HADDR is only valid in the first phase, as shown, as are HWRITE and other manager control
signals listed in the tables, when a single word transfer. Wait states are inserted by the subordinate using
HREADY, just as for APB and PREADY. Not shown, but HRESP (like PSLVERR) is asserted at the end of the
transaction if an error occurred.

The AHB protocol would be no better than APB if this was all that was possible. The first facility it has is that
transfers can overlap. Since the address and general control signalling is valid in the first phase, then when data
is transferred in the second phase, a new address and control may be asserted so that all cycles are used for data
transfer, assuming no wait state inserted by the subordinate. The diagram below shows some overlapping
transfer (including wait states):
Here we see a write to address A in the first cycle, the data is transferred in the second cycle, as HREADY was
asserted. In that second cycle, a new read is instigated to address B. The subordinate de-asserts HREADY in the
next cycle, so this is a wait state, but it’s asserted in the following cycle, and the read data is returned. Another
write, to address C, was instigated in the cycle after address B, but this is held whilst the wait state is active, and
into the cycle with data for B returned. The data for address C is written in the cycle after this. Without the wait
state, data would be transferred in consecutive cycles after address A is asserted, giving 100% efficiency.

The next feature above APB that AHB has available is that multiple words can be transferred on a single
address ‘command’—a burst. Until now, we have looked at a single word transfer, largely ignoring the
HTRANS, HBURST, and HSIZE signals, which were more or less fixed (HTRANS was NONSEQ/IDLE,
mirroring, somewhat, PENABLE).

The HTRANS signal, as mentioned before, sequences the transfers. The two-bit signal has four possible values:

• IDLE: indicates no transfer from the Manager. A subordinate must still return an OK response on
HRESP (low) if selected with HSEL.
• BUSY: allows a Manager to insert wait states in the middle of a burst (which APB managers can’t)
• NONSEQ: indicates the first transfer of a burst (or the only transfer of a single word transfer—
effectively a burst of 1).
• SEQ: indicates subsequent transfers in a burst

The number of words in a burst is determined by the HBURST signal. The simplest we have seen already, a
single transfer. This is the default setting and, remembering that HBURST is optional, is what happens when
this signal is not present. The other types of transfer are for an incrementing burst of undefined length, or fixed
size burst of 4, 8, or 16 words, either incrementing or wrapping.

The table below shows the encodings for the burst types:
The incrementing types are fairly self-explanatory, though a restriction of an undefined length burst is that it
must not cross a 1Kbyte boundary. For all the incrementing types, the address must be aligned with the width of
the data word. For example, if a 32-bit wide data bus is defined, the address bottom 3 bits must be 0. At each
data transfer, the address increments the appropriate number of bytes.

For the wrapping burst types, the addresses still increment by default, but will wrap at the address boundary
determined by the wrap length. For instance, if a wrap4 type is selected, and the first address is 0x34, the
address sequence is 0x34, 0x38, 0x3c, 0x30. The address signal HADDR follows these incrementing or
wrapping address value—it is not implied after the command and the subordinate need not work out the address
sequence (though it could). The diagram below shows a four-word incrementing burst read transfer:

It is assumed the HSEL is active in the first cycle, and for the entire transfer. Also not shown, but for every
word transfer an HRESP is required. The transaction is started with HTRANS as NONSEQ and an address of
0x20. The transfer is a read since HWRITE is 0, and HBURST indicates an incrementing burst of undefined
length. In the second cycle, data is returned from the subordinate, but also HTRANS is now BUSY, inserting a
wait state from the manager. The transfer is resumed in the next cycle with HTRANS at SEQ, indicating
subsequent data transfers in the burst. This proceeds for addresses 0x24 and 0x28, then a subordinate wait state
is inserted with HREADY being raised on the 0x2c address, which is held until the next cycle. The data for
0x2c is returned in the last cycle.

What happens next depends on what the signals are for the last cycle. If HSEL was de-asserted, then the transfer
has finished. If HSEL remains asserted, but HTRANS is IDLE, then no new commands have been issued. With
HSEL asserted and an HTRANS value of NONSEQ, then a new burst (or single word transfer) is started, with
any new start address required. Only if an HTRANS of SEQ would the current burst continue (possibly after
IDLEs). Note that, for a fixed burst size, this can be terminated early with a new NONSEQ command before all
the data transferred.

As stated previously, the width of the data busses is fixed by a parameter DATA_WIDTH. Although it is
recommended that this is 32-bits, it can be set as little as 8 bits (1 byte), or as much as 1024 bits (128 bytes).
Data less than the width of the bus may still be transferred if HSIZE is set to be less than this width. The table
below shows the encodings for HSIZE.

The timing for HSIZE is that it is set in the first phase and held for all the address phases of the transfer. This
signal, along with HADDR and HBURST determine which lanes on the bus are active. For example, on a 32-bit
bus, if an incrementing burst, HSIZE indicates a half-word transfer (001b) and the address lower 2 bits are 10b,
then the upper two lanes are active. In addition, for writes, the HWSTRB (introduced in AMBA 5) can modify
‘active’ lanes to not be written, so that non-consecutive bytes are updated. This is called sparse writing.

The last AHB signal to discuss is HPROT. This is similar to the PPROT signal we saw for APB, but with some
slight differences. For AMBA 3 this is a 4-bit signal but, by AMBA 5 this is extended to 7 bits. The first three
bits are (largely) the same as AMBA 3, but the rest extend the control over cache accesses and whether transfers
can be modified, when transiting through the bus structure. AHB, then, adds some control over cache coherent
accesses with AMBA 5 extending this control. (For the importance of cache coherent accesses, see my article
on caches.) Here we confine ourselves to looking at AMBA 3, and the table below shows the HPROT
encodings:
The timing for HPROT is that it is set in phase 1 and held for the rest of the address command cycles for the
transfer. This concludes the review of the AHB protocol.

Conclusions
In this article we have looked at two of the AMBA bus protocols, starting with the simplest (APB) and then
looked at a higher performance bus (AHB) where additional features are added to cope with the needs of
moving data within a processor base system more efficiently using burst and overlapping transfers, and
addressing the needs of such systems, such as requiring cache coherent accesses. The idea is to show a heritage
from APB to AHB, with many common features.

As systems have evolved, with multi-core designs becoming common, AHB evolved to provide features for
these, and AMBA 5 adds some of this over AMBA 3 specifications. However, the limitations of a bus-oriented
system start to show, the

more cores (or other manager components) that are required on the bus. Therefore, a new architecture based on
a switching fabric is required, where multiple transfers from multiple managers is handled in parallel. The AXI
bus was developed to do just this. As we shall see, this still has features that will be familiar from APB and
AHB, but the architecture is somewhat different. We shall look at this, along with a review of alternative
protocols and of AMBA protocols beyond AXI in the next article.
SoC Bus and Interconnect Protocols #2: Interconnect (AXI)
Simon Southwell

Introduction
In the first article we covered APB and AHB bus protocols. The defining common characteristic was that there
were both true busses, with a single manager instigating transactions over the bus at any one time. The main
differences between APB and AHB were some added features to AHB to allow overlapped and burst
transfers. The assumption has been that there is only a single manager. AHB can, in fact, support multi-manager
operation via an interconnect component that provides arbitration and routing signals from different managers
to the appropriate subordinates. In the Advanced Extensible Interface (AXI) protocol, this use of an
interconnect protocol now becomes the mode of operation, and the ‘bus’ itself is now a single point-to-point
interface. Connection between managers and subordinates is now done via interconnect matrices that can
connection multiple managers to different subordinates, with data transfer occurring in parallel. An individual
matrix will have a finite set of ports for connecting managers, and a similarly finite set for connecting
subordinates. If mode ports are required, other matric components can be connected as subordinates. The
diagram blow shows a simplified arrangement of an AXI based system, with a second interconnect matrix
component.

Another difference between AXI and its predecessors is that, instead of a single set of signals with outputs for
sending commands and write data and inputs returning read data and responses, these functions are now split
into five separates independent interfaces, or ‘channels’. Three of these are for writes, with a write address
channel, a write data channel, and a write response channel. The first two are instigated by the manager, and the
last by the subordinate. The two remaining channels are for a read address channel, from the manager, and a
read data channel, from the subordinate. The read response is combined with the read data channel as this is
flowing in the same direction. The diagram below summarises this situation.
These channels, at the signal level, are truly independent but, of course, have to be consistent at the higher level.
For example, a manager might send multiple write commands over the write address channel (if the subordinate
can accept these) and then, at some future point, send the data for the outstanding writes. It (usually) has to send
the write data in the order that was dictated by the address commands, but there is no timing coupling between
the interfaces. (Out-of-order completions can be supported.) Indeed, if I read the specifications correctly, one
might send all the data first (again, assuming the subordinate has space to accept this) and then the write
commands. (I’ve not actually tried this, but I believe this is valid behaviour.)

What it does allow is (like for AHB) overlapping accesses, with addresses sent before the end of a previous
burst, to allow full usage of the data bus bandwidth. We will be looking at bus signal timings shortly.

For the rest of the article, we will concentrate on the AXI-3 specification, though we will be making reference
to AXI-4 and even AXI-5. This is to keep the length of the article manageable, and the aim is to get familiar
with AXI in general (hopefully to the point where implementing such an interface does not seem too daunting)
and be able to discover more features in the newer specifications oneself if needed for a design’s particular
functionality.

Common Channel Characteristics

Each of the five channels has some common characteristics, which I will discuss here so that they can be taken
as read when looking at the individual interfaces.

Like for AHB and APB, AXI has global clock and reset signals ACLK and ARESETn, with the reset being
active low. These act in the same way as for the other interfaces.

Each of the channels have a handshaking method which, in my opinion, is simpler than for APB and AHB. All
channels have a two-signal handshake, with a xVALID signal from the source, and an xREADY signal from the
destination. Only when both signals are high is a command or data word transferred. This makes inserting wait
states from either and of the link possible without additional signalling (see HTRANS’ BUSY state for AHB in
previous article).

Write Channels
The three write channels (address, data, and response) as we shall see, have some common features from the
protocols of the first article.

Write Address Channel Signals

The write address channel signals are summarised in the table below. This has some colour coding to help
identify the required versus optional signaling. Signals in back represent signals required at both manager and
subordinate. The green signals are optional for a manager, but are required by a subordinate, but they may be
tied off with the default values, as indicated in the table. Red signals are required by the manager but not by a
subordinate and may be left unconnected. The rest of the signals in grey are optional for both manager and
subordinate.

Before we look at what these individual signals do, let’s look at what minimal interfacing this channel requires.
There must be three active signals—an address and the two handshake signals mentioned previously. The green
signals can be tied off on a subordinate, and the red signal left unconnected on the manager. The default values
for the green signals means that all transfers will be from a single manager and be a full single word transfer.
This is the minimum functionality of the AHB, and I would argue, easier to implement with AXI handshaking.
The belies the criticism of AXI complexity levelled by the engineer I mentioned in the introduction to the first
article.

All the signals for the AXI write channel are prefixed with AW, such as AWADDR. As this is the write address
channel this is the only fully required signal (with handshaking implied and common to all interfaces). Like for
AHB, it is configurable in width and gives the address of the first transfer. Since the data is transferred over its
on channel, the address does not change for each data transfer and the subordinate must internally increment the
address appropriately for a burst.

The AWSIZE signal is basically the same as for AHB, defining the number of bytes that will be in each
individual data transfer, from 1 to 128. The length of a burst is defined with AWLEN, unlike AHB where this
was undefined, or from a limited set of fixed sizes, depending on the transfer type. This length can be, for AXI-
3, anything from 1 up to 16 words, though in AXI-4 incrementing transfers can be up to 256 words. In all cases,
a burst must not be issued that would cross a 4Kbyte boundary (cf. 1Kbyte restriction of AHB). Note that
AWLEN is actually the transfer length minus 1, so that 0 equals a 1 word transfer up to 255 for a 256-word
transfer.

The AWBURST signal is not quite the same as for HBURST of AHB. It does have an incrementing mode
(INCR) and a wrapping mode (WRAP) with the AWLEN dictating the wrapping characteristics, just as for
AHB. In addition, there is a FIXED mode, where each data transfer is to the same address. This might be useful
for pushing onto a queue at some fixed location in the memory map.

AWPROT defined the access’s protection mode and is similar to AHB’s HPROT but drops a bit for
distinguishing between opcode and data transfers. The AWLOCK signal is 2-bits for AXI-3 and specifies
NORMAL, EXCLUSIVE, and LOCKED accesses. AXI-4 actually simplifies this signal to a single bit for just
NORMAL and EXCLUSIVE. The AWCACHE signals defines the accesses cache attributes. In AHB these
were part of the HPROT signal and the AWCACHE has many of these attributes. It defines whether the access
is bufferable, modifiable (i.e., attributes changed in transit), non-cached/cached access, cache read and/or write
cache allocation and write-thru/write-back.

The AWID signal is an identification tag for the write transactions. It is optional for a manager and can be tied
off for a subordinate to the default value. When used, it identifies which manager port was used for the
transaction for routing back read data and responses, which will have a matching ID. For a single manager it is
not required.

The other signals—AWQOS, AWREGION, and AWUSER—are introduced in AXI-4. The first is related to
quality-of-service (i.e., minimum bandwidth/latency targets), the second indicates an address region identifier
allowing a single interface to have multiple logical interfaces on a single physical interface, and the last which
is a user defined signal. More details can be found in the AXI specifications (listed at the end of the article).

We shall look at timing in another section but, in general, all of the manager driven signals are all set in a single
cycle, when AWVALID set. If AWREADY is high then the command is transferred, else the signals are held
during the wait states until AWREADY does go high.

Write Data Channel Signals

The write data channel interface is the data counterpart to the write address channel and has the same
handshaking method. All the write data channel signals are prefixed with W, and the handshaking signals are
thus WVALID and WREADY. The table below summarises the write channel signal, with the same colour
coding as for the write address channel.

As for the write address channel, only three signals need to be active for a valid interface, with the two
handshaking signals and WDATA. The WLAST signal is required for a manager and is set on the last data
transfer word of a burst (or the only transfer for a single word transfer). However, it need not be connected to a
subordinate if there is no input. The WSTRB signal (cf. HWSTRB of AHB-5) and is the byte enables for the
data bus, if wider that 8-bits. It is optional for a manager and can be tied off in a subordinate as all ones for all
lanes always active.

The optional WID signal is a write identification tag, meant to identify the manager port that the data came from
for data ordering purposes. This is defined for AXI-3 only and the specification suggests not to use it and use
the AWID signal of the write address channel instead. Living with regret is just part of being an engineer. The
WUSER signal was introduced in AXI-4 for user defined signalling and is optional.

Write Response Channel Signals

Since the write address and data channels are both from manager to subordinate a separate response channel is
needed. The signals are defined in the table below, with the usual colour coding. All write response channel
signals are prefixed with B (not sure why).
The only absolutely required signals here are the two handshaking signals. The BRESP signal indicates the
response status but, if a subordinate does not generate errors, this signal need not be implemented, and the
default response is OKAY. The valid values for this two-bit signal are:

• 00b : OKAY (no error)


• 01b : EXOKAY (exclusive access with no error)
• 10b : SLVERR (unsuccessful transfer)
• 11b : DECERR (decoder error)

Note, the top bit of BRESP acts like PSLVERR of APB or HRESP of AHB, the bottom bit sub-divides this for
exclusive access OK (cf. HEXOKAY of AHB-5) when the response is good, or a normal or decoder error when
not good.

BID is a write identification that is required on a subordinate but is optional for a manager. When AWID is used
on a write command, the BID will match the data returned for that ID. The BUSER signal, introduced in AXI-4,
is completely optional and is for user defined signalling.

AXI Write Transaction Timing

Now we have defined all the signals for the write channels, let’s look at what a write transaction looks like for
the signal timing. The diagram below shows a write transfer for a 4-word incrementing burst transfer, where the
data width is a byte.
The timing diagram shows the write address channel at the top. A write is issued for address 0x0, for a burst of
length 4 (remember AWLEN is length – 1), with AWBURST showing incrementing addresses. Since the
AWREADY is low for the first cycle that AWVALID is high, a wait state is inserted, and the values are held
for a cycle.

Some cycles later, on the write data channel, the data appears with the setting of WVALID, and 4 transfers take
place. The first transfer had a wait state inserted by the subordinate with WREADY low. The subordinate’s
optional signal WLAST is low for all the transfers until the final one. Note that WVALID is shown active for
the whole transfer, but it need not be. This may be deasserted at any point and then reasserted for the next
transfer.

To complete the whole transfer, sometime after the write data is transferred, the write response channel is
activated with BVALID asserted (driven by the subordinate), which waits for BREADY assertion (by the
manager) and, if present, the response is given on BRESP.

For all the other signals not shown, whether optional, or for different specifications, the timings are just the
same, being asserted when the appropriate xVALID signal is set.

Having looked at the write channels, we can turn our attention to the read channels. These have so much in
common with the write channels that we can run through this in fairly short order.
Read Channels
The two read channels (address and data) are very similar to the write channels, with signalling combined for
read data and response.

AXI Read Address Channel Signalling

The read address channel signals all have the prefix AR, but in every other respect they are the same as for the
write address channel. The table below shows the signals with the usual colour coding:

Suffice it to say we need not run through all these signals in detail and the section for the write address channel
applies to these except for the signal name prefix.

AXI Read Data Channel Signalling

The read data channel combines both the data and response signalling, and all the signals are prefixed with R.
The table below summarises these signals.
As ever, only the handshaking signals and the data signal are required at both ends. The RID (optional for the
manager) is equivalent to the BID/WID signals and is AXI-3 only, and RLAST is required for the subordinate.
The remaining signals are optional and have the same functionality as for the write data channel, except in the
opposite direction (driven by the subordinate). So, we can now go straight into looking at the timings for a read
transaction.

AXI Read Transaction Timing

The diagram below shows a write transfer for a 4-word incrementing burst transfer, where the data width is a
byte.
The timing diagram shows the read address channel at the top. A read is issued for address 0x0, for a burst of
length 4, with ARBURST defining incrementing addresses. Since the ARREADY is low for the first cycle that
ARVALID is high, a wait state is inserted, and the values are held for a cycle. Some cycles later the read data is
returned on RDATA when RVALID is asserted. Two wait states are inserted (by the manager) during the data
transfer. The read response is also returned on RRESP. Note that this is separately valid for each data
transaction, so that while some values may have a response of OKAY, others may not. The optional RLAST is
asserted only with the last data transaction.

The other signals not shown have the same timings as the control signals that are shown, if they are required.

AXI Multiple Outstanding Transactions

As mentioned before, transfers can overlap with new addresses valid before a transaction has completed, and
data returned before a response for a previous transaction has been issued. This allows the bus to be able to run
at 100% efficiency.

The diagram below shows two overlapping read transactions for four-word bursts
As can be seen, a read is issued, and data starts to return in the following cycles. Before the end of the burst, a
new read address is issued for a 4-word burst. When the last data word is transferred for the first read, the data
for the second read begins immediately.

Implementing AXI Interfaces


We have now been through all five channels of the AXI interface, with their timings. Hopefully, by now, you
have seen that the protocol is not so complicated and has many advantages. We have seen that it has many
advanced features to support such things as cache coherency and atomic operations, and also seen that many of
these features are optional. The diagram below summarises a minimum manager subordinate interface
connection:
For minimum compliance just the address and data busses (both read and write) are connected, each with the
valid/ready handshake signals. The only additional connected signalling is the write response handshake
signals. All other signals are only those required at one or the other end and can be left unattached or tied off to
defaults, as shown. Even the manager BREADY signals could be tied high internally to the manager.

For the address and data channels, as a minimum, each end could be a short FIFO, with the driving end setting
xVALID when ‘not empty’ and popping the value when xREADY is also high. At the receiving end, xREADY
is simply set on ‘not full’, and data is pushed into the queue when xVALID also high. This can be improved
with registering outputs and early full or empty statuses, or not using queues for lower latency, but you get the
idea. As this is the same for all interfaces, this logic need only be designed once and used for all address and
data channels. Then, internally, any logic timing and implementation can be used as needed for the design, but
one now can interface to an AXI port, either as a manager or a subordinate, and transfer data.

This only gives a minimalist interface for single word transfers but, from this base, additional functionality can
be added as need for burst transfers, cache coherency, QoS, and whatever else is required.

Other Protocols
Beyond AXI (ACE and CHI)
Back in the first article, the AMBA busses were summarised in the first diagram and a couple of protocols were
shown beyond AXI. These are the AXI Coherency Extensions (ACE) and the Coherency Hub Interface (CHI).
As you can see from the names, cache coherency is front and centre of the protocol features. As systems have
more and more processors, cache coherency is a genuine concern for efficient and accurate data sharing
between cores (discussed in my article on caches).

The ACE protocol extends AXI and adds hardware-coherent cache support. That is, cached memory can
coherently be shared across components without the need for software cache maintenance. This is done system
wide, and regions can be specified for coherent memory. Each cache line now has five states associated with it,
and what happens when that cached location is accessed is determined by that state. These states are
UniqueClean, UniqueDirty, SharedClean, SharedDirty, and Invalid. For invalid and dirty/clean definitions, see
my article on caches. The unique status is that a cache line is only valid in one cache, whereas shared implies it
might be valid in multiple caches.

Cached data associated with a particular manager can be accessed from another manager, even when the first
manager is active on the interconnect. The diagram below, shows multiple managers, each with their own
cache, with a coherent interconnect, allowing (depending on cache line state) for one manager to access data in
another’s cache.

The ACE specification modifies the AXI address channels for cache access and barriers and extends signalling
on BRESP. In addition, three new channels are added for ‘snooping’: snoop address, snoop response and snoop
data. In this context, snooping is accessing a cache to see if an address is active within it and retrieving data
from there.

The CHI protocol is trying to solve basically the same problems, but for systems that are far more distributed,
with many more cores, in a scalable way. It has the same five states for cache lines as for ACE but moves away
from busses and interconnects as such and moves to a message-based access model. It can be configured in
multiple ways as a network: mesh, ring, crossbar etc. Indeed, the high-performance computer systems designed
at Quadrics, when I was there, were just such message-based coherent systems using a crossbar network. Since
then, on-chip systems with large numbers of processors have been developed, with all the same problems
associated with them, and CHI aims to solve these in a similar manner.

The CHI protocol ensures coherency by allowing only one copy of data to exist for a location when a store
occurs. Each manager can get a copy for its own local cache after the store. Within the interconnect, a ‘home
node’ receives all the requests and manages and coordinates the snooping caching etc. The interconnect block
may optionally have its own cache.

Intel Avalon

The Avalon bus specification from Intel (formerly Altera) aims to provide the same general functions as the
AMBA specifications, with memory mapped interfaces, burst transfers, streaming interfaces and more. Much of
the basic signalling for the memory mapped interface is similar to AXI, with notable differences being separate
read and write strobes and burst lengths being from 0 to 4095 words and specified as n rather than n-1. Indeed,
the interfaces are so similar that I have, in the past, constructed a converter from Avalon burst read interface
(used internally to the design) to AXI manager interface (for interfacing to a memory controller) using just
simple combinatorial logic.

The memory mapped interface has similar response errors and support for locked transaction. What is notably
missing is any support for cache coherency, QoS, or protection. It does, though, specify an interrupt interface.

In addition to the memory mapped and interrupt interfaces, the Avalon specification defines two streaming
interfaces; that is, queue-based data transfers. The first is signal flow controlled whilst the second is credit
controlled (cf. data link layer of PCIe). A last ‘conduit’ interface is defined, but this is really a means to bundle
an arbitrary set of user defined signals into a single unit—useful when using Intel’s platform designer to
construct logic block hierarchy and auto-generate the interconnect logic.

Wishbone
I want to mention the Wishbone bus, which is an open-source specification with much the same aims as the
other specifications. It can be configured for a shared bus (like AHB), a pipeline or a crossbar switch system.
The signalling is very limited compared to the other specifications and mainly associated with data transfer.
There is no higher-level signalling for cache coherency, QoS, locked transfers etc.

Its main advantage is that is a free open-source specifications. My understanding, though, is that AMBA has no
licence or royalty associated with using the specifications (unlike the ARM processor architectures), so this
advantage is limited.

Conclusions
We have, over these two articles, looked at some of the AMBA specifications as case studies for the bus and
interconnect typically used in SoC embedded systems, starting with a simple word transfer protocol (APB),
through a higher performance burst bus (AHB), to an interconnect base protocol, AXI. All of these have a
common heritage, but new features are added (often optionally) to solve particular problems of efficient data
transfer within a system. We have seen that cache coherency solutions start to dominate, with ACE and CHI
specifications beyond AXI, in order to meet the needs of large multi-core distributed systems, using techniques
that were the mainstay only of supercomputers not so long ago.
AMBA protocols are by far the most commonly used (in my experience), but other protocols such as Avalon
are used within their own domain, and open-source specifications exist for freedom from any tie-in to corporate
control.

The aim of these articles has been to show how, at the basic level, these int

erfaces are not complicated to understand or implement, despite the perception of the engineer mentioned in the
introduction to the first article. The layering of features, we have shown, allows, at its simplest, a straight-
forward implementation of an AXI interface (either manager or subordinate) using just well understood
components, such as FIFOs. So, do you still think AXI is too complicated a protocol to use?

Getting Hold of the Specifications

For those wishing to dive deeper into the protocols, the list below gives links to all the specifications mentioned
in these two articles.

• APB
• AHB
• AXI/ACE
• CHI
• Avalon
• Wishbone
PCI Express Primer #1: Overview and Physical Layer
Simon Southwell | Published Jul 20, 2022

Introduction
This is the first in a set of articles giving an overview of the PCI Express (PCIe) protocol. This is quite a large
subject and, I think, has the need to be split over a number of separate, more manageable documents and, even
so, it is just a brief introduction to the subject. The main intended audience for these articles is anyone wishing
to understand PCIe more if, say, they are working on a system which includes this protocol, or needs a primer
before diving more deeply into the specifications themselves.

In this, the first article, an overview will be given of the PCIe architecture and an introduction to the first of
three layers that make up the PCIe protocol. The Transaction and Data Link Layer details protocols will wait for
future articles, and just the Physical Layer will be focused on here, after the overview.

To accompanying these articles a behavioural PCIe simulation model (pcievhost) written for Verilog is
available. This runs a C model of the PCIe protocol on a simulation virtual processor (VProc) that interfaces
this model to the Verilog environment. The model’s documentation explains how this is set up, and the API of
the model for generating PCIe traffic. In these articles I will be making reference to the output of this model for
many of the examples, but the model is also intended to allow readers the chance to explore the protocol for
themselves. An example test environment is included which contains generating most traffic that the model is
capable of producing and the reader is encouraged to get the model and make changes to the driving code to
explore the PCIe space.

So, let’s start at the beginning.

Beginnings

Back when the PCI Express (PCIe) protocol was first published (at version 1.0a), it was decided that the HPC
systems we were designing would upgrade from PCI to this new protocol so that we had a future path of
bandwidth capability. There were a few 3rd party IP options at that time but, on inspection, these did not meet
certain requirements for our designs, such as low latency through the interface. Therefore, I was tasked,
amongst other things, to design a 16-lane endpoint interface to the PCIe specification that also met the other
requirements. So, I got hold of the specification and started to look though it—all 500 plus pages of it. This was,
of course, quite intimidating. I also went to a PCIe convention in London and spoke to one of the keynote
speakers who lead a team that had implemented a PCIe interface. I asked her how long it took and how big a
team she had. She replied that it took about a year, with a team of 4 engineers—oh, and she had 20 contractors
doing various things, particularly on the verification side. I had 18 months and it was just me. One deep breath
later I started breaking down the specification into the areas I would need to cover, optional things I could (at
least at first) drop and slowly a pattern emerged that was a set of manageable concepts. In the end a working
endpoint was implemented that covered the 1.1 specification with 16 lanes, or the 2.0 specification a with 8
lanes.

In these articles I want to introduce the concepts I learnt in that endpoint implementation exercise in the same
manageable chunks. Since then, the 2.0 endpoint the specification has moved forward and, as at the time of
writing, is at version 6.0. None-the-less, the article will start at the beginning before reviewing the changes that
have taken place since the initial specification. Often systems can seem too complicated to understand because
there have been a multitude of incremental changes and improvements to produce something of complexity. By
starting at the beginning and then tracking the changes the problem becomes easier to follow, so that’s what I
will do here.

PCIe Overview
Unlike its predecessor, PCI, PCIe is not a bus. It is a point-to-point protocol, more like AXI for example. The
structure of the PCIe system consists of a number of point-to-point interfaces, with multiple peripherals and
modules connected through an infrastructure, or fabric. An example fabric topology is shown below:

Unlike some other point-to-point architectures, there is a definite directional hierarchy with PCIe. The main
CPU (or processor sub-system) sits at the top and is connected to a ‘root complex’ (RC) using whatever
appropriate user interface. This root complex is the top level PCIe interconnect component and would typically
be connected to main memory through which the CPU system would access it. The root complex will have a
number of PCIe interfaces included, but to a limited degree. To expand the number of supported peripherals
‘switches’ may be attached to a root complex PCIe interface to expand the number of connections. Indeed, a
switch may have one or more of its interfaces connected to other switches to allow even more expansion.
Eventually an ‘endpoint’ (EP) is connected to an interface, which would be on a peripheral device, such as a
graphics card or ethernet network card etc.

At each link, then, there is a definite ‘downstream’ link (from an upstream component e.g., RC to a switch or
EP) and a ‘upstream’ link (from a downstream component e.g., EP to switch/RC). For each link the
specification defines three layers built on top of each other

• Physical Layer
• Data Link Layer
• Transaction Layer

The physical layer is concerned with the electrical connections, the serialisation, encoding of bytes, the link
initialisation and training and moving between power states. The data link layer sits on top of the physical layer
and is involved in data flow control, ACK and NAK replies for transactions and power management. The
transaction layer sits on top of the data link layer and is involved with sending data packet reads and writes for
memory or I/O and returning read completions. The transaction layer also has a configuration space—a set of
control and status registers separate to the main address map—and the transaction layer protocol has read and
write packets to access this space.

Physical Layer
Lanes

The PCIe protocol communicates data through a set of serial ‘lanes’. Electrically, these are a pair of AC coupled
differential wires. The number of lanes for an interface can be of differing widths, with x1, x2, x4, x8, x12, x16
and x32 supported. Obviously the higher the number of lanes the greater the data bandwidth that can be
supported. Graphics cards, for instance, might be x16, whilst a serial interface might be x1. However, an
interface need not be connected to another interface of the same size. During initialisation, active lanes are
detected, and a negotiated width is agreed (to the largest of the mutually supported widths). The interface will
then operate as if both ends are of the negotiated width. The diagram below shows the motherboard PCIe
connectors of my PC supporting 3 different lane widths: x16 at the top, x1 (two connectors), and x4.

Note that the x4 connector at the bottom has an open end on the right. This allows a larger width card (e.g., x16)
to plug into this slot and operate at a x4 lane configuration. The signals are arranged so that the lower lane
signals are on the left (with respect to the above diagram). It is outside the scope of this article to go into
physical and electrical details for PCIe as we are concentrating on the protocol but signals other than the lane
data pairs include power and ground, hot plug detects, JTAG, reference clocks, an SMBus interface, a wake
signal, and a reset.
Scrambling

The basic data unit of PCIe is a byte which will be encoded prior to serialisation. Before this encoding, though,
the data bytes are scrambled on a per lane basis. This is done with a 16-bit linear feedback shift register or an
equivalent. The polynomial used for PCIe 2.0 and earlier is G(x) = x16+x5+x4+x3+1, whilst for 3.0 it is G(x) =
x23+x21+x16+x8+x5+x2+1. The scrambling can be disabled during initialisation, but this is normally for test
and debug purposes.

To keep the scrambling synchronised across multiple lanes the LSFR is reset to 0xffff when a COM symbol is
processed (see below). Also, it is not advanced when a SKP symbol is encountered since these may be deleted
in some lanes for alignment (more later). The K symbols are not scrambled and also data symbols within a
training sequence.

Serial Encoding

I said earlier that the serial lines were AC coupled. The first layer of protocol that we will look at is the
encoding of bytes. Data is serialised from bytes, but the bytes are encoded into DC free signals using one of two
encoding schemes:

• 8b/10b encoding (version 2.1 and earlier)


• 128b/130b encoding (version 3.0 onwards)

Both of these encodings allow three things. Firstly, they allow a minimal localized DC components with the
average signal DC free. Secondly, they allow clock recovery from the data with a guaranteed transition rate.
Thirdly they allow differentiation between control information and data. We will not look at the details of the
two encodings here. The aforementioned model has an 8b/10b encoder and decoder included, which may be
inspected, and there are good articles on this particular encoding. The 128b/130b is based on 64b/66b and
simply doubles the payload size.

Having encoded the bytes for each lane, the SERDES (serialiser-deserialiser) will turn this into a bit stream and
send out least-significant bit first. Other than when the lane is off there will always be a code transmitted, with
one of the codes being IDLE if there is nothing to send. At the receiving end, the SERDES will decode the
encoded values and produce a stream of bytes.

Within the 8b/10b encoding are control symbols, as mentioned before, called K symbols, and for PCIe these are
encoded to have the following meanings.
For 128b/130b encoding of the two control bits determine whether the following 16 bytes are an ordered set
(10b) or data (01b), rather than a K symbol. When an ordered set, the first symbol determines the type of
ordered set. Thus, the 10b control bits act like a COM symbol, and the next symbol gives the value, whereas
01b control bits have symbol 0 encode the various other token types. More details are given in the next section.

Ordered Sets

So now we know how to send bytes out over the lanes, some of which are scrambled, all of which are encoded,
and some of the encoded symbols are control symbols. Using these encodings, the protocol can encode blocks
of data either within a lane, or frame packets across the available lanes. Within the lanes are ‘ordered sets’
which are used during initialisation and link training. (We will deal with the link training in a separate section as
it is a large subject.) The diagram below shows the ordered sets for PCIe versions up to 2.1:
The training sequence OS, as we shall see in a following section, are sent between connected lanes advertising
the PCIe versions supported and link and lane numbers. The training will start at the 1.x speeds and then, at the
appropriate point switch to the highest rate supported by both ends. The link number is used to allow for
possible splitting of a link. For example, if a downstream link is x8, and connected to two x4 upstream links, the
link numbers will be N and N+1. The lane number is used to allow the reversal of lanes in a link when lane 0
connects to a lane N-1, and lane N-1 connects to 0. This can be reversed, meaning the internal logic design still
sends the same data out on its original, unreversed, lane numbers. By sending the lane number, the receiving
link knows that this has happened. This may seem a strange feature, but this may occur due to, say, layout
constraints on a PCB and reassigning lane electrically helps in this regard. In addition, the training will also
detect lanes that have their differential pairs wired up incorrectly when a receiver may see inverted TSX
symbols in the training sequences on one or more lanes and will, at the appropriate point in the initialisation,
invert the data.

The Electrical idle OS is sent on active lines by a transmitter immediately before entering and electrical idle
state (which is also the normal initial state).

The Fast Training OS is sent when moving from a particular power saving state (L0s) back to L0 (normal
operation) to allow a quick bit and symbol lock procedure without a full link recovery and training. The number
of fast training OS blocks sent was configured during link initialisation with the N_FTS value in the training
sequence OSs.

The Skip OS is used for lane-to-lane deskew. The delays through the serial links via connectors traces and
difference in electrical properties of the SERDES will skew data on the lanes and the bit rates may vary by
some amount between transmitter and receiver. Allowance for this made via the skip OS. At the transmitter
these are sent at regular intervals; for PCIe 2.1 and below this is between 1180 and 1538 symbols, and for PCIe
3.0 this is 370 to 375 blocks. Deskew logic is used to detect the arrival of skip OSs and will align across the
lanes by adding and subtracting one or more SKP symbols from the stream to keep the lanes aligned to within
tight constraints to minimize the amount of compensation buffering required.

PCIe 3.0+ Ordered Sets


For specifications beyond 2.1 a training sequence consists of a 130-bit code with 2 bits of control and 16 bytes.
The leading control determines which type of data follows, with 01b a data block (across lanes) and 10b an
ordered set. This takes the place of a comma for the ordered sets. The following 16 bytes define which ordered
set is present

• Training Sequence Ordered Set: First symbol of 1Eh (TS1) or2Dh (TS2), followed by the same
information fields as above (though PAD is encoded as F7h). Symbols 6 to 9 replace TSX values with
addition configuration values such as equalization and other electrical settings.
• Electrical Idle Ordered Set: All symbols 66h
• Fast Training Ordered Set: a sequence of: 55h, 47h, 4Eh, C7h, CCh, C6h, C9h, 25h, 6Eh, ECh, 88h,
7Fh, 80h, 8Dh, 8Bh, 8Eh
• Skip Ordered Set: 12 symbols of AAh, a skip end symbol of E1h, last 3 symbols status and LSFR
values. Not that the first 12 symbols can vary in length since symbols may be added or deleted for lane-
to-lane deskew.

Link Initialisation and Training


The state of the link is defined by a Link Training and Status State Machine (LTSSM). From an initial state, the
state machine goes through various major states (Detect, Polling, Configuration) to train and configure the link
before being fully in a link-up state (L0). The initialisation states also have sub-state which we will discuss
shortly.

In addition, there are various powered-down states of varying degrees from L0s to L1 and L2, with L2 being all
but powered off. The link can also be put into a loopback mode for test and debug, or a ‘hot reset’ state to send
the link back to its initial state. The disabled state is for configured links where communications are suspended.
Many of these ancillary states can be entered from the recovery state, but the main purpose of this state is to
allow a configured link to change data rates, establishing lock and deskew for the new rate. Note that many of
these states can be entered if directed from a higher layer, or if the link receives a particular TS ordered set
where the control symbol has a particular bit set. For example, if a receiver receives two consecutive TS1
ordered sets with the Disable Link Bit asserted in the control symbol (see diagram above), the state will be
forced to the Disabled state.

The diagram below shows these main states and the paths between them:
From power-up, then, the main flow is from the Detect state which checks what’s connected electrically and
that it’s electrically idle. After this it enters the polling state where both ends start transmitting TS ordered sets
and waits to receive a certain number of ordered sets from the other link. Polarity inversion is done in this state.
After this, the Configuration state does a multitude of things with both ends sending ordered sets moving
through assigning a link number (or numbers if splitting) and the lane numbers, with lane reversal if supported.
In the configuration state the received TS ordered sets may direct the next state to be Disabled or Loopback and,
in addition, scrambling may be disabled. Deskewing will be completed by the end of this state and the link will
now be ’up’ and the state enters L0, the normal link-up state (assuming not directed to Loopback or Disabled).

As mentioned before, the initialisation states have sub-states, and the diagram below lists these states, what’s
transmitted on those states and the condition to move to the next state.
In the Detect.Quiet state the link waits for Electrical Idle to be broken. Electrical Idle (or not) is done with
analogue circuitry, though it may be inferred with the absence of received packets or TS OSs, depending on the
current state. When broken, Detect.Active performs a Receiver Detect by measuring for a DC impedance (40 Ω
– 60Ω for PCIe 2.0) on the line. Moving into Polling.Active both ends start transmitting TS1 ordered sets with
the lane and link numbers set to the PAD symbol. The wait to have sent at least 1024 TSs and received 8 (or
their inverse), before moving to Polling.Config. Here they start transmitting TS2 ordered sets with link and lane
set to PAD, having inverted the RX lanes as necessary. The state then waits for transmitting at least 16 TS2s
(after receiving one) and receives at least 8.

Now we move to Config.LinkWidth.Start. It is this, and the next state, that the viable link width or split is
configured using different link numbers for each viable group of lanes. Here the upstream link (e.g., the
endpoint) starts transmitting TS1s again with link and lane set to PAD. The downstream link (e.g., from root
complex) start transmitting TS1s with a chosen link number and the lane number set to PAD. The upstream link
responds to the receiving a minimum of two TS1s with a link number by sending back the TS1 with that link
value and moves to Config.LinkWidth.Accept. The downstream will move to the same state when it has received
to TS1s with a non-PAD link number. At this point the downstream link will transmit TS1s with assigned lane
numbers whilst the upstream will initially continue to transmit TS1s with the lanes at PAD but will respond by
matching the lane numbers on its TS1 transmissions (or possibly lane reversed) and then move to
Config.Lanenum.Wait. The downstream link will move to this state on receiving TS1s with non-PAD lanes.
This state is to allow for up to 1ms of time to settle errors or deskew that could give false link width
configuration. The downstream will start transmitting TS2s when it has seen two consecutive TS1s, and the
upstream lanes will respond when it has received two consecutive TS2s. At this point the state is
Config.Complete and will move to Config.Idle after receiving eight consecutive TS2s whilst sending them. The
lanes start sending IDL symbols and will move to state L0 (LinkUp=1) after receiving eight IDL symbols and
have sent at least sixteen after seeing the first IDL.

The diagram below summarises these described steps for a normal non-split link initialisation.

To summarise these steps each link sends training sequences of a certain type and with certain values for link
and lane values. When a certain number of TSs are seen, and on which lanes, the state is advanced, and
configurations are set. There is a slight asymmetry in that a downstream link will switch type first to lead the
upstream link into the next state. By the end of the process the link is configured for width, link number and
lane assignment, with link reversal, lane inversion, and disabled scrambling where indicated. There are many
variations of possible flow, such as being directed to Disabled or Loopback, or timeouts forcing the link back to
Detect from Configuration states etc., which we want to describe in detail here.
A fragment of the output from the start of the link initialisation of the pcievhost model is shown below:

You can try this out for yourselves by retrieving the pcievhost repository, along with the virtual processor
(VProc), and running the default test.

Compliance Pattern

The is an additional Polling state, Polling.Compliance, that is entered if, at Detect, if at least a single lane never
exited Electrical Idle during Polling.Active. This implies that some passive test equipment is attached to
measure transmitter electrical compliance. The transmitter must then output the compliance pattern, which is,
for 8b/10b K28.5, D21.5, K28.5 and D10.2, repeated. For multiple lane devices, a two-symbol delay is
introduced on every eighth lane, and then scrolled around in one lane steps at the end of the sequence.

Since this is a test mode, we will not detail this any further here, but it must be available in an implementation.

SERDES Interface and PIPE


The PCIe protocol runs serial lanes at high speed. As of version 6.0 this is 64GT/s (that is, raw bits). The
SERDES that drives these serial lines at these high rates are complex and vary between manufactures and ASIC
processes. The ‘Phy Interface for PCI Express’ (PIPE) was developed, by Intel, to standardize the interface
between the logical protocol that we have been discussing, and the PHY sub-layer. It is not strictly part of the
PCIe specification but is used so ubiquitously that I have included an overview here.

The PIPE specification conceptually splits the Physical layer into a media access layer (MAC) which includes
the link training and status state machine (LTSSM), with the ordered sets and lane to lane deskew logic, a
Physical Coding Sub-layer (PCS) with 8b/10b or 128b/130b codecs, RX detection and elastic buffering, and a
Physical Media Attachment (PMA) with the analogue buffers and SERDES etc. The PIPE then standardizes the
interface between the MAC and the PCS. The diagram below shows an overview of the PIPE signalling
between the MAC and PCS:

The transmit data (TxData) carries the bytes for transmission. This could be wider than a byte, with 16- and 32-
bit inputs allowed. The TxDataK signal indicates wither the byte is control symbol (K symbol in 8b/10b
parlance). If the data interface is wider than a byte then this signal will have one wire per byte. The command
signals are made up of various control signal inputs that we will discuss shortly. The data receive side mirrors
the data transmit side with RxData and RxDataK signals. A set of Status signals are returned from the receiver,
discussed shortly. The CLK input is implementation dependent on its specification but provides a reference for
TX and RX bit-rate clock. The PCLK is the parallel data clock that all data transfers are referenced from.

The transmit command signals are summarised in the following table for PIPE version 2.00.
The receive status signals are summarised in the following table for PIPE version 2.00.
Hopefully from the tables it is easy to see how, via the PIPE interface signaling, MAC logic can control PHY
state in a simple way and receive PHY status to indicate how it may transition through the LTSSM for
initialisation and power down states.

The use of the PIPE standard makes development and verification much easier and allows Physical layer logic
to be more easily migrated to different SERDES solutions. Usually, ASIC or IP vendors will provide IP that has
this PIPE standard interface and will implement the PCS and PMA functions themselves. The vendor specific
MAC logic, then, becomes more generic.

Conclusions
In this article we have looked at how PCIe is organized, with Root Complex, Switches and Endpoints, in a
definite flow from upstream to downstream. We have seen that a PCIe link can be from 1 to 32 differential
serial ‘lanes’. Bytes are scrambled (if data) and then encoded into DC free symbols (8b/10b or 128b/130b).
Ordered Sets are defined for waking up a link from idle, link bit and symbol lock and lane-to-lane deskew.
Training sequence Ordered sets are used to bring up a link from electrically idle to configured and initialized,
configuring parameters as it does so or, optionally, forcing to non-standard states. Additional states are used for
powered down modes of varying degrees, and a recovery state to update to higher link speed if supported. We
also looked at the complementary PIPE specification for virtualizing away SERDES and PHY details to a
standard interface.

We dwelt on the LTSSM at some length as this is the more complex aspect of the physical layer protocol, and
the only remaining aspects of this layer are how the physical layer carries the higher data link layer and
transaction layer packets.
PCI Express Primer #2: Data Link Layer
Simon Southwell | Published Jul 28, 2022

Introduction
In the first article PCIe was introduced, defining the architecture in terms of a root complex, switches and
endpoints. The three layers of physical, data link and transaction were defined before diving into a detailed look
at the lowest layer; the physical layer. This layer defines the serial ‘lanes’ over which the data is transported,
with their specific electrical requirements for high-speed serial data transmission. Data is encoded and
scrambled to make suitable for these high-speed channels. Ordered sets were defined to allow for deskewing
and skipping symbols to compensate for clock and delay variations as well as for initializing the link during
training. A state machine was defined to manage the link’s state from powered down to being ready, as well as
some low powered states and test states. Connection to the SERDES was also discussed, with a new
specification used to standardize this connection—the PIPE specification—virtualizing away the SERDES
specifics which can vary greatly in detail.

At this point, then, we are ready to send bytes over the lanes. The layer that sits above the physical layer is the
data link layer. In this article we will look at this layer in more detail. The data link layer is responsible for
ensuring data integrity over the link as a whole, for the flow control of data across the links and for some power
management. In addition, it provides a means for vendor specific information to be transferred.

Data Link Layer


Before we look at data link layer specifics, we need to define three types of transaction layer packets that the
data link layer needs to know about (we will discuss what type of packets fit this model when we look at the
transaction layer). PCIe defines three packet types:

• Posted (P)
• Non-Posted (NP)
• Completion (Cpl)

Posted packets are ones where no response is issued (or expected), such as a write to memory, non-posted are
the opposite where a response is required, and a completion is a returned packet for an earlier packet in the
opposite direction—such as read data from an earlier read.

As well as transporting transaction layer packets (TLPs) the data link has its own packets called data-link-layer
packets (DLLPs). The diagram below shows the layout of such DLLPs.
PHY Layer Revisited

The DLLP itself, of course, must be transported over the physical layer. In the first article were defined a set of
tokens, such as COM, IDL, FTS etc. Some that were not discussed in detail were SDP (start of data-link-layer
packet) and END. All the ordered sets mentioned in the last article were sent down each lane serially, but now
for the DLLPs (and TLPs) data will be striped across the lanes (if the link width is greater than ×1). So, the start
of a DLLP is marked with an SDP token, and the end is marked with the END token. If an END token, in a
wide link (> ×4), would not be followed immediately by another packet (DLLP or TLP), then PAD tokens are
used to pad to the end of the link width.

For 128b/130b modulation generations (beyond Gen 2) a DLLP/TLP packet is identified with the two control
bits (sync header) of the 130 bits being 10b, followed by framing tokens to cover SDP, STP, IDL and EDB. IDL
is one symbol, SDP is two and STP and EDB are both four symbols. Symbol 0 uniquely identifies what type of
token it is, with the other symbols conveying additional information. There is no equivalent of the END token.
An EDS token is defined (end of data stream—four symbols) indicates that the next block is an ordered set
block. This might be sent, for instance, in place of IDL tokens, aligned to the last lanes in a wide link, just
before a skip ordered set is transmitted. Basically, EDS switches to in-line data model. The link switches back
when a sync header has a 10b value, indicating a data frame.

DLLP structure

A DLLP has a fixed size of 6 bytes. The first byte defines the type of DLLP it is, with the next three bytes
specific to each type. The encodings for the different DLLP types are given in the diagram. The DLLP
(including the type byte) is protected by a 16-bit CRC, with the polynomial as shown in the diagram above. The
resultant CRC is also inverted, and the bytes are bit reversed (i.e., bit 0 becomes bit 7, bit 1 becomes bit 6 etc.).
As we shall see, all the CRCs used in PCIe follow this pattern, though not inverting the LCRC of a TLP, along
with an EDB token instead of an END in the PHY layer, is used to indicate a terminated TLP.

There are four categories of DLLPs:

• Flow control
• Acknowledgment
• Power management
• Vendor

Within flow control there are two sub-categories: initialisation and update. Before we discuss how these DLLPs
are used, we need to look at two new concepts. That of virtual channels and that of flow control credits.

Virtual Channels
Each PCIe link is a single conduit for transferring data and the data packets are indivisible units that can’t be
interrupted. (Actually, a root complex can divide a block transfer request into smaller transfers, within certain
restrictions, but this is before constructing the PCIE packets.) Therefore, a link can’t switch from sending part
of a low priority packet to a higher one, and the resume the low priority data. However, as we saw in the first
article, a transaction may have to skip across multiple links, via switches, before arriving at an endpoint. The
use of virtual channels allows, at each hop, a packet on one virtual channel the opportunity to overtake packets
on other virtual channels. The diagram below shows an arrangement for one link (in just one direction, say a
downstream link).

PCIe defines 8 virtual channels from VC0 to VC7. Only VC0 must be implemented—the others are optional. If,
as in the diagram, logic has multiple virtual channels, each with its own buffering of packets, then an arbiter can
choose between whichever virtual channel has data to send, based on priority. Note that the virtual channels do
not have a priority associated with them directly, but transaction layer packets have a Traffic Class associated
with them that are mapped to a virtual channel. This mapping is done at each hop, so the first link may only
have one virtual channel (VC0) and all traffic classes are mapped to this. Further along, in a switch, say, there
may be multiple virtual channels with one or more traffic classes associated with them. Note, though, that TC0
is always mapped to VC0. We will discuss traffic classes and mapping again when covering TPLs and the
configuration space but, for now, we have enough information for the data link layer.

Flow Control and Credits


The PCIe protocol operates a credit-based flow control system. That is, a sender is given a certain amount of
‘credits’ for sending data across the link and can send data to the value of those credits and no more, until such
time that more credits are issued. In PCIe, credits are given to control flow for the three transaction types we
discussed earlier: Posted, Non-posted and Completions. In addition, header and data are separated so that each
of the three types has two credit values associated with them. The reason for this is that some transactions have
no data associated with them and would not want to be stuck behind a large data packet in the same buffer.
Also, even for packets with data, it gives the opportunity to start processing the packet header before the data
arrives (perhaps due to insufficient data credits). So, for example, to send a whole memory write request, 1
posted header credit is needed, and n posted data credits (when n is the amount of data in units of credit). The
unit of a credit is 16 bytes (4 double-words).

Flow Control Initialisation and Updates

The issuing and consumption of credits discussed above are all defined for the transaction layer, but the data
link layer is where this flow control credit information is communicated. After the physical layer has powered
up to L0 (normal operation) a transmitter will not know how many credits are available for each of the six
types. The data link layer must go through an initialisation process, just as for the PHY layer. Mercifully this is
nowhere near as complicated. To go from DL-LinkDown to DL-LinkUp, only virtual channel 0 (VC0) needs to
have been initialized for flow control. After that, data can be transmitted over the data link layer. Any other
virtual channels supported can then be initialized, even with data flowing over VC0. Flow control is initialized
using the InitFCx-ttt DLLPs, where x is either 1 or 2, and ttt is the packet type P, NP, or Cpl. The diagram
below shows the detailed structure of the data flow DLLPs:

For each of the types of flow control DLLPs, a 3-bit virtual channel number is given along with 8 bits of header
flow control credits and 12 bits of data flow control credits. Note that if, during initialisation, a credit value of 0
is given, this means that infinite credits are available. This might be used, say, for completions if a device would
never issue a non-posted request without having space to receive any reply.

At initialisation, the data link layer transmits InitFC1 DLLPs for each of the three transaction types in order of
P, NP and Cpl and repeats this at least every 34μs (for early generations). When it receives InitFC1s it records
the credits for the types and starts transmitting InitFC2s. From this point it ignores the credit values for any
newly received InitFC DLLPs (either 1 or 2). Once it receives any InitFC (or an updateFC) it goes to DL-
LinkUp. Below is shown a fragment of the pcie model test’s output during data ink layer initialisation, coloured
to differentiate the up and down links and also showing the physical layer with the raw DLLP bytes, for
reference.

Note that the advertisement of 0 for completions (both header and data), indicating infinite credits. The Posted
and Non-Posted values are non-zero, allowing flow control. The InitFC2 DLLPs repeat the values of the
InitFC1 DLLPs, but these are really “don’t cares”.

Once the data link layer is up (for a given virtual channel) the receiver must send updates to advertise the latest
available space. In the InitFC DLLPs, the header and data information values were absolute credit values. The
UpdateFC DLLPs, though, are a rolling count. So, if a posted data initial value is 128, then the transmitter can
send 128 credits worth of data. If it then receives an update of 130 credits then only two more credits have been
made available. In other words, each end keeps count of the credits issued from initialisation and the amount
available to the transmitter is the amount advertised minus the amount consumed since initialisation. The
maximum size of the packets compared to the maximum value in the credit fields ensures rollover is easily dealt
with, without ambiguity. Below is shown a fragment from the pcie model test’s output, once the data link layer
is initialized and traffic is flowing. In this case only the data link layer output is enabled.
Note that the flow control count values keep on increasing as these are counts of advertised credits since
initialisation. There are also no updates for completion flow control counts as infinite values were advertised at
initialisation.

Transmission of updateFCs are scheduled under certain conditions. The most common is that an update is made
when space is freed up, to the size of a single unit of the type or if no space is free at all. For Posted data credits
the ‘unit’ size is the maximum payload size. Also, updates must be scheduled for sending at regular intervals.
This defaults to 30μs (-0%/+50%) but can be configured to be 120μs. This is because a DLLP is not
acknowledged and may have been discarded at the receiver if the CRC failed. If no new updates were normally
scheduled to be transmitted (such as if the receive buffer is empty), then lock-up could ensue. By sending
regular updates, even if nothing changes, this is unblocked. The specification also states that updates can be sent
more frequently than required, and it is good practice to send update DLLPs more regularly if no other traffic is
using the link.

Transporting Data

So far, we have looked at data link layer packets and how they can be used for initializing and updating flow
control for the higher layer packets. The data link layer is also responsible for transporting the transaction layer
packets (TLPs) and ensuring their integrity. Below is shown the structure of a transaction layer packet, with the
added data link information added, and bracketed by physical layer tokens STP (cf. SDP) and END.
Like a DLLP bracketed with SDP and END tokens, a data link layer transaction packet has an STP and END
pair. In addition, though, there is an EDB (end-data-bad) token. A transaction can be terminated on the link
early, without sending all of its data (if it has any), so long as it is terminated with EDB in the PHY and also the
LCRC bits are not inverted. This effectively says to discard that packet.

To a transaction layer packet the data link layer adds a 12-bit sequence number and a 32-bit CRC, known as the
LCRC. The diagram above gives the polynomial used for the CRC and its initial value and, like the DLLP CRC,
the bits within bytes are reversed and the values are inverted. When sending packets the data link layer adds a
sequence number starting from 0 (initialised when DL-LinkDown), and increments this for each TLP sent. The
link CRC is added (which also covers the sequence number). At the receive side the packet is received and the
LCRC is checked. If the CRC passes then an acknowledgement is sent back in the form of an ACK DLLP. If
the CRC fails then a NAK DLLP is sent instead to request retries. The format of the ACK and NAK DLLPs is
shown below:

The type field identifies whether the DLLP is an ACK or a NAK, and the only data is the 12 bit ACK/NAK
sequence number. Since the transmitter can’t know if the packet was received correctly until it receives an
acknowledgement it will keep a retry buffer where transmitted TLPs wait until they are acknowledged. When
an ACK is received, all packets in the retry buffer with that sequence number or older are acknowledged an can
be removed from the retry buffer.

If a NAK is received instead of an ACK, this initiates a retry from the transmitter. The sequence number carried
by the NAK is that of the last packet correctly received. All packets in the retry buffer at this sequence, or older,
are deemed acknowledged and deleted from the retry buffer. All the remaining packets not acknowledged are
then resent, oldest first.

There is a limit on the number of retries that can take place. This is set at four. If a fifth retry would take place,
then the physical layer is informed to retrain the link, though the data link and transaction layers are not reset.
When link training is complete, the data link proceeds once more and does a retry of all the unacknowledged
TLPs in the retry buffer in their original order (i.e. oldest first). A timer is also kept by a transmitter which times
whether any TLP in the retry buffer has not been either ACK’d or NAK’d and initiates retries on expiration.
The expiration time is a function maximum payload sizes and link width and other latencies. This condition is a
reported error. If a sequence number is received with a packet that was not the expected sequence number, then
a TLP has gone missing, and this is also a reported error. These mechanisms ensure that the data link does not
lock up with transitory errors, though if the link becomes permanently bad then the reported errors will flag this.

The sequence numbers are 12 bits with a range in values of 0 to 4095. To ensure clean rollover, just as for flow
control, the maximum allowed unacknowledged packets is limited to half this range, at 2048, even if there are
enough credits to send additional TLPs. The data link layer will stop sending TLPs if this maximum is reached.

Below is shown a fragment of the pcie model test’s output. This shows only the transactions and
acknowledgement for one direction of data flow, with the data link layer and Transaction layers displays
enabled.
In this example the upstream link has just acknowledged sequence 4. After this the downlink sends transactions
5, 6, 7 and 8. All four TLPs will remain in the retry buffer. When the upstream link sends an acknowledge
DLLP back with sequence number 7, TLPs 5 to 7 are acknowledged and are feed from the retry buffer, whilst
TLP 8 remains, waiting to be acknowledged.

Power Management Support

Power management is done at a higher level than the data link layer, but the data link layer provides
mechanisms to support these functions in the form of PM DLLPs. The general form for PM DLLPs is shown in
the diagram below.
The PM DLLPs carry no information in the 3 symbols after the DLLP type, and each PM DLLP is just
identified by the type value, as shown in the diagram.

In general, changes in power state are instigated from upstream components to power down a downstream
component—e.g., the Root Complex to an endpoint. This is done with a write to the downstream component’s
configuration space (which is covered in a future article). At this point the downstream component will stop
sending TLPs and wait for all outstanding TLPs to be acknowledged. It then repeatedly sends PM_Enter_xx
DLLPs (where xx is either L1 or L23, depending on which power state was configured). The upstream
component also stops sending TLPs and waits until all those sent are acknowledged. It will then send
PM_Request_Ack DLLPs repeatedly until it sees electrical idle on the downstream component, and then goes
idle itself.

PCIe supports Active State Power Management whereby downstream components can request entering a lower
power state when their link is idle (and likely to be for while). A downstream component, if wishing to enter a
low power state, stops sending TLPs and waits for all to be acknowledged and then starts a request to enter L1
by repeatedly sending a PM_Active_State_Request_L1 DLLP. If the request is accepted then
PM_Request_Acks are sent by the upstream component until it sees electrical idle. If the downstream
component’s request is rejected then the upstream component sends a PM_Active_State_Nak TLP Message, as
there is no equivalent DLLP type for this power management NAK. At this point the downstream link must
enter the physical layer L0s state and recover to L0 from there.

Vendor

The final DLLP to discuss is the vendor specific DLLP. The diagram below shows the format for this DLLP:

The contents of this DLLP are undefined and are implementation specific. The specification states that this
DLLP is not used in normal operations, but it is available for extensions to functionality.

Conclusions
This article has given an overview of the data link layer of the PCIe protocol. We have looked and flow control
with the use of credits, including a data link initialisation process. We have seen how ACK and NAK, along
with CRC protection, at the data link layer implements a retry mechanism for transaction layer data and we
have discussed power management support on the data link layer.

The data link layer sits between the transaction and physical layers, and it has been impossible not to discuss
aspects of both these layers and, indeed, the configuration space and software control in the case of power
management. I have tried not to drift too far away from the data link but hope that just the sufficient amount of
information is provided to understand why the data link has the features described. These aspects of the
transaction layer and configuration space will be discussed more fully in upcoming articles. This, like the last
article and future articles, is only an overview and the specifications have a vast number of rules and constraints
that must be followed to be compliant. These articles are a primer to get a handle on the concepts and the
specifications are the final authority on the protocol.

In the next article I will cover the transaction layer protocol which carries the data requests (and data) for
memory, I/O and configuration space, and also messages for such things as power management, error reporting,
and interrupts.
PCI Express Primer #3: Transaction Layer
Simon Southwell | Published Aug 4, 2022

Introduction
In the first and second articles in this series the PCIe physical and data link layers were discussed and we got to
the point of having a physical channel we can send data through, then a means to flow control through that
channel with integrity using CRCs and a retry based model. In this article is discussed the transaction layer and
we can (finally) send some data, to and fro, across the link in the form of transaction layer packets (TLPs). In
this third article we shall look at the transaction layer.

The transaction layer, as we shall see, defines three categories of packets for transferring data as reads and
writes into three address spaces, and a fourth category for sending messages for housekeeping and signalling.
Compared to the layers discussed in the last two articles, the transaction layer has a lot of detailed rules, though
the general concepts are not complicated. To do a comprehensive survey of all these rules would create a large
and quite tedious article, and the specifications themselves cover this. So, in this article on the transaction layer,
I want to go through all the different transaction layer packet types, looking at their individual structures, and
give a summary of the functionality for which they are used. Necessarily this will not be comprehensive, but
specifications often will detail a packet and then say ‘see section x.y.z for details’ of the function they are
associated with, making it difficult to form a picture. By summarizing this functionality at the point of packet
definition will, I hope, give a broader picture of how PCIe works and the transactions that are used for each
function.

So, let’s get to it.

Transaction Layer Packets


As we saw in the data link layer article, three types of transaction were identified and it is worth reiterating what
these are here. Below are the three identified types:

• Posted
• Non-posted
• Completions

Posted packets are ones where no response is issued (or expected), such as a write to memory, non-posted are
the opposite where a response is required, and a completion is a returned packet for an earlier packet in the
opposite direction—such as read data from an earlier read. In the data link layer, we discussed these types with
reference to flow control, as each type is flow controlled separately and, indeed, within those types are flow
controlled for header and data separately as well. The transaction layer defines a set of transaction layer packets
(TLPs), each of which fits into one of these three types. The general category of TLPs are listed below along
with the type to which they belong.
Of these TLP types most are non-posted, whilst just memory writes and messages are posted, with the
completion TLPs being the response to non-posted request which may, or may not, carry data and is a
‘completion’ type, as you’d expect.

Other than message transactions, the access request TLPs (with a completion being a response, if applicable) are
reads and writes to different address spaces—namely memory, I/O and configuration.

Memory accesses are just what you’d expect, with reads and writes of data within a memory mapped address
space. According to the PCIe specifications, the I/O TLPs are to support legacy PCI which defines a separate
I/O address space, but even modern systems still make a distinction of main memory and I/O, such as the RISC-
V fence instructions. The configuration access TLPs are used to access the configuration space of the PCIe. The
configuration space is effectively the control and status registers of the PCIe interface. These ‘registers’
advertise capabilities, reports status and allow configurations. We will look at the configuration space details in
another article. The I/O and Configurations writes, unlike memory writes, are both non-posted, and require a
completion.

The general structure of a TLP is as shown in the diagram below:

Each TLP has a header which is either 3 or four double words, depending on its type, and (where applicable) the
address width being used (either 32-bit or 64-bit). This is followed by the data, if any. The header will indicate
the length of the data, as we shall see, but the maximum supported payload size is 4096 bytes (1024 DWs) by
default, but an endpoint can advertise in its configuration space that this is smaller, in powers of 2, down to 128
bytes (32 DWs). An optional CRC can be added for addition data integrity. This is called the TLP digest and is
a CRC with the same specification as the LCRC of the data link layer discussed in article 2, including inversion
and bit swapping the bytes. A bit in the header indicates whether this TLP digest is present or not.

TLP Headers

The first double word of all TLP headers have a common format to indicate what the construction of the rest of
the TLP is like. The diagram below shows the layout of this first double word:

The first field in the header is the ‘fmt’, or format field. This dictates whether the header is 3DW or 4, and
whether there is a payload or not. Basically, if bit 0 of the format is 0 it’s a 3DW header, and if 1 it’s a 4DW
header. If bit 1 of the format field is 0 then there is no data payload, else there is.

The type field, in conjunction with the format field, indicates the specific type of TLP this header is for. The
table below gives the possible values:
The three-bit TC field defines the traffic class. We discussed virtual channels in the last article, and how these
are mapped (through the configuration space) to traffic classes of TLPs. These three bits define to which class
the TLP belongs and thus its priority through links that have more than one virtual channel. Traffic class 0 is
always implemented and always mapped to VC0.

The TD bit is the TLP digest bit and indicates the presence of the ECRC TLP Digest word (when set). In
addition, there is the EP bit, indicating that the TLP data payload is ‘poisoned’. That means that some error
occurred, such as the ECRC check (if TLP digest present) failing at some hop over a link towards its endpoint
destination, or perhaps error correction failed when reading memory. A packet is still forwarded, through any
switches, to the destination in these cases, which is known as error forwarding. This feature is optional but, if
present, the destination reports the error and discards the packet. Though this is a reported error it need not be
fatal, as higher layer recovery mechanisms may exist.

The two bit attribute field (attr[1:0]) bits are to do with ordering and cache coherency (snooping—see my article
on caches). Bit 1 indicates relaxed ordering when set, like for PCI-X, but strict ordering when clear (as for PCI).
Bit 0 is a cache snoop bit, where a 1 indicates no snooping for cache coherency, and a 0 indicates cache
snooping expected. For both these bits, they are only set for memory requests.

The AT field, introduced in Gen 2.0, is an address type field. There are three valid values as shown below.

• 00b: Default/untranslated
• 01b: Translation request
• 10b: Translated

These are only used for memory requests and are reserved for all other types of transactions. The use of these
bits relates to address translation services (ATS) extensions. This allows endpoints to keep local mappings
between untranslated addresses and physical addresses. Which type in the header is being sent is defined by the
AT bits, as per the list above. The translation request (01b) allows endpoints to request the ‘translation agent’
(logically sitting between the root complex and main memory) to return the physical address for storing locally.
Using local endpoint mappings relieves the bottle neck in the translation agent. There are also mechanisms for
invalidating mappings, but more details on ACS is beyond the scope of this article.

The last field is the length field. This indicates the length of the data payload in double words (32 bits). All data
in a TLP is naturally aligned to double words, with byte enables used to align at bytes and words, where
applicable. Note that a length of 0 for packets with a payload indicates 1024 double words, whilst for TLPs with
no payload the length field is reserved. Note that a transaction must not have a length field where an access
crosses a 4Kbyte boundary.

Having defined the headers’ common first double word, present for all TLPs, let’s look in detail at the
individual TLP header formats and uses.

Memory Accesses
Memory access TLPs are the fundamental means for doing reads and writes over the PCIe links. As mentioned
before, memory access come in two forms: a 64-bit long address format and a 32-bit short address format. Both
the read and write memory requests can use either format, but a requester accessing an address less than
4GBytes must use the 32-bit format. The diagram below shows the format for the Memory requests’ headers for
the two address types.

As mentioned above when discussing the common first header DW, the fmt and type fields identify the TLP as
either a Memory Write (MWr), a Memory Read (MRd) or a Memory Read-Locked (MRdLk—see the table
above). For memory writes the length field will determine the number of double words of the accompanying
payload. For memory reads, this is the amount of data requested to be returned. If the TD bit is set, then the
digest is ECRC is present.

The second double word for memory transaction contains a requester ID, a tag and byte enables for the first and
last double words. The requester ID is a unique value for every PCIe function within the fabric hierarchy and is,
as we shall see later, another means of routing transactions through the fabric. It consists of a bus number, a
device number and function number, as shown in the diagram below:

The bus and device numbers are captured by a device during certain configuration writes (more later) and must
be used by that device whenever it issues requests. Many devices are single function but, if multiple function,
then the device must assign unique function numbers to each function it contains. It is possible that the bus and
device number change at run-time and so the device must recapture these numbers if the particular type of
configuration write is received once more. This feature might be used when a new device is hot-plugged and the
system determines it might be useful to group the new device with devices neighbouring numbers, but one it
wants is already allocated. Before being assigned a bus and device number a device can’t initiate any non-
posted transactions as a requester ID is required to route back the completion.

The tag field is assigned a unique value by the requester from all other outstanding requests so that it may be
identified for completions (which might be out of order from the order of requests). By default, only 32
outstanding requests are allowed, but if the device is configured for extended tags in the configuration space,
then all 8 bits can be used for 256 outstanding requests. The number of outstanding requests can be extended
even further with the use of ‘Phantom function numbers’. If a device has less than the full number of separate
functions that can be supported (8), then the unused functions numbers may be used to uniquely identify
outstanding transaction in conjunction with the tag. Since a device must have at least one function, leaving 7,
this extends the maximum possible outstanding transactions to 1792.

The byte enables indicate the valid bytes at the beginning and end of a transaction. This means the bytes to be
written or read, when set. When set on write, only those bytes are updates. When clear on reads, this indicates
the bytes are not to be read if the data isn’t pre-fetchable. If the payload length is 1DW, the last BE must be
0000b. Also, in this case, the first BE need not be contiguous, so 0101b or 1001b etc., are valid. For multiple
DW lengths the BE must be such to form a contiguous set of bytes, and neither must be 0000b.

After the second double word (or words), the address follows. This address must be aligned to a double word so
the lower two bits are reserved and are implied as 0.

The above descriptions refer to both writes and reads to memory. The memory read lock variant, identified with
a type of 00001b is identical in usage for normal memory reads. It is included for legacy reasons, as PCI
supported locked reads, but is a potential for bus lockup, so new device designs are not to include support for
this type of read and normally only root complexes would issue these types of transaction.

Completions
Completions are used as responses to all non-posted requests. That is, all read requests and non-posted write
requests (i.e., I/O and configurations writes). The diagram below shows the header format for a completion.

All completion headers are 3DW, so fmt bit 0 is always 0. A completion sets the TC, attribute, requester ID and
tag fields to match those the request for which it is a response.

The second DW carries a completer ID, which is the bus, device and function number for the device issuing the
completion, using bus and device numbers as captured on receipt of a CfgWr0 (more later). If the bus and
device hasn’t been programmed, then a completion sets the bus and device to 0. All completions use ID routing,
and the requester ID sent with the non-posted transaction is used in the third DW for this purpose. After the
completer ID is a 3-bit completion status which has the following valid values:

• 000b: Successful completion (SC)


• 001b: Unsupported request (UR)
• 010b: Configuration Request Retry Status (CRS)
• 100b: Completer abort (CA)

Of the non-successful statuses, we have mentioned unsupported request before, where a request, such as a
vendor message, is not implemented so a completion with UR is returned. The CRS status is for configuration
requests only where, say after initialisation, a configuration request can’t yet be processed but will be able to in
the future and a retry can be scheduled. The completer abort is used only to indicate a serious error that makes
the completer permanently unable to respond to a request that it would otherwise have normally responded to
and is a reported error. The error that might result in such a response can be very high level, such violating the
program model of the device.

The BCM (byte count modified) field is for PCI legacy support, and PCIe completers should set this to 0. The
byte count gives the remaining byte count to complete the read request, including the payload data of the
completion. For memory reads, the completions can be split into multiple completions, so long as the total
amount sent exactly equals that requested. Since all I/O and Configurations reads are 1DW in length, only one
completion is allowed for these packets. Note that a byte count of 0 equals 4096 bytes.

The final completer field to mention is the lower address field. For completions other than for memory reads,
this value is set to 0. For memory reads it is the lower byte address of the first byte in the returned data (or
partial data). This is set for the first (or only) completion and will be 0 in the lower 7 bits from then on, as the
completions, if split, must be naturally aligned to a read completion boundary (RCB), which is usually 128
bytes (though 64 bytes in root complex).

The diagram below shows some traffic, with requests and completions, from the pcie model with just the
transaction layer enabled for display with colour and highlights added for clarity :
In this traffic snippet we can see the down link sending out two requests; a memory read request, with a tag of
1, and a configuration read (type 0) request with a tag of 3. For the memory read, the address is given as
0xa0000080, but the first byte enable (FBE) is 0001b, so the data actually starts at byte address 0xa0000083.
The length is given as 0x21 (33) DWs, but the last byte enable is 0111b, so the actual length of the transfer, in
bytes, is 128. The traffic class is TC0 and the request has a digest word (ECRC).

The successful completion for the memory read is returned by the upstream port after the config read request,
identified as for the memory access with a tag of 1. The count is set at 0x80, matching the 128-byte request (so
no split completion), but the bytes are spread over the 132 returned bytes (33 DWs) since the address offset was
3, and the lower address value in the header reflects this.

The completion for the configuration read reflects that all configuration reads return a single DW, so the byte
count is 4 and the payload length 1.

I/O Accesses

I/O access transaction are very similar to memory access transaction, but with some restrictions. As mentioned
before, they are used to access an I/O space that’s separate from the memory address space and are really for
legacy support. The diagram below shows the header for these types of TLPs.
The main thing to note here is that only 32-bit address types are supported for I/O requests, so bit 0 of the fmt
field is always 0. An I/O request can only be for 1DW, so the length is always 1. Also, to comply with the BE
rules for 1DW payloads, the last BE is fixed at 0000b. Since the attribute field bits are associated with memory
access ordering and cache snooping, they are both set to 0 in I/O TLP headers. Other than this, I/O transactions
work in much the same way as 32-bit address, 1DW memory accesses.

Configuration Space Access

The configuration space is a third address space, separate from the memory and I/O spaces. In addition, unlike
memory and I/O TLPs, the transactions are not routed with an address but with an ID, containing a bus-, device-
, and function number, as per the requester ID mentioned in the discussion of memory accesses. There we
talked of unique bus, device, and function number for each device on the fabric, and transactions for
configuration accesses use these to specify the destination configuration space. The header format for
configuration request TLPs is shown below:

Like I/O TLPs, configuration TLPs are only 1DW, and the same field values are set to 0 and length set to 1, as
for I/O. The device sending the configuration request also has a unique ID, with bus, device, and function
number, and this is in the second DW as for other transactions. The device it is addressing is in the third DW, in
lieu of the 32-bit address, with the target bus, device, and function numbers.

In addition, there is a register number. The configuration space is made up of a set of 8-bit registers with an
offset associated with each of them, addressed by the register number. The PCIe device has a PCI compatible
256 register space, addressed by the register number, but extends this to a 4096 register space. The extended
register number bits are used to access this extended space. Thus, the PCI compatible configuration space
occupies offsets 0 to FFh, and the PCIe extended configuration space occupies, occupies offsets 100h to FFFh.

One final thing that identifies the destination is the configuration TLP’s type—either type 1 or type 0, as shown
in the table of TLP types in the section on TLP headers. Type 0 configuration reads and writes are routed to a
destination device (endpoint) and intermediate link hops simply route the request to the destination. Type 1
configuration accesses are destined for root complexes or switches/bridges. The configuration register set for
type 1 is different from type 0 , though there are common registers (more later).

Note that the bus and device numbers, as used by completions, are not fixed for a given link. Whenever an
endpoint receives a type 0 configuration write, the bus and device number used in the transaction is set in the
devices configuration space and used in the CID of all completions it generates. It is sampled on all type 0
configuration writes, as it may be updated dynamically whilst the link is up. The configuration space itself will
be discussed in a separate section.

Messages

Messages convey a variety of information that isn’t an access to an addressable space. The general groups of
information carried by messages are:

• Interrupt signalling
• Power management
• Error signalling
• Locked transaction support
• Slot power limit support
• Vendor-defined messaging

The general format for a message header is shown below:

Message headers are 4DW, so bit 0 of the fmt field is fixed at 1. The attribute field is also fixed at 00b. Some
message types can have payloads (MsgD TLPs) as well as be assigned a traffic class. A requester ID and tag is
included as normal, but in place of the byte enables is a message code defining the type of the TLP message.
For most message types, the third and fourth double word are reserved, but are used for some types, as we shall
see.

Unlike the other TLPs, messages can have different routing types. The table in the TLP Header section listed
the Msg/MsgD types as having their lower 3 bits of type as rrr. These bits define the routing used, as shown in
the table below.
Interrupt Messages

Interrupt messages are for legacy support, though they must be implemented. The preferred interrupt signalling
method is to use message signalled interrupts—MSI or MSI-X (extended). These are implemented using normal
memory write transactions. PCI Express devices must support MSI, but legacy devices might not be capable,
and the interrupt messages are used in that case. Switch devices, at least, must support the interrupt messages.

The interrupt messages effectively implement four ‘virtual wires’ that can be asserted or deasserted—namely
A,B C, and D, mirroring the four wires in PCI. Thus, there are two types of interrupt message Assert_INTx and
Deassert_INTx, where is x is one of the virtual wires. The message codes for the eight interrupt messages are as
follows:

• 00100000b: Assert_INTA
• 00100001b: Assert_INTB
• 00100010b: Assert_INTC
• 00100011b: Assert_INTD
• 00100100b: Deassert_INTA
• 00100101b: Deassert_INTB
• 00100110b: Deassert_INTC
• 00100111b: Deassert_INTD

All interrupt messages use local routing (rrr = 100b). It is up to the switches to amalgamate interrupts arriving
on its downstream ports and map these to interrupts on its upstream port. Also, only upstream ports (e.g.,
endpoint to switch) can issue these messages as it makes no sense sending interrupts ‘away’ from the CPU
direction towards endpoint devices. Interrupt messages never have a payload (so no MsgD types). Ultimately, at
the root complex, an actual interrupt is raised on the system interrupt resource system—e.g., an interrupt
controller.

So, these messages are sent by an upstream port whenever the state of one of the interrupts changes, either to
active or inactive. Duplicate messages (e.g., a second Assert_INTB without a deassertion) have no effect but are
not errors and are ignored by the receiving device. Note that interrupts can be disabled individually in the
command register of the configuration space andn if in an asserted state when disabled, a Deassert_INTx
message must be sent.

Power Management Messages


We have already alluded to one of the power management messages, PM_Active_State_Nak, when discussing
the data link layer in the second article in this series, used when a downstream device is requesting a lower
power state by sending PM_Active_State_Request_L1 DLLPs, and this is sent if the request is rejected. There
are three other message types to look at, and the full message code encodings are shown below:

• 00010100b: PM_Active_state_Nak, uses local routing (100b)


• 00011000b: PM_PME, uses routing to root complex (000b)
• 00011001b: PME_Turn_Off, uses broadcast from root complex routing (011b)
• 00011011b: PME_TO_Ack, uses gathered and rooted to root complex routing (101b)

None of these messages include a payload (no MsgD types) and all are traffic class 0 (TC0).

The PM_PME message signals a ‘power management event’—e.g., some change in state of power has
completed. These are sent by an endpoint device towards the root complex and are another source of interrupt.
All these events can be enabled/disabled in the configuration space, like the interrupt messages.

The last two power management messages are the PME_Turn_off, a request broadcast from the root complex to
prepare for power removal, and PME_TO_Ack, an acknowledgment sent back to the root complex that the
appropriate state is reached. From a link LTSSM point of view, the downstream component must get to L0, if in
a lower power state, so the PM_TO_Ack can be sent, and then it eventually ends up to L2 (see the first article in
this series). Power can then be removed (L3 power state) when the root complex has seen acknowledgement
from all the devices.

Error Signalling Messages

Error signals originate from downstream components and are routed towards the root complex (routing type
000b) and do not have payloads (no MsgD types). There are three types of error messages, as listed below:

• 00110000b: ERR_COR
• 00110001b: ERR_NON_FATAL
• 00110011b: ERR_FATAL

The message types reflect correctable, non-correctable but not fatal, and fatal errors. An example correctable
error might be a TLP LCRC error, but where this can be fixed with a retry. This is correctable but the error
might still be reported for analysis and debug of error rates. A non-fatal error is one which cannot be corrected
but does not render the link itself unusable. It would then be up to software to process the error to recover the
situation of possible. The reception of a malformed packet might be an example of a non-fatal error. A fatal
error is one where a link is now considered as unreliable. For example, a time out on acknowledgements that
has reached maximum link retraining attempts. The three types of error can be individually enabled or disabled
in the device control register of the configuration space. Some error types are listed below:

• Access control services violation


• Receiver overflow
• Flow control protocol error
• ECRC check failed
• Malformed TLP
• Unsupported request
• Completer abort
• Unexpected completion
• Poisoned TLP
• Completion timeout
If extended capabilities are supported in the configuration space, then, if the advanced error reporting capability
structure is present, the above errors have their own separate status and can be enabled or disabled individually.

Locked Transaction Messages

There is a single message used to support locked transactions. As we have seen previously, there is a MRdLk
and CplDLk TLP type. A lock transaction is initiated by one or more CPU locked read accesses (with
subsequent CplDLk responses) followed by a number of writes to the same locations. This establishes a lock,
and all other traffic is blocked from using the link path from the RC to the (legacy) endpoint. The lock is release
by sending an Unlock message from the root complex. The message code value is shown below:

• 00000000b: Unlock, uses RC broadcast routing (011b)

The Unlock messages do not have payloads (no MsgD types) and always have a traffic class of 0 (TC0).

Slot Power Limit Messages

There is a single message defined for support of slot power limiting: Set_Slot_Power_Limit. This Message is
used to set a slot power limitation value from a downstream port of a root complex or switch to an upstream
Port of a component (e.g., Endpoint or Switch) attached to the same Link. The message code value is shown
below:

• 01010000b: Set_Slot_Power_Limit, uses local routing (100b)

The Set_Slot_Power_Limit message includes a 1DW data payload, and this data payload is copied from the slot
capabilities configuration space register of the downstream port and is written into the device capabilities
register’s captured slot power limit fields (a scale and limit) of the upstream port on the other side of the link.
The two fields then define the upper limit of power supplied by the slot, which the device must honour.

All Set_Slot_Power_Limit messages must belong to traffic class 0 (TC0).

Vendor Defined Messages

Vendor messages are meant for PCIe expansion or vendor-specific functionality. There are two types of vendor
messages defined: type 0 and type 1. Both types can be routed using one of four mechanisms: routed to RC
(000b), routed by ID (010b), broadcast from RC (011b), and routed locally (100b). The message codes for the
two vendor defined messages are as listed:

• 01111110b: Vendor_Defined Type 0


• 01111111b: Vendor_Defined Type 1

The main difference between type 0 and type 1 vendor messages is that, receiving a type 0 vendor message if
vendor messages are not implemented, triggers an unsupported request (UR) error, whereas receiving a type
1 message when not implemented discards the packet without error.

The structure of a vendor message is shown in the diagram below:


In these messages bytes 8 and 9 are either a route ID (bus, device, and function numbers) when the routing type
is 010b, otherwise these bytes are reserved. The last DW is defined by the vendor specific implementation.
Vendor messages may contain payloads (Msg and MsgD TLPs supported). Bit 0 of the format fields in the first
DW is fixed at 1, as the header is always 4DWs long. The attribute field, though, is not fixed and either bit may
be set, and any traffic class value can be used.

Conclusions
In this article we have gone through all the transaction layer packets types and discussed their use with the PCIe
protocols. Necessarily, this has been a summary as the amount of detail would quickly overwhelm an article
such as this.

For the most part, the TLP layers is involved in reading and writing to various addressed spaces: memory, I/O
and configuration, each with their own transaction layer packet (TLP) types. These access requests, where
applicable, result in completion packets with a success/error status and returned data where reading—which can
be split into smaller completions. Each outstanding packet request has a unique tag, and completions identify
with the request using the same tag number. We have also seen that packets can be routed using different
mechanisms—address, ID, routed to RC (possibly gathered), broadcast from RC and routed to local link. As
well as the different kinds of reading and writing transactions, there are message TLPs used for interrupt
signalling, error reporting, power management, locked translation support, and vendor defined messages.

We have now covered all three layers of the PCIe protocol, so that should be it, right? In this article I have
mention the configuration space on numerous occasions but with only few details to explain what was
necessary. In the next (and last) article I want to look at the configurations space in a little more detail to see
what information it contains and what can be controlled. Then we will finish with a quick look at later
specifications, including PCIe 6.0, released in January of this year (2022) and PCIe 7.0, the development of
which was announced in June. I will also try and summarise what features these articles have not covered,
through lack of space and time.
PCI Express Primer #4: Configuration Space
Simon Southwell | Published Aug 17, 2022

Introduction
The configuration space for each link is where driver software can inspect the capabilities and status advertised
by the device and to set certain parameters. As was mentioned in the last article, when discussing configuration
space access TLPs, there are two types of spaces—type 0 and type 1, with corresponding configuration TLP
types. The former is for endpoints, and the latter for root complexes and switches that have virtual PCI-PCI
bridges.

A lot of work was done in constructing the PCI Express specification to give backwards compatibility to PCI
configurations such that operating system code for PCI could function when enumerating and configuring
systems which now had PCIe components. However, operating systems with PCIe aware software can have
access to extended capability status and configuration. The original PCI configuration space was for 256 bytes.
This is now extended to 4096 bytes, with the first 256 bytes for PCI and the rest for PCIe extended capabilities.
Furthermore, within the 256 byte PCI space, the first 64 bytes are fully PCI compatible registers, with the other
192 bytes used for PCIe capabilities that can be accessed by legacy PCI OS code. The diagram below
summarises this arrangement:

Since the configurations space is designed to be compatible with legacy OS code, and must function as such, the
extended capabilities are all optional and meant to enhance functionality for operating systems that can make
uses of them.

In this article will be summarised the configurations spaces’ structure and registers. It will necessarily be an
overview, and we will not get bogged down into fine detail of every individual field in all the registers—the
specifications are the source of this information. Instead, I want to give a ‘picture’ of what information and
configurations are available to driver software through the configuration spaces.

Configuration Space
Both type 0 and type 1 configurations have a set of common registers in the PCI compatible region (0 to 3Fh).
The diagram below shows these common registers and their relative position in the configuration space.
The device ID and vendor ID are read-only registers that uniquely identify the function. The vendor ID is
assigned by the PCI-SIG and is different for each vendor, but the device ID is set by the vendor to identify the
function. The revision ID field is also set by the vendor to identify hardware and/or firmware versions.

The command register allows some global control of the device, the main one being a master enable (bit 2). For
an endpoint this means it may issue memory and I/O read and write requests. For a root complex/switch it
enables the forwarding of such transactions. Because this is a PCI compatible register there are a set of register
fields that must be present but for PCIe devices are hardwired to 0. There is also control of whether poisoned
TLPs are flagged in the master status register or not, and control of reporting of non-fatal and fatal errors—
though this is also controllable through the PCIe device control register. Finally, there is a bit for disabling the
legacy INTx message interrupts.

The status register has a set of read-only bits and some which have write-one-to-clear functionality. Like the
command register, many of these are hardwired for PCIe. The interrupt status field (bit 3) indicates a pending
INTx interrupt. A master data parity error bit (bit 8) flags a poisoned TLP, when the command register is
configured to enable this. Bit 15 (detected parity error) also reflects a poisoned status but can’t be disabled. A
couple of bits flag whether a device is completing with an ABORT error (bit 11) or has received a packet with
this error (bit 12). Both can be cleared by writing a 1 to the bits. Similarly, if a completion with unsupported
request arrives, a received master abort bit (bit 13) is set and can be cleared. If a function sends ERR_FATAL
or ERR_NONFATAL status in a TLP, a signaled system error bit (bit 14) is set, but only if the command
register is configured to enable this.

The class code field is a PCI register for identifying the type of function, with different numbers representing
different classes of functionality. For example, a class code of 02h is a network controller or 01h is a mass
storage device. These are defined in the PCI Code and ID Assignment Specification.

The cache line size register is usually programmed by the operating system to the implemented cache line size.
However, in PCIe devices although a read/write register for compatibility, it has no effect on the device.
Similarly, the master latency timer is not used for PCIe and is hardwired to 0.

The header type identifies whether the space is type 0 or type 1, thus defining the layout of the rest of the type
specific registers. The BIST register allows control of any built-in-self-test of the function. Bit 7 indicates
whether a BIST capability is available, and bit 6 is written to 1 to start the test if available. A result is returned
in bits 0 to 3.

The interrupt line register is a read-write register that is programmed by the operating system if an interrupt pin
is implemented for interrupt routing. The device doesn’t use this value but must provide the register if an
interrupt pin is implemented. The interrupt pin register is read-only that indicates which legacy interrupts are
used (if any). Valid values are 1, 2, 3, and 4 for each of the INTA to INTD legacy interrupt messages. A value
of 0 indicates to legacy interrupt message support. (See Interrupt Messages section in article 3.)

Finally, the capabilities pointer register indicates an offset, beyond the header registers to further capability
register structures. In other words, beyond the PCI registers, the location of other capabilities is not fixed within
the configurations space. Instead, capabilities are arranged as a linked list of structures so that only those
structures required within the function need to have registers implemented, or allocated space within the
configurations space. The structures themselves have a fixed relative set of registers, but these can be anywhere
in a valid region of configurations space for that structure, aligned to 32-bits. The capability pointer register
gives the offset for the first capability structure. A popular value would be 40h, the first offset beyond the PCI
compatible space—but it need not be.

Type 0 Configuration Space

Type 0 configuration spaces are for endpoints. Beyond the common registers described above, this type of
configurations space is mainly given over to defining base address registers (BARs), but with a few extra
registers thrown in. The diagram below shows the layout of the PCI compatible region for a type 0
configuration space.
There are six base address registers which are used to define regions of memory mapping that the device can be
assigned. This can be up to six individual 32 bits address regions, or even/odd pairs can be formed for 64-bit
address regions. The lower 4 bits define characteristics of the address:

• Bit 0 is region type: 0 = memory, 1 = I/O


• Bits 2:1 is locatable type (memory only): 0 = any 32-bit region, 1 = < 1MB, 2 = any 64-bit region
• Bit 3 is prefetchable flag (memory only): 0 = not prefetchable, 1 = prefetchable

If the BAR is for I/O, bit 1 is reserved and bits 3:2 are used as part of the naturally aligned 32-bit address. If bits
2:1 of an even BAR register indicates a 64 address, then the following BAR register is the upper bits (63:32) of
the address. The operating system software can determine how much space that the device is requesting to be
reserved in the address space by writing all 1s to the BAR registers. The implementation hardwires the address
bits to 0 for the requested space. For example, if wishing to reserve 8Mbytes of address space, bits 22:4 will be
hardwired to 0, and read back as such. The minimum that can be reserved is 128 bytes. Once the software has
determined the requests space it can allocate this in the memory map and set the upper bits of the BAR to the
relevant address, naturally aligned to the requested space size. Normally the BAR prefetchable bit would be set
unless the device has regions where reading a location would have side effects. It is strongly discouraged to
design devices with this characteristic, and to make all regions prefetchable.

The cardbus CIS pointer points to the cardbus information structure, should that be supported—CardBus being
a category of PCMIA interface. The lower 3 bits can indicate that it is in the device’s configuration space, or
that a BAR register points to it, or that it’s in the device’s expansion ROM (see below). It is beyond the scope of
this document to expand on this any further.

The subsystem ID and subsystem vendor ID are similar to that of the common registers but are used to
differentiate a specific variant. For example, a common driver for network cards with the same device
ID/vendor ID combinations might wish to know further information, such as different chips used on the
particular device, to alter its behaviour to match any slight variations in configuration or control.

The expansion ROM base address is similar to other BARs and is used to locate, in the memory map, code held
in a ROM on the device that can be executed during BIOS initialisation. The BIOS will initialise this BAR but
then hand off execution to this code, usually having copied the code to main memory. The code will have
device specific initialisation routines.

The last two type 0 registers, min_gnt and max_lat are not relevant to PCIe and are hardwired to 0.

Type 1 Configuration Space

As mentioned before, the type 1 configuration space is for switches and root complexes—basically devices with
virtual PCI-PCI bridges. The diagram below shows the layout of the PCI compatible region for a type 1
configuration space.
Type 1 spaces also have base address registers for mapping into the address space, but it’s limited to two 32-bit
regions or a single 64-bit region. The primary base number is not used in PCIe but must exists as a read/write
register for legacy software. The secondary bus number is the bus number immediately downstream of the
virtual PCI-PCI bridge, whilst the subordinate bus number is the highest bus number of all the busses that are
reachable downstream. These, then, are used construct the bus hierarchy and to route packets that use ID
routing.

The secondary latency timer is not used in PCIe and is tied to 0. The secondary status register is basically a
mirror of the common status register but without the interrupt status or capabilities list flag bits.

The type 1 configurations space registers also have a set of base/limit pairs, split over multiple registers, which
define an upper and lower boundary for the memory and I/O regions. The memory region is split into non-
prefetchable and prefetchable regions (see above). If a TLP is received by the link from upstream, and it fits
between the base and limit values (for the relevant type) of the link, it will be forwarded on that downstream
port.

The expansion ROM base address has the same function as for type 0, though it is located at a different offset.

The bridge control register has various fields, some of which aren’t applicable to PCIe, and some of which are
duplicates of the common control register. The parity error response enable (bit 0) is a duplicate of bit 6 of the
common control register. The SERR# enable (bit 1) is a duplicate of bit 8 of the common control register.
Master abort mode (bit 5) is unused for PCIe and hardwired to 0, along with fast back-to-back transactions
enabled (bit 7), primary discard timer (bit 8), secondary discard timer status (bit 9), discard timer status (bit
10), and discard timer SERR# enable (bit 11). The only remaining active bit is the Secondary Bus Reset (bit
6). Setting this bit instigates a hot reset on the corresponding PCIe port.

Capabilities

Having looked at the PCI compatible type 0 and type 1 capabilities headers, there are other capability registers
that may (or should) be included in the PCI configuration space (i.e. between 00h and FFh). These capabilities
are listed below

• PCI power management capabilities


• MSI capabilities (if device capable of generating MSI or MSI-X interrupt messages)
• PCIe capabilities

Each capability structure can be located anywhere in the PCI configuration space, aligned to 32-bits, and will
have a capability ID in the first byte to identify the type of structure. This is then followed by a next capability
pointer byte that links to then next structure. A value of 00h for the next capability pointer marks the end of the
linked list of capabilities.

Power Management Capability Structure

The power management capability has an ID of 01h. A diagram of this structure is shown below:
The power management capability structure, after the common ID and next capability pointer, has a power
management capabilities register (PMC) which is a set of read-only fields indicating whether a PME clock is
required for generating power management events (hardwired to 0 for PCIe), whether device specific
initialisation is required, the maximum 3.3V auxiliary current required, whether D1 and D2 power states are
supported and which power states can assert the power management event signal. In addition, this structure has
a power management control/status register (PMCSR). A two-bit field indicates what the current power state is
(D3hot to D0) or, when written to, set to a new power state. Another bit indicates whether a reset is performed
when software changes the power state to D0. Power management events can be enabled. A data select field
selects what information is shown in the data register (the last byte of the power management capabilities
structure) if that register is implemented. This optional information is for reporting power dissipated or
consumed for each of the power states. Finally there is a PME_Status bit that shows the state of the PME#
signal, regardless of whether PME is enabled or not.

After the power management control/status register is a bridge support extension register (PMCSR_BSE) that
only has two bits. A read-only B2_B3# bit determines the action when transitioning to D3hot. When set, the
secondary busses clock will be stopped. When clear, the secondary bus has bower removed. A BPCC_En bit
indicates that the bus power/clock control features are enabled or not. When not enabled the power control bits
in the power management control/status register can’t be used by software to control the power or clock of the
secondary bus.

MSI Capability Structure

If a device is capable of generating message signalled interrupts (MSI) or extended message signalled interrupts
(MSI-X) messages, then it must have an MSI capabilities structure and/or an MSI-X capabilities structure. Only
one of each is allowed, but an MSI and MSI-X can co-exist (or neither or just one). In general, message
signalled interrupts are a way of signalling the state of an interrupt line (or lines) by writing to a given address,
in place of routing interrupt wires. We saw in article 3 that message TLPs can be used to signal legacy
interrupts, but the MSI/MSI-X is the preferred method using normal memory writes. In order to know what
address to write to, and to give some control over the interrupts, a device has an MSI or MSI-X capability
structure. The diagram below shows a summary of the MSI capabilities structure.
In the above diagram the standard capability ID (05h for MSI and 11h for MSI-X) and next capability pointer
registers are shown. This is followed by a message control register (more shortly). The next register is the lower
32 bits of the address for sending any MSI messages. If the device can only generate 32-bit accesses, then the
address is defined by this first address register. If it is 64-bit capable, then there is an upper 32-bit address
register. The address is followed by a message data register (16 bits), to be sent to the MSI address. Then there
are two optional registers, both 32-bit. These are present if a bit in the message control register is set and give
control and status of interrupts. The mask allows software to disable certain messages, whilst the pending status
indicates which interrupts are active.

The message control register has various fields for controlling interrupts. Firstly an MSI enable bit is a master
enable/disable control. A 3-bit multiple message capable field indicates the requested number of vectors that the
device would like, in powers of 2, from 1 to 32. However, it may not be granted these, and a writable multiple
message enable field gives the actual number enabled by the software. These two fields are encoded as log2(n),
so 1 vector is encoded as 0, and 2 as 1 and so on, up to 5 for 32. The next bit flags whether the device can
generate 64-bit message addresses, in which case the message upper address register is present in the structure.
The last active bit flags whether masking is supported for the multiple MSI vectors. If set, then the mask and
pending register are implemented in the structure. These 32-bit registers bitmap each bit to all 32 possible
vectors, allowing individual masking and pending status. The message data register is the message to send. This
is the interrupt vector value and must be restricted to values that are less than the value of the message enable
field.

The MSI-X capability structure is just an extension of the MSI capability but allows a greater number of
vectors. It has a capability ID (11h for MSI-X) and next pointer register, along with the message control
register, just as for MSI. It then has two offset registers—the first pointing to a table of 32-bit words of
addresses, data, and vector control, the other to a table of 64-bit words with pending status. The message control
register gives the size of the table, a global mask to mask all vectors and a master MSI-X enable. I won’t
elaborate further on MSI-X, as its function is very similar to MSI, just with higher granularity of control and
status and the support of more vectors.

PCIe Capabilities Structure

All PCIe devices must have a PCIe capability structure. The initial registers are a capability ID (10h), a next
capabilities pointer and a PCIe Capabilities Register. The rest of the structure consists of a set of three
registers—status, control, and capabilities— for each of four types, namely device, link, slot, and root. The first
three of these types are split across two sets of registers. For a given configurations space, all the registers must
be present, but if a register is not relevant for a given configuration space, such as for the root complex registers
of an endpoint, then these are reserved and set to 0. The diagram below shows the layout of the PCIe capability
structure.
All devices have the PCIe capabilities register and the device registers. The link registers are active for devices
with links, whilst the slot registers are active for ports with slots (such as a device on a card that plus in to a
connector, as opposed to an integrated device). Root ports will include the root registers, along with the others.
There are a lot of controls and statuses in these registers and this section will summarise relevant ones, based
largely on the Gen 2.0 specification. Hopefully this will give a flavour of the control and status available to
software for configuring and enumerating PCIe devices.

The read-only PCIe capabilities register has fields for identifying the capability version, as well as the device
port type (e.g. endpoint, root port of an RC, upstream or downstream port of switch or RC event collector etc.).
In addition, a flag indicates whether a port is connected to a slot (as opposed to an integrated component or
disabled) and is valid only for root ports of an RC and downstream ports of a switch. A 5-bit field indicates
which MSI/MSI-X vector will be used for any interrupt messages associated with the status bits of the structure.

The device capabilities register includes fields to indicate the following functionality

• Maximum payload size supported: 128 to 4096 bytes


• Phantom functions supported: use of unused function numbers to extend tags
• Extended tag supported: use of extended tags from 5 to 8 bits
• Endpoint L0s acceptable latency: acceptable latency from L0s to L0 state from 64ns to 4μs, or no limit
• Endpoint L1 acceptable latency: acceptable latency from L1 to L0 state from 64ns to 4μs, or no limit
• Role based error reporting: basically must be set for generations after 1.0a
• Captured slot power limit value: specifies upper limit, with scale register, on power to the slot. This
register is the normalised value to be multiplied with scale
• Captured slot power limit scale: Scale for slot power limit, from 1.0×, down to 0.001× in Watts.
• Function level reset capability: Endpoint supports function level reset (other types must be 0).

The device control register is a read write register that has various configuration and enable fields:

• Correctable error reporting enable


• Non-fatal error reporting enable
• Fatal error reporting enable
• Unsupported request reporting enable
• Relaxed ordering enable
• Maximum payload size: The actual maximum allowed on link, even if capabilities advertise large
payload capability.
• Extended tag field enable
• Phantom functions enable
• Aux power PM enable: when set enables function to draw Aux power independent of PME Aux.
• Enable no snoop
• Maximum read request size
• Initiate function level reset: function level reset capable endpoints

The device status register is a read only register with the following status bits

• Correctable error detected


• Non-fatal error detected
• Fatal error detected
• Unsupported request detected
• Aux power detected: set by functions that need Aux power and have detected it
• Transactions pending: Set by endpoints waiting on completions, or by root and switch ports waiting on
completion initiated at that port

The device capabilities 2 register adds indication of completion timeout support and timeout ranges, with the
device control register 2 giving the ability to set the timeout or disable it. These ranges go from 50μs to 64s.
The device status register 2 is not used.

The link capabilities register has information about the supported link speeds, maximum link width, active state
power management (APSM) support, L0s and L1 exit latencies, clock power management, surprise power down
error support, data link layer active reporting capability, link bandwidth notification capability, and a port
number. The control register adds control bits for the link: ASPM control, read completion boundary (RCB),
link disable, retrain link, common clock config, extended synch, enable clock power management, hardware
autonomous width disable, link bandwidth management interrupt enable, and link autonomous bandwidth
interrupt enable. The link status register indicates the link’s current speed, negotiated width, link training (in
LTSSM config or recovery states), slot uses same ref clock as platform, data link layer is active, link bandwidth
management status, link autonomous management status. The link capabilities 2 register is not used, but the link
control 2 register adds control for target link speed, enter compliance, hardware autonomous speed disable,
select de-emphasis, transmit margin, enter modified compliance, compliance send SKP OSs, and compliance
de-emphasis. The link status 2 register just reports the current de-emphasis level.

The slot capabilities register has information about the presence (or not) of buttons, sensors indicators and
power controllers. It advertises hot-plug features and slot power limits. The slot control register adds enables
and disables for these advertised functions, whilst the slot status register indicates state change for enabled
functions and power faults. The slot 2 registers are unused.
The root command register gives enable control for correctable, non-fatal and fatal error reporting, whilst the
root status register gives PME status information. The root capabilities register has one control bit to enable
configuration request retry (CRS) software visibility.

Extended Capabilities

In addition to the capabilities discussed above (as if that were not enough) there are a set of extended capability
structures possible that would reside in the PCIe configuration space (100h to FFFh). These are all optional, and
so I will simply summarise the capability functionality that is relevant to discussions in the previous articles,
namely advanced error reporting and virtual channels. The full list of extended capabilities (at Gen 2.0) is given
below:

• Advanced Error Reporting capabilities


• Virtual Channel capability
• Device Serial Number capability
• PCIe RC link declaration capability
• PCIe RC internal link control capability
• Power budgeting capability
• Access control services extended capability
• PCIe RC event collector endpoint association capability
• Multi-function virtual channel capability
• Vendor specific capability
• RCRB header capability

Al these extended capabilities start with a capability ID and next capability pointer, just like PCI capabilities,
but the ID is now 16 bits and the pointer 12 bits.

Advanced error reporting gives more granularity and control over specific errors than the default functionality.
For example, differentiation between uncorrectable errors can be made for masking, status, and control of
severity (i.e. fatal/non-fatal). For example differentiation can be made between malformed TLP and ECRC
errors. Status and mask registers exist for correctable errors (though no severity register). A control and
capabilities register gives status and control over ECRC. A set of header log registers capture the TPL header of
a reported error. For root ports and root complex event collectors, a set of registers enable/disable correctable,
non-fatal and fatal errors, and report error reception, as well as logging the source (requester ID) of an error and
its level. The diagram below gives an overview of the advance error reporting capabilities structure.
As mentioned in article 2, virtual channels along with traffic classes can be used to give control of priority for
packets with differing traffic classes. The virtual channel capabilities structure controls the mapping of traffic
classes to virtual channels. In particular there are a set of VC Resource Capabilities and Control registers, one
for each supported VC, mapping the traffic classes to the particular VC (an 8-bit bitmap for each of the 8 TCs).
Various arbitration schemes can be advertised and then selected. An optional VC arbitration table can be
defined for weighted round robin schemes. A similar table for port arbitration can be added for switch, root and
RCRB ports. The diagram below gives an overview of the virtual channel capabilities structure.
Note that the lower 3 bits of the port VC capability register 1 defines the number of supported virtual channels
n. This determines the number of VC resource register groups (of capability, control, and status) and the number
of port arbitration tables, pointed to by the port arbitration table offset field of the VC resource capability
register for each channel.

As for the rest of the extended capabilities, we have run out of room to discuss these here, but hopefully the
names indicate their general function and more detail can be found in the specifications.

Real World Example


To summarise this discussion of configurations space, a real-world example is in order. Below is shown a
snapshot output from lspci for the endpoint device I designed at Quadrics, during its development, with the type
0 configuration space information displayed in a textual formatted manner.
The device is configured as a 16 lane Gen 1.1 device in this output, but it could also be configured as an 8 lane
Gen 2.0 device. Many of the fields can be configured in software, and some of the power and latency numbers
may not be final in this snapshot. It should be noted that his version is a minimalist implementation. If you’ve
made it this far then you will appreciate that this set of articles, even if just a primer and summary, indicates
how large the PCIe specification is. Implementing a compliant device is non-trivial, but manageable if one
understands only what is required for one’s device to meet the specification rules.

Note on the PCIe Model Configuration Space


For those wishing to use the PCIe simulation model (pcievhost) this section gives some notes on its limitations
of configuration space functionality. If not using the model you may skip this section.

The model was originally conceived as a means to teach myself the PCIe protocol by implementing something
that would actually execute, generating traffic for all three layers, and the results inspected. It grew from there
to be used in a work environment to test an endpoint implementation and eventually even co-simulating with
kernel driver code (see my article on PLIs and co-simulation). In that context the model was designed to
generate all the different configuration read and write TLPs so that an endpoint’s configuration space, amongst
other things, could be accessed and tested. An endpoint does not generate configuration access packets, and so
the model was not required to process these.

None-the-less, the model will accept configuration read and write packets of type 0 if the model is configured as
an endpoint (via a parameter of the Verilog module). In this case the model simply has a separate 4Kbyte buffer
that is read or written to by the CfgRd0 and CfgWr0 TLPs. By default, the configuration space buffer is
uninitialised, but the model provides a couple of read and write direct access API functions so that the space
may be configured by the local user program. There is no mechanism at present to have read-only bits, bytes, or
words (as seen by the Cfg TPL accesses), and so will not respond correctly to CfgWr0 TLPs that write to read-
only fields. More details can be found in the model’s documentation.

PCIe Evolution
PCie has evolved from its initial release in 2003 until the latest 6.0 specification released in January of this year
(2022). In general, each major revision has doubled its raw bit rate. This does not translate to a doubling of data
rate due to the encoding overheads. We have seen 8b/10b and 128b/130b encoding, but the latest specification
changes the encoding once more to a pulse-amplitude-modulation scheme and organizes data into fixed size
units (FLITs). In fact, in June of this year, a 7.0 specification was announced as being in development, with
finalisation expected in 2025.

The table below summarizes the major characteristics of the PCIe generations

Since the first article discussed the encodings for 1.x to 5.x, we will not discuss this here, but we will look
briefly at the 6.0 PAM encoding for reference, as it’s not been covered elsewhere. However, the PCIe 6.0
specification is only just released, and is available only to members of the PCI special interest group (PCI-SIG)
of which I am not a member or work for a company that is (and the membership fee is $4000 annually—which
is a bit pricey for me). So, I will summarise what I know from the press releases and other available public
information.

The first thing that’s different is that support for link widths of ×12 and ×32 have been dropped. It turns out that
no-one really designed PCIe links at these widths and so dropping them simplifies the specification. A more
major change is a move away from straight forward ones and zeros on the serial lines to multi-level encoding
using PAM.

PAM, CRC and FEC

Pulse amplitude modulation (PAM) is a signal modulation where data is encoded in the amplitude of a signal as
a set of pulses. In PCIe generations before 6.0, the signal was the ‘traditional’ 1s and 0s in an NRZ (non-return
to zero) format using just two levels. In PCIe 6.0 this is replaced with a PAM4 scheme—that is a PAM with
four levels, with each level encoding two bits.

• V0 = 00b
• V1 = 01b
• V2 = 10b
• V3 = 11b
An example of encoding a byte is shown in the diagram below:

This is fine and straightforward to understand and doubles the bandwidth without increasing the maximum
channel frequency. However, there are problems introduced by this method. In NRZ there is maximum
separation between 0 and 1 states, giving a better signal-to-noise ratio (SNR). With PAM4, there is less
separation between adjacent levels and the SNR is reduced (see diagram below), increasing the channel bit-
error-rate (BER). Therefore, further measures are necessary.
Firstly, the PCIe 6.0 specification adds forward error correction (FEC). In the information I’ve read, this is
described as a “lightweight” FEC without detailing the algorithm used. In general, FECs work on fixed size
blocks of data and so data is now organised into flow controls units (FLITS) of 256 bytes. This is organised as
242 bytes of data (with the TLPs and DLLPs), 8 bytes for a CRC and 6 bytes for FEC. I would infer from this
that a Reed-Solomon code is used for the FEC; RS(256,250). The FEC decreases the error rate to acceptable
levels but does not eliminate it. The CRC is added for detection of errors should the FEC fail. If the CRC fails
then the DLLP NAK and resend mechanism comes into play. The main point of this is to get error rates to a
point where the latency of retries is at previous generations’ levels, under the reduced SNR of PAM4.

A future article is planned on error correction and detection, including Reed-Solomon codes. So look out for
this if interested in more FEC details.

Conclusions
In these four articles on PCIe we have built up the layers from the physical serial data and encodings, data link
layer for data integrity, transaction layer for transporting data and finally configuration space for software
initialisation and control. These articles have ended up larger than I was expecting, and I have only summarised
the protocol. Power management and budgeting has only been lightly touched upon and features such as access
control services, root complex register blocks, device synchronization, reset rules, and hot plugging are lightly
skipped over. In this article, most of the extended capabilities structures have been skipped.

What these articles have attempted to achieve is give a flavour of how PCIe works, with enough information to
jump off to more detailed inspection as and when required. Even if one never is involved with PCIe directly, a
working mental picture of how it functions is useful if working on systems that have PCIe as part of the system,
is useful. There are likely to be errors in these articles, and things have changed over the different generation
specifications. Whilst writing the articles, reviewing nearly always found errors against the specification, so let
me know if you find any and I will try and correct the documents.

Access to Specifications
The PCIe specifications are published by the PCI-SIG. However, they are only made available to members and,
as mentioned above, the cost of membership is not an insignificant amount. It is not my place to publish this
information here, or even to link to places where some of the earlier specifications have been made freely
available, as I don’t know what their legal status is (even though some are available from very reputable
organisations). A quick search on google for “pcie specifications”, however, will quickly return links to some of
these documents for those interested in exploring PCIe in more detail. It is likely that if you are designing a
PCIe component, either RC, switch, or endpoint, that your company will be a member of the special interest
group (if only to register a vendor ID), and so you will be able to register your work e-mail address and gain
access.
C++ Modelling of SoC Systems Part 1: Processor Elements
Simon Southwell | Published Oct 2, 2023

Introduction
In this article I want to discuss the modelling of System-on-Chip (SoC) systems in software. In particular in
C++, but other programming languages could be used. The choice will depend on many factors but, as we shall
see, there are some advantages in modelling with a language that will also be the ‘programming language’ of
the model. Modelling processor-based systems in software is not uncommon. In my own career alone, I have
seen this done, to varying degrees, at Quadrics, Infineon, Blackberry, u-blox and Global Inkjet Systems and
have been involved in constructing some of these models and used the models at all of them.

SoCs are generally characterised by having on a single chip many of the functions that might, in the past, have
been separate components on a PCB, or set of PCBs. We can define some common characteristics of an SoC
that we’re likely to find on any device. The SoC will have one or more processor cores, and this immediately
implies a memory sub-system, with a combination of one or more devices from ROM, Flash, SRAM, DDR
DRAM etc., to facilitate running programs and operating systems. The cores may have caches to varying levels,
and may support virtual memory, implying an MMU (memory management unit) or, if not VM support, at least
memory protection in the form of an MPU (memory protection unit). Some memory mapped interconnect or
bus will be needed for the core(s) to access memory and other devices, so an interconnect/bus system will be
present, such as Amba busses (APB, AHB, AXI, CHI), or Intel/Altera busses (Avalon). Almost certainly, the
processor will need to support interrupts, and an interrupt controller would then be needed for multiple nested
interrupts with, perhaps, support for both level and edge triggered interrupts. If support for a multi-tasking
and/or real-time operating system is needed, then a real-time timer that can generate interrupts will be present,
along with other counter/timer functionality, including perhaps watchdog timers.

Once we have the processor system, with memory and interconnect, the SoC will need to interact with the real
world via peripherals. These might be mapped on the main memory address map, but there may be a separate
I/O space. The peripherals might be low bandwidth serial interfaces (UART, I2C, SPI, CAN), or higher
bandwidth interfaces, such as SDIO, USB, Gigabit Ethernet or PCIe. Moving data from the interfaces
(especially those with high bandwidth) might require direct memory access (DMA) functionality, utilising
streaming bus protocols. Encryption and security support may also be required. For control and status of
external devices, a certain number of general purpose I/O (GPIO) pins may be supported, as well as analogue-
to-digital converters (ADCs) and/or digital-to-analogue convertors (DACs). There may also be peripherals to
drive display devices such as LCD displays.

Having a set (or sub-set) of these general-purpose functions, where required, custom functions, specific to the
system being developed, can be added. In an FPGA based SoC these might be implemented in the logic part of
the device. The diagrams below show two commonly used FPGA devices that have SoC hard macro logic
(Custom ASIC type logic implementation): one from AMD and one from Intel.
These two devices have very similar architectures and sets of SoC components. This is not surprising for two
reasons. Firstly, they serve the same market and are competitors in that market, and secondly, they are based
around the same processor system, namely the ARM Cortex-A, using the same Amba interconnect. They do,
however, give a good reference point to what a generic SoC might look like, and what functionality is present.
They contain a lot of options for interfaces and protocols, and a specific implementation may not use all of
them, so any modelling of a given system need only model what is going to be present and used in the
implementation.

SoCs are not restricted to FPGAs, and many ASICs follow this same pattern. I worked on a Zigbee based
wireless ASIC which was also ARM based and had a smaller set of peripherals, but not dissimilar to those
above so customers to adapt the chip for their specific application.

Having defined a typical set of functions one might find in an SoC, we find that there are a lot of complex
things present, from the processor cores to the peripherals and all the functionality and protocols in between.
How can we make a model that covers all this functionality?

If cycle accurate modelling is required then the model is likely to converge in complexity of the logic
implementation and we need some strategies to simplify the problem or else the model development will rival
the logic in effort and elapsed time. It is possible, if an HDL implementation of the logic is available, to convert
this to a programming language. The Verilator simulator can convert SystemVerilog or Verilog to C++ or
SystemC (a set of libraries for event driven simulation) which can be interfaced to other C++ based model
functions. However, this rather negates some of the advantages of having a C++ model; namely, having a model
on which software can be developed before logic implementation is available, speed of execution, and ease of
system modification for architecture experimentation and exploration. So, is it worth making a software model
of an SoC system at all?

In the rest of this article, and the following article(s), I want to break down each of the functions we looked at
for an SoC and look at strategies for putting together software models to quickly and usefully construct a system
that can be used to develop software and explore a design space either before committing to a logic
development or used in parallel with a development to shorten schedules and mitigate risk. We will begin in this
article by looking at ways to have a processing element on which we can run our embedded software.

Modelling the Processing Element


The beating heart of an SoC are the processor cores. One of the motivations for building a software model is to
execute software that is targeted at the final product. The software will run on one or more cores and, in general,
have a memory mapped view of the rest of the SoC hardware. There may be a separation of memory and I/O
spaces, but this is just another simple level of decode. Other than the memory and I/O views, the only other
external interaction the processor cores usually have is that of interrupts. This may range from a single interrupt
(e.g., RISC-V), where an external interrupt controller can handle multiple sources of interrupt, to having a
vector of interrupt inputs (e.g., ARM Cortex with built in NVIC). Actually, in either case, the interrupt
controller could be modelled as an external peripheral. What remains is what I call the ‘processing element’,
with just this memory and interrupt interfaces. This simplifies what we have to model considerably. The next
question is, what processor do we need to model? There are two answers to this, one of which is obvious, and
the other not so obvious:

1. Model the processor that is targeted for the product.


2. Don't model the processor

The first answer is, I hope you’ll see, the obvious answer, and we will look at constructing instruction set
simulators later, including with timing accurate modelling.

The second option is not so obvious. Whatever we model, we want to run a program that can read and write to
memory (or I/O space) and be interrupted, just like a real processor core. If we present an API to that software
so that it can do these memory accesses and have interrupt service routines called when an external interrupt
event occurs, then we are close to a solution. The model, presumably, is compiled and run on a PC or
workstation, likely compiled for an x84-64 processor. Even if the embedded software is targeted for a different
processor, such as a RISC-V RV32G processor, then it might still be possible to cross-compile it for the
model’s host machine—especially if steps are taken to ease this process, as we will discuss shortly. This saves
on constructing a specific processor model, which requires a good understanding of the processor’s instruction
set architecture (ISA), or when no third party model is available. Since an instruction set simulator is, itself, just
a program, once we have a generic processing element model, we can simply make the program we run on it an
ISS and, voila, we have a model that can run code for an architecture other than the host computer’s processor.
The diagram below summarises these two cases:
Hopefully it is clear that one is, in general, just an extension of the other and that taking a generic processor
route has an ‘upgrade path’ for more accurate processor modelling as the next logical step.

In the next section I want to look at this generic processing element approach, before looking at methods for
constructing instructions set simulators for specific processor instruction set architectures (ISAs).

Generic Processor Models


The question on whether to take a generic processor model approach or use an ISS is really down to the timing
accuracy required of the model. With an ISS, instruction execution time is usually well documented and can be
built into the model, as we will see when discussing this approach. For a generic solution, we can still have
timing models, but these will be more crude estimates based on statistical modelling (or educated guesses).
None-the-less, this may still be very useful in constructing the embedded code and running on a model with the
desired peripheral functionality.

Memory Accesses

It's fair to say, I think, that most SoC processors memory accesses will be done through the equivalent of load
and store instructions of fairly limited functionality, perhaps being able to load or store from bytes to double
words etc. From a software viewpoint, this is largely hidden (unless writing in assembly language), and the
software manipulates variables, arrays, structures, class members etc. In a generic processor model these
memory data structures can just be part of the program and reside on the host. It gets interesting when accessing
memory mapped peripherals and their registers.

The simplest API for accessing memory mapped space within the SoC model is perhaps a pair of C like
functions, or C++ methods in a class, to read and write such as shown below (assuming ultimately targeting a
32-bit processor):

The type argument defines the type of access—byte, half-word and so on. Of course, these will be wrapped up
as methods in an API class, and there may be I/O equivalents. This isn’t an article on C++, but the functions
could be overloaded so that the type of data (the return type for read_mem, and the data argument for
write_mem) could define the type of access, dropping the type argument. Where possible I will avoid
obfuscating the points being made with this kind of ‘best practice’ optimisations. When writing your own
models, you should use good coding style (and comment liberally), but I want to keep things simple. You can,
of course, write the whole thing in C, and the embedded code to be run on the model may well be in that
language in any case.

In many of the embedded systems I have worked on, the software has a virtualising layer between the main
code and accessing the registers of the various memory mapped hardware. This is a Hardware Abstraction
Layer (HAL) and might consist of a set of classes that define access methods to all the different registers and
their sub-fields—perhaps one per peripheral—built into a hierarchy that matches that of the SoC. I.e., a sub-unit
may consist of a set of peripherals, each with their own register access class, and even, perhaps, some memory,
gathered into a parent class for the sub-unit. The advantage here of having a HAL is that it can be used to hide
the access methods we defined above and make compiling the code for both the target and the host running the
model that much easier. Ultimately, the HAL will do a load or a store to a memory location. If we arrange
things so that, when compiled for the target, the HAL simply makes a memory access (a = *reg or *reg = a),
but when compiled for the model references the methods (a = read_mem(reg, WORD, fault) or
write_mem(reg, WORD, fault)) then the embedded software gets the same view of the SoC registers whether
running on the target platform or running on the generic processor as part of the SoC model. Indeed, this was
done at one of my employers and the HAL was automatically generated from JSON descriptions, as was the
register logic, ensuring that the software and hardware views agreed. Again, avoiding C++ nuances, it is
possible (for those interested) that if the register types are not the standard types (e.g., uint32_t) but a custom
type, accesses such as a = *reg or *reg = a can be overloaded to call the read and write methods, so retaining
pointer access. This is more complicated, and a HAL would virtualise this away anyway, making it
unnecessary.

Whether overloading pointers, using a HAL, or just calling a read and write API method directly, from a
software view we have an API for reading and writing to a memory mapped bus. We haven’t discussed what
goes in these methods yet, but we will get to this when we talk about modelling the bus/interconnect.

Interrupts

The other interface to the model of a processing element we identified was for interrupts. Notoriously, on a PC
or workstation, when running user privilege programs we don’t have access to the computer’s interrupts
directly. Fortunately, we do not need to.

In a real processor core, at each execution of an instruction, the logic will inspect interrupt inputs, gating them
through specific and then master interrupt enables and if one is active, and enabled, will alter the flow of the
program in accordance with the processor’s architecture. Thus the granularity of an interrupt is at the instruction
level. For our generic processor model, we aren’t running at the instruction level, but just running a program on
a host machine. We do, however, access the SoC model with the read and write API calls. Since the SoC model
will be the source of interrupts, this is a good point to inspect the current interrupt state. Glossing over just how
that state might get updated for the moment, so long as, at each read and write call, the interrupt state can be
inspected, we can implement interrupts and have interrupt service routine functions.

If the read_mem and write_mem methods of the memory access class call a process_int method as the first
thing they do, then this can keep interrupt state and make decisions on whether to call an interrupt service
routine (ISR) method. The main program is stalled at the memory access call whilst the ISR is running, and so
will return to that point when the ISR method exits. The ISRs themselves can access memory and can also be
interrupted by higher priority interrupts allowing hierarchical interrupt modelling to be achieved. A sketch for
an API class with interrupts is shown below:

Here we have a class with the two methods for read and write, and I’ve shown these with some code to show
that an internal process_int method is called before actually processing the access. The class contains some
state, with an array of function pointers, set to NULL in the constructor, which can be set to point to external
functions via the registerIsr() method. A master interrupt variable, int_master_enable, can be set or cleared
with enable- or disableMasterInterrupt methods. Similarly, the individual enables can be controlled with
enableIsr and disableIsr methods. To actually interrupt the code, the updateIntReq method is called with the
new interrupt state, which would set the int_req internal bitmap variable, which process_int will process. A
bitmap int_active variable is also used by process_int to indicate which interrupts are active (i.e., requested
and enabled). There can be more than one active, and the highest priority will be the one that is executing.

This type of method is used with the OSVVM co-simulation code and I write about how this is done with more
detail in a blog on that website. In this environment there is an OsvvmCosim class with, amongst other
methods, a transRead and transWrite methods. This is used as a base class to derive an OsvvmCosimInt
class, then overloads the transRead and transWrite methods (and others) to insert a processInt method call
which models the interrupts. The ISRs don’t reside within the class, but external functions can be registered by
the user to be called for each of the ISR priority levels. The blog gives more details and the referenced source
code can be found on OSVVM’s github repository, so I won’t repeat the description here, and the details of the
processInt methods of that code serves to show how this would be done with the sketch class described above,
and its process_int method.
So here we have a framework to build an SoC and run a program. We have defined a class with read and write
capabilities and the ability to update interrupt state and have the running program interrupted with prioritised
and nested interruptable interrupt service routines, provided externally by registering them with our class. We
can now write a program and a set of ISRs that uses this class to do memory mapped accesses and support
interrupts. I’ve left off the details of the read_mem and write_mem methods, for now, as this is how we will
talk to the rest of the model which will be dealt with in another article.

Instruction Set Simulators


With the class defined from the last section we can write arbitrary programs and interact with the rest of the
model (when we get that far). Of course, that arbitrary program could just be an instruction set simulator (ISS).
One difference is that the granularity of interrupts will be at the instruction level, rather than the read and write
memory level, and the ISS model of the processor itself, in some cases, will be contain the interrupt handling
code. Thus the API class we defined before simplifies considerably, with the read and write methods no longer
requiring a process_int call, and all the code associated with interrupts disappears. We still need to inspect
interrupt state but, as we shall see, a slightly different method is used. In the OSVVM code, the non-interrupt
class (OsvvmCosim) is defined as a base class, and then a derived class (OsvvmCosimInt) overloads the read
and write methods to insert an interrupt processing method at the beginning of each one, and then call the base
class’s read or write method. If this split was done to the class from the last section, then the base class could be
used for an ISS, which wouldn’t need the interrupt functionality externally. In the rest of this section I want to
outline the architecture of an ISS model which is largely common to modelling any processor’s ISA.

Just as for a logic implementation, we have some basic operation we must implement:

• Reading an instruction
• Decoding the instruction
• Executing the instruction

These three basic functions are repeated, in an execution loop, indefinitely or until some termination condition
has been reached, such as having executed a particular number of instructions, executed a particular instruction
(like a break for example) or some such state, set up prior to running the processor. In addition to these basic
functions, some state also needs to be modelled for things like internal registers and the program counter. These
can all be collected into a processor model class.

For reading an instruction we are already set up, as we can use our API with the read_mem method to read
instructions preloaded into memory though, as we’ll see later, this will be done via an indirection. For RISC
type processors, only one word is read per instruction such as ARM, RISC-V and LatticeMico32 processors.
Therefore the decode process is completely isolated from the other steps. This is the case for my RISC-V and
LatticeMico32 ISS models. For non-RISC, usually older, processors, instructions may be variable in length,
with a basic instruction opcode followed by zero or more arguments. Therefore a decode, or partial decode
needs to be done, and then any further bytes/words read before moving to execution. Thus, reading and
decoding can be entwined somewhat, complicating the first two stages, though only mildly so. This is the case
for my 6502 processor and 8501 microcontroller models.

Decoding an instruction will involve extracting opcode bits to uniquely identify the instruction for execution,
with the other bits being ‘arguments’ such as source and destination registers, immediate bits and the like.
Depending on how many opcode bits the processor’s ISA defines will determine how many possible unique
instructions there can be, though they might not all decode to a valid instruction. The number of opcode bits
might be quite small, such as for the LatticeMico32, which has 6 bits and 64 possible instructions and the 8051
which has 8 bits for 256 possible instructions. For other processors it may be much higher and the RISC-V
RV32I processor’s R-type instructions have 17 bits (see my article on RISC-V for more details). Many ISS
models I have seen use a switch/case statement for the decoding. For the small opcode processors, like the
LatticeMico32 with 6 bits, a switch statement with 64 cases to select the execution code is manageable. For the
larger opcode spaces, such as the 17 bits of RISC-V RV32I, this then becomes 131072 cases most of which will
be invalid instructions. To manage all of the different architectures, I prefer to use a hierarchy of tables which
have pointers to instruction execution methods as part of each entry. For the smaller opcode spaces, this table
hierarchy can be one deep (i.e., a single flat table), but for the large spaces this is broken down. The RISC-V
instruction formats have a common 7-bit opcode, and then have various other functX fields of various sizes,
such as a three bit funct3 or a seven bit funct7 fields. We can use this to produce a hierarchy. An initial primary
table can be made with the number of entries for the opcode (i.e., 128). Each entry in the table can have a flag
saying whether it is an instruction, and then has a pointer to an instruction execution method, or points to
another table. A secondary table would have entries for the funct3 field, and a tertiary table would have entries
for the funct7 field. This can be repeated for any depth required. Decoding then walks down the table until it
finds an instruction entry.

The diagram below, taken from the RV32 ISS Reference Manual, shows this situation.

So what might each table entry look like? Here we define a structure (class) to group all the relevant
information and make an array of these structures for the tables. The code snippet below shows a top-level
structure for the rv32 ISS.

This structure has the sub-table flag, a union of either a pointer to a decoded instruction data structure or to
another table and then a pointer to an instruction execution function (which is null if a sub-table). The decoded
instruction data structure is all the fields of the instruction extracted out, which is ‘filled in’ by the decode code.
Since this will be passed to all the instruction execution functions, it contains all possible fields for all
instruction types, so that the instruction execution methods can simply pick the out appropriate values they
need.

The decode table arrays are constructed and filled by the constructor, to link the table hierarchies and point to
the instructions’ execution methods. In the execution loop, decoding does the ‘table walk’, indexing down the
table with the opcode and functX values as indexes, until it reaches an instruction execution entry. The entry
decode table is filled in from the raw instruction value, and then the instruction method, pointed to in the entry,
is called with the entry decode table as an argument. If the pointer is pointing to a ‘reserved’ method, then the
decoding reached an invalid/unsupported instruction, and an exception can be raised.

The instruction execution methods are now simply a matter of executing the functionality of the instruction.
Below is an example of an add instruction method:

As you can see, this is now fairly straight forward. The branch and jump instructions will update the PC (with
the former on a condition), whilst the load and store instructions will do reads and writes. Now these could use
the API class that we defined earlier, but a better way (as I hope I will convince you) is to have internal read
and write methods which call an external callback function registered with the ISS model. (If you are not
familiar with pointers to functions and callback methods, I talk about these in one of my articles on real-time-
operating systems, in the Asymmetric Multi-Processor section.

The reason I think this is better is because it decouples the ISS completely from the rest of the SoC model,
allowing the ISS to drop in to any SoC model, which can register its external memory access callback function
and run code on the ISS for that environment. The rv32 ISS read and write methods check for any access errors
(such as misaligned addresses) and then executes the external memory callback function if one has been
registered. If one hasn’t been registered, or if the call to the callback returns indicating it didn’t handle the
access, the ISS will attempt to make an access to a small 64Kbyte memory model it has internally. If the access
is outside of this range, then an error is generated.

Many (but not all) of the instructions can generate exceptions and in the rv32 ISS a process_trap method is
defined to handle these called, as appropriate, from the instruction execution methods, with the trap type. The
process_trap method simply updates register state for the exception and sets the PC to the appropriate
exception address. Since interrupts are forms of exception, we can also have a process_interrupts method.
This, though, is not called from the instruction methods, but is in the execution loop so that it is called every
instruction. Some processors have a mixture of internal and external interrupt sources. So, for example, the
RISC-V processor can generate timer interrupts internally (ironically from timers that are allowed to be
external), whilst also have an external interrupt input. In order for the ISS to be able to be interrupted by
external code we, once again, use an externally registered callback function. At each call to process_interrupts
this callback (if one registered) is executed and returns the external interrupt request state. This is then
processed against interrupt enable state and, if enabled, a call to the process_trap is made, with an interrupt
type instead of an internal exception type and the PC will be altered similarly to that for an exception.

So we now have all the components for the basic functionality, with decode, execution, memory access and
exception/interrupt handling. To run an actual program we just need a run loop.

We can now refine our ISS diagram a bit with the callback functions provided by the external model software,
called by the ISS for memory accesses and inspecting interrupt state, and then these callbacks making use of the
API we defined earlier.
Timing Models
If modelling timing to some degree is important (and it may not be), then what is possible will depend on the
choice of whether using a generic model or a processor specific model. With the generic model the software
running on the processing element is just a program running on the host machine and only when it interacts with
the rest of the model (e.g., do a read or write) is there any concept of the advancement of ‘model’ time. Of
course, if there is some data on the average mix between memory access and non-memory access instructions in
a similar real-world system, then an estimate for clock cycles run between calls to the API read and write
methods can be made, and some state kept that’s updated when calls to the methods made. It will, necessarily,
be a crude estimate, but may be a useful approximation of how a system might perform. However, the generic
method is not really suitable for more accurate performance measurements.

With an ISS things become a lot better. Processor execution times are usually well understood and documented
for a given implementation. For example, from the documentation of the rv32 RISC-V softcore we have:

• 1 cycle for arithmetic and logic instructions


• Jumps take 4 cycles
• Branches take 1 cycle when not taken and 4 cycles when taken
• Loads take 3 cycles plus wait states
• Stores take 1 cycle plus wait states

This is a fairly straight forward specification, and the ISS instruction execution functions can update some count
state to keep track of cycle time. The exception is the memory access instructions which also have a wait state
element. This wait delay is a black-box as far as the processor is concerned, as it is caused by the external
modelling of the memory sub-system. Therefore, the memory callback function prototype specifies a return
value which is the additional cycles added by the memory access, and the SoC model memory functionality will
calculate this. Thus, the memory access instructions will add their base timing to the cycle count, and the
callback return value will then be subsequently added. From a processor point of view, then, we have cycle
accurate behaviour. Of course, if the core has complex features, such as out of order execution, dynamic branch
prediction, or is superscalar in architecture, accurate cycle counts become harder as the model must take these
factors in to account.
Debugging

With my software hat on, the first question I might ask when presented with a software model of a system to
program is, how will I debug my code? Here, I’m talking about the code that is running on the model, rather
than the code that is the model—which can be debugged using the normal tools and methods as it’s just an
application running on the host machine. In fact, for the generic processor model, the code to be executed is just
an extension of that model code, perhaps usefully separated from the model code, but compiled and linked with
it. So the same techniques and tools can be used here and the issue is sorted.

For an ISS based model things are a little bit more complicated—but not much. Here the software running on
the model (using the ISS) is probably cross-compiled for the particular processor, and the host tools can’t be
used and one would need to use those supplied with the toolchain for that modelled processor architecture.
However, taking gdb as a common debug tool, it has a remote mode where it can connect to a processor
remotely, via a TCL/IP socket, and then send ‘machine’ versions of the common commands to load programs,
set breakpoints, run code, inspect memory, step and continue the program etc. If the ISS model has some
TCP/IP server code that can receive and decode the gdb machine commands, and then act appropriately,
sending any required responses, a debugging session can be set up using the processor’s version of gdb. This
has, in fact, been done for my RISC-V and LatticeMico32 ISS models, and the source code can be inspected for
how this is implemented. The LatticeMico32 ISS documentation has sections on the gdb interface, and an
appendix on how to set up the Eclipse IDE with gdb so that a full IDE can be used for debugging. So, now we
have a full debug environment and can debug code.

Multiple Processing Elements


Many embedded systems have multiple processor cores, and we may need to model this. With the generic
processor model, we might ask, do we really need to model multiple cores as the code is just host application
code? The main motivation here is that the system may be set up to have different functionality on each core,
rather than just have them as a pool of processing resources to run processes and threads as allocated by the
operating system. For this latter situation, maybe no further modelling is needed for multiple cores. Even if the
embedded code is multi-threaded, these can just be threads running on the host. The only issue to solve is that
multiple threads accessing the memory read and write API will need to be done in a thread safe way. This might
be wrapping the API calls with mutexes to make sure any access is completed atomically by each thread. For
the case with each core performing different tasks, the source code is likely to be structured with this split, and
so these could be run as separate host threads and the same methods used to ensure thread safe operation.
Again, as for the timing models, the generic processor model will stray from accurate and predictable flow of
code when modelling multiple cores in this way, and it is mainly for software architecture accuracy, with the
aim to be able, as much as possible, to compile and run the same code on the model as for the target platform.

For the ISS based modelling we don’t need to rely on threads or mutexes and can maintain a single threaded
application. How? It was implied, when discussing debugging, that the ISS model is able to be stepped one
instruction at a time. The run loop code snippet shown earlier had a ‘while not halt’ as the main loop, where halt
is doing a lot of heavy lifting. Actual real code will have a whole host of possible reasons to break out of the
loop, allowing breaks on reaching a certain address in the instructions (a break point) or after a certain number
of instructions have been run, such as 1 (a step). We can use this step feature to advance the processor
externally instead of free running. Now a run loop can have calls to step multiple ISS objects and step them in
sequence. With access to the ISS objects’ concept of time, the execution order can be improved by, at each loop
iteration, only step the ISS object that has a smaller cycle count time. This, then, minimises the error in cycle
counts between the processor models and keeps them synchronised. This is discussed in more detail in the
LatticeMico32 documentation, under the Multi-processor System Modelling section, for those wanting to know
more about this subject. This method can be done for any number of processor cores required to be modelled
and can even be done with different processor models, which is not an uncommon situation in some embedded
systems.

Conclusions
We have started our look at constructing software models of SoC systems by looking at ways to allow us to run
embedded software on a ‘processing element’. This might be a virtual element where a specific processor isn’t
modelled, or an instruction set simulator, where the target processor is fully modelled. In either case we present
an API that does the basic processor external operations—read and write to memory space and get interrupted.

For the generic model, the API can be used directly for reads and writes, and strategies were discussed to model
nested interrupts whilst maintaining single threaded code. Timing modelling with the generic processor model
was shown to be limited in accuracy but may have some useful application with real-world based estimates.

The ISS modelling was broken down to mimicking the basic steps of a core and we looked at using a table
hierarchy for instruction decoding and the use of pointers to instruction execution methods in the table, to be
executed when the decoding terminates in the decoded instruction entry within the tables. To de-couple the
processor models from the rest of the modelling, the API was not used directly, but callback functions used for
memory space accesses and inspecting interrupt request state, which can then use the API, allowing modelling
of the bus and interconnect to be external to the processor model, which we’ll discuss in the next article.
Methods were also discussed for accurate timing models and debugging, as well as how to handle the modelling
of multiple processor cores.

In this article, we have just focused on processor modelling, and we still have to look at the bus/interconnect,
memory sub-system and all the various peripherals, as well as looking at interfacing to external programs to
extend the model’s usefulness into such domains as co-simulation or to interface to other external models. I will
cover these subjects in the article(s) to come.
C++ Modelling of SoC Systems Part 2 : Infrastructure
Simon Southwell | Published Oct 14, 2023

Introduction
In the previous article in this series, we looked at the modelling of the processing element of an SoC. In this
article I want to talk about the reset of the system model. When writing the article, the size of what I wanted to
cover in this subject grew and grew, and I had to start trimming the details down in order to get an article that
was of digestible length, knowing that many reading these articles are new to the subject, or less familiar with
software than they are with logic. So, I have made this article into an overview of approaches but have tried to
compliment this with references to examples that I have made available in the open-source domain, and of other
relevant documents and articles that describe how these are constructed so that people can dive deeper to the
level of their interest. I have, professionally, been involved in modelling using additional techniques than
covered here for which I don’t have examples available. I have mentioned a few which extrapolate from the
techniques described below but have tried to keep this to a minimum. I hope I’ve succeeded, though, in laying a
foundation from which someone can start constructing their own models.

We will do ourselves a great favour if, when constructing the rest of an SoC model, we separate out certain
aspects of the blocks we wish to model:

• Basic functionality
• Specific protocols
• Timing

Usually (not always) the basic functionality of an SoC block is easy to describe. For example, an AXI bus
facilitates a read or a write of data between (using ARM terminology) a ‘manager’ (such as a processor) and a
‘subordinate’, as indexed with an address. This is also true of the AHB, APB, Avalon bus, or wishbone or any
memory mapped bus or interconnect protocol. There is a ‘transaction’ between the manager and the subordinate
in the data exchange. It would be easiest to model at this transactional level. The details of the protocol really
only manifest themselves at this transaction level as timing for the transaction and some specific facilities, such
as privilege level—which may be common between other similar protocols or may not.

By separating out the different aspects of the functionality like this we can restrict what is modelled to meet the
specific needs. If transactional modelling meets requirements, then a model of a block can more constructed
more easily and quickly. Timings would then, of course, be approximations—if timing is of interest at all. To
get more accurate timings, the certain parts of the block’s detailed functionality may need to be modelled. This
might be at the level of “operation a takes m cycles, operation b takes n cycles” etc., if the functionality is that
linear, all the way to a full blown cycle accurate model with full protocol modelling. This latter is most
important if it interfaces with another model that needs the proper protocol, for example your interfacing to a
3rd party model that expects the received bits to be fully fledged ethernet packets, say. There are no short cuts,
and the requirements on the model will depend on the intended use from purely functional to allow initial
software architecture testing, to the ability to get accurate estimates for software and hardware performance.
With separation of these aspects, complexity and features can be added over time to refine the model, whilst
still having a system up and running before all aspects are implemented.

In this article it is not possible to go through every single SoC type block and discuss in detail how to model
these. I want to mention the most common and reference some of my open-source code that uses these methods
by way of working examples. The open-source examples can't be as complete as a whole professional system
(I'm only a one man band) such as the systems I worked on at Infineon as a software modelling developer. Here,
for instance, the idea was to have a GUI where you could pick components from a library—core, bus, memory,
peripherals as required, etc.), and then automatically generate a software model from this, ready to run
embedded software. Also, the 8051 model was used at Silicon Labs when I specified a 'soft MAC' for the
Zigbee systems and modelled a solution, using the 8051 model to Silicon Lab's product specs, which could run
some prototype embedded code that the software team had done some time before. Naturally, the code for these
system models is not in my possession or open-source. What I’m hoping is that I have enough examples to
show how to approach modelling an SoC, starting with an easy base system which can be quickly constructed,
especially if the embedded software architecture is constructed with software modelling in mind, to adding
complexity as and when needed.

FPU
In the previous article we left modelling processor’s behind. However, we didn’t talk about floating point
functionality. Many embedded processors don’t support FP arithmetic, but many do. From the modelling
perspective, though, we can treat floating point support as a separate extension.

Naturally, when modelling a processor in C++ running on a host, access to floating point functionality is built
into the language, so the FP instruction code can just do floating point arithmetic using the float and double
types as appropriate. For many cases this is all that’s required. However, not in all cases.

The RISC-V F and D extensions specify different rounding modes. I won’t go into details (see the The RISC-V
Instruction Set Manual Volume I sections 8 and 9), but rounding modes can be set for things like rounding
towards zero, rounding down etc. By default, your host will do one of these, but to get accurate results we need
to round as appropriate to the configuration. Fortunately, the fenv library provides a means to program which
mode is used by the processor running our program via the fsetround function. The rv32 RISC-V ISS uses this
library to set the rounding mode to match that programmed for the processor (see the update_rm method in the
file iss/src/rv32f_cpu.cpp).

Another option which I’ve used before, professionally, is the softfloat library. This is a standalone library to
implement IEEE standard for floating point arithmetic. The advantage of this is that it is agnostic to the features
of the host computer that the SoC model is being run on. The disadvantage is that it is slower than using the
host’s floating point capabilities, if that is an issue. The choice will depend on the priorities of the model
performance and features.

Bus and Interconnect


As was outlined in the introduction, the functionality of a memory mapped bus or interconnect is really quite
simple, it does a transaction between a source requestor and, ultimately, a completer. There may be layers and
hierarchies between these two points, but the destination is uniquely identified by an address. In an AHB
system, for example, a decoder will activate one of several chip-select lines, based on some bits of the address.
The subordinate may be a peripheral but can also be a device for further decoding, such as a bridge, between
AHB and APB for example. So, the modelling of this can be as simple as a switch-case statement to select
between a set of appropriate methods to access the subordinate. If the subordinate has further decode selections
to make, then this is repeated to reflect the hierarchy. Even with a point-to-point interconnect protocol, such as
AXI, the interconnect is usually a crossbar type architecture, but is still connecting between a subordinate and a
manager by decoding the address bits. Ultimately, the peripheral model will have a set of registers (or maybe a
memory) and the final decode to individual registers and bitfields is done there. If an interconnect has multiple
input ports for multiple cores and other manager devices, then this is just the situation of multiple processors
discussed in the first article.

Thus, modelling the bus and interconnect functionality is fairly straight forward. Layering on timing and
protocol to this model will involve calculations due to the nature of the transactions, type of operation and any
arbitration resolution. The accuracy of this is dependent on the effort for accurately modelling the protocol used
and the device doing the decoding or routing. With all the models I have encountered in my career, none have
actually claimed a 100% cycle accuracy with the real bus system. This is because this would involve having
every event from external sources happening precisely as it would in the real system. Because of arbitration
may go a different way if an access request is just one cycle different, and these errors accumulate. What is
achievable though, is a system that behaves consistently with the exact same inputs over a model’s run
(repeatability) and is stochastically representative (precise) over many transactions, so it is worth putting good
timing modelling in the bus models if performance measurements are required. Again, precision is a function of
effort.

Connection to the Processing Element


In the first article of this series I mentioned the use of callback function from the processing element model
which is called whenever it does a load or store (read or write) operation, but I did not elaborate on what that
might look like in any detail. Taking the RISC-V model, for example, the callback type is defined as:

When a callback, of this type, is registered with the model (using the register_ext_mem_callback method) it
will be called at each memory access by the processor with a byte address, a data value (on write type accesses),
an access type and also the processor model’s current time value. The type value takes one of some defined
values in the form of either MEM_WR_ACCESS_XX or MEM_RD_ACCESS_XX, where XX is one of
BYTE, HWORD, WORD, or INSTR. On read accesses, the value is returned in the data argument. The
callback has a return value which is normally a cycle count (which can be 0) returning the number of additional
cycles the access took over and above the processor instruction cycles, allowing wait state counts to be inserted,
and these are added by the model to include in its elapsed time reckoning. If the callback returns
RV32I_EXT_MEM_NOT_PROCESSED, then this lets the model know that the external call did not decode
to any active block and get processed by it. The model will then try to see if it decodes to its own internal
memory, otherwise it raises an exception. The 6502, 8501 and the LatticeMico32 models all have very similar
mechanisms.

This callback, then, is the entry point for the rest of the model and where accesses to any read and write API
functions can be made. For simple systems, the callback function itself could do the top level decode and call
the read and write API calls for each modelled sub-block or, for more complex systems, a top level API (such as
that discussed in the first article) could virtualize away the detail, and a hierarchical call structure made to
reflect the system structure. The diagram below shows a simple example of this.
For a virtual processing element, the only difference from the above diagram is that the code modelling the
embedded software calls the API directly, and has no callback.

This isn’t an article on code structure, and I don’t want to dictate how one might definitively go about putting
together software models, and good judgment should prevail for each implementation. However, bear in mind
that you could very well, in the future, want to re-use components from your model in another model that has a
different mix of components, so keep re-use in mind when structuring your code.

Memory Model
The first major ‘peripheral’ we will definitely need is memory. A system may have more than one type of
memory, such as DDR3 main memory, eeprom, flash, ROM etc. These, of course, could all be modelled as
different components, each with its own memory model. From a processor’s point of view, though, there is just
one flat memory space (excluding any I/O space). Depending on how much detail of the different characteristics
for each memory type are required, they could all be modelled as a single structure. This brings us to our first
problem. A 32-bit processor has a 4Gbyte address space, and we can’t simply create a byte array of that size
(well, we could but perhaps shouldn’t). Things only get worse when modelling a 64-bit architecture. An SoC
system is unlikely to have populated the entire address space, but it may have main memory of 512Mbytes or
more, and memory may be distributed within the memory map.

The solution is to have a ‘sparse’ memory model. This is where memory is allocated on demand as the memory
is accessed, usually in blocks (or ‘pages’). At start up, the model has no memory allocated, and the first access,
with whatever address is provided, will break the address down to access a series of tables that eventually point
to a block of memory, dynamically allocated for that access. Any subsequent access to the address space
covered by that block will not cause another allocation but will simply access that block of memory. If a new
access doesn’t land on an allocated block a new one is created and so on. The diagram below, is adapted from
the documentation of my memory model.

Here a primary table pointer is initially set to NULL. On the first access a new primary table is allocated with
4K entries. The top bits of the address are ‘hashed’ to generate a 12 bit number to index an entry in the primary
table. The primary table entry has the address for the base of the space—that is, the provided address with only
the bits (63 down to 24) used, and the others set at zero. A valid flag marks the entry as having been set, and
then there is a pointer to a secondary table. At first access, there are no secondary tables, so memory is allocated
for a table with 4K entries. The secondary table is just a set of pointers to byte memory. The entry is indexed by
bits 23 down to 12 of the access address, and then that entry points to a dynamically allocated 4Kbyte space,
with the bottom 12 bits used to index into there and this is the actual storage space where data is written and
read back. As the model is accessed further, secondary tables and 4Kbyte spaces are created as needed, or
previously created tables and memory blocks are accessed by walking down the primary and secondary tables,
and then writing or reading the allocated memory block. Thus, we have a memory model that spans an entire
64-bit address space, but only uses the amount of memory actually needed, without having to configure it. More
details can be found in the model’s documentation. Any memory protection or invalid regions would be
modelled externally to this model by, say, an MMU or MPU model or, at it’s simplest a wrapper class which
maps actively present memory and generates an error outside these configured regions.

The memory model was originally targeted for co-simulation with logic, but the core source code (in files
mem.h and mem.c) is purely software and provides an API that allows for both processor type accesses (bytes,
half-words, words, double words) but also allows for block accesses. This is important because most SoC main
memory systems provide ports for burst accesses and DMA. The model also allows big- or little-endian storage
and for multiple ‘nodes’ allowing more than one memory space to be defined. I would expect that the model
(written in C) would be wrapped up in a class so that these different types of APIs could have exposure to the
reset of the model, with checks and validations. This might be in the form of a ‘singleton’, so only one memory
model is accessed via this singleton’s API object that can be constructed where needed, effectively modelling
different ports to the memory. No additional API functionality is provided for loading data, such as the code
that might be in a ROM, but this would be loaded using the same API (with an ‘instruction’ type, say) prior to
the model being executed.

Cache

From a program’s viewpoint, a cache is ‘invisible’, in that reading and writing data, whether it is in a cache or
not does not make a difference in terms of the values stored and retrieved (and it shouldn’t, otherwise
something has gone horribly wrong). The main difference is that the access time is much faster when the
memory being accessed is in the cache (see my article on caches for more information on how they work).
Therefore, the first level that one might model a cache is to adjust wait states that are generated by an access. If
accurate timing is not important, then one can skip modelling the cache and, perhaps, use an estimated average
for the timing (based on measured data) which incorporates probabilities of when data access in main memory,
versus when it is in the cache.

Even with just modelling for more accurate timing, the cache operation must be modelled to know when cache
lines are loaded and flushed, and when accesses hit or miss, and then the appropriate timings reported. The
LatticeMico32 model has a model of a cache for just this purpose (lm32_cache.cpp).

Caches also have different behaviours, depending on configuration, with a ‘write-thru’ or ‘write-back’ policy.
The former means that all writes to the cache also has that cache-line written to main memory (the benefit of
caching only seen on reads), whilst the latter means that the cache-line is only written to memory when the
cache-line is evicted to re-use the entry. It might be important to model this if concerned with cache-coherency
issues being detected with this model. The cache model then will need to hold the cache-line in internal
memory, and only write as per the configured write policy. It may also have to provide an API to allow
‘snooping’ for cache coherent accesses, but this is all getting beyond the scope of this introduction summary
article. The point I want to make is that the level of detail is a function of the expectations from the model, and
the effort increases as more detailed modelling is required.

MMU

If the model needs to host an operating system which uses virtual memory (e.g. Linux) then a memory
management unit (MMU) will be present and may need modelling. I discuss how VM and MMUs work in my
article on the subject, and use RISC-V as an example. The functionality is not so complex (but it’s all relative)
as the complexity mainly resides in the software. There is no real way around this if the model is to boot a VM
OS. The diagram below, from my article, summarises the basic components of a virtual to physical page
number lookup for a RISC-V SV32 VM spec., that would need to be modelled for a RISC-V RV32 based
system, with the TLB (translation lookaside buffer) being a cache of page translations.

The LatticeMico32 model does have some basic MMU modelling, based on that added to the LatticeMico32
logic for the MilkyMist project by M-LABs. The LatticeMico32 model does, in fact, boot a version of Linux
though, ironically, it is an ‘MMU-less’ version of Linux call μClinux, and doesn’t use the MMU features. See
the ISS documentation for details.

So we’re at the point where we have processing elements (from the first article) with connection to the rest of
the model, address decoding and bus/interconnect infrastructure, and a memory sub-system with optional
caching and virtual memory. So we can run programs, but it’s not really an SoC unless we have some
peripherals and connections to the outside world.

Interfaces
Modelling the Software View

As we saw in first article in this series, a couple of examples of ARM based FPGA based SoC devices were
shown, and a significant number of the bundled peripherals were interfaces to the outside world, with a mix of
low bandwidth serial interfaces (e.g., UART, SPI, CAN) and higher bandwidth interfaces (e.g., USB, GbE).
What they all have in common is that they communicate data between the SoC device and an external
component. As we’ve seen, the software view of these interface peripherals is usually a set of memory mapped
registers, and these can be modelled sufficiently to allow the code to interact with a software model of the
interface, initiating control and reading status without necessarily modelling all the internal functionality. The
interfaces usually come in one of two flavours—that of addressed accesses (memory mapped) and streaming
accesses. For the former, an example might be an I2C interface, which sends read or write commands with an
address (though not an address in the memory map as seen from the processor). For the latter case, a UART
sends and receives bytes on a point to point connection without an address. Even ethernet, though having
network addressing in the packet protocol, can be thought of as a streaming device at the interface.
The functionality of the model for the interfaces, once a register view is established, need not model the
protocol used by the interface. There is also the question of what will be connected to the other end of the
interface and need modelling. An example of a register accurate, but non-protocol accurate model is that of the
LatticeMico32 ISS UART (lnxuart.cpp). This is modelled on the Lattice 16550 UART IP core, and models
registers that match this specification. The internal behaviour, however, is to emulate a terminal connected to
the UART, printing characters to where the program was run from, and receiving keyboard inputs to return
from the UART, using the C library functions available to our model code. The UART model has a
lm32_uart_write, lm32_uart_read and lm32_uart_tick function as way of an API to the rest of the model,
called when the peripheral is addresses. The tick function returns the interrupt status of the interface, allowing
connection to the interrupt structure (more later).

As another example, the 8051 ISS model exports its GPIO pins (amongst other things) by allowing the
registering of an external callback function that’s called on every special function register (SFR) access, which
includes the port registers mapped to the GPIO pins. The external functional can decode the SFR access and
respond to port 0 to 3 accesses but return a ‘not handled’ status for any SFRs that are not modelled, and the ISS
will handle these internally. If no callback is registered the GPIO port registers are handled by the ISS with
registers that can be read or written but have no external connection.

This same basic structure can be used for any of the interfaces mentioned (and others, I’m sure). The actual
modelling is restricted to emulating what would actually be connected to the interface. With ethernet, though,
the libraries available can allow a TCP/IP client or server (say) to be set up in software that can actually be
connected from an external program, and even be a remote device. In the OSVVM co-simulation environment
an example of a server TCP/IP socket class (OsvvmCosimSkt.cpp) is provided that allows the sending of read
and write commands to the co-simulation software from an external connected device. In a demonstration test
case, the external client is a python program (both with a GUI and for running as a batch program), but the
OSVVM software is agnostic to who is sending the commands, and so the model can be connected to any
external client. The co-simulation documentation has more details. Now this example is for co-simulation with
a logic simulator, but the connection over the socket is to software, which then does the actual communication
with the simulator, so method this works for purely software modelling as well. So, we have a means of actually
connecting one of the interfaces to the outside world. In projects I have worked on, a TCP/IP socket connection
between the embedded system and an external control program running on a remote host machine is the main
data link with the SoC. If our model can communicate with the actual control software in this same manner, we
have increased the potential software test coverage to include this application code.

Modelling Protocols

Sometime, having a behavioural model of a peripheral isn’t quite enough and a more detailed model is required.
We’ve seen before that this might be because accurate timing estimates are desirable, but it might be that we
need a valid protocol to interface with a model that is expecting it. My own experience of this has always
centred around co-simulation and driving logic IP from a software model, though this isn’t restricted to doing
so. However, the advantage here is that, having created a model and used it to develop the initial embedded
software, once the logic that connects to the interface becomes available, with a software model of the protocol
the logic IP can be connected into the model over a co-simulation link to a simulator and the model can drive
the real IP, replacing the emulation code that was originally there. This gives us an upgrade path for the model
and a basis of a test platform for logic and the driving software.

I’ve written about the logic simulators and their programming interfaces for co-simulation in more detail before
in a set of articles (see parts 1, 2 and 3 for more details), but it’s worth looking at an example in summary of
what is required. As well as a software modelled TCP/IP traffic generator, I have also created a PCIe model,
both of which have been used to drive logic IP for test purposes. A slightly modified diagram from the PCIe
model’s documentation is shown further below.
The model provides an API (left of the diagram) to generate PCIe gen 1 or gen 2 data for all three layers of the
protocol, the ultimate output and input being 8b10b encoded data (on the right of the diagram). The model is
aimed at co-simulation, where the “External Model” block is an interface to my Virtual Processor co-simulation
element, but this has a simple interface of VWrite, and VRead functions which mostly index the lanes to
update or read the current 8b10b code, so this can easily be used with other software models. All the
functionality for the PCIe protocol is done in the software model and only (optional) serialisation was ever done
in the logic HDL domain.

If co-simulating, there are other ways of approaching this. In OSVVM co-simulation, for example, the software
side uses a generic interface to do reads, writes, bursts and streams etc., without regard to the underlying
protocol. In the logic simulation, OSVVM provides ‘Verification Components’ to translate from the generic
transactions to protocol specific signalling, such as for ethernet. If a VC isn’t available for a given protocol,
then there are instructions on how to construct this for oneself. Which approach is best, a pure software model
or an HDL translation from a generic transaction, will depend very much on circumstances and the
requirements, and I think both approaches have their advantages. The diagram below shows an OSVVM co-
simulation setup running the rv32 ISS model.
Timers
Timers are essential in SoC systems that run a multi-tasking operating system or have real-time capabilities (or
both). The simplest way to model this is to use the clock count from the processing element, such as discussed
in the first article, if such a timing model is implemented. In many SoC systems the clock is running
continuously and at a constant frequency, and so can be relied upon to indicate real-time as clock-count divided
by clock-frequency. However, many SoCs have variable clock rates, or clock gating, for power saving
measures, and thus no reliable clock count that matches real-time. Often these systems have either internal,
always powered, logic that is clocked from a separate real-time clock (often 32.768KHz) or relies on an external
device. Either way, we need strategies to model these situations.

For the case of a reliable system clock rate, a model of the timer block just needs to know the time. When we
discussed the memory callback of the processor models, I indicated that the processor’s view of the time is
passed into the callback. This might be passed down to the timer using the memory access hierarchy, but that
would mean it would only get an update if a register was accessed in the timer block. The callback function
could save the time off to some accessible means, which is better, but the granularity would only be at the
load/store instruction execution rate—which might be acceptable. If a clock tick granularity is needed, then the
timer will need to be called every clock cycle to inspect the current time so that it can generate any interrupts
that may have been set up. This requires that the processor models’ step function has clock based stepping as
well as instruction based stepping, and the execution loop can then step the processors and call a timer function
at a clock granularity. My experience is that this level of granularity isn’t really necessary, down to the low
nanosecond precision, as the interfaces to the timer are the memory mapped registers, which require a load/store
instruction to access (and thus can update the available time) or the interrupts generated by the timer block with,
often, a variable latency before the processor responds, making clock cycle precision just part of the ‘noise’.

So, these approaches work for running a simulation, where executing the model behaves correctly for its own
sense of time but is not synchronised with real time and runs at whatever rate as dictated by the host
performance capabilities that the model is running on. Its advantage is repeatability, since its timing is not
reliant on outside events. What about running in real-time?

Real-time Clock

As we shall see later, basic SoC system models can run in the mega-instructions per second range. This lends
itself to running in real time. It may be that the real SoC will run at a faster rate, but the model is sufficiently
fast still to run at a human level interaction, such as keyboard input, or even a blinking LED at a once-per-
second blink rate, and this can be very useful.

For this, then, we need a real-time clock model for our timer block. The host machine will have a real-time
clock which we can access through library functions. The RISC-V rv32 ISS is configurable to use its internal
cycle count state or a real time clock as the clock value for its mtime Zicsr register. A method is defined as
show below:

Here the chrono library is used, and the system clock accessed and cast to microseconds since the ‘epoch’ and
returned by the method. This is used by the RISC-V model in a small SoC model which can run the FreeRTOS
multi-tasking operating system, scheduling task with time-slicing, and a demonstration is provided. I write
about this in my series of articles on real-time operating systems (see parts 1, 2, 3 and 4). In part 4 I discuss
porting this to the RISC-V rv32 based SoC model. The SoC system modelled to run FreeRTOS is shown in the
diagram below, taken from the 4th article, which uses the UART and memory model we’ve already discussed,
and the real-time library functionality for the timer, as described above. With this it’s possible to set up
concurrent tasks that execute at real-time delays of, say, 1 second and have available all the other features of the
real-time OS.

This is a simple SoC model, but already it is sufficiently able to run an OS supported by a large commercial
company (Amazon Web Services) with real-time capabilities and uses many of the techniques discussed
already. With components already available—the memory taken from the co-simulation memory model, the
UART adapted from the mico32 model, and the rv32 ISS was already implemented—only the timer
functionality needed to be implemented (as described above) and the components put together. Thus, a working
system is constructed that can run embedded software in a short space of time that is then the starting point for
adding more functionality to build a more complex model.

Speed of SoC Models for Real-time

I mentioned in the last sub-section that real-time operation is possible because models can run the mega-
instruction per second range. But what kind of range? I will briefly describe what I measure with both the
mico32 model running μClinux and the rv32 SoC model running the FreeRTOS demo.

The LatticeMico32 model is much older than the RISC-V ISS, and I went through an exercise, in the past, of
putting in a configurability when allowed all ‘nonessential’ operations to be compiled out. That is various
checks and optional functionality were no longer included in the executable. This is the system that the μClinux
demonstration is run on, with compilation optimisations set to ‘fast’. As I write, my host PC is an Intel i5-8400
CPU running at 2.81 GHz, with 32GBytes of memory. On a virtual Linux machine on this host, running
CentOS 7 I have measured between 99 and 101Mips.

The RISC-V system, on this same machine runs at between 10 and 12Mips. Why the disparity? Well, firstly I
have not, as yet, gone through the same exercise as I have for the LatticeMico32 model to strip down to
barebones code. I’ve run a profiler when executing the code and, as one would expect, the functions in the
execution loop take up most of the time, with the decode method and the processing of interrupts method taking
up most of this time. Compared to the LatticeMico32, these two functions are much more complicated. The
mico32 decoding is only one table deep with a single 6-bit opcode for indexing, whilst the RISC-V model has
depth up to four tables. As for the interrupts, the RISC-V architecture has many more sources of exceptions and
more complex enabling and status constructions (with the Zicsr extension). Thus, the model runs more slowly,
though still at a fast enough pace to allow real-time operation.

None-the-less, more complex models with, say, multiple processors and more sophisticated peripheral models
will run slower still, and coding techniques matching in complexity may be needed to maintain a real-time
operation, if this is really required. This might be in the form of utilizing multiple cores in the host’s processor,
running model components, such as the processor models, in different threads, or using the virtual processing
techniques discussed in article 1, which will run much faster than an ISS. When writing models professionally,
speed of execution is always a constraint, and coding is always done with this in mind.

Algorithmic Modelling
So far, we have talked about interface modelling and the processors. Some peripherals are not interfaces,
however, but process data in an algorithmic way instead. Such things might include ECC, data compression or
encryption. How much of the algorithms is modelled depends on what the requirements are.

When I was at Hewlett Packard, I was putting together an FPGA based emulator for a tape storage drive data
channel and mechanism. It had to look, to the main controller logic and embedded software, like the tape
mechanism and be able to read and write a section of ‘tape’. In order to keep things simple, the hardware and
FPGA logic just implemented a large buffer to store a partial tape image, with some logic to produce emulated
signalling, such as the head spinning signal, and to ‘move’ up and down the tape. All of the tape image was
going to be generated in software, and one of the first reasons I went down that path is that I knew some of the
software I needed already existed. The engineer developing an ECC solution had already constructed a C model
of the algorithm to test and then match against the logic, when it had been developed. I was developing a data
compression solution and had done a similar exercise for this. There were some missing bits in the DSP part of
the channel, though, and no available model (the design spec. was not even fully complete). Part of this did
have to be on the hardware and so a block was constructed that did basically nothing (an effective delta-function
for the channel) but looked like the component to the rest of the system. This example demonstrates the kind of
modelling decisions that need to be made, making use of existing models, or simplifying the modelling to
implement enough for a valid algorithm case, even if this is ‘don’t alter the data’. I don’t have access to these
models, but I did re-implement a simplified data compression model, based on the LZW algorithm used at HP
(DCLZ), but stripped of unnecessary fluff, and this is an example of an algorithmic model.

Sources of Interrupts
In the first article, I discussed modelling interrupts for a processing element, with techniques for a virtual
processor without resorting to complex coding, and ISS models that use an interrupt callback to check the status
of interrupt requests. For both these cases it is likely that an interrupt controller block is modelled. In the SoC
model running FreeRTOS mentioned above, for example, the interrupt functionality models the RISC-V CLIC
specification. This interrupt controller model, from the processor side, can either call a method to update
interrupt state with the virtual processor, or update system state that the ISS interrupt callback can inspect when
called.

The sources of interrupts can come from various places, and, in an SoC, most peripherals will likely be a source
of interrupts, including things closer to the processor such as MMUs. The interrupt controller block just needs
to provide methods to allow each peripheral to update its interrupt request state, both set and clear. If edge
triggered interrupts need to be modelled (perhaps configurably), then the interrupt controller model will need to
hold request state after the input is returns to inactive, until cleared explicitly by a write to the controller, as
dictated by the controller functionality being modelled, such as the ARM NVIC, for example.

Conclusions
In this, the final article on modelling SoC systems in C++, we’ve concentrated on the other devices other than
the processors. This, and the previous article, has only been an introduction to this subject, but there have been
plenty of references to example code and documentation to allow further investigation. There has not been room
to mention all the features a model might have. For example, the mico32 ISS models hardware debug registers,
which can be utilized by the gdb debugger, and also has save and restore features.

I have covered only in summary the subject of connection to external programs and co-simulation with a logic
simulator, but I detail this in a previous set of articles on PLI and on the co-simulation features of OSVVM,
including articles on interrupt modelling, and more on modelling event driven, multi-threaded programs on
a virtual processor as part of a document on OSVVM co-simulation nodes, both of which are relevant to this
discussion.

I hope I’ve given some insight into modelling SoC systems in C++ but had to trim this to a sensible length. If
people would like more detailed discussions on a particular topic, or clarification on the points I’ve raised, let
me know in the comments. Perhaps smaller, targeted, and more detailed articles can be done in the future, if
there is interest.

I have rarely had a job where software modelling wasn’t used in some capacity, even if I wasn’t directly
involved (which usually I was). It bridges the gap between embedded software and the logic and hardware at
the earliest possible stage, from architecture exploration, to allowing early integration, and then feeds forward
as a platform for testing IP, either as means to co-simulate logic with the model, or as a model for comparative
verification with the logic with a simulation test bench. Either way, it is useful for embedded software
engineers, logic design engineers and design verification engineers to understand what is possible with such
models, and for early cooperation and early engagement between these disciplines.
Finite Impulse Response Filters Part 1: Convolution and HDL
Simon Southwell | Published Sep 11, 2023

Introduction
In these articles on convolution and finite impulse response filters I want to move into the world of digital
signal processing. Strictly speaking my articles on data compression, including image compression with JPEG,
fall under this category but, I think, this article on finite impulse response filters is what most people would
consider as definitely in the realm of DSP.

Finite impulse response filters (or FIRs) are based around the process of ‘convolution’, and we will have to
have a look at some equations to explain what’s going on, but I promise that there is nothing particularly taxing
about this. When we get to a practical solution, we will be down to multiply-and-accumulate and nothing more
complicated than that. We will also look at impulse responses—that is, the output you get from a system when
an impulse (or delta) is input—which tend to be infinite (assuming unlimited precision). It turns out that if you
convolve the impulse response of the desired filter with your signal it will filter that signal as per the filter
characteristics. In fact, convolving in the time domain is the same as multiplying in the frequency domain (and
vice versa). So, if we wanted a low pass filter, say, in the frequency domain this means that all frequencies from
0 to fc we want to multiply by 1, and all frequencies above fc we want multiplied by 0. This is the ideal
frequency response, and the diagram below shows this for a low pass filter.

Even more handy, multiplying everything by 1 in the frequency domain is equivalent convolving a ‘delta
function’ in the time domain—i.e., an impulse. Thus, if we know the impulse response of the filter, we can
convolve this with our signal and it will filter it, just as if we had taken the signal into the frequency domain,
multiplied it by our frequency response (by 1 for low frequencies, and by 0 for high frequencies), and put it
back in the time domain again. In fact, this is another way to filtering which is outside of the scope of this
article, but we might visit this again when looking digital fourier transforms and the fast fourier transform
algorithm in the future. So, we ‘simply’ need to work out the impulse response values of our desired filter and
convolve these with our signal. We will look at this in this article and a handy open-source program is provided
to help us do just this.
In practical solutions, dealing with infinity is somewhat of problem, but we can truncate things as values tend
towards zero, making things finite—and this is where the finite part of the FIR comes from, with the ‘IR’ from
the impulse response. The truncation of impulse response values effectively gives a ‘window’ on that response,
with everything outside of that window set to zero. Now, simply truncating the values causes unwanted artifacts
and things can be improved if the window on the impulse response is shaped in a more gentle type of roll-off.
There are many types of window functions that have been discovered/developed over time that improve the
situation of reducing the artifacts of truncating an infinite impulse response, and we shall be looking at some of
these and what the trade-offs are. The program I mentioned implements many of these (though we shall restrict
ourselves to just a few) and so the different performances of the window functions can be explored.

Another source of undesired artifacts come from the fact that we can’t have infinite precision in the values we
use for our impulse response but must ‘quantise’ these to be of finite precision, with a set number of bits, such
as 8- or 16-bit values. As we shall see, this tends to limit the amount of attenuation in the ‘stop-band’ that can
be achieved, and the number of bits chosen needs to match the requirements of the filter being implemented.

Finally, the impulse responses of ideal filters are made up with ‘sinc’ functions, or combinations thereof. A sinc
function has a general form of sine(x)/x, which is not too complicated. The diagram below shows a plot for a
generic sinc function.

As can be seen, the sine component provides the oscillations and the 1/x component gives the decay with the
plot centred on 0, thus negative x on the left and positive x on the right. The mathematicians amongst you may
have spotted the if x is 0 the function is infinite, but that is not what’s shown. As x tends to 0 the plot
approaches 1, and so this limit is the value at 0. (A bit of a mathematical cheat, I known, but I’m not
mathematician enough to know why this is okay.)

Using these functions, as we shall see in the second article, gives the infinite impulse responses of our ideal
filter and multiplying these with a window function to make finite gives us our finite impulse response. Thus,
these are sometimes called windowed-sinc filters.

Now, don’t worry if this introduction has introduced a lot of concepts in a short space of time as we will revisit
these in some more detail in the rest of the articles. And, as mentioned before, when we get to an
implementation in HDL, we will only be doing multiplication and addition. All the messing about with
trigonometrical functions can be done off-line to calculate the values we will need in the logic, and the provided
program will do this for us and we could use this ‘turn key’, though I’d very much like you to understand
what’s going on.
The article is divided into two parts. In this first article we will introduce convolution for digital signals and
move straight to discussing an HDL implementation. By the end we will have a solution that can be used to
construct an FIR but won’t know what values to configure it with to do the filtering that we desire. That will
have to wait for the next article, where we will look at sinc functions and their use in constructing impulse
responses for our filters and at window functions to modify the infinite impulse responses to a finite set of
values.

Before diving straight into convolution, I want to talk briefly about the general characteristics of FIRs and their
advantages and disadvantages.

FIR Characteristics

There is not space in this article to give a full treatment to the alternative to FIRs, which are infinite impulse
response (IIR) filters. Without going into too much detail IIRs are characterised by the current output being a
function of both the current input and also by previous output values (and inputs). This gives them a potential
infinite impulse response, which has many advantages. Compared to FIRs there are fewer coefficients required
and thus smaller memory requirements to get similar characteristics in terms of cut off frequency transition and
stop-band attenuation. Because there are fewer coefficients, the latency is smaller than for FIRs. Also, IIRs can
be constructed in such a way as to be equivalent to analogue filters in terms of mapping between the s
(analogue) and z (digital) planes (for those that know what this means—which we can’t cover here). This is
handy if digitizing a previous analogue system. There is no such equivalent mapping for FIRs.

So, if I were a salesman selling FIRs, I’ve just convinced you to go down the IIR route, right?

Well, unlike FIRs, IIRs have non-linear phase characteristics and in some applications, such as processing audio
or biometric data, this can be quite undesirable. Due to the fact that IIRs have feedback paths, using previous
output values, they can become unstable, whereas FIRs can’t become unstable for any input signal since they
are a function of input values and impulse response only. Although these articles will limit what filter response
we will investigate, none-the-less, FIRs can have an arbitrary frequency response. FIRs must handle numerical
overflow to some extent, but this is easier than with IIRs, making IIR design more complex, and the effects of
quantisation more severe.

So which one is used for any given application will depend on the requirements, resources, complexity, and all
the normal engineering trade-offs in choosing between competing solutions. For these articles, we will explore
FIRs.

Convolution
I mentioned in the introduction that convolution was at the heart of FIRs, and so we need to understand this.
However, since we will be implementing a logic based convolution solution, I want to move straight on to that
implementation afterwards, in a break from a traditional text ordering, because, once we have this solution,
making a desired filter from it is simply a matter setting the ‘right numbers’, and the next article will deal with
this. Thus, if one is only interested in a filter implementation, and are happy to get the right numbers from the
provided winfilter program, then you can skip the next article—though I genuinely hope you don’t.

What is Convolution

The general form of the convolution of two signals (e.g., an input signal and an impulse response signal h) is
given as:
If the input signal has N points and the impulse response has M points, then the resultant output (o) will be an
N+M-1 point signal. What does this actually mean? Well, each output o[i] is calculated by multiplying the
impulse response values from 0 to M-1 with the signal values from i to i-M-1 and adding all these up. This is
summarised in the equation below.

In other words, if we think of our impulse response as an M point array of values, and our signal as an N point
array of signals, then for each output we multiply the M array elements of the impulse response with the last M
elements of the signal, starting from and running backwards, and add all the M multiplication results together.
If the signal is finite in length (N), then if a signal value is indexed outside of its valid range, then it is assumed
0, and the multiplication is 0, and thus no effect on the output value. Let’s try and picture this. The diagram
below has an 8 point signal to be convolved with a 4 point impulse response. This should yield an 11 point
output.
Each box is an element in an array for the signal (green), impulse response (blue) and output (orange). Each
output element is calculated as the sum of the overlapping signal and impulse response elements multiplied
together. I’ve tried to relate this to the previous equation, showing the specific summation of multiplications on
the right by each output element. So, we’ve ‘swept’ the impulse response across the signal for each output
calculation, multiplying and accumulating the overlapping elements. Note that the indexes for the response and
the signal run in opposite directions, and this is important for the general case of convolution. It is likely that a
signal will be asymmetric, and the secondary signal (an impulse response for our purposes but may not be for
other uses of convolution) may also be asymmetrical. Thus, to convolve, one or the other signal must be
reversed. It doesn’t matter which one, as we could treat the impulse response as our signal, and the signal as the
values to be convolved. As I said, the two sets of values need not be for FIR purposes. As we shall see later, the
filter impulse responses will be symmetrical, so it won’t matter, but an implementation should reversal do this
so that it can be used for any convolution purpose.

What About Arbitrarily Long Signals?

The example in the diagram assumed a finite set of signal values, but in a real DSP usage, the signal will
probably be an unending set of samples to be processed. Notice from the diagram that the impulse response
only ever overlaps the signal values over a limited range. Basically, the length of the impulse response. So, so
long as the values to be convolved with the signal are of finite length then, if we remember the last M signal
values, we can keep producing output values indefinitely.

Now this is beginning to suggest a solution in logic. A multi-bit shift register is good for remembering the last
M values of an input. For our purposes, the impulse response we want to convolve with our signal is fixed, and
so we just need a look-up-table of values. Then, to convolve the impulse response with the signal for a new
output value, we simply multiply and accumulate the elements of the LUT and the shift register. When a new
input arrives, and the shift register is shifted, with the oldest value disposed of and the new value added, the
system is ready to generate a new output. It’s that simple!

There are some practical considerations to take into account. The above diagram suggests that there are as many
multipliers as there are entries in the LUT (known as ‘taps’). A practical impulse response might need to have
as many as 256 taps or more, and the accumulator would have to be able to add that many multiplication results
at once. If the rate of new input samples is low relative to the maximum clock frequency that, say, the multiplier
logic can run at, then the implementation can do a multiply and accumulate once per cycle for M cycles, and if
that period is less than the input sample rate, only one multiplier and one adder is required. In between this, if M
cycles is too long, then two circuits running in parallel, doing half the taps (bottom/top or odd/even, for
example) can be employed. This would take two multipliers and two adders plus an additional adder to add the
two final results. Hopefully you can see that this bifurcation can be extended for quarter, eight, sixteenth etc.
parallel calculations, trading off speed against logic resources, depending on the requirements. In the example
implementation we will discuss later, the clock rate is sufficiently fast that we can use a single multiply-
accumulate logic block to process all the taps with time to spare for the target functionality.

Another consideration is the quantisation Q of the signal and ‘tap’ values. The choice here is dependent on the
requirements and could range from double precision floating point (64 bits) to 8 bit integers or fixed point
values. Having chosen the number of Q bits, it must also be noted that multiplying two numbers of Q bits wide
gives a result width of 2Q bits and the adder must be able to take this size input. Also, when adding numbers,
the result can be one bit larger than the inputs (let’s assume they’re both the same width). Thus, potentially, to
avoid overflow (or underflow), the adder should allow additional bits for each addition—i.e., an additional M
bits. In practice, this becomes impractical, particularly if M is configurable (as our example will be, via a
parameter). Various ways of dealing with this are available. Firstly, do nothing and let the accumulator rollover.
This may seem ‘lazy’, but it might be that upstream processing makes certain range guarantees that means the
accumulation calculation can’t over- or under-flow, and additional logic to deal with it is redundant. The next
level is to detect under- or overflow and flag an error. This might come in the form of detecting that an addition
would change the sign of the result when the input signs match. I.e., adding two positive numbers should not
result in a negative number and, similarly, adding two negative numbers should not result in a positive
number. Having detected over/underflow one could then take a further step and ‘saturate’ the result so that the
overflowed or underflowed result is discarded to be replaced with the maximum positive or negative number as
appropriate.

Having dealt with these problems, we have a result that is 2Q bits wide but, for many applications, we want the
result to be Q bits wide to match the input. We could just take the top Q bits from the result, but we can do
better than this and introduce less quantisation noise if we add the most significant bit of the truncated bits to
the rescaled value to round up. This, then, picks the nearest rescaled value to the original.

Earlier in the article I made the bold claim that convolving a signal with the delta function (an impulse) in the
time domain results in multiplying all frequencies by 1 in the frequency domain—in other words and all-pass
filter, which doesn’t alter the signal. Our solution gives us an insight into why that might be. Imagine our
impulse response is a delta function, with just one of the taps set to 1.0 (or the equivalent in binary). Now, as we
sweep this over our signal, the single tap with the 1 in it will pick out only one value from the signal and will
pick each one in turn as it sweeps through. All other values will be masked to 0, being multiplied by 0 from the
impulse response. The output, then, being the addition of all the results, and all but one of the multiplications
being zero, will simply reproduce the input value unaltered. This is true for whatever input signal we choose to
put into the convolution. Thus, this acts as an all pass filter.

Conversely, if we make our signal a delta function (impulse), when we convolve this with our impulse response
tap values, it will reproduce the impulse response at the output, as it picks one value at a time as it sweeps
through. So here we see, in our solution, that convolving in the time domain is the same as multiplying in the
frequency domain, and that an impulse in the time domain is the same as an all pass filter.
HDL Implementation
I have put together a configurable Verilog solution, firfilter, on github. Firstly, let’s have a look at its ports. The
diagram below shows the fir module block diagram with its signals and parameters.

The module has a clock (clk) and power-on reset (nreset, active low). It also has a synchronous external reset
(reset, active high) so that external logic can reset the module outside of POR. There is also a simple
subordinate memory mapped bus port which allows the tap values to be programmed from an external source.

The other two ports consist of the input samples (sample) with an accompanying valid (smplvalid), and the
output values (out) with its own valid signal (opvalid). The number of bits for the input and output values is
determined by the SMPL_BITS parameter (with a default value of 12). Needless to say, these values are going to
be treated as signed numbers. The TAPS parameter (with a default of 127) determines the number of logical taps
the impulse response LUT will have. It will also determine the width of the address (addr) on the memory
mapped port to be able to index all the tap entries.
Actually, as this implementation is targeting an FIR implementation, and because the impulse responses will be
symmetrical about the centre value, an optimisation is made to only store half the data, plus the centre value.
Thus, the actual storage defined is TAP/2+1 tap entries and logic will be used to recreate the full impulse
response. If the size of the LUT is not a limiting factor, then this need not be done, saving on a little bit of logic,
and making the implementation a more generic convolution solution. However, I wanted to illustrate some
optimisation considerations that might be required when looking at the logic.

One other deviation in this implementation from the generic diagram seen earlier is that the taps are actually
SMPL_BITS+1 wide. This allows for the equivalent of 1.0 to be stored, otherwise the maximum value is
equivalent to 0.111111111111… (in binary) which gives a slight scaling to the output (an attenuation), the
degree of which is dependent of the magnitude of SMPL_BITS (the larger, the less of an error). This may not be
an issue for some applications, but this will make things easier to cross-correlate results to what we’ve
discussed and not introduce any scaling errors.

Indexing the Tap Values

Before jumping into the convolution logic, we need some logic to extract values from the tap LUT (taps—an
array of signed values). These will be programmed with the impulse response values from index 0 to N/2. So,
for the default TAPS value of 127, this is 0 to 64. When a new input value arrives (smplvalid is 1 for a cycle), a
register count is set to TAPS and counts down to zero, decrementing each clock cycle. For the first half we want
to index the LUT from 0 to N/2, and then count back down from N/2-1 down to 0. A combinatorial bit of logic
does this calculation assigning a wire tapidx:

Note that countdlyd is the count value delayed by one cycle, which is equivalent to count+1 when the count is
active, saving an adder, aligning the value, and relieving the timing.

Multiply and Accumulate

The heart of the convolution is a multiply and accumulate function. In many FPGAs, ‘DSP’ functions are
included. For example, the Intel Cyclone V or the AMD Zynq 7000 series FPGAs, both fairly low cost
solutions. These DSP blocks are basically multiply and accumulate blocks (though they can be used as just
multipliers)—just what we need. One could instantiate these blocks as modules, but the synthesis tools can infer
the functionality and construct the blocks we need for us. It does help, though, if we construct the logic to look
like a multiply and accumulate equation. In the fir module, the synchronous process has the code:

The accum register is just what you’d expect, accumulating the values over the convolution. The tapval is the
registered extracted LUT value (taps[tapidx]). The validsample signal is registered version of the shift register
(smpls[]) values, as indexed by count, but also validated as containing a value. After POR, the shift register has
no valid values and could be random. POR could reset them all, but this could be a lot of bits (TAPS × Q+1) so,
in this implementation, a register, smplsvalid, of TAPS bit wide is cleared on reset, and is used as a single bit
shift register, filling with 1s at each shift in of the input samples. Thus, validsample is the indexed shift register
value if the equivalent bit of smplsvalid is set, otherwise it’s 0. The update of accum happens whilst the count
value (delayed) is non-zero. It gets zeroed each time a new input sample arrives.

Rescaling

Lastly, when in the last cycle, the output is updated with the accum value rescaled. The code fragment below
shows this:

Note that the opvalid is set to a default value of 1'b0 at the beginning of the synchronous process so that, when
set here, it pulses for just one cycle, being cleared on the next. And that’s all there is to it. We now have the
basis of a finite impulse response filter, if only we knew what to put in the tap values. That will be the subject of
the next article.

With this design, having a single multiply and accumulate function, a new output takes TAPS + 3 clock cycles
to calculate. The extra three cycles allow for registering at both the input and output, and around the multiply-
and-accumulate function. Thus, for a 100MHz clock this would be, for the default TAPS size of 127, 1.3μs or a
rate of 769KHz. This 1.3μs is also the latency through the filter.

Conclusions
In this first of two articles, digital convolution was introduced showing how two signals can be convolved in the
time domain and what this means in the frequency domain. By making one of the signals a finite impulse
response we can use convolution to filter a signal.

A generic solution was explored, where there were no restrictions on resources such as memory, multipliers, or
the number of terms in an addition. This was used to demonstrate an intuitive understanding of why convolution
in the time domain is equivalent to multiplication in the frequency domain, and why a delta function (or
impulse) in the time domain is an ‘all-pass’ filter—i.e., multiples everything in the frequency domain by 1.

From this idealised solution some practical constraints were formalised, and a Verilog based implementation
was discussed. The solution requires only multiply and accumulate functionality over and above normal logic,
and the solution is, I hope, simple and easy to understand. We left this HDL implemented but with no
knowledge of how to configure this logic to act as a filter. We still need the numbers.

In the next article, I want to concentrate on generating the tap values required to turn our HDL implementation
into a filter. It will cover generating impulse responses using sinc functions, ‘windowing’ the sinc function to
make finite, designing these to meet our requirement, and analysing these to validate the responses. To do this
we will take a specification from digital audio processing as an example, for filtering 48KHz samples with a 4 ×
oversampled low-pass filter, a cut off at 20KHz and a stop -band attenuation target of -60dB. We will use my
winfilter program to help us do this but will discuss the calculations behind all this for a fuller understanding.
Finite Impulse Response Filters Part2: Sinc Functions and Windows
Simon Southwell | Published Sep 15, 2023

Introduction
In the last article, I introduced the subject of finite impulse response filters (FIRs) and had a look a convolution.
An HDL implementation was then discussed showing just how easy it is to construct logic to implement a
solution for convolving two signals—for our purposes a finite impulse response with a continuous stream of
digital signal samples. However, I left things hanging as the implementation needs configuring its ‘taps’ (or
coefficients) to program the impulse response into the logic, and we don’t know how to do that yet.

In this article I want to look more closely at the sinc function and explain why we’re messing around with it,
and then window functions to turn the infinite nature of the sinc into a finite set of values we can configure into
the logic implementation. Using a digital audio example to set some requirements on a practical design, we will
then use winfilter to generate tap values, looking at the effects of the different window functions, quantisation
bit width and the number of taps has on things like the filter’s roll-off rate, stop-band attenuation and ripple in
the pass-band. We will be able to plot the frequency response, the phase, the impulse response, and even the
window function to visualise the functions and the filter performance. Although we will explore a low pass
filter in detail as a practical example, we will also look briefly at highpass, bandpass and bandstop filters as
well, and how we can construct these. I mentioned in the first article that, actually, an arbitrary frequency
response can be implemented using FIRs, but space restricts what we can discuss here, but I will give some
references to allow adventurers to continue to explore on their own.

Once we have some taps values for our logic implementation, we can then test how the filter actually performs
against our predictions for the digital audio example, and some simulation results will be analysed and cross-
referenced. A simulation environment is provided with the Verilog implementation which should be adaptable
for a preferred logic simulator.

The sinc Function and Impulse Response


I mentioned in the first article that the impulse response for a low pass filter is a sinc function of general form
sine(x)/x. Firstly, this isn’t strictly true as it’s sine(πx)/πx, and secondly why is it a sinc function? In the
introduction to the first article there was a diagram of an idealised low pass filter. This diagram is actually only
half the story. It only plots the frequency axis from 0 to the sample frequency divided by 2 (fs/2). Without going
into too much detail, when sampling data at a given sample frequency, only signals up to half that rate can be
represented—this is the Nyquist limit. Any higher frequencies get ‘aliased’, and fold-back into the lower
frequencies. What we see when we plot the entire frequency range is something like that shown in the diagram
below:
If we simply extend the diagram from the first article to fs, we get a plot like that shown on the left of the
diagram above, which is a reflection of the first half. If we kept plotting higher and higher frequencies this
would actually repeat periodically, both in the positive and negative directions. The diagram on the right, then,
shows what you get if plot from - fs/2 to + fs/2—i.e., a rectangle. Now, the fourier transform of a rectangle is a
sinc function. I’m not going to prove this (others who know more mathematics than I do have done so already),
but this is why we want a sinc function for our impulse response. The phase for a sinc function centred around
0, is 0 for all frequencies. Even if the sinc function is offset from 0 the phase remains linear, with a slope a
function by the offset. This linear phase is one of the defining characteristics of an FIR, as mentioned in the first
article. We will look at a phase plot when we discuss the audio example.

The diagrams above don’t specify what the sample or cut-off frequencies are, but as we increase and decrease
the passband as a proportion of the sample frequency, we can get a ratio fc/fs. This can now be mapped into our
sinc function to get the impulse response for the cut off frequency desired as a fraction of sample frequency:

So, fc/fs can be anything from 0 to 0.5 to determine the low pass filter pass- and stopbands. For example, if our
sample rate was 100KHz, and the desired cut off frequency is, say, 12.5KHz, then fc/fs = 0.125. Plug this into
the equation and we can calculate the impulse response for all values of i, to plus and minus infinity.

The value when i is zero isn’t just 1—previously, in the first article, we used nice round numbers. The sine
function becomes linear the nearer we approach zero radians—i.e., sin(θ) tends to θ. Therefore, removing the
sine, and dividing by iπ, leaves the value as shown in the equation for i at zero. So now we’d better deal with
infinity.

Windowing and Window Functions


In the first article, I mentioned that truncating the infinite impulse response gave a ‘window’ on that response. I
also mentioned that this simple truncation adds undesirable artifacts and that things can be improved with
window functions that are a more rounded roll-off. The simple truncation, in fact, is known as a ‘uniform’
window. Over the years, researchers have experimented with various functions to improve the response from
the basic uniform window, from a triangle (aka Bartlett window), a cosine window, a raised cosine window (aka
Hamming window), through to some quite elaborate functions such as a Kaiser window. We can only look at a
few, but the winfilter program supports quite a few, and the functions’ definitions are all in the source code (see
windows.c). Starting from winfilter’s default settings, let’s look at a few example windows:

As you can see, the shapes of these windows more or less have a similar form, starting at 1 for the centre point
and reducing either side towards 0, and it gets harder to spot the difference between them, though they do make
a difference as we shall see. I haven’t plotted the uniform window, as I assume you can picture multiplying by 1
over the number of taps and 0 everywhere else.
If we now take the sinc impulse response that we calculated from the last section and multiply it by the selected
window function, we get the finite windowed-sinc filter tap coefficients (sometimes know as the filter’s kernel)
with which we need to program our HDL implementation. Plotting the frequency responses (in dBs) for these
windowed-sinc impulse responses (including for the uniform window), we get the following results:

I’ve chosen the order of the types of windowed-sinc impulse responses with improved performance from first to
last (left to right, then down). The uniform window has ripple in the passband and tails off slowly to an
attenuation of -46dB or so by the fs/2 point. The Bartlett window is not much better, reaching around -55dB at
fs/2, but with a slightly better roll-off. The cosine window improves things again, but with the Hamming
window we now have a very decent response, reaching -60dB (for the given Alpha value) with a superior roll-
off rate and maintaining that attenuation in the stopband. The Kaiser window, like the Hamming window, has
an Alpha (α) value that can be manipulated to trade off passband ripple, roll-off rate, and stopband attenuation.
With the values I’ve configured for the example, you can see that superior stopband attenuation can be
achieved, with good roll-off and minimal passband ripple.

Phase Characteristics

The diagram below shows the phase plot for the example Kaiser windowed-sinc impulse response over the
passband (0 to 24KHz).
The phase axis is in degrees and runs from -180° to + 180°, where the plot then wraps. If you imagine, though,
that the axis goes to infinity, then you would see a straight line extending in a linear fashion.

This confirms what was suggested in the first article that the phase is linear. This remains true regardless of the
number of taps, the window function used or the quantisation.

Window Mathematics
So far, I’ve taken the window functions as read without describing how they are calculated. The uniform
window should be intuitive, as is, I hope, is the triangular window. For the rest, I want to look at the examples
for those discussed above, but the comments above each window function in the winfilter source code details
the mathematics for all the window functions not explored here. This section can be skipped for those not
interested in the details, but just want the tap coefficient values, but I think it is interesting to have a look for
informative purposes.

The cosine window isn’t as straight forward as plotting for ±½π, but gets raised to a power, defined by the
Alpha setting (1.0 in the case of the example above). I.e.

The Hamming window is a raised cosine where the Alpha (α) value defines how much above 0 the cosine is
raised:

A Kaiser window is now going to pile on the mathematics to some degree, using a ‘modified order 0 Bessel
function’. The window is calculated as:
Where I0 is a modified order 0 Bessel function, I0, is calculated as:

The Bessel function runs with k from 0 to infinity which, obviously, can’t be computed, but the terms
converging towards 0 as k increases, and a value of 69 terms was heuristically chosen to be used in winfilter.

So you can see, things can get quite elaborate. The Chebyshev window, for example, is actually calculated in
the frequency domain, and then converted into the time domain with an inverse discrete fourier transform. I’ve
included some examples in this section so that there is an idea of what’s involved. The good thing, of course, is
that these calculations do not have to be done as part of the implementation, but can be calculated offline, with
the tap coefficients as the result to be configured into the FIR logic implementation. The source code documents
all the windows, with C implementation for those that wish to study this more deeply.

Quantisation
All the plots we’ve seen so far all use double precision numbers for maximum accuracy but quantising the
impulse response taps to a fixed number of bits (either signed integers or signed fixed-point values—it makes
no odds) has an effect on the resultant filter performance. In particular, the stopband attenuation performance
suffers. Taking the Kaiser windowed-sinc filter example from before, which reached better than -80dB in the
stopband, we can change the quantisation setting and replot the frequency response.
With 16-bit quantisation, we still reached around -80dB attenuation, if with a little messier stopband ‘lobes’, but
with 12-bit precision, this becomes around -60dB and with 8 bits around -45dB but rises above this further up
the stopband frequencies. Thus the choice of window function and an Alpha value is not enough to reach a
particular target performance, and the number of quantisation bits has to be chosen to sufficiently meet the
requirements as well.

In theory, the achievable signal to noise ratio for a given number of bits (Q) is around 6Q + 1.76 dB. So a Q of
12 should yield -74dB. In the Q = 12 plot above, it actually does no better than -60dB across the stopband. What
I see is that the initial transition will reach around the theoretical limit, but the uneven lobes will often bring this
back up above this, so plotting the actual response is important to verify performance target have been met.

High Pass, Band Pass and Band Stop Filters

Until now only low pass filters have been discussed, with the sinc function being the fourier transform of the
rectangular shape of the LPF, but what about highpass and bandpass filters?

For highpass filters we could think of this in terms of lowpass filters in one of two ways. Firstly, if we invert an
LPF in the frequency domain, i.e. flip it in the y axis, it will suddenly become a high pass filter. Also, if we
reversed the frequency response about the fs/2 midpoint, flipping it about the x axis, so that the stopband starts
first ending in the pass band, that would also be a highpass filter. What does this mean for in the impulse
responses?

For the inversion, in the frequency domain, if we subtracted the low pass filter response from an all-pass filter
response (i.e., 1) this would achieve an inversion. We saw before that an all-pass filter in the frequency domain
is a delta function (an impulse of magnitude 1) in the time domain. So subtracting our low pass impulse
response from the delta function will invert the response. This means multiplying all the values by -1, except the
centre point, where the delta occurs which is subtracted from 1. The diagram below shows the impulse response
for an inverted LPF and its frequency response based on the Kaiser windowed examples from above.
To reverse an LPF, recalling the earlier idealised response diagram and the rectangular shape centred on 0
frequency, if we could shift this so that it was centred on fs/2 instead, that would reverse the response, with
(from the earlier diagram) the response from - fs/2 to 0—a reflection of that from 0 to fs/2—now sitting from 0
to fs/2. We can do this if we convolve the LPF frequency response with a delta function at fs/2. A delta function
in one domain is a 1 in the other if it is at point 0. If it is shifted, the signal in the other domain becomes
rectangular between -1 and 1. A delta function at fs/2 produces a signal in the time domain at a frequency of
fs/2, and we know that convolving in one domain means multiplying in the other domain. So if we multiply
every other coefficient of an LPF impulse response by -1, we will reverse the frequency domain. This is shown
below:

Now that we can make highpass filters, we can combine these to make bandpass and bandstop filters. For a
highpass filter, imagine having an LPF and an HPF such that, in the frequency domain, they overlap. If you
multiply these responses together then, where they overlap, there will be a pass band and elsewhere a stop band.
Multiplying in the frequency domain is the same as convolving in the time domain, so we can take the impulses
from the low- and highpass filters and convolve them together to generate the impulse response for the band
pass filter. If we invert this band pass kernel, in the manner discussed above, this will become a band stop filter.
So, by using convolution, multiplication, and lowpass filters, we can generate these other types of filters as well.
This is why the bulk of these articles have concentrated on low pass filters, as the others are just a function of
these.

Designing the Filter


To put what we’ve looked at so far together, we will work though an example filter design, using winfilter, from
the world of digital audio. Digital audio sample rates perhaps more familiar to most people from consumer
electronics and streaming services might be 44.1KHz from Compact Discs (CDs), or 48KHz from DVDs and
(in an obsolete reference to my own past) Digital Audio Tape (DAT). Blu-ray supports many audio formats,
with varying number of channels at rates from 48KHz to 192KHz. These audio formats have sample widths
ranging from 16- to 24-bits. The sample rates for audio will not tax the HDL implementation, which can run at
over 750KHz with a 100MHz clock, and so we’ll target an audio application.
As a side note, an FIR that runs at the higher rate of, say, 192KHz can still process signals at a lower (integer)
rate such as 48KHz by ‘oversampling’ the signal. That is, inputting an individual sample multiple times (in this
case four, for 4 times oversampling) and the filter will work just fine. The advantage of this is that the digital
noise of the signal is spread over the larger frequency range. When the signal is converted back to the analogue
domain and low pass filtered, the noise at the higher frequencies is attenuated, leaving less noise in the passband
than would otherwise have been without oversampling. The analogue anti-aliasing filter specification can also
be relaxed as it has a much longer attenuation region with which to reach its attenuation target to filter aliased
frequencies. A simplified system is shown in the diagram below:

In the diagram I’ve ignored any clock synchronisation between the sampling rate and the internal clock, but this
would need considering—either the system clock rate can be a synchronous integer multiple of the sample rate,
or CDC logic added. CDC can add some time domain jitter to the samples which would need analysing to
ensure acceptable (beyond the scope of this article).

With oversampling, the specification of the analogue filter can also be relaxed, as mentioned above, as it can
roll off more slowly from the cut off frequency, as the space between this and its reflected frequency response
between fs/2 to fs is wider than without oversampling. But the analogue LPF must be present to remove the
reflected aliased signals. A typical analogue LPF used for this might be a Bessel filter which has a linear phase
delay (aka maximally flat group delay) preserving the phase linearity of the FIR. An example 4-pole Bessel
filter is shown below, courtesy of the Texas Instruments filter design tool.

Before setting out on the specific audio example, a quick summary of the winfilter program usage is in order.
Note on Using winfilter

Firstly, a note on the winfilter source code. The winfilter source code (originally xfilter for Linux and X-
windows) has its origins from more than twenty years ago. In fact, some parts of the code, particularly the
DFT/FFT parts, originate from my undergraduate dissertation work done in 1988/’89. Therefore, it’s written in
C, rather than C++, and its style is not necessarily how I’d construct code nowadays. What I was gratified to
see, when revisiting the code, was that there are plenty of comments, including the mathematics for things such
as the window functions alongside the code that implements them. So, if a bit rough around the edges in terms
of coding best practices, it is still a useful reference for all the details we can’t cover in this article (we didn’t
explore every window function or the DFT and FFTs functionality—this latter we’ll save for another article).

As the name implies, winfilter is a Windows based graphical program as shown in the diagram below with the
default values:

The documentation that comes with winfilter explains in detail each of the functions on the GUI and what they
are all for, but a brief summary here is worthwhile, I think.
My intention for the program was to be able to explore FIRs, all the different windows functions, and the effects
of the different parameters on filter performance. In terms of actually generating some tap coefficients we can
actually use, my intention is to do a two phase approach. The first is to select parameters and alter them until we
meet the specification required, minimising parameters that cost in terms of logic, such as number of
quantisation bits and the number of taps. For this we can plot frequency magnitudes to verify predicted
performance. Once satisfied with these results, we can then plot the impulse response. The program will always
output the plot values to the file specified in the ‘Configure Output’ box. This would normally be both X and Y
values so the glGraph library I’m using can read these and display the graph. A new checkbox has been added,
labelled ‘Symmetric Impulse’, which can be checked and, so long as the quantisation value, Q, is non-zero, the
output will just be a list of signed hexadecimal numbers of the correct width for the Q value, suitable for
configuring the memory in the HDL implementation.

The program does come with an automode design function. This is only available for the Kaiser window
function and so this is automatically selected. In this mode, instead of setting the number of taps and an ‘Alpha’
value and seeing what performance is achieved, one can specify the pass- to stopband transition (the roll off)
and a target attenuation in dBs. When executed it will update the Taps and Alpha boxes that achieve that
specification, which then informs you the number of taps the HDL implementation needs to be configured for.
Note that the Q value for quantisation will also have an effect.

By default, the quantisation value Q is set to 0. This means no quantisation, but actually the program will use
double precision floating point values, which is the best we can do. Using Automode when Q is 0 will, indeed
achieve the specification configured. However, if you specify a target of -100dB attenuation with a 1KHz roll-
off period (other parameters being default), and then set the quantisation to 8 bits, you have no chance of
achieving that target, and this parameter must be experimented with as well.

Example Specification

So, for our audio example let’s set some parameters.

• 192KHz sampling rate


• Kaiser windowed-sinc filter
• Pass- to stopband transition < 3KHz
• -60dB attenuation in the stopband

The first parameter we can set from this specification in the quantisation Q. For an attenuation of -60dB we saw
from our quantisation plots from before that 12 bits is the smallest we can practically get away with. The
number of taps will determine the frequency transition width and we can use the Kaiser window automode to
set the 3KHz transition and the target attenuation. This gives 116 taps and an Alpha of 5.6533. If we make the
taps 127 (the implementation module’s default parameter setting) and set an Alpha value of 6.0, along with a
quantisation of 12 bits, we get a frequency response as shown below:
This appears to meet the brief well. At some of the higher frequencies the lobes do push slightly above -60dB,
but the analogue anti-aliasing LPF will be attenuating by those frequencies, bringing them back into line. Now
we have our settings we can generate the tap coefficients by selecting ‘Symmetrical Impulse’ in the Configure
Output box. The winfilter GUI will look like the following:
If the program is executed with these settings, the generated filter.dat file will now contain 127 signed
hexadecimal 13 bits numbers—from the last article we made the HDL LUT values one bit larger that Q so that
an equivalent 1.0 can be stored (0x800). The table below shows the first 64 hexadecimal values, remembering
we will only store half the table, with the impulse response reflected about the value at index 63:

These numbers have been scaled by the winfilter program to give maximum range, and thus resolution, to the
impulse response, with the peak of the impulse response at 2048 (at index 63). This will add a gain to the filter.
When winfilter uses floating point values (Q set to 0) the impulse response generated gives unity gain at DC,
but when quantised this would reduce the resolution, and so the numbers are rescaled to make the peak the
maximum quantised value. This can be set back to unity, if needed, by external means, such as the gain of the
anti-aliasing filter, or reduce the filter kernel values to normalise them. This latter, though, will reduce the
filter’s response performance.
So, after all this effort, we simply need put these numbers into the TAP lookup table of the logic
implementation and we have a filter to our specification. Let’s check if it works.

Performance

To test the FIR implementation, a simple test bench (tb) is provided in the github repository to simulate the
module and check the results. This test bench has a set of user definable parameters to configure the test.

The input signal to the UUT can be selected to be either an impulse (IP_FREQ_KHZ is 0), or a sine wave with
a frequency defined by IP_FREQ_KHZ, and a peak-to-peak value defined by IP_PK_PK. The system clock
rate is defined with CLK_FREQ_HZ, and the input signal sample rate is defined with SMPL_RATE_KHZ.
The default values are set up so that the clock rate is a multiple of the sample rate. I.e., a 96MHz clock and a
192KHz sample rate. The filter’s parameters, SMPL_BITS and TAPS are also exposed for configuration with
the test bench parameters. The illustration below shows a simplified block diagram of the test bench setup.

Simulation Results

For the purposes of this document, three tests were performed for the example configuration discussed in the
last section: an impulse input, a sinewave at 15KHz in the passband, and a sine wave at 24KHz in the stopband.

The diagram below shows the waveform trace for the impulse response. The signals displayed (from top to
bottom) are a clock and reset followed by the smplvalid pulses at the sample rate. The impulse signal, on
sample, is next (look to the left), followed by opvalid and then out from the FIR filter. The last signal uses the
simulators ‘interpolation’ mode to emulate a DAC and analogue anit-aliasing filter to remove the steps. As can
be seen, we do, indeed, get the impulse response at the FIR’s output. This also illustrates the latency of the
filter.
The next plot is for the 15KHz signal, which is at a frequency just before the predicted filter response starts to
attenuate. Here, a sine wave is generated and then sampled to input to the filter. As expected, we do see this
signal passing through the filter after the latency period.

The last plot does the same thing as the previous test, but the input signal frequency is now 24KHz, the lowest
frequency in the predicted stopband. Here we see that the signal is filtered and does not pass through. There is
some initial response after the filter latency, and this is due to the non-linearity of the transition from no input
samples to a sine wave input temporarily generating frequencies in the passband, but it soon settles down to
attenuating the input signal.

Conclusions
In this article, we built on the foundations of the previous article, discovering convolution, and with an HDL
implementation ready to program with some numbers. In this article was discussed how to generate those
numbers for a lowpass filter, from the sinc function being the digital fourier transform of the rectangular
frequency response to give an infinite impulse response. This was turned into a finite impulse response using
‘window’ functions of varying degrees of complexity and performance. To make practical for the Verilog
implementation the finite impulse response was then quantised, and the limitations explored by this step.

Having generated a practical set of coefficients for a low pass filter, we then looked at how to manipulate and
combine the tap coefficient to produce high pass and band pass filters.

With all this knowledge a practical example from the world of digital audio was specified and a simulation
constructed to test the FIR implementation with some filter tap coefficients to match that specification. The
results did indeed confirm the filter’s response to impulses, passband frequencies and stopband frequencies to
be as predicted.

The convolution/FIR logic can be seen, I hope to be fairly simple, with the complexity in the calculation of the
tap coefficients. Although more complex, the mathematics of this is not, I’d like to think, at the realm of
professors of pure mathematicians, and at least means none of it ends up in the logic implementation and can be
done offline. The resultant FIR, though, has impressive characteristics for such simple functionality, and is
useful in a wide range of applications where stable and linear filtering is required.

More Information

For a very large section of my career, the foundation for the DSP knowledge I have has come from a book “The
Scientist and Engineer's Guide to Digital Signal Processing”, By Steven W. Smith. I cannot recommend this
book highly enough for anyone starting out in digital signal processing. It’s writing style and clarity is
something that I have tried to emulate in my own documentation and these articles. Even, better, you can get
hold it for free, on Steven Smith’s website at www.dspguide.com. This book includes way more information
and detail that I can fit in a few articles, but includes much more than FIR filters. I still refer to this text (I
actually bought a hard copy, despite it being free). As far as implementation goes, it focuses more on digital
signal processors, where I aim towards logic implementation, but is very much relevant for both disciplines. So,
if you enjoyed these articles, and want to explore DSP more, check out this book.
Processor Design #1: Overview
Simon Southwell | Published Sep 14, 2022

Introduction
This and the next few articles are based on the notes I made for a mentoring program where I covered processor
architecture and logic design using RISC-V as the case-study, as this is a modern RISC based instruction set
architecture, is open-source, and is making a lot of noise in the industry right now. From knowing nothing about
how processor worked the mentee executed a fully working implementation that passed all the relevant RISC-V
International instruction tests. They then went on, of their own volition, to pipeline it for single cycle operation
on non-memory instructions. I’m hoping that this article will provide enough information that anyone who
wants to do so can reproduce what my mentee did, at least to a finite state-machine based design, but I will also
discuss some steps beyond these fundamental principles for more advanced features.

RISC-V is being used as a relevant example instruction set, but this is not a document on all aspects of RISC-V
and I will stick to only those features that allow a processor core to be implemented. Thus, in this article, we
will stick to the base implementation and relevant control and status registers, whereas there are many
instruction extensions to this base. Also, RISC-V can be 32- or 64-bit (or even 128-bit) and has three privilege
modes—machine, supervisor, and user—or four if you count hypervisor—but we will stick to 32-bits and the
highest priority mode only (machine) as that has restrictions on any permissions. RISC-V also supports multiple
hardware threads (harts), but we will stick to just one…and so on. There are many, many good resources out
there for those who want to know more about RISC-V, including the specifications which I will give links to at
the end of the article.

Throughout the articles I will be making reference to my own RISC-V logic implementation for illustration and
example. The source code, along with documentation, is available on github. It is targeted at FPGA, though
would easily be implemented on an ASIC, and is restricted to the specification I have just laid down. I have
sometimes sacrificed efficiency for clarity in the design as it is meant for informative and educative purposes
and there are many better, more developed, more verified, implementations than this available as open source
(e.g., Ibex), but my core is architected and documented for ease of understanding. Where a ‘better’ approach
might be warranted I will discuss this in the text to explore more general processor design features.

In this first article, though, is discussed processor function and design in a generic way (though looking at some
real examples) in order to define what a processor is, where it fits within a system, and what the common traits
are for the vast majority of processors. In future articles, we will look at the RISC-V architecture, the
instructions it defines for the base system, and the register sets (both general purpose and CSR). Then we will
look at the logic architecture to actually implement such a processor core, including optimisations and
alternatives. Finally, we will look at assembly language, the lowest level programming (discounting
programming directly in machine code, which nobody would be foolish to do since the 1970s). There is an
instruction set simulator (ISS) as part of the accompanying RISC-V project, and this can be used to experiment
with assembly language programming without the need for processor hardware.

What is a Processor?
Firstly, I want to say that a processor doesn’t do very much. It reads a set of fairly simple instructions from a
memory or internal registers, manipulates associated data according to those instructions, and stores this altered
data either internally, or back to memory. And that’s it. The power comes from the fact that it can do these
instructions very fast, and that more complex operations can be achieved by combining the limited set of
instructions. The instructions used vary between different processor, but a result that might surprise you is that,
in the limit, only one instruction is really necessary! All the processors that have multiple instructions are
actually doing engineering to make the processors’ operations more efficient. Such One Instruction Set
Computers (OISC) actually exist and can perform the same functions as a bigger processor, albeit much less
efficiently.

A modern processor has more than one instruction, and these are encoded as plain binary numbers (‘machine
code’) which the processor reads from memory (maybe RAM, flash, or other storage device) to manipulate data
and read and write to memory. This is the software running on the processor. It will have a bus to access
memory and other devices such as I/O, using protocols such as AXI or Avalon (see my article on busses). The
internal registers vary in number and purpose between different processors, as we shall see, and this register set
and the instruction set that the processor recognises, is known as the processor’s Instruction Set Architecture
(ISA). The ISA defines, then whether the processor is RISC-V RV32I, ARMv8, IA64 etc. There may be
different implementations of the same ISA, but the processor is classed based on the ISA that it implements.

Processors are often classed a n bit, for example 8-bit. This (usually) refers to the size of the data and
instructions it processes, and this has been increasing with time. For microprocessors, it all started with the Intel
4004 at 4-bit, and the 1980s 8-bit home computer revolution with 8-bit processors, such as the MOS 6502,
Zilog Z80, and the Intel 8080—all originally designed in the 1970s. A short period of 16-bit processors (e.g.,
Motorola 68000 and Intel 80286) was taken over by 32-bit processors such as the Intel 80386, SPARC v8 and
ARM Cortex-M. This 32-bit era still persists in the embedded processor world, but modern PCs, workstations,
and smart phones, use 64-bit processors (I’m ignoring graphics processors), such as Intel i9, and the Apple A15
incorporating 64-bit ARM based processors. Beyond this are Very Long Instruction Word (VLIW) processors
such as the HP/STMicroelectronics ST200 family of processors, Analog Device’s SHARD DSP processor and
the u-blox software defined model (SDM) processor (which I worked on as a DSP software engineer doing 4G
physical layer code including the code on the VLIW processor).

CISC versus RISC

Another categorisation of processors is whether it is CISC or RISC. That is, is it a complex instruction set
computer, or a reduced instruction set computer? In the early days of processors, iterations of processors tended
to add more instructions with more complex functionality to aid in the efficiency of programs which, originally,
were written directly with the processors’ own specific instructions. With the advent of higher-level
programming languages and their compilers, it was found in research that, in general, compilers would use 80%
of the instructions only 20% of the time, and 20% of the instructions 80% of the time. Using this result work
was done to design processor architectures that had fewer (i.e., a reduced) number of instructions—the ones that
ran most of the time—that run more efficiently. The more complex functions can be emulated using multiple
simpler instructions which, although slower than a dedicated instruction, is done less often on a processor that
can run much faster.

The term complex instruction set computer (CISC) was retro-fitted to earlier processors that had characteristics
of large instructions set, including instructions that had complex functionality, with multiple ‘modes’ for each
instruction and could be variable in the size to encode the instruction. An example of a CISC processor is the
Intel x86 family and the processors derived from this architecture. RISC processors, by contrast, have fewer
more simple instructions, implemented for fast execution. The instructions are all fixed width allowing ease of
pipelining an implementation. Also, RISC processors generally separate data manipulation from memory input
and output. So all data manipulation is done from values held in internal registers and placed back in internal
registers. All movement to or from memory are done with instructions that can’t alter the date. This separation
avoids having multiple modes for each data manipulation instruction—e.g., supporting a function that can have
data either from memory or a register, or a memory location indirected by another register etc., as is common in
CISC processors. Example RISC processor architectures include ARM and, of course, RISC-V. Indeed, even
modern CISC processors often have an internal RISC architecture, and stages to break down complex
instructions to multiple simpler internal operations.

Role of a Processor in an Embedded System

My particular interests and experiences are with embedded systems and SoCs, so where does a processor fit in
such a system? Within an embedded system or SoC one or more processors, might sit on a system bus, such as
an AHB or AXI bus, along with a memory sub-system (with caches and MMU) and with peripherals that it
might control, such as ethernet, UART, USB etc. It might also have an interrupt controller and timer to produce
internal and external ‘events’ (more on this later). The diagram below shows a simple SoC arrangement.

This arrangement is oversimplified, but indicative of most processor-based systems, with software in memory
running on one or more processors which control a set of peripheral devices over the system bus or interconnect
fabric, which make up a system, such a controller for a storage device, a smartphone, or even a Raspberry Pi. If
there are multiple cores, then the cores may also communicate with each other, usually through memory.

Basic CPU Operation


We’ve not yet defined any kinds of instructions that a processor my use, but we can still discuss what a
processor does when it is powered up and taken out of reset, that is common to the majority of processors. The
diagram below shows what happens at this first step.
After reset is removed, a processor will start to read an instruction from some predetermined fixed location, for
example at address 0 (shown as step 1 on the diagram). The value of the instruction is returned from memory to
the logic for interpretation (step 2). It may be that the particular instruction manipulates data from two of the
internal registers and so these are fetched by the logic (step 3). The result of that manipulation might be,
depending on the instruction, placed back into another internal register (step 4), and the instruction execution is
completed. In step 1, the address is shown to come from a particular register labelled PC. This is the program
counter. When an instruction completes, this is normally incremented to point to the start address of the next
instruction located in memory immediately after that just executed. For a 32-bit RISC machine, all instructions
are 32-bits, or 4 bytes, and so the PC (a byte address) would be incremented by 4 and the whole cycle started
again.

The next instruction, instead of manipulating data, might be a memory access, such as a write.
The diagram shows a store operation, where two internal register values are accessed, one to form an address
for the data to be written and one for the actual data to be stored. So, instead of the ‘result’ being written back to
an internal register, it is directed to memory. Similarly, a read from memory instruction might read an internal
register for an address value, perform a read operation from memory, and the returned data written back to
another internal register.

One last class of instruction is one which can change the value of the PC from its default of moving to the
address of the next instruction. The diagram below illustrates this:

Here, the instruction is read from the current PC address and sent to the logic. The logic might read two internal
registers to compare their values. If they meet some criteria, such as being equal, the logic then overrides the
default increment of the PC to some new address location. Often this is an offset from the current PC value,
forward or back, that is encoded in the instruction itself. So, in the diagram, if the two registers did not meet the
criteria (of being equal, say), then the PC would increment as normal and the new PC address would be 0xC. If
the two register were equal, then the instruction might have an offset of, say, -8 and the PC would be set back to
0x0 and start executing instruction from there once more. The offset could just as easily have been positive, +8,
say, and would then skip the instructions at 0xC and 0x10, and start executing instructions from 0x14.

This basically describes all the types of operations that a modern RISC processor does: data manipulation to and
from registers, memory reads and writes, and overriding the program counter. The only addition to this is an
exception where an external signal (interrupt) or an internal error event can also change the program counter
value, but this is not normal program flow and we will look at this shortly.

Processor Core Port Architecture


Above has been shown that a processor core reads instructions from memory. Some of those instructions will
direct the core to load data from memory or store data to memory. These two classes of memory access,
instruction and data, can be handled in one of two ways via the external ports of the core. The diagram below
shows two examples of cores with memory port configurations.
The first configuration has a single memory port. The diagram shows a simple SRAM like port with a wait
request, but it could be any kind of port to access memory (e.g., AHB). This configuration is known as a von
Neumann architecture. This might cause conflict on memory access if an instruction could be read whilst data is
being loaded from, or stored to, memory. If the internal design is such that an instruction is completed before
the next instruction is fetched, then no conflict arises. However, this is not a very efficient implementation, and
modern pipelined implementations can fetch instructions in parallel with accessing memory for data. The
second configuration has separate memory ports for instructions (a read only port) and for data (a read and write
port) and is known as a Harvard architecture. Now, memory can be accessed for data, as well as instructions
fetched in parallel. It may be that the instructions are in a separate ROM which is connected directly to the
instruction port, with the data port to RAM or DRAM. Alternatively, the instructions and data may ultimately
reside in the same set of memory, which would simply move the conflict to the memory sub-system which
would need to arbitrate access for both ports. However, the memory bandwidth may be higher than that of the
core’s memory ports, and so both ports can be run at 100% efficiency. This is not an uncommon situation in an
SoC.

Internal Registers
Also common amongst nearly all processors is a set of internal registers, which we have alluded to already. The
type and number vary, but they all have very similar roles. The basic categories of these internal registers are as
follows:

• Program counter (PC)


• Control and status registers (CSR)
• General purpose registers

Within the general-purpose registers, certain ones might be nominated as having additional specific purposes
but could still be used as a general-purpose register. Also, many modern processors nominate a register to
always read as zero.

Below are some examples or real registers sets from three different processors, and from different eras.
All three of these examples have registers that fit, roughly, into the listed categories. The 6502 has A, X and Y
registers that are more or less general purpose. The ‘stack pointer’ is kind of custom but could be used as a
general-purpose register. The PC maps directly to the functionality already described, and the P register
(processor flags) is a status register. The ARM processor has a set of general-purpose registers, r0 to r15, where
r13 is nominated for a ‘stack pointer’, r14 for a ‘link register’ and r15 for the program counter. The PSR is the
status register. Finally, the Tricore processor from Infineon (which I worked on, constructing software models
of the processor and system) has 32 general purpose register (though split as address and data) and a PC, with
control and status in the PCXI and PSW registers.
This, I think, serves to illustrate that there is a commonality of internal registers amongst varied processors.
Compilers, such as gcc, work (in broad terms) by having a generic model of registers which it uses to map the
programming language code to an intermediate code before mapping to actual registers (and instructions)
available on the target processor. Modern processor architecture is now sympathetic to this compilation process
to make mapping straightforward.

In the diagram for the three processor register sets, looking at the ARM and Tricore, the diagrams show labels
in brackets next to some of the register names. I’ve mentioned before that some registers are nominated for
particular functions but this need not be the case (though it might be for some processors). In reality, only if
writing code in assembly language (the lowest level programming) would this be true. The reason for
nominating registers like this is to have a convention when compiling code from a higher-level language. For
interoperability of code compiled separately it is helpful if everyone follows the same convention. This
convention is often called the application binary interface (ABI). It dictates things like the register to use for the
stack pointer (I’m not going to define these terms here), which register to use as inputs to the call of a function,
and which as outputs and things like this, relevant to a high-level language.

Exceptions and Interrupts


Before we leave this generic discussion on how processors are used and operate, I want to mention exception
and interrupts. I have mentioned these in passing in the above sections, but I want to fill in some of the blanks.

We have discussed the flow of a program running on a processor as proceeding in a sequential fashion through
memory, though this can be altered using instructions to change the default program counter increment to
change the flow through the running program. Another way to change the flow of the program is with
‘exceptions’. These usually happen from some internal error condition, or from some external event or interrupt.

For the internal events an error condition might be, for example, an unrecognised instruction. If an instruction
read from memory has a value that does not decode to any known instruction, or some fields within an
instruction have illegal values, this is an error and causes an exception. Some processors have special
instructions to actually cause particular exceptions, so that a program can generate these events itself, rather
than on an error condition.

For external events, these usually come in the form of interrupt input signals. There can be multiple interrupt
inputs to a processor core, but fundamentally this boils down to one exception with other logic (an interrupt
controller) sorting out if a given interrupt input is enabled (so the processor responds) or is the top priority to
respond to if multiple interrupts are active. The interrupt controller logic might be within a core, but is often
external to the core, with just a single input signal to the processor.

For both the internal and external events a processor will finish its current instruction and then change the
program counter from its normal next value to be some fixed, predetermined, address in memory (not unlike
after coming out of reset). It will save the address of the next instruction it would have normally executed to
some store (usually a register). In some cases, the source of the exception may add an offset from the fixed
address to differentiate the various types and sources of exceptions. The code that is located at this fixed region
of memory is specially written and is known as an exception handler. So, for example, if a UART peripheral has
a new byte just arrived it might raise an interrupt so indicate this byte needs processing. The interrupt causes an
exception, the exception handler is called and will identify that the exception is from the UART, so call a
routine to fetch the byte which might place it in a buffer in memory for the main software to process. When this
is finished (and the handler code is usually as small and fast as possible), the exception can be completed and
the program counter reset to the saved address so that it can carry on from where it left off at the point of the
exception. If the event handler can’t process the exception, this is when it can ‘crash’, perhaps displaying a
message (if it can) before halting the processor.
In the simplified SoC diagram above I showed a timer, along with the other peripherals which, itself, can be a
source of interrupts. This is important when constructing a multi-tasking system, where the processor core is
running an operating system and several other processes and threads ‘concurrently’. That is, it appears that
multiple programs are running, but actually only one is running at a time, and they are swapped out at regular
intervals by the operating system software, which is where the timer comes in. This might be programmed to
interrupt after a given time, and then a particular processes code allowed to run. When the timer interrupts at the
end of the period, the exception handler notices that this is so, and hands control back to the operating system,
which can then swap in another process to start running from where it last was running when it was swapped
out, and so on. (There are other reasons a process might be swapped out, such as it waiting for the UART to
receive another byte, as it won’t make progress anyway, so the OS might as well let something else run, but this
is still event driven and under the control of the OS software.) Thus multiple processes and threads can make
progress on a single core, as if running in parallel on multiple processors. However, from the processor core’s
point of view, this is all just interrupts to jump to the handlers, and the handlers are software to be run, just as
for the main code, to know what actions need taking.

Conclusions
In this introductory article we have restricted to discussing processor cores in a generic way, without
committing to details of supported instructions (which can be just one, but usually isn’t) or the particulars of a
logic implementations. This is so that the fundamental characteristics can be teased out, without the complexity
of the details for a particular architecture or an implementation strategy.

And it turns out that a processor does not do very much that’s particularly complicated (I hope I’ve
demonstrated). It reads instructions (which are just numbers in memory) which tells the logic to process data in
and out of internal registers, load or store data from memory or update the program counter to start executing
somewhere other than the next sequential location. Exceptions are very much like the PC update instructions,
except they are caused by errors or external events (interrupts) or maybe even special instructions. In these
cases, the new program counter address is set to a fixed location (or fixed location plus an offset), and normal
program flow can be restored once the exception is handled by the specially written software, having saved the
place where the exception changed the PC.

In the next article I want to map what we have discussed here to a modern real-world example, the RISC-V
architecture. We will look at the base configuration, the instructions defined for this, and the registers, both
general purpose and the control and status registers. This should set us up nicely for discussing an
implementation in logic.
Processor Design #2: Introduction to RISC-V
Simon Southwell | Published Sep 17, 2022

Introduction
In the first article I discussed what a processor was and what it did in generic terms, with just some loose
examples of particular processors. In order to get to implementing a processor in logic, more specific details are
required and I said we’d look at the base specification for RISC-V as a relevant modern processor instruction
set architecture, whose ISA is open-source. In this article, then, I will introduce the instruction set for the RV32I
specification—that is, the 32-bit integer specification that’s the minimum feature set for compliance. Also, to tie
in with the other generic processor aspects discussed in article 1, the control and status registers (CSRs) will be
introduced, though, strictly speaking, this is an extension (Zicsr) to the base specification, but all practical
implementations would have these registers. You’ll be glad to hear that there is only a small minimum set that
need to be implemented. Following on from the CSR registers, but closely associated with them, is the RISC-V
exception handling—events and interrupts—which will be discussed.

In order to strip away obfuscating and unnecessary detail, but still have an operational and compliant processor,
other aspects of RISC-V will default to the simplest legal configuration. All the extensions (except Zicsr) will
not be detailed, though we will have a review of what the common extensions are and what they do, and we will
stick to the highest-level privilege mode (machine). RISC-V also defines harts, which are hardware threads—
multiple copies of the processor registers for fast, hardware assisted, swapping between different program
flows. The minimum number specified is 1, so we will stick this. The hope, here is that, by sticking to a
minimal but functional and compliant implementation (when we get to the logic), this becomes a foundation for
discovering the additional features that can augment the base specification without too much trouble.

I realise that I am writing this article as summary of a specification and so, necessarily, this might be a little
repetitive in form, but I hope to map back to the concepts introduced in the first article as we go, and I promise
we are not going through every instruction in minute detail (not that there are too many). The article is a little
longer than my normal articles, but we need to cover enough ground to have as a basis for an implementation.

RISC-V Register set


Before looking at the RV32I instruction set, the processor internal general-purpose register set needs to be
defined. The tables below list these registers:
So, RV32I defines 32 general-purpose registers from x0 to x31, plus a program counter (PC). There is nothing
special about the general-purpose registers except for x0. This register will always return 0 when read. It can be
written to, but its value will remain at zero. For x1 to x31, the specification does not define the values of these
registers after reset, and these may be written and read using the instructions we will define presently.

The tables, next to the register column, define a name for each register under ABI. In the first article, this was
mentioned as the application binary interface, which is a convention on the use of registers for programming
languages. As this is academic to a logic implementation, the ABI names are just here for reference and
because, when we get to assembly language programming either the register name or the ABI name can be
used, and disassemblers can also be configured to use one or the other.

All the registers are 32-bits wide since we are looking at the 32-bit specification. If this were for the 64-bit
specification (RV64I) than these would all be 64 bits but would still have the same names. It is this register set,
then, on which the processor’s instructions will perform operations.

RV32I Instructions
As mentioned before, RISC processors have the characteristic of a limited number of fixed width instructions.
For RV32I these are all 32-bit instructions. Let’s remind ourselves of the categories of processor instructions
defined in general in the first article.

• Instructions for data manipulation in and out of internal registers


• Instructions for reading and writing memory
• Instructions for altering the default update of the PC
• Instructions to generate events and other system related functions

The RISC-V RVC32I specification follows this model, and in the rest of this section we will look at the
instructions for each of them.
Instruction Formats

Before looking at specific instructions, I mentioned in the first article that the instructions are just binary
numbers (32-bits for RV32I), and these are known as machine code. Machine code instructions are sub-divided
into bit fields with different meanings associated with them. Some of the fields are used to identify the unique
instruction (thus its operation), whilst others are for identifying which registers to read from and to write back a
result. Some fields are even just encoded constant numbers to be used in an operation.

For the majority of the data manipulation instructions these instructions follow one of two formats, the R-type
or I-type format, as shown below:

Both these instruction format types share common attributes. The 7-bit ‘opcode’ (operation code) is the first
field that identifies a ‘class’ of instruction. For example, all arithmetic instructions of R-type format have the
same opcode value (in this case 0110011b). The next field is rd, which is a five-bit index into the register set,
and is the destination register—i.e., the register (x0 to x31) that the result of the operation will be written. The 3-
bit function field (funct3) further refines the specific instruction within the class (e.g., add, subtract etc.) and, for
R-type format instructions, it also requires the 7-bit funct7 field, whereas the I-type format instructions are
uniquely identified with just funct3. The rs1 field is a 5-bit index into the processor registers from where a value
will be read for manipulation. For R-type instructions there is also an rs2 index for reading a second value from
a register to process. For I-Type instruction, there is no second register index, but a field called ‘imm’. This 12-
bit field stands for ‘immediate’ and is a number encoded within the instruction itself in place of a number
fetched from an internal register. This number can be interpreted as a signed value or an unsigned value,
depending on the specific instruction, but is limited in range; to either -2048 to 2047, or 0 to 4095.

There are only four other instructions formats, and they have the exact same sorts of fields as the R- and I-type
formats, but just formatted differently or, in the case of immediate fields, a different number and range of bits.
These remaining formats are shown below.

The S-Type format is used exclusively for memory store instructions (load instructions use the I-type format).
The main difference here is that the immediate value is split up into different bits. The RISC-V specification
will always try to keep formats that have a particular field, such as rd, in the same bits for all of the different
types. This makes decoding so much easier (and faster). Something has to give, though, and the immediate
values tend to get jumbled up.

The B-type and J-type formats are for the PC updating instructions (branch and jump). The immediate values
look jumbled up somewhat but, even here, an attempt was made to align the same bit position in the same
locations within the instruction. So, for example, the B-Type instruction does not require imm[0], so imm[11] is
placed in this space, leaving imm[4:1] in the same positions as for the S-type format. Similarly, for imm[10:5].
This bit alignment is followed where possible for J-type instructions as well.
The final format is the U-type, used for a couple of special data manipulation instructions to load a 20-bit
immediate value into the upper 20 bits of a register, used for forming addresses within a register. Thus, only a
destination register index (rd) and the 20 bit immediate number are required after the opcode. No attempt is
made to align these immediate bits with the other formats.

Arithmetic Instructions

Now (finally!) we can look at some instructions. This first set is for arithmetic operations. In this context this
not only means basic arithmetic (adding and subtracting) but also shifting, comparison and bitwise operations.
The table below shows the arithmetic operations for RV32I that have the R-type format.

Since these are R-type instructions, there are two source register indexes (rs1 and rs2) and a destination register
index (rd). The values for the opcode, funct3 and funct7 are shown for each unique instruction. The description
shows the operation for each instruction, where a register index in brackets means the actual register, such as
x17. It should be noted at this point that the destination register index can be the same as one of the source
registers’ and, indeed, the two source register indexes could also be the same. This is perfectly legal.

The first two operations are add (add) and subtract (sub) and the values of the two source registers are added or
subtracted and placed in the destination register. For these two operations, whether the two source values are
interpreted as signed or unsigned does not affect the result and is the same operation.

The next two instruction are comparison instructions: ‘set if less than’. The signs of the values do matter here,
so there is a signed (slt) and unsigned (sltu) version of the instructions. If the rs1 register is less than the rs2
register, then the rd register is set to 1, else it is set to 0.

The next three instructions are for logic bit manipulation, with ‘and’, ‘or’ and ‘xor’ operations. Since these
instructions just do logic operations across the 32 bits, the sign of the source values is not important.

The last three operations are for shifting values. In the case of these instructions the value in the rs1 indexed
register is to be shifted by an amount dictated by the value in the rs2 indexed register (and the result written to
rd). Only the bottom 5 bits of the value in rs2 are used since 31 is the maximum amount that can be shifted. The
sign of the rs1 value does matter, and the first two instructions are for unsigned operations to shift left logical
(sll) and shift right logical (srl). A shift right arithmetic (sra) instruction uses a signed value interpretation of
the rs1 value, and will shift right, preserving the sign of the value—that is, the top bit value of the number being
shifted, whether 0 or 1, is used to fill the top bits above the shifted value.
Having looked at the R-type arithmetic instructions, there are versions of these exact same instructions using the
I-type format. The table below lists these.

For I-type instructions, the major difference is that, instead of an rs2 index to fetch a value from a register, the
encoded immediate value is used in its place. The instructions have the same name as the R-type instructions
but are suffixed with an ‘i’ (e.g., addi). In all other respects these perform the same operations as for the R-type
instructions. The eagle eyed amongst you will have noticed that there is no subtract instruction in this list. Since
the add immediate instruction is a signed operation, subtraction is performed by just making the immediate
value negative.

These few operations, then, are the main data manipulation instructions for RV32I, barring the U-type which we
will get to later. Hopefully, you can see that these operations are simple and can easily map these to the kinds of
logic you would see in any digital design, with just adds, shifts, comparisons and logic operations.

Memory Access Instructions

In the first article it was mentioned that a characteristic of RISC architectures was that memory loads and stores
were separated from data manipulation, and RISC-V is no exception. The instruction set, for RV32I, provides
access to memory for reading and writing bytes (8-bits), half-words (16-bits), and words (32-bits). The tables
below show the load and store instructions:
As stated, there are load instructions for byte (lb), half-word (lhw) and word (lw). An address to read from
memory is formed by adding (with signed arithmetic) the immediate value to the value in the register indexed
by rs1. A memory read is done to this address and the returned value is placed in the register indexed by rd. For
lb and lhw the returned value is signed extended to make a 32-bit word (the lw read data is already 32 bits).
Unsigned versions of lb and lw are provided, lbu and lhwu, that do not sign extend the returned value.

The store instructions write bytes, half-words, and words to memory with sb, shw and sw. The address is
formed in the same way as for loads, but the value to be written is taken from a second register indexed by rs2.
Since the destination for a store is memory, there is no rd index, and no register values are updated. There are
no signed/unsigned differentiated instructions for stores, as the sb and shw are reducing, not expanding, the
values, with the bottom 8 or 16 bits in the rs2 indexed register used.

PC Update Instructions

For RV32I, the PC updating instructions are classed into two types; those that change the PC on a condition (a
branch), or those that unconditionality change the PC (a jump). These have the instruction format types B-type
and J-type respectively. The tables below show these instructions:

For the branch instructions the values from two registers (indexed by rs1 and rs2) are compared. If the
condition is met, then the PC is updated by adding (with signed arithmetic) the instruction’s immediate value to
the current PC value. If it’s not met, then the PC increments as normal (4 is added for RV32I). There are four
conditions for branches:

• Equal
• Not equal
• Less than
• Greater than or equal

The instructions beq, bne, blt, and bge implement branches on these conditions. Since the sign of a value
matters for the last two conditions, there are unsigned versions of blt and bge—namely bltu and bgeu.

For the jump instructions, RISC-V defines a jump-and-link approach. That is, the address of the instruction after
the jump instruction is saved so that it can be used later on to return flow back to the instruction just after the
jump. There are two jump and link instructions, a direct jump instruction (jal) and an indirect jump instruction
(jalr). For the direct jump, the current PC + 4 is saved to the register indexed by rd, and the sign extended
immediate value is added to the current PC. For the indirect jump, the PC + 4 is saved to rd and the sign
extended immediate value is added to the value in the register indexed by rs1, and the PC set to this result, but
with the least significant bit forced to 0 if set (instructions can’t start at an odd byte address—but can on a half-
word address if compressed instructions supported, hence only the bottom bit forced to 0).

Other Instructions

The last set of instructions for RV32I tidy up some data manipulation and have a small number of ‘system’
instructions. The tables below show these remaining instructions:

The ‘load upper’ instructions are really data manipulation instructions, but they use their own special format
and are a means to an end, rather than true data manipulating. Since, for the 32-bit RV32I specification,
addresses are 32-bits and instructions are also 32-bits, one can’t construct a single instruction to load a full 32-
bit address into a register using immediate bits. What can be done is to load a large portion of an address into
the upper bits (bits 31 down to 12) and then use an addi instruction to construct the lower bits (11 down to 0).
One could use the shift left immediate instruction (slli) but, as this can only do 12 bits at a time, would take
three instructions instead of two. There are two flavours of ‘load upper’ instructions. The ‘load upper
immediate’ (lui) simply loads the instruction’s immediate value into the upper bits of the destination register,
indexed by rd, and clears the lower 12 bits. The ‘add upper immediate to PC’ (auipc) instruction constructs
upper address bits relative to the current PC value (i.e., the address of the auipc instruction). Here imm[31:12],
concatenated with 12 bits of 0, is added to the PC values and placed in the destination register, indexed by rd.

The system instructions are fairly straight forward. The ecall and ebreak instructions force an exception
condition. In the specification we’re following, they basically do the same thing. If vectored exceptions
(mentioned in the first article, but more later) are supported and enabled, they will jump to different offsets from
the base exception handler address. For reference, the ecall would be normally used to break from a running
process back into the operation system code for service. The ebreak is involved with debugging, where the
exception would jump into debugging software—a break point in the code. The mret instruction (machine level
return from exception) actually sits outside the RV32I specification and is a privileged instruction. Since we are
going to always run in the top-level machine privilege mode, this is a valid instruction, and is vital to exception
handling. This instruction updates the PC to the address where an exception occurred, that was saved when the
PC was updated to the exception base address (plus any offset). This saved address could have been where, say,
an ebreak instruction was, or an illegal instruction, or even the instruction that was interrupted by an external
event. The address is saved in a CSR register, which is the subject of the next section. So mret is called at the
end of an exception handler to return flow back to normal.
Lastly, fence is used for memory and I/O access ordering. If a system is allowed to reorder memory accesses
(for efficiency’s sake) then, for certain operations this can cause hazards between different harts or other
external cores or devices that read and write memory or I/O. The fence instruction forces all memory or I/O
accesses that were sent (from this core and hart) to have completed before continuing. For our purposes, with a
single hart and core, and no memory sub-system with access re-ordering, it actually will not need to do anything
and is a ‘no operation’.

This is all the instructions we need for RV32I compliance. It is possible to construct a processor that processes
just these instructions but, as I’ve said before, a practical solution will have control and status registers,
specified by the Zicsr extension, and this introduces a few more instructions to manipulate these (though,
luckily, not many).

Control and Status Registers


In the first article, some example register sets were looked at and included among them were control and status
registers (CSRs) but in the RISC-V registers we looked at earlier there were no control and status registers.
RISC-V does things slightly differently (than processors I’ve been familiar with).

RISC-V defines a separate 4K word space in which various CSRs are mapped. This space is completely
separate from the normal address space access with load and store instructions which can access the CSR
registers. New instructions (as part of the Zicsr extensions) are required to get to this space. In theory, then, we
can have up to 4K 32-bit CSR registers. In the current specification a little over 200 registers are defined but,
thankfully, most of these are optional. Before we look at those that are required, the register access instructions
need defining.

CSR Access Instructions

The table below shows the six instructions for accessing CSR registers. There are actually only three basic
operations, with register and immediate versions of these.

The three operations are to read-and-write a register, read-and-set a register, or read-and-clear a register. For the
register type instructions, the write, set, or clear is done from a value in the general-purpose register indexed by
rs1. The CSR index field identifies where within the 4K word space the operation is performed, and the rd field
indexes the general-purpose register where the register’s old value will be written. Note that, if the old value is
of no interest, rd can index x0.

The immediate versions of the instructions are similar, but the 5-bit field is unsigned and can, obviously, only
affect the bottom five bits of the register being accessed, though many of them have active bits higher than this.
This is still useful as many other of the CSRs only have the first few bits defined with sub-fields.

Registers
Of the 200 or so possible control and status registers, only a handful are required. Even within this requirement,
some may be defined as fixed at 0. The sub-set we shall look at comes from a real-world example.

In October 2021 Intel released a softcore RISC-V processor as part of its NIOS range for its FPGAs—the NIOS
V/m softcore processor. This meets the RV32IA specifications (and implied Zicsr extension), where the A
extension is for atomic operations (we will discuss standard extensions later). As part of this part of this softcore
it has a set of control and status registers as listed in the table below with their names and offsets within the
CSR space.

Two things to note about this table. Firstly the register names in green indicate the registers that could legally be
set to 0, though this does not imply that they are for the NIOS V/m processor. Secondly, my understanding of
the specification is that the mscratch register is also requirement, at offset 0x340.

Concentrating, for a moment, on the registers that are black, these are pretty much all involved with exception
handling. Since there is a section on this subject below, I just want to summarise the other registers here, and
deal with the exception functionality in the later section.

The mscratch register mentioned earlier is just a 32-bit CSR that can be written (or set, or cleared) with a value
and has no other function. It has no defined function and is there as a place to hold a useful value for the
software. The misa register (if not 0) defines which extensions are implemented. Bits 0 to 25 map to the letters
A to Z, and when set means that extension is implemented. The NIOS V/m would then have bit 0 (A for atomic)
and bit 8 (I for integer) set. It also has two bits (31 down to 30 for RV32I) to define the architecture size (RV32,
RV64 or RV128), with 01b being RV32. These bits can be writable if an implementation can support multiple
architecture sizes. For example if an RV64 architecture can be set to operate as RV32. If an implementation
can’t do this writing has no effect.

The four ‘id’ registers are read-only, and (if not 0) give information about the device. The vendor ID
(mvendorid) is a unique ID, obtained from RISC-V International, for each manufacturer of cores. The marchid
specifies the microarchitecture. For open-source projects these are also allocated by RISC-V International and
have the top bit clear. A vendor may set this number, but the top bit must be set. The mimpid register is for
vendors to set a version number. The mhartid register is a read-only register indicating which hart is currently
running the code. As has been mentioned before, an implementation can have just 1 hart, and its ID must be 0.
For multiple hart devices one of the harts must be ID 0 though the others can be any valid number. For our
purposes, with 1 hart, mhartid will always be 0.
Timer Registers

In the first article, the simplified SoC diagram had a timer that could generate interrupts and it was indicated
that this was needed for multi-tasking operating systems to schedule the swapping of processes and threads. The
RISC-V specification defines two timer registers but these are deliberately defined outside of the CSR space
and are mapped into the normal memory address space. The addresses within the memory map are not defined.
The reason for this is that timers often run from real-time clocks and may even be external devices. Running a
timer from the core’s clock might be possible if it is a fixed clock rate, but many implementations have variable
clock rates for power saving when loaded more lightly. The specification doesn’t dictate the tick rate for a timer
(though it must be possible to derive what it is), but it must be constant and it must be possible to derive real-
time from its value.

The timer registers are 64 bits, though for RV32 these are split into high and low registers of 32-bits each and
mapped to consecutive addresses. There are only two timer registers; mtime and mtimecmp. The first counts at
the constant tick rate and will wrap if it overflows. The second is a comparison value and will set an interrupt
request if its value is less than, or equal to, the mtime register. Whether this causes an exception depends on
whether it has been enabled or not.

RISC-V Exception Handling


Exception handling is closely tied to the CSR registers. The first functionality required is to enable or disable all
interrupts and events (that aren’t error conditions). This global interrupt is controlled from a bit in the mstatus
register (MIE at bit 3). This is the machine interrupt enable bit. There is a supervisor interrupt enable bit (SIE)
but we are not supporting this mode. If interrupts are enabled and an event occurs, the MIE bit state is saved to
another bit (MPIE at bit 7) and the MIE bit cleared. This is so the interrupt or event does not keep causing a new
exception. When the exception code returns (by executing an mret instruction) the MPIE value is copied back
to MIE. There is one more field to mention in mstatus, and that is MPP (machine previous privilege at bits 12
down to 11). This is used if multiple privilege levels supported, to save off (and then restore) which level was
active at the point of the exception. For our purposes this is a fixed value for machine privilege (11b). There are
other fields in the mstatus registers to do with supervisor and user modes which we will not concern ourselves
with here and can set to 0 in an implementation.

Having defined a global interrupt enable, there are finer controls of some different classes of exception
controlled from the machine interrupt enable register (mie). There are both machine and supervisor control bits
but we shall limit ourselves to just the machine bits:

• MSIE: bit 3 : software level interrupt enable


• MTIE: bit 7: timer interrupt enable
• MEIE: bit 11: external interrupt enable

The software level interrupts are generated by software writing to some location to raise this type of exception.
The mechanism is not specified but writing to a bit in an implementation specific register could be one way.
The timer and external interrupt types are, I hope, self-explanatory. There is an equivalent read-only pending
interrupt register (mip) with the same bit field definitions that indicates, when set, if any of these three types of
interrupts are active.

Mentioned before is that, when an exception occurs, the program counter (PC) is set to some fixed address with,
possibly, an offset depending on the nature of the exception. The address that the code will jump to is defined in
the machine trap vector base address register (mtvec). Since the address will be 32-bit aligned, only bits 31
down to 2 define the address, with bits 1 down to 0 implied as 00b. Instead, these bottom two bits define
whether all exceptions will jump to the defined address (non-vectored, when 00b) or whether they will jump to
an offset from this address depending on the exception type (vectored, when 01b). The offset is calculated as 4
× the ‘cause number’ (which we will look at shortly). It has also been mentioned that when an exception is
taken, the value of the PC at that point is saved. This is done into the machine exception program counter
register (mepc), which can also be updated with the CSR instructions. The machine trap value (mtval) register
may also be updated on taking an exception with implementation specific information to aid an exception
handler, or it may be set to 0, and can also be updated with CSR instructions.

The final exception CSR is the mcause register. When an exception occurs, and the PC is updated to the
exception address, the cause of the exception is noted in the mcause register. It is this value that determines the
offset taken from the mtvec address if vectored exceptions are enabled. The top bit of the address indicates
whether it is an interrupt (the bit is set) or and internal exception (bit clear). For the interrupt case only the three
classes of interrupt mentioned before (software, timer, and external) are supported, but separate for the machine,
supervisor, and user privilege modes. The exception codes just for the machine level exceptions are shown
below (with the bit 31 implied as set to 1)

• 3: machine software interrupt


• 7: machine timer interrupt
• 11: machine external interrupt

For the internal exceptions (with top bit at 0), the list of codes is shown below, and is applicable for all privilege
modes:

0: Instruction address misaligned

1: Instruction access fault

2: Illegal instruction

3: Breakpoint (ebreak instruction)

4: Load address misaligned

5: Load access fault

6: Store address misaligned

7: Store access fault

8: Environment call from U-mode (ecall instruction)

9: Environment call from S-mode (ecall instruction)

10: Reserved

11: Environment call from M-mode (ecall instruction)

12: Instruction page fault

13: Load page fault

14: Reserved for future standard use


15: Store page fault

Of these codes, 8 and 9 are for user and supervisor privilege modes so we can skip these. The access faults refer
to some illegal addressing, for example reading an instruction from memory marked as having no execution
permissions. These would normally be flagged by external signalling from the memory sub-system. The page
faults would also be flagged externally by the memory sub-system when, say, there is no mapping of a physical
page for the virtual address being used (see my article on virtual memory). The access and page faults need not
concern anyone until interfacing a core implementation to a memory sub-system.

The ones that are of interest are, firstly, the illegal instruction exception, where either this does not decode to a
legal value, or has a sub-field with an illegal value for an instruction etc. Secondly, the address misaligned
exception, for instruction, load, and store, will flag if a calculated address was bad, such as a word access being
aligned to an odd byte. The breakpoint code is for when the ebreak instruction is executed, with the
environment call code is for when the ecall instruction is executed.

We have now been through everything that is required to put together a logic implementation for an RV32I
(+Zicsr) compliant processor core. Before we finish though, for completeness, some other standard extensions
should be summarised to show what other features are of interest to processor functionality.

RISC-V Extensions
From the base specification are defined a set of standard extensions to add further functionality. Additional
extensions are being defined all the time and added to the specification once ratified, but we will look at some
of the original standard extensions to the RV32I base specification. As you may have gleaned, these extensions
usually have a letter associated with them, and an implementation is defined by listing the letters of the
supported functionality (e.g., the NIOS V/m softcore was RV32IA, meaning it was the base plus atomic
instructions). The most common extensions are listed below:

• RV32M adds integer multiplication and division instructions


• RV32A adds atomic instructions to ease multiple processor communication through memory
• RV32F adds single precision floating point instructions
• RV32D adds double precision floating point instructions

The above extensions together, with the base and Zicsr, constitute the RV32G specification, or general-purpose
processor. RV32G is thus shorthand for RV32IMAFD, and this is considered a useful specification for a
general-purpose processor implementation.

Other useful extensions are:

• RV32E drops the regfile registers to just x0 – x15, for logic saving
• RV32C adds 16-bit compressed instructions to save coding space in embedded systems

For RV32E, it was realised that to make RISC-V useful for small embedded (hence the E), low-power, low
frequency applications it would be useful to reduce the number of registers from 32 to 16 to save on gates. If
you refer back to the register set and the ABI names, you will see that in the first sixteen register there is at least
one type of register that’s in the upper 16 registers, allowing a calling convention to be maintained for the
RV32E specification.

The compressed instructions are 16-bit ‘abbreviations’ of a sub-set of the full 32-bit instructions, and can be
mapped one-to-one with a full instruction, though often with restricted parameter values (such as register
indexes). The ARM processors have an equivalent with their thumb instruction set, though one difference is that
the ARM enters and exits a thumb mode (and does so on 32-bit boundaries) whereas, for RISC-V, compressed
instructions can be freely mixed with the full 32-bit instructions meaning that 32-bit instructions can be aligned
on 16-bit boundaries and not just 32-bit boundaries, and this must be handled by an implementation with
compressed instruction support.

In all of these descriptions we have restricted the discussion to RV32, but all this is applicable to RV64 as well.
In that case all instructions, data sizes and CSRs become 64 bits. All the default instruction behaviour then
works on 64-bit values. Because manipulating 32 bits is still useful for a 64-bit implementation, additional data
instructions are included, for 32-bit data manipulation, with RV64 over RV32 (though they are not the same
opcode as their RV32 equivalents, as these are upgraded to 64-bit operations).

Conclusions
This article has introduced the RISC-V ISA specification. In order to limit the size of the discussion, only the
minimum has been reviewed for a compliant implementation; namely, RV32I+Zicsr. As well as being an
informative introduction to a modern RISC processor, it is also going to be used as a basis for actually
implementing a RISC-V core, in logic, to the RV32I specification and detailed discussion of all optional
features have been either skipped or only mentioned in passing—though I have attempted to make the article to
be a stand-alone introduction to RISC-V for those who do not wish to go the next steps to an implementation.

This article is just a review and does not constitute a formal specification. The specifications themselves are the
final word on compliant operation and can be freely sourced on the riscv.org website. The RV32I instructions
are defined in volume 1 of the specification and the CSR registers and exception handling operations in volume
2.

The next article will look at what an RV32I logic implementation might look like as an example, and as a
platform for discussing more advanced techniques that could be employed for improved processor performance
(though this can’t be exhaustive), and a final article will look at assembly language programming of a RISC-V,
as this will be both informative and useful for testing any logic implementation.
Processor Design #3: Processor Logic
Simon Southwell | Published Sep 22, 2022

Introduction
In the last article the 32-bit RISC-V base ISA was introduced, along with the Zicsr extensions that are really
needed for a useful implementation. The specific classes of instructions were defined for data manipulation in
and out of internal registers (also defined), memory accesses, and program counter updating. With the addition
of control and status registers and exception handling that they mostly control, we should have enough to
implement a compliant RISC-V processor in logic.

In this article is discussed approaches to implementation of a processor, and reference will be made to an
example open-source Verilog implementation that is available on github for those who want to study the code
alongside this discussion. Some of the design decisions will be highlighted and also alternatives and more
advanced approaches. The aim of the example implementation is to provide a working design with some
common features to make a useful and somewhat efficient core, but in places clarity has taken precedence at the
slight cost of speed and logic size. Where I can, these will be ‘confessed’. Better open-source implementations
exist that will be more efficient in gates, speed or power consumption, or a mixture of these.

Before looking at implementation specifics I want, in the first part of the article, to define what functions the
logic will need to perform. There are various operations that are commonly defined for the basic operations of a
processor core for executing instructions, with slight variations. The diagram below shows a typical set of these
operations.

Starting at the top, without implementation specifics, a processor will fetch an instruction from memory, it will
decode the instruction and read the values of the indexed source registers, it will execute the instruction using
the register values, it will access memory (if a load or store instruction) and it will write the result back to a
destination register (unless a store) or the program counter. The process then repeats, either as the normal flow
of execution or the PC was updated due to an exception condition that’s occurred, externally or internally.
In the diagram I have shown some optional steps, where the reading of the source registers might be bundled as
part of decoding, or possibly split out as a separate prefetch, and for memory accesses combined in the execute
step or have its own step. In theory, all of these steps could be done in one big operation and the reality is that
different implementations have different sets of combined or split steps to meet specific requirements. None-
the-less, the above diagram lends itself to the start of an implementation design, so let’s begin with this.

Design Approaches
When designing a solution to a core implementation, as with any design process, a set of conflicting
requirements will shape what approach is taken. Perhaps the engineering problem being solved lends itself to a
large number of tiny processing elements, so the core design should be as small as possible whilst trading
instruction execution speed (see, for instance, the fascinating bit-serial SERV RISC-V core). Alternatively,
instruction execution efficiency might be key, and a superscalar and deeper pipeline solution is required (more
on these terms later)—the SiFive series 7 processors are dual-issue superscalar, 8-state pipelined cores. The
truth is, it is likely to be mixture of requirements—power, size, speed, complexity, time to market—and a
balance between them all. First, though, let’s look at just implementing something that functions. The required
functionality from the processor ISA does not change with any of these engineering considerations, and we can
keep things simple to begin with.

Finite-State-Machine Based Core

The diagram above, suggesting various operations for an executing processor, looks very much like a finite-
state-machine (FSM), if a very simple one. If we were only interested in making a functional core, then the
FSM approach is a good strategy. The diagram below adapts the original diagram into a set of matching states
of an FSM.
In its simplest form the FSM can just do exactly what the original diagram showed, with one clock cycle for
each step (ignoring wait states for now). However, we can already improve on that by skipping states when
there are irrelevant for the particular instruction. For example, if the instruction is not a load or store, then the
memory state can be skipped. If the instruction is a store, then the writeback step might be skipped—so long as
the PC can be updated for any exception conditions in the memory state. Indeed, exceptions that are internal
error conditions might occur in the other stages, and paths from these states directly to writeback when these
occur might be another efficiency improvement.

At its simplest, every instruction would take 6 cycles to execute. With some stated improvements, data
manipulation and PC updating instructions can be reduced to 5 cycles. This does not sound particularly
impressive, but this FSM approach has some major advantages. Firstly, it is very simple to understand and
implement and avoids some complex issues of other approaches (as we shall see). It is likely, therefore, to be in
a working state much sooner than other methods, with fewer problems along the way. Many early processors
are multi-cycle instruction execution implementations, as it was complex enough already to design integrated
circuits at that time, and the designers were limited in the amount of logic they could lay down. So, I would
suggest that anyone new to this subject and implementing their first processor core starts with this approach.
Everything else we are about to talk about does not affect the functional operation of the core and is just there to
engineer the efficiency of one or more parameters of an implementation. So, we’ll leave the FSM based
implementation discussion here, as I hope it is self-evident as an approach to the required logic design.

Pipeline

The classic step of improving the instruction execution speed of a processor core, or indeed any serial data path
such as a streaming interface, is to pipeline the design, where each step is a stage in the pipeline feeding into the
next step. The diagram below shows the steps ‘unwrapped’ into a pipeline:

Now, in this arrangement, an instruction is fetched by the fetch ‘stage’ in a given clock cycle and the returned
instruction passed to, in this case, the rs prefetch stage. The prefetch stage, in the cycle after the fetch, can start
processing the instruction to retrieve the values in the indexed source registers. In parallel, though, the fetch
stage can get the next instruction. As this continues, the pipeline is quickly filled in all the stages and
instructions are being fetched and fed into the pipeline every clock cycle. So, apart from an initial latency before
the first instruction is finished, the pipeline completes an instruction every cycle. We have now gone from, at its
basic, 6 cycles for the FSM to 1 cycle for a pipeline, and the pipeline doesn’t look any more complicated than
the FSM, perhaps is even simpler. Of course, there is a price to pay, and we have just created a whole set of
problems for ourselves that need solving.

Pipeline Hazards
If a processor only ever did memory accesses and data manipulation instructions, where it was never updating a
register that was needed by any of the other instructions in the pipe, then things would be fine. However, it is
possible that an instruction’s destination register is a source register for an instruction that is in a pipeline stage
behind it. Since, during the rs prefetch, an old ‘stale’ value was fetched, the instruction will fail without some
intervention. This is known as a register writeback hazard. Fortunately, the internal registers are only updated in
one stage, the writeback stage, and it is always just writing to one register (rd), so this can be monitored and
compared with the rs indexes for the instructions at each previous stage. If it matches either of the rs indexes,
the source value is replaced with that being written to the destination register. This does mean that the indexes
for the rs values must be forwarded through the pipeline, and the rd index and writeback value routed to each
stage with a comparator and mux to choose between the value from the previous stage or the writeback value
fed-forward.

The second issue we have introduced by going to a pipelined architecture is branch hazards. The default
behaviour has been to keep fetching instructions linearly through memory (the default PC + 4 behaviour for the
RV32I). If a PC updating instruction is processed then, at writeback, the PC is potentially updated to a new non-
linear address and all the instructions in the pipeline behind are then wrong. Dealing with this hazard is known
as branch prediction which varies in complexity, from avoiding the problem to quite complex solutions.

Branch Prediction

A whole new article could be written on the subject or branch prediction so we will limit ourselves to a few
common solutions. Before we look at these, it should be noted that all these schemes make some assumptions
about the nature of the code that runs on a processor. In general, instructions aren’t executed from random
places in memory but have some localisation, the simplest being located in consecutive addresses in memory.
Even when there is a branch, this might be to code that is not far removed from the current PC address, such as
a small local loop. If it were truly random, branch prediction wouldn’t really work.

There are two major categories of branch prediction; static and dynamic. The simpler of the two is static branch
prediction. Here a predetermined action will be taken when processing a PC updating instruction. Below is a list
of three popular strategies.

• Always Take
• Never Take
• Take on negative

The first is to assume that the branch or jump is always taken and to flush the pipeline. When encountering a
branch or jump, the pipeline effectively stops fetching new instructions until all the stages have completed and
any new PC address has been written to the PC. Whilst flushing, the pipeline is ‘stuffed’ with no-operations
instructions (e.g., addi x0, x0, 0), This completely avoids any branch hazard but means execution time is
increased to flush the pipe, even if the branch would not have been taken. It also requires a partial decode of the
instruction to flag a branch or jump. An improvement might be to pre-calculate the new PC value if taken and
start fetching from there. This is okay for branch instructions and jump-and-link (jal) instructions, as the
potential new PC value is just the current value plus an immediate value from the instruction. The jump-and-
link-relative instruction (jalr), though, is also a function of a source register, indexed by rs1, which might be
stale in the register file, with an instruction ahead in the pipe about to update it. Partial decode, addition, and
handling stale register values quickly becomes complicated, potentially creating critical path logic or the need to
add more pipeline stages to relieve this with the writeback hazards also needing addressing in the new stages.

The next strategy is to never take the branch. That is, we will carry on linearly regardless. The advantage of this
is that no early partial decode is required and we will be right for a proportion of the time and incur no penalty.
How big a proportion depends on the nature of the code running on the core. The disadvantage is that, when
wrong, the instructions in the pipe must be cancelled by some method—marked as invalid or forced to a “no-
operation” (nop) instruction, etc. Also, if the instruction was a jump, then it could be known in advance that the
instruction would be taken. It is perhaps the simplest to implement and is an improvement on the flushing
variant of the always-take method but avoids the always-take variant of new PC calculation and complexity of
fetching from the non-consecutive address.

The last static method I want to mention is the take-on-negative. That is, for a branch, the branch is taken if the
new PC is less-then-or-equal to the current PC (negative) but isn’t taken if the new PC is greater than the
current PC. To determine this only the immediate value needs to be processed as either negative or not. For
jumps, of course, then it is unconditionally taken. The complexity of logic for always-take methods still applies.
The reason this might be an improvement is an assumption about the nature of code—particularly compiled
higher level language code—where local loops (such as for-loops of C or C++ for example) are often executed
multiple times before exiting, making jumping back more likely than jumping forward. By assuming a
backwards branch might be a local loop that is not about to terminate, taking the branch makes the likelihood of
being correct higher. Of course, a backwards jump might not be this situation and thus the wrong decision to
take.

Lastly, in this section, I want to look at dynamic branch prediction. This differs from static branch prediction as
the decision to take or not take is based on state that is going to be kept at run time and so the decision is not-
predetermined. There is only space in this article to look at one simple method for illustration purposes, but this
is a vast subject and there are many solutions of varying complexity, some of which are proprietary and not in
the public domain, I’m sure.

A table is kept of recent branch locations (addresses) along with some state to use in making a decision when
next encountered. The table will be finite, so some method for replacing entries when newly encountered branch
locations are executed. As such, this is like a cache of branch addresses with a least-recently-used (LRU)
replacement algorithm (see my article on caches). Like for static methods this relies on some locality of code
execution for some period. The state kept is just a (signed) counter that increments to a maximum value when a
branch is taken (that is, actually taken, not the predicted value), and decrements to a minimum value when a
branch isn’t (actually) taken. For the decision on taking or not taking, if the counter is positive, then the branch
is predicted as taken, and if negative it is predicted as not taken. The number of bits in the counter mustn’t be so
small as to add no real prediction, but not so large that it takes too long to respond to changes in the direction. In
fact a two-bit counter is a practical value. The diagram below summarises this:
The starting position can be either at a count of 0 or a count of 1 since no previous information is available so
it’s a 50:50 situations whether to predict one way or the other. This state is kept separately for each branch
address in the table until an old address is swapped out for a newly encountered branch address with initialised
state.

As you can see, branch prediction gets messy pretty quickly, but much attention is given to it as it is all about
the efficiency of executing instructions and not pausing in execution due to branching. This is not unlike the
added complexity of caches for instruction and data in memory sub-systems which also can stall a pipeline
waiting for data to be written or read with wait states.

The memory access wait states are also something that needs to be handled within an implementation which we
will look at in the example implementation. This is usually done by stalling the pipeline when wait states are
inserted by the memory sub-system. Functionally, this is fairly simple with the update of output state held
whilst the memory access is pending. However, care in the design needs to be taken so as not to stall a cycle
after a wait state is asserted and can lead to critical paths in the pipeline if using an externally generated signal
directly for stalling all stages.

Superscalar

With a pipelined architecture, fixing all the hazards it generates, an efficient branch predictor, and a single cycle
memory sub-system, a design could approach running at 1 instruction per cycle. This then is the limit of what
we can achieve for a single core, right? Can we do any better? The answer is yes, using superscalar architecture.

The limiting factor is the pipeline, if saturated, can’t process instructions any faster. By duplicating the logic
within a core, with effectively two (or more) pipelines, instructions can be executed in parallel. The reading of
memory instructions needs now to be at a higher bandwidth to match the increased throughput of the core. This
might be reading over a wider bus (64-bits when the instructions are 32-bits such as for RV32I) or running the
fetch cycles at a higher clock frequency (or a combination of both). The diagram below shows a simplified two
pipeline superscalar arrangement.
Two duplicate sets of stages are present, with the increased bandwidth fetch stage, and a new dispatcher stage in
between. Like for the initial move to pipelining, going to superscalar creates more problems. If an instruction is
to update a register in one pipeline and the next logical instruction has this as a source then if it is dispatched
down the other pipeline in the same cycle, then we have a race condition. What if, though, the dispatcher had a
following instruction that didn’t rely on the result from either of the instructions in front and didn’t write back
to any register that was a source for them? Then the dispatcher could send that instruction into the other pipeline
for execution, “out of order”. The instruction that was paused can then be sent into the pipeline for execution.
This keeps the pipeline full without changing the result. For example, if the first instruction is addi, x1, x0, 1
(put 1 into x1) followed by addi x2, x1, 1 (add 1 to x1 and place in x2), then addi x3, x0, 6 (place 6 into x3),
then the last instruction can be executed in parallel with the first, then the second instruction dispatched. The
result will still be x1 = 1, x2 = 2, and x3 = 6.

Here I have assumed that the pipeline is executing at 100% but, of course, memory accesses can stall a pipeline,
branch prediction may fail and cause flushing etc. The dispatcher will actually keep tabs on all the stages for all
the pipelines about which destination registers are to be updated. It will prefetch later instructions, up to the
depth of the pipeline, in order to lookahead about whether any instruction is a candidate for dispatching out-of-
order so it can keep the pipelines as full as possible. Of course this is all complicated by branch and jump
instructions—and you thought branch prediction was messy. We are getting to quite advanced features now,
and still are only just looking under the hood, but we must stop here and look at an example implementation.

Example Implementation
To finish this article, I want to review the design of the RISC-V implementation on github to illustrate some of
the topics discussed above. This will not be exhaustive as much of the detail of the design is given in the logic’s
documentation. Not everything we’ve talked about is implemented to keep things simple enough to understand
but to have advanced enough features for an efficient design. The implementation has the following
specification.

• RV32I + Zicsr + RV32M, configurable for RV32E


• 5-stage pipeline
• 1 cycle for data manipulations and stores, 3 cycles for loads (plus wait states), 3 cycles for branches when taken
(else 1 cycle)
• “Never take” static branch prediction policy
• Single hart

The RV32M and RV32E features, though available, need not bother us here and can actually be configured out.

Top Level

The base RV32I logic is implemented as a 5-stage pipe in 3 separate blocks. A register file (regfile) implements
the internal registers and the program counter. As such it spans both the fetch stage, by providing the PC, and
the writeback phase where registers and the PC are updated. The rs prefetch and decode stages are implemented
in the decoder block, and the execute and writeback phases are implemented in the ALU block (with regfile
register updates in the writeback phase as well). The diagram shows the top-level block diagram arrangement.

The diagram does not show the Zicsr or RV32M blocks. These logically sit in the execute phase and, in this
implementation, they are in separate modules. This is so that the core can be easily configured without these
features for reduced size if those functions are not required. It requires some additional muxing to do this but
appeared not to be on the critical timing path. Otherwise, the additional instructions could have been added to
the ALU and the CSR registers to the regfile blocks. The decoder already partially decodes these instructions
and forwards them to the blocks.
Register File

The regfile block contains the general-purpose registers and PC. It is configurable to use either logic gates or
memory. Which is best depends on the target technology. FPGAs tend to have available memory blocks which,
if not all already allocated, can be used for the registers, saving on consuming logic resources which might be at
a premium in a small device. There is an issue, though, if the memories do not have dual read ports. R-Type
instructions, as discussed in the second article in this series, have two source register to fetch. Either an
additional pipeline stage for fetching the registers separately can be used (not preferred) or, if resources allow,
two memories can be used. When updating registers during writeback, both memories are updated with the
same value at the same offset. When fetching source registers data, one memory is allocated for rs1 and the
other for rs2. For targeting an ASIC, whether logic or memory is more efficient will depend on the technology.
A dual port memory is more likely to be available for ASICs though.

Whether a memory or logic, the registers themselves are just 31 registers of 32 bits wide. Logic intercepts reads
and writes to x0, so it always reads 0. The regfile will increment the PC by default, unless the writeback stage
indicates an update.

Decoder

The decoder block extracts the source register indexes and sends then to the regfile for reading the value during
the rs prefetch phase. This is done for rs1 and rs2 regardless of whether the instruction ultimately require them.
The instruction is then registered for the decode phase. The diagram below shows the main functions within the
decoder logic:
The decoder extracts the immediate bits from the instruction. Since these bits vary according to instruction type,
each type is extracted and then the correct one selected and, where applicable, sign-extended, within the decode
phase. The register values returned from the prefetch phase are routed to one of two decoder output parameters,
a and b, for routing to the ALU. This diagram shows that these returned register values can be overridden by
feedback rd values from the writeback stage, discussed later. As well as rs1 for a, and rs2 for b, these outputs
can be selected to be the PC value (from regfile) or 0 for a, and the trap vector (Zicsr’s mtvec + offset value)
and immediate value for b, depending on the instruction. The diagram shows which type of instructions these
are selected for. Other decoder outputs for the ALU include the PC value forwarded, the immediate value to be
used as a PC offset, the instruction’s rd index value and the set of controls to select the ALU execution, decoded
from the instruction. These are essentially a bitmap of what operations to perform in the execute phase. All
these outputs are registered to place them in the execute phase.

ALU

The ALU is essentially going to take the a and b parameters from the decoder and generate a result on a c
output. This is true for all the data manipulation instructions and the store instruction, whilst PC and memory
read data are used for loads and jumps respectively. Some addition outputs are generated for memory accesses
and PC updates. The diagram below shows a simplified block diagram for the ALU.
The controls from the decoder select which operation to perform. The load instruction has an alignment
operation on data returned from memory and the jump instruction takes the forwarded PC and adds 4. The rest
of the instructions use a and b, as selected during the decode phase. The result is registered as output c.

In addition to this main ALU operation, the block generates a strobe for updating the pc, with the new PC value
being either the PC input plus the forwarded offset value for branches, or a plus b for jumps (which contain PC
and immediate values). Load and store strobes are generated as appropriate, with an address output calculated
from the forwarded offset value and the b parameter. From this address any byte enables are calculated from the
lower bits. The value to be written for stores is on the c output. Finally, the rd index value is output for the
writeback of the c result to the register file.

In the Verilog implementation the ALU code is not fully optimized for gate count. For example, there are three
shift RV32I shift instructions, sll, srl and sra. In the code this is handled by calculating the result for each of the
possible instructions and then selecting the required value depending on the particular instruction active. The
code snippet below shows this:
Hopefully this clear how this operates to produce a shift result, but the three innocuous lines with the shifts hide
a lot of synthesized gates, with 32-bit barrel shifters which use a fair amount of gates. A single signed barrel
shifter might have been employed with a_signed a 33-bit sign extended version of a. For right shifts, a_signed is
then used for both srl and sra. For shift left, the input of the barrel shifter is a_signed with bits 31:0 bit reversed
and bit 32 set to 0. The output of the barrel shifter is then extracted and bits 31:0 bit reversed for the result. This
would result in a single 33-bit barrel shifter and some additional muxing, which ought to consume less
resources, but may add to the timing path. If timing is not critical then this is a useful optimisation, but the code
is less clear. Actually, even the shift for aligning data on loads and stores could also use the same barrel shifter,
adding additional muxing to the inputs and outputs. Similarly, a maximum of 2 adders/subtractors are needed
(jal and jalr add 4 to the PC, plus the new PC value addition calculation), but actually 4 are used for adds,
subtracts, new PC calculation, and PC + 4. In the implementation, clarity and speed where chosen over
resources. All these decisions are a classic compromise between clarity, logic resources and speed.

Hazards and Stalls

In the logic implementation, write back hazards are dealt with in the decoder and ALU by comparing the rd
index on an active writeback with the rs indexes for that stage. The diagram below shows the logic for selecting
between the input from the previous stage and the fed-back rd value. Note that the writeback logic ensures rd is
0 when no active writeback.

Within the register file, if using memory, it takes two cycles before a value written to the register array can be
available at the output. When the decoder reads from the registers, if currently writing to a matching rs index
then the write value is sent back. If matching against the value in the previous cycle, this is selected, otherwise
that from the register file is sent. This is illustrated in the diagram below:
To deal with branch hazards under the chosen never-take policy, whenever a branch (or jump) updates the PC,
the ALU cancels the next instruction it was about to execute and forwards this ‘branch taken’ state to the
decoder to clear its output state, effectively turning the decoder output to a “no-operation” so the ALU will not
do anything with this instruction as well.

Stalling on memory accesses is a matter of detecting a wait state and forwarding a stall signal to all the stages to
hold their state. Global stalls can be a source of critical path and faster means might be employed such as a
shallow FIFO between each stage where an output is held-off when the FIFO is full. This keeps the stalling
local to an interface between two stages, rather than global, but is more complicated to implement and costly in
gates—the outputs from the some of the stages can have many bits.

Zicsr

The control and status registers are implemented in a separate module and are logic gates rather than an array.
As indicated in the previous article only a small sub-set of the possible registers will be implemented, similar to
the NIOS V/m, with just a few extra. I won’t go over their detailed function which is discussed in the last article
but is mainly concerned with exception handling, with some additional counters.

The Zicsr logic handles both the external exceptions (interrupts, software, and timer) and the internal
exceptions. The interrupt and software exceptions are signals that originate from outside the core, whereas the
timer interrupts are part of the Zicsr logic. The timer can be read and updated from outside the core so that it
can be mapped into the memory address space, as per the specification. The timer can be configured to be a 1μs
real-time count, or just reflect the clock cycle counts in mcycle and mcycleh, which are always implemented.
Since this implementation has a fixed clock, using the cycle counts is valid. If it ever had a variable clock input,
that would not be true.

All the internal exceptions originate from the decoder, either as decoding an ecall or ebreak instruction, or as
an error condition, such as misaligned addresses etc.

The Zicsr block has outputs to update the PC on an exception and also to write to a general-purpose register
when executing a CSR instruction. The decoder partially decodes the CSR instructions and forwards the
parameters to the Zicsr block to complete decoding and access the registers. The system instruction, mret, is
decoded by the decoder block, and a flag sent to the Zicsr block. In this case the decoder sends a “no-operation”
to the ALU as the Zicsr block will handle the instruction. The logic updates the CSR register state and also
updates the PC to the exception return address that saved in the mepc register when an exception was
generated.

The Zicsr logic also, configurably, adds the retired instruction counter registers (minstret and minstreth) which
are incremented on a pulse from the ALU for each instruction completed.

Conclusions
In this article we have looked at what the minimum logic might be to implement a RISC-V processor, starting
with just a functional approach based around an FSM. From then on, we looked at ways of improving on this
basic design, not to add functionality but to make it operate more efficiently.

Starting with pipelining, the design could suddenly process more instructions per cycle, but at a cost of
introducing some hazards which needed to be solved with writeback feedforwards and branch hazard handling.
The branch hazards were dealt with using branch prediction algorithms of varying complexity and efficiency
and were based on some assumptions about the nature of executing code. This got quickly messy and is only the
beginning of the range of possible algorithms employed. With these methods the design could approach
executing (in the limit) one instruction per cycle. Looking at superscalar architectures, this can be improved
even further but, once again, this has added complications. Complex logic is needed for multiple pipelines and
out-of-order execution of instructions without altering the functional results.

Finally, we reviewed an implementation that sits somewhere between the simplest FSM based approach and the
complex solutions already looked at which, hopefully, can give a useful example of an IP core for further
exploration. The review was cursory, and the implementation documentation gives a fuller picture of the design
details for those interested in following this up. Perhaps have a go at implementing a core yourself.

In the next, and final, article in the series we look at the RISC-V assembly language so that the core can be
programmed. This may not be of interest to everyone reading these articles, and so it has been left until last, but
it ought to be informative and for those wishing to run simulations on the example core (or its accompanying
instruction set simulator) will allow them to explore executing all the different instructions.
Processor Design #4: Assembly Language
Simon Southwell | Published Oct 3, 2022

Introduction
In this, the last of the four articles on processor design, I want to look at assembly language programming—at
least give an introduction and overview so that an individual can start writing basic programs. Now many
people might say that this is not so relevant these days, and only a specialist subject, as it is likely that any
software being written for a processor to run will use a higher-level language from C through to Java and
everything in between. Whilst that may be true to a large extent, I think it will be useful to know how assembly
language works and to take the first steps from this place to the more machine independent higher level
languages. After all, a compiler just converts the language being used into a set of assembled processor
instructions.

In this document we will, of course, use as a case study the RISC-V assembly language with the gnu toolchain
so that this directly follows on from the previous articles. Links will be provided in a later section for those that
wish to follow up on this. There will also be links to an RISC-V instruction set simulator so that assembled code
can be experimented with and run without the need to have a RISC-V hardware setup.

In the following sections the instruction format will be discussed, mapping directly to the RISC-V instructions,
before going through the layout of an assembly language program and its sections. Then compiling the code to a
program that can be actually run and getting additional information from this process will be looked at.
Executable code can also be disassembled to view the assembled code it’s made from, and we will see how the
toolchain tool can do this for us. To finish I want to show how to run an executable constructed from assembly
language on the ISS so interested people can see what’s going on and how instructions alter the processor state.
The provided ISS does have a remote gdb interface so that those who know how to drive this can step through
code. There is no room in this article for details but the documentation with the ISS details how this is achieved.

Assembly Language
I have discussed before that the fundamental code that runs on a processor is machine code. That is, the raw
binary values that make up the instructions, as we saw in article 2 of this series. In the early days of computing,
this was how a program was labouriously entered into a computer, either directly or by creating paper tape with
the codes entered on them. I, myself, remember entering machine code into some programable test equipment
by setting switches to on or off and then pressing a load button, for a set of instructions listed in a test manual. If
you got any step wrong, you had to reset and start again.

Assembly language was the next step where the instructions were symbolically represented in a source file, and
then “assembled” into the machine code instructions using an “assembler” program. The symbols used are just
the instruction names and registers that were documented in article 2 for RISC-V. So instead of 0x05d00893 we
can write addi x17, x0, 93. I defy anyone to claim that the former is the better choice. So, let’s look at how all
the RISC-V instructions previously discussed can be represented in assembly language.

Instruction Formats

In article 2, the six instructions formats used for the base instructions of the RISC-V ISA are written in
particular ways in assembler code. For the most part these are the natural way one might write these.
The R-Type instructions for register-to-register data manipulation have the following general format: <op> rd,
rs1, rs2, where op is the instruction. Some examples:

• add x3, x1, x2


• sla gp, ra, sp

Note that the first example uses the x register index format, whilst the second uses the ABI names which are, if
you recall, aliases for the registers. Both methods are valid. The I-Type format for immediate forms of the data
manipulation instructions is very similar in format to R-Type: <op> rd, rs1, imm. Here rs2 is replaced with an
immediate value. Some examples:

• ori x3, x1, 0x123


• stli gp, ra, -1

The load instruction is also an I-type instruction but has a different format, namely: <op> imm(rs1). This
indicates that the rs1 register’s value is an address, modified by the immediate value (i.e., added to it—noting
that imm might be negative). The data is read from that address and stored in the destination register. Some
examples:

• lw x3, 64(x1)
• lb ra, -4(sp)

Similarly, the store instruction (S-type) uses parentheses to indicate a memory access, and has the format: <op>
rs2, imm(rs1). Again, the rs1 register value is the address for the store, offset by the immediate value, and rs2
indexes the register with the data to be stored. Examples:

• sw x6, 16(x19)
• sh s1, -14(t0)

With branch instructions (B-type) we’re back to a format somewhat similar to I-Type: <op> rs1, rs2, imm. Here
rs1 and rs2 are compared (there’s no destination register) and the immediate value is used to alter the PC
relative to its current value. Some examples:

• bge x12, x31, -16


• bne gp, t6, fail

Note that the second instruction doesn’t have a numeric value. This is a “label”. We will see shortly that, in
assembler, one can put labels next to bits of code and then reference them. The assembler will work out what
value to put for the immediate value when the code is assembled to go to that place in the code next to the label.
The jump instruction jal (J-Type) also use an immediate offset, though more bits are available for larger jumps,
but the assembly format is still the same and labels can be used: <op> rd, imm. For example:

• jal ra, subroutine

The companion jump instruction, jalr, is actually an I-Type format, for example: "jalr x1, x2, 2000". Finally,
the load-upper-immediate instructions (U-Type) are laid out like J-Type but, again, the number of immediate
bits is different, though in assembler one does not need to worry about this except for the range of legal values
that can be used. E.g., "lui 0x80000".

Example Code
Now we know how to put down individual instructions in this symbolic form, we are ready to write a program.
A short example will serve to see how a program is constructed and some additional details of an assembly
language program will be introduced and discussed. The diagram below shows a simple program.

In general, you can see the instructions in the middle of the example surrounded by some additional
paraphernalia. At the top of the example are lines starting with ‘#’. These are for comments where anything
after the ‘#’ symbol is ignored by the assembler program, thus allowing ordinary text to be added to give
information about the program by the implementer. The first ‘active’ line of the program is “.text” on line 7.
This is a section marker declaring, in this case, that what follows is an actual program—known as text in
assembly language parlance. Programs and data are usually separated into different areas memory. For example,
in an embedded system, the code might be in a ROM whilst the data is stored and retrieved via a RAM. These
will be mapped into the address space at different offsets, so in the code we mark program sections with .text
and data sections with .data (which can be seen on line 39). When assembled, areas marked as text will be put in
the program space, and areas marked data will be put in the data space. The assembler is usually told where the
start of these sections are or will use a default. Indeed, line 32 loads the address 0x1000 to x3 on the assumption
that this is where the data section starts. Later we will see how to use the label so that this isn’t necessary. A
program can switch back and forth between sections in the source code if desired, simply marking the beginning
of a section appropriately. There are various other types of section. A couple of other common sections are
'.rodata', which is the same as data but can’t be written to, so may be part of a ROM. There is also a '.bss'
section, which means “block starting symbol” for historic reasons but means an area of reserved unitialised
memory. Where the .data section defines memory with some initial values, the ".bss" section defines an area
that is known at compile time needs to be reserved but has no set initial value.

After the .text line, at line 10 is a label called '_start'. Labels were mentioned as part of the examples for the
instructions. A label is added as a name which must start with a letter or certain symbols (not numbers though),
followed by a colon (:). The address of the first instruction after a label will be inherited by that label and thus
the label is an alias for that address. The '_start' label is special though, as it must be present somewhere in the
program once, and only once, (or programs if multiple source files) and is where the program will start
executing from.

Immediately after the '_start' label there is a '.global _start'. This makes the label available externally to
separately compiled programs and compiler and linker programs. The default is that a label is only visible
within the program file in which it is defined. However, one can use the '.local' directive explicitly to say a label
is local only. Note that the '.global' directive can be on a separate line if desired. Indeed, on line 13 and 16 the
'.global' directive for a label main is on a separate line and before the label declaration. This is all legal.

We than have the program itself using the symbolic notation outlined in the last section. This is a simple linear
set of instructions (no branches or jumps in this example) that terminates with an ebreak instruction. The
comments describe what effect each instruction has.

Other Directives
After the program instructions, the data section is declared ('.data') and a label added at the beginning called
'data'. This has a single word added to the section as '.word 0x12345678', placing that value in the first four
bytes of the data section. The '.word' defines a 32 bit value, and there is also '.byte', '.half', and '.quad'. for
different sized words. A comma separated list of values after the directive will place these in consecutive order
in memory. In addition, a .string directive will store the bytes of a string in the data memory terminated with a
zero, and a '.zero' directive, followed by an integer count, will place that many zeros in memory.

In addition to the directives mentioned above in relation to the example, some additional useful ones are worth
mentioning here. The '.equ' directive allows an alias to be defined for a value. E.g., '.equ INVERT, -1'. The
example code uses literal values throughout the code, but if the above were to be defined at the top of the file,
then the -1 of line 23 could be substituted with INVERT, giving a clearer indication of the code’s intent. These,
then, are like labels, aliasing constant data values rather than address locations.

Another useful set of directives are for alignment. As the assembler constructs code, it keeps tabs on the address
location, incrementing for each instruction. It may be that a new section of code or data needs to start on some
boundary that is aligned to, say, a page, a double-word or whatever. To enforce this the '.align' and '.balign'
directives are used. As a minimum they are followed by a byte count for the alignment (in powers of 2). So
'.align 8' will move the location counter to the next address that falls on an 8 byte boundary (i.e., one that has
the bottom 3 bits as 0). The alignment directives can have a second argument that sets the value of the padding
bytes. E.g., '.align 8, 0xff' would set any padded bytes to 0xff. The default is 0x00 if there is no second
argument. The reason there are two types of alignment directive is that '.align' can be different for other
processors, specifying the number of low order bits to be 0 instead of the alignment bytes. The '.balign' is
unambiguous and always specifies the byte alignment (and there is a '.p2align' for specifying alignment using
zeroed low bits).

These directives discussed here are not exhaustive, and the full list of those supported in the RISC-V toolchain
are documented in its assembly programmer’s manual.
Macros
One last couple of directives I want to talk about are '.macro' and '.endm'. These form a pair to define a section
of code that can be inserted in a program. It is possible, when writing code, that one finds oneself writing very
similar code in various places with just perhaps the values being used that are different. A macro can be used to
define a set of operations that can then be placed in an assembler program and substituted with the defined body
of the macro. The macro definition can have arguments so that the code generated is modifiable when
instantiated. An example:

In the code the above macro can be used anywhere to insert instructions:

In this simple example the macro is defined to set a byte value in three registers. Which registers are not
specified in the macro but use arguments for defining them when the macro is used, along with an argument of
the byte value. The use of the backslash allows the argument to be reference in the macro body—e.g., \reg1. In
the code, the macro can be instantiated with arguments, filled in to define the registers to be updated and the
value to be set in each by substituting the backslash argument references in the macro definition with those
provided. In this usage of macros, it is like defining higher level operations, constructed from the simpler
instructions. Note that this is different from calling a subroutine (a section of code that’s called using, say, jal,
and then returned from using the saved return address). In that case there is one copy of the code in memory,
and arguments would be passed by setting predetermined registers with the argument values. Using a macro
places a copy of the code at each instantiation, effectively the macro is expanded to instructions at that location.
When disassembling, the individual instructions would be seen for that part of the code.

Macros, then, are just expanded at assembly time and, therefore, can have other contents such as directives and
definitions and need not be restricted to sections of instructions. In fact, much more is possible in assembly
language than introduced here and I’d encourage you to explore further and try programming yourself.

Pseudo-instructions
Another feature of an assembler is that it can alias instructions to have more meaningful names. In this and the
previous articles I have mentioned that the instruction 'addi x0, x0', 0 is a no-operation. It is not necessarily
obvious in some code that this instruction is meant to be a no-operation at first glance. It might be better if the
defined instructions set had a separate 'nop' instruction. Well, the assembler has a set of pseudo-instructions to
alias the processor base instructions into more meaningful names. Below is a sub-set table of pseudo-
instructions for the RV32I instructions set.

I won’t list all the pseudo-instructions here as it would get too long, but there is a set for the Zicsr control and
status register instructions, such as 'csrw csr, rs' which maps to 'csrrw x0, csr, rs'. There are also some that map
to two instructions. A couple of examples are shown below:

With this all in mind, we can re-write the example code from before using pseudo-instructions and also use the
label for the data to make the code independent of where that might be located.
Lines 20, 22, 23, 29 and 30 use pseudo instructions. At lines 20 and 22, with the load address (la) instruction,
these are two instruction pseudo instructions. The full list of standard RISC-V pseudo-instructions is given in
volume 1 of the RISC-V specification (Ch. 25).

Compiling Code
At this point we have enough to construct and compile (assemble) a program. In this and the next section I will
be going through the steps to be able to compile code and then run it on the ISS. For those who don’t want to
take this step just now, you can skip these sections and go straight to the Conclusions section at the end. The
instructions here are all based on using the RISC-V tool chain for Windows 10, and the wyverSemi riscV
project’s ISS, which has the example code, discussed above, included in the bundled.

Getting the toolchain


The RISC-V gnu toolchain can be downloaded from here, and needs to be installed at a convenient place. The
default is 'c:\SysGcc\riscv', though it may be installed elsewhere convenient. If, during installation, the “Add
binary directory to %PATH%” was ticked, then the setup is completed after installation. To check if you have
this installed, open a new console and type at the prompt “where riscv64-unknown-elf-as” and you should get a
response something like:

c:\SysGcc\riscv\riscv64-unknown-elf-as.exe

If it wasn’t found something went wrong.

The ISS bundle can be downloaded from here. Unzip this in a suitable folder. The ISS executable rv32.exe uses
visual C++ redistributable libraries. If you have these installed then running the ISS with 'rv32 -h' will display a
help message, otherwise it will give an error message. The redistributable installation packages are bundled
with the ISS to install if not done so already. If you have visual studio you may compile from the source
directly, which can be found on github.

Compiling
A simple batch file is included with the ISS bundle to compile code, called rv_asm.bat. This can be used with a
single argument to specify a file to assemble to a RISC-V executable called test.exe. This can be used to
compile the provided examples: example.s and example_pseudo.s. E.g., 'rv32_asm.bat example.s'.

I could skip over what this batch file does, but it is worth taking a look at the compilation process. The
toolchain has a twostep process in order to compile an executable. There is the assembler program (riscv64-
unknown-elf-as) and there is a linker (riscv64-unknown-elf-ld). The riscv64-unknown-elf- prefix is there to
uniquely identify the programs for the RISC-V toolchain. Though we have a 64 bit toolchain (riscv64-) it can be
restricted to 32-bit instructions a we’ll see, and it is not specific to a particular RISC-V system or board (hence
unknown-). The elf- indicates that it uses “executable and link format” for the compiled code—a very common
format which defines the sections we discussed earlier (data, text, bss etc.), optional symbols to help debug the
code as well as the actual program and predefined data. There are versions of the assembler and linker without
these prefixes in the toolchain folders, but many developers have multiple toolchains installed and using the
prefixes avoid clashes.

The batch file takes care of the prefix, so let’s pretend we don’t need it and see what the two steps are with all
the fluff removed:

as -fpic -march=rv32i -aghlms=test.list -o test.o example.s

ld test.o -Ttext 0 -Tdata 1000 -melf32lriscv -o test.exe

The 'as executable assembles the source (example.s) to an “object file”. An object file is an intermediate
compilation point, also part of higher level languages, where the instructions have all been encoded, but any
reference to absolute addresses is yet to be determined and added to the code. This mean that the code is
locatable anywhere in memory, so that the linker can decide where this is when constructing an executable from
one or more object files. It is the linker that will provide the final absolute addresses and complete the
compilation process. The object files will be the same format whether assembly language was the source code
or C or whatever. Thus, the linker is language independent. If an object file created from C++, say, is
disassembled (more later) it will show processor instructions just as if it was created from an assembly program.

Looking at the assembler’s arguments the first is '-fpic' to specify position-independent-code. We will talk about
this shortly, but the example code can be assembled without this argument. When compiling code, especially
higher level languages like C, when position-independent, the compiler will only use instructions that are
relative to the PC, making the code relocatable within memory. In this tool chain, to aid in doing this, a ‘global-
offset-table’ is used, which keeps all the offsets required within the code in order to reference instructions for
jumps and branches and for data. In the assembly language there is even a .got section (cf. .data and .text) where
this table data is kept. For our example, this makes little difference.

The '-march=rv32i' argument restricts the assembler to accepting only RV32I base instructions (and Zicr). For
assembly this means that it will give an error if, say, a multiply instruction from the RV32M specification was
used. The compiler for C (or other language) uses this argument to restrict which instructions it will use to map
the source code, making it compatible with the target processor’s specification. A processor that has RV32M
instructions can then have source code compiled, as an example, with '-march=rv32im'.

The next argument '-aghlms=test.list' specifies that a list file is to be generated to show information about what
was assembled. I won’t go into details about what all the letters after the ‘a’ mean, but the example pretty much
turns on everything. When run, a new file 'test.list' is created, as specified, which has a listing of the code, now
with the provisional address locations and machine code and other useful information. The '-o' option specifies
that the output will be in a file 'test.o', and finally the source code is specified. Multiple source code arguments
could be specified for inclusion in the object file, though more usually source files are compiled to objects
separately and combined with the linker (see below).

The linker (ld) takes the' test.o' object file and generates the executable. It can take multiple object files as input
to create a single output—this is the ‘linking’ part of the operation where it links together several objects,
allocating position in memory for the code and data and finalizing absolute addresses. In the example, both the
text and data start addresses are defined with '-Ttext 0' and '-Tdata 1000'. The next argument, -'melf32lriscv',
tells the linker that the objects are 32-bit RISC-V code (as opposed to, say, 64-bit). Finally, the output file is
specified with a '-o' option, in this case test.exe. It is this executable that can be loaded to a RISC-V system’s
memory and executed by the processor.

Disassembling
When the example code was compiled a 'test.list' listing file was produced, giving a lot of information prior to
linking. It may be that there are objects files and executables to hand that do not have this listing (perhaps they
were precompiled). We can still get information about them by disassembling them. The 'objdump' (prefix
assumed) executable will do just this. The diagram below shows the example program disassembled using this
program:
The '-d' option tells the program to disassemble the text section of the file. If a '-D' option was used instead, it
will attempt to disassemble all sections, including data—which may or may not be useful. The '-M numeric'
option tells the program to display instructions with the registers as the indexed x values. Without it, the ABI
names are used instead. In this example the test.exe executable was disassembled, but the test.o object file could
just as easily have been disassembled. There are plenty more options for displaying data from other sections, the
global symbols, etc. If you execute the program without any arguments, it will list what all these options are.

Running Code
Having generated an executable, we can try running it on the ISS. The diagram below shows an example of
running the test.exe executable on the ISS, this time compiled from the example_pseudo.s code:
The ISS has various options (use 'rv32.exe -h' to list them all) but the ones used here load the ELF executable to
memory '-t test.exe), enable run-time disassembly (-r), specify to halt on ebreak/ecall (-e), dump the registers
when complete (-x) and dump the first 8 words from data memory when complete (-m8). If the ABI register
names are required, a '-a' option can be specified.
Having run the test program, we can now inspect what is in the registers and what is in memory and see if this
matches with expectations. Normally a test program would be self-checking and produce a pass/fail criterion on
completion, such as a 0 in x10 and 93 (0x5d) in x17. This, in fact, is what the ISS is looking for when it exits to
give a pass or fail message. This is copied from the exit pass/fail criteria from the RISC-V International’s unit
tests, which the ISS has run.

More information on the ISS is given in it’s manual, but one last thing to mention is that the ISS has a model of
a 16550 UART built within it. The upshot of this is that if one writes a byte to the transmit-holding-register
(THR) at address 0x8000000, the ACSII character will be printed to the screen. This, then, gives a means for
the programs themselves to display messages. Why not create some programs, maybe with branches and jumps
included, that also display messages to the screen, perhaps by putting predefined strings in the data section.

Conclusions
In this last article in the series on processor design, assembly language programming has been introduced. The
RISC-V has, once again, been used as a case-study looking at how the instructions presented in article 2 can be
programmed in assembly, along with all the directives for defining text and data sections, as well as macros,
constants and fixed words, bytes, and strings.

The main aim has not been to give a definitive guide to RISC-V assembler but to allow enough of a start for a
reader to get on with trying some programming for themselves to consolidate all that’s been learnt over this
series. For those intrepid explorers who want to get straight on and have a go, the RISC-V toolchain was
introduced to assemble code, along with an ISS to run it and experiment with new instructions and results.

Assembly language is the first step on the software side of processor design, which is a huge subject in itself. In
the future I plan some more articles that bridge the gap between this lower level programming and higher level
languages referencing stacks, heaps, and other memory organisation, as well as calling conventions alluded to
with the RISC-V ABI. I hope this series has been informative, especially for those new to the subject and that
the pace and amount of information has been judged right to allow an individual to start exploring further for
themselves.

You might also like