Professional Documents
Culture Documents
Fpga Notes April23
Fpga Notes April23
Introduction to FPGA
In the last 20 years the majority of DSP applications have been enabled by DSP processors: Texas Instruments Motorola Analog Devices
A number of DSP cores have been available. Oak Core LSI Logic ZSP 3DSP
ASICs (Application specific integrated circuits) have been widely used for specific (high volume) DSP applications But the most recent technology platform for high speed DSP applications is the Field Programmable Gate Array (FPGA) This course is all about the why and how of DSP with FPGAs!
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
DSP is all about multiplies and accumulates/adds (MACs). As we progress through the course, we will see that most algorithms that are used for different applications employ digital filters, adaptive filters, Fourier transforms and so on. These algorithms all require multiplies and adds (note that a divide or square root is quite a rare thing in DSP). Hence a DSP algorithm or problem is often specified in terms of its MAC requirements. In particular, when comparing two algorithms, if they both perform the same job but one with less MACs than the other, then clearly the cheaper one would be the best choice. However this implies some assumptions. One is that the required MACs are the same - but surely a multiply is multiply! Well, yes in the traditional DSP processor based situations we are likely to be using, say, a 16 bit device which will process 16 bit inputs, using 16 bit digital filter coefficients etc. With FPGAs this constraint is removed - we can use as many, or as few bits, as are required. Therefore we can choose to optimise and schedule DSP algorithms in a completely different way.
DAC
DSP56307
ADC
Voltage Input
R. Stewart, Dept EEE, University of Strathclyde, 2010
Since around 1998 the evolution of FPGAs into the DSP market has been sustained by classic technology progress such as the ever present Moores law. Late 1990s FPGAs allow multipliers to be implemented in FPGA logic fabric. A few multipliers per device are possible. Early 2000s FPGAs: vendors place hardwired multipliers onto the device with clocking speeds of > 100MHz. Number of multpliers from 4 to > 500. Mid 2000s FPGA vendors place DSP algorithms signal flow graphs (SFGs) onto devices. Full (pipelined) FIR SFGs filters for example are available Late 2000s to early 2010s - who knows! Probably more DSP power, more arithmetic capability (fast square root, divide), FFTs, more floating point support. Rest assured more DSP is coming....
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Technology just keeps moving.
Introduction to FPGA
Anyone who has purchased a new laptop knows the feeling. If you just wait, then in the next quarter you will get the new model with integrated WiFi or WiMax, a faster processor. Of course wait another quarter and in 6 months it will be improved again - also, the new faster, better, bigger machine is likely to be cheaper also! Such is technology DSP for FPGAs is just the same. If you wait another year its likely the vendors will be bring out prepackaged algorithms for precisely what you want to do. And they will be easier to work with - higher level design tools, design wizards and so on. So if you are planning to design a QR adaptive equalizing beamformer for MIMO implementation of a software radio for 802.16 - then if you wait, it will probably be a free download in a few years. But of course who can wait? Therefore in this course, we discuss and review the fundamental strategies of designing DSP for FPGAs. Like all technologies you still need to know how it works if you really want to use it.
Ver 10.423
We might be tempted to this of the latest FPGAs as repositories of DSP components just waiting to be connected together. Of course the resource is finite and the connections available are finite. In the days of circuits boards one had to be careful about running busses close together, lengths of wires etc. Similiar considerations are required for FPGAs (albeit out of your direct control). However, the high level concept, take the blocks, & build it:
Clocks Input/Output
Registers and Memory Design Verify Place and Route Connectors Logic Arithmetic
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
This is undoubtedly the modern concept of FPGA design. Take the blocks, connect them together and the algorithm is in place. Do we actually need an FPGA/IC engineer then? Do we actually need a DSP engineer? Yes in both cases, but moderm toolsets and design flows are such that it might be the same person. There is lots to worry about. In terms of the DSP design; is the arithmetic correct (ie overflows, underflows, saturates etc). Do the latency or delays used allow the integrity to be maintained. For the FPGA, can we clock at a high enough rate? Does the device place and route. What device do we need, and how efficient is the implementation (just like compilers, different vendors design flows will give different results (some better than others). As vendors provide higher level components (like the DSP48 slice from XIlinx which allows a complete FIR to be implemented) then issues such as overflow, numerical integrity and so on are taken care of.
Ver 10.423
So with a MAC (multiply and accumulate/add) of two N bit numbers we could, in the worst case, end up with 2N+1 bits wordlength.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
If the wordlength grows beyond the maximum value you can store we clearly have the situation of numerical overflow which is a non-linear operation and not desirable. Within traditional DSP processors this wordlength growth is well known an catered for. For example, the Motorola 56000 series is so called because it has a 56 bit accumulator, i.e. the largest result of any addition operation can have 56 bits. For a typical DSP filtering type operation we may require to take, say an array of 24 bit numbers and multiply by an array of another 24 bits numbers. The result of each multiply will be a 48 bit number. If we then add two 48 bit numbers together, if they both just happen to be large positive values then the result could be a 49 bit number. Now if we add many 48 bit numbers together (and they just all happen to be large positive values), then the final result may have a word growth of quite a few bits. So one must assume that Motorola had a good look at this, and realised it was fairly unlikely that the result of adding these 48 bit products together would ever be larger that 56 bits - so 56 bits was chosen. (Of course if you did have a problem that grew beyond 56 bits you would have to put special trapping in to the code to catch this.)
Ver 10.423
C3 S4 MSB
S3
C2
S2
C1
S1
C0
S0 LSB
A3 A2 A1 A0 + B3 B2 B1 B0
C3 C2 C1 C0
S4 S3 S2 S1 S0
0 carry in
Notes:
The simple Full Adder (FA): Adds two bits + one carry in bit, to produce sum and carry out
Introduction to FPGA
A Cout
B Cin
Sout
+11 +13 +24
Ver 10.423
a3
a2
a1
a0 b0
0
b1
0
b3
0
p7
p6
p5
p4
p3
p2
p1
p0
Therefore an N by N multiply requires N 2 cells...... ......so for example a 16 bit multiply is nominally 4 times more expensive to perform than an 8 bit multiply.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
s a bout cout aout sout b c
Introduction to FPGA
Each cell is composed of a Full Adder (FA) and an AND gate, plus some broadcast wires:
11 x9
Partial Product
99
Ver 10.423
Designs were produced by interconnecting the gates to form combinational and sequential functions.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
The NAND gate is often called the Universal gate, meaning that it can be used to produce any Boolean logic function. Early gate array design flow would be design, simulate/verify, device production and test. Metal layers make simple connections Z = AB + CD
A B C D
Early simulators and netlisters such as HILO (from GenRad) were used. From GA to FPGA However simple gate arrays although very generic, were used by many different users for similar systems..... ....for example to implement two level logic functions, flip-flops and registers and perhaps addition and subtraction functions. For a GA once a layer(s) of metal had been laid on a device - thats it! No changes, no updates, no fixes. So then we move to field programmable gate arrays. Two key differences between these and gate arrays: They can be reprogrammed in the field, i.e. the logic specified is changeable They no longer are just composed of NAND gates. A carefully balanced selection of multi-input logic, flipsflops, multiplexors and memory.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Ver 10.423
Arrays of gates and higher level logic blocks might be refered to as the logic fabric...
I/O logic block I/O logic block I/O logic block I/O logic block logic block logic block logic block Column interconnects logic block logic block logic block logic block logic block I/O logic block I/O logic block I/O logic block Row interconnects
I/O logic block I/O logic block I/O logic block I/O logic block I/O
R. Stewart, Dept EEE, University of Strathclyde, 2010
logic block
Notes:
Introduction to FPGA
The logic block in this generic FPGA contains a few logic elements. Different manufacturers will include different elements (and use different terms for logic block, e.g. slices etc). A simple logic block might contain the following:
LUT
Logic Element
Of course the actual contents of a logic element will vary from manufacturer to manufacturer and device to device.
Ver 10.423
Interconnects
FLIP FLOP
FLIP FLOP
Looking more specifically at recent Xilinx FPGAs, we also find block RAMs and dedicated arithmetic blocks... both very useful for DSP! Block RAM
r o w s
Notes:
Diagram Key
Introduction to FPGA
One of the major features of more recent, DSP-targeted Xilinx FPGAs is the provision of dedicated arithmetic blocks, which offer lower power and higher clock frequency operation than the logic fabric (i.e. the array of CLBs). These can be configured to perform a number of different computations, and are especially suited to the Multiply Accumulate (MAC) operations prevalent in digital filtering. Block RAMs are also used extensively in DSP. Example uses are for storing filter coefficients, encoding and decoding, and other tasks. Despite the inclusion of these additional resources, the logic fabric still forms the majority of the FPGA. We will now look at the CLBs which comprise the logic fabric, and how they are connected together, in further detail.
Block RAM
FPGA
Ver 10.423
10
Xilinx FPGA logic fabric comprises Configurable Logic Blocks (CLBs), which are groups of SLICEs (e.g. 2 or 4 SLICEs per CLB). Signals travel between CLBs via routing resources. Each CLB has an adjacent switch matrix for (most) routing. CLB Slices
Switch Matrix
other CLBs
Notes:
Introduction to FPGA
The example in the main slide features a typical Xilinx FPGA architecture, and the Altera and Lattice architectures are different. Logic units differ in size, composition and name! However, in all cases, their Logic Blocks include both combinational logic and registers, and routing resources are required for connecting blocks together. Continuing with the Xilinx example, the combinational blocks are termed Lookup Tables (LUTs), and in most devices have 4 inputs (some of the more recent devices have 6-input LUTs). These LUTs can be utilised in four modes: To implement a combinatorial logic function As Read Only Memory (ROM) As Random Access Memory (RAM) As shift registers
Over the next few slides, the functionality of LUTs and registers in the above modes will be described.
Ver 10.423
11
When used to implement a logic function, the 4-bit input addresses the LUT to find the correct output, Z, for that combination of A, B, C and D.
A B C D
Lookup Table
CD AB 00
00 01 11 10
0 1 1 0
01 0 0 0 0
11 0 0 0 0
10 0 0 0 1
Notes:
The lookup table can also implement a ROM, containing sixteen 1-bit values. Instead of the four inputs representing inputs of a logic function, they can be thought of as a 4-bit address. A 1-bit value is stored within each memory location, and the appropriate output is supplied for any input address. In this example, A is considered the Most Significant Bit (MSB) and D the Least Significant Bit (LSB), and the output is Z.
Introduction to FPGA
A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
Z 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1
Ver 10.423
12
LUTs can also be configured as distributed RAM, in either single port or dual port modes. Single port: 1 address for both synchronous write operations and asynchronous read operations. Dual port: 1 address for both synchronous write operations and asynchronous read operations, and 1 address for asynchronous reads only.
Larger RAMs can be constructed by connecting two or more LUTs together. Dual port RAM requires more resources than single port RAM. For example, a 32x1 Single Port RAM (32 addresses, 1 bit data), requires two 4-bit LUTs, whereas an equivalent Dual Port RAM requires four 4-bit LUTs.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
The two diagrams below demonstrate the implementation of 16x1 single and dual port RAMs in the Virtex II Pro device, respectively. Notice that the dual port RAM requires twice as many resources as the single port RAM.
Single Port
Source: Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet, DS083 (v4.7), November 5, 2007.
Dual Port
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010
13
A final alternative is to use the LUT as a Shift Register of up to 16 bits. Additional Shift In and Shift Out ports are used, and the 4-bit address is used to define the memory location which is asynchronously read. For example, if the LUT input is the word 1001, the output from the 10th register is read, as depicted below.
SHIFT IN CLK
DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ
SHIFT OUT
10
11
12
13
14
15
LUT INPUT
(e.g. 1001) D OUT
The slice register at the output from the LUT can be used to add another 1 clock cycle delay. Using the register also synchronises the read operation.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
As with the other LUT configurations, larger Shift Registers can be constructed by combining several LUTs. For example, a 64-bit shift register segment can be constructed by combining four 16-bit Shift Registers together, as shown below. The cascadable ports allow further interconnections for larger Shift Registers.
Cascadable In
DI
DI
SRL16
MSB
FF
SRL16
MSB
FF
DI
D MSB
DI
SRL16
FF
D MSB
SRL16
FF
Slice 1
Ver 10.423
Registers
14
The sequential logic element which follows the lookup table can be configured as either: An edge-triggered D-type flip flop; or A level-sensitive latch
The input to the register may be the output from the LUT, or alternatively a direct input to the slice (i.e. the LUT is bypassed).
LUT Inputs
LUT
Bypass Input
Notes:
Introduction to FPGA
A D-type flip flop provides a delay of one clock cycle, as confirmed by the truth table below (D(t) is the register input at time t, and Q(t=1) is the output 1 clock cycle later). A clock signal and control inputs (set, reset, etc.) are also provided.
D(t) 0 1
Q(t+1)
0 1
When configured as a latch, the control inputs define when data on the D input is captured and stored within the register. The Q output thereafter remains unchanged until new data is captured. Flip flops and registers are discussed in the Digital Logic Review notes chapter.
Ver 10.423
15
Whereas registers can be reset, SRL16s cannot. Therefore, adding reset capabilities to a design has implications for resource utilisation. For example, consider an 8-bit shift register. If resettable, then each element requires a slice register. If not, then an SRL16 can be used.
Resettable Implementation
INPUT CLOCK RESET Slice 1 Slice 2 Slice 3 Slice 4
DQ DQ DQ DQ DQ DQ DQ DQ
OUTPUT
SRL16
OUTPUT
Notes:
Introduction to FPGA
We can still design a resettable shift register with an SRL16, by using a slightly more sophisticated design. Instead of making all elements resettable, we can implement the first element using a slice register, and the subsequent ones using an SRL16. The reset signal is held high for 8 clock cycles, which allows the 0 input to propagate through the shift register. Instead of using 4 slices, this design would require 2 slices at most.
DQ
SRL16
OUTPUT
Hold RESET signal high for 8 clock cycles to reset the shift register...
Ver 10.423
16.1
The implementation of arithmetic operations in FPGA hardware is an integral and important aspect of DSP design. The following key issues are presented in this chapter: Number representations: binary word formats for signed and unsigned integers, 2s complement, fixed point and floating point; Binary arithmetic, including: Overflow and underflow, saturation, truncation and rounding Structures for arithmetic operations: Addition/Subtraction, Multiplication, Division and Square Root; Complex arithmetic operations; Mapping to Xilinx FPGA hardware ... including special resources for high speed arithmetic
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
This section of the course will introduce the following concepts: Integer number representations - unsigned, ones complement, twos complement. Non-integer number representations - fixed point, floating point. Quantisation of signals, truncation, rounding, overflow, underflow and saturation.
Introduction to FPGA
Addition - decimal, twos complement integer binary, twos complement fixed point, hardware structures for addition, Xilinx-specific FPGA structures for addition. Multiplication - decimal, 2s complement integer binary, twos complement fixed point, hardware structures for multiplication, Xilinx-specific FPGA structures for multiplication. Division. Square root. Complex addition. Complex multiplication.
Ver 10.423
Number Representations
17
DSP, by its very nature, requires quantities to be represented digitally using a number representation with finite precision. This representation must be specified to handle the real-world inputs and outputs of the DSP system. Sufficient resolution Large enough dynamic range
The number representation specified must also be efficient in terms of its implementation in hardware. The hardware implementation cost of an arithmetic operator increases with wordlength. The relationship is not always linear!
Notes:
Introduction to FPGA
The use of binary numbers is a fundamental of any digital systems course, and is well understood by most engineers. However when dealing with large complex DSP systems, there can be literally billions of multiplies and adds per second. Therefore any possible cost reduction by reducing the number of bits for representation is likely to be of significant value. For example, assume we have a DSP filtering application using 16 bit resolution arithmetic. We will show later (see Slide 42) that the cost of a parallel multiplier (in terms of silicon area - speed product) can be approximated as the number of full adder cells. Therefore for a 16 bit by 16 bit parallel multiply the cost is of the order of 16 x 16 = 256 cells. The wordlength of 16 bits has been chosen (presumably) because the designer at sometime demonstrated that 17 bits was too many bits, and 15 was not enough - or did they? Probably not! Its likely that we are using 16 bits because... well, thats what we usually use in DSP processors and we are creatures of habit! In the world of FPGA DSP arithmetic you can choose the resolution. Therefore, if it was demonstrated that in fact 9 bits was sufficient resolution, then the cost of a multiplier is 9 x 9 cells = 81 cells. This is approximately 30% of the cost of using 16 bits arithmetic. Therefore its important to get the wordlength right: too many bits wastes resources, and too few bits loses resolution. So how do we get it right? Well, you need to know your algorithms and DSP.
Ver 10.423
18
n-2
n-3
22
21
20
LSB
0 1 0 1 0 0 1 0
0x27 1x26 0x25 1x24 0x23 0x22 1x21 0x20 64 + 16 + 2 = 82
The numerical range of an n-bit unsigned number is: 0 to 2n - 1. For example, an 8-bit word can represent all integers from 0 to 255.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Taking the example of an 8-bit unsigned number, the range of representable values is 0 to 255:
Integer Value 0 1 2 3 4
64 65 131 255
Note that the minimum value is 0, and the maximum value ( 255 ) is the sum of the powers of two between 0 and 8 , where 8 is the number of bits in the binary word: i.e. 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 = 255 = 28 - 1
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010
19
2s Complement caters for positive and negative numbers in the range -2n-1 to +2n-1 -1, and has only one representation of 0 (zero). In 2s complement, the MSB has a negative weighting:
bit index: bit weighting: n-1
MSB
n-2
n-3
22
21
20
LSB
The most negative and most positive numbers are represented by:
bit index: n-1 n-2 n-3 2 1 0
most negative:
bit index:
1
MSB
0
n-2
0
n-3
0
2
0
1
0
LSB
n-1
most positive:
0
MSB
1
LSB
Notes:
Introduction to FPGA
As examples, we can convert the following two 8-bit 2s complement words to decimal. As for the unsigned representation, the decimal number 82 is 01010010 in 2s complement signed format:
7 -27
6 26
5 25
4 24
3 23
2 22
1 21
0 20
0 1 0 1 0 0 1 0
0x-27 1x26 0x25 1x24 0x23 0x22 1x21 0x20 64 + 16 + 2 = 82
7 -27
6 26
5 25
4 24
3 23
2 22
1 21
0 20
1 0 1 0 1 1 1 0
1x-27 1x26 1x25 1x24 0x23 0x22 1x21 0x20 -128 + 32 + 8 + 4 + 2 = -82
R. Stewart, Dept EEE, University of Strathclyde, 2010
Ver 10.423
2s Complement Conversion
20
For 2s Complement, converting between negative and positive numbers involves inverting all bits, and adding 1. For example, we have just considered 2s complement representations of the decimal numbers -82 and +82. They are converted as shown:
+82 +82
-82 -82
1 0 1 0 1 1 0 1 add 1 -82 +
0 0 0 0 0 0 0 1
0 1 0 1 0 0 0 1 add 1 +82 +
0 0 0 0 0 0 0 1
1 0 1 0 1 1 1 0
0 1 0 1 0 0 1 0
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes: Positive Numbers Integer 0 1 2 3 Binary 00000000 00000001 00000010 00000011 Invert all bits and ADD 1
Introduction to FPGA
Note that when negating positive values, a ninth bit is required to represent negative zero. However, if we simply ignore this ninth bit, the representation for the negative zero becomes identical to the representation for positive zero. Notice from the above that -128 can be represented but +128 cannot.
Ver 10.423
21
Bits on the left of the binary point are termed integer bits, and bits on the right of the binary point are termed fractional bits. The format of a generic fixed point word, comprising n integer bits and b fractional bits, is: : binary point
bit index: bit weighting: n-1
MSB
n-2
-1
-2
-b+1
-b
LSB
2n-1 2n-2
21
20
2-1 2-2
2-b+1 2-b
n integer bits
b fractional bits
The MSB has -ve weighting for 2s complement (as for integer words).
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
As examples, we consider the 2s complement word 11010110 with the binary point in two different places. Firstly, with the binary point to the left of the third bit, i.e. 5 integer bits and 3 fractional bits:
4 -24
3 23
2 22
1 21
0 20
-1 2-1
-2 2-2
-3 2-3
1 1 0 1 0 1 1 0
1x-24 1x23 0x22 1x21 0x20 1x2-1 1x2-2 0x2-3 -16 + 8 + 2 + 0.5 + 0.25 = -5.25
...and secondly, with the binary point to the left of the third bit, i.e. 3 integer bits and 5 fractional bits:
2 -22
1 21
0 20
-1 2-1
-2 2-2
-3 2-3
-4 2-4
-5 2-5
1 1 0 1 0 1 1 0
1x-22 1x21 1x20 1x2-1 0x2-2 1x2-3 1x2-4 0x2-5 = -1.3125 0.5 + 0.125+ 0.0625 -4 + 2 +
R. Stewart, Dept EEE, University of Strathclyde, 2010
Note that these results are related by a factor of 22 = 4, i.e. 4 x -1.3125 = -5.25.
Ver 10.423
22
As with integer representations (which are also effectively fixed point numbers, but with the binary point at position 0), the binary range of fixed point numbers extends from:
00000.....0000 10000.....0000
11111.....1111 01111.....1111
The same number of quantisation levels is present, e.g. for an 8-bit binary word, 256 levels can be represented. Numerical range scales according to the binary point position, e.g.:
Dynamic range (range / interval) is independent of the binary point position, e.g. (127-(-128))/1 = 255 = (63.5-(-64))/0.5
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
To illustrate this further, lets consider the very simple case of a 3-bit 2s complement word, with the binary point in all four possible positions. Clearly the numerical range is affected by the binary point position, but the relationship between the interval and range remains the same.
+3 +2 +1 0 -1 -2 -3 -4
binary point position = 0
-4 +2 +1 -2
+1.5 +0.75 -1 -2 interval = 1 interval = 0.5 interval = 0.25 interval = 0.125 +0.375 -0.5
Ver 10.423
Fixed-point Quantisation
Consider the fixed point number format: n n n b b b b b
23
3 integer bits
5 fractional bits
Numbers between 4 and 3.96785 can be represented, in steps of 0.03125 . As there are 8 bits, there are 2 8 = 256 different values. Revisiting our sine wave example, using this fixed-point format: +2
-2 LSB This looks much more accurate. The quantisation error is ---------(where LSB = least significant bit)... so 0.015625 rather than 0.5! 2
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Quantisation is simply the DSP term for the process of representing infinite precision numbers with finite precision numbers. In the decimal world, it is familiar to most to work with a given number of decimal places. The real number can be represented as 3.14159265.... and so on. We can quantise or represent to 4 decimal places as 3.1416. If we use rounding here and the error is: 3.14159265 3.1416 = 0.00000735 If we truncated (just chopped off the bits below the 4th decimal place) then the error is larger: 3.14159265 3.1415 = 0.00009265 Clearly rounding is most desirable to maintain best possible accuracy. However it comes at a cost. Albeit the cost is relatively small, but it is however not free. When multiplying fractional numbers we will choose to work to a given number of places. For example, if we work to two decimal places then the calculation: 0.57 x 0.43 = 0.2451 can be rounded to 0.25, or truncated to 0.24. The result are different. Once we start performing billions of multiplies and adds in a DSP system it is not difficult to see that these small errors can begin to stack up.
Ver 10.423
24
The binary point position has a power-of-2 relationship with the numerical range of the word format, and any number it represents. Therefore if we want to multiply or divide a fixed point number by a power-of-two, this is achieved by simply shifting the numbers with respect to the binary point!
shift right by 2 places -2 1 0.5 0.0625 0.25 0.125
4
-8 4
0
2
1
1
0
0.5
1
0.25
(decimal 1.4375)
original number:
-16
0
8
1
4
0
2
1
1
1
0.5
(decimal 5.75)
(decimal 11.5)
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Of course, looking at the example in the main slide, we could also consider that the binary point is moved, rather than the bits - it amounts to the same thing! Ultimately the binary point is conceptual - having no effect on the hardware produced - and it falls to the DSP design tool and/or DSP designer to keep track of it. Reviewing the divide-by-4 and multiply-by-2 examples from the main slide... if we move the binary point, the weightings of the bits comprising the word, and hence the value it represents, change by a power-of-two factor.
-2
0.5
(decimal 1.4375)
4
-8 4 2 1 0.5 0.25
original number:
(decimal 5.75)
2
-16 8 4 2 1 0.5
(decimal 11.5)
Ver 10.423
25
Fixed point word formats with 1 integer bit and a number of fractional bits are often adopted. Numbers from -1 to +1-2-b (i.e. just less than 1) can be represented. Some examples:
-1 1/2 1/4 1/8 1/16 1/32 1/64 1/128
decimal -1
most -ve
0 most +ve
0 +0.9921875
Limiting the numeric range to 1 is advantageous because it makes the arithmetic easier to work with... multiplying two normalised numbers together cannot produce a result greater than 1!
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
The term Q-format is often used to describe fixed point number formats, usually in the context of DSP processors. However, it is useful to note that Q-format and 2s complement are actually the same thing. Q-format notation is given in the form Qm.n, where m is the number of integer bits, and n is the number of fractional bits. Notably this description excludes the MSB of the 2s complement representation, which Q-format considers a sign bit. Therefore the total number of bits in a Q-format number is 1 + m + n, whereas in 2s complement, the same word format would be described as having m+1 integer bits, and n fractional bits. For example, a Q2.5 number has a sign bit, 2 other integer bits, and 5 fractional bits, and hence can be represented as shown below. In 2s complement, this would be described as having 3 integer bits and 5 fractional bits.
n=5 fractional bits 2-1 2-2 2-3 2-4 2-5 5 fractional bits
2s complement description
3 integer bits
The Q0.15 format (often abbreviated to Q15) is used extensively in DSP as it covers the normalised range of numbers from -1 to +1 - 2-15, and is equivalent to a 16 bit 2s complement representation with the binary point at 15, i.e. 1 integer bit and 15 fractional bits.
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010
26
Working with fractional binary values makes the arithmetic easier to work with and to account for wordlength growth. As an example take the case of working with a machine using 4 digit decimals and a 4 digit arithmetic unit - range -9999 to +9999. Multiplying two 4 digit numbers will result in up to 8 significant digits. 6787 x 4198 = 28491826
Scale
2849.1826
Tr
2849
If we want to pass this number to a next stage in the machine (where arithmetic is 4 digits accuracy) then we need scale down by 10000, then truncate. Consider normalising to the range -0.9999 to +0.9999. 0.6787 x 0.4198 = 0.28491826
Tr
0.2849
Notes:
Introduction to FPGA
Of course the two results are exactly identical and the differences are in how we handle the truncate and scale. However using the normalised values, where all inputs are contrained to be in the range from -1 to +1 then its easy to note that multiplying ANY two numbers in this range together will give a result also in the range of -1 to +1. Exactly the same idea of normalisation is applied to binary, and the the binary point is implicitly used in most DSP systems. Consider 8 bit values in 2s complement. The range is therefore: 10000000 to 01111111 (-128 to +127)
If we normalise these numbers to between -1 and 1 (i.e. divide through by 128) then the binary range is: 1.0000000 to 0.1111111 ( -1 to 0.9921875, where 127/128 = 0.9921875).
Therefore we apply the same normalising ideas as for decimal for multiplication in binary.1101 1010 0100 Consider multiplying 36 x 97 = 3492 equivalent to 00100100 x 01100001 = 0000 1101 1010 0100 In binary normalising the values would be the calculation 0.0100100 x 0.1100001 = 0.00110110100100 which in decimal is equvalent to: 0.28125 x 0.7578125 = 0.213134765625 Note very clearly that in a DSP system then the binary poiint is all in the eye of the designer. There is no physical connection or wire for the binary point. It just makes things significantly easier in keeping track of wordlength growth, and truncating just by dropping fractional bits. Of course if you prefer integers and would like to keep track of the scaling etc you can do this,..... you will get the same answer and the cost is the same...
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010
27
An ADC is a device that can convert a voltage to a binary number, according to its specific input-output characteristic.
Binary Output
127 96 64 32
fs
8 bit 1 0 0 1 1 1 0 1 Binary Output
ADC -2 -1
-32 -64 -96 -128
2 Voltage Input
Voltage Input
Notes:
Introduction to FPGA
Viewing the straight line portion of the device we are tempted to refer to the characteristic as linear. However a quick consideration clearly shows that the device is non-linear (recall the definition of a linear system from before) as a result of the discrete (staircase) steps, and also that the device clips above and below the maximum and minimum voltage swings. However if the step sizes are small and the number of steps large, then we are tempted to call the device piecewise linear over its normal operating range. Note that the ADC does not necessarily have a linear (straight line) characteristic. In telecommunications for example a defined standard nonlinear quantiser characteristic is often used (A-law and -law). Speech signals, for example, have a very wide dynamic range: Harsh oh and b type sounds have a large amplitude, whereas softer sounds such as sh have small amplitudes. If a uniform quantisation scheme were used then although the loud sounds would be represented adequately the quieter sounds may fall below the threshold of the LSB and therefore be quantised to zero and the information lost. Therefore non-linear quantisers are used such that the quantisation level at low input levels is much smaller than for higher level signals. A-law quantisers are often implemented by using a nonlinear circuit followed by a uniform quantiser. Two schemes are widely in use: the A-law in Europe, and the -law in the USA and Japan. Similarly for the DAC can have a non-linear characteristic.
Binary Output
Voltage Input
Ver 10.423
28
Perfect signal reconstruction assumes that sampled data values are exact (i.e. infinite precision real numbers). In practice they are not, as an ADC will have a number of discrete levels. The ADC samples at the Nyquist rate, and the sampled data value is the closest (discrete) ADC level to the actual value:
s(t)
4 3 2 1 0 -1 -2 -3 -4
ADC time
Binary value
fs
v(n)
4 3 2 1 0 -1 -2 -3 -4
ts
Voltage
sample, n
Notes:
Introduction to FPGA
For example purposes, we can assume our ADC or quantiser has 5 bits of resolution and maximum/minimum voltage swing of +15 and -16 volts. The input/output characteristic is shown below:
10000 (-16)
In the above slide figure, for the second sample the true sample value is 1.589998..., however our ADC quantises to a value of 2.
Ver 10.423
Quantisation Error
29
If the smallest step size of a linear ADC is q volts, then the error of any one sample is at worst q/2 volts.
01111 (+15) Binary Output
-Vmax
10000 (-16)
Notes:
Introduction to FPGA
Quantisation error is often modelled an additive noise component, and indeed the quantisation process can be considered purely as the addition of this noise: nq
ADC
Ver 10.423
An Example
Here is an example using a 3-bit ADC:
4 3 Amplitude/volts 2 1 0 -1 -2 -3 -4 4 3 2 1 -4 0 -1 -2 -3 Input Amplitude/volts 3 2 1 0 -1 -2 -3 -4 time/seconds time/seconds
30
ADC Characteristic
Quantisation Error
Output
-1
-2
-3
-4
Notes:
In this case worst case error is 1/2.
Introduction to FPGA
Ver 10.423
31
The full adder circuit can be used in a chain to add multi-bit numbers. The following example shows 4 bits: A3 B3 A2 B2 A1 B1 A0 B0
C3 S4 MSB
S3
C2
S2
C1
S1
C0
A3 A2 A1 A0 + B3 B2 B1 B0
C3 C2 C1 C0
S0 LSB
S4 S3 S2 S1 S0
This chain can be extended to any number of bits. Note that the last carry output forms an extra bit in the sum. If we do not allow for an extra bit in the sum, if a carry out of the last adder occurs, an overflow will result i.e. the number will be incorrectly represented.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
The truth table for the full adder is: A 0 0 0 0 1 1 1 1 B 0 0 1 1 0 0 1 1 CIN 0 1 0 1 0 1 0 1 S 0 1 1 0 1 0 0 1 COUT 0 0 0 1 0 1 1 1
Introduction to FPGA
A B C
S out = ABC + ABC + ABC + ABC = ABC C out = ABC + ABC + ABC + ABC = AB + AC + BC = C ( A B )
COUT
The longest propagation delay path in the above full adder is two gates.
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010
Subtraction
32
Subtraction is very readily derived from addition. Remember 2s complement? All we need to do to get a negative number is invert the bits and add 1. Then if we add these numbers, well get a subtraction D = A + (-B): A3 B3 A2 B2 A1 B1 A0 B0 Invert add 1 1
4 bit 2s comp
Discard
C4
D3
D2
D1
D0
Notes:
Introduction to FPGA
Note for 4 bit positive numbers (i.e. NOT 2s complement) the range is from 0 to 15. For 2s complement the numerical range is from -8 to 7. Addition/Subtraction (using 2s complement representation) Sometimes we need a combined adder/subtractor with the ability to switch between modes. This can be achieved quite easily:
A3
0
B3
1 MUX
A2
B2
0 1 MUX
A1
B1
0 1 MUX
A0
B0
0 1 MUX
For: A + B, K = 0 For: A - B, K= 1 This structure will be seen again in the Division/Square Root slides!
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010
33
With 2s complement overflow will occur when the result to be produced lies outside the range of the number of bits. Therefore for an 8 bit example the range is -128 to +127 (or in binary this is 100000002 to 011111112: -65 + -112 -177 10111111 +10010000 101001111 100 + 37 137 01100100 +00100101 10001001
With an 8 bit result we lose the 9th bit and the result wraps around to a positive value: 01001111 = 79 .
With an 8 bit result the result wraps around to a negative value: 10001001 = 119 .
One solution to overflow is to ensure that the number of bits available is always sufficient for the worst case result. Therefore in the above example perhaps allow the wordlength to grow to 9 or even 10 bits. Using Xilinx SystemGenerator we can specifically check for overflow in every addition calculation.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Recall from previously that overflow detect circuitry is relatively easy to design. Just need to keep an eye on the MSB bits (indicating whether number is +ve or -ve): For example
(-73) + 127 = 54
Discard final 9th bit carry
100 + 64 = 164
will never overflow! if a -ve result then overflow if a +ve result then overflow
Ver 10.423
Saturation
Taking the previous overflowing examples from Slide 33 10111111 01100100 -65 100 +10010000 + -112 +00100101 + 37 -177 101001111 137 10001001
Detect overflow and saturate Detect overflow and saturate
34
-128
1000000
127
01111111
When overflow is detected, the result is set to the close possible value (i.e for the 8 bit case either -128 or +127). Therefore for every addition that is explicitly done with an adder block. In Xilinx System Generator the user will get a checkbox choice to allow results to either (i) Wraparound or (ii) Saturate. Implementing saturate will require detect overflow & select circuitry.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Once again, design of a DSP system might be done such that overflow never happens as the user has ensured there are enough bits to cater for the worse possible case leading to the maximum magnitude result. Generally for some later FPGAs such as the Virtex-4, using some of the DSP48 blocks give adders with 48 bits of precisions therefore the likelihood of say working with 16 bit values that grow to over 48 bits is unlikely. Hence overflow has been designed out. Of course not applications will use these devices and using general slice logic and attempting to make adders as small as possible, would mean care must be taken, and where appropriate to efficient design, saturate might be included.
Saturation is extremely useful for adaptive algorithms. For example, in the Least Mean Squares algorithm (LMS), the filter weights w are updated according to the equation: w ( k ) = w ( k 1 ) + 2e ( k )x ( k ) Without further concern over the meaning of this equation, we can see that the term 2e ( k )x ( k ) is added to the weights at time epoch k 1 to generate the new weights at time epoch k . The the operations that form 2e ( k )x ( k ) were to overflow, there is a high chance that the sign of the term would flip and drive the weights in completely the wrong direction, leading to instability. With saturation however, if the term 2e ( k )x ( k ) gets very big and would overflow, saturation will limit it to the maximum value representable, causing the weights to change in the right direction, and at the fastest speed possible in the current representation. The result is a huge increase in the stability of the algorithm.
Ver 10.423
35
B A
Sout
Cout
Sout
Cin
Cin
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Picture of Xilinx-II Pro slice (upper half) taken from Virtex-II Pro Platform FPGAs: Introduction and Overview, DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com LookUp Table (LUT) programmed with two-input XOR function: G1 (A) 0 0 1 1 Sout = Cin xor D , G2 (B) 0 1 0 1 D 0 1 1 0
Cout = DA + Cin D (multiplex operation). Result is the FULL ADDER implementation: . G1 (A) 0 0 0 0 1 1 1 1 G2 (B) 0 0 1 1 0 0 1 1 Cin 0 1 0 1 0 1 0 1 D 0 0 1 1 1 1 0 0 Sout 0 1 1 0 1 0 0 1 Cout 0 0 0 1 0 1 1 1
Ver 10.423
36
A (very) high level diagram of the main logic components on one slice
D-type FF
MULTAND
Upper
Mux
Mux
XORG
Mux
D-type FF
MULTAND
Mux
Lower
Mux Mux Mux Mux
R. Stewart, Dept EEE, University of Strathclyde, 2010
XORG
Notes:
Introduction to FPGA
Just reviewing the logic circuitry on one half of the slice (note that in Slide 35 only the top half of the slice is shown, whereas the above slide shows the top and bottom halfs), we can note: One D-type Flip Flop One 4 input Look-Up-Table (LUT) (can be configured as shift register or simple as RAM/memory) One XOR gate One AND gate One OR gate A few 2 input MUX (multiplexors) to route signals Clock inputs
Small FPGAs will have just a few hundred (100s) slices; Large FPGAs will have many tens of thousands (10000s) of slices (and other components!)
Ver 10.423
37
To produce larger adders the Xilinx tools will simply cascade the carry bits in adjacent (where possible!) slices. The bottom half of a Virtex-II Pro slice can be programmed for an identical operation, with its COUT wired to the top-halfs CIN. Hence we can get two bits of addition per standard Xilinx slice. To produce a 4 bit adder, we cascade with another slice.
C1 B1 A1 B0 A0 C3 S1
FA FA
0
B3 A3 B2 A2
FA FA
C1
S3
S0
S2
C1
S1
C0
S0
B1 A1 B0 A0
FA FA
S1 C3 S0 S4
S3
C2
S2
C1
S1
C0
S0
Notes:
Introduction to FPGA
Note the importance of the LUT (look up table) in the Xilinx slice. When configured as a LUT, any four input Boolean equation can be implemented. For example take the equation Y = ABC + ABCD The truth table for this equation is: ABCD 00001 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
A B C D
To implement this function, simply store the values of Y in the Slice LUT, and the address the LUT with values of ABCD to get the output Therefore ANY 4 variable Boolean function can be simply implemented with a four input LUT. (Of course if the equation is only 3 variables then we can also implement and just set one input constant)
Ver 10.423
Multiplication in binary
Multiplying in binary follows the same form as in decimal: 11010110 A 7 A 0 x00101101 B 7 B 0 11010110 000000000 1101011000 11010110000 000000000000 1101011000000 00000000000000 000000000000000 0010010110011110 P 15 P 0
38
Note that the product P is composed purely of selecting, shifting and adding A . The i th column of B indicates whether or not a shifted version of A is to be selected or not in the i th row of the sum. So we can perform multiplication using just full adders and a little logic for selection, in a layout which performs the shifting.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Multiplication in decimal Starting with an example in decimal:
Introduction to FPGA
etc...
Ver 10.423
2s complement Multiplication
39
For one negative and one positive operand just remember to sign extend the negative operand.
sign extends
-42 11010110 x00101101 x45 1111111111010110 0000000000000000 1111111101011000 1111111010110000 0000000000000000 1111101011000000 0000000000000000 0000000000000000 1111100010011110 -1890
Notes:
2s complement multiplication (II) For both operands negative, subtract the last partial product.
Introduction to FPGA
We use the trick of inverting (negating and adding 1) the last partial product and adding it rather than subtracting.
form last partial product negative
twos
-1110101100000000
complement
11010110 x10101101 1111111111010110 0000000000000000 1111111101011000 1111111010110000 0000000000000000 1111101011000000 0000000000000000 +0001010100000000 0000110110011110
-42 x-83
3486
Of course, if both operands are positive, just use the unsigned technique! The difference between signed and unsigned multiplies results in different hardware being necessary. DSP processors typically have separate unsigned and signed multiply instructions.
Ver 10.423
40
Fixed point multiplication is no more awkward than integer multiplication: 11010.110 x00101.101 11.010110 000.000000 1101.011000 11010.110000 000000.000000 1101011.000000 00000000.000000 000000000.000000 0010010110.011110 26.750 x5.625 0.133750 0.535000 16.050000 133.750000 150.468750
Again we just need to remember to interpret the position of the binary point correctly.
Notes:
Introduction to FPGA
Ver 10.423
41
High speed integrated arithmetic slices (DSP48s) Multiply, accumulate Add, multiply, accumulate
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Over the next few slides we will see that multipliers can be implemented in a variety of different ways. As multipliers are used extensively in DSP, implementing them efficiently is a priority consideration. The most basic multiplier is a 2-input version which is implemented using the logic fabric, i.e. the lookup tables within the slices of the device. This type is referred to as a distributed multiplier, because the implementation is distributed over the resouces in several slices. In the case of multiplication with a constant, which is commonly required in DSP, the knowledge of one multiplicand can be exploited to create a cheaper hardware implementation than a conventional 2-input multiplier. Two approaches that will be discussed in the coming pages are ROM-based constant multipliers, and shift-and-add multipliers which sum the outputs from binary shift operations. The FPGA companies are well aware that DSP engineers desire fast and efficient multipliers, and as a result, they began incorporating embedded multipliers into their devices in the year 2000. Since then the sophistication of these components has increased, and they have been extended to feature fast adders and in many cases longer wordlengths, too. We can now think of them as embedded arithmetic slices, rather than simply multipliers.
Ver 10.423
Distributed Multipliers
This figure shows a four-bit multiplication:
s
42
FA is full adder
a b 0
a3 0
a2 0
a1 0
a0 b0 0
c 0 b2 0 b3 0 b1 0
p7
p6
p5
p4
p3
p2
p1
The AND gate connected to a and b performs the selection for each bit. The diagonal structure of the multiplier implicitly inserts zeros in the appropriate columns and shifts the a operands to the right. Note that this structure does not work for signed twos complement!
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Note the function of the simple AND gate. The operation of multiplying 1s and 0s is the same AND 1s and 0s
Introduction to FPGA
A 0 0 1 1
B 0 1 0 1
Z 0 0 0 1
or in Boolean algebra Z = A and B = AB
Z = A x B (where x = multiply)
Hence the AND gate is the bit multiplier. The function of one partial product stage of the multiplier is as shown below.
s
FA is full adder
a b
x3 a3
x2 a2
x1 a1
x0 a0 b0 0
c y4 y3 y2 y1 y0
y4 y3 y2 y1 y0 = b0(a3 a2 a1a0) + x3 x2 x1 x0
Ver 10.423
43
This shows the top half of a slice, which implements one multiplier cell.
S B A
Sout
A B
Cout
FA
Cin
NOTE: This implementation features a Virtex-II Pro FPGA.
Sout
Cin
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Picture of Xilinx-II Pro slice (upper half) taken from Virtex-II Pro Platform FPGAs: Introduction and Overview, DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com LUT implements the XOR of two ANDs:
The dedicated MULTAND unit is required as the intermediate product G1G2 cannot be obtained from within the LUT, but is required as an input to MUXCY. The two AND gates perform a one-bit multiply each, and the result is added by the XOR plus the external logic (MUXCY, XORG): Sout = CIN xor D, COUT = DAB + CIND This structure will perform one cell of the multiplier (see the next slide...). Note that whereas the signal flow graph of the distributed multiplier shows signals propagating from the top and right of the diagram to the bottom, the internal structure of the FPGA slice logic results in a different configuration when implemented on a device.
Ver 10.423
ROM-based Multipliers
44
Just as logical functions such as XOR can be stored in a LUT as shown for addition, we can use storage-based methods to do other operations. By using a ROM, we can store the result of every possible multiplication of two operands. The two operands A and B are concatenated to form the address with which to access the ROM. The value stored at that address is the multiplication result, P: A
4 bits
1 0 1 0
ROM-based multiplier 1 0 1 0 0 0 1 1
address 0000 0000 data (product) 0000 0000
A:B
1010 0011 1110 1110
P
8 bits
1111 1111 0000 0001
decimal -6
B
4 bits
0 0 1 1
8 bits
1 1 1 0 1 1 1 0
decimal -18
decimal 3
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
There is one serious problem with this technique: as the operand size grows, the ROM size grows exponentially. For two N bit input operands, there are 2 2N possible results, and hence the ROM has 2 2N entries. The output result is 2N bits long, and in total 2N 2 2N bits of storage are required. For example, with 8 bits operands (a fairly reasonable) size, 1Mbit of storage is required - a large quantity. For bigger operands e.g. 16 bits, a huge quantity of storage is required. 16 bit operands require 128Gbits of storage and hence a ROM-based multiplier is clearly not a realistic implementation choice!
Input Wordlength (N) 4 6 8 10 12 14 16 18 20 Output Wordlength (2N) 8 12 16 20 24 28 32 36 40 No. of ROM entries (22N) 28 = 256 212 = 4,096 216 = 65,536 220 = 1,048,576 224 = 16,777,216 228 = 268,435,456 232 = 4,294,967,296 236 = 68,719,476,736 240 = 1,099,511,627,776 Total ROM Storage (2N x 22N) 2 Kbits 48 Kbits 1 Mbit 20 Mbits 384 Mbits 7 Gbits 128 Gbits 2.25 Tbits 40 Tbits
Ver 10.423
45
Consider a ROM multiplier with 8-bit inputs: 65,536 8-bit locations are required to store all possible outputs... so 1Mbit of storage is needed!
0 1 1 0 1 0 1 1 ROM-based multiplier 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 address
0000 0000 0000 0000
data (product)
0000 0000 0000 0000
A
8 bits
A:B
16 bits
address 27,459
P
16 bits
decimal 107
B
8 bits
0 1 0 0 0 0 1 1
0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1
decimal 7,169
decimal 67
16-bit data
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
For example, if the B input was the constant value 75, the possible input words would be composed of 256 possible combinations of the upper 8-bits of the address, concatenated with the 8-bit binary word 0100 1011, as shown below. The result is that only 256 of the 65,536 memory locations are actually accessed. Therefore, when one of the inputs to the ROM-based multiplier is fixed, the size of the required ROM can be reduced to 256 locations of 16-bit data (note that the precision of the stored output words remains 8 bits + 8 bits). The total memory required is thus 256 x 16 = 4kbits. However, depending on the value of the constant, it may also be possible to reduce the length of the stored results. For instance, if the value of B is (decimal) 10, the maximum output product generated by the multiplication of B with any 8-bit input A will be: 128 10 = 1280 As -1280 can be represented with 12 bits, that represents a further saving of 4 bits storage x 256 memory locations = 1kbit.
A=?
8 bits
? ? ? ? ? ? ? ?
B=75
8 bits
0 1 0 0 1 0 1 1
? ? ? ? ? ? ? ? 0 1 0 0 0 0 1 1
A:B
16 bits
The total storage requirement for this example constant coefficient multiplier would therefore be 3 kbits... significantly smaller than the 1Mbit needed for a 16-bit multiplier where both operands are unknown!
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010
46
ROM-based multipliers with a constant input require fewer addresses. The storage required for output words may also be reduced, if the maximum result does not require the full numerical range of: 2
2N 1
result 2
2N 1
The maximum product and output wordlength can be calculated for the particular constant value, and the multiplier optimised accordingly...
8 bit signed number maximum absolute value = -128 16 bit signed result
A=?
? ? ? ? ? ? ? ?
P=?
8 bit representation (min.)
B = -83
0 1 0 1 0 0 1 1
Notes:
Introduction to FPGA
Constant multipliers can be implemented using the LUTs within the logic fabric (distributed ROM), or with one or more of the Block RAMs available on most devices. The selection is influenced by the other demands placed on these resources by the rest of the system being designed. In System Generator, the designer can specify the implementation style via the Constant Multiplier dialog box, along with the constant value, the output wordlength, and other parameters.
Ver 10.423
47
Multiplication by a power-of-2 can be achieved simply by shifting the number to the right or left by the appropriate number of binary places.
1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1
Extending this a little, multiplications by other numbers can be performed at low cost by creating partial products from shifts, and then adding them together.
0 1 0 1 0 1 3 21 x 23 + 21 = 189 0 1 0 1 1 1 1 0 1 0 1 0 0 0 0 4 0 1 0 1 0 1 0 0 0 0 2 (1 x 2-4) + 1 + (1 x 2-2) = 1.3125 x1.3125 x9
Notes:
Introduction to FPGA
Shift operations are effectively free in terms of logic resources, as they are implemented only using routing. Therefore multiplications by power-of-two numbers are very cheap! By recognising that multiplications by other numbers can be achieved by summing partial products of power-of-two shifts, any arbitrary multiplication can be decomposed into a series of shifts and add operations. The closer the desired multiplication is to a powerof-two, i.e. the fewer partial products that are required, the fewer adders are required, and hence the lower the cost of the multiplier. This type of multiplier is suitable only for constant multiplications, because there is only one input, and the result is achieved using the configuration of the hardware. x16
4
x8
x16
4 3 x24
x8
x24
The technique can be x8 particularly powerful when x1 3 applied to parallel x9 multiplications of the x9 same input. The partial x1 combined - fewer partial products product terms common to several multiplications x24 and x9 calculated separately can be shared and thus the overall effort reduced. Transpose form filters are very suitable for optimisation in this way. Taking the above simple example of two concurrent multiplications, one x9 and the other x24, it is clear that the shift right by three places can be shared as x8 is common to both operations.
Ver 10.423
Embedded multipliers
48
The Xilinx Virtex-II and Virtex-II Pro series were the first to provide onchip multipliers in early 2000s. These are in hardware on the FPGA ASIC, not actually in the user FPGA slice-logic-area. Therefore are permanently available, and they use no slices. They also consume less power than a slice-based equivalent and can be clocked at the maximum rate of the device. A
18x18 bit multiplier
A and B are 18-bit input operands, and P is the 36-bit product, i.e. P = A B. Depending upon the actual FPGA, between 12 and more than 2000 (Virtex 6 top of range) of these dedicated multipliers are available.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Looking at a device floorplan, you can clearly see the embedded multipliers, which are located next to Block RAMs on the FPGA in order to support high speed data fetching/writing and computation.
Block RAM
slices
18x18 multiplier
Information on dedicated multipliers taken from Virtex-II Pro Platform FPGAs: Introduction and Overview, DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com.
Ver 10.423
49
It can be easy to utilise on chip embedded multipliers inefficiently through choice of wordlengths...
36 4
18 18
36
19 19
38
Notes:
Introduction to FPGA
If you specify the use of embedded multipliers for a particular multiplier in System Generator, the tool will do exactly as you have asked, and implement it entirely using embedded multipliers. However, depending on the wordlengths involved, this may lead to an inefficient implementation. The wordlengths of the embedded multipliers are fixed at 18 x 18 bits, and it makes sense to use them as fully as possible. It is relatively easy to see that a 4 x 4 bit multiply will greatly underuse the capabilities of the multiplier, and this particular multiply operation might be better mapped to a distributed implementation, which would leave the embedded multiplier free for use somewhere else. Of course, these decisions are made in the context of some larger design with its own particular needs for the various resources available on the FPGA being targeted. Perhaps less obviously, mapping a multiplication to embedded multipliers where the input operands are slightly longer than 18 bits is also inefficient. This may result in, for example, the following implementation for a requested 19 x 19 bits multiplier, where 4 embedded multipliers are used instead of the expected 1!
18 x 18
1 x 18
19 19
38
18 x 1
Ver 10.423
1x1
50
As much DSP involves the Multiply-Accumulate (MAC) operation, soon after embedded multipliers came DSP48 slices (on the Virtex-4). These feature an 18 x 18 bit adder followed by a 48 bit accumulator.
18 18
36 48
48 DSP48 Virtex-4
Like the embedded multipliers, these are low power and fast. The ability to cascade slices together also means that whole filters can be constructed without having to use any slices.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
The next series of FPGAs (the Virtex-5) enhanced the capabilities of the DSP slice with the DSP48E. The major improvements of this slice are logic capabilities within the adder/subtractor unit, and an extended wordlength of one input to 25 bits. The maximum clock frequency also increased in line with the speed of the device.
18 25
43 48
48 DSP48E Virtex-5
Ver 10.423
51
The Spartan-3A DSP series and subsequent Spartan-6 feature a version of the DSP48 with a pre-adder, prior to the multiplier. This feature is especially useful for DSP structures like symmetric filters, because it allows the total number of multiplications to be reduced.
18 18
18 18
Notes:
Introduction to FPGA
The Virtex-6 offers a combination of the benefits of the Virtex-5 (the longer wordlength and arithmetic unit), together with the pre-adder from the Spartan series. This results in a very computationally powerful device, especially as it can be clocked at 600MHz, and the largest chips have 2000+ of them!
25 25
Virtex-6 DSP48E1 25 18 43 48 48
Ver 10.423
Division (i)
Divisions are sometimes required in DSP, although not very often. 6 bit non-restoring division array:
a5 1 q5 q4 q3 q2 a4 b5 a3 b4 a2 b3 a1 b2 a0 b1 b0 0 Bin sin
52
bin cout 0
FA
sout
0 q1 q0
Q=B/A
Note that each cell can perform either addition or subtraction as shown in an earlier slide either Sin+ Bin or Sin - Bin can be selected.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
A Direct method of computing division exists. This paper and pencil method may look familiar as it is often taught in school. A binary example is given below. Note that each stage computes an addition or subtraction of the divisor A. The quotient is made up of the carry bits from each addition/subtraction. If the quotient bit is a 0, the next computation is an addition, and if it is a 1, the divisor is subtracted. It is not difficult to map this example into the structure shown on the slide. Example: B = 01011 (11), A = 01101 (13) -A = 10011. Compute Q = B / A. q4 = 0 01011 carry 10011 11110
carry
q3 = 1
11100 01101 01001 10010 10011 00101 01010 10011 11101 11010 01101 00111
q2 = 1
carry
q1 = 0
carry
q0 = 1
carry
Division (ii)
00000.1101 01101 01011.0000 0 01 00 010 000 0101 0000 01011 00000 remdsh1 01011 0 divisor_in 00110 1 00100 10 00011 01 00001 010 00000 000 00001 0100 00000 1101 00000 0111
divisor_in
53
There is an alternative way to compute division using another paper and pencil technique.
VHDL Design
Notes:
Introduction to FPGA
Ver 10.423
54
An important aspect of division is to note that the quotient is generated MSB first - unlike multiplication or addition/subtraction! This has implications for the rest of the system. It is unlikely that the quotient can be passed on to the next stage until all the bits are computed - hence slowing down the system! Also, an N by N array has another problem - ripple through adders. Note that we must wait for N full adder delays before the next row can begin its calculations. Unlike multiplication there is no way around this, and as result division is always slower than multiply even when performed on a parallel array - a N by N multiply will run faster than a N by N divide!
Notes:
Introduction to FPGA
By looking at the top two rows of a 4 x 4 division array we can see that the first bit to get generated is the MSB of the quotient. This is unlike the multiplication array that can also be seen below, where the LSB is generated first. This is a problem when using division as most operations require the LSBs to start a computation and hence the whole solution will have to be generated before the next stage can begin. Another problem for division is the fact that it takes N full adder delays before the next row can start. In the examples below, the order in which the cells can start has been shown. So for the multiplier, the first cell on the second row is the 3rd cell to start working. However, for the divider, the first cell on the second row is only the 5th cell to start working because it has to wait for the 4 cells on the first row to finish.
a3 1 q3 a2 b3
4
a1 b2
3
a0 b1
2
b0
1
Bin
sin
0 q2
6 5
bin cout
FA
FA is full adder
a b 0
0
4
a3 0
3
a2 0
2
a1 0
1
a0 b0 0
c
5 4 3
b1 0 p0
p1
Ver 10.423
55
Q=B/A
q1 q0
Notes:
Introduction to FPGA
To increase the throughput, the critical path can be broken down by implementing pipeline delays at appropriate points. If pipelining is not used, the delay (critical path) from new data arriving to registering the full quotient is N2 full adders. This delay represents the maximum rate that new data can enter the array. However, by pipelining the array, the critical path is broken down to just N full adders and thus the rate at which new data can arrive is increased dramatically.
a3 a2 b3 a1 b2 a0 b1 b0 0 q2 a3 1 q3 q2 q1 q0 a2 b3 a1 b2 a0 b1 q1 q0
1 q3
b0 0
Ver 10.423
56
1 0
a4 0 a3 1 0 a2 0 a1 a0 1 1 0 0
bin cout
FA
1 0
0 1 0 0 b1 b0
1 0 0 sout
B =
0 1 0 0
1 0 0 0
The square root is found (with divides) in DSP in algorithms such as QR algorithms, vector magnitude calculations and communications constellation rotation.
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Looking carefully at the non-restoring square root array, we can note that this array is essentially half of the division array! If the division array above is cut diagonally from the left we can see the cells that are needed for the square root array. The 2 extra cells on the right hand side are standard cells which can be simplified. So square root can be performed twice as fast as divide using half of the hardware! a4 a 3 a2 a1 A = 10 11 01 01 010 carry 111 001 0111 carry 1011 0010 01001 c ar r y 10011 11100
c ar r y
b3 = 1 b2 = 1
0 1 b5
1 0
a7
1 0
a6 0 1 0 a5
1 0
a4 0 a3 1 0 a2
b1 = 0
0a4 111 R1 R1<<1 & a3 1b311 R2 R2<<1 & a2 1b3b211 R3 R3<<1 & a1 0b3b2b111 R4
b4 b3 b2
1 0
b0 = 1
0 a1 a0 1 1 0 0 0 1 0 0 b1 b0 1 0 0 sout
0 1 0 0
1 0 0 0
Ver 10.423
57
The main appearance of square roots and divides is in advanced adaptive algorithms such as QR using givens rotations. For these techniques we often find equations of the form: x cos = --------------------x2 + y2 and y sin = --------------------x2 + y2
So in fact we actually have to perform two squares, a divide and a square root. (Note that squaring is simpler than multiply!) There are a number of iterative techniques that can be used to calculate square root. (However these routines invariably require multiplies and divides and do not converge in a fixed time.) There seems to be some misinformation out there about square roots: For FPGA implementation square roots are easier and cheaper to implement than divides....!
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
Introduction to FPGA
Ver 10.423