Fpga Notes April23

DSPedia Notes 1
Introduction to FPGA
THIS SLIDE IS BLANK
Introduction: DSP and FPGAs
In the last 20 years the majority of DSP applications have been enabled by DSP processors: Texas Instruments Motorola Analog Devices
A number of DSP cores have been available. Oak Core LSI Logic ZSP 3DSP
ASICs (Application specific integrated circuits) have been widely used for specific (high volume) DSP applications But the most recent technology platform for high speed DSP applications is the Field Programmable Gate Array (FPGA) This course is all about the why and how of DSP with FPGAs!
R. Stewart, Dept EEE, University of Strathclyde, 2010
Notes:
DSP is all about multiplies and accumulates/adds (MACs). As we progress through the course, we will see that most algorithms that are used for different applications employ digital filters, adaptive filters, Fourier transforms and so on. These algorithms all require multiplies and adds (note that a divide or square root is quite a rare thing in DSP). Hence a DSP algorithm or problem is often specified in terms of its MAC requirements. In particular, when comparing two algorithms, if they both perform the same job but one with less MACs than the other, then clearly the cheaper one would be the best choice. However this implies some assumptions. One is that the required MACs are the same - but surely a multiply is multiply! Well, yes in the traditional DSP processor based situations we are likely to be using, say, a 16 bit device which will process 16 bit inputs, using 16 bit digital filter coefficients etc. With FPGAs this constraint is removed - we can use as many, or as few bits, as are required. Therefore we can choose to optimise and schedule DSP algorithms in a completely different way.
Circuit Board General Purpose Input/Output Bus
DAC
DSP56307
ADC
DSP Processor Amplifiers/Filters Voltage Output

Ver 10.423
Voltage Input
The FPGA DSP Evolution
Since around 1998 the evolution of FPGAs into the DSP market has been sustained by classic technology progress such as the ever present Moores law. Late 1990s FPGAs allow multipliers to be implemented in FPGA logic fabric. A few multipliers per device are possible. Early 2000s FPGAs: vendors place hardwired multipliers onto the device with clocking speeds of > 100MHz. Number of multpliers from 4 to > 500. Mid 2000s FPGA vendors place DSP algorithms signal flow graphs (SFGs) onto devices. Full (pipelined) FIR SFGs filters for example are available Late 2000s to early 2010s - who knows! Probably more DSP power, more arithmetic capability (fast square root, divide), FFTs, more floating point support. Rest assured more DSP is coming....
Notes:
Technology just keeps moving.
Anyone who has purchased a new laptop knows the feeling. If you just wait, then in the next quarter you will get the new model with integrated WiFi or WiMax, a faster processor. Of course wait another quarter and in 6 months it will be improved again - also, the new faster, better, bigger machine is likely to be cheaper also! Such is technology DSP for FPGAs is just the same. If you wait another year its likely the vendors will be bring out prepackaged algorithms for precisely what you want to do. And they will be easier to work with - higher level design tools, design wizards and so on. So if you are planning to design a QR adaptive equalizing beamformer for MIMO implementation of a software radio for 802.16 - then if you wait, it will probably be a free download in a few years. But of course who can wait? Therefore in this course, we discuss and review the fundamental strategies of designing DSP for FPGAs. Like all technologies you still need to know how it works if you really want to use it.
Ver 10.423
FPGAs: A Box of DSP blocks
We might be tempted to this of the latest FPGAs as repositories of DSP components just waiting to be connected together. Of course the resource is finite and the connections available are finite. In the days of circuits boards one had to be careful about running busses close together, lengths of wires etc. Similiar considerations are required for FPGAs (albeit out of your direct control). However, the high level concept, take the blocks, & build it:
Clocks Input/Output
Registers and Memory Design Verify Place and Route Connectors Logic Arithmetic
Notes:
This is undoubtedly the modern concept of FPGA design. Take the blocks, connect them together and the algorithm is in place. Do we actually need an FPGA/IC engineer then? Do we actually need a DSP engineer? Yes in both cases, but moderm toolsets and design flows are such that it might be the same person. There is lots to worry about. In terms of the DSP design; is the arithmetic correct (ie overflows, underflows, saturates etc). Do the latency or delays used allow the integrity to be maintained. For the FPGA, can we clock at a high enough rate? Does the device place and route. What device do we need, and how efficient is the implementation (just like compilers, different vendors design flows will give different results (some better than others). As vendors provide higher level components (like the DSP48 slice from XIlinx which allows a complete FIR to be implemented) then issues such as overflow, numerical integrity and so on are taken care of.
Ver 10.423
Binary Addition and Multiply

The bottom line for DSP is multiplies and adds - and lots of them! Adding two N bit numbers will produce up to an N+1 bit number: N N N+1
Multiplying two N bit numbers can produce up to a 2N bit number: N N 2N
So with a MAC (multiply and accumulate/add) of two N bit numbers we could, in the worst case, end up with 2N+1 bits wordlength.
Notes:
If the wordlength grows beyond the maximum value you can store we clearly have the situation of numerical overflow which is a non-linear operation and not desirable. Within traditional DSP processors this wordlength growth is well known an catered for. For example, the Motorola 56000 series is so called because it has a 56 bit accumulator, i.e. the largest result of any addition operation can have 56 bits. For a typical DSP filtering type operation we may require to take, say an array of 24 bit numbers and multiply by an array of another 24 bits numbers. The result of each multiply will be a 48 bit number. If we then add two 48 bit numbers together, if they both just happen to be large positive values then the result could be a 49 bit number. Now if we add many 48 bit numbers together (and they just all happen to be large positive values), then the final result may have a word growth of quite a few bits. So one must assume that Motorola had a good look at this, and realised it was fairly unlikely that the result of adding these 48 bit products together would ever be larger that 56 bits - so 56 bits was chosen. (Of course if you did have a problem that grew beyond 56 bits you would have to put special trapping in to the code to catch this.)
Ver 10.423
The Cost of Addition

A 4 bit addition can be performed using a simple ripple adder: A3 B3 A2 B2 A1 B1 A0 B0
C3 S4 MSB
S3
C2
S2
C1
S1
C0
S0 LSB
A3 A2 A1 A0 + B3 B2 B1 B0
C3 C2 C1 C0
S4 S3 S2 S1 S0
0 carry in
Therefore an N bit addition could be performed in parallel at a cost of N full adders.

Notes:
The simple Full Adder (FA): Adds two bits + one carry in bit, to produce sum and carry out
S out = ABC + ABC + ABC + ABC
A B Cin Cout Sout 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1
= ABC C out = ABC + ABC + ABC + ABC = AB + AC + BC
A Cout
B Cin
Sout
+11 +13 +24
1011 1101 11000
Ver 10.423
The Cost of Multiply

A 4 bit multiply operation requires an array of 16 multiply/add cells:
a3 a2 a1 a0 b3 b2 b1 b0 c3 c2 c1 c0 d3 d2 d1 d0 e3 e2 e1 e0 + f3 f2 f1 f0 p7 p6 p5 p4 p3 p2 p1 p0
b2
0 0
a3
a2
a1
a0 b0
0
b1
0
b3
0
p7
p6
p5
p4
p3
p2
p1
p0
Therefore an N by N multiply requires N 2 cells...... ......so for example a 16 bit multiply is nominally 4 times more expensive to perform than an 8 bit multiply.
Notes:
s a bout cout aout sout b c
Each cell is composed of a Full Adder (FA) and an AND gate, plus some broadcast wires:
cout = s.z.c + s.z.c + s.z.c + s.z.c aout = a sout = (s z) c bout = b z = a.b

An 8 bit by 8 bit multiplier would require 8 x 8 = 64 cells
1011 1001 1011 0000 0000 +1 0 1 1 1100011
11 x9
Partial Product
99
Ver 10.423
The Gate Array (GA)

Early gate-arrays were simply arrays of NAND gates:
Designs were produced by interconnecting the gates to form combinational and sequential functions.
Notes:
The NAND gate is often called the Universal gate, meaning that it can be used to produce any Boolean logic function. Early gate array design flow would be design, simulate/verify, device production and test. Metal layers make simple connections Z = AB + CD
A B C D
Early simulators and netlisters such as HILO (from GenRad) were used. From GA to FPGA However simple gate arrays although very generic, were used by many different users for similar systems..... ....for example to implement two level logic functions, flip-flops and registers and perhaps addition and subtraction functions. For a GA once a layer(s) of metal had been laid on a device - thats it! No changes, no updates, no fixes. So then we move to field programmable gate arrays. Two key differences between these and gate arrays: They can be reprogrammed in the field, i.e. the logic specified is changeable They no longer are just composed of NAND gates. A carefully balanced selection of multi-input logic, flipsflops, multiplexors and memory.
Ver 10.423
Generic FPGA Architecture (Logic Fabric)
Arrays of gates and higher level logic blocks might be refered to as the logic fabric...
I/O logic block I/O logic block I/O logic block I/O logic block logic block logic block logic block Column interconnects logic block logic block logic block logic block logic block I/O logic block I/O logic block I/O logic block Row interconnects
I/O logic block I/O logic block I/O logic block I/O logic block I/O
logic block
Notes:
The logic block in this generic FPGA contains a few logic elements. Different manufacturers will include different elements (and use different terms for logic block, e.g. slices etc). A simple logic block might contain the following:
LUT
Select MUX Logic
Logic Element
Of course the actual contents of a logic element will vary from manufacturer to manufacturer and device to device.
Ver 10.423
Interconnects
Cascade/ Carry Logic
FLIP FLOP
FLIP FLOP
FPGA Architecture (Xilinx DSP Devices)
Looking more specifically at recent Xilinx FPGAs, we also find block RAMs and dedicated arithmetic blocks... both very useful for DSP! Block RAM
Logic Fabric Arithmetic Block Input/Output Blocks

columns
r o w s
Notes:
Diagram Key
Input / Output Block (IOB)
One of the major features of more recent, DSP-targeted Xilinx FPGAs is the provision of dedicated arithmetic blocks, which offer lower power and higher clock frequency operation than the logic fabric (i.e. the array of CLBs). These can be configured to perform a number of different computations, and are especially suited to the Multiply Accumulate (MAC) operations prevalent in digital filtering. Block RAMs are also used extensively in DSP. Example uses are for storing filter coefficients, encoding and decoding, and other tasks. Despite the inclusion of these additional resources, the logic fabric still forms the majority of the FPGA. We will now look at the CLBs which comprise the logic fabric, and how they are connected together, in further detail.
Block RAM
DSP48 / DSP48A / DSP48E
Configurable Logic Block (CLB)
FPGA
Ver 10.423
Example: Xilinx Logic Blocks and Routing
10
Xilinx FPGA logic fabric comprises Configurable Logic Blocks (CLBs), which are groups of SLICEs (e.g. 2 or 4 SLICEs per CLB). Signals travel between CLBs via routing resources. Each CLB has an adjacent switch matrix for (most) routing. CLB Slices
Switch Matrix
other CLBs
NOTE: Only a subset of routing resources is depicted above.

Notes:
The example in the main slide features a typical Xilinx FPGA architecture, and the Altera and Lattice architectures are different. Logic units differ in size, composition and name! However, in all cases, their Logic Blocks include both combinational logic and registers, and routing resources are required for connecting blocks together. Continuing with the Xilinx example, the combinational blocks are termed Lookup Tables (LUTs), and in most devices have 4 inputs (some of the more recent devices have 6-input LUTs). These LUTs can be utilised in four modes: To implement a combinatorial logic function As Read Only Memory (ROM) As Random Access Memory (RAM) As shift registers
The register can be used as: A flip-flop A latch
Over the next few slides, the functionality of LUTs and registers in the above modes will be described.
Ver 10.423
The Lookup Table
11
When used to implement a logic function, the 4-bit input addresses the LUT to find the correct output, Z, for that combination of A, B, C and D.
A B C D
Lookup Table
Example logic function: Z=BCD+ABCD
CD AB 00
00 01 11 10
0 1 1 0
01 0 0 0 0
11 0 0 0 0
10 0 0 0 1
Notes:
The lookup table can also implement a ROM, containing sixteen 1-bit values. Instead of the four inputs representing inputs of a logic function, they can be thought of as a 4-bit address. A 1-bit value is stored within each memory location, and the appropriate output is supplied for any input address. In this example, A is considered the Most Significant Bit (MSB) and D the Least Significant Bit (LSB), and the output is Z.
A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
Z 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1
Ver 10.423
LUTs as Distributed RAM
12
LUTs can also be configured as distributed RAM, in either single port or dual port modes. Single port: 1 address for both synchronous write operations and asynchronous read operations. Dual port: 1 address for both synchronous write operations and asynchronous read operations, and 1 address for asynchronous reads only.
Larger RAMs can be constructed by connecting two or more LUTs together. Dual port RAM requires more resources than single port RAM. For example, a 32x1 Single Port RAM (32 addresses, 1 bit data), requires two 4-bit LUTs, whereas an equivalent Dual Port RAM requires four 4-bit LUTs.
Notes:
The two diagrams below demonstrate the implementation of 16x1 single and dual port RAMs in the Virtex II Pro device, respectively. Notice that the dual port RAM requires twice as many resources as the single port RAM.
Single Port
Source: Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet, DS083 (v4.7), November 5, 2007.
Dual Port
Ver 10.423
LUTs as Shift Registers (SRL16s)
13
A final alternative is to use the LUT as a Shift Register of up to 16 bits. Additional Shift In and Shift Out ports are used, and the 4-bit address is used to define the memory location which is asynchronously read. For example, if the LUT input is the word 1001, the output from the 10th register is read, as depicted below.
SHIFT IN CLK
DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ
SHIFT OUT
10
11
12
13
14
15
LUT INPUT
(e.g. 1001) D OUT
The slice register at the output from the LUT can be used to add another 1 clock cycle delay. Using the register also synchronises the read operation.
Notes:
As with the other LUT configurations, larger Shift Registers can be constructed by combining several LUTs. For example, a 64-bit shift register segment can be constructed by combining four 16-bit Shift Registers together, as shown below. The cascadable ports allow further interconnections for larger Shift Registers.
Cascadable In
DI
DI
SRL16
MSB
FF
SRL16
MSB
FF
DI
D MSB
DI
SRL16
FF
D MSB
SRL16
FF
Slice 1
Slice 2 Cascadable Out
Ver 10.423
Registers
14
The sequential logic element which follows the lookup table can be configured as either: An edge-triggered D-type flip flop; or A level-sensitive latch
The input to the register may be the output from the LUT, or alternatively a direct input to the slice (i.e. the LUT is bypassed).
LUT Inputs
LUT
Carry Logic REG Output
Bypass Input
LUT / Register Pair

Notes:
A D-type flip flop provides a delay of one clock cycle, as confirmed by the truth table below (D(t) is the register input at time t, and Q(t=1) is the output 1 clock cycle later). A clock signal and control inputs (set, reset, etc.) are also provided.
D(t) 0 1
Q(t+1)
0 1
When configured as a latch, the control inputs define when data on the D input is captured and stored within the register. The Q output thereafter remains unchanged until new data is captured. Flip flops and registers are discussed in the Digital Logic Review notes chapter.
Ver 10.423
Resets: Registers and SRL16s
15
Whereas registers can be reset, SRL16s cannot. Therefore, adding reset capabilities to a design has implications for resource utilisation. For example, consider an 8-bit shift register. If resettable, then each element requires a slice register. If not, then an SRL16 can be used.
Resettable Implementation
INPUT CLOCK RESET Slice 1 Slice 2 Slice 3 Slice 4
DQ DQ DQ DQ DQ DQ DQ DQ
OUTPUT
Non Resettable Implementation

INPUT CLOCK Slice 1
SRL16
OUTPUT
Notes:
We can still design a resettable shift register with an SRL16, by using a slightly more sophisticated design. Instead of making all elements resettable, we can implement the first element using a slice register, and the subsequent ones using an SRL16. The reset signal is held high for 8 clock cycles, which allows the 0 input to propagate through the shift register. Instead of using 4 slices, this design would require 2 slices at most.
INPUT CLOCK RESET
DQ
SRL16
OUTPUT
R Slice 1/2 Slice 1
Hold RESET signal high for 8 clock cycles to reset the shift register...
Ver 10.423
FPGA Arithmetic Implementation
16.1
The implementation of arithmetic operations in FPGA hardware is an integral and important aspect of DSP design. The following key issues are presented in this chapter: Number representations: binary word formats for signed and unsigned integers, 2s complement, fixed point and floating point; Binary arithmetic, including: Overflow and underflow, saturation, truncation and rounding Structures for arithmetic operations: Addition/Subtraction, Multiplication, Division and Square Root; Complex arithmetic operations; Mapping to Xilinx FPGA hardware ... including special resources for high speed arithmetic
Notes:
This section of the course will introduce the following concepts: Integer number representations - unsigned, ones complement, twos complement. Non-integer number representations - fixed point, floating point. Quantisation of signals, truncation, rounding, overflow, underflow and saturation.
Addition - decimal, twos complement integer binary, twos complement fixed point, hardware structures for addition, Xilinx-specific FPGA structures for addition. Multiplication - decimal, 2s complement integer binary, twos complement fixed point, hardware structures for multiplication, Xilinx-specific FPGA structures for multiplication. Division. Square root. Complex addition. Complex multiplication.
Ver 10.423
Number Representations
17
DSP, by its very nature, requires quantities to be represented digitally using a number representation with finite precision. This representation must be specified to handle the real-world inputs and outputs of the DSP system. Sufficient resolution Large enough dynamic range
The number representation specified must also be efficient in terms of its implementation in hardware. The hardware implementation cost of an arithmetic operator increases with wordlength. The relationship is not always linear!
There is a trade-off between cost, and numeric precision and range.

Notes:
The use of binary numbers is a fundamental of any digital systems course, and is well understood by most engineers. However when dealing with large complex DSP systems, there can be literally billions of multiplies and adds per second. Therefore any possible cost reduction by reducing the number of bits for representation is likely to be of significant value. For example, assume we have a DSP filtering application using 16 bit resolution arithmetic. We will show later (see Slide 42) that the cost of a parallel multiplier (in terms of silicon area - speed product) can be approximated as the number of full adder cells. Therefore for a 16 bit by 16 bit parallel multiply the cost is of the order of 16 x 16 = 256 cells. The wordlength of 16 bits has been chosen (presumably) because the designer at sometime demonstrated that 17 bits was too many bits, and 15 was not enough - or did they? Probably not! Its likely that we are using 16 bits because... well, thats what we usually use in DSP processors and we are creatures of habit! In the world of FPGA DSP arithmetic you can choose the resolution. Therefore, if it was demonstrated that in fact 9 bits was sufficient resolution, then the cost of a multiplier is 9 x 9 cells = 81 cells. This is approximately 30% of the cost of using 16 bits arithmetic. Therefore its important to get the wordlength right: too many bits wastes resources, and too few bits loses resolution. So how do we get it right? Well, you need to know your algorithms and DSP.
Ver 10.423
Unsigned Integer Numbers

Unsigned integers are used to represent non-negative numbers. The weightings of individual bits within a generic n-bit word are:
bit index: bit weighting: n-1
MSB
18
n-2
n-3
2n-1 2n-2 2n-3
22
21
20
LSB
The decimal number 82 is 01010010 in unsigned format as shown:

bit index: bit weighting: example binary number: decimal representation: 7 27 6 26 5 25 4 24 3 23 2 22 1 21 0 20
0 1 0 1 0 0 1 0
0x27 1x26 0x25 1x24 0x23 0x22 1x21 0x20 64 + 16 + 2 = 82
The numerical range of an n-bit unsigned number is: 0 to 2n - 1. For example, an 8-bit word can represent all integers from 0 to 255.
Notes:
Taking the example of an 8-bit unsigned number, the range of representable values is 0 to 255:
Integer Value 0 1 2 3 4
Binary Representation 00000000 00000001 00000010 00000011 00000100
64 65 131 255
01000000 01000001 10000011 11111111
Note that the minimum value is 0, and the maximum value ( 255 ) is the sum of the powers of two between 0 and 8 , where 8 is the number of bits in the binary word: i.e. 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 = 255 = 28 - 1
Ver 10.423
2s Complement (Signed Integers)
19
2s Complement caters for positive and negative numbers in the range -2n-1 to +2n-1 -1, and has only one representation of 0 (zero). In 2s complement, the MSB has a negative weighting:
MSB
n-2
n-3
-2n-1 2n-2 2n-3
22
21
20
LSB
The most negative and most positive numbers are represented by:
bit index: n-1 n-2 n-3 2 1 0
most negative:
bit index:
1
MSB
0
n-2
0
n-3
0
2
0
1
0
LSB
n-1
most positive:
0
MSB
1
LSB
Notes:
As examples, we can convert the following two 8-bit 2s complement words to decimal. As for the unsigned representation, the decimal number 82 is 01010010 in 2s complement signed format:
bit index: bit weighting: example binary number: decimal representation:
7 -27
6 26
5 25
4 24
3 23
2 22
1 21
0 20
0 1 0 1 0 0 1 0
0x-27 1x26 0x25 1x24 0x23 0x22 1x21 0x20 64 + 16 + 2 = 82
Meanwhile the decimal number -82 is 10101110 in 2s complement signed format:
bit index: bit weighting: example binary number: decimal representation:
7 -27
6 26
5 25
4 24
3 23
2 22
1 21
0 20
1 0 1 0 1 1 1 0
1x-27 1x26 1x25 1x24 0x23 0x22 1x21 0x20 -128 + 32 + 8 + 4 + 2 = -82
Ver 10.423
2s Complement Conversion
20
For 2s Complement, converting between negative and positive numbers involves inverting all bits, and adding 1. For example, we have just considered 2s complement representations of the decimal numbers -82 and +82. They are converted as shown:
+82 +82
-82 0 1 0 1 0 0 1 0 invert all bits
-82 -82
+82 1 0 1 0 1 1 1 0 invert all bits
1 0 1 0 1 1 0 1 add 1 -82 +
0 0 0 0 0 0 0 1
0 1 0 1 0 0 0 1 add 1 +82 +
0 0 0 0 0 0 0 1
1 0 1 0 1 1 1 0
0 1 0 1 0 0 1 0
Notes: Positive Numbers Integer 0 1 2 3 Binary 00000000 00000001 00000010 00000011 Invert all bits and ADD 1
Negative Numbers Integer 0 -1 -2 -3 Binary 100000000 11111111 11111110 11111101
125 126 127
01111101 01111110 01111111
-125 -126 -127 -128
10000011 10000010 10000001 10000000
Note that when negating positive values, a ninth bit is required to represent negative zero. However, if we simply ignore this ninth bit, the representation for the negative zero becomes identical to the representation for positive zero. Notice from the above that -128 can be represented but +128 cannot.
Ver 10.423
Fixed-point Binary Numbers

We can now define what is known as a fixed-point number: ...a number with a fixed position for the binary point.
21
Bits on the left of the binary point are termed integer bits, and bits on the right of the binary point are termed fractional bits. The format of a generic fixed point word, comprising n integer bits and b fractional bits, is: : binary point
MSB
n-2
-1
-2
-b+1
-b
LSB
2n-1 2n-2
21
20
2-1 2-2
2-b+1 2-b
n integer bits
b fractional bits
The MSB has -ve weighting for 2s complement (as for integer words).
Notes:
As examples, we consider the 2s complement word 11010110 with the binary point in two different places. Firstly, with the binary point to the left of the third bit, i.e. 5 integer bits and 3 fractional bits:
bit index: bit weighting: binary number: decimal representation:
4 -24
3 23
2 22
1 21
0 20
-1 2-1
-2 2-2
-3 2-3
1 1 0 1 0 1 1 0
1x-24 1x23 0x22 1x21 0x20 1x2-1 1x2-2 0x2-3 -16 + 8 + 2 + 0.5 + 0.25 = -5.25
...and secondly, with the binary point to the left of the third bit, i.e. 3 integer bits and 5 fractional bits:
bit index: bit weighting: binary number: decimal representation:
2 -22
1 21
0 20
-1 2-1
-2 2-2
-3 2-3
-4 2-4
-5 2-5
1 1 0 1 0 1 1 0
1x-22 1x21 1x20 1x2-1 0x2-2 1x2-3 1x2-4 0x2-5 = -1.3125 0.5 + 0.125+ 0.0625 -4 + 2 +
Note that these results are related by a factor of 22 = 4, i.e. 4 x -1.3125 = -5.25.
Ver 10.423
Fixed Point Range and Precision
22
As with integer representations (which are also effectively fixed point numbers, but with the binary point at position 0), the binary range of fixed point numbers extends from:
unsigned: signed (2s comp.):
00000.....0000 10000.....0000
11111.....1111 01111.....1111
The same number of quantisation levels is present, e.g. for an 8-bit binary word, 256 levels can be represented. Numerical range scales according to the binary point position, e.g.:
1000 0000. (-128)

1 x0.5
0111 1111. (+127)

1 x0.5
1000 000.0 (-64)
0111 111.1 (+63.5)
Dynamic range (range / interval) is independent of the binary point position, e.g. (127-(-128))/1 = 255 = (63.5-(-64))/0.5
Notes:
To illustrate this further, lets consider the very simple case of a 3-bit 2s complement word, with the binary point in all four possible positions. Clearly the numerical range is affected by the binary point position, but the relationship between the interval and range remains the same.
+3 +2 +1 0 -1 -2 -3 -4
binary point position = 0
-4 +2 +1 -2
+1.5 +0.75 -1 -2 interval = 1 interval = 0.5 interval = 0.25 interval = 0.125 +0.375 -0.5

+1 +0.5

-1 +0.5 +0.25

+0.125 -0.5 +0.25
all integer bits
all fractional bits
Ver 10.423
Fixed-point Quantisation
Consider the fixed point number format: n n n b b b b b
23
3 integer bits
5 fractional bits
Numbers between 4 and 3.96785 can be represented, in steps of 0.03125 . As there are 8 bits, there are 2 8 = 256 different values. Revisiting our sine wave example, using this fixed-point format: +2
-2 LSB This looks much more accurate. The quantisation error is ---------(where LSB = least significant bit)... so 0.015625 rather than 0.5! 2
Notes:
Quantisation is simply the DSP term for the process of representing infinite precision numbers with finite precision numbers. In the decimal world, it is familiar to most to work with a given number of decimal places. The real number can be represented as 3.14159265.... and so on. We can quantise or represent to 4 decimal places as 3.1416. If we use rounding here and the error is: 3.14159265 3.1416 = 0.00000735 If we truncated (just chopped off the bits below the 4th decimal place) then the error is larger: 3.14159265 3.1415 = 0.00009265 Clearly rounding is most desirable to maintain best possible accuracy. However it comes at a cost. Albeit the cost is relatively small, but it is however not free. When multiplying fractional numbers we will choose to work to a given number of places. For example, if we work to two decimal places then the calculation: 0.57 x 0.43 = 0.2451 can be rounded to 0.25, or truncated to 0.24. The result are different. Once we start performing billions of multiplies and adds in a DSP system it is not difficult to see that these small errors can begin to stack up.
Ver 10.423
Multiplication & Division via Binary Shifts
24
The binary point position has a power-of-2 relationship with the numerical range of the word format, and any number it represents. Therefore if we want to multiply or divide a fixed point number by a power-of-two, this is achieved by simply shifting the numbers with respect to the binary point!
shift right by 2 places -2 1 0.5 0.0625 0.25 0.125
4
-8 4
0
2
1
1
0
0.5
1
0.25
(decimal 1.4375)
original number:
-16
0
8
1
4
0
2
1
1
1
0.5
(decimal 5.75)
shift left by 1 place
(decimal 11.5)
Notes:
Of course, looking at the example in the main slide, we could also consider that the binary point is moved, rather than the bits - it amounts to the same thing! Ultimately the binary point is conceptual - having no effect on the hardware produced - and it falls to the DSP design tool and/or DSP designer to keep track of it. Reviewing the divide-by-4 and multiply-by-2 examples from the main slide... if we move the binary point, the weightings of the bits comprising the word, and hence the value it represents, change by a power-of-two factor.
-2
0.5
0.0625 0.25 0.125
(decimal 1.4375)
4
-8 4 2 1 0.5 0.25
original number:
(decimal 5.75)
2
-16 8 4 2 1 0.5
(decimal 11.5)
Ver 10.423
Normalised Fixed Point Numbers
25
Fixed point word formats with 1 integer bit and a number of fractional bits are often adopted. Numbers from -1 to +1-2-b (i.e. just less than 1) can be represented. Some examples:
-1 1/2 1/4 1/8 1/16 1/32 1/64 1/128
decimal -1
most -ve
0 most +ve
0 +0.9921875
Limiting the numeric range to 1 is advantageous because it makes the arithmetic easier to work with... multiplying two normalised numbers together cannot produce a result greater than 1!
Notes:
The term Q-format is often used to describe fixed point number formats, usually in the context of DSP processors. However, it is useful to note that Q-format and 2s complement are actually the same thing. Q-format notation is given in the form Qm.n, where m is the number of integer bits, and n is the number of fractional bits. Notably this description excludes the MSB of the 2s complement representation, which Q-format considers a sign bit. Therefore the total number of bits in a Q-format number is 1 + m + n, whereas in 2s complement, the same word format would be described as having m+1 integer bits, and n fractional bits. For example, a Q2.5 number has a sign bit, 2 other integer bits, and 5 fractional bits, and hence can be represented as shown below. In 2s complement, this would be described as having 3 integer bits and 5 fractional bits.
1 sign bit Q-format description

bit weightings:
m=2 integer bits -22 21 20
n=5 fractional bits 2-1 2-2 2-3 2-4 2-5 5 fractional bits
2s complement description
3 integer bits
The Q0.15 format (often abbreviated to Q15) is used extensively in DSP as it covers the normalised range of numbers from -1 to +1 - 2-15, and is equivalent to a 16 bit 2s complement representation with the binary point at 15, i.e. 1 integer bit and 15 fractional bits.
Ver 10.423
Fractional Motivation - Normalisation
26
Working with fractional binary values makes the arithmetic easier to work with and to account for wordlength growth. As an example take the case of working with a machine using 4 digit decimals and a 4 digit arithmetic unit - range -9999 to +9999. Multiplying two 4 digit numbers will result in up to 8 significant digits. 6787 x 4198 = 28491826
Scale
2849.1826
Tr
2849
If we want to pass this number to a next stage in the machine (where arithmetic is 4 digits accuracy) then we need scale down by 10000, then truncate. Consider normalising to the range -0.9999 to +0.9999. 0.6787 x 0.4198 = 0.28491826
Tr
0.2849
now the procedure for truncating back to 4 bits is much easier.

Notes:
Of course the two results are exactly identical and the differences are in how we handle the truncate and scale. However using the normalised values, where all inputs are contrained to be in the range from -1 to +1 then its easy to note that multiplying ANY two numbers in this range together will give a result also in the range of -1 to +1. Exactly the same idea of normalisation is applied to binary, and the the binary point is implicitly used in most DSP systems. Consider 8 bit values in 2s complement. The range is therefore: 10000000 to 01111111 (-128 to +127)
If we normalise these numbers to between -1 and 1 (i.e. divide through by 128) then the binary range is: 1.0000000 to 0.1111111 ( -1 to 0.9921875, where 127/128 = 0.9921875).
Therefore we apply the same normalising ideas as for decimal for multiplication in binary.1101 1010 0100 Consider multiplying 36 x 97 = 3492 equivalent to 00100100 x 01100001 = 0000 1101 1010 0100 In binary normalising the values would be the calculation 0.0100100 x 0.1100001 = 0.00110110100100 which in decimal is equvalent to: 0.28125 x 0.7578125 = 0.213134765625 Note very clearly that in a DSP system then the binary poiint is all in the eye of the designer. There is no physical connection or wire for the binary point. It just makes things significantly easier in keeping track of wordlength growth, and truncating just by dropping fractional bits. Of course if you prefer integers and would like to keep track of the scaling etc you can do this,..... you will get the same answer and the cost is the same...
Ver 10.423
Analogue to Digital Converter (ADC)
27
An ADC is a device that can convert a voltage to a binary number, according to its specific input-output characteristic.
Binary Output
127 96 64 32
01111111 01100000 01000000 00100000
fs
8 bit 1 0 0 1 1 1 0 1 Binary Output
ADC -2 -1
-32 -64 -96 -128
1 11001000 11000000 10100000 10000000
2 Voltage Input
Voltage Input
We can generally assume ADCs operate using twos complement arithmetic.

Notes:
Viewing the straight line portion of the device we are tempted to refer to the characteristic as linear. However a quick consideration clearly shows that the device is non-linear (recall the definition of a linear system from before) as a result of the discrete (staircase) steps, and also that the device clips above and below the maximum and minimum voltage swings. However if the step sizes are small and the number of steps large, then we are tempted to call the device piecewise linear over its normal operating range. Note that the ADC does not necessarily have a linear (straight line) characteristic. In telecommunications for example a defined standard nonlinear quantiser characteristic is often used (A-law and -law). Speech signals, for example, have a very wide dynamic range: Harsh oh and b type sounds have a large amplitude, whereas softer sounds such as sh have small amplitudes. If a uniform quantisation scheme were used then although the loud sounds would be represented adequately the quieter sounds may fall below the threshold of the LSB and therefore be quantised to zero and the information lost. Therefore non-linear quantisers are used such that the quantisation level at low input levels is much smaller than for higher level signals. A-law quantisers are often implemented by using a nonlinear circuit followed by a uniform quantiser. Two schemes are widely in use: the A-law in Europe, and the -law in the USA and Japan. Similarly for the DAC can have a non-linear characteristic.
Binary Output
Voltage Input
Ver 10.423
ADC Sampling Error
28
Perfect signal reconstruction assumes that sampled data values are exact (i.e. infinite precision real numbers). In practice they are not, as an ADC will have a number of discrete levels. The ADC samples at the Nyquist rate, and the sampled data value is the closest (discrete) ADC level to the actual value:
s(t)
4 3 2 1 0 -1 -2 -3 -4
ADC time
Binary value
fs
v(n)
4 3 2 1 0 -1 -2 -3 -4
ts
Voltage
sample, n
v ( n ) = Quantise { s ( nt s ) }, for n = 0, 1, 2, Hence every sample has a small quantisation error.

Notes:
For example purposes, we can assume our ADC or quantiser has 5 bits of resolution and maximum/minimum voltage swing of +15 and -16 volts. The input/output characteristic is shown below:
01111 (+15) Binary Output
Vmin = -15 volts
1 volts Vmax = 15 volts Voltage Input
10000 (-16)
In the above slide figure, for the second sample the true sample value is 1.589998..., however our ADC quantises to a value of 2.
Ver 10.423
Quantisation Error
29
If the smallest step size of a linear ADC is q volts, then the error of any one sample is at worst q/2 volts.
01111 (+15) Binary Output
q volts Vmax Voltage Input
-Vmax
10000 (-16)
Notes:
Quantisation error is often modelled an additive noise component, and indeed the quantisation process can be considered purely as the addition of this noise: nq
ADC
Ver 10.423
An Example
Here is an example using a 3-bit ADC:
4 3 Amplitude/volts 2 1 0 -1 -2 -3 -4 4 3 2 1 -4 0 -1 -2 -3 Input Amplitude/volts 3 2 1 0 -1 -2 -3 -4 time/seconds time/seconds
30
ADC Characteristic
Quantisation Error
Output
-1
-2
-3
-4
Notes:
In this case worst case error is 1/2.
Ver 10.423
Adding multi-bit numbers
31
The full adder circuit can be used in a chain to add multi-bit numbers. The following example shows 4 bits: A3 B3 A2 B2 A1 B1 A0 B0
C3 S4 MSB
S3
C2
S2
C1
S1
C0
A3 A2 A1 A0 + B3 B2 B1 B0
C3 C2 C1 C0
S0 LSB
S4 S3 S2 S1 S0
This chain can be extended to any number of bits. Note that the last carry output forms an extra bit in the sum. If we do not allow for an extra bit in the sum, if a carry out of the last adder occurs, an overflow will result i.e. the number will be incorrectly represented.
Notes:
The truth table for the full adder is: A 0 0 0 0 1 1 1 1 B 0 0 1 1 0 0 1 1 CIN 0 1 0 1 0 1 0 1 S 0 1 1 0 1 0 0 1 COUT 0 0 0 1 0 1 1 1
0+0+0 = 0 0+0+1 = 1 0+1+0 = 1 0+1+1 = 2 1+0+0 = 1 1+0+1 = 2 1+1+0 = 2 1+1+1 = 3
Full adder circuitry can be therefore produced with gates:
A B C
S out = ABC + ABC + ABC + ABC = ABC C out = ABC + ABC + ABC + ABC = AB + AC + BC = C ( A B )
COUT
The longest propagation delay path in the above full adder is two gates.
Ver 10.423
Subtraction
32
Subtraction is very readily derived from addition. Remember 2s complement? All we need to do to get a negative number is invert the bits and add 1. Then if we add these numbers, well get a subtraction D = A + (-B): A3 B3 A2 B2 A1 B1 A0 B0 Invert add 1 1
4 bit 2s comp
Discard
C4
D3
D2
D1
D0
The addition of the 1 is done by setting the LSB carry in to be 1.

Notes:
Note for 4 bit positive numbers (i.e. NOT 2s complement) the range is from 0 to 15. For 2s complement the numerical range is from -8 to 7. Addition/Subtraction (using 2s complement representation) Sometimes we need a combined adder/subtractor with the ability to switch between modes. This can be achieved quite easily:
A3
0
B3
1 MUX
A2
B2
0 1 MUX
A1
B1
0 1 MUX
A0
B0
0 1 MUX
For: A + B, K = 0 For: A - B, K= 1 This structure will be seen again in the Division/Square Root slides!
Ver 10.423
Wraparound Overflow & 2s Complement
33
With 2s complement overflow will occur when the result to be produced lies outside the range of the number of bits. Therefore for an 8 bit example the range is -128 to +127 (or in binary this is 100000002 to 011111112: -65 + -112 -177 10111111 +10010000 101001111 100 + 37 137 01100100 +00100101 10001001
With an 8 bit result we lose the 9th bit and the result wraps around to a positive value: 01001111 = 79 .
With an 8 bit result the result wraps around to a negative value: 10001001 = 119 .
One solution to overflow is to ensure that the number of bits available is always sufficient for the worst case result. Therefore in the above example perhaps allow the wordlength to grow to 9 or even 10 bits. Using Xilinx SystemGenerator we can specifically check for overflow in every addition calculation.
Notes:
Recall from previously that overflow detect circuitry is relatively easy to design. Just need to keep an eye on the MSB bits (indicating whether number is +ve or -ve): For example
(-73) + 127 = 54
Discard final 9th bit carry
10110111 +01111111 1 00110110

No overflow
100 + 64 = 164
01100100 +01000000 10100100

Overflow
MSB bit indicate -ve result!
Adding Adding Adding
+ve and -ve +ve and +ve -ve and -ve
will never overflow! if a -ve result then overflow if a +ve result then overflow
Ver 10.423
Saturation
Taking the previous overflowing examples from Slide 33 10111111 01100100 -65 100 +10010000 + -112 +00100101 + 37 -177 101001111 137 10001001
Detect overflow and saturate Detect overflow and saturate
34
One method to try to address overflow is to use a saturate technique.
-128
1000000
127
01111111
When overflow is detected, the result is set to the close possible value (i.e for the 8 bit case either -128 or +127). Therefore for every addition that is explicitly done with an adder block. In Xilinx System Generator the user will get a checkbox choice to allow results to either (i) Wraparound or (ii) Saturate. Implementing saturate will require detect overflow & select circuitry.
Notes:
Once again, design of a DSP system might be done such that overflow never happens as the user has ensured there are enough bits to cater for the worse possible case leading to the maximum magnitude result. Generally for some later FPGAs such as the Virtex-4, using some of the DSP48 blocks give adders with 48 bits of precisions therefore the likelihood of say working with 16 bit values that grow to over 48 bits is unlikely. Hence overflow has been designed out. Of course not applications will use these devices and using general slice logic and attempting to make adders as small as possible, would mean care must be taken, and where appropriate to efficient design, saturate might be included.
Saturation is extremely useful for adaptive algorithms. For example, in the Least Mean Squares algorithm (LMS), the filter weights w are updated according to the equation: w ( k ) = w ( k 1 ) + 2e ( k )x ( k ) Without further concern over the meaning of this equation, we can see that the term 2e ( k )x ( k ) is added to the weights at time epoch k 1 to generate the new weights at time epoch k . The the operations that form 2e ( k )x ( k ) were to overflow, there is a high chance that the sign of the term would flip and drive the weights in completely the wrong direction, leading to instability. With saturation however, if the term 2e ( k )x ( k ) gets very big and would overflow, saturation will limit it to the maximum value representable, causing the weights to change in the right direction, and at the fastest speed possible in the current representation. The result is a huge increase in the stability of the algorithm.
Ver 10.423
Xilinx Virtex-II Pro Addition

The used components of the slide are outlined below
Cout Sout
35
B A
Sout
Cout
Sout
Cin
Cin
Notes:
Picture of Xilinx-II Pro slice (upper half) taken from Virtex-II Pro Platform FPGAs: Introduction and Overview, DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com LookUp Table (LUT) programmed with two-input XOR function: G1 (A) 0 0 1 1 Sout = Cin xor D , G2 (B) 0 1 0 1 D 0 1 1 0
Cout = DA + Cin D (multiplex operation). Result is the FULL ADDER implementation: . G1 (A) 0 0 0 0 1 1 1 1 G2 (B) 0 0 1 1 0 0 1 1 Cin 0 1 0 1 0 1 0 1 D 0 0 1 1 1 1 0 0 Sout 0 1 1 0 1 0 0 1 Cout 0 0 0 1 0 1 1 1
Ver 10.423
Xilinx Virtex-II Pro Slice Main Components

Inputs ORCY Outputs
36
A (very) high level diagram of the main logic components on one slice
D-type FF
4 input LUT RAM ShiftReg
MULTAND
Upper
Mux
Mux
XORG
Inputs ORCY Outputs
Mux
D-type FF
MULTAND
Mux
Lower
Mux Mux Mux Mux
XORG
Notes:
Just reviewing the logic circuitry on one half of the slice (note that in Slide 35 only the top half of the slice is shown, whereas the above slide shows the top and bottom halfs), we can note: One D-type Flip Flop One 4 input Look-Up-Table (LUT) (can be configured as shift register or simple as RAM/memory) One XOR gate One AND gate One OR gate A few 2 input MUX (multiplexors) to route signals Clock inputs
Small FPGAs will have just a few hundred (100s) slices; Large FPGAs will have many tens of thousands (10000s) of slices (and other components!)
Ver 10.423
Xilinx Virtex-II Pro 4 bit Adder
37
To produce larger adders the Xilinx tools will simply cascade the carry bits in adjacent (where possible!) slices. The bottom half of a Virtex-II Pro slice can be programmed for an identical operation, with its COUT wired to the top-halfs CIN. Hence we can get two bits of addition per standard Xilinx slice. To produce a 4 bit adder, we cascade with another slice.
C1 B1 A1 B0 A0 C3 S1
FA FA
0
2 bit addition 1 slice

A1 B1 A0 B0
B3 A3 B2 A2
FA FA
C1
S3
S0
S2
4 bit addition 2 slices

A3 B3 A2 B2 A1 B1 A0 B0
C1
S1
C0
S0
B1 A1 B0 A0
FA FA
S1 C3 S0 S4
S3
C2
S2
C1
S1
C0
S0
Notes:
Note the importance of the LUT (look up table) in the Xilinx slice. When configured as a LUT, any four input Boolean equation can be implemented. For example take the equation Y = ABC + ABCD The truth table for this equation is: ABCD 00001 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
A B C D
To implement this function, simply store the values of Y in the Slice LUT, and the address the LUT with values of ABCD to get the output Therefore ANY 4 variable Boolean function can be simply implemented with a four input LUT. (Of course if the equation is only 3 variables then we can also implement and just set one input constant)
Ver 10.423
Multiplication in binary
Multiplying in binary follows the same form as in decimal: 11010110 A 7 A 0 x00101101 B 7 B 0 11010110 000000000 1101011000 11010110000 000000000000 1101011000000 00000000000000 000000000000000 0010010110011110 P 15 P 0
38
Note that the product P is composed purely of selecting, shifting and adding A . The i th column of B indicates whether or not a shifted version of A is to be selected or not in the i th row of the sum. So we can perform multiplication using just full adders and a little logic for selection, in a layout which performs the shifting.
Notes:
Multiplication in decimal Starting with an example in decimal:
214 x45 1070 +8560 9630

Note that we do 214 5 = 1070 and then add to it the result of 214 4 = 856 right-shifted by one column. For each additional column in the second operand, we shift the multiplication of that column with the first operand by another place.
zzz xaaaa bbbb +cccc0 +dddd00 +eeee000
etc...
Ver 10.423
2s complement Multiplication
39
For one negative and one positive operand just remember to sign extend the negative operand.
sign extends
-42 11010110 x00101101 x45 1111111111010110 0000000000000000 1111111101011000 1111111010110000 0000000000000000 1111101011000000 0000000000000000 0000000000000000 1111100010011110 -1890
Notes:
2s complement multiplication (II) For both operands negative, subtract the last partial product.
We use the trick of inverting (negating and adding 1) the last partial product and adding it rather than subtracting.
form last partial product negative
twos
-1110101100000000
complement
11010110 x10101101 1111111111010110 0000000000000000 1111111101011000 1111111010110000 0000000000000000 1111101011000000 0000000000000000 +0001010100000000 0000110110011110
-42 x-83
3486
Of course, if both operands are positive, just use the unsigned technique! The difference between signed and unsigned multiplies results in different hardware being necessary. DSP processors typically have separate unsigned and signed multiply instructions.
Ver 10.423
Fixed Point multiplication
40
Fixed point multiplication is no more awkward than integer multiplication: 11010.110 x00101.101 11.010110 000.000000 1101.011000 11010.110000 000000.000000 1101011.000000 00000000.000000 000000000.000000 0010010110.011110 26.750 x5.625 0.133750 0.535000 16.050000 133.750000 150.468750
Again we just need to remember to interpret the position of the binary point correctly.
Notes:
Ver 10.423
Multiplier Implementation Options

Distributed multipliers Constant multipliers Using the logic fabric (LUTs) Using block RAM
41
Shift-and-add multipliers High speed embedded multipliers 18 x 18 bit multipliers
High speed integrated arithmetic slices (DSP48s) Multiply, accumulate Add, multiply, accumulate
Notes:
Over the next few slides we will see that multipliers can be implemented in a variety of different ways. As multipliers are used extensively in DSP, implementing them efficiently is a priority consideration. The most basic multiplier is a 2-input version which is implemented using the logic fabric, i.e. the lookup tables within the slices of the device. This type is referred to as a distributed multiplier, because the implementation is distributed over the resouces in several slices. In the case of multiplication with a constant, which is commonly required in DSP, the knowledge of one multiplicand can be exploited to create a cheaper hardware implementation than a conventional 2-input multiplier. Two approaches that will be discussed in the coming pages are ROM-based constant multipliers, and shift-and-add multipliers which sum the outputs from binary shift operations. The FPGA companies are well aware that DSP engineers desire fast and efficient multipliers, and as a result, they began incorporating embedded multipliers into their devices in the year 2000. Since then the sophistication of these components has increased, and they have been extended to feature fast adders and in many cases longer wordlengths, too. We can now think of them as embedded arithmetic slices, rather than simply multipliers.
Ver 10.423
Distributed Multipliers
This figure shows a four-bit multiplication:
s
42
FA is full adder
a b 0
a3 0
a2 0
a1 0
a0 b0 0
bout cout aout FA sout 0
c 0 b2 0 b3 0 b1 0
Example: 1101 13 11 1011 1101 1101 0000 1101 10001111 143 p0
p7
p6
p5
p4
p3
p2
p1
The AND gate connected to a and b performs the selection for each bit. The diagonal structure of the multiplier implicitly inserts zeros in the appropriate columns and shifts the a operands to the right. Note that this structure does not work for signed twos complement!
Notes:
Note the function of the simple AND gate. The operation of multiplying 1s and 0s is the same AND 1s and 0s
A 0 0 1 1
B 0 1 0 1
Z 0 0 0 1
or in Boolean algebra Z = A and B = AB
Z = A x B (where x = multiply)
Hence the AND gate is the bit multiplier. The function of one partial product stage of the multiplier is as shown below.
s
FA is full adder
a b
x3 a3
x2 a2
x1 a1
x0 a0 b0 0
bout cout aout FA sout
c y4 y3 y2 y1 y0
y4 y3 y2 y1 y0 = b0(a3 a2 a1a0) + x3 x2 x1 x0
Ver 10.423
Distributed Multiplier Cell

Cout
43
This shows the top half of a slice, which implements one multiplier cell.
S B A
Sout
A B
Cout
FA
Cin
NOTE: This implementation features a Virtex-II Pro FPGA.
Sout
Cin
Notes:
Picture of Xilinx-II Pro slice (upper half) taken from Virtex-II Pro Platform FPGAs: Introduction and Overview, DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com LUT implements the XOR of two ANDs:
G3 (S) G2 (A) G1 (B)
The dedicated MULTAND unit is required as the intermediate product G1G2 cannot be obtained from within the LUT, but is required as an input to MUXCY. The two AND gates perform a one-bit multiply each, and the result is added by the XOR plus the external logic (MUXCY, XORG): Sout = CIN xor D, COUT = DAB + CIND This structure will perform one cell of the multiplier (see the next slide...). Note that whereas the signal flow graph of the distributed multiplier shows signals propagating from the top and right of the diagram to the bottom, the internal structure of the FPGA slice logic results in a different configuration when implemented on a device.
Ver 10.423
ROM-based Multipliers
44
Just as logical functions such as XOR can be stored in a LUT as shown for addition, we can use storage-based methods to do other operations. By using a ROM, we can store the result of every possible multiplication of two operands. The two operands A and B are concatenated to form the address with which to access the ROM. The value stored at that address is the multiplication result, P: A
4 bits
1 0 1 0
ROM-based multiplier 1 0 1 0 0 0 1 1
address 0000 0000 data (product) 0000 0000
A:B
1010 0011 1110 1110
P
8 bits
1111 1111 0000 0001
decimal -6
B
4 bits
0 0 1 1
8 bits
28 = 256 8-bit addresses 8-bit data
1 1 1 0 1 1 1 0
decimal -18
decimal 3
Notes:
There is one serious problem with this technique: as the operand size grows, the ROM size grows exponentially. For two N bit input operands, there are 2 2N possible results, and hence the ROM has 2 2N entries. The output result is 2N bits long, and in total 2N 2 2N bits of storage are required. For example, with 8 bits operands (a fairly reasonable) size, 1Mbit of storage is required - a large quantity. For bigger operands e.g. 16 bits, a huge quantity of storage is required. 16 bit operands require 128Gbits of storage and hence a ROM-based multiplier is clearly not a realistic implementation choice!
Input Wordlength (N) 4 6 8 10 12 14 16 18 20 Output Wordlength (2N) 8 12 16 20 24 28 32 36 40 No. of ROM entries (22N) 28 = 256 212 = 4,096 216 = 65,536 220 = 1,048,576 224 = 16,777,216 228 = 268,435,456 232 = 4,294,967,296 236 = 68,719,476,736 240 = 1,099,511,627,776 Total ROM Storage (2N x 22N) 2 Kbits 48 Kbits 1 Mbit 20 Mbits 384 Mbits 7 Gbits 128 Gbits 2.25 Tbits 40 Tbits
Ver 10.423
Input Wordlength and ROM Addresses
45
Consider a ROM multiplier with 8-bit inputs: 65,536 8-bit locations are required to store all possible outputs... so 1Mbit of storage is needed!
0 1 1 0 1 0 1 1 ROM-based multiplier 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 address
0000 0000 0000 0000
data (product)
0000 0000 0000 0000
A
8 bits
0110 1011 0100 0011
0001 1100 0000 0001
A:B
16 bits
address 27,459
P
16 bits
decimal 107
B
8 bits
0 1 0 0 0 0 1 1
0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1
1111 1111 1111 1111
0000 0000 0000 0001
decimal 7,169
decimal 67
216 = 65,536 16-bit addresses
16-bit data
Notes:
For example, if the B input was the constant value 75, the possible input words would be composed of 256 possible combinations of the upper 8-bits of the address, concatenated with the 8-bit binary word 0100 1011, as shown below. The result is that only 256 of the 65,536 memory locations are actually accessed. Therefore, when one of the inputs to the ROM-based multiplier is fixed, the size of the required ROM can be reduced to 256 locations of 16-bit data (note that the precision of the stored output words remains 8 bits + 8 bits). The total memory required is thus 256 x 16 = 4kbits. However, depending on the value of the constant, it may also be possible to reduce the length of the stored results. For instance, if the value of B is (decimal) 10, the maximum output product generated by the multiplication of B with any 8-bit input A will be: 128 10 = 1280 As -1280 can be represented with 12 bits, that represents a further saving of 4 bits storage x 256 memory locations = 1kbit.
A=?
8 bits
? ? ? ? ? ? ? ?
B=75
8 bits
0 1 0 0 1 0 1 1
? ? ? ? ? ? ? ? 0 1 0 0 0 0 1 1
addresses (decimal): 0 x 28 + 75 = 75 1 x 28 + 75 = 331 2 x 28 + 75 = 587 3 x 28 + 75 = 843
A:B
16 bits
253 x 28 + 75 = 64,843 254 x 28 + 75 = 65,099 255 x 28 + 75 = 65,355
The total storage requirement for this example constant coefficient multiplier would therefore be 3 kbits... significantly smaller than the 1Mbit needed for a 16-bit multiplier where both operands are unknown!
Ver 10.423
ROM-based Constant Multipliers
46
ROM-based multipliers with a constant input require fewer addresses. The storage required for output words may also be reduced, if the maximum result does not require the full numerical range of: 2
2N 1
result 2
2N 1
The maximum product and output wordlength can be calculated for the particular constant value, and the multiplier optimised accordingly...
8 bit signed number maximum absolute value = -128 16 bit signed result
A=?
? ? ? ? ? ? ? ?
P=?
8 bit representation (min.)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? maximum product = 10,624 so maximum 15-bit representation required! 1-bit saved!
B = -83
0 1 0 1 0 0 1 1
Additional optimisations allow cost to be reduced further.

Notes:
Constant multipliers can be implemented using the LUTs within the logic fabric (distributed ROM), or with one or more of the Block RAMs available on most devices. The selection is influenced by the other demands placed on these resources by the rest of the system being designed. In System Generator, the designer can specify the implementation style via the Constant Multiplier dialog box, along with the constant value, the output wordlength, and other parameters.
Ver 10.423
Multiplication by Shift and Add

4 2 -31 x 24 = -496 0.625 x 2-2 = 0.15625 x16 x0.25
47
Multiplication by a power-of-2 can be achieved simply by shifting the number to the right or left by the appropriate number of binary places.
1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1
Extending this a little, multiplications by other numbers can be performed at low cost by creating partial products from shifts, and then adding them together.
0 1 0 1 0 1 3 21 x 23 + 21 = 189 0 1 0 1 1 1 1 0 1 0 1 0 0 0 0 4 0 1 0 1 0 1 0 0 0 0 2 (1 x 2-4) + 1 + (1 x 2-2) = 1.3125 x1.3125 x9
Notes:
Shift operations are effectively free in terms of logic resources, as they are implemented only using routing. Therefore multiplications by power-of-two numbers are very cheap! By recognising that multiplications by other numbers can be achieved by summing partial products of power-of-two shifts, any arbitrary multiplication can be decomposed into a series of shifts and add operations. The closer the desired multiplication is to a powerof-two, i.e. the fewer partial products that are required, the fewer adders are required, and hence the lower the cost of the multiplier. This type of multiplier is suitable only for constant multiplications, because there is only one input, and the result is achieved using the configuration of the hardware. x16
4
x8
x16
4 3 x24
x8
x24
The technique can be x8 particularly powerful when x1 3 applied to parallel x9 multiplications of the x9 same input. The partial x1 combined - fewer partial products product terms common to several multiplications x24 and x9 calculated separately can be shared and thus the overall effort reduced. Transpose form filters are very suitable for optimisation in this way. Taking the above simple example of two concurrent multiplications, one x9 and the other x24, it is clear that the shift right by three places can be shared as x8 is common to both operations.
Ver 10.423
Embedded multipliers
48
The Xilinx Virtex-II and Virtex-II Pro series were the first to provide onchip multipliers in early 2000s. These are in hardware on the FPGA ASIC, not actually in the user FPGA slice-logic-area. Therefore are permanently available, and they use no slices. They also consume less power than a slice-based equivalent and can be clocked at the maximum rate of the device. A
18x18 bit multiplier
A and B are 18-bit input operands, and P is the 36-bit product, i.e. P = A B. Depending upon the actual FPGA, between 12 and more than 2000 (Virtex 6 top of range) of these dedicated multipliers are available.
Notes:
Looking at a device floorplan, you can clearly see the embedded multipliers, which are located next to Block RAMs on the FPGA in order to support high speed data fetching/writing and computation.
Block RAM
slices
18x18 multiplier
Information on dedicated multipliers taken from Virtex-II Pro Platform FPGAs: Introduction and Overview, DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com.
Ver 10.423
Embedded Multiplier Efficiency

18 18 4
49
It can be easy to utilise on chip embedded multipliers inefficiently through choice of wordlengths...
36 4
1 embedded multiplier 100% utilised
1 embedded multiplier ~5% utilised
When using multipliers in System Generator....be careful
18 18
36
19 19
38
SysGen will use 1 embedded mult
SysGen will use 4 embedded mults

Notes:
If you specify the use of embedded multipliers for a particular multiplier in System Generator, the tool will do exactly as you have asked, and implement it entirely using embedded multipliers. However, depending on the wordlengths involved, this may lead to an inefficient implementation. The wordlengths of the embedded multipliers are fixed at 18 x 18 bits, and it makes sense to use them as fully as possible. It is relatively easy to see that a 4 x 4 bit multiply will greatly underuse the capabilities of the multiplier, and this particular multiply operation might be better mapped to a distributed implementation, which would leave the embedded multiplier free for use somewhere else. Of course, these decisions are made in the context of some larger design with its own particular needs for the various resources available on the FPGA being targeted. Perhaps less obviously, mapping a multiplication to embedded multipliers where the input operands are slightly longer than 18 bits is also inefficient. This may result in, for example, the following implementation for a requested 19 x 19 bits multiplier, where 4 embedded multipliers are used instead of the expected 1!
18 x 18
1 x 18
19 19
38
18 x 1
Ver 10.423
1x1
High Speed Arithmetic Slices (DSP48s)
50
As much DSP involves the Multiply-Accumulate (MAC) operation, soon after embedded multipliers came DSP48 slices (on the Virtex-4). These feature an 18 x 18 bit adder followed by a 48 bit accumulator.
18 18
36 48
48 DSP48 Virtex-4
Like the embedded multipliers, these are low power and fast. The ability to cascade slices together also means that whole filters can be constructed without having to use any slices.
Notes:
The next series of FPGAs (the Virtex-5) enhanced the capabilities of the DSP slice with the DSP48E. The major improvements of this slice are logic capabilities within the adder/subtractor unit, and an extended wordlength of one input to 25 bits. The maximum clock frequency also increased in line with the speed of the device.
18 25
43 48
48 DSP48E Virtex-5
Ver 10.423
DSP48s with Pre-Adders
51
The Spartan-3A DSP series and subsequent Spartan-6 feature a version of the DSP48 with a pre-adder, prior to the multiplier. This feature is especially useful for DSP structures like symmetric filters, because it allows the total number of multiplications to be reduced.
18 18
18 18
Spartan-3A DSP DSP48A Spartan-6 DSP48A1 36 48 48
Notes:
The Virtex-6 offers a combination of the benefits of the Virtex-5 (the longer wordlength and arithmetic unit), together with the pre-adder from the Spartan series. This results in a very computationally powerful device, especially as it can be clocked at 600MHz, and the largest chips have 2000+ of them!
25 25
Virtex-6 DSP48E1 25 18 43 48 48
Ver 10.423
Division (i)
Divisions are sometimes required in DSP, although not very often. 6 bit non-restoring division array:
a5 1 q5 q4 q3 q2 a4 b5 a3 b4 a2 b3 a1 b2 a0 b1 b0 0 Bin sin
52
bin cout 0
FA
bout cin Bout
sout
0 q1 q0
Q=B/A
Note that each cell can perform either addition or subtraction as shown in an earlier slide either Sin+ Bin or Sin - Bin can be selected.
Notes:
A Direct method of computing division exists. This paper and pencil method may look familiar as it is often taught in school. A binary example is given below. Note that each stage computes an addition or subtraction of the divisor A. The quotient is made up of the carry bits from each addition/subtraction. If the quotient bit is a 0, the next computation is an addition, and if it is a 1, the divisor is subtracted. It is not difficult to map this example into the structure shown on the slide. Example: B = 01011 (11), A = 01101 (13) -A = 10011. Compute Q = B / A. q4 = 0 01011 carry 10011 11110
carry
R0 = B -A R1 2.R1 +A R2 2.R2 -A R3 2.R3 -A R4 2.R4 +A R5
q3 = 1
11100 01101 01001 10010 10011 00101 01010 10011 11101 11010 01101 00111
q2 = 1
carry
q1 = 0
carry
q0 = 1
carry
Q = B / A = 01101 x 2-4 = 0.8125

Ver 10.423
Division (ii)
00000.1101 01101 01011.0000 0 01 00 010 000 0101 0000 01011 00000 remdsh1 01011 0 divisor_in 00110 1 00100 10 00011 01 00001 010 00000 000 00001 0100 00000 1101 00000 0111
divisor_in
53
There is an alternative way to compute division using another paper and pencil technique.
VHDL Design
Notes:
Ver 10.423
The Problem With Division
54
An important aspect of division is to note that the quotient is generated MSB first - unlike multiplication or addition/subtraction! This has implications for the rest of the system. It is unlikely that the quotient can be passed on to the next stage until all the bits are computed - hence slowing down the system! Also, an N by N array has another problem - ripple through adders. Note that we must wait for N full adder delays before the next row can begin its calculations. Unlike multiplication there is no way around this, and as result division is always slower than multiply even when performed on a parallel array - a N by N multiply will run faster than a N by N divide!
Notes:
By looking at the top two rows of a 4 x 4 division array we can see that the first bit to get generated is the MSB of the quotient. This is unlike the multiplication array that can also be seen below, where the LSB is generated first. This is a problem when using division as most operations require the LSBs to start a computation and hence the whole solution will have to be generated before the next stage can begin. Another problem for division is the fact that it takes N full adder delays before the next row can start. In the examples below, the order in which the cells can start has been shown. So for the multiplier, the first cell on the second row is the 3rd cell to start working. However, for the divider, the first cell on the second row is only the 5th cell to start working because it has to wait for the 4 cells on the first row to finish.
a3 1 q3 a2 b3
4
a1 b2
3
a0 b1
2
b0
1
Bin
sin
0 q2
6 5
bin cout
FA
bout cin Bout sout
FA is full adder
a b 0
0
4
a3 0
3
a2 0
2
a1 0
1
a0 b0 0
bout cout aout FA sout
c
5 4 3
b1 0 p0
p1
Ver 10.423
Pipelining The Division Array
55
The division array shown earlier can be pipelined to increase throughput.

a5 a5 a5 1 q5 q4 q3 q2 a4 b5 a4 b5 a4 b5 a3 b4 a3 b4 a3 b4 a2 b3 a2 b3 a2 b3 a1 b2 a1 b2 a1 b2 a0 b1 a0 b1 a0 b1 b0 b0 b0
Operands pipeline delay

0 Bin sin
bin cout 0 sout

FA
bout cin Bout 0
Q=B/A
q1 q0
Notes:
To increase the throughput, the critical path can be broken down by implementing pipeline delays at appropriate points. If pipelining is not used, the delay (critical path) from new data arriving to registering the full quotient is N2 full adders. This delay represents the maximum rate that new data can enter the array. However, by pipelining the array, the critical path is broken down to just N full adders and thus the rate at which new data can arrive is increased dramatically.
a3 a2 b3 a1 b2 a0 b1 b0 0 q2 a3 1 q3 q2 q1 q0 a2 b3 a1 b2 a0 b1 q1 q0
The longest path from register to register is the Critical Path.
1 q3
b0 0
With pipelining the critical path is only N full adders.

0
Without pipelining the critical path is through N2 full adders.
Ver 10.423
Square Root (i)

6 bit non-restoring square root array.
0 1 b5 b4 b3 b2 0 1 0 a7 1 0 a6 0 1 0 a5 Bin sin
56
1 0
a4 0 a3 1 0 a2 0 a1 a0 1 1 0 0
bin cout
FA
bout cin Bout
1 0
0 1 0 0 b1 b0
1 0 0 sout
B =
0 1 0 0
1 0 0 0
The square root is found (with divides) in DSP in algorithms such as QR algorithms, vector magnitude calculations and communications constellation rotation.
Notes:
Looking carefully at the non-restoring square root array, we can note that this array is essentially half of the division array! If the division array above is cut diagonally from the left we can see the cells that are needed for the square root array. The 2 extra cells on the right hand side are standard cells which can be simplified. So square root can be performed twice as fast as divide using half of the hardware! a4 a 3 a2 a1 A = 10 11 01 01 010 carry 111 001 0111 carry 1011 0010 01001 c ar r y 10011 11100
c ar r y
b3 = 1 b2 = 1
0 1 b5
1 0
a7
1 0
a6 0 1 0 a5
1 0
a4 0 a3 1 0 a2
b1 = 0
0a4 111 R1 R1<<1 & a3 1b311 R2 R2<<1 & a2 1b3b211 R3 R3<<1 & a1 0b3b2b111 R4
b4 b3 b2
1 0
b0 = 1
110001 011011 001100
0 a1 a0 1 1 0 0 0 1 0 0 b1 b0 1 0 0 sout
0 1 0 0
1 0 0 0
Ver 10.423
Square Root and Divide - Pythagoras!
57
The main appearance of square roots and divides is in advanced adaptive algorithms such as QR using givens rotations. For these techniques we often find equations of the form: x cos = --------------------x2 + y2 and y sin = --------------------x2 + y2
So in fact we actually have to perform two squares, a divide and a square root. (Note that squaring is simpler than multiply!) There are a number of iterative techniques that can be used to calculate square root. (However these routines invariably require multiplies and divides and do not converge in a fixed time.) There seems to be some misinformation out there about square roots: For FPGA implementation square roots are easier and cheaper to implement than divides....!
Notes:
Ver 10.423

Fpga Notes April23

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fpga Notes April23

Uploaded by

Copyright:

Available Formats

DSPedia Notes 1

THIS SLIDE IS BLANK

Introduction: DSP and FPGAs

Circuit Board General Purpose Input/Output Bus

DSP Processor Amplifiers/Filters Voltage Output

The FPGA DSP Evolution

R. Stewart, Dept EEE, University of Strathclyde, 2010

FPGAs: A Box of DSP blocks

R. Stewart, Dept EEE, University of Strathclyde, 2010

Binary Addition and Multiply

Multiplying two N bit numbers can produce up to a 2N bit number: N N 2N

R. Stewart, Dept EEE, University of Strathclyde, 2010

The Cost of Addition

Therefore an N bit addition could be performed in parallel at a cost of N full adders.

S out = ABC + ABC + ABC + ABC

A B Cin Cout Sout 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1

= ABC C out = ABC + ABC + ABC + ABC = AB + AC + BC

1011 1101 11000

R. Stewart, Dept EEE, University of Strathclyde, 2010

The Cost of Multiply

cout = s.z.c + s.z.c + s.z.c + s.z.c aout = a sout = (s z) c bout = b z = a.b

1011 1001 1011 0000 0000 +1 0 1 1 1100011

R. Stewart, Dept EEE, University of Strathclyde, 2010

The Gate Array (GA)

Generic FPGA Architecture (Logic Fabric)

Select MUX Logic

R. Stewart, Dept EEE, University of Strathclyde, 2010

Cascade/ Carry Logic

FPGA Architecture (Xilinx DSP Devices)

Logic Fabric Arithmetic Block Input/Output Blocks

R. Stewart, Dept EEE, University of Strathclyde, 2010

Input / Output Block (IOB)

DSP48 / DSP48A / DSP48E

Configurable Logic Block (CLB)

R. Stewart, Dept EEE, University of Strathclyde, 2010

Example: Xilinx Logic Blocks and Routing

NOTE: Only a subset of routing resources is depicted above.

The register can be used as: A flip-flop A latch

R. Stewart, Dept EEE, University of Strathclyde, 2010

The Lookup Table

Example logic function: Z=BCD+ABCD

R. Stewart, Dept EEE, University of Strathclyde, 2010

R. Stewart, Dept EEE, University of Strathclyde, 2010

LUTs as Distributed RAM

LUTs as Shift Registers (SRL16s)

Slice 2 Cascadable Out

R. Stewart, Dept EEE, University of Strathclyde, 2010

Carry Logic REG Output

LUT / Register Pair

R. Stewart, Dept EEE, University of Strathclyde, 2010

Resets: Registers and SRL16s

Non Resettable Implementation

INPUT CLOCK RESET

R Slice 1/2 Slice 1

R. Stewart, Dept EEE, University of Strathclyde, 2010

FPGA Arithmetic Implementation

R. Stewart, Dept EEE, University of Strathclyde, 2010

There is a trade-off between cost, and numeric precision and range.

R. Stewart, Dept EEE, University of Strathclyde, 2010

Unsigned Integer Numbers

2n-1 2n-2 2n-3

The decimal number 82 is 01010010 in unsigned format as shown:

Binary Representation 00000000 00000001 00000010 00000011 00000100

01000000 01000001 10000011 11111111