You are on page 1of 116

DSPedia Notes 1

Introduction to FPGA

THIS SLIDE IS BLANK

Introduction: DSP and FPGAs

In the last 20 years the majority of DSP applications have been enabled by DSP processors: Texas Instruments Motorola Analog Devices

A number of DSP cores have been available. Oak Core LSI Logic ZSP 3DSP

ASICs (Application specific integrated circuits) have been widely used for specific (high volume) DSP applications But the most recent technology platform for high speed DSP applications is the Field Programmable Gate Array (FPGA) This course is all about the why and how of DSP with FPGAs!
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

DSP is all about multiplies and accumulates/adds (MACs). As we progress through the course, we will see that most algorithms that are used for different applications employ digital filters, adaptive filters, Fourier transforms and so on. These algorithms all require multiplies and adds (note that a divide or square root is quite a rare thing in DSP). Hence a DSP algorithm or problem is often specified in terms of its MAC requirements. In particular, when comparing two algorithms, if they both perform the same job but one with less MACs than the other, then clearly the cheaper one would be the best choice. However this implies some assumptions. One is that the required MACs are the same - but surely a multiply is multiply! Well, yes in the traditional DSP processor based situations we are likely to be using, say, a 16 bit device which will process 16 bit inputs, using 16 bit digital filter coefficients etc. With FPGAs this constraint is removed - we can use as many, or as few bits, as are required. Therefore we can choose to optimise and schedule DSP algorithms in a completely different way.

Circuit Board General Purpose Input/Output Bus

DAC
DSP56307

ADC

DSP Processor Amplifiers/Filters Voltage Output


Ver 10.423

Voltage Input
R. Stewart, Dept EEE, University of Strathclyde, 2010

The FPGA DSP Evolution

Since around 1998 the evolution of FPGAs into the DSP market has been sustained by classic technology progress such as the ever present Moores law. Late 1990s FPGAs allow multipliers to be implemented in FPGA logic fabric. A few multipliers per device are possible. Early 2000s FPGAs: vendors place hardwired multipliers onto the device with clocking speeds of > 100MHz. Number of multpliers from 4 to > 500. Mid 2000s FPGA vendors place DSP algorithms signal flow graphs (SFGs) onto devices. Full (pipelined) FIR SFGs filters for example are available Late 2000s to early 2010s - who knows! Probably more DSP power, more arithmetic capability (fast square root, divide), FFTs, more floating point support. Rest assured more DSP is coming....
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
Technology just keeps moving.

Introduction to FPGA

Anyone who has purchased a new laptop knows the feeling. If you just wait, then in the next quarter you will get the new model with integrated WiFi or WiMax, a faster processor. Of course wait another quarter and in 6 months it will be improved again - also, the new faster, better, bigger machine is likely to be cheaper also! Such is technology DSP for FPGAs is just the same. If you wait another year its likely the vendors will be bring out prepackaged algorithms for precisely what you want to do. And they will be easier to work with - higher level design tools, design wizards and so on. So if you are planning to design a QR adaptive equalizing beamformer for MIMO implementation of a software radio for 802.16 - then if you wait, it will probably be a free download in a few years. But of course who can wait? Therefore in this course, we discuss and review the fundamental strategies of designing DSP for FPGAs. Like all technologies you still need to know how it works if you really want to use it.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

FPGAs: A Box of DSP blocks

We might be tempted to this of the latest FPGAs as repositories of DSP components just waiting to be connected together. Of course the resource is finite and the connections available are finite. In the days of circuits boards one had to be careful about running busses close together, lengths of wires etc. Similiar considerations are required for FPGAs (albeit out of your direct control). However, the high level concept, take the blocks, & build it:
Clocks Input/Output

Registers and Memory Design Verify Place and Route Connectors Logic Arithmetic
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

This is undoubtedly the modern concept of FPGA design. Take the blocks, connect them together and the algorithm is in place. Do we actually need an FPGA/IC engineer then? Do we actually need a DSP engineer? Yes in both cases, but moderm toolsets and design flows are such that it might be the same person. There is lots to worry about. In terms of the DSP design; is the arithmetic correct (ie overflows, underflows, saturates etc). Do the latency or delays used allow the integrity to be maintained. For the FPGA, can we clock at a high enough rate? Does the device place and route. What device do we need, and how efficient is the implementation (just like compilers, different vendors design flows will give different results (some better than others). As vendors provide higher level components (like the DSP48 slice from XIlinx which allows a complete FIR to be implemented) then issues such as overflow, numerical integrity and so on are taken care of.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Binary Addition and Multiply


The bottom line for DSP is multiplies and adds - and lots of them! Adding two N bit numbers will produce up to an N+1 bit number: N N N+1

Multiplying two N bit numbers can produce up to a 2N bit number: N N 2N

So with a MAC (multiply and accumulate/add) of two N bit numbers we could, in the worst case, end up with 2N+1 bits wordlength.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

If the wordlength grows beyond the maximum value you can store we clearly have the situation of numerical overflow which is a non-linear operation and not desirable. Within traditional DSP processors this wordlength growth is well known an catered for. For example, the Motorola 56000 series is so called because it has a 56 bit accumulator, i.e. the largest result of any addition operation can have 56 bits. For a typical DSP filtering type operation we may require to take, say an array of 24 bit numbers and multiply by an array of another 24 bits numbers. The result of each multiply will be a 48 bit number. If we then add two 48 bit numbers together, if they both just happen to be large positive values then the result could be a 49 bit number. Now if we add many 48 bit numbers together (and they just all happen to be large positive values), then the final result may have a word growth of quite a few bits. So one must assume that Motorola had a good look at this, and realised it was fairly unlikely that the result of adding these 48 bit products together would ever be larger that 56 bits - so 56 bits was chosen. (Of course if you did have a problem that grew beyond 56 bits you would have to put special trapping in to the code to catch this.)

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

The Cost of Addition


A 4 bit addition can be performed using a simple ripple adder: A3 B3 A2 B2 A1 B1 A0 B0

C3 S4 MSB

S3

C2

S2

C1

S1

C0

S0 LSB

A3 A2 A1 A0 + B3 B2 B1 B0
C3 C2 C1 C0

S4 S3 S2 S1 S0

0 carry in

Therefore an N bit addition could be performed in parallel at a cost of N full adders.


R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
The simple Full Adder (FA): Adds two bits + one carry in bit, to produce sum and carry out

Introduction to FPGA

S out = ABC + ABC + ABC + ABC

A B Cin Cout Sout 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1

= ABC C out = ABC + ABC + ABC + ABC = AB + AC + BC

A Cout

B Cin

Sout
+11 +13 +24

1011 1101 11000

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

The Cost of Multiply


A 4 bit multiply operation requires an array of 16 multiply/add cells:
a3 a2 a1 a0 b3 b2 b1 b0 c3 c2 c1 c0 d3 d2 d1 d0 e3 e2 e1 e0 + f3 f2 f1 f0 p7 p6 p5 p4 p3 p2 p1 p0
b2
0 0

a3

a2

a1

a0 b0
0

b1
0

b3
0

p7

p6

p5

p4

p3

p2

p1

p0

Therefore an N by N multiply requires N 2 cells...... ......so for example a 16 bit multiply is nominally 4 times more expensive to perform than an 8 bit multiply.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
s a bout cout aout sout b c

Introduction to FPGA

Each cell is composed of a Full Adder (FA) and an AND gate, plus some broadcast wires:

cout = s.z.c + s.z.c + s.z.c + s.z.c aout = a sout = (s z) c bout = b z = a.b


An 8 bit by 8 bit multiplier would require 8 x 8 = 64 cells

1011 1001 1011 0000 0000 +1 0 1 1 1100011

11 x9
Partial Product

99

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

The Gate Array (GA)


Early gate-arrays were simply arrays of NAND gates:

Designs were produced by interconnecting the gates to form combinational and sequential functions.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

The NAND gate is often called the Universal gate, meaning that it can be used to produce any Boolean logic function. Early gate array design flow would be design, simulate/verify, device production and test. Metal layers make simple connections Z = AB + CD
A B C D

Early simulators and netlisters such as HILO (from GenRad) were used. From GA to FPGA However simple gate arrays although very generic, were used by many different users for similar systems..... ....for example to implement two level logic functions, flip-flops and registers and perhaps addition and subtraction functions. For a GA once a layer(s) of metal had been laid on a device - thats it! No changes, no updates, no fixes. So then we move to field programmable gate arrays. Two key differences between these and gate arrays: They can be reprogrammed in the field, i.e. the logic specified is changeable They no longer are just composed of NAND gates. A carefully balanced selection of multi-input logic, flipsflops, multiplexors and memory.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Ver 10.423

Generic FPGA Architecture (Logic Fabric)

Arrays of gates and higher level logic blocks might be refered to as the logic fabric...
I/O logic block I/O logic block I/O logic block I/O logic block logic block logic block logic block Column interconnects logic block logic block logic block logic block logic block I/O logic block I/O logic block I/O logic block Row interconnects

I/O logic block I/O logic block I/O logic block I/O logic block I/O
R. Stewart, Dept EEE, University of Strathclyde, 2010

logic block

Notes:

Introduction to FPGA

The logic block in this generic FPGA contains a few logic elements. Different manufacturers will include different elements (and use different terms for logic block, e.g. slices etc). A simple logic block might contain the following:

LUT

Select MUX Logic

Logic Element

Of course the actual contents of a logic element will vary from manufacturer to manufacturer and device to device.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Interconnects

Cascade/ Carry Logic

FLIP FLOP

FLIP FLOP

FPGA Architecture (Xilinx DSP Devices)

Looking more specifically at recent Xilinx FPGAs, we also find block RAMs and dedicated arithmetic blocks... both very useful for DSP! Block RAM

Logic Fabric Arithmetic Block Input/Output Blocks


columns

r o w s

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
Diagram Key

Introduction to FPGA

Input / Output Block (IOB)

One of the major features of more recent, DSP-targeted Xilinx FPGAs is the provision of dedicated arithmetic blocks, which offer lower power and higher clock frequency operation than the logic fabric (i.e. the array of CLBs). These can be configured to perform a number of different computations, and are especially suited to the Multiply Accumulate (MAC) operations prevalent in digital filtering. Block RAMs are also used extensively in DSP. Example uses are for storing filter coefficients, encoding and decoding, and other tasks. Despite the inclusion of these additional resources, the logic fabric still forms the majority of the FPGA. We will now look at the CLBs which comprise the logic fabric, and how they are connected together, in further detail.

Block RAM

DSP48 / DSP48A / DSP48E

Configurable Logic Block (CLB)

FPGA

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Example: Xilinx Logic Blocks and Routing

10

Xilinx FPGA logic fabric comprises Configurable Logic Blocks (CLBs), which are groups of SLICEs (e.g. 2 or 4 SLICEs per CLB). Signals travel between CLBs via routing resources. Each CLB has an adjacent switch matrix for (most) routing. CLB Slices

Switch Matrix

other CLBs

NOTE: Only a subset of routing resources is depicted above.


R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

The example in the main slide features a typical Xilinx FPGA architecture, and the Altera and Lattice architectures are different. Logic units differ in size, composition and name! However, in all cases, their Logic Blocks include both combinational logic and registers, and routing resources are required for connecting blocks together. Continuing with the Xilinx example, the combinational blocks are termed Lookup Tables (LUTs), and in most devices have 4 inputs (some of the more recent devices have 6-input LUTs). These LUTs can be utilised in four modes: To implement a combinatorial logic function As Read Only Memory (ROM) As Random Access Memory (RAM) As shift registers

The register can be used as: A flip-flop A latch

Over the next few slides, the functionality of LUTs and registers in the above modes will be described.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

The Lookup Table

11

When used to implement a logic function, the 4-bit input addresses the LUT to find the correct output, Z, for that combination of A, B, C and D.

A B C D

Lookup Table

Example logic function: Z=BCD+ABCD

CD AB 00

00 01 11 10

0 1 1 0

01 0 0 0 0

11 0 0 0 0

10 0 0 0 1

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
The lookup table can also implement a ROM, containing sixteen 1-bit values. Instead of the four inputs representing inputs of a logic function, they can be thought of as a 4-bit address. A 1-bit value is stored within each memory location, and the appropriate output is supplied for any input address. In this example, A is considered the Most Significant Bit (MSB) and D the Least Significant Bit (LSB), and the output is Z.

Introduction to FPGA

A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Z 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

LUTs as Distributed RAM

12

LUTs can also be configured as distributed RAM, in either single port or dual port modes. Single port: 1 address for both synchronous write operations and asynchronous read operations. Dual port: 1 address for both synchronous write operations and asynchronous read operations, and 1 address for asynchronous reads only.

Larger RAMs can be constructed by connecting two or more LUTs together. Dual port RAM requires more resources than single port RAM. For example, a 32x1 Single Port RAM (32 addresses, 1 bit data), requires two 4-bit LUTs, whereas an equivalent Dual Port RAM requires four 4-bit LUTs.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

The two diagrams below demonstrate the implementation of 16x1 single and dual port RAMs in the Virtex II Pro device, respectively. Notice that the dual port RAM requires twice as many resources as the single port RAM.

Single Port

Source: Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet, DS083 (v4.7), November 5, 2007.

Dual Port
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010

LUTs as Shift Registers (SRL16s)

13

A final alternative is to use the LUT as a Shift Register of up to 16 bits. Additional Shift In and Shift Out ports are used, and the 4-bit address is used to define the memory location which is asynchronously read. For example, if the LUT input is the word 1001, the output from the 10th register is read, as depicted below.
SHIFT IN CLK
DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ

SHIFT OUT

10

11

12

13

14

15

LUT INPUT
(e.g. 1001) D OUT

The slice register at the output from the LUT can be used to add another 1 clock cycle delay. Using the register also synchronises the read operation.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

As with the other LUT configurations, larger Shift Registers can be constructed by combining several LUTs. For example, a 64-bit shift register segment can be constructed by combining four 16-bit Shift Registers together, as shown below. The cascadable ports allow further interconnections for larger Shift Registers.

Cascadable In

DI

DI

SRL16
MSB

FF

SRL16
MSB

FF

DI

D MSB

DI

SRL16

FF

D MSB

SRL16

FF

Slice 1

Slice 2 Cascadable Out

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Registers

14

The sequential logic element which follows the lookup table can be configured as either: An edge-triggered D-type flip flop; or A level-sensitive latch

The input to the register may be the output from the LUT, or alternatively a direct input to the slice (i.e. the LUT is bypassed).

LUT Inputs

LUT

Carry Logic REG Output

Bypass Input

LUT / Register Pair


R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

A D-type flip flop provides a delay of one clock cycle, as confirmed by the truth table below (D(t) is the register input at time t, and Q(t=1) is the output 1 clock cycle later). A clock signal and control inputs (set, reset, etc.) are also provided.
D(t) 0 1
Q(t+1)

0 1

When configured as a latch, the control inputs define when data on the D input is captured and stored within the register. The Q output thereafter remains unchanged until new data is captured. Flip flops and registers are discussed in the Digital Logic Review notes chapter.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Resets: Registers and SRL16s

15

Whereas registers can be reset, SRL16s cannot. Therefore, adding reset capabilities to a design has implications for resource utilisation. For example, consider an 8-bit shift register. If resettable, then each element requires a slice register. If not, then an SRL16 can be used.
Resettable Implementation
INPUT CLOCK RESET Slice 1 Slice 2 Slice 3 Slice 4
DQ DQ DQ DQ DQ DQ DQ DQ

OUTPUT

Non Resettable Implementation


INPUT CLOCK Slice 1
R. Stewart, Dept EEE, University of Strathclyde, 2010

SRL16

OUTPUT

Notes:

Introduction to FPGA

We can still design a resettable shift register with an SRL16, by using a slightly more sophisticated design. Instead of making all elements resettable, we can implement the first element using a slice register, and the subsequent ones using an SRL16. The reset signal is held high for 8 clock cycles, which allows the 0 input to propagate through the shift register. Instead of using 4 slices, this design would require 2 slices at most.

INPUT CLOCK RESET

DQ

SRL16

OUTPUT

R Slice 1/2 Slice 1

Hold RESET signal high for 8 clock cycles to reset the shift register...

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

FPGA Arithmetic Implementation

16.1

The implementation of arithmetic operations in FPGA hardware is an integral and important aspect of DSP design. The following key issues are presented in this chapter: Number representations: binary word formats for signed and unsigned integers, 2s complement, fixed point and floating point; Binary arithmetic, including: Overflow and underflow, saturation, truncation and rounding Structures for arithmetic operations: Addition/Subtraction, Multiplication, Division and Square Root; Complex arithmetic operations; Mapping to Xilinx FPGA hardware ... including special resources for high speed arithmetic
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
This section of the course will introduce the following concepts: Integer number representations - unsigned, ones complement, twos complement. Non-integer number representations - fixed point, floating point. Quantisation of signals, truncation, rounding, overflow, underflow and saturation.

Introduction to FPGA

Addition - decimal, twos complement integer binary, twos complement fixed point, hardware structures for addition, Xilinx-specific FPGA structures for addition. Multiplication - decimal, 2s complement integer binary, twos complement fixed point, hardware structures for multiplication, Xilinx-specific FPGA structures for multiplication. Division. Square root. Complex addition. Complex multiplication.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Number Representations

17

DSP, by its very nature, requires quantities to be represented digitally using a number representation with finite precision. This representation must be specified to handle the real-world inputs and outputs of the DSP system. Sufficient resolution Large enough dynamic range

The number representation specified must also be efficient in terms of its implementation in hardware. The hardware implementation cost of an arithmetic operator increases with wordlength. The relationship is not always linear!

There is a trade-off between cost, and numeric precision and range.


R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

The use of binary numbers is a fundamental of any digital systems course, and is well understood by most engineers. However when dealing with large complex DSP systems, there can be literally billions of multiplies and adds per second. Therefore any possible cost reduction by reducing the number of bits for representation is likely to be of significant value. For example, assume we have a DSP filtering application using 16 bit resolution arithmetic. We will show later (see Slide 42) that the cost of a parallel multiplier (in terms of silicon area - speed product) can be approximated as the number of full adder cells. Therefore for a 16 bit by 16 bit parallel multiply the cost is of the order of 16 x 16 = 256 cells. The wordlength of 16 bits has been chosen (presumably) because the designer at sometime demonstrated that 17 bits was too many bits, and 15 was not enough - or did they? Probably not! Its likely that we are using 16 bits because... well, thats what we usually use in DSP processors and we are creatures of habit! In the world of FPGA DSP arithmetic you can choose the resolution. Therefore, if it was demonstrated that in fact 9 bits was sufficient resolution, then the cost of a multiplier is 9 x 9 cells = 81 cells. This is approximately 30% of the cost of using 16 bits arithmetic. Therefore its important to get the wordlength right: too many bits wastes resources, and too few bits loses resolution. So how do we get it right? Well, you need to know your algorithms and DSP.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Unsigned Integer Numbers


Unsigned integers are used to represent non-negative numbers. The weightings of individual bits within a generic n-bit word are:
bit index: bit weighting: n-1
MSB

18

n-2

n-3

2n-1 2n-2 2n-3

22

21

20
LSB

The decimal number 82 is 01010010 in unsigned format as shown:


bit index: bit weighting: example binary number: decimal representation: 7 27 6 26 5 25 4 24 3 23 2 22 1 21 0 20

0 1 0 1 0 0 1 0
0x27 1x26 0x25 1x24 0x23 0x22 1x21 0x20 64 + 16 + 2 = 82

The numerical range of an n-bit unsigned number is: 0 to 2n - 1. For example, an 8-bit word can represent all integers from 0 to 255.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Taking the example of an 8-bit unsigned number, the range of representable values is 0 to 255:

Integer Value 0 1 2 3 4

Binary Representation 00000000 00000001 00000010 00000011 00000100

64 65 131 255

01000000 01000001 10000011 11111111

Note that the minimum value is 0, and the maximum value ( 255 ) is the sum of the powers of two between 0 and 8 , where 8 is the number of bits in the binary word: i.e. 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 = 255 = 28 - 1
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010

2s Complement (Signed Integers)

19

2s Complement caters for positive and negative numbers in the range -2n-1 to +2n-1 -1, and has only one representation of 0 (zero). In 2s complement, the MSB has a negative weighting:
bit index: bit weighting: n-1
MSB

n-2

n-3

-2n-1 2n-2 2n-3

22

21

20
LSB

The most negative and most positive numbers are represented by:
bit index: n-1 n-2 n-3 2 1 0

most negative:
bit index:

1
MSB

0
n-2

0
n-3

0
2

0
1

0
LSB

n-1

most positive:

0
MSB

1
LSB

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

As examples, we can convert the following two 8-bit 2s complement words to decimal. As for the unsigned representation, the decimal number 82 is 01010010 in 2s complement signed format:

bit index: bit weighting: example binary number: decimal representation:

7 -27

6 26

5 25

4 24

3 23

2 22

1 21

0 20

0 1 0 1 0 0 1 0
0x-27 1x26 0x25 1x24 0x23 0x22 1x21 0x20 64 + 16 + 2 = 82

Meanwhile the decimal number -82 is 10101110 in 2s complement signed format:

bit index: bit weighting: example binary number: decimal representation:

7 -27

6 26

5 25

4 24

3 23

2 22

1 21

0 20

1 0 1 0 1 1 1 0
1x-27 1x26 1x25 1x24 0x23 0x22 1x21 0x20 -128 + 32 + 8 + 4 + 2 = -82
R. Stewart, Dept EEE, University of Strathclyde, 2010

Ver 10.423

2s Complement Conversion

20

For 2s Complement, converting between negative and positive numbers involves inverting all bits, and adding 1. For example, we have just considered 2s complement representations of the decimal numbers -82 and +82. They are converted as shown:

+82 +82

-82 0 1 0 1 0 0 1 0 invert all bits

-82 -82

+82 1 0 1 0 1 1 1 0 invert all bits

1 0 1 0 1 1 0 1 add 1 -82 +
0 0 0 0 0 0 0 1

0 1 0 1 0 0 0 1 add 1 +82 +
0 0 0 0 0 0 0 1

1 0 1 0 1 1 1 0

0 1 0 1 0 0 1 0
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes: Positive Numbers Integer 0 1 2 3 Binary 00000000 00000001 00000010 00000011 Invert all bits and ADD 1

Introduction to FPGA

Negative Numbers Integer 0 -1 -2 -3 Binary 100000000 11111111 11111110 11111101

125 126 127

01111101 01111110 01111111

-125 -126 -127 -128

10000011 10000010 10000001 10000000

Note that when negating positive values, a ninth bit is required to represent negative zero. However, if we simply ignore this ninth bit, the representation for the negative zero becomes identical to the representation for positive zero. Notice from the above that -128 can be represented but +128 cannot.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Fixed-point Binary Numbers


We can now define what is known as a fixed-point number: ...a number with a fixed position for the binary point.

21

Bits on the left of the binary point are termed integer bits, and bits on the right of the binary point are termed fractional bits. The format of a generic fixed point word, comprising n integer bits and b fractional bits, is: : binary point
bit index: bit weighting: n-1
MSB

n-2

-1

-2

-b+1

-b
LSB

2n-1 2n-2

21

20

2-1 2-2

2-b+1 2-b

n integer bits

b fractional bits

The MSB has -ve weighting for 2s complement (as for integer words).
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

As examples, we consider the 2s complement word 11010110 with the binary point in two different places. Firstly, with the binary point to the left of the third bit, i.e. 5 integer bits and 3 fractional bits:

bit index: bit weighting: binary number: decimal representation:

4 -24

3 23

2 22

1 21

0 20

-1 2-1

-2 2-2

-3 2-3

1 1 0 1 0 1 1 0
1x-24 1x23 0x22 1x21 0x20 1x2-1 1x2-2 0x2-3 -16 + 8 + 2 + 0.5 + 0.25 = -5.25

...and secondly, with the binary point to the left of the third bit, i.e. 3 integer bits and 5 fractional bits:

bit index: bit weighting: binary number: decimal representation:

2 -22

1 21

0 20

-1 2-1

-2 2-2

-3 2-3

-4 2-4

-5 2-5

1 1 0 1 0 1 1 0
1x-22 1x21 1x20 1x2-1 0x2-2 1x2-3 1x2-4 0x2-5 = -1.3125 0.5 + 0.125+ 0.0625 -4 + 2 +
R. Stewart, Dept EEE, University of Strathclyde, 2010

Note that these results are related by a factor of 22 = 4, i.e. 4 x -1.3125 = -5.25.
Ver 10.423

Fixed Point Range and Precision

22

As with integer representations (which are also effectively fixed point numbers, but with the binary point at position 0), the binary range of fixed point numbers extends from:

unsigned: signed (2s comp.):

00000.....0000 10000.....0000

11111.....1111 01111.....1111

The same number of quantisation levels is present, e.g. for an 8-bit binary word, 256 levels can be represented. Numerical range scales according to the binary point position, e.g.:

1000 0000. (-128)


1 x0.5

0111 1111. (+127)


1 x0.5

1000 000.0 (-64)

0111 111.1 (+63.5)

Dynamic range (range / interval) is independent of the binary point position, e.g. (127-(-128))/1 = 255 = (63.5-(-64))/0.5
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

To illustrate this further, lets consider the very simple case of a 3-bit 2s complement word, with the binary point in all four possible positions. Clearly the numerical range is affected by the binary point position, but the relationship between the interval and range remains the same.

+3 +2 +1 0 -1 -2 -3 -4
binary point position = 0
-4 +2 +1 -2

+1.5 +0.75 -1 -2 interval = 1 interval = 0.5 interval = 0.25 interval = 0.125 +0.375 -0.5

binary point position = 1


+1 +0.5

binary point position = 2


-1 +0.5 +0.25

binary point position = 3


+0.125 -0.5 +0.25

all integer bits

all fractional bits

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Fixed-point Quantisation
Consider the fixed point number format: n n n b b b b b

23

3 integer bits

5 fractional bits

Numbers between 4 and 3.96785 can be represented, in steps of 0.03125 . As there are 8 bits, there are 2 8 = 256 different values. Revisiting our sine wave example, using this fixed-point format: +2

-2 LSB This looks much more accurate. The quantisation error is ---------(where LSB = least significant bit)... so 0.015625 rather than 0.5! 2
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Quantisation is simply the DSP term for the process of representing infinite precision numbers with finite precision numbers. In the decimal world, it is familiar to most to work with a given number of decimal places. The real number can be represented as 3.14159265.... and so on. We can quantise or represent to 4 decimal places as 3.1416. If we use rounding here and the error is: 3.14159265 3.1416 = 0.00000735 If we truncated (just chopped off the bits below the 4th decimal place) then the error is larger: 3.14159265 3.1415 = 0.00009265 Clearly rounding is most desirable to maintain best possible accuracy. However it comes at a cost. Albeit the cost is relatively small, but it is however not free. When multiplying fractional numbers we will choose to work to a given number of places. For example, if we work to two decimal places then the calculation: 0.57 x 0.43 = 0.2451 can be rounded to 0.25, or truncated to 0.24. The result are different. Once we start performing billions of multiplies and adds in a DSP system it is not difficult to see that these small errors can begin to stack up.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Multiplication & Division via Binary Shifts

24

The binary point position has a power-of-2 relationship with the numerical range of the word format, and any number it represents. Therefore if we want to multiply or divide a fixed point number by a power-of-two, this is achieved by simply shifting the numbers with respect to the binary point!
shift right by 2 places -2 1 0.5 0.0625 0.25 0.125

4
-8 4

0
2

1
1

0
0.5

1
0.25

(decimal 1.4375)

original number:
-16

0
8

1
4

0
2

1
1

1
0.5

(decimal 5.75)

shift left by 1 place

(decimal 11.5)
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Of course, looking at the example in the main slide, we could also consider that the binary point is moved, rather than the bits - it amounts to the same thing! Ultimately the binary point is conceptual - having no effect on the hardware produced - and it falls to the DSP design tool and/or DSP designer to keep track of it. Reviewing the divide-by-4 and multiply-by-2 examples from the main slide... if we move the binary point, the weightings of the bits comprising the word, and hence the value it represents, change by a power-of-two factor.

-2

0.5

0.0625 0.25 0.125

(decimal 1.4375)

4
-8 4 2 1 0.5 0.25

original number:

(decimal 5.75)

2
-16 8 4 2 1 0.5

(decimal 11.5)

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Normalised Fixed Point Numbers

25

Fixed point word formats with 1 integer bit and a number of fractional bits are often adopted. Numbers from -1 to +1-2-b (i.e. just less than 1) can be represented. Some examples:
-1 1/2 1/4 1/8 1/16 1/32 1/64 1/128

decimal -1

most -ve

0 most +ve

0 +0.9921875

Limiting the numeric range to 1 is advantageous because it makes the arithmetic easier to work with... multiplying two normalised numbers together cannot produce a result greater than 1!
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

The term Q-format is often used to describe fixed point number formats, usually in the context of DSP processors. However, it is useful to note that Q-format and 2s complement are actually the same thing. Q-format notation is given in the form Qm.n, where m is the number of integer bits, and n is the number of fractional bits. Notably this description excludes the MSB of the 2s complement representation, which Q-format considers a sign bit. Therefore the total number of bits in a Q-format number is 1 + m + n, whereas in 2s complement, the same word format would be described as having m+1 integer bits, and n fractional bits. For example, a Q2.5 number has a sign bit, 2 other integer bits, and 5 fractional bits, and hence can be represented as shown below. In 2s complement, this would be described as having 3 integer bits and 5 fractional bits.

1 sign bit Q-format description


bit weightings:

m=2 integer bits -22 21 20

n=5 fractional bits 2-1 2-2 2-3 2-4 2-5 5 fractional bits

2s complement description

3 integer bits

The Q0.15 format (often abbreviated to Q15) is used extensively in DSP as it covers the normalised range of numbers from -1 to +1 - 2-15, and is equivalent to a 16 bit 2s complement representation with the binary point at 15, i.e. 1 integer bit and 15 fractional bits.
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010

Fractional Motivation - Normalisation

26

Working with fractional binary values makes the arithmetic easier to work with and to account for wordlength growth. As an example take the case of working with a machine using 4 digit decimals and a 4 digit arithmetic unit - range -9999 to +9999. Multiplying two 4 digit numbers will result in up to 8 significant digits. 6787 x 4198 = 28491826
Scale

2849.1826

Tr

2849

If we want to pass this number to a next stage in the machine (where arithmetic is 4 digits accuracy) then we need scale down by 10000, then truncate. Consider normalising to the range -0.9999 to +0.9999. 0.6787 x 0.4198 = 0.28491826
Tr

0.2849

now the procedure for truncating back to 4 bits is much easier.


R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Of course the two results are exactly identical and the differences are in how we handle the truncate and scale. However using the normalised values, where all inputs are contrained to be in the range from -1 to +1 then its easy to note that multiplying ANY two numbers in this range together will give a result also in the range of -1 to +1. Exactly the same idea of normalisation is applied to binary, and the the binary point is implicitly used in most DSP systems. Consider 8 bit values in 2s complement. The range is therefore: 10000000 to 01111111 (-128 to +127)

If we normalise these numbers to between -1 and 1 (i.e. divide through by 128) then the binary range is: 1.0000000 to 0.1111111 ( -1 to 0.9921875, where 127/128 = 0.9921875).

Therefore we apply the same normalising ideas as for decimal for multiplication in binary.1101 1010 0100 Consider multiplying 36 x 97 = 3492 equivalent to 00100100 x 01100001 = 0000 1101 1010 0100 In binary normalising the values would be the calculation 0.0100100 x 0.1100001 = 0.00110110100100 which in decimal is equvalent to: 0.28125 x 0.7578125 = 0.213134765625 Note very clearly that in a DSP system then the binary poiint is all in the eye of the designer. There is no physical connection or wire for the binary point. It just makes things significantly easier in keeping track of wordlength growth, and truncating just by dropping fractional bits. Of course if you prefer integers and would like to keep track of the scaling etc you can do this,..... you will get the same answer and the cost is the same...
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010

Analogue to Digital Converter (ADC)

27

An ADC is a device that can convert a voltage to a binary number, according to its specific input-output characteristic.
Binary Output
127 96 64 32

01111111 01100000 01000000 00100000

fs
8 bit 1 0 0 1 1 1 0 1 Binary Output

ADC -2 -1
-32 -64 -96 -128

1 11001000 11000000 10100000 10000000

2 Voltage Input

Voltage Input

We can generally assume ADCs operate using twos complement arithmetic.


R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Viewing the straight line portion of the device we are tempted to refer to the characteristic as linear. However a quick consideration clearly shows that the device is non-linear (recall the definition of a linear system from before) as a result of the discrete (staircase) steps, and also that the device clips above and below the maximum and minimum voltage swings. However if the step sizes are small and the number of steps large, then we are tempted to call the device piecewise linear over its normal operating range. Note that the ADC does not necessarily have a linear (straight line) characteristic. In telecommunications for example a defined standard nonlinear quantiser characteristic is often used (A-law and -law). Speech signals, for example, have a very wide dynamic range: Harsh oh and b type sounds have a large amplitude, whereas softer sounds such as sh have small amplitudes. If a uniform quantisation scheme were used then although the loud sounds would be represented adequately the quieter sounds may fall below the threshold of the LSB and therefore be quantised to zero and the information lost. Therefore non-linear quantisers are used such that the quantisation level at low input levels is much smaller than for higher level signals. A-law quantisers are often implemented by using a nonlinear circuit followed by a uniform quantiser. Two schemes are widely in use: the A-law in Europe, and the -law in the USA and Japan. Similarly for the DAC can have a non-linear characteristic.
Binary Output

Voltage Input

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

ADC Sampling Error

28

Perfect signal reconstruction assumes that sampled data values are exact (i.e. infinite precision real numbers). In practice they are not, as an ADC will have a number of discrete levels. The ADC samples at the Nyquist rate, and the sampled data value is the closest (discrete) ADC level to the actual value:
s(t)
4 3 2 1 0 -1 -2 -3 -4

ADC time

Binary value

fs

v(n)
4 3 2 1 0 -1 -2 -3 -4

ts

Voltage

sample, n

v ( n ) = Quantise { s ( nt s ) }, for n = 0, 1, 2, Hence every sample has a small quantisation error.


R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

For example purposes, we can assume our ADC or quantiser has 5 bits of resolution and maximum/minimum voltage swing of +15 and -16 volts. The input/output characteristic is shown below:

01111 (+15) Binary Output

Vmin = -15 volts

1 volts Vmax = 15 volts Voltage Input

10000 (-16)
In the above slide figure, for the second sample the true sample value is 1.589998..., however our ADC quantises to a value of 2.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Quantisation Error

29

If the smallest step size of a linear ADC is q volts, then the error of any one sample is at worst q/2 volts.
01111 (+15) Binary Output

q volts Vmax Voltage Input

-Vmax

10000 (-16)

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Quantisation error is often modelled an additive noise component, and indeed the quantisation process can be considered purely as the addition of this noise: nq

ADC

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

An Example
Here is an example using a 3-bit ADC:
4 3 Amplitude/volts 2 1 0 -1 -2 -3 -4 4 3 2 1 -4 0 -1 -2 -3 Input Amplitude/volts 3 2 1 0 -1 -2 -3 -4 time/seconds time/seconds

30

ADC Characteristic

Quantisation Error

Output

-1

-2

-3

-4

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
In this case worst case error is 1/2.

Introduction to FPGA

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Adding multi-bit numbers

31

The full adder circuit can be used in a chain to add multi-bit numbers. The following example shows 4 bits: A3 B3 A2 B2 A1 B1 A0 B0

C3 S4 MSB

S3

C2

S2

C1

S1

C0

A3 A2 A1 A0 + B3 B2 B1 B0
C3 C2 C1 C0

S0 LSB

S4 S3 S2 S1 S0

This chain can be extended to any number of bits. Note that the last carry output forms an extra bit in the sum. If we do not allow for an extra bit in the sum, if a carry out of the last adder occurs, an overflow will result i.e. the number will be incorrectly represented.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
The truth table for the full adder is: A 0 0 0 0 1 1 1 1 B 0 0 1 1 0 0 1 1 CIN 0 1 0 1 0 1 0 1 S 0 1 1 0 1 0 0 1 COUT 0 0 0 1 0 1 1 1

Introduction to FPGA

0+0+0 = 0 0+0+1 = 1 0+1+0 = 1 0+1+1 = 2 1+0+0 = 1 1+0+1 = 2 1+1+0 = 2 1+1+1 = 3

Full adder circuitry can be therefore produced with gates:

A B C

S out = ABC + ABC + ABC + ABC = ABC C out = ABC + ABC + ABC + ABC = AB + AC + BC = C ( A B )

COUT

The longest propagation delay path in the above full adder is two gates.
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010

Subtraction

32

Subtraction is very readily derived from addition. Remember 2s complement? All we need to do to get a negative number is invert the bits and add 1. Then if we add these numbers, well get a subtraction D = A + (-B): A3 B3 A2 B2 A1 B1 A0 B0 Invert add 1 1

4 bit 2s comp

Discard

C4

D3

D2

D1

D0

The addition of the 1 is done by setting the LSB carry in to be 1.


R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Note for 4 bit positive numbers (i.e. NOT 2s complement) the range is from 0 to 15. For 2s complement the numerical range is from -8 to 7. Addition/Subtraction (using 2s complement representation) Sometimes we need a combined adder/subtractor with the ability to switch between modes. This can be achieved quite easily:

A3
0

B3
1 MUX

A2

B2
0 1 MUX

A1

B1
0 1 MUX

A0

B0
0 1 MUX

For: A + B, K = 0 For: A - B, K= 1 This structure will be seen again in the Division/Square Root slides!
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010

Wraparound Overflow & 2s Complement

33

With 2s complement overflow will occur when the result to be produced lies outside the range of the number of bits. Therefore for an 8 bit example the range is -128 to +127 (or in binary this is 100000002 to 011111112: -65 + -112 -177 10111111 +10010000 101001111 100 + 37 137 01100100 +00100101 10001001

With an 8 bit result we lose the 9th bit and the result wraps around to a positive value: 01001111 = 79 .

With an 8 bit result the result wraps around to a negative value: 10001001 = 119 .

One solution to overflow is to ensure that the number of bits available is always sufficient for the worst case result. Therefore in the above example perhaps allow the wordlength to grow to 9 or even 10 bits. Using Xilinx SystemGenerator we can specifically check for overflow in every addition calculation.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Recall from previously that overflow detect circuitry is relatively easy to design. Just need to keep an eye on the MSB bits (indicating whether number is +ve or -ve): For example

(-73) + 127 = 54
Discard final 9th bit carry

10110111 +01111111 1 00110110


No overflow

100 + 64 = 164

01100100 +01000000 10100100


Overflow

MSB bit indicate -ve result!

Adding Adding Adding

+ve and -ve +ve and +ve -ve and -ve

will never overflow! if a -ve result then overflow if a +ve result then overflow

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Saturation
Taking the previous overflowing examples from Slide 33 10111111 01100100 -65 100 +10010000 + -112 +00100101 + 37 -177 101001111 137 10001001
Detect overflow and saturate Detect overflow and saturate

34

One method to try to address overflow is to use a saturate technique.

-128

1000000

127

01111111

When overflow is detected, the result is set to the close possible value (i.e for the 8 bit case either -128 or +127). Therefore for every addition that is explicitly done with an adder block. In Xilinx System Generator the user will get a checkbox choice to allow results to either (i) Wraparound or (ii) Saturate. Implementing saturate will require detect overflow & select circuitry.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Once again, design of a DSP system might be done such that overflow never happens as the user has ensured there are enough bits to cater for the worse possible case leading to the maximum magnitude result. Generally for some later FPGAs such as the Virtex-4, using some of the DSP48 blocks give adders with 48 bits of precisions therefore the likelihood of say working with 16 bit values that grow to over 48 bits is unlikely. Hence overflow has been designed out. Of course not applications will use these devices and using general slice logic and attempting to make adders as small as possible, would mean care must be taken, and where appropriate to efficient design, saturate might be included.

Saturation is extremely useful for adaptive algorithms. For example, in the Least Mean Squares algorithm (LMS), the filter weights w are updated according to the equation: w ( k ) = w ( k 1 ) + 2e ( k )x ( k ) Without further concern over the meaning of this equation, we can see that the term 2e ( k )x ( k ) is added to the weights at time epoch k 1 to generate the new weights at time epoch k . The the operations that form 2e ( k )x ( k ) were to overflow, there is a high chance that the sign of the term would flip and drive the weights in completely the wrong direction, leading to instability. With saturation however, if the term 2e ( k )x ( k ) gets very big and would overflow, saturation will limit it to the maximum value representable, causing the weights to change in the right direction, and at the fastest speed possible in the current representation. The result is a huge increase in the stability of the algorithm.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Xilinx Virtex-II Pro Addition


The used components of the slide are outlined below
Cout Sout

35

B A

Sout

Cout

Sout

Cin
Cin
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Picture of Xilinx-II Pro slice (upper half) taken from Virtex-II Pro Platform FPGAs: Introduction and Overview, DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com LookUp Table (LUT) programmed with two-input XOR function: G1 (A) 0 0 1 1 Sout = Cin xor D , G2 (B) 0 1 0 1 D 0 1 1 0

Cout = DA + Cin D (multiplex operation). Result is the FULL ADDER implementation: . G1 (A) 0 0 0 0 1 1 1 1 G2 (B) 0 0 1 1 0 0 1 1 Cin 0 1 0 1 0 1 0 1 D 0 0 1 1 1 1 0 0 Sout 0 1 1 0 1 0 0 1 Cout 0 0 0 1 0 1 1 1

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Xilinx Virtex-II Pro Slice Main Components


Inputs ORCY Outputs

36

A (very) high level diagram of the main logic components on one slice

D-type FF

4 input LUT RAM ShiftReg

MULTAND

Upper

Mux

Mux

XORG

Inputs ORCY Outputs

Mux

D-type FF

4 input LUT RAM ShiftReg

MULTAND

Mux

Lower
Mux Mux Mux Mux
R. Stewart, Dept EEE, University of Strathclyde, 2010

XORG

Notes:

Introduction to FPGA

Just reviewing the logic circuitry on one half of the slice (note that in Slide 35 only the top half of the slice is shown, whereas the above slide shows the top and bottom halfs), we can note: One D-type Flip Flop One 4 input Look-Up-Table (LUT) (can be configured as shift register or simple as RAM/memory) One XOR gate One AND gate One OR gate A few 2 input MUX (multiplexors) to route signals Clock inputs

Small FPGAs will have just a few hundred (100s) slices; Large FPGAs will have many tens of thousands (10000s) of slices (and other components!)

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Xilinx Virtex-II Pro 4 bit Adder

37

To produce larger adders the Xilinx tools will simply cascade the carry bits in adjacent (where possible!) slices. The bottom half of a Virtex-II Pro slice can be programmed for an identical operation, with its COUT wired to the top-halfs CIN. Hence we can get two bits of addition per standard Xilinx slice. To produce a 4 bit adder, we cascade with another slice.
C1 B1 A1 B0 A0 C3 S1

FA FA
0

2 bit addition 1 slice


A1 B1 A0 B0

B3 A3 B2 A2

FA FA
C1

S3

S0

S2

4 bit addition 2 slices


A3 B3 A2 B2 A1 B1 A0 B0

C1

S1

C0

S0

B1 A1 B0 A0

FA FA

S1 C3 S0 S4

S3

C2

S2

C1

S1

C0

S0

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Note the importance of the LUT (look up table) in the Xilinx slice. When configured as a LUT, any four input Boolean equation can be implemented. For example take the equation Y = ABC + ABCD The truth table for this equation is: ABCD 00001 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

A B C D

4 input LUT RAM ShiftReg

To implement this function, simply store the values of Y in the Slice LUT, and the address the LUT with values of ABCD to get the output Therefore ANY 4 variable Boolean function can be simply implemented with a four input LUT. (Of course if the equation is only 3 variables then we can also implement and just set one input constant)

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Multiplication in binary
Multiplying in binary follows the same form as in decimal: 11010110 A 7 A 0 x00101101 B 7 B 0 11010110 000000000 1101011000 11010110000 000000000000 1101011000000 00000000000000 000000000000000 0010010110011110 P 15 P 0

38

Note that the product P is composed purely of selecting, shifting and adding A . The i th column of B indicates whether or not a shifted version of A is to be selected or not in the i th row of the sum. So we can perform multiplication using just full adders and a little logic for selection, in a layout which performs the shifting.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
Multiplication in decimal Starting with an example in decimal:

Introduction to FPGA

214 x45 1070 +8560 9630


Note that we do 214 5 = 1070 and then add to it the result of 214 4 = 856 right-shifted by one column. For each additional column in the second operand, we shift the multiplication of that column with the first operand by another place.

zzz xaaaa bbbb +cccc0 +dddd00 +eeee000

etc...

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

2s complement Multiplication

39

For one negative and one positive operand just remember to sign extend the negative operand.

sign extends

-42 11010110 x00101101 x45 1111111111010110 0000000000000000 1111111101011000 1111111010110000 0000000000000000 1111101011000000 0000000000000000 0000000000000000 1111100010011110 -1890

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
2s complement multiplication (II) For both operands negative, subtract the last partial product.

Introduction to FPGA

We use the trick of inverting (negating and adding 1) the last partial product and adding it rather than subtracting.
form last partial product negative

twos

-1110101100000000

complement

11010110 x10101101 1111111111010110 0000000000000000 1111111101011000 1111111010110000 0000000000000000 1111101011000000 0000000000000000 +0001010100000000 0000110110011110

-42 x-83

3486

Of course, if both operands are positive, just use the unsigned technique! The difference between signed and unsigned multiplies results in different hardware being necessary. DSP processors typically have separate unsigned and signed multiply instructions.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Fixed Point multiplication

40

Fixed point multiplication is no more awkward than integer multiplication: 11010.110 x00101.101 11.010110 000.000000 1101.011000 11010.110000 000000.000000 1101011.000000 00000000.000000 000000000.000000 0010010110.011110 26.750 x5.625 0.133750 0.535000 16.050000 133.750000 150.468750

Again we just need to remember to interpret the position of the binary point correctly.

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Multiplier Implementation Options


Distributed multipliers Constant multipliers Using the logic fabric (LUTs) Using block RAM

41

Shift-and-add multipliers High speed embedded multipliers 18 x 18 bit multipliers

High speed integrated arithmetic slices (DSP48s) Multiply, accumulate Add, multiply, accumulate
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Over the next few slides we will see that multipliers can be implemented in a variety of different ways. As multipliers are used extensively in DSP, implementing them efficiently is a priority consideration. The most basic multiplier is a 2-input version which is implemented using the logic fabric, i.e. the lookup tables within the slices of the device. This type is referred to as a distributed multiplier, because the implementation is distributed over the resouces in several slices. In the case of multiplication with a constant, which is commonly required in DSP, the knowledge of one multiplicand can be exploited to create a cheaper hardware implementation than a conventional 2-input multiplier. Two approaches that will be discussed in the coming pages are ROM-based constant multipliers, and shift-and-add multipliers which sum the outputs from binary shift operations. The FPGA companies are well aware that DSP engineers desire fast and efficient multipliers, and as a result, they began incorporating embedded multipliers into their devices in the year 2000. Since then the sophistication of these components has increased, and they have been extended to feature fast adders and in many cases longer wordlengths, too. We can now think of them as embedded arithmetic slices, rather than simply multipliers.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Distributed Multipliers
This figure shows a four-bit multiplication:
s

42

FA is full adder
a b 0

a3 0

a2 0

a1 0

a0 b0 0

bout cout aout FA sout 0

c 0 b2 0 b3 0 b1 0

Example: 1101 13 11 1011 1101 1101 0000 1101 10001111 143 p0

p7

p6

p5

p4

p3

p2

p1

The AND gate connected to a and b performs the selection for each bit. The diagonal structure of the multiplier implicitly inserts zeros in the appropriate columns and shifts the a operands to the right. Note that this structure does not work for signed twos complement!
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
Note the function of the simple AND gate. The operation of multiplying 1s and 0s is the same AND 1s and 0s

Introduction to FPGA

A 0 0 1 1

B 0 1 0 1

Z 0 0 0 1
or in Boolean algebra Z = A and B = AB

Z = A x B (where x = multiply)

Hence the AND gate is the bit multiplier. The function of one partial product stage of the multiplier is as shown below.
s

FA is full adder
a b

x3 a3

x2 a2

x1 a1

x0 a0 b0 0

bout cout aout FA sout

c y4 y3 y2 y1 y0

y4 y3 y2 y1 y0 = b0(a3 a2 a1a0) + x3 x2 x1 x0

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Distributed Multiplier Cell


Cout

43

This shows the top half of a slice, which implements one multiplier cell.

S B A

Sout

A B

Cout

FA

Cin
NOTE: This implementation features a Virtex-II Pro FPGA.

Sout

Cin
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Picture of Xilinx-II Pro slice (upper half) taken from Virtex-II Pro Platform FPGAs: Introduction and Overview, DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com LUT implements the XOR of two ANDs:

G3 (S) G2 (A) G1 (B)

The dedicated MULTAND unit is required as the intermediate product G1G2 cannot be obtained from within the LUT, but is required as an input to MUXCY. The two AND gates perform a one-bit multiply each, and the result is added by the XOR plus the external logic (MUXCY, XORG): Sout = CIN xor D, COUT = DAB + CIND This structure will perform one cell of the multiplier (see the next slide...). Note that whereas the signal flow graph of the distributed multiplier shows signals propagating from the top and right of the diagram to the bottom, the internal structure of the FPGA slice logic results in a different configuration when implemented on a device.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

ROM-based Multipliers

44

Just as logical functions such as XOR can be stored in a LUT as shown for addition, we can use storage-based methods to do other operations. By using a ROM, we can store the result of every possible multiplication of two operands. The two operands A and B are concatenated to form the address with which to access the ROM. The value stored at that address is the multiplication result, P: A
4 bits

1 0 1 0

ROM-based multiplier 1 0 1 0 0 0 1 1
address 0000 0000 data (product) 0000 0000

A:B
1010 0011 1110 1110

P
8 bits
1111 1111 0000 0001

decimal -6

B
4 bits

0 0 1 1

8 bits

28 = 256 8-bit addresses 8-bit data

1 1 1 0 1 1 1 0
decimal -18

decimal 3
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

There is one serious problem with this technique: as the operand size grows, the ROM size grows exponentially. For two N bit input operands, there are 2 2N possible results, and hence the ROM has 2 2N entries. The output result is 2N bits long, and in total 2N 2 2N bits of storage are required. For example, with 8 bits operands (a fairly reasonable) size, 1Mbit of storage is required - a large quantity. For bigger operands e.g. 16 bits, a huge quantity of storage is required. 16 bit operands require 128Gbits of storage and hence a ROM-based multiplier is clearly not a realistic implementation choice!
Input Wordlength (N) 4 6 8 10 12 14 16 18 20 Output Wordlength (2N) 8 12 16 20 24 28 32 36 40 No. of ROM entries (22N) 28 = 256 212 = 4,096 216 = 65,536 220 = 1,048,576 224 = 16,777,216 228 = 268,435,456 232 = 4,294,967,296 236 = 68,719,476,736 240 = 1,099,511,627,776 Total ROM Storage (2N x 22N) 2 Kbits 48 Kbits 1 Mbit 20 Mbits 384 Mbits 7 Gbits 128 Gbits 2.25 Tbits 40 Tbits

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Input Wordlength and ROM Addresses

45

Consider a ROM multiplier with 8-bit inputs: 65,536 8-bit locations are required to store all possible outputs... so 1Mbit of storage is needed!
0 1 1 0 1 0 1 1 ROM-based multiplier 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 address
0000 0000 0000 0000

data (product)
0000 0000 0000 0000

A
8 bits

0110 1011 0100 0011

0001 1100 0000 0001

A:B
16 bits
address 27,459

P
16 bits

decimal 107

B
8 bits

0 1 0 0 0 0 1 1

0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1

1111 1111 1111 1111

0000 0000 0000 0001

decimal 7,169

decimal 67

216 = 65,536 16-bit addresses

16-bit data
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

For example, if the B input was the constant value 75, the possible input words would be composed of 256 possible combinations of the upper 8-bits of the address, concatenated with the 8-bit binary word 0100 1011, as shown below. The result is that only 256 of the 65,536 memory locations are actually accessed. Therefore, when one of the inputs to the ROM-based multiplier is fixed, the size of the required ROM can be reduced to 256 locations of 16-bit data (note that the precision of the stored output words remains 8 bits + 8 bits). The total memory required is thus 256 x 16 = 4kbits. However, depending on the value of the constant, it may also be possible to reduce the length of the stored results. For instance, if the value of B is (decimal) 10, the maximum output product generated by the multiplication of B with any 8-bit input A will be: 128 10 = 1280 As -1280 can be represented with 12 bits, that represents a further saving of 4 bits storage x 256 memory locations = 1kbit.

A=?
8 bits

? ? ? ? ? ? ? ?

B=75
8 bits

0 1 0 0 1 0 1 1

? ? ? ? ? ? ? ? 0 1 0 0 0 0 1 1

addresses (decimal): 0 x 28 + 75 = 75 1 x 28 + 75 = 331 2 x 28 + 75 = 587 3 x 28 + 75 = 843

A:B
16 bits

253 x 28 + 75 = 64,843 254 x 28 + 75 = 65,099 255 x 28 + 75 = 65,355

The total storage requirement for this example constant coefficient multiplier would therefore be 3 kbits... significantly smaller than the 1Mbit needed for a 16-bit multiplier where both operands are unknown!
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010

ROM-based Constant Multipliers

46

ROM-based multipliers with a constant input require fewer addresses. The storage required for output words may also be reduced, if the maximum result does not require the full numerical range of: 2
2N 1

result 2

2N 1

The maximum product and output wordlength can be calculated for the particular constant value, and the multiplier optimised accordingly...
8 bit signed number maximum absolute value = -128 16 bit signed result

A=?

? ? ? ? ? ? ? ?

P=?
8 bit representation (min.)

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? maximum product = 10,624 so maximum 15-bit representation required! 1-bit saved!

B = -83

0 1 0 1 0 0 1 1

Additional optimisations allow cost to be reduced further.


R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Constant multipliers can be implemented using the LUTs within the logic fabric (distributed ROM), or with one or more of the Block RAMs available on most devices. The selection is influenced by the other demands placed on these resources by the rest of the system being designed. In System Generator, the designer can specify the implementation style via the Constant Multiplier dialog box, along with the constant value, the output wordlength, and other parameters.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Multiplication by Shift and Add


4 2 -31 x 24 = -496 0.625 x 2-2 = 0.15625 x16 x0.25

47

Multiplication by a power-of-2 can be achieved simply by shifting the number to the right or left by the appropriate number of binary places.
1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1

Extending this a little, multiplications by other numbers can be performed at low cost by creating partial products from shifts, and then adding them together.
0 1 0 1 0 1 3 21 x 23 + 21 = 189 0 1 0 1 1 1 1 0 1 0 1 0 0 0 0 4 0 1 0 1 0 1 0 0 0 0 2 (1 x 2-4) + 1 + (1 x 2-2) = 1.3125 x1.3125 x9

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Shift operations are effectively free in terms of logic resources, as they are implemented only using routing. Therefore multiplications by power-of-two numbers are very cheap! By recognising that multiplications by other numbers can be achieved by summing partial products of power-of-two shifts, any arbitrary multiplication can be decomposed into a series of shifts and add operations. The closer the desired multiplication is to a powerof-two, i.e. the fewer partial products that are required, the fewer adders are required, and hence the lower the cost of the multiplier. This type of multiplier is suitable only for constant multiplications, because there is only one input, and the result is achieved using the configuration of the hardware. x16

4
x8

x16

4 3 x24
x8

x24

The technique can be x8 particularly powerful when x1 3 applied to parallel x9 multiplications of the x9 same input. The partial x1 combined - fewer partial products product terms common to several multiplications x24 and x9 calculated separately can be shared and thus the overall effort reduced. Transpose form filters are very suitable for optimisation in this way. Taking the above simple example of two concurrent multiplications, one x9 and the other x24, it is clear that the shift right by three places can be shared as x8 is common to both operations.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Embedded multipliers

48

The Xilinx Virtex-II and Virtex-II Pro series were the first to provide onchip multipliers in early 2000s. These are in hardware on the FPGA ASIC, not actually in the user FPGA slice-logic-area. Therefore are permanently available, and they use no slices. They also consume less power than a slice-based equivalent and can be clocked at the maximum rate of the device. A
18x18 bit multiplier

A and B are 18-bit input operands, and P is the 36-bit product, i.e. P = A B. Depending upon the actual FPGA, between 12 and more than 2000 (Virtex 6 top of range) of these dedicated multipliers are available.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Looking at a device floorplan, you can clearly see the embedded multipliers, which are located next to Block RAMs on the FPGA in order to support high speed data fetching/writing and computation.

Block RAM

slices

18x18 multiplier

Information on dedicated multipliers taken from Virtex-II Pro Platform FPGAs: Introduction and Overview, DS083-1 (v2.4) January 20, 2003. http://www.xilinx.com.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Embedded Multiplier Efficiency


18 18 4

49

It can be easy to utilise on chip embedded multipliers inefficiently through choice of wordlengths...

36 4

1 embedded multiplier 100% utilised

1 embedded multiplier ~5% utilised

When using multipliers in System Generator....be careful

18 18

36

19 19

38

SysGen will use 1 embedded mult

SysGen will use 4 embedded mults


R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

If you specify the use of embedded multipliers for a particular multiplier in System Generator, the tool will do exactly as you have asked, and implement it entirely using embedded multipliers. However, depending on the wordlengths involved, this may lead to an inefficient implementation. The wordlengths of the embedded multipliers are fixed at 18 x 18 bits, and it makes sense to use them as fully as possible. It is relatively easy to see that a 4 x 4 bit multiply will greatly underuse the capabilities of the multiplier, and this particular multiply operation might be better mapped to a distributed implementation, which would leave the embedded multiplier free for use somewhere else. Of course, these decisions are made in the context of some larger design with its own particular needs for the various resources available on the FPGA being targeted. Perhaps less obviously, mapping a multiplication to embedded multipliers where the input operands are slightly longer than 18 bits is also inefficient. This may result in, for example, the following implementation for a requested 19 x 19 bits multiplier, where 4 embedded multipliers are used instead of the expected 1!

18 x 18

1 x 18

19 19

38

18 x 1
Ver 10.423

1x1

R. Stewart, Dept EEE, University of Strathclyde, 2010

High Speed Arithmetic Slices (DSP48s)

50

As much DSP involves the Multiply-Accumulate (MAC) operation, soon after embedded multipliers came DSP48 slices (on the Virtex-4). These feature an 18 x 18 bit adder followed by a 48 bit accumulator.

18 18

36 48

48 DSP48 Virtex-4

Like the embedded multipliers, these are low power and fast. The ability to cascade slices together also means that whole filters can be constructed without having to use any slices.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

The next series of FPGAs (the Virtex-5) enhanced the capabilities of the DSP slice with the DSP48E. The major improvements of this slice are logic capabilities within the adder/subtractor unit, and an extended wordlength of one input to 25 bits. The maximum clock frequency also increased in line with the speed of the device.

18 25

43 48

48 DSP48E Virtex-5

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

DSP48s with Pre-Adders

51

The Spartan-3A DSP series and subsequent Spartan-6 feature a version of the DSP48 with a pre-adder, prior to the multiplier. This feature is especially useful for DSP structures like symmetric filters, because it allows the total number of multiplications to be reduced.

18 18

18 18

Spartan-3A DSP DSP48A Spartan-6 DSP48A1 36 48 48

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

The Virtex-6 offers a combination of the benefits of the Virtex-5 (the longer wordlength and arithmetic unit), together with the pre-adder from the Spartan series. This results in a very computationally powerful device, especially as it can be clocked at 600MHz, and the largest chips have 2000+ of them!

25 25

Virtex-6 DSP48E1 25 18 43 48 48

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Division (i)
Divisions are sometimes required in DSP, although not very often. 6 bit non-restoring division array:
a5 1 q5 q4 q3 q2 a4 b5 a3 b4 a2 b3 a1 b2 a0 b1 b0 0 Bin sin

52

bin cout 0
FA

bout cin Bout

sout

0 q1 q0

Q=B/A

Note that each cell can perform either addition or subtraction as shown in an earlier slide either Sin+ Bin or Sin - Bin can be selected.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

A Direct method of computing division exists. This paper and pencil method may look familiar as it is often taught in school. A binary example is given below. Note that each stage computes an addition or subtraction of the divisor A. The quotient is made up of the carry bits from each addition/subtraction. If the quotient bit is a 0, the next computation is an addition, and if it is a 1, the divisor is subtracted. It is not difficult to map this example into the structure shown on the slide. Example: B = 01011 (11), A = 01101 (13) -A = 10011. Compute Q = B / A. q4 = 0 01011 carry 10011 11110
carry

R0 = B -A R1 2.R1 +A R2 2.R2 -A R3 2.R3 -A R4 2.R4 +A R5

q3 = 1

11100 01101 01001 10010 10011 00101 01010 10011 11101 11010 01101 00111

q2 = 1

carry

q1 = 0

carry

q0 = 1

carry

Q = B / A = 01101 x 2-4 = 0.8125


Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010

Division (ii)
00000.1101 01101 01011.0000 0 01 00 010 000 0101 0000 01011 00000 remdsh1 01011 0 divisor_in 00110 1 00100 10 00011 01 00001 010 00000 000 00001 0100 00000 1101 00000 0111
divisor_in

53

There is an alternative way to compute division using another paper and pencil technique.

VHDL Design

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

The Problem With Division

54

An important aspect of division is to note that the quotient is generated MSB first - unlike multiplication or addition/subtraction! This has implications for the rest of the system. It is unlikely that the quotient can be passed on to the next stage until all the bits are computed - hence slowing down the system! Also, an N by N array has another problem - ripple through adders. Note that we must wait for N full adder delays before the next row can begin its calculations. Unlike multiplication there is no way around this, and as result division is always slower than multiply even when performed on a parallel array - a N by N multiply will run faster than a N by N divide!

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

By looking at the top two rows of a 4 x 4 division array we can see that the first bit to get generated is the MSB of the quotient. This is unlike the multiplication array that can also be seen below, where the LSB is generated first. This is a problem when using division as most operations require the LSBs to start a computation and hence the whole solution will have to be generated before the next stage can begin. Another problem for division is the fact that it takes N full adder delays before the next row can start. In the examples below, the order in which the cells can start has been shown. So for the multiplier, the first cell on the second row is the 3rd cell to start working. However, for the divider, the first cell on the second row is only the 5th cell to start working because it has to wait for the 4 cells on the first row to finish.
a3 1 q3 a2 b3
4

a1 b2
3

a0 b1
2

b0
1

Bin

sin

0 q2
6 5

bin cout
FA

bout cin Bout sout

FA is full adder
a b 0

0
4

a3 0
3

a2 0
2

a1 0
1

a0 b0 0

bout cout aout FA sout

c
5 4 3

b1 0 p0

p1
Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Pipelining The Division Array

55

The division array shown earlier can be pipelined to increase throughput.


a5 a5 a5 1 q5 q4 q3 q2 a4 b5 a4 b5 a4 b5 a3 b4 a3 b4 a3 b4 a2 b3 a2 b3 a2 b3 a1 b2 a1 b2 a1 b2 a0 b1 a0 b1 a0 b1 b0 b0 b0

Operands pipeline delay


0 Bin sin

bin cout 0 sout


FA

bout cin Bout 0

Q=B/A

q1 q0

R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

To increase the throughput, the critical path can be broken down by implementing pipeline delays at appropriate points. If pipelining is not used, the delay (critical path) from new data arriving to registering the full quotient is N2 full adders. This delay represents the maximum rate that new data can enter the array. However, by pipelining the array, the critical path is broken down to just N full adders and thus the rate at which new data can arrive is increased dramatically.
a3 a2 b3 a1 b2 a0 b1 b0 0 q2 a3 1 q3 q2 q1 q0 a2 b3 a1 b2 a0 b1 q1 q0

The longest path from register to register is the Critical Path.

1 q3

b0 0

With pipelining the critical path is only N full adders.


0

Without pipelining the critical path is through N2 full adders.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Square Root (i)


6 bit non-restoring square root array.
0 1 b5 b4 b3 b2 0 1 0 a7 1 0 a6 0 1 0 a5 Bin sin

56

1 0

a4 0 a3 1 0 a2 0 a1 a0 1 1 0 0

bin cout
FA

bout cin Bout

1 0

0 1 0 0 b1 b0

1 0 0 sout

B =

0 1 0 0

1 0 0 0

The square root is found (with divides) in DSP in algorithms such as QR algorithms, vector magnitude calculations and communications constellation rotation.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Looking carefully at the non-restoring square root array, we can note that this array is essentially half of the division array! If the division array above is cut diagonally from the left we can see the cells that are needed for the square root array. The 2 extra cells on the right hand side are standard cells which can be simplified. So square root can be performed twice as fast as divide using half of the hardware! a4 a 3 a2 a1 A = 10 11 01 01 010 carry 111 001 0111 carry 1011 0010 01001 c ar r y 10011 11100
c ar r y

b3 = 1 b2 = 1

0 1 b5

1 0

a7

1 0

a6 0 1 0 a5

1 0

a4 0 a3 1 0 a2

b1 = 0

0a4 111 R1 R1<<1 & a3 1b311 R2 R2<<1 & a2 1b3b211 R3 R3<<1 & a1 0b3b2b111 R4

b4 b3 b2

1 0

b0 = 1

110001 011011 001100

0 a1 a0 1 1 0 0 0 1 0 0 b1 b0 1 0 0 sout

0 1 0 0

1 0 0 0

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Square Root and Divide - Pythagoras!

57

The main appearance of square roots and divides is in advanced adaptive algorithms such as QR using givens rotations. For these techniques we often find equations of the form: x cos = --------------------x2 + y2 and y sin = --------------------x2 + y2

So in fact we actually have to perform two squares, a divide and a square root. (Note that squaring is simpler than multiply!) There are a number of iterative techniques that can be used to calculate square root. (However these routines invariably require multiplies and divides and do not converge in a fixed time.) There seems to be some misinformation out there about square roots: For FPGA implementation square roots are easier and cheaper to implement than divides....!
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

You might also like