# DSPedia Notes 1

Introduction to FPGA

THIS SLIDE IS BLANK

Introduction: DSP and FPGAs

1

• In the last 20 years the majority of DSP applications have been enabled by DSP processors: • Texas Instruments Motorola Analog Devices

• A number of DSP cores have been available. • Oak Core LSI Logic ZSP 3DSP

• ASICs (Application specific integrated circuits) have been widely used for specific (high volume) DSP applications • But the most recent technology platform for high speed DSP applications is the Field Programmable Gate Array (FPGA) This course is all about the why and how of DSP with FPGAs!
R. Stewart, Dept EEE, University of Strathclyde, 2010

These algorithms all require multiplies and adds (note that a divide or square root is quite a rare thing in DSP).423 Voltage Input R. or as few bits.but surely a multiply is multiply! Well. then clearly the “cheaper” one would be the best choice. University of Strathclyde. Circuit Board General Purpose Input/Output Bus DAC DSP56307 ADC DSP Processor Amplifiers/Filters Voltage Output Ver 10.we can use as many. 2010 . One is that the required MACs are the same . using 16 bit digital filter coefficients etc. a 16 bit device which will process 16 bit inputs. Dept EEE. yes in the traditional DSP processor based situations we are likely to be using. as are required. say. if they both perform the same job but one with less MACs than the other. Stewart. Hence a DSP algorithm or problem is often specified in terms of its MAC requirements. Therefore we can choose to optimise and schedule DSP algorithms in a completely different way. Fourier transforms and so on. However this implies some assumptions. we will see that most algorithms that are used for different applications employ digital filters.Notes: Introduction to FPGA DSP is all about multiplies and accumulates/adds (MACs). when comparing two algorithms. In particular. With FPGAs this constraint is removed . adaptive filters. As we progress through the course.

Dept EEE. Number of multpliers from 4 to > 500. FFTs.. A few multipliers per device are possible. Rest assured more DSP is coming. University of Strathclyde. Early 2000s FPGAs: vendors place hardwired multipliers onto the device with clocking speeds of > 100MHz. • • Late 1990s FPGAs allow multipliers to be implemented in FPGA logic fabric.. 2010 • • . divide). more floating point support. Full (pipelined) FIR SFGs filters for example are available Late 2000s to early 2010s .who knows! Probably more DSP power. more arithmetic capability (fast square root.. Stewart.The FPGA DSP Evolution 2 • Since around 1998 the evolution of FPGAs into the DSP market has been sustained by classic technology progress such as the ever present Moore’s law. R. Mid 2000s FPGA vendors place DSP algorithms signal flow graphs (SFGs) onto devices.

then in the next quarter you will get the new model with integrated WiFi or WiMax. bigger machine is likely to be cheaper also! Such is technology DSP for FPGAs is just the same.16 .also. If you wait another year its likely the vendors will be bring out prepackaged algorithms for precisely what you want to do. Ver 10. 2010 . Like all technologies you still need to know how it works if you really want to use it. Introduction to FPGA Anyone who has purchased a new laptop knows the feeling. Stewart. Dept EEE. we discuss and review the fundamental strategies of designing DSP for FPGAs. If you just wait. But of course who can wait? Therefore in this course. And they will be easier to work with . design wizards and so on. the new faster. University of Strathclyde. better. it will probably be a free download in a few years. Of course wait another quarter and in 6 months it will be improved again .423 R.Notes: Technology just keeps moving. a faster processor.higher level design tools. So if you are planning to design a QR adaptive equalizing beamformer for MIMO implementation of a software radio for 802.then if you wait.

• Of course the resource is finite and the connections available are finite. & build it: Clocks Input/Output Registers and Memory Design Verify Place and Route “Connectors” Logic Arithmetic R. the high level concept. Stewart. 2010 . Similiar considerations are required for FPGAs (albeit out of your direct control). • However. lengths of wires etc. Dept EEE. • In the days of circuits boards one had to be careful about running busses close together. University of Strathclyde. take the blocks.FPGAs: A “Box” of DSP blocks 3 • We might be tempted to this of the latest FPGAs as repositories of DSP components just waiting to be connected together.

Ver 10. In terms of the DSP design. There is lots to worry about. Do we actually need an FPGA/IC engineer then? Do we actually need a DSP engineer? Yes in both cases. underflows. As vendors provide higher level components (like the DSP48 slice from XIlinx which allows a complete FIR to be implemented) then issues such as overflow.Notes: Introduction to FPGA This is undoubtedly the modern concept of FPGA design. saturates etc). different vendors design flows will give different results (some better than others). connect them together and the algorithm is in place. can we clock at a high enough rate? Does the device place and route.423 R. Do the latency or delays used allow the integrity to be maintained. but moderm toolsets and design flows are such that it might be the same person. Stewart. and how efficient is the implementation (just like compilers. Take the blocks. 2010 . University of Strathclyde. numerical integrity and so on are taken care of. For the FPGA. is the arithmetic correct (ie overflows. Dept EEE. What device do we need.

in the worst case. 2010 . University of Strathclyde. Dept EEE. R.Binary Addition and Multiply • The bottom line for DSP is multiplies and adds . Stewart. end up with 2N+1 bits wordlength.and lots of them! • Adding two N bit numbers will produce up to an N+1 bit number: N N N+1 4 + = = • Multiplying two N bit numbers can produce up to a 2N bit number: N N 2N x = • So with a MAC (multiply and accumulate/add) of two N bit numbers we could.

The result of each multiply will be a 48 bit number. For a typical DSP filtering type operation we may require to take.Notes: Introduction to FPGA If the wordlength grows beyond the maximum value you can store we clearly have the situation of numerical overflow which is a non-linear operation and not desirable. If we then add two 48 bit numbers together. Within traditional DSP processors this wordlength growth is well known an catered for. if they both just happen to be large positive values then the result could be a 49 bit number. (Of course if you did have a problem that grew beyond 56 bits you would have to put special trapping in to the code to catch this. and realised it was fairly unlikely that the result of adding these 48 bit products together would ever be larger that 56 bits . So one must assume that Motorola had a good look at this. For example. say an array of 24 bit numbers and multiply by an array of another 24 bits numbers.) Ver 10. Dept EEE. Stewart. 2010 . University of Strathclyde. then the final result may have a word growth of quite a few bits. Now if we add many 48 bit numbers together (and they just all happen to be large positive values).e.423 R.so 56 bits was chosen. the largest result of any “addition” operation can have 56 bits. i. the Motorola 56000 series is so called because it has a 56 bit accumulator.

The “Cost” of Addition • A 4 bit addition can be performed using a simple ripple adder: A3 B3 A2 B2 A1 B1 A0 B0 5 C3 S4 MSB Σ S3 C2 Σ S2 C1 Σ S1 C0 Σ S0 LSB ‘0’ A3 A2 A1 A0 + B3 B2 B1 B0 C3 C2 C1 C0 S4 S3 S2 S1 S0 0 carry in • Therefore an N bit addition could be performed in parallel at a cost of N full adders. R. Dept EEE. Stewart. 2010 . University of Strathclyde.

Stewart. University of Strathclyde.423 R. Dept EEE.Notes: The simple Full Adder (FA): Adds two bits + one carry in bit. 2010 . to produce sum and carry out Introduction to FPGA S out = ABC + ABC + ABC + ABC A B Cin Cout Sout 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1 = A⊕B⊕C C out = ABC + ABC + ABC + ABC = AB + AC + BC A Cout B Cin Σ Sout +11 +13 +24 + 1011 1101 11000 Ver 10.

.. R. Stewart.so for example a 16 bit multiply is nominally 4 times more expensive to perform than an 8 bit multiply.. Dept EEE...... 2010 .. ..The “Cost” of Multiply • A 4 bit multiply operation requires an array of 16 multiply/add cells: a3 a2 a1 a0 b3 b2 b1 b0 c3 c2 c1 c0 d3 d2 d1 d0 e3 e2 e1 e0 + f3 f2 f1 f0 p7 p6 p5 p4 p3 p2 p1 p0 b2 0 0 6 a3 0 a2 0 a1 0 a0 b0 0 b1 0 b3 0 p7 p6 p5 p4 p3 p2 p1 p0 • Therefore an N by N multiply requires N 2 cells. University of Strathclyde.

c + s.c aout = a sout = (s ⊕ z) ⊕ c bout = b z = a.c + s. plus some broadcast wires: cout = s.Notes: s a bout cout aout sout b c Introduction to FPGA Each cell is composed of a Full Adder (FA) and an AND gate.b An 8 bit by 8 bit multiplier would require 8 x 8 = 64 cells 1011 1001 1011 0000 0000 +1 0 1 1 1100011 11 x9 Partial Product 99 Ver 10. University of Strathclyde. Dept EEE.z.423 R. Stewart.z.c + s.z.z. 2010 .

The Gate Array (GA) • Early gate-arrays were simply arrays of NAND gates: 7 • Designs were produced by interconnecting the gates to form combinational and sequential functions. Stewart. 2010 . University of Strathclyde. Dept EEE. R.

R.that’s it! No changes.. no updates.. From GA to FPGA However simple gate arrays although very generic. . 2010 Ver 10. flip-flops and registers and perhaps addition and subtraction functions. Dept EEE. were used by many different users for similar systems. So then we move to field programmable gate arrays. multiplexors and memory.. University of Strathclyde.for example to implement two level logic functions. A carefully balanced selection of multi-input logic... Stewart. Two key differences between these and gate arrays: • • They can be reprogrammed in the “field”.423 .e. no fixes. device production and test.. the logic specified is changeable They no longer are just composed of NAND gates. For a GA once a layer(s) of metal had been laid on a device . simulate/verify. flipsflops. meaning that it can be used to produce any Boolean logic function. Metal layers make simple connections Z = AB + CD A B C D Z Early simulators and netlisters such as HILO (from GenRad) were used.Notes: Introduction to FPGA The NAND gate is often called the Universal gate. i.. • Early gate array design flow would be design.

. 2010 logic block . Dept EEE. I/O logic block I/O logic block I/O logic block I/O logic block logic block logic block logic block Column interconnects logic block logic block logic block logic block logic block I/O logic block I/O logic block I/O logic block Row interconnects I/O logic block I/O logic block I/O logic block I/O logic block I/O R. Stewart.Generic FPGA Architecture (Logic Fabric) 8 • Arrays of gates and higher level logic blocks might be refered to as the logic fabric. University of Strathclyde..

423 R. slices etc). Ver 10.Notes: Introduction to FPGA The logic block in this generic FPGA contains a few logic elements. Dept EEE. Stewart. University of Strathclyde. 2010 Interconnects Cascade/ Carry Logic FLIP FLOP FLIP FLOP . Different manufacturers will include different elements (and use different terms for logic block.g. e. A simple logic block might contain the following: LUT Select MUX Logic Logic Element Of course the actual contents of a logic element will vary from manufacturer to manufacturer and device to device.

University of Strathclyde.. Stewart. Dept EEE. 2010 . we also find block RAMs and dedicated arithmetic blocks.. both very useful for DSP! Block RAM Logic Fabric Arithmetic Block Input/Output Blocks columns r o w s R.FPGA Architecture (Xilinx DSP Devices) 9 • Looking more specifically at recent Xilinx FPGAs.

Notes: Diagram Key Introduction to FPGA Input / Output Block (IOB) One of the major features of more recent. the array of CLBs). which offer lower power and higher clock frequency operation than the logic fabric (i. Despite the inclusion of these additional resources. Block RAMs are also used extensively in DSP. and how they are connected together. University of Strathclyde. DSP-targeted Xilinx FPGAs is the provision of dedicated arithmetic blocks. 2010 . Stewart. We will now look at the CLBs which comprise the logic fabric. and other tasks.e. These can be configured to perform a number of different computations. the logic fabric still forms the majority of the FPGA. in further detail. Example uses are for storing filter coefficients.423 R. Block RAM DSP48 / DSP48A / DSP48E Configurable Logic Block (CLB) FPGA Ver 10. Dept EEE. and are especially suited to the Multiply Accumulate (MAC) operations prevalent in digital filtering. encoding and decoding.

R. which are groups of SLICEs (e. Stewart.Example: Xilinx Logic Blocks and Routing 10 • Xilinx FPGA logic fabric comprises Configurable Logic Blocks (CLBs). University of Strathclyde.g. Dept EEE. CLB Slices Switch Matrix other CLBs NOTE: Only a subset of routing resources is depicted above. • Signals travel between CLBs via routing resources. • Each CLB has an adjacent switch matrix for (most) routing. 2010 . 2 or 4 SLICEs per CLB).

Continuing with the Xilinx example. composition and name! However. Dept EEE. and in most devices have 4 inputs (some of the more recent devices have 6-input LUTs). These LUTs can be utilised in four modes: • • • • To implement a combinatorial logic function As Read Only Memory (ROM) As Random Access Memory (RAM) As shift registers The register can be used as: • • A flip-flop A latch Over the next few slides. 2010 . Ver 10.Notes: Introduction to FPGA The example in the main slide features a typical Xilinx FPGA architecture. and routing resources are required for connecting blocks together. Stewart. the functionality of LUTs and registers in the above modes will be described.423 R. their Logic Blocks include both combinational logic and registers. Logic units differ in size. and the Altera and Lattice architectures are different. the combinational blocks are termed Lookup Tables (LUTs). in all cases. University of Strathclyde.

A B C D Lookup Table Z Example logic function: Z=BCD+ABCD CD AB 00 00 01 11 10 0 1 1 0 01 0 0 0 0 11 0 0 0 0 10 0 0 0 1 R. 2010 . University of Strathclyde. for that combination of A. Stewart. C and D. Dept EEE. the 4-bit input addresses the LUT to find the correct output. B. Z.The Lookup Table 11 • When used to implement a logic function.

Stewart. 2010 . and the output is Z. University of Strathclyde. and the appropriate output is supplied for any input address. A is considered the Most Significant Bit (MSB) and D the Least Significant Bit (LSB). containing sixteen 1-bit values. A 1-bit value is stored within each memory location.Notes: The lookup table can also implement a ROM. Dept EEE.423 R. Instead of the four inputs representing inputs of a logic function. Introduction to FPGA A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Z 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 Ver 10. they can be thought of as a 4-bit address. In this example.

1 bit data). Stewart. a 32x1 Single Port RAM (32 addresses. Dual port: 1 address for both synchronous write operations and asynchronous read operations. • Dual port RAM requires more resources than single port RAM.LUTs as Distributed RAM 12 • LUTs can also be configured as distributed RAM. whereas an equivalent Dual Port RAM requires four 4-bit LUTs. and 1 address for asynchronous reads only. 2010 . Dept EEE. University of Strathclyde. in either single port or dual port modes. requires two 4-bit LUTs. • For example. R. • • Single port: 1 address for both synchronous write operations and asynchronous read operations. • Larger RAMs can be constructed by connecting two or more LUTs together.

November 5. respectively. Stewart. 2007. Notice that the dual port RAM requires twice as many resources as the single port RAM. Dept EEE. University of Strathclyde. 2010 . Single Port Source: Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet.7).Notes: Introduction to FPGA The two diagrams below demonstrate the implementation of 16x1 single and dual port RAMs in the Virtex II Pro device. Dual Port Ver 10. DS083 (v4.423 R.

1001) D OUT • The slice register at the output from the LUT can be used to add another 1 clock cycle delay. Stewart. • For example. SHIFT IN CLK DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ SHIFT OUT 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 LUT INPUT (e. and the 4-bit address is used to define the memory location which is asynchronously read. Using the register also synchronises the read operation. as depicted below.LUTs as Shift Registers (SRL16s) 13 • A final alternative is to use the LUT as a Shift Register of up to 16 bits. the output from the 10th register is read.g. R. Dept EEE. 2010 . University of Strathclyde. if the LUT input is the word 1001. • Additional Shift In and Shift Out ports are used.

as shown below. Stewart. larger Shift Registers can be constructed by combining several LUTs. The cascadable ports allow further interconnections for larger Shift Registers. For example.Notes: Introduction to FPGA As with the other LUT configurations. Cascadable In DI D DI D SRL16 MSB FF SRL16 MSB FF DI D MSB DI SRL16 FF D MSB SRL16 FF Slice 1 Slice 2 Cascadable Out Ver 10. 2010 . a 64-bit shift register segment can be constructed by combining four 16-bit Shift Registers together. Dept EEE.423 R. University of Strathclyde.

Stewart. the LUT is bypassed). 2010 . Dept EEE. or A level-sensitive latch • The input to the register may be the output from the LUT.Registers 14 • The sequential logic element which follows the lookup table can be configured as either: • • An edge-triggered D-type flip flop. University of Strathclyde. or alternatively a direct input to the slice (i. LUT Inputs LUT Carry Logic REG Output Bypass Input LUT / Register Pair R.e.

and Q(t=1) is the output 1 clock cycle later). A clock signal and control inputs (set. University of Strathclyde. D(t) 0 1 Q(t+1) 0 1 When configured as a latch. The Q output thereafter remains unchanged until new data is captured. Flip flops and registers are discussed in the Digital Logic Review notes chapter.423 R. the control inputs define when data on the D input is “captured” and stored within the register. Ver 10.Notes: Introduction to FPGA A D-type flip flop provides a delay of one clock cycle.) are also provided. etc. 2010 . reset. Stewart. Dept EEE. as confirmed by the truth table below (D(t) is the register input at time t.

then each element requires a slice register. Dept EEE. adding reset capabilities to a design has implications for resource utilisation. If not.Resets: Registers and SRL16s 15 • Whereas registers can be reset. SRL16s cannot. then an SRL16 can be used. 2010 SRL16 OUTPUT . Resettable Implementation INPUT CLOCK RESET Slice 1 Slice 2 Slice 3 Slice 4 DQ DQ DQ DQ DQ DQ DQ DQ OUTPUT R R R R R R R R Non Resettable Implementation INPUT CLOCK Slice 1 R. Stewart. Therefore. consider an 8-bit shift register. If resettable. • For example. University of Strathclyde.

by using a slightly more sophisticated design. Stewart. Ver 10. Dept EEE. INPUT CLOCK RESET DQ SRL16 OUTPUT R Slice 1/2 Slice 1 Hold RESET signal high for 8 clock cycles to reset the shift register. Instead of making all elements resettable. 2010 .423 R. which allows the 0 input to propagate through the shift register. and the subsequent ones using an SRL16. Instead of using 4 slices. The reset signal is held high for 8 clock cycles.. this design would require 2 slices at most.Notes: Introduction to FPGA We can still design a resettable shift register with an SRL16.. University of Strathclyde. we can implement the first element using a slice register.

saturation. Complex arithmetic operations. 2010 • • • • . 2’s complement. Stewart. Division and Square Root.1 • The implementation of arithmetic operations in FPGA hardware is an integral and important aspect of DSP design. University of Strathclyde. including special resources for high speed arithmetic R.. Dept EEE. • The following key issues are presented in this chapter: • Number representations: binary word formats for signed and unsigned integers. Binary arithmetic.. Mapping to Xilinx FPGA hardware . fixed point and floating point. Multiplication.FPGA Arithmetic Implementation 16. truncation and rounding Structures for arithmetic operations: Addition/Subtraction. including: Overflow and underflow.

fixed point. floating point. 2s complement integer binary. Complex addition. Xilinx-specific FPGA structures for addition. truncation. Division. overflow. two’s complement integer binary. Multiplication .decimal. Stewart. rounding.unsigned.Notes: This section of the course will introduce the following concepts: Integer number representations .decimal. Quantisation of signals. Introduction to FPGA Addition . two’s complement fixed point. Square root. Non-integer number representations . two’s complement. Xilinx-specific FPGA structures for multiplication. University of Strathclyde. Ver 10. hardware structures for addition. two’s complement fixed point. underflow and saturation. Dept EEE. Complex multiplication. 2010 .423 R. one’s complement. hardware structures for multiplication.

and numeric precision and range. requires quantities to be represented digitally using a number representation with finite precision. • • The hardware implementation cost of an arithmetic operator increases with wordlength. • This representation must be specified to handle the “real-world” inputs and outputs of the DSP system. • • Sufficient resolution Large enough dynamic range • The number representation specified must also be efficient in terms of its implementation in hardware. The relationship is not always linear! • There is a trade-off between cost. R. 2010 . University of Strathclyde. by its very nature. Dept EEE.Number Representations 17 • DSP. Stewart.

well. This is approximately 30% of the cost of using 16 bits arithmetic. Therefore its important to get the wordlength right: too many bits wastes resources. University of Strathclyde. However when dealing with large complex DSP systems.speed product) can be approximated as the number of full adder cells. and is well understood by most engineers. 2010 . that’s what we usually use in DSP processors and we are creatures of habit! In the world of FPGA DSP arithmetic you can choose the resolution. Ver 10. Therefore any possible cost reduction by reducing the number of bits for representation is likely to be of significant value. and 15 was not enough .. Therefore. So how do we get it right? Well. Stewart. For example. if it was demonstrated that in fact 9 bits was sufficient resolution.423 R. assume we have a DSP filtering application using 16 bit resolution arithmetic.or did they? Probably not! Its likely that we are using 16 bits because. you need to know your algorithms and DSP. and too few bits loses resolution. We will show later (see Slide 42) that the cost of a parallel multiplier (in terms of silicon area .. then the cost of a multiplier is 9 x 9 cells = 81 cells. The wordlength of 16 bits has been chosen (presumably) because the designer at sometime demonstrated that 17 bits was too many bits. there can be literally billions of multiplies and adds per second. Therefore for a 16 bit by 16 bit parallel multiply the cost is of the order of 16 x 16 = 256 “cells”.Notes: Introduction to FPGA The use of binary numbers is a fundamental of any digital systems course. Dept EEE.

University of Strathclyde.Unsigned Integer Numbers • Unsigned integers are used to represent non-negative numbers. For example. • The weightings of individual bits within a generic n-bit word are: bit index: bit weighting: n-1 MSB 18 n-2 n-3 2 1 0 2n-1 2n-2 2n-3 22 21 20 LSB • The decimal number 82 is “01010010” in unsigned format as shown: bit index: bit weighting: example binary number: decimal representation: 7 27 6 26 5 25 4 24 3 23 2 22 1 21 0 20 0 1 0 1 0 0 1 0 0x27 1x26 0x25 1x24 0x23 0x22 1x21 0x20 64 + 16 + 2 = 82 • The numerical range of an n-bit unsigned number is: 0 to 2n . an 8-bit word can represent all integers from 0 to 255. R.1. 2010 . Stewart. Dept EEE.

Stewart. 2010 . Dept EEE. 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 = 255 = 28 . University of Strathclyde.1 Ver 10. the range of representable values is 0 to 255: Integer Value 0 1 2 3 4 Binary Representation 00000000 00000001 00000010 00000011 00000100 64 65 131 255 01000000 01000001 10000011 11111111 Note that the minimum value is 0.e.423 R. where 8 is the number of bits in the binary word: i.Notes: Introduction to FPGA Taking the example of an 8-bit unsigned number. and the maximum value ( 255 ) is the sum of the powers of two between 0 and 8 .

Dept EEE. Stewart. the MSB has a negative weighting: bit index: bit weighting: n-1 MSB n-2 n-3 2 1 0 -2n-1 2n-2 2n-3 22 21 20 LSB • The most negative and most positive numbers are represented by: bit index: n-1 n-2 n-3 2 1 0 most negative: bit index: 1 MSB 0 n-2 0 n-3 0 2 0 1 0 LSB n-1 0 most positive: 0 MSB 1 1 1 1 1 LSB R. 2010 . and has only one representation of 0 (zero).2’s Complement (Signed Integers) 19 • 2’s Complement caters for positive and negative numbers in the range -2n-1 to +2n-1 -1. University of Strathclyde. • In 2’s complement.

2010 Ver 10. Stewart. we can convert the following two 8-bit 2’s complement words to decimal. As for the unsigned representation. the decimal number 82 is “01010010” in 2’s complement signed format: bit index: bit weighting: example binary number: decimal representation: 7 -27 6 26 5 25 4 24 3 23 2 22 1 21 0 20 0 1 0 1 0 0 1 0 0x-27 1x26 0x25 1x24 0x23 0x22 1x21 0x20 64 + 16 + 2 = 82 Meanwhile the decimal number -82 is “10101110” in 2’s complement signed format: bit index: bit weighting: example binary number: decimal representation: 7 -27 6 26 5 25 4 24 3 23 2 22 1 21 0 20 1 0 1 0 1 1 1 0 1x-27 1x26 1x25 1x24 0x23 0x22 1x21 0x20 -128 + 32 + 8 + 4 + 2 = -82 R. University of Strathclyde.423 . Dept EEE.Notes: Introduction to FPGA As examples.

They are converted as shown: +82 +82 -82 0 1 0 1 0 0 1 0 invert all bits -82 -82 +82 1 0 1 0 1 1 1 0 invert all bits 1 0 1 0 1 1 0 1 add 1 -82 + 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 add 1 +82 + 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 R. and adding 1. Dept EEE. converting between negative and positive numbers involves inverting all bits.2’s Complement Conversion 20 • For 2’s Complement. 2010 . Stewart. • For example. we have just considered 2’s complement representations of the decimal numbers -82 and +82. University of Strathclyde.

a ninth bit is required to represent negative zero. the representation for the negative zero becomes identical to the representation for positive zero.Notes: Positive Numbers Integer 0 1 2 3 Binary 00000000 00000001 00000010 00000011 Invert all bits and ADD 1 Introduction to FPGA Negative Numbers Integer 0 -1 -2 -3 Binary 100000000 11111111 11111110 11111101 125 126 127 01111101 01111110 01111111 -125 -126 -127 -128 10000011 10000010 10000001 10000000 Note that when negating positive values. Notice from the above that -128 can be represented but +128 cannot. Dept EEE. Ver 10. if we simply ignore this ninth bit. However. Stewart. 2010 . University of Strathclyde.423 R.

is: : binary point bit index: bit weighting: n-1 MSB n-2 1 0 -1 -2 -b+1 -b LSB 2n-1 2n-2 21 20 2-1 2-2 2-b+1 2-b n integer bits b fractional bits • The MSB has -ve weighting for 2’s complement (as for integer words). comprising n integer bits and b fractional bits. R. Stewart. University of Strathclyde. and bits on the right of the binary point are termed fractional bits.a number with a fixed position for the binary point. 21 • Bits on the left of the binary point are termed integer bits.. 2010 . • The format of a generic fixed point word..Fixed-point Binary Numbers • We can now define what is known as a “fixed-point” number: . Dept EEE.

with the binary point to the left of the third bit. i.25 = -5.and secondly.3125 0.5 + 0.25 . with the binary point to the left of the third bit. Stewart. Firstly. 3 integer bits and 5 fractional bits: bit index: bit weighting: binary number: decimal representation: 2 -22 1 21 0 20 -1 2-1 -2 2-2 -3 2-3 -4 2-4 -5 2-5 1 1 0 1 0 1 1 0 1x-22 1x21 1x20 1x2-1 0x2-2 1x2-3 1x2-4 0x2-5 = -1. 5 integer bits and 3 fractional bits: bit index: bit weighting: binary number: decimal representation: 4 -24 3 23 2 22 1 21 0 20 -1 2-1 -2 2-2 -3 2-3 1 1 0 1 0 1 1 0 1x-24 1x23 0x22 1x21 0x20 1x2-1 1x2-2 0x2-3 -16 + 8 + 2 + 0.423 . we consider the 2’s complement word “11010110” with the binary point in two different places.25. Ver 10.e. Dept EEE. i. i.125+ 0. University of Strathclyde.3125 = -5.e.0625 -4 + 2 + R.. 2010 Note that these results are related by a factor of 22 = 4.5 + 0.Notes: Introduction to FPGA As examples.. 4 x -1.e.

1 (+63...1111 • The same number of quantisation levels is present.1111 01111. (127-(-128))/1 = 255 = (63. e. 256 levels can be represented... e. • Numerical range scales according to the binary point position.0 (-64) 0111 111..g. University of Strathclyde.5) • Dynamic range (range / interval) is independent of the binary point position.5 1000 000..5-(-64))/0. but with the binary point at position 0).g..5 R. (+127) 1 x0.): 00000. Dept EEE. Stewart. (-128) 1 x0.0000 10000.. for an 8-bit binary word..Fixed Point Range and Precision 22 • As with integer representations (which are also effectively fixed point numbers.g.5 0111 1111.: 1000 0000. e..0000 11111.... the binary range of fixed point numbers extends from: unsigned: signed (2’s comp.... 2010 .

125 -0.75 -1 -2 interval = 1 interval = 0.5 +0.423 R.5 binary point position = 1 +1 +0.25 all integer bits all fractional bits Ver 10.5 interval = 0. University of Strathclyde.5 +0.125 +0. Clearly the numerical range is affected by the binary point position.5 +0. Stewart.Notes: Introduction to FPGA To illustrate this further. with the binary point in all four possible positions.5 binary point position = 2 -1 +0.25 binary point position = 3 +0. 2010 . lets consider the very simple case of a 3-bit 2’s complement word.25 interval = 0.375 -0. +3 +2 +1 0 -1 -2 -3 -4 binary point position = 0 -4 +2 +1 -2 +1. but the relationship between the interval and range remains the same. Dept EEE.

015625 rather than 0. there are 2 8 = 256 different values. • Revisiting our sine wave example. using this fixed-point format: +2 -2 LSB • This looks much more accurate. so 0. Dept EEE.Fixed-point Quantisation • Consider the fixed point number format: n n n b b b b b 23 3 integer bits 5 fractional bits • Numbers between – 4 and 3. Stewart.5! 2 R. The quantisation error is ± ---------(where LSB = least significant bit).. University of Strathclyde.96785 can be represented. 2010 . As there are 8 bits..03125 . in steps of 0.

if we work to two decimal places then the calculation: 0.00000735 If we truncated (just chopped off the bits below the 4th decimal place) then the error is larger: 3. or truncated to 0. The result are different.14159265. Albeit the cost is relatively small.14159265… – 3. Stewart.. In the decimal world. but it is however not “free”.Notes: Introduction to FPGA Quantisation is simply the DSP term for the process of representing infinite precision numbers with finite precision numbers.2451 can be rounded to 0..25. If we use “rounding” here and the error is: 3.1415 = 0.1416 = 0. University of Strathclyde.24.00009265 Clearly rounding is most desirable to maintain best possible accuracy. The real number π can be represented as 3. We can quantise or represent π to 4 decimal places as 3..43 = 0. Once we start performing billions of multiplies and adds in a DSP system it is not difficult to see that these small errors can begin to stack up. and so on.1416. For example. Ver 10.423 R.57 x 0. When multiplying fractional numbers we will choose to work to a given number of places. 2010 . it is familiar to most to work with a given number of decimal places.14159265… – 3. Dept EEE. However it comes at a cost.

125 4 -8 4 0 2 1 1 0 0. 2010 .4375) original number: -16 0 8 1 4 0 2 1 1 1 0. and any number it represents. University of Strathclyde.5 1 (decimal 5.25 1 1 (decimal 1.75) 2 shift left by 1 place 0 1 0 1 1 1 (decimal 11.25 0. Stewart.5 0. this is achieved by simply shifting the numbers with respect to the binary point! shift right by 2 places -2 1 0.5) R. Dept EEE. • Therefore if we want to multiply or divide a fixed point number by a power-of-two.0625 0.5 1 0.Multiplication & Division via Binary Shifts 24 • The binary point position has a power-of-2 relationship with the numerical range of the word format.

25 original number: 0 1 0 1 1 1 (decimal 5. and hence the value it represents. change by a power-of-two factor.5 0 1 0 1 1 1 (decimal 11. rather than the bits .125 0 1 0 1 1 1 (decimal 1.0625 0. 2010 .5 0.75) 2 -16 8 4 2 1 0.4375) 4 -8 4 2 1 0.25 0. looking at the example in the main slide. Dept EEE. the weightings of the bits comprising the word.and it falls to the DSP design tool and/or DSP designer to keep track of it.423 R. if we move the binary point.having no effect on the hardware produced .it amounts to the same thing! Ultimately the binary point is conceptual . University of Strathclyde.Notes: Introduction to FPGA Of course.5 0. Stewart.5) Ver 10. -2 1 0. Reviewing the divide-by-4 and multiply-by-2 examples from the main slide. we could also consider that the binary point is moved...

just less than 1) can be represented. 2010 .9921875 0 1 1 1 1 1 1 1 • Limiting the numeric range to ± 1 is advantageous because it makes the arithmetic easier to work with. Dept EEE. • Numbers from -1 to +1-2-b (i. Some examples: -1 1/2 1/4 1/8 1/16 1/32 1/64 1/128 decimal -1 most -ve 1 0 0 0 0 0 0 0 0 most +ve 0 0 0 0 0 0 0 +0.e.Normalised Fixed Point Numbers 25 • Fixed point word formats with 1 integer bit and a number of fractional bits are often adopted. Stewart... multiplying two normalised numbers together cannot produce a result greater than 1! R. University of Strathclyde.

which Q-format considers a sign bit.e. and n fractional bits. For example. whereas in 2’s complement.15 format (often abbreviated to Q15) is used extensively in DSP as it covers the normalised range of numbers from -1 to +1 . Therefore the total number of bits in a Q-format number is 1 + m + n.n.2-15.Notes: Introduction to FPGA The term Q-format is often used to describe fixed point number formats.5 number has a sign bit. i. University of Strathclyde. 2 other integer bits. this would be described as having 3 integer bits and 5 fractional bits. Ver 10.423 R. the same word format would be described as having m+1 integer bits. where m is the number of integer bits. 1 integer bit and 15 fractional bits. 1 sign bit Q-format description bit weightings: m=2 integer bits -22 21 20 n=5 fractional bits 2-1 2-2 2-3 2-4 2-5 5 fractional bits 2’s complement description 3 integer bits The Q0. it is useful to note that Q-format and 2’s complement are actually the same thing. usually in the context of DSP processors. and n is the number of fractional bits. Notably this description excludes the MSB of the 2’s complement representation. Q-format notation is given in the form Qm. However. 2010 . and 5 fractional bits. Dept EEE. and is equivalent to a 16 bit 2’s complement representation with the binary point at 15. Stewart. In 2’s complement. and hence can be represented as shown below. a Q2.

1826 Tr 2849 If we want to pass this number to a next stage in the machine (where arithmetic is 4 digits accuracy) then we need scale down by 10000.Normalisation 26 • Working with fractional binary values makes the arithmetic “easier” to work with and to account for wordlength growth. 2010 . • Multiplying two 4 digit numbers will result in up to 8 significant digits. Stewart.range -9999 to +9999.Fractional Motivation . 0. University of Strathclyde.9999.6787 x 0.4198 = 0. R.28491826 Tr 0. 6787 x 4198 = 28491826 Scale 2849.9999 to +0. Dept EEE.2849 now the procedure for truncating back to 4 bits is much “easier”. • Consider normalising to the range -0. then truncate. • As an example take the case of working with a “machine” using 4 digit decimals and a 4 digit arithmetic unit .

and truncating just by dropping fractional bits. you will get the same answer and the cost is the same.213134765625 Note very clearly that in a DSP system then the binary poiint is all in the eye of the designer.Notes: Introduction to FPGA Of course the two results are exactly identical and the differences are in how we handle the truncate and scale. and the the binary point is implicitly used in most DSP systems..1101 1010 0100 Consider multiplying 36 x 97 = 3492 equivalent to 00100100 x 01100001 = 0000 1101 1010 0100 In binary normalising the values would be the calculation 0...0100100 x 0.1111111 ( -1 to 0. Of course if you prefer integers and would like to keep track of the scaling etc you can do this.7578125 = 0. University of Strathclyde. where all inputs are contrained to be in the range from -1 to +1 then its easy to note that multiplying ANY two numbers in this range together will give a result also in the range of -1 to +1.00110110100100 which in decimal is equvalent to: 0. Dept EEE.9921875.. Ver 10. Therefore we apply the same normalising ideas as for decimal for multiplication in binary. However using the normalised values. Consider 8 bit values in 2’s complement. It just makes things significantly easier in keeping track of wordlength growth. where 127/128 = 0.e. The range is therefore: 10000000 to 01111111 (-128 to +127) If we normalise these numbers to between -1 and 1 (i.1100001 = 0.0000000 to 0. 2010 . There is no physical connection or wire for the binary point..9921875)..28125 x 0. Exactly the same idea of normalisation is applied to binary.. Stewart.423 R. divide through by 128) then the binary range is: 1.

R. University of Strathclyde. Stewart. according to its specific input-output characteristic.Analogue to Digital Converter (ADC) 27 • An ADC is a device that can convert a voltage to a binary number. Binary Output 127 96 64 32 01111111 01100000 01000000 00100000 fs 8 bit 1 0 0 1 1 1 0 1 Binary Output ADC -2 -1 -32 -64 -96 -128 1 11001000 11000000 10100000 10000000 2 Voltage Input Voltage Input • We can generally assume ADCs operate using two’s complement arithmetic. Dept EEE. 2010 .

Stewart. Note that the ADC does not necessarily have a linear (straight line) characteristic. Binary Output Voltage Input Ver 10. for example. A-law quantisers are often implemented by using a nonlinear circuit followed by a uniform quantiser. and also that the device clips above and below the maximum and minimum voltage swings. Two schemes are widely in use: the A-law in Europe. Similarly for the DAC can have a non-linear characteristic. then we are tempted to call the device “piecewise linear over its normal operating range”. University of Strathclyde. Speech signals. In telecommunications for example a defined standard nonlinear quantiser characteristic is often used (A-law and μ-law). If a uniform quantisation scheme were used then although the loud sounds would be represented adequately the quieter sounds may fall below the threshold of the LSB and therefore be quantised to zero and the information lost. 2010 .423 R. and the μ -law in the USA and Japan. However if the step sizes are small and the number of steps large. Therefore non-linear quantisers are used such that the quantisation level at low input levels is much smaller than for higher level signals. whereas softer sounds such as “sh” have small amplitudes.Notes: Introduction to FPGA Viewing the straight line portion of the device we are tempted to refer to the characteristic as “linear”. Dept EEE. However a quick consideration clearly shows that the device is non-linear (recall the definition of a linear system from before) as a result of the discrete (staircase) steps. have a very wide dynamic range: Harsh “oh” and “b” type sounds have a large amplitude.

28

• Perfect signal reconstruction assumes that sampled data values are exact (i.e. infinite precision real numbers). In practice they are not, as an ADC will have a number of discrete levels. • The ADC samples at the Nyquist rate, and the sampled data value is the closest (discrete) ADC level to the actual value:
s(t)
4 3 2 1 0 -1 -2 -3 -4

Binary value

fs

ˆ v(n)
4 3 2 1 0 -1 -2 -3 -4

ts

Voltage

sample, n

ˆ v ( n ) = Quantise { s ( nt s ) }, for n = 0, 1, 2, … • Hence every sample has a “small” quantisation error.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:

Introduction to FPGA

For example purposes, we can assume our ADC or quantiser has 5 bits of resolution and maximum/minimum voltage swing of +15 and -16 volts. The input/output characteristic is shown below:

01111 (+15) Binary Output

Vmin = -15 volts

1 volts Vmax = 15 volts Voltage Input

10000 (-16)
In the above slide figure, for the second sample the true sample value is 1.589998..., however our ADC quantises to a value of 2.

Ver 10.423

R. Stewart, Dept EEE, University of Strathclyde, 2010

Quantisation Error

29

• If the smallest step size of a linear ADC is q volts, then the error of any one sample is at worst q/2 volts.
01111 (+15) Binary Output

q volts Vmax Voltage Input

-Vmax

10000 (-16)

R. Stewart, Dept EEE, University of Strathclyde, 2010

and indeed the quantisation process can be considered purely as the addition of this noise: nq x ADC y x y Ver 10. University of Strathclyde. Stewart. Dept EEE.Notes: Introduction to FPGA Quantisation error is often modelled an additive noise component.423 R. 2010 .

2010 . Stewart. University of Strathclyde.An Example • Here is an example using a 3-bit ADC: 4 3 Amplitude/volts 2 1 0 -1 -2 -3 -4 4 3 2 1 -4 0 -1 -2 -3 Input Amplitude/volts 3 2 1 0 -1 -2 -3 -4 time/seconds time/seconds 30 ADC Characteristic Quantisation Error Output 4 3 2 1 -1 -2 -3 -4 R. Dept EEE.

Introduction to FPGA Ver 10. Dept EEE.423 R. University of Strathclyde.Notes: In this case worst case error is 1/2. 2010 . Stewart.

31

• The full adder circuit can be used in a chain to add multi-bit numbers. The following example shows 4 bits: A3 B3 A2 B2 A1 B1 A0 B0

C3 S4 MSB

Σ
S3

C2

Σ
S2

C1

Σ
S1

C0

Σ

‘0’

A3 A2 A1 A0 + B3 B2 B1 B0
C3 C2 C1 C0

S0 LSB

S4 S3 S2 S1 S0

• This chain can be extended to any number of bits. Note that the last carry output forms an extra bit in the sum. • If we do not allow for an extra bit in the sum, if a carry out of the last adder occurs, an “overflow” will result i.e. the number will be incorrectly represented.
R. Stewart, Dept EEE, University of Strathclyde, 2010

Notes:
The truth table for the full adder is: A 0 0 0 0 1 1 1 1 B 0 0 1 1 0 0 1 1 CIN 0 1 0 1 0 1 0 1 S 0 1 1 0 1 0 0 1 COUT 0 0 0 1 0 1 1 1

Introduction to FPGA

0+0+0 = 0 0+0+1 = 1 0+1+0 = 1 0+1+1 = 2 1+0+0 = 1 1+0+1 = 2 1+1+0 = 2 1+1+1 = 3

Full adder circuitry can be therefore produced with gates:

A B C

S

S out = ABC + ABC + ABC + ABC = A⊕B⊕C C out = ABC + ABC + ABC + ABC = AB + AC + BC = C ( A ⊕ B )

COUT

The longest propagation delay path in the above full adder is “two gates”.
Ver 10.423
R. Stewart, Dept EEE, University of Strathclyde, 2010

Subtraction

32

• Subtraction is very readily derived from addition. Remember 2’s complement? All we need to do to get a negative number is invert the bits and add 1. • Then if we add these numbers, we’ll get a subtraction D = A + (-B): A3 B3 A2 B2 A1 B1 A0 B0 Invert add 1 ‘1’

4 bit 2’s comp

C4

Σ
D3

Σ
D2

Σ
D1

Σ
D0

• The addition of the 1 is done by setting the LSB carry in to be 1.
R. Stewart, Dept EEE, University of Strathclyde, 2010

K= 1 This structure will be seen again in the Division/Square Root slides! Ver 10.e. Dept EEE. K = 0 For: A . 2010 . NOT 2’s complement) the range is from 0 to 15. Addition/Subtraction (using 2’s complement representation) Sometimes we need a combined adder/subtractor with the ability to switch between modes. For 2’s complement the numerical range is from -8 to 7.B. This can be achieved quite easily: A3 0 B3 1 MUX A2 B2 0 1 MUX A1 B1 0 1 MUX A0 B0 0 1 MUX Σ Σ Σ Σ K For: A + B. Stewart. University of Strathclyde.423 R.Notes: Introduction to FPGA Note for 4 bit positive numbers (i.

R. 2010 . • Using Xilinx SystemGenerator we can specifically check for overflow in every addition calculation. Dept EEE. With an 8 bit result the result “wraps around” to a negative value: 10001001 = – 119 . Stewart. University of Strathclyde. • One solution to overflow is to ensure that the number of bits available is always sufficient for the worst case result. • Therefore for an 8 bit example the range is -128 to +127 (or in binary this is 100000002 to 011111112: -65 + -112 -177 10111111 +10010000 101001111 100 + 37 137 01100100 +00100101 10001001 With an 8 bit result we lose the 9th bit and the result “wraps around” to a positive value: 01001111 = 79 .Wraparound Overflow & 2’s Complement 33 • With 2’s complement overflow will occur when the result to be produced lies outside the range of the number of bits. Therefore in the above example perhaps allow the wordlength to grow to 9 or even 10 bits.

2010 .Notes: Introduction to FPGA Recall from previously that overflow detect circuitry is relatively easy to design. Stewart. University of Strathclyde. Dept EEE. Just need to keep an eye on the MSB bits (indicating whether number is +ve or -ve): For example (-73) + 127 = 54 Discard final 9th bit carry 10110111 +01111111 1 00110110 No overflow 100 + 64 = 164 01100100 +01000000 10100100 Overflow MSB bit indicate -ve result! Adding Adding Adding +ve and -ve +ve and +ve -ve and -ve will never overflow! if a -ve result then overflow if a +ve result then overflow Ver 10.423 R.

-128 1000000 127 01111111 • When overflow is detected.e for the 8 bit case either -128 or +127). 2010 . Stewart. Dept EEE. • Implementing saturate will require “detect overflow & select” circuitry. R. In Xilinx System Generator the user will get a checkbox choice to allow results to either (i) Wraparound or (ii) Saturate. • Therefore for every addition that is explicitly done with an adder block. the result is set to the close possible value (i.Saturation • Taking the previous overflowing examples from Slide 33 10111111 01100100 -65 100 +10010000 + -112 +00100101 + 37 -177 101001111 137 10001001 Detect overflow and saturate Detect overflow and saturate 34 • One method to try to address overflow is to use a saturate technique. University of Strathclyde.

The the operations that form 2μe ( k )x ( k ) were to overflow.Notes: Introduction to FPGA Once again. Generally for some later FPGAs such as the Virtex-4. University of Strathclyde. The result is a huge increase in the stability of the algorithm. For example. Ver 10. Saturation is extremely useful for adaptive algorithms. saturate might be included. the filter weights w are updated according to the equation: w ( k ) = w ( k – 1 ) + 2μe ( k )x ( k ) Without further concern over the meaning of this equation. if the term 2μe ( k )x ( k ) gets very big and would overflow. Dept EEE. 2010 . there is a high chance that the sign of the term would flip and drive the weights in completely the wrong direction. Stewart. in the Least Mean Squares algorithm (LMS). Hence overflow has been “designed out”. leading to instability. With saturation however. and where appropriate to efficient design. would mean care must be taken. Of course not applications will use these devices and using general slice logic and attempting to make adders as small as possible. design of a DSP system might be done such that overflow never happens as the user has ensured there are enough bits to cater for the worse possible case leading to the maximum magnitude result.423 R. using some of the DSP48 blocks give adders with 48 bits of precisions therefore the likelihood of say working with 16 bit values that grow to over 48 bits is unlikely. saturation will limit it to the maximum value representable. we can see that the term 2μe ( k )x ( k ) is added to the weights at time epoch k – 1 to generate the new weights at time epoch k . causing the weights to change in the right direction. and at the fastest speed possible in the current representation.

Dept EEE. 2010 . University of Strathclyde.Xilinx Virtex-II Pro Addition • The used components of the slide are outlined below Cout Sout 35 B A D Sout A B Cout Σ Sout Cin Cin R. Stewart.

Stewart. G1 (A) 0 0 0 0 1 1 1 1 G2 (B) 0 0 1 1 0 0 1 1 Cin 0 1 0 1 0 1 0 1 D 0 0 1 1 1 1 0 0 Sout 0 1 1 0 1 0 0 1 Cout 0 0 0 1 0 1 1 1 Ver 10. http://www. University of Strathclyde. Result is the FULL ADDER implementation: .Notes: Introduction to FPGA Picture of Xilinx-II Pro slice (upper half) taken from “Virtex-II Pro Platform FPGAs: Introduction and Overview”. 2003.com LookUp Table (LUT) programmed with two-input XOR function: G1 (A) 0 0 1 1 Sout = Cin xor D .4) January 20. 2010 . G2 (B) 0 1 0 1 D 0 1 1 0 Cout = DA + Cin D (multiplex operation).423 R. DS083-1 (v2. Dept EEE.xilinx.

University of Strathclyde. Dept EEE. Stewart. 2010 XORG .Xilinx Virtex-II Pro Slice Main Components Inputs ORCY Outputs 36 • A (very) high level diagram of the main logic components on one slice D-type FF 4 input LUT RAM ShiftReg MULTAND Upper Mux Mux XORG Inputs ORCY Outputs Mux D-type FF 4 input LUT RAM ShiftReg MULTAND Mux Lower Mux Mux Mux Mux R.

2010 . Stewart. we can note: • • • • • • • One D-type Flip Flop One 4 input Look-Up-Table (LUT) (can be configured as shift register or simple as RAM/memory) One XOR gate One AND gate One OR gate A few 2 input MUX (multiplexors) to route signals Clock inputs “Small” FPGAs will have just a few hundred (100’s) slices. “Large” FPGAs will have many tens of thousands (10000’s) of slices (and other components!) Ver 10. University of Strathclyde. Dept EEE.Notes: Introduction to FPGA Just reviewing the logic circuitry on one half of the slice (note that in Slide 35 only the top half of the slice is shown. whereas the above slide shows the top and bottom halfs).423 R.

University of Strathclyde. C1 B1 A1 B0 A0 C3 S1 FA FA ‘0’ 2 bit addition 1 slice A1 B1 A0 B0 B3 A3 B2 A2 FA FA C1 S3 S0 S2 4 bit addition 2 slices A3 B3 A2 B2 A1 B1 A0 B0 C1 Σ S1 C0 Σ S0 ‘0’ B1 A1 B0 A0 FA FA S1 C3 S0 S4 Σ S3 C2 Σ S2 C1 Σ S1 C0 Σ S0 ‘0’ ‘0’ R. Hence we can get two bits of addition per standard Xilinx slice. Dept EEE.Xilinx Virtex-II Pro 4 bit Adder 37 • To produce larger adders the Xilinx tools will simply cascade the carry bits in adjacent (where possible!) slices. • To produce a 4 bit adder. with its COUT wired to the top-half’s CIN. • The bottom half of a Virtex-II Pro slice can be programmed for an identical operation. we cascade with another slice. Stewart. 2010 .

Dept EEE.423 R. and the address the LUT with values of ABCD to get the output Therefore ANY 4 variable Boolean function can be simply implemented with a four input LUT.Notes: Introduction to FPGA Note the importance of the LUT (look up table) in the Xilinx slice. 2010 . When configured as a LUT. Stewart. University of Strathclyde. any four input Boolean equation can be implemented. (Of course if the equation is only 3 variables then we can also implement and just set one input constant) Ver 10. For example take the equation Y = ABC + ABCD The truth table for this equation is: ABCD 00001 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 A B C D 4 input LUT RAM ShiftReg Y To implement this function. simply store the values of Y in the Slice LUT.

in a layout which performs the shifting.Multiplication in binary • Multiplying in binary follows the same form as in decimal: 11010110 A 7 …A 0 x00101101 B 7 …B 0 11010110 000000000 1101011000 11010110000 000000000000 1101011000000 00000000000000 000000000000000 0010010110011110 P 15 …P 0 38 • Note that the product P is composed purely of selecting. The i th column of B indicates whether or not a shifted version of A is to be selected or not in the i th row of the sum. Stewart. Dept EEE. University of Strathclyde. • So we can perform multiplication using just full adders and a little logic for selection. R. shifting and adding A . 2010 .

2010 .423 R. University of Strathclyde. zzz xaaaa bbbb +cccc0 +dddd00 +eeee000 etc. For each additional column in the second operand. Ver 10. Dept EEE. Stewart..Notes: Multiplication in decimal Starting with an example in decimal: Introduction to FPGA 214 x45 1070 +8560 9630 Note that we do 214 × 5 = 1070 and then add to it the result of 214 × 4 = 856 right-shifted by one column.. we shift the multiplication of that column with the first operand by another place.

sign extends -42 11010110 x00101101 x45 1111111111010110 0000000000000000 1111111101011000 1111111010110000 0000000000000000 1111101011000000 0000000000000000 0000000000000000 1111100010011110 -1890 R. 2010 . University of Strathclyde. Stewart.2’s complement Multiplication 39 • For one negative and one positive operand just remember to sign extend the negative operand. Dept EEE.

subtract the last partial product. Stewart. just use the unsigned technique! The difference between signed and unsigned multiplies results in different hardware being necessary. University of Strathclyde. if both operands are positive. Dept EEE. form last partial product negative two’s -1110101100000000 complement 11010110 x10101101 1111111111010110 0000000000000000 1111111101011000 1111111010110000 0000000000000000 1111101011000000 0000000000000000 +0001010100000000 0000110110011110 -42 x-83 3486 Of course. Introduction to FPGA We use the trick of inverting (negating and adding 1) the last partial product and adding it rather than subtracting. 2010 . DSP processors typically have separate unsigned and signed multiply instructions.Notes: 2’s complement multiplication (II) For both operands negative. Ver 10.423 R.

Dept EEE.050000 133.Fixed Point multiplication 40 • Fixed point multiplication is no more awkward than integer multiplication: 11010. R.000000 00000000.000000 1101011.110 x00101.000000 1101.011000 11010.110000 000000.468750 • Again we just need to remember to interpret the position of the binary point correctly.535000 16.750000 150.133750 0. University of Strathclyde.011110 26.010110 000. Stewart. 2010 .750 x5.625 0.000000 000000000.101 11.000000 0010010110.

2010 .423 R.Notes: Introduction to FPGA Ver 10. Dept EEE. University of Strathclyde. Stewart.

multiply.Multiplier Implementation Options • Distributed multipliers • Constant multipliers • • Using the logic fabric (LUTs) Using block RAM 41 • Shift-and-add “multipliers” • High speed embedded multipliers • 18 x 18 bit multipliers • High speed integrated arithmetic slices (DSP48s) • • Multiply. Stewart. 2010 . Dept EEE. accumulate R. University of Strathclyde. accumulate Add.

The FPGA companies are well aware that DSP engineers desire fast and efficient multipliers. which is commonly required in DSP. they began incorporating embedded multipliers into their devices in the year 2000. The most basic multiplier is a 2-input version which is implemented using the logic fabric. University of Strathclyde. Since then the sophistication of these components has increased.Notes: Introduction to FPGA Over the next few slides we will see that multipliers can be implemented in a variety of different ways. and as a result. We can now think of them as embedded arithmetic slices. and they have been extended to feature fast adders and in many cases longer wordlengths. Dept EEE.e. implementing them efficiently is a priority consideration. This type is referred to as a distributed multiplier. the lookup tables within the slices of the device. Ver 10. 2010 .423 R. As multipliers are used extensively in DSP. the knowledge of one multiplicand can be exploited to create a cheaper hardware implementation than a conventional 2-input multiplier. Stewart. rather than simply multipliers. too. and “shift-and-add” multipliers which sum the outputs from binary shift operations. In the case of multiplication with a constant. Two approaches that will be discussed in the coming pages are ROM-based constant multipliers. because the implementation is distributed over the resouces in several slices. i.

The diagonal structure of the multiplier implicitly inserts zeros in the appropriate columns and shifts the a operands to the right. • Note that this structure does not work for signed two’s complement! R. Dept EEE. Stewart. University of Strathclyde.Distributed Multipliers • This figure shows a four-bit multiplication: s 42 FA is full adder a b 0 0 a3 0 a2 0 a1 0 a0 b0 0 bout cout aout FA sout 0 c 0 b2 0 b3 0 b1 0 Example: 1101 13 11 1011 1101 1101 0000 1101 10001111 143 p0 p7 p6 p5 p4 p3 p2 p1 • The AND gate connected to a and b performs the selection for each bit. 2010 .

s FA is full adder a b x3 a3 x2 a2 x1 a1 x0 a0 b0 0 bout cout aout FA sout c y4 y3 y2 y1 y0 y4 y3 y2 y1 y0 = b0(a3 a2 a1a0) + x3 x2 x1 x0 Ver 10.423 R. Dept EEE. Stewart. The function of one partial product stage of the multiplier is as shown below. 2010 .Notes: Note the function of the simple AND gate. The operation of multiplying 1’s and 0’s is the same AND 1’s and 0’s Introduction to FPGA A 0 0 1 1 B 0 1 0 1 Z 0 0 0 1 or in Boolean algebra Z = A and B = AB Z = A x B (where x = multiply) Hence the AND gate is the bit multiplier. University of Strathclyde.

Stewart. University of Strathclyde. 2010 .Distributed Multiplier Cell Cout 43 • This shows the top half of a slice. Sout Cin R. S B A D Sout S A B Cout FA Cin NOTE: This implementation features a Virtex-II Pro FPGA. Dept EEE. which implements one multiplier cell.

but is required as an input to MUXCY. and the result is added by the XOR plus the external logic (MUXCY. the internal structure of the FPGA slice logic results in a different configuration when implemented on a device.. The two AND gates perform a one-bit multiply each.). 2010 . 2003. University of Strathclyde..com LUT implements the XOR of two ANDs: G3 (S) G2 (A) G1 (B) D The dedicated MULTAND unit is required as the intermediate product G1G2 cannot be obtained from within the LUT.Notes: Introduction to FPGA Picture of Xilinx-II Pro slice (upper half) taken from “Virtex-II Pro Platform FPGAs: Introduction and Overview”. Dept EEE.423 R.xilinx. XORG): Sout = CIN xor D.4) January 20. http://www. Note that whereas the signal flow graph of the distributed multiplier shows signals propagating from the top and right of the diagram to the bottom. COUT = DAB + CIND This structure will perform one cell of the multiplier (see the next slide. Stewart. DS083-1 (v2. Ver 10.

P: A 4 bits 1 0 1 0 ROM-based multiplier 1 0 1 0 0 0 1 1 address 0000 0000 data (product) 0000 0000 A:B 1010 0011 1110 1110 P 8 bits 1111 1111 0000 0001 decimal -6 B 4 bits 0 0 1 1 8 bits 28 = 256 8-bit addresses 8-bit data 1 1 1 0 1 1 1 0 decimal -18 decimal 3 R. • By using a ROM. The value stored at that address is the multiplication result. 2010 . we can store the result of every possible multiplication of two operands. University of Strathclyde. • The two operands A and B are concatenated to form the address with which to access the ROM.ROM-based Multipliers 44 • Just as logical functions such as XOR can be stored in a LUT as shown for addition. Stewart. Dept EEE. we can use storage-based methods to do other operations.

Dept EEE. Stewart.777.776 Total ROM Storage (2N x 22N) 2 Kbits 48 Kbits 1 Mbit 20 Mbits 384 Mbits 7 Gbits 128 Gbits 2.216 228 = 268.627. 2010 .576 224 = 16.423 R. The output result is 2N bits long. For example.456 232 = 4.435. 16 bit operands require 128Gbits of storage and hence a ROM-based multiplier is clearly not a realistic implementation choice! Input Wordlength (N) 4 6 8 10 12 14 16 18 20 Output Wordlength (2N) 8 12 16 20 24 28 32 36 40 No. with 8 bits operands (a fairly reasonable) size.296 236 = 68.511.048. and in total 2N × 2 2N bits of storage are required. and hence the ROM has 2 2N entries. 1Mbit of storage is required .25 Tbits 40 Tbits Ver 10.536 220 = 1.967. For bigger operands e. For two N bit input operands.476.Notes: Introduction to FPGA There is one serious problem with this technique: as the operand size grows.736 240 = 1.294.a large quantity. 16 bits. University of Strathclyde.099.719. a huge quantity of storage is required. of ROM entries (22N) 28 = 256 212 = 4.g.096 216 = 65. the ROM size grows exponentially. there are 2 2N possible results.

so 1Mbit of storage is needed! 0 1 1 0 1 0 1 1 ROM-based multiplier 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 address 0000 0000 0000 0000 data (product) 0000 0000 0000 0000 A 8 bits 0110 1011 0100 0011 0001 1100 0000 0001 A:B 16 bits address 27.. Dept EEE..169 decimal 67 216 = 65.536 16-bit addresses 16-bit data R.459 P 16 bits decimal 107 B 8 bits 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1111 1111 1111 1111 0000 0000 0000 0001 decimal 7.Input Wordlength and ROM Addresses 45 • Consider a ROM multiplier with 8-bit inputs: 65. Stewart. University of Strathclyde.536 8-bit locations are required to store all possible outputs. 2010 .

843 254 x 28 + 75 = 65. depending on the value of the constant.355 The total storage requirement for this example constant coefficient multiplier would therefore be 3 kbits.. if the value of B is (decimal) 10. Dept EEE. A=? 8 bits ? ? ? ? ? ? ? ? B=75 8 bits 0 1 0 0 1 0 1 1 ? ? ? ? ? ? ? ? 0 1 0 0 0 0 1 1 addresses (decimal): 0 x 28 + 75 = 75 1 x 28 + 75 = 331 2 x 28 + 75 = 587 3 x 28 + 75 = 843 A:B 16 bits 253 x 28 + 75 = 64.536 memory locations are actually accessed. that represents a further saving of 4 bits storage x 256 memory locations = 1kbit. Stewart. significantly smaller than the 1Mbit needed for a 16-bit multiplier where both operands are unknown! Ver 10. However. For instance.Notes: Introduction to FPGA For example. The result is that only 256 of the 65. the size of the required ROM can be reduced to 256 locations of 16-bit data (note that the precision of the stored output words remains 8 bits + 8 bits). it may also be possible to reduce the length of the stored results. if the B input was the constant value 75. concatenated with the 8-bit binary word 0100 1011.423 R. 2010 .099 255 x 28 + 75 = 65. Therefore.. when one of the inputs to the ROM-based multiplier is fixed. The total memory required is thus 256 x 16 = 4kbits. as shown below. the possible input words would be composed of 256 possible combinations of the upper 8-bits of the address. the maximum output product generated by the multiplication of B with any 8-bit input A will be: – 128 × 10 = – 1280 As -1280 can be represented with 12 bits. University of Strathclyde.

. R. University of Strathclyde.624 so maximum 15-bit representation required! 1-bit saved! B = -83 0 1 0 1 0 0 1 1 • Additional optimisations allow cost to be reduced further. and the multiplier optimised accordingly.. if the maximum result does not require the full numerical range of: –2 2N – 1 ≤ result ≤ 2 2N – 1 –1 • The maximum product and output wordlength can be calculated for the particular constant value. 2010 . 8 bit signed number maximum absolute value = -128 16 bit signed result A=? ? ? ? ? ? ? ? ? P=? 8 bit representation (min.ROM-based Constant Multipliers 46 • ROM-based multipliers with a constant input require fewer addresses. • The storage required for output words may also be reduced. Stewart. Dept EEE.) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? maximum product = 10.

The selection is influenced by the other demands placed on these resources by the rest of the system being designed. along with the constant value. Dept EEE. the designer can specify the implementation style via the Constant Multiplier dialog box. and other parameters. University of Strathclyde. Stewart.Notes: Introduction to FPGA Constant multipliers can be implemented using the LUTs within the logic fabric (“distributed ROM”). the output wordlength. 2010 . Ver 10. or with one or more of the Block RAMs available on most devices. In System Generator.423 R.

1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 • Extending this a little. Dept EEE. multiplications by other numbers can be performed at low cost by creating partial products from shifts. and then adding them together.Multiplication by Shift and Add 4 2 -31 x 24 = -496 0.3125 x9 R.15625 x16 x0.625 x 2-2 = 0.25 47 • Multiplication by a power-of-2 can be achieved simply by shifting the number to the right or left by the appropriate number of binary places. 2010 . Stewart. University of Strathclyde. 0 1 0 1 0 1 3 21 x 23 + 21 = 189 0 1 0 1 1 1 1 0 1 0 1 0 0 0 0 4 0 1 0 1 0 1 0 0 0 0 2 (1 x 2-4) + 1 + (1 x 2-2) = 1.3125 x1.

and the result is achieved using the configuration of the hardware. x16 4 x8 x16 4 3 x24 x8 x24 3 The technique can be x8 particularly powerful when x1 3 applied to parallel x9 multiplications of the x9 same input. it is clear that the shift right by three places can be shared as x8 is common to both operations. 2010 . the fewer adders are required.423 R.fewer partial products product terms common to several multiplications x24 and x9 calculated separately can be shared and thus the overall effort reduced. i. Taking the above simple example of two concurrent multiplications.Notes: Introduction to FPGA Shift operations are effectively “free” in terms of logic resources. University of Strathclyde. Dept EEE. The “closer” the desired multiplication is to a powerof-two. and hence the lower the cost of the multiplier. This type of multiplier is suitable only for constant multiplications. The partial x1 combined . because there is only one input. Stewart. any arbitrary multiplication can be decomposed into a series of shifts and add operations. as they are implemented only using routing. the fewer partial products that are required. Ver 10. one x9 and the other x24.e. Transpose form filters are very suitable for optimisation in this way. Therefore multiplications by power-of-two numbers are very cheap! By recognising that multiplications by other numbers can be achieved by summing partial products of power-of-two shifts.

and P is the 36-bit product.Embedded multipliers 48 • The Xilinx Virtex-II and Virtex-II Pro series were the first to provide “onchip” multipliers in early 2000s. • These are in hardware on the FPGA ASIC. They also consume less power than a slice-based equivalent and can be clocked at the maximum rate of the device. 2010 . Stewart. Therefore are permanently available. i. and they use no slices. University of Strathclyde. R. A 18x18 bit multiplier P B • A and B are 18-bit input operands. not actually in the user FPGA “slice-logic-area”.e. • Depending upon the actual FPGA. Dept EEE. P = A × B. between 12 and more than 2000 (Virtex 6 top of range) of these dedicated multipliers are available.

DS083-1 (v2. 2003.Notes: Introduction to FPGA Looking at a device floorplan. University of Strathclyde. you can clearly see the embedded multipliers. which are located next to Block RAMs on the FPGA in order to support high speed data fetching/writing and computation.xilinx.423 R.4) January 20. 2010 . Ver 10. Block RAM slices 18x18 multiplier Information on dedicated multipliers taken from “Virtex-II Pro Platform FPGAs: Introduction and Overview”. http://www.com. Dept EEE. Stewart.

Dept EEE. University of Strathclyde. 2010 .....Embedded Multiplier Efficiency 18 18 4 49 • It can be easy to utilise on chip embedded multipliers inefficiently through choice of wordlengths. Stewart.. x 36 4 8 1 embedded multiplier 100% utilised 1 embedded multiplier ~5% utilised • When using multipliers in System Generator.be careful 18 18 36 19 19 38 SysGen will use 1 embedded mult SysGen will use 4 embedded mults R.

where 4 embedded multipliers are used instead of the expected 1! 18 x 18 1 x 18 19 19 38 18 x 1 Ver 10.423 1x1 R. It is relatively easy to see that a 4 x 4 bit multiply will greatly underuse the capabilities of the multiplier. which would leave the embedded multiplier free for use somewhere else. the following implementation for a requested 19 x 19 bits multiplier. 2010 .Notes: Introduction to FPGA If you specify the use of embedded multipliers for a particular multiplier in System Generator. and implement it entirely using embedded multipliers. The wordlengths of the embedded multipliers are fixed at 18 x 18 bits. Perhaps less obviously. and it makes sense to use them as fully as possible. mapping a multiplication to embedded multipliers where the input operands are slightly longer than 18 bits is also inefficient. Of course. This may result in. University of Strathclyde. this may lead to an inefficient implementation. depending on the wordlengths involved. for example. these decisions are made in the context of some larger design with its own particular needs for the various resources available on the FPGA being targeted. Dept EEE. Stewart. and this particular multiply operation might be better mapped to a distributed implementation. However. the tool will do exactly as you have asked.

• These feature an 18 x 18 bit adder followed by a 48 bit accumulator. Stewart. R. University of Strathclyde. these are low power and fast. 18 18 36 48 48 DSP48 Virtex-4 • Like the embedded multipliers. Dept EEE.High Speed Arithmetic Slices (DSP48s) 50 • As much DSP involves the Multiply-Accumulate (MAC) operation. 2010 . soon after embedded multipliers came DSP48 slices (on the Virtex-4). • The ability to cascade slices together also means that whole filters can be constructed without having to use any slices.

Stewart. Dept EEE. University of Strathclyde. 2010 . The maximum clock frequency also increased in line with the speed of the device.423 R. and an extended wordlength of one input to 25 bits.Notes: Introduction to FPGA The next series of FPGAs (the Virtex-5) enhanced the capabilities of the DSP slice with the DSP48E. The major improvements of this slice are logic capabilities within the adder/subtractor unit. 18 25 43 48 48 DSP48E Virtex-5 Ver 10.

Stewart. 18 18 18 18 Spartan-3A DSP DSP48A Spartan-6 DSP48A1 36 48 48 R. prior to the multiplier. 2010 .DSP48s with Pre-Adders 51 • The Spartan-3A DSP series and subsequent Spartan-6 feature a version of the DSP48 with a pre-adder. Dept EEE. • This feature is especially useful for DSP structures like symmetric filters. University of Strathclyde. because it allows the total number of multiplications to be reduced.

2010 . University of Strathclyde.423 R. Stewart. and the largest chips have 2000+ of them! 25 25 Virtex-6 DSP48E1 25 18 43 48 48 Ver 10. Dept EEE. especially as it can be clocked at 600MHz.Notes: Introduction to FPGA The Virtex-6 offers a combination of the benefits of the Virtex-5 (the longer wordlength and arithmetic unit). together with the pre-adder from the Spartan series. This results in a very computationally powerful device.

although not very often. University of Strathclyde.Division (i) • Divisions are sometimes required in DSP. Dept EEE. R.Bin can be selected. 2010 . • 6 bit non-restoring division array: a5 1 q5 q4 q3 q2 a4 b5 a3 b4 a2 b3 a1 b2 a0 b1 b0 0 Bin sin 52 bin cout 0 FA bout cin Bout 0 sout 0 q1 q0 Q=B/A 0 • Note that each cell can perform either addition or subtraction as shown in an earlier slide ⇒ either Sin+ Bin or Sin . Stewart.

Compute Q = B / A.8125 Ver 10. Stewart. q4 = 0 01011 carry 10011 11110 carry 0 R0 = B -A R1 2. This “paper and pencil” method may look familiar as it is often taught in school. Example: B = 01011 (11). Dept EEE. A binary example is given below. and if it is a 1. Note that each stage computes an addition or subtraction of the divisor A.R4 +A R5 q3 = 1 11100 01101 01001 10010 10011 00101 01010 10011 11101 11010 01101 00111 0 q2 = 1 carry 0 q1 = 0 carry 0 q0 = 1 carry Q = B / A = 01101 x 2-4 = 0.R1 +A R2 2.R2 -A R3 2. University of Strathclyde.423 R.R3 -A R4 2. It is not difficult to map this example into the structure shown on the slide. If the quotient bit is a 0.Notes: Introduction to FPGA A Direct method of computing division exists. The quotient is made up of the carry bits from each addition/subtraction. the divisor is subtracted. the next computation is an addition. A = 01101 (13) ⇒ -A = 10011. 2010 .

1101 01101 01011. University of Strathclyde.Division (ii) 00000. Dept EEE. VHDL Design R. Stewart.0000 0 01 00 010 000 0101 0000 01011 00000 remdsh1 01011 0 divisor_in 00110 1 00100 10 00011 01 00001 010 00000 000 00001 0100 00000 1101 00000 0111 divisor_in 53 • There is an alternative way to compute division using another paper and pencil technique. 2010 .

Stewart. 2010 . Dept EEE.Notes: Introduction to FPGA Ver 10. University of Strathclyde.423 R.

ripple through adders. University of Strathclyde. an N by N array has another problem . • It is unlikely that the quotient can be passed on to the next stage until all the bits are computed . • Unlike multiplication there is no way around this. • Note that we must wait for N full adder delays before the next row can begin its calculations.a N by N multiply will run faster than a N by N divide! R.The Problem With Division 54 • An important aspect of division is to note that the quotient is generated MSB first . 2010 .hence slowing down the system! • Also. Stewart. Dept EEE.unlike multiplication or addition/subtraction! • This has implications for the rest of the system. and as result division is always slower than multiply even when performed on a parallel array .

However. where the LSB is generated first. This is a problem when using division as most operations require the LSBs to start a computation and hence the whole solution will have to be generated before the next stage can begin. a3 1 q3 a2 b3 4 a1 b2 3 a0 b1 2 b0 1 Bin sin 0 q2 6 5 bin cout FA bout cin Bout sout s FA is full adder a b 0 0 4 a3 0 3 a2 0 2 a1 0 1 a0 b0 0 bout cout aout FA sout c 5 4 3 b1 0 p0 p1 Ver 10. 2010 . This is unlike the multiplication array that can also be seen below. University of Strathclyde. In the examples below. the first cell on the second row is only the 5th cell to start working because it has to wait for the 4 cells on the first row to finish.Notes: Introduction to FPGA By looking at the top two rows of a 4 x 4 division array we can see that the first bit to get generated is the MSB of the quotient. the first cell on the second row is the 3rd cell to start working. for the divider. Dept EEE. the order in which the cells can start has been shown. So for the multiplier. Another problem for division is the fact that it takes N full adder delays before the next row can start.423 R. Stewart.

Stewart. University of Strathclyde. a5 a5 a5 1 q5 q4 q3 q2 a4 b5 a4 b5 a4 b5 a3 b4 a3 b4 a3 b4 a2 b3 a2 b3 a2 b3 a1 b2 a1 b2 a1 b2 a0 b1 a0 b1 a0 b1 b0 b0 b0 Operands pipeline delay 0 Bin sin 0 bin cout 0 sout FA bout cin Bout 0 Q=B/A q1 q0 0 R. Dept EEE.Pipelining The Division Array 55 • The division array shown earlier can be pipelined to increase throughput. 2010 .

the delay (critical path) from new data arriving to registering the full quotient is N2 full adders. 1 q3 0 b0 0 0 With pipelining the critical path is only N full adders. University of Strathclyde.423 R. However. a3 a2 b3 a1 b2 a0 b1 b0 0 q2 a3 1 q3 q2 q1 q0 a2 b3 a1 b2 a0 b1 q1 q0 The longest path from register to register is the Critical Path. This delay represents the maximum rate that new data can enter the array. 2010 . the critical path can be broken down by implementing pipeline delays at appropriate points. Dept EEE. the critical path is broken down to just N full adders and thus the rate at which new data can arrive is increased dramatically. Stewart. by pipelining the array. 0 0 Without pipelining the critical path is through N2 full adders.Notes: Introduction to FPGA To increase the throughput. Ver 10. If pipelining is not used.

vector magnitude calculations and communications constellation rotation.Square Root (i) • 6 bit non-restoring square root array. 2010 . University of Strathclyde. Stewart. 0 1 b5 b4 b3 b2 0 1 0 a7 1 0 a6 0 1 0 a5 Bin sin 56 1 0 a4 0 a3 1 0 a2 0 a1 a0 1 1 0 0 bin cout FA bout cin Bout 1 0 0 1 0 0 b1 b0 1 0 0 sout B = A 0 1 0 0 1 0 0 0 • The square root is found (with divides) in DSP in algorithms such as QR algorithms. Dept EEE. R.

So square root can be performed twice as fast as divide using half of the hardware! a4 a 3 a2 a1 A = 10 11 01 01 010 carry 111 001 0111 carry 1011 0010 01001 c ar r y 10011 11100 c ar r y b3 = 1 b2 = 1 0 1 b5 0 1 0 a7 1 0 a6 0 1 0 a5 1 0 a4 0 a3 1 0 a2 b1 = 0 0a4 111 R1 R1<<1 & a3 1b311 R2 R2<<1 & a2 1b3b211 R3 R3<<1 & a1 0b3b2b111 R4 b4 b3 b2 1 0 b0 = 1 110001 011011 001100 0 a1 a0 1 1 0 0 0 1 0 0 b1 b0 1 0 0 sout 0 1 0 0 1 0 0 0 Ver 10. we can note that this array is essentially “half” of the division array! If the division array above is cut diagonally from the left we can see the cells that are needed for the square root array. Dept EEE.Notes: Introduction to FPGA Looking carefully at the non-restoring square root array. The 2 extra cells on the right hand side are standard cells which can be simplified. 2010 . Stewart.423 R. University of Strathclyde.

• For these techniques we often find equations of the form: x cos θ = --------------------x2 + y2 and y sin θ = --------------------x2 + y2 • So in fact we actually have to perform two squares. (Note that squaring is “simpler” than multiply!) • There are a number of iterative techniques that can be used to calculate square root... a divide and a square root. Dept EEE. 2010 . (However these routines invariably require multiplies and divides and do not converge in a fixed time.Pythagoras! 57 • The main appearance of square roots and divides is in advanced adaptive algorithms such as QR using givens rotations. University of Strathclyde. Stewart..! R.) • There seems to be some misinformation out there about square roots: For FPGA implementation square roots are easier and cheaper to implement than divides.Square Root and Divide .

423 R.Notes: Introduction to FPGA Ver 10. Dept EEE. University of Strathclyde. Stewart. 2010 .