This action might not be possible to undo. Are you sure you want to continue?
Student: B EN C OPE
Supervisors: P ROF P ETER C HEUNG & P ROF WAYNE L UK
Research Group: C IRCUITS & S YSTEMS , EEE
Submission Date: F RIDAY 22 ND A PRIL , 2005
The use of a combination of different hardware blocks, on the same chip, is becoming increasingly popular to accelerate computationally intensive applications. The main problems faced are: price, design time, combination of blocks to use, tools and minimising power.
FPGAs, over the last decade, have been a popular solution to implementing computationally intensive tasks, such as video processing, in real time. They exploit the spatial / temporal parallelism of these tasks, through using large amounts of reconﬁgurable hardware.
Over the same period the capabilities of Graphics Hardware have been growing rapidly, with the advantage that cost is kept low due to its application to the consumer market. It is suggested that the computation speed / economical price of this hardware could encourage its use outside of the consumer domain for more complex tasks, such as video processing. Later it will be shown that today’s graphics cards are capable of primary colour correction, in real time, on High Deﬁnition Video.
In the future a combination of the above two types of hardware along with other cores, such as processors, are likely to be used for video processing tasks. For this to be possible it is important to eliminate the bottleneck, seen in current SOC (System on Chip) designs, of interconnects. A proposed solution has been a NOC (Network on Chip) structure, which will be explained in more detail in the literature survey section.
In this report I will cover my progress, on my chosen research area of a mixed hardware solution to video processing. This will involve an annotated literature survey, discussion of research questions, work to date and a plan of work up to MPhil to PhD transfer. Work to date will include interconnect modelling and implementation of a primary colour correction algorithm.
such as lighting effects. along with tools used for debugging & programming. Firstly.2 Literature Survey Covered is a literature survey of the use of FPGAs along with Graphics Hardware in the application of Video Processing. This was used for the Genesis Planet Sequence in ‘Star Trek: The Wrath of Khan’ and has also been used in TV commercials and the ‘Superman III’ video game. prior to it reaching the card. GPUs were ﬁrstly used for non-graphics ‘applications’ (www. the bus is no-longer seen as the only solution to connect hardware cores together.GPGPU. These capabilities have progressed further since.) FPGA architectural features which make it adaptable to such applications will be shown. There have also been advances in research into interconnect structures. Followed by interconnect structures (including Network on Chip. This can be seen with the ﬁrst video cards: all image processing was done. Today much more of the processing is performed on the card. I will look at current architectures using GPUs (Graphics Processing Units) and FPGAs for video processing. Graphics hardware has progressed from being display hardware to allow user programmability leading to non-graphics applications.org) by Lengyel et al in 1990 for robot motion planning. This highlighted early on the potential for graphics hardware to do much more than output successive images. Switch-boxes and networks are emerging. GPU Architectures: Emerging Field: Research into the use of GPUs for general-purpose ‘computation’ began on the Ikonas machine which was developed in 1978. this highlighted the start of the era of GPGPU (General Purpose GPUs.) Recent Developments: Trendall and Stewart in 2000 gave a summary of the possible calculations available on a GPU with a real-time calculation of refractive caustics. with more on-board memory 2 .1 Current Architectures This section is split into the use of GPUs and FPGAs individually for Video applications and discusses the possibility of interlinking these modules. 2. The implementation of video processing applications is moving away from being predominantly software based to a more hardware based solution. it is likely that we will see more topologies being considered in the future and some of the new ideas today becoming common-place.
Performance: The rate of increase of processing performance of GPUs has been 2. There is still room for future progress as some hardware’s functionality is still hidden from the user.) Increases in performance also beneﬁt ﬁltering operations: in the previously mentioned FFT implementation the potential for frequency domain ﬁltering was shown. ﬁltering and IFFT. The implementation of multimedia applications on graphics cards (and their importance to the customer) means the screen is now seen as the computer: rather than the network it sits on. 32-bit/RGBA pixel component). Previously the CPU ofﬂoaded tasks.5 years for CPUs according to Moores Law) a trend which is expected to continue till 2013. frequency compression and 3 .and larger processing ability. Nvidia: The intentions of the manufactures is clear. This will be capable of handling multimedia tasks and will bring theatre-style DVD playback to the home computer. Recognition for non-graphical applications of GPUs was given at the Siggraph / Eurographics Hardware Workshop in San Diego (2003) showing its emergence as a recognised ﬁeld. which is equivalent to a 10 GHz Pentium Processor. in an article in July 2002. This has been made possible by graphics hardware manufacturers (Nvidia / ATI being the largest) allowing programmers more control over the GPU.e. This shows the potential of graphics hardware to out-perform CPUs. this trend is likely to continue to show a power shift to Graphics card manufacturers. The GeForce 5900 performs at 20 G/ﬂops. The amount of computations required to do ﬁltering are reduced by performing them in the frequency domain from an O(N M 2 ) problem to an O(NM) + FFT + IFFT (about O(MN(log(M) + log(N)))) one. NVIDIA’s CEO announced teaming with AMD to develop nForce. performed on a 512x512 image. This is facilitated through shader programs (released in 2001) written in languages such as Cg (C for Graphics) by Nvidia or DirectX by Microsoft. This minimisation was beneﬁcial to the likes of Intel and less so to GPU producers such as Nvidia. More recently Moreland and Angel implemented an FFT routine using GPUs with FFT. For Example in Strzodka and Garbe’s implementation of Motion Estimation they out-perform a P4 2GHz processor by 4 to 5 times (with a GeForce 5800 Ultra. onto plug-in cards which were later shrunk and encapsulated within the CPU. such as video processing. Moreland and Angel implemented ‘clever tricks’ with indexing (dynamic programming). there is no pointer support and debugging is difﬁcult. with a new generation being unveiled every 6 months. It is expected that TFLOP performance will be seen from graphics hardware in 2005. in under 1 second with a GeForce 5800.8 times/year since 1993 (compared to 2 times/1. which have been enhanced since for greater user control (DirectX 9.0 now allows 128-bit precision i. In development of Microsoft’s Xbox more money was given to Nvidia than Intel.
4 .) Other features are the exploitation of spatial parallelism of images and that pixels are generally independent. Strzodka and Garbe in their paper on motion estimation / visualisation on graphics cards show how a parallel computer application can be implemented in graphics hardware. In such applications the data-stream controls the ﬂow rather than instructions. A ﬁnal factor which aids this is that 32-bit precision calculations are now possible on GPUs which are vital for such calculations. Parallelism: The architecture of graphics hardware is equivalent to that of stretch computers (designed for fast ﬂoating point arithmetic). APIs also handle Flowware and Conﬁgware separately. however in graphics hardware these are explicitly different. Multiple Data-stream) parallel processor. Moreland and Angel go further in branding the GPU as no-longer a ﬁxed pipeline architecture but a SIMD (Single Instruction-stream. DX 9. which highlights the ﬂexibility in the eyes of the programmer. This method exploits the dataﬂow in the organisation of the processing elements to reduce caching (CPUs are typically 60% cache. This gives the opportunity for real-time video processing capabilities on a standard workstations. Such a graphics card can perform equivalently to image generators costing £1000’s in 1999.0) and system (e. This becomes important if considering programming FPGAs and GPUs simultaneously.g. C++) independence. With the rapidly increasing power of graphics cards it can be expected that the computation time will be reduced from 1 second to allow real-time processing. They use stream processing: requiring a sequence of data in some order. Cost: Another beneﬁt of the GPU is cost.g. Careful implementation of GPU code is necessary for platform (e.splitting a 2D into two 1D problems to achieve this speed up. a top end graphics card (capable of near TFLOP performance) can be purchased for less than £300. How to program: There are 2 programming sources in graphics hardware programming: • Flowware: Assembly and Direction of dataﬂow • Conﬁgware: Conﬁguration of processing elements In comparison these 2 features are implemented together in an FPGA. They identify GPUs as not the best solution but to have a better price-performance ratio than other hardware solutions. facilitating the above cache beneﬁt.
Sonic approaches these challenges with PIPEs (Plug in Processing Elements) with 3 main components of an engine (for computations). Example: Singh and Bellac in 1994 implemented three graphics applications on a FPGA. namely outline of circle. however. 2D ﬁltering and 2D convolution. In the latter an implementation at 1/2 the clock rate of state-of-the-art technology was adequate suggesting 5 .) Sonic / Ultra-Sonic: Two more possibilities to accelerate video processing which highlight beneﬁts of a re-conﬁgurable architecture. The hardware is systolic: 1 data item is clocked in and 1 out of the modules on every clock cycle. The performance in drawing the circle was seen to be satisfactory out-performing a general purpose display processor (TMS34020) by a factor of 6 (achieving 16 million pixels/sec. The need for ﬂexibility is justiﬁed as one may need a change an algorithm (e. DSPs allow for more ﬂexibility and can be reused for many applications. They found a RAM based FPGA was favourable due to the large storage requirements. To maximise its beneﬁt. compression standard) post manufacture. power and size. The challenges involved. this maintains a high throughput rate although latency can vary.g. 14. such as the Virtex 4.300 from a custom hardware block. which have larger on-chip storage and more processing power. keeping memory accesses low and real-time throughput. spatial and temporal resolution. Typical applications are the removal of distortions introduced by watermarking an image.) It was however worse in the application of fast sphere rendering at only 2627 spheres/sec vs. FPGAs today also have more built-in blocks to speed up operations such as multiplication (these will be considered later. An Alternative: When deciding which hardware to use for a graphics sub-system there is a trade-off between operating speed and ﬂexibility. formatting and data access) and memory for storing video data. are energy inefﬁcient and can cause delays if not optimised per task. hardware integration with software. highlighted in [9. are expensive and inﬂexible.FPGA Architectures: ASIC solutions for processing tasks are optimal in speed. 10] are: Correct hardware and software partitioning. a router (for routing. By utilising its re-programmability a small FPGA can appear as a large and efﬁcient device. ﬁlled circle and a fast sphere algorithm. Improvements are however expected with new FPGAs. an FPGA implementation must: give more ﬂexibility than custom graphics processors and be faster than a general purpose processor. For these reasons it is often favourable to implement such applications in a reconﬁgurable unit. however.
these elements can also be grouped to perform bigger tasks. parallel computation at high speeds. Conﬁgurations can be stored in a memory bank and copied into a local cache as required. The software designer needs 6 . Advancements in hardware mean that some of the problems with ﬂoating point calculations and alike have been overcome. 2D ﬁltering was split into two 1D ﬁlters and showed a 5. by proposing a design method focused on the system dataﬂow Sedcole et al hope to overcome this. Singh and Bellec similarly suggest the grouping of zones of a partitioned FPGA in a design. Sedcole et al propose to be considered in such large scale implementations: • Design Complexity • Modularisation . each with good periphery access to the network and a different size. 2*1D implementation of a 2D ﬁlter) only the router need be reconﬁgured. data-dependent or ﬂoating point calculations. The following are general issues. Singh and Bellac propose partitioning the FPGA into zones. A practical example of this is seen with the above Sonic Architecture: The router and engine are implemented on separate FPGA elements. Parallelism: On a Task-level parallelism is often ignored in designs. A hardware solution is beneﬁcial for regular.allocation / management of resources • Connectivity / Communication between modules • Power Minimisation (ties in with low memory accesses) Hardware or Software: The beneﬁts of a software implementation are seen with irregular. Tasks must be split optimally between these 2 methods. Temporal parallelism can be exploited by distributing entire frames over these blocks. data-loss occurs during conﬁguration phase but this is seen as acceptable. The sharing of data and conﬁguration control path reduces the bottleneck. and greater speed up with more PIPEs. this separation also provides abstraction. Further. If a different application required only a different memory access pattern (e.5 times speed up when using 1 PIPE. Another architecture where the bus bottleneck problem is reduced is seen in  where a daughter board is incorporated to perform D/A conversion.a lower power solution. The capability of partial reconﬁguration is important here: if a new task is required only a section of the FPGA need be reconﬁgured. Bottlenecks: In contrast to memory the FPGA’s bottleneck isn’t bus speed but conﬁguration time.g. however. leaving other sections untouched for later reuse. Taking Sonic as an example: spatial parallelism is done through distributing parts of each frame across multiple hardware blocks (PIPEs in this case). Hardware can now perform equally or even better than software.
powerful. partial reconﬁguration (Altera previously had no support) and PCI Bus Bottleneck. The requirements of a reconﬁgurable implementation are therefore to be ﬂexible. These considerations would be important if considering an FPGA’s implementation with other hardware and some / all of these requirements may also apply to this ‘mixed system. this became restrictive past 8x. again there are issues with ﬁnding ways to get the components to communicate and work together. AGP uses parallel point to point interconnections with timing relative to the source.’ 2. debugging. GPU view: GPU components are implemented in conjunction with CPUs acting as graphics sub-systems and working as co-processors with the CPUs. the capacitance and inductance on connectors needed to be reduced. Co-operation: Another way to look at the use of FPGAs in a graphics system is to extend the instruction set of a hostprocessor as virtual hardware. This overcomes a previous problem with re-conﬁgurable hardware that there were no good software models.) however with new GPUs working at higher bit precisions (128bit/RGBA in the GeForce 6800 series) greater throughput was required. low cost. number of gates. analogous to software plug-ins. This idea is approached by Vermeulen et al where a processor is mixed with some hardware to extend its instruction set. In general this ‘hardware’ could be another processor or an ASIC component. To do this a high speed interface is required as GPUs can process large amounts of data in parallel. In the Sonic example: PIPEs act as plug-ins. followed by a look at some System-on-Chip (SOC) and Network-on-Chip (NOC) architectures. run-time / partial reconﬁgurable and to ﬁt in well with software.2 Interconnects Interconnects currently used for graphics card to processor communications will be discussed. Hardware acceleration is particularly suited to video processing applications due to the parallelism and relatively simple calculations. The current FGPA limitations highlighted by papers [9.a good software model of the hardware and the hardware designer requires good abstraction. which provides an easy path for Sonic into software.1 GBytes/sec. doubling in required bandwidth every 2 years. As the transfer speed increased. 14] are: conﬁguration speed. A new transfer method was required: serial differential point to point offers a high speed interconnect at 7 . The AGP standard has progressed through 1x to the current 8x model (peaking at 2.
e.2. memory) raises questions of the best interconnect structure. DSPs) to a number of resources (e. (1) (2) Xu et al found the optimum for the Smart SOC Camera system to be a 3x3 crossbar switch with 2 shared memories. area or even money. are as follows: U tilisation = Avg − throutput/M ax − throughput System − f requency = 3x108 ∗ 150/P rocessed − F rames Throughput is proportional to 1/utilisation. overcoming the above capacitance / inductance issues. The opposite is where the Arbiter holds requests from the processors and then decides when they can transmit. 8 . All have the 3 main components above and adhere to the common concept of bus request by master and grant by arbiter. where processors are split across the memory. whether a processor or arbiter controlled model.g. which may be useful to analyse interconnects in general. PCI Express 16x offers 8 GB/sec peak bandwidth (4 GB in each direction) and is the new standard for CPU-GPU communications. There are other bus architectures such as the Wishbone Bus and the Core-Connect Bus(IBM) which have different features. They simulated different size switch / port widths using OPNET (a communications network package which they modiﬁed to work with a NOC. the processor decides when it can transmit and arbitrates for the bus.) In creating their architecture model they make assumptions about how long certain blocks take to function.g. This justiﬁes a large amount of design time on this. SOC/NOC Topologies: The connection of a number of cores (e. Their metrics. We require the maximum possible throughput at the lowest possible ‘cost. This became the ﬁrst PCI Express generation. One bus structure is the AMBA AHB Bus which is a processor controlled model i. The control.5 GHz/sec. They draw the conclusion that the bottleneck in a SOC design is the interconnect. They showed that a switch was better than the bus model for more than 8bit/port with a high effective throughput and low latency. An arbiter/switch control for example takes 1 cycle. a Master attempts to own the bus and a Slave responds to the Masters’ communications. The Bus: A bus has 3 main components: an Arbiter decides who ‘owns’ the bus. address and data are communicated in 1 packet and a central control unit arbitrates between masters on the switch ports. The Switch: Xu et al propose a number of cross bar switch architectures which are deﬁned as NxN where N is the number of switch ports.’ The concept cost can be power.
Jantsch suggests 2 possibilities for the future: many NOC designs for many applications (expensive in design time) or 1 NOC design for many applications (inﬂexible.) This attempts to minimise the number of tiles a packet must pass through to reach it’s destination.).) The limitations are opposite to computer networks: less constraint on number of interconnections.W connection and each has an input and output path to put data into the network or take it out respectively. Dally and Towles propose a mesh structured NOC interconnect as a general purpose interconnect structure.1.’ Area is dominated by buffers (6.6% of tile area in their example. Each tile therefore has a N. Operating System (for when running) and design method for a set of tasks.The Network: The advantages of a network are that it has high performance / bandwidth.3. The disadvantage of the above example is that the tiles are not always going to be the same size and thus space would be wasted for smaller designs. DSP etc. can handle concurrent communications and has better electrical properties than a bus or switch. Language (to conﬁgure NOC). 9 .e. address and control signals are grouped and sent as a single ‘ﬂit. around a block considerably larger than others. which emulates the original network being present. He also provides a region wrapper. the disadvantage is that one could do better by optimising to an certain application (though this may not be ﬁnancially viable.E. in the order 0.2. however this would increase space required for buffers. modularity. however one would need to decide on the correct architecture (mix of CPU.) In Dally and Towles example they divide a chip into an array of 16 tiles. The network could be run at 4GB/s (at least twice the speed of tiles) to increase efﬁciency. The advantage of being general purpose is that the frequency of use would be greater justifying more effort in design. Jantsch proposes a solution which overcomes this. The data.S. The main differences are that he no longer uses the torus topology but a standard connection to a tiles neighbours. using a similar mesh structure. but more on buffer space. The NOC overcomes this problem by being a GALS (Globally-Asynchronous Locally-Synchronous) architecture.) The later would justify the design cost. As the size of chips increases global synchrony becomes infeasible as it takes a signal several clock cycles to travel across a chip. Interconnections between tiles are made as a folded torus topology (i. numbered 0 through 3 in each axis.
which was more speed and area efﬁcient. Programmable. Multipliers: The motivation for use of embedded multipliers is that implementation of binary multiplication in FPGAs is often too large and slow. or less with redundancy. although these are difﬁcult to generalise. The length must be greater than 7 to make a space saving on the FAB. manufacturers have introduced standard components into them. As they have developed. This shows a snapshot of the NOC ideas for which the are possibly as many topology ideas as for our standard computer networks today. such as embedded memory blocks and in some of the latest Xilinx FPGAs Power PC Processors.these are ﬁxed in size however waste space if small bit length multiplication are required. They designed a Flexible Array Block (FAB) capable of multiplication of two 4 bit numbers. Benini and De Micheli introduce the SPIN (Scalable. The later isn’t a problem due to the many metal layers in an FPGA. comprising conﬁgurable logic blocks and interconnects. they are also smaller than a pure FPGA implementations. Dobkin et al propose a similar mesh structure to Jantsch however include bit-serial long-range links. There is potential for future work in this area in the development of new blocks.3 FPGA Internal Structure FPGAs were ﬁrst designed to be as programmable as possible. which could be placed into an FPGA. A possible solution is Programmable Array Modules (PAMs) . The MFAB (Modiﬁed FAB) multiplies 2 numbers of length 8 together. The blocks are 1/30th the size of the equivalent pure FPGA implementation and need only 40% usage to make them a worthwhile asset. The speed of the FABs is comparable to that of non-conﬁgurable blocks at a cost of them being twice the size and having twice the number of interconnects. Interconnect Network) with a tree structure of routers and the nodes being the leaves of the tree. In this section interesting modules will be considered. Ferrari and Cheung with a design base on the radix-4 overlapped multiplebit scanning algorithm. A modiﬁcation was proposed later by Haynes. which could be used within FPGAs in the future. 2. Other solutions are trees or pre-processing methods. 10 . The 2 input numbers can be independently signed or unsigned. They use a non-return to zero method for the bit-serial connection and believe it to be best for NOC. A better solution is presented by Haynes and Cheung to use reconﬁgurable multiplier blocks. FABs combine to multiply numbers of lengths 4n and 4m. to improve functionality.There are other suggested interconnect methods: Hemani et al suggest a honeycomb structure where each component connects to 6 others.
is the SONICmole used with UltraSonic.Function Evaluation: A more speciﬁc block is one for functional evaluation such as that proposed by Sidahao. enable two accesses concurrently. which can create its own memory address. It is likely this technology will progress further. creating a level of abstraction for the user. drastically effecting timing.4 Debugging tools / coding The testing of a hardware module can be split into 2 areas: pre-load and post-load. Embedded Dual-Port RAMs.g. A previously popular test strategy was ‘Bed of Nails’ where pins are connected directly to the chip and a logic analyser. Post-load: The issue of post-load testing is currently approached by using part of the FPGA space for a debugging environment. A downside to FPGAs over ASICs is in pre-load: speciﬁcally back-annotated compared with initial testing. 11 . even if possible it would signiﬁcantly alter the timing. whereas in FPGA design the module placement is decided at synthesis. Due to the large pin count on today’s devices this is impractical. Following this was ‘Boundary Scanning’ by JTAG (Joint Test Action Group) however this only probed external signals. An example of an on-chip debugging environment.) COMPASS (Avant) is an automated design tool. The downside is that is uses the slow JTAG interface to communicate readings. perhaps to an Autonomous Memory Block (AMB). Memory: In Video applications storage of frames of data is important. Constantinides and Cheung. Pre-load: The most widely known pre-load test environments are ModelSim (Xilinx) and Quartus (Altera. currently available in devices such as the Xilinx Virtex II Pro family. inside the FPGA. This takes up only 4% of a Virtex XVC1000 chip (512 slices. VHDL) level. Previously a Lookup Table (LUT) approach was used however their architecture provides a lower area solution at the cost of execution speed. as a probe unit. which uses the faster interface (PCI Bus). The beneﬁts are highlighted by Singh and Bellac in 1994: The user can enter a design as a state machine or dataﬂow and therefore implement at the system rather than lower (e. In ASIC design only wiring capacitance is missing from pre-synthesis tests. 2. Chueng and Luk. proposed by Melis. invoked during on-board test. therefore it is useful to be able to store this data in memory efﬁciently. Better still is Xilinx Chipscope: an embedded black box which resides.) It’s function is to act as a logic analyser.
along with matrices up to size 4x4 (for operations on the vectors. Cg supports user deﬁned compound types (e. The features of C beneﬁcial for an equivalent GPU programming tool are: performance. Nvidia introduce the concept of proﬁling for handling differences in generations of GPUs. This shows the trend towards more user programmability. When developing Cg Nvidia worked closely with other companies (such as Microsoft) who were developing similar tools.g.giving the programmer more control.) For 12 . a solution is to use a meta-programming system to merge this boundary. In development of the PlayStation 2 Sony supported a full C implementation with the on-chip GPUs combined with off chip resources. (Fernando and Kilgard provide a tutorial on using Cg to program graphics hardware. Each GPU era has a proﬁle of what it is capable of implementing. generality and user control over machine level operations. This uses the PIPE memory to store signal captures. The downside is optimisations across this boundary aren’t possible.viewing and driving signals.) A downside is Cg doesn’t support pointers or recursive calls (as there is no stack structure). MATLAB (system generator). It also allows vectors of ﬂoating point numbers up to size 4 (e. Nvidia separates programming of the 2 GPU processors (vertex and fragment)to avoid branching and loop problems and so they are accessed independently. An aim of Cg was to support non-shading uses of GPU. There is also a proﬁle level for all GPUs necessary for portable code. Cg supports high level programming.g. portability. The main difference to C is the stream processing model for parallelism in GPUs. Coding: Firstly coding for FPGAs: these can be programmed through well known languages such as VHDL and Verilog at the lower level. It has been implemented at the UltraSonic maximum frequency of 66MHz and is portable to other reconﬁgurable systems. however is linkable with assembly code for optimised units . and more recently SystemC (see systemc. this is of particular interest. The focus of this sub-section will be on programming the GPU.org) and HandleC at the higher level. arrays and structures) which are useful for non-graphics applications. whilst being as small as possible and having a good software interface. Cg Language: Cg was developed by Nvidia for developers to program GPUs in a C-like manner. Pointers may be implemented at a later date. RGBA). as FPGA coding is widely understood.
3 Research Questions The interconnect between cores in a design is a common bottleneck. to non-standard tasks. It has been shown to be particularly efﬁcient where there is no inter-dependence between pixels. speciﬁcally the Cg language.5 Literature Survey Conclusions In summary some current architectural uses for GPUs and FPGAs have been considered including an FFT routine on the GPU and some graphics routines on a FPGA. This leads to the ﬁrst research question: investigate suitable interconnect architectures for mixed core hardware blocks and ﬁnd adequate ways to model interconnect behaviour. Finally tools used in pre and post device function load and in device programming were analysed. The Sonic architecture was looked at particularly how it is used as a hardware accelerator for graphics applications. This was followed by interconnect structures looking at buses. whilst maintaining current beneﬁts of using FPGA / Processor cores. Programming this hardware was historically difﬁcult: One could use assembly level language in which it takes a long time to prototype. FPGA cores allow for high levels of parallelism 13 . such as OpenGL. The potential of Graphics Hardware has long been exploited in the gaming industry. focusing on its high pixel throughput and fast processing. The alternative is an API. It is important to have a good model of the interconnect. which limits a programmers choice to a set of functions. There have been many architectures proposed / developed for module interconnects (groupable as bus. switch and network) discussed in the Literature Survey. The adaptability of graphics hardware. 2. The internal structure of an FPGA was then considered investigating embedded components that could be useful in video applications such as multipliers. In 2003 Nvidia produced a language called Cg. for example Moreland and Angel’s FFT Algorithm. Following this non-graphical applications were explored. allowing high level programming without losing the control of assembly level coding.the non-programmable parts of a GPU CgFX handles the conﬁguration settings and parameters. switches and networks: speciﬁcally their advantages and disadvantages. This takes advantage of the price-performance ratio of graphics hardware. A model is important to decide the best interconnect for a task without the need for full implementation. memory and function solvers. to either eliminate or reduce this delay. leads to the second research question: to further investigate graphics hardware used in a mixed core architecture.
and ﬂexibility as many designs can be implemented on the same hardware. to be multiplied. multiplied then returned to memory using an interconnect. SystemC. The results are seen Figure 1. Interpolation could be a Bi-linear. A multiply function. Processors can be optimised for certain types of instructions and run many permutations of them without costly reprogramming associated with FPGAs. each of varying complexity. Waveform for multiplier implementation 14 . should produce a smoother result: this may not however be perceptively the best or could be too computationally complex. 4 Interconnect Model My ﬁrst task was to implement a high level model of the ARM AMBA Bus. The ﬁnal research questions is: investigate the perceived quality versus computational complexity of the 2 methods. When one wishes to resize an image there are two possibilities for determining new values for pixels: ﬁltering or interpolation. a relatively new hardware modelling library. was used for this. SystemC is used to create a VCD (Value Change Dump) ﬁle. which can be displayed in a waveform viewer such a ModelSim’s. for a communicating processor and memory. was modelled: Two values. to allow for up to 10% post manufacture functional modiﬁcation. Theory suggests FIR ﬁltering. in addition to custom hardware. are loaded in consecutive cycles. This consists of data plus control signals as a simple bus model. Bi-Cubic or a spline method. This demonstrates how to display and debug the results of a hardware model. Filtering could be a FIR (Finite Impulse Response) Low-Pass ﬁlter with complexity variation in the number of taps. This would model its performance for varying numbers of master & slave and be cycle accurate. of a ‘long enough’ tap length. The motivation came from a paper by Vermeulen and Catthoor where an ARM7 processor was used.
and their signals are multiplexed to the master. operands and result. taken from the speciﬁcation..) The routing and decoding between the blocks. The processor is optimised to take only the number of cycles required for memory accesses. reading the 2 hexadecimal values. have high performance and enable pipelining. namely AHB (Advanced High Performance) and ASB (Advanced System Bus. along with some arbiter signalling. The AMBA bus speciﬁcation was considered and 2 types were of interest... The ﬁgure shows 2 identical multiply cycles of the custom hardware.. Master Mux HADDR HWDATA HRDATA Slave Mux . A typical design consisting of 1 master with 2 slaves. Control Signals Master Select HADDR HWDATA (pair per master) . The Arbiter trivially grants the single Master Bus access all the time. Split Signals Active Master . Arbiter .) Each allow multiple master and slave conﬁgurations.e. the AHB bus was chosen.in ﬁgure 1.. is a good starting point for the model. HRDATA (one per slave) Slave Select Decoder Figure 2.. Block Diagram of the AMBA AHB Bus 15 . 2 and 17 from memory locations 1 and 2 respectively and returning their product 2e to location 3. The advantages of AHB are: it allows burst transfers and split transactions. The architecture is similar to the multiplier system: Slave 1 could hold instructions for the ‘processor’ and slave 2 data (i. A decoder switches between the 2 slaves.. comprises ‘the Bus’. for this reason and because it is a newer architecture..
Figure 3. between an AMBA bus model and processor model. Constants were used. in place of actual numbers. One master can use the data bus. my attention was turned to the design of such a bus model. which could be later extended to include other hardware blocks such as FPGAs. The number of address bits. The literal value of the binary number indicates which slave to use. Missing from ﬁgure 2. which are routed to each block. Test output showing reset and bus request / grant procedure A number of meetings were held with Ray Cheung from Computing (currently modelling processors) to discuss the possible interoperability. whilst another controls the address bus. for data and address signal widths throughout. 16 . A physical interpretation of how the AMBA AHB bus blocks ﬁt together can be seen in ﬁgure 2. A fully ﬂexible bus and processor model was suggested. are global clock and reset signals. Complexity in coding the multiplexer blocks lay in making them general. used to decipher which slave to use. The control signals are requests from masters and split (resume transfer) signals from slaves. Following this. For the decoder an assumption was made about how a slave is chosen. i. The master multiplexer used a delayed master select signal from the Arbiter to pipeline address and data buses. is calculated as: log2(numberslaves) rounded up. 01 would be slave 1.e. The bits are taken as the MSBs of the address. HWDATA and HRDATA apply to write and read data respectively and H preﬁx denotes AHB bus as apposed to ASB.
Within this. however are of a similar form to ﬁgure 3.A test procedure was produced. before granting access. The model was further tested with 2 masters and 2 slaves. cycle accurate. I converted half way to a chroma representation ‘ycbcr’ and implemented the algorithm at this level. which showed considerable speed up. 5 Primary Colour Correction Primary Colour Correction is a non-graphical application. Saturation and Luminance) space. Histogram Equalisation and Colour Balancing (see Figure 4. the sending of packets consisting of 1 and multiple data items was experimented with along with split transfers and error responses from slaves. In the example as HSEL signals change at the bottom of the waveform. The lessons learnt can be summarised below: 17 . In my optimisations. results were seen. a common conﬁguration. where possible. through HGRANT. irrespective of inputs. the arbiter waits till HREADY goes high. the 2 read data signals are multiplexed.) Input Correction and Colour Balancing require the RGB signal to be converted to HSL (Hue. this loads stimulus from a text ﬁle. Other key optimisations were to perform calculations in vector space and to remove.) An example of a test output is shown in ﬁgure 3. as with the FFT on a GPU algorithm discussed above. conditional statements which are inefﬁcient on GPUs. as with the multiplexer example. the HMASTER signal changes immediately (with HBUSREQ) to the correct master. When reset. all outputs are set to zero. simple tests were carried out. When the master requests the bus. The waveforms for these become complicated and large very quickly. The correct. The ﬁle consists of lines of either. ‘variable’ and ‘value’ pairs or ‘tick’ followed by a number of cycles to run for. allowing for multiplexing and so the slaves know which master is communicating. In the case of more than 1 master. Initially. to check for correct reset behaviour and that the 2 multiplexers worked (with a setup of 1 master and 2 slaves. The algorithm performs three main transformations per pixel: Input Correction. which is what would be expected. I will now discuss my optimised version of this. with the result viewed as a waveform.
14 38. although there is only 2-3 years between them.12 1. It is seen that there is a large variation in the throughput rates of the devices. where possible.73 12.1] Fix to range [0. avoiding repetition for each pixel • Consider what is happening at assembly code . Architecture 6800 Ultra 6800 GT 6600 5700 Ultra 5200 Ultra Throughput (Final) MP/s 116.24 Table 1: Performance Comparison on GeForce architectures for the Optimised (Final) and Initial Designs 18 HueShift SatShift LumShift HighlightMidtoneCross MidtoneHighlightCross AreaSelection color decalCoords x R n/ d Fix to range [0.1] R' .1] Fix to range [0.2D Texture input Texture Fetch x x G B Input Correct Histogram Correct Color Balance 255 n/ d G' B' 255 n/ d 255 255 black white saturation hue BlackLevel Gamma WhiteLevel OutputBlackLevel OutputWhiteLevel ChannelSelection Figure 4.62 27.08 Throughput (Initial) MP/s 44. Primary Colour Correction Block Diagram • Perform calculations in vectors & matrices • Use in-built functions to replace complex maths & conditional statements • Pre-compute uniform inputs.59 2. For more information on the optimisation of the primary colour correction algorithm see.82 72.67 7.36 101.decipher code if necessary • Don’t convert between colour spaces if not explicitly required Table 1 shows the performance results for the initial and optimised designs using various generations of GPUs.
was carried out. varying methods of arbitration. An implementation in C / C++ is expected to perform better.00411 Table 2: Effect on Performance of Each Block of the Primary Colour Correction Algorithm 6 Plan of Work Leading to Transfer The next step in the modelling of interconnects is to consider a general bus structure. with a 512x512 image in 2. in terms of delay. This is due to the large number of min-terms in the calculations and there being fewer intermediate storages required in compares. An optimised implementation in MATLAB completed the computation. This equates to 0. In Histogram Equalisation ‘pow’ was seen also to add greatly to the delay and accounts for almost 50% of the delay (0. The Colour Balancing function.For efﬁcient optimisation of an algorithm it is important to understand the performance penalty of each section.00089s/MP). Block Input Correction Histogram Correct Colour Balancing Cycles 16 12 23 Rregs 3 2 3 Hregs 1 1 1 Instructions 35 25 56 Throughput (MP/s) 350. A model of cross-bar switches and a network on chip structure are other possibilities for the future work on interconnect modelling. shared / individual read-write lines etcetera.1MP/s which is much slower than the graphics card. for more detail see . The breakdown of delay for each block can be seen below.47 Delay (s/MP) 0.67 243. this can also consist of: multiple masters and slaves. A detailed breakdown of the above primary colour correction algorithm.00214 0. but to still be 1-2 orders of magnitude worse. Some performance bottlenecks in the implementation were ‘compare’ and ‘clamping’ operations. on a Pentium 4. clock speeds.3 seconds. In this case the register usages was not a limiting factor to the implementation.00286 0. however it may be for other algorithms. was seen to be the slowest of the three main blocks. This requires a more abstract implementation. which is allowed for in the SystemC library. which includes many of each of these. The next stage on the question of graphics hardware is to implement the primary colour correction algorithm on a Pentium Processor and on a FPGA. The FPGA implementation is expected to 19 . The register usage.00 466. The conversion between colour spaces was seen to have a large delay penalty due mainly to the conversion from RGB to XYL space. was seen to be larger in calculations than compare operations. although minimal.
The algorithms will be tried on the graphics hardware and any limitations of the interconnect. up to transfer. particularly in hardware. A comparison of the visual differences of ﬁltering and interpolation will be performed. along with the computation time required by each. When limited to a device of equivalent cost to a graphics card the FPGA is expected to perform worse than the graphics card but better than the CPU. noted. either on or off board. An updated Gantt chart for my work intensions.out-perform both. The work covered to-date on Interconnect Modelling and Primary Colour Correction implementation on a graphics card was summarised. Implementations may also be prototyped on an FPGA device and Pentium 4 processor for further comparison of computational capabilities. followed by a plan of my future work including a Gantt chart. the use of graphics hardware for video processing and comparison of FIR ﬁltering and interpolation were then explained. 7 Conclusion A literature survey of related work to my chosen research area has been presented. My three main research questions: investigating interconnects and their modelling. The literature survey will also be updated to include documents relating to interpolation and ﬁltering algorithms. 20 . can be found in Appendix 1 at the rear of this document. if a large enough device is used. This relates to my above aims. highlighting possibilities for work in the areas of interconnects and utilisation of graphics hardware in a mixed core system.
”General calculations using graphics hardware.www.org. pp.A Plug in Architecture for Video Processing”. 2000  Wim Melis. P. pp. pp.07. FPGA. IEEE Transactions on VLSI Systems Vol 11. G. Peter Cheung and Wayne Luk. pp. ”A system for interactive modelling of physical curved surface objects”. and Siganos. F. 2002  Robert Strzodka and Christoph Garbe. Dupont-De-Dinechin. Derbyshire. “The FFT on a GPU. FPGA. 1148-1151. 2004  Wayne Luk.IEEE Symposium on FPGAs for Custom Computing Machines.A reconﬁgurable image processing architecture”. Constantinides and Wayne Luk. ”Real-Time Motion Estimation and Visualisation on Graphics Cards”. 2003  Simon Haynes. ”A Reconﬁgurable Engine for Real-Time Video Processing”.336-340  Chris Trendall and A. 1999  Pete Sedcole. England.com Issue 10. ”Virtual Hardware for Graphics Applications Using FPGAs”. A. Lecture Notes in Computer Science. Rice. Computer.” in The Eurographics Association. ”A Reconﬁgurable Platform for Real-Time Embedded Video Image Processing”.N. 2000  Kenneth Moreland and Edward Angel. ”SONIC . D. ”Why PCI Express Architectures for Graphics”. 1998  Satnam Singh and Pierre Bellec. 1999  Intel Developers Network for PCI Esxpress Architecture. J. FPL. ”Video Image Processing with the Sonic Architecture”. 2003. Peter Cheung and Wayne Luk. Andreou. 2003  Jeffery M.express-lane. pp. ”Image Registration of Real-Time Broadcast Video Using the UltraSONIC Reconﬁgurable Computer”. 106-108.21-30. www. Peter Cheung. Entertainment Computing.A. ”Power-Efﬁcient Flexible Processor Architecture for Embedded Applications”. FCCM 1994  Simon Haynes. John Stone. Shirazi. with applications to interactive caustics”. James Stewart. 2004 21 . pp.wired. 112-136  Micheal Macedonia. SIGGRAPH 78 1978. ”Nvidia”. ”Sonic . 2003  Fredrick Vermeulen and Francky Catthoor. Peter Cheung and Wayne Luk. pp. ”The GPU Enters Computing’s Mainstream”. University of Duisburg. 2002  Simon Haynes. N. 50-57. O’Brian. 376-385. John Stone. Poster .References  J.
Ran Ginosar.Y. ”A case study in Networks-on-Chip Design for Embedded Video”. ”Autonomous Memory Block for Reconﬁgurable Computing”. Shashi Kumar. Johnny Oberg. ”SONICmole: A Debugging Environment for the UltraSONIC Reconﬁgurable Computer”. ”Fast Asynchronous Bit-Serial Interconnects for Network-on-Chip”. 1999  Jiang Xu. Stephen Glanville. AMBA SPECIFICATION (Rev 2. ”A Reconﬁgurable Multiplier Array for Video Image Processing Tasks. Mikael Millberg and Dan Lindvist. ”Networks on Chips: A New SOC Paradigm”. Fernando and M. Kurt Akeley and Mark J. wayne Wold. 2003  William R. 896-907. R. ACM Transactions on Graphics. pp. ”Efﬁcient Implementation of Primary Colour Correction on Graphics Hardware”. Proceedings of the IEEE NorChip Conference. avaliable from author. ”Network on Chip: Architecture for billion transistor era. pp. Computer. 2003  R. ”Breakdown of performance for Primary Colour Correction”. Cheung. 2001  Axel Jantsch. pp.T.K. Wiangtong. 581-584. Adam Postula. 2002  Rostislav Dobkin. Ewe and P. George Constantinides and Peter Cheung. Srimat Chakradhar and Tiehan Lv. Suitable for Embedded In An FPGA Structure. IEEE Symposium on Circuits and Systems. Not Wires: On-Chip Interconnection Networks”. 2005 22 . 1999  Nalin Sidahao.”. C. 2000  Luca Benini and Giovanni De Micheli. 804-807. Isral Cidon. pp. 2004  Simon Haynes and Peter Cheung.J. Antonio Ferrari and Peter Cheung. 2005  Ben Cope. Proceedings of Custom Integrated Circuit Conference. ”The Cg Tutorial: The deﬁnative guide to programming real-time graphics. ”Flexible Reconﬁgurable Multiplier Blocks Suitable for Enhancing the Architecture of FPGAs”. ”Route Packets. DAC. ”Networks on Chip”. 2004  T. ARM. avaliable from author. Joerg Henkel. ”Architectures for Function Evaluation on FPGAs”. 70-78. Addison Wesley. IEEE Symposium on Field-Programmable Custom Computing. pp. ”Cg: A system for programming graphics hardware in a C-like language. Peter Cheung and Wayne Luk.808-811. Avinaom Kolodny and Arkadiy Morgenshtein. Axel Jantsch. Kilgard. Automation and Test European Conference. 2003  Wim Melis. Dally and Brian Towles. ISCAS. ISCAS.0). 2003  Ben Cope. Mark. Kilgard. 1998  Simon Haynes. 2002  Ahmed Hemani. 2004  William J.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.