You are on page 1of 7

Algorithm / Hardware Co-Design using MATLAB

Robert K. Anderson Xilinx DSP Technical Marketing High-level ESL design methodologies now enable the simultaneous development of both the DSP algorithms and hardware architectures resulting in systems with a much higher level of optimization. Traditionally, FPGAs have been designed by engineers with limited knowledge of the application domain and similarly, algorithms have been developed by researchers without insight into FPGA design. This significant gap results in inefficiencies in both the final hardware and in the algorithmic system response that can be addressed when both the algorithmic methods and the hardware architectures are designed in a unified manner. Performing this type of algorithm and hardware co-design provides a new degree of freedom that allows the algorithmic methods to more closely match the capabilities and capacity of the targeted hardware platform. MATLAB is an abstract language where the algorithmic specification does not necessarily imply any particular hardware implementation. As such, it is used extensively by researchers for algorithm development. These abstractions make it an obvious choice for architectural co-exploration and combined with an ESL design methodology provide the foundation for architectural / hardware co-design. This paper is the second part of a 3-part series discussing the considerations and possibilities for transitioning from high-level MATLAB into FPGA implementations. It will build on the concepts introduced in our first paper, “Using MATLAB to Create an Executable Specification for Hardware” that discusses how MATLAB can be used as an executable specification for hardware development and verification. This paper will build on those concepts by exploring how MATLAB code, supported with an automated ESL design flow, can be used to enable an algorithm developer to tailor the algorithmic methods to match available resources on an FPGA, and in doing so produce a superior overall design. These tools are designed with the algorithm developer in mind, and are intended to bridge the gap between high-level algorithmic descriptions and low-level hardware implementation details.

Introducing the AccelDSP Synthesis Tool
The AccelDSP synthesis tool is a high-level algorithmic synthesis tool that takes highlevel floating-point MATLAB algorithms as input and performs a series of specific transformations to generate an optimized, fixed-point design. These transformations remove the requirement that engineers must manually re-code floating-point MATLAB into fixed-point MATLAB using “fi” datatypes, quantizer functions or fixed-point C. When these design requirements are removed, they eliminate the design gap that could otherwise create a divergence between the original floatingpoint MATLAB source and the hardware implementation. It also eliminates the potential that a design is forced into to any specific hardware implementation too early in the overall design process.

m” extracts successively larger regions around a given element of a matrix to compute statistics (minimum. verification and implementation steps.com/showArticle. This is partly achieved by automating or encapsulating the FPGA synthesis. Key to AccelDSP’s ease of use is a unique way of breaking out each step of the design flow using a step-by-step flowbar that guides users through the process.The tool supports a variety of “architectural exploration” capabilities that allow the user to quickly modify an implementation using various numerical implementations and micro-architectures. and median) over these regions. The MATLAB function “adaptive_stats. This will be discussed in more detail later in the article. function Out = adaptive_stats_roi(In) %In : square matrix with odd number of rows/columns %Out : Center element kept treated based on statistics .jhtml?articleID=207800773 a design example was provided that helps illustrate the steps of this development process. Figure 1 – AccelDSP Synthesis Tool AccelDSP is designed to be intuitive to use for algorithm developers who are familiar with the MATLAB programming language but not RTL design methodologies. http://www. Once a suitable fixed-point implementation is achieved the user can use these design exploration techniques to improve the overall area or performance of the design. maximum. The function outputs a filtered center element of the matrix based on these statistics. MATLAB to Embedded C Design Flow In a recent article from The Mathworks on their MATLAB to Embedded C design flow.dspdesignline. This same code example can also be used to illustrate a similar set of steps in the MATLAB to FPGA development process. and subsequently generate an RTL implementation optimized for targeting Xilinx devices.

%Indata : square matrix with odd number of rows/columns %Out : Center element kept treated based on statistics smax=size(Indata.1). end end %Sort the sub-matrix only n = count-1. Indata2. Indata9) %%%%%%%% . Indata5. AccelDSP Design Example Let’s take the same example used above by The Mathworks in their MATLAB to Embedded C design flow example and show how it can be targeted for implementation in an FPGA. v = mysort(v. for i = first:last for j = first:last v(count) = in(i.smax. % Conditionally remove noise if rmed > rmin && rmed < rmax if center <= rmin || center >= rmax Out = rmed.1 .2).s). %initialize large array for sorting v = ones(smax*smax.s) first = ceil(smax/2)-floor(s/2).n).smax=size(In. Indata8. Out=center. Indata7 Indata8 Indata9 ].rmed] = roi_states(in. end break. index=ceil(smax/2).j). Indata7. end end function [rmin. for s = 3:2:smax % Compute region of interest statistics [rmin. function Outdata = adaptive_stats_roi(Indata1. Indata4 Indata5 Indata6. %Compute statistics on sub-matrix rmed = v(ceil(n/s)).smax. last = ceil(smax/2)+floor(s/2). .index). Indata4. count = count+1. center=In(index.rmed]=roi_stats(In.Indata3. Outdata=center for s = 3:2:smax % Compute region of interest statistics [rmin.1).1). Rmax = v(n).smax.s). rmin = v(1).Matrix Indata is constructed in AccelDSP by concatenating multiple input vectors Indata = [Indata1 Indata2 Indata3.rmax.rmax.rmax.rmed]=roi_states(Indata. index=ceil(smax/2) center=Indata(2. count = 1. Indata6.

j).Substitute in an AccelDSP sorting function”. %initialize large array for sorting v = ones(smax*smax. count = 1.rmax.s) first = ceil(smax/2)-floor(s/2). MATLAB Coding Considerations Since MATLAB is an abstract language where the algorithmic specification does not necessarily imply any particular hardware architecture. rmin = v(1). as there are a diverse set of implementations.% Conditionally remove noise if rmed > rmin && rmed < rmax if center <= rmin || center >= rmax Outdata = rmed. The Sort Function Next. AccelDSP then takes these vectors and aggregates them into internal arrays. The definition for the sort function we are using follows: % Compute region of interest statistics function v = selectsort(v_in). end end end function [rmin. v_in = v %%%%%%%% . rmax = v(n). . based on the MATLAB coding style used. count = count+1.2 . for i = first:last for j = first:last v(count) = Indata(i.smax. last = ceil(smax/2)+floor(s/2). Construction of internal MATLAB Matrices In the first coding instance “%%%%%%%% . Sorting functions can be found at a variety of locations.2 . To illustrate some of these items we will list changes made in the MATLAB source to prepare it for FPGA implementation. end end %Sort the sub-matrix only n = count-1.rmed] = roi_states(Indata. we used a custom sort function “%%%%%%%% .Matrix Indata is constructed in AccelDSP by concatenating multiple input vectors” allows internal matrices to be constructed by passing in and concatenating together vectors. various algorithmic methods may have a large number of numerically equivalent hardware architectures.Substitute in an AccelDSP sorting function v = selectsort(v_in).1). %Compute statistics on sub-matrix rmed = v(ceil(n/s)).1 .

These building blocks are parameterized so the user can further tailor the final implementation for precision. Furthermore. micro-architectures. AccelDSP used some basic heuristics to automatically select a Bipartite Table implementation. in the function roi_states(). AccelDSP addresses these issues by providing the user with a library of pre-designed building blocks that can be easily swapped into and out of a design. and specific FPGA resources such as RAMs. testing.Exploration As mentioned before. through design exploration. DWT. index = j . designing. and implementing these building blocks can take a significant amount of time and effort. For example. DCT. n = length(v). such as CORDIC. design throughput. Each of these methods is characterized with a set of trade-offs associated with numerical precision. AccelDSP. a division function is used immediately following selectsort().k]). square root. end Architectural Co. numerical methods. for j=k+1:n.v = v_in. A number of alternative implementations. flip-flops and shift registers. implementation performance. In this instance. for the most part. for k=1:n-1. linear algebra. and area constraints. trigonometric functions. This division is used to compute statistics on a sub-matrix in the design. etc). transformations (FFT. gives the user the ability to pick a specific numerical architecture for this function. %Compute statistics on sub-matrix rmed = v(ceil(n/s)). if v(j) < v(index).index]) = v([index. a single algorithm can have a large number of different implementations which are. . and others. numerically equivalent. end end v([k. For example. resource utilization. index = k. there are many different ways to implement division.

xilinx. and vector sets grow very large. Enabling this insight early in the design process helps refine the overall system to produce an overall superior design.5 . This methodology enables an algorithm developer. hardware co-simulation becomes a viable option and can increase simulation performance by up to 1000x. One of these refinements is quantization. This additional freedom gives FPGA developers the ability to tailor the quantization of a data-path to an exact set of requirements. This ability for the user to quickly modify micro-architectures for the key building blocks of a design allows an algorithm developer to evaluate a much larger hardware solution space quickly and with minimal hardware design experience. the improved simulation run times when HW co-simulation is employed. or perform hardware co-simulation of a MATLAB algorithm. further refinements to the design are possible that help improve overall area. 16 or 32 bit data widths as are fixed-point DSP processors.com/dsp www.” The table below shows a comparison of simulations performed using just MATLAB or Simulink vs. performance and system response.to Fixed-Point Conversion and Co-Verification Once the basic hardware architecture for a design has been established.com/acceldsp .Newton-Rhaphson and LUT also exist and could have been selected by the user using a design directive.75 23 4 92 Increase 45X 989X 32X 69x 113X Conclusion The abstract nature of MATLAB combined with the powerful and easy-to-use hardware design explorations of AccelDSP provide an efficient environment from which to perform algorithm / hardware architecture co-design. Simulation Time (sec) Design Beamformer OFDM BER Test DUC CFR Color Space Conv Video Scalar Non-accelerated 113 742 731 277 10422 Accelerated (through HW co-sim) 2. not familiar with FPGA design techniques. When algorithms become sophisticated. fixed-point MATLAB. Fixed-point ‘C’ is the default language generated for fixed-point verification because it is significantly faster than running fixed-point MATLAB. This is an iterative process that requires constant feedback. For more information. please visit www. to quickly assess the feasibility of the algorithms on an FPGA and to refine the algorithmic methods to match the available hardware resources. Hardware co-simulation establishes a simulation connection between a MATLAB system model and a design running on an actual FPGA that is part of a supported hardware platform such as the “XtremeDSP Starter Platform – Spartan-3A DSP Edition” or the “XtremeDSP Development Platform – Virtex-5 SXT Edition. FPGAs are not limited to fixed 8. Floating.xilinx. AccelDSP gives the user the option to generate fixed-point ‘C’.