You are on page 1of 6

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs

Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto
Politecnico di Milano, Dipartimento di Elettronica e Informazione
Via Ponzio 34/5
20133 Milano, Italy
E-mails: {tumeo, monchier, gpalermo, ferrandi, sciuto}@elet.polimi.it

Abstract In this context, several toolchains for the design and


prototyping of Systems-on-Chip (SoC) have been pre-
Multimedia applications, and in particular the en- sented [2, 3]. These tools permit to rapidly create sys-
coding and decoding of standard image and video tems composed of hard and soft core processors and a
formats, are usually a typical target for Systems- set of standard IP-cores, to interface with internal and
on-Chip (SoC). The bi-dimensional Discrete Cosine external peripherals. In addition, the system can be tai-
Transformation (2D-DCT) is a commonly used fre- lored to the target application by including ad hoc co-
quency transformation in graphic compression algo- processors to properly accelerate the critical kernels.
rithms. Many hardware implementations, adopting This paper presents a novel hardware architecture
disparate algorithms, have been proposed for Field for a fast 2D Discrete Cosine Transform accelerator.
Programmable Gate Arrays (FPGA). These designs The basic idea is to exploit the symmetries of the
focus either on performance or area, and often do not algorithm to save some area, but still ensure high-
succeed in balancing the two aspects. performance. The architecture is targeted to work as
In this paper, we present a design of a fast 2D- a hardware accelerator for the Xilinx MicroBlaze soft
DCT hardware accelerator for a FPGA-based SoC. core processor, and builds on the specifications of the
This accelerator makes use of a single seven stages connection with the processor to further optimize its
1D-DCT pipeline able to alternate computation for the operations.
even and odd coefficients in every cycle. In addition, it This design is a component of a complete HW/SW
uses special memories to perform the transpose opera- implementation of the JPEG encoding algorithm. The
tions. Our hardware takes 80 clock cycles at 107MHz 2D-DCT is one of the most computationally intensive
to generate a complete 8x8 2D DCT, from the writ- phase of the encoding process, and its acceleration no-
ing of the first input sample to the reading of the last ticeably reduces the whole execution time of the appli-
result (including the overhead of the interface logic). cation.
We show that this architecture provides optimal per- The structure of this paper is the following. Section
formance/area ratio with respect to several alternative 2 discusses some related works. The 2D-DCT and on
designs. the Fast DCT algorithm are briefly discussed in Sec-
tion 3. The proposed architecture is described in Sec-
tion 4. Results are discussed in Section 5. Finally,
1 Introduction Section 6 concludes the paper.

Reconfigurable platforms have recently emerged to 2 Related Work


be an important alternative to ASIC design, featuring a
significant flexibility and time-to-market improvement Several works proposing the architecture and high-
with respect to the conventional digital design flow [1]. level design of a 2D-DCT cores have appeared. Xil-

IEEE Computer Society Annual Symposium on VLSI(ISVLSI'07)


0-7695-2896-1/07 $20.00 © 2007
7 X 7
Λ(u)Λ(v) X (2i + 1)uπ (2j + 1)vπ
F (u, v) = cos[ ] cos[ ]f (i, j) (1)
4 i=0 j=0
16 16

√1
(
2
if k = 0
Λ(k) = (2)
1 else

Figure 1. Equations for the 2D-DCT

inx [4] and Altera [5] offers, in their libraries, spe- 3 2D-DCT Overview
cific cores, optimized for their programmable devices
in terms of occupation. Nevertheless, they feature rela-
tively low performance and, furthermore, they are not The DCT is a frequency transformation commonly
so easy to integrate in System-on-Chip designs real- adopted for compression algorithms, that concentrates
ized with their own toolchains. the most information in a few low frequency coeffi-
Many custom designs for FPGA have also been pre- cients. Slightly different definitions of the transform
sented. Among them, Trainor et al. [6] propose an ar- exist. Nevertheless, the bi-dimensional version, in
chitecture with distributed arithmetic that exploits par- the mostly used form, for 8x8 input samples block is
allelism and pipelining. shown in Figure 1. This equation has a high computa-
tional complexity. For instance, a 8x8 block requires
Agostini et al. [7] propose a 2D-DCT architecture 4096 multiplications and 4096 additions. Many opti-
based on the previous work of Kovac et al. [8]. The mizations have been proposed and, among them, in the
authors decompose the transform in two 1D-DCT cal- field of image compression algorithms, the Fast DCT
culations with a transpose buffer thanks to the separa- has been widely adopted.
bility property. This design is based on the Fast DCT
algorithm. It uses a six stages Wallace tree multiplier, According to the Fast DCT algorithm, since the
that decomposes the multiplier in shift and add oper- cosines depend only on the position in the 8x8 block
ations. Nevertheless, since nowadays multipliers are of the samples, their values can be precomputed and
embedded in FPGA, this approach is no more effective the transform can be rewritten as a matrix multiplica-
in order to reduce occupation. The 2D DCT global la- tion, where the last matrix is the transpose of the first:
tency is 160 clock cycles and a complete 8x8 matrix is T = CxC 0 where C is the matrix of the values of the
processed in 64 clock cycles. Our proposal is loosely cosines.
inspired to this work. Nevertheless, we propose sev- In addition, since the 2D-DCT is a separable oper-
eral optimizations that achieve important advantages ation, it can be computed by applying a 1D-DCT in
in terms of area and performance. In addition, Agos- one dimension (row-wise) and then by applying an-
tini’s design is conceived for a fully HW implementa- other 1D-DCT to the results in the other dimension
tion of the JPEG encoder. On the other hand, our work (column-wise). This decomposition reduces the com-
targets a mixed HW/SW design, stressing the role of plexity of the calculation by a factor of four.
the interfaces to/from the processor.
Applying both the 1D decomposition and the Fast
Yusof et al. [9], present a similar DCT architecture, DCT algorithm, only 80 multiplications and 464 addi-
integrated in a complex SoC targeted at image encod- tions are needed to compute a 2D-DCT of a 8x8 block,
ing. Finally, Bukhari et al. [10] present an architecture where each 1D-DCT on a vector of 8 elements requires
that implements a modified Loeffler algorithm (result- 29 sums or subtractions and 5 multiplications. It is im-
ing in a faster but significantly larger implementation portant to stress that the result of the Fast DCT algo-
w.r.t. our proposal). In addition, the authors show rithm is scaled, so for example for the JPEG algorithm,
how the occupation of the accelerators can greatly vary it gets corrected in the quantization phase, where it can
when implemented on FPGAs from different vendors. be performed in one step with the quantization itself.

IEEE Computer Society Annual Symposium on VLSI(ISVLSI'07)


0-7695-2896-1/07 $20.00 © 2007
4 Architecture

The decomposition in two 1D computations leads


to an architecture composed of two 1D pipelined ar-
chitectures, and an intermediate buffer for the transpo-
sition, as proposed in [7]. Nevertheless, this solution
is not area efficient, since each 1D pipeline performs
exactly the same operations. Figure 2. The 2D-DCT architecture with a sin-
gle 1D-DCT component
In addition, to allow the use a global 2D-DCT
pipeline, a special transpose buffer must be designed,
since the first DCT produces row results, and the sec-
ond DCT needs column values as input. This memory 4.1 Implementation
should have ping pong1 features to permit to the first
1D architecture to write different values that could be We decided to implement an architecture that uses
read by the second 1D architecture. This leads to even a single 1D-DCT pipeline, fed by a master FSL port,
more space occupation on FPGA. and a transpose memory that, as soon as the first mono-
In particular, if the latency is critical, these memo- dimensional transformation has been completed, feeds
ries cannot be implemented with internal BRAMs and back the transposed results to the same pipeline. Re-
they should be implemented as registers, which takes a moving the option for a 2D-DCT global pipeline (like
lot of logic cells. The solution proposed in [7], which in [7]), we could implement this memory as a simple
uses BRAMs, takes a latency of 64 cycles to generate a memory that gets written in rows and gets read from
full transpose matrix. Also BRAMs can become a lim- its columns. Then, the second 1D-DCT is performed,
iting factor, in particular if the 2D-DCT architectures and the final results are stored in a secondary buffer
needs to be integrated in a System-on-Chip with soft before being transposed again and output to the slave
core processors, that needs the BRAMs as fast data FSL. Figure 2 shows an overview of the architecture.
and instruction memories. As explained before, a single pipeline would require
Our architecture has been designed considering the the execution of 29 sums/subs and 5 multiplications.
fact that the resulting accelerator should be connected Observing that odd and even coefficients of the re-
to a soft core processor, the MicroBlaze [11] from Xil- sulting 8 samples transformed vector requires differ-
inx. Our DCT module should be part of a complete ent types of computations, we organized the pipeline
System-on-Chip to perform image encoding. The Mi- in seven stages. In this way, we reduced the number
croBlaze, thanks to the Fast Simplex Links (FSL) [12], of adder/subtractors to 19 and the number of multipli-
permits to connect application specific hardware accel- ers to 4. This means that the pipeline alternates the
erators using a point-to-point communication protocol needed values, each cycle, to compute the odd and the
via master slave ports. Each communication primitive even coefficients of the resulting vector.
can transmit 32 bits from the register file of the Mi- The organization of our seven stages pipeline is
croBlaze to the accelerator and vice versa. Since the shown in Figure 3. The FSL connection can feed four
values of the input samples in image compression are 8 bits values per cycle, and all the input samples are
constrained in a range covered with 8 bits, a single FSL needed (8 values) for both the odd and even output
command can transmit up to 4 values per cycle. Next samples. For these reasons, we implemented a pseudo-
section provides more details on the architecture im- ping pong buffer (now at the input) partitioned in two
plementation. parts of four values, in order to maintain the same val-
ues for two consecutive clock cycles.
1
We say a ‘ping pong’ memory, a memory interposed between
It is also important to stress that the DCT extends
two blocks (A and B) that can alternatively be written by A and the range of the output values. Thus, the initial 8 bits
read by B or be written by B and read by A. values become, at the end of a 1D-DCT, values that

IEEE Computer Society Annual Symposium on VLSI(ISVLSI'07)


0-7695-2896-1/07 $20.00 © 2007
Figure 3. The seven stages of the 1D-DCT pipeline, with 19 adders/subtractors and 4 multipliers.
Notice that latches between each stage are not drawn to show how the different functional units are
connected

are valid on 16 bits. But, in order not to lose precision, Resource Used Available Utilization
when doing multiple passes performing a 2D-DCT, it Slices 2823 13696 20%
is important to represent the intermediate results be- Slice Flip Flop 3431 27392 12%
tween the first and the second 1D-DCT in a fixed point 4 input LUTs 2618 27392 9%
format, with at least 24 bits (8 bits for the decimal
part). Table 1. Resource utilization of the Optimized
Our 1D-DCT pipeline accounts for this. Each com- Fast 2D-DCT hardware accelerator on the Xil-
putation is performed at 24 bits precision, and the inx XC2VP30 FPGA
transpose memory allows to save 24 bits values. The
final results buffer saves, instead, only the integer part
of the numbers in 16 bits format. Therefore, effec- Starting from the loading of the first group of four
tively, the output rate of the complete 2D-DCT is two input samples, to the reading of the last group of two
16 bits values per clock cycle. results, the IP core takes 80 cycles. 48 cycles are
used to manage the interfaces and the ping pong buffer,
4.2 Interfaces while 32 cycles are used for effective computation.

The input logic starts receiving data from the pro- 5 Evaluation
cessor master port, feeds the ping pong buffer, and the
pipeline, as soon as the first group of four samples In Table 1, we show the occupation of our 2D-DCT
is available. The output logic waits that the full 8x8 accelerator on a Xilinx XC2VP30 Device. With Xilinx
block has completed the two 1D phases and the result ISE 8.2 our IP Core is synthesized at 107 MHz.
has been stored to the memory. Then, it starts sending Compared to the Xilinx [4] solution, our core has
results, grouped as two 16 bits values each, to the pro- an occupation around 2.5 times higher, but the Xilinx
cessor. The MicroBlaze, which, after sending the input IP core does not include input and output logic for a
samples, is waiting for a block to receive (MicroBlaze standard bus and it is much slower since it has an ini-
block read), finally starts reading the results. tial latency of 92 cycles and then produces just one

IEEE Computer Society Annual Symposium on VLSI(ISVLSI'07)


0-7695-2896-1/07 $20.00 © 2007
sample every cycle. This is due to the fact that it is
realized combining 8 FIR filters to produce a single
sample. Also, the area values refer to a standard, not-
customized core, and so they are relative to a 8 bits in-
put and 9 bits output range, clearly not ready for JPEG
encoding.
Compared to Agostini’s [7] architecture, which uses
full Fast 1D-DCT components, our solution uses less
multipliers and adders/subtractors just adding a sin-
gle pipeline stage (six compared to seven). In addi-
tion, they adopt a solution with two 1D-DCT elements,
while our IP core has one that get reused. They try to Figure 4. Area/Delay comparison of the Four
use less area implementing the multiplications using IP Cores
a Wallace tree, but since new FPGAs have embedded
multipliers this is no more an interesting solution. In
addition, this can lead to more occupation. Moreover, of the input file and the saving of the output) on a
each stage of the pipeline needs eight clock cycles to two different architectures for a 160x120 pixels image.
be completed, so the initial latency is 48 cycles for a The first solution executes the encoding completely in
single 1D-DCT. The transpose memory requires 64 cy- software, and it is easy to see that the DCT calculation,
cles more to complete the transpose operation, which performed with a Fast DCT software implementation,
leads to a global latency of 160 cycles. After filling the accounts for almost 20% of the application. The sec-
pipeline, however, each 8x8 blocks comes out at a full ond architecture uses instead our Optimized 2D-DCT
64 cycles rate. core to execute the transform.
Finally, Bukhari [10] IP core uses less The numbers show that the 2D-DCT hardware ac-
adders/subtractors but many more multipliers (11) celerator is two orders of magnitude faster than the
than our solution for a single 1D DCT element, due software implementation, giving a speed up of 138.4.
to the adoption of the Loeffler algorithm. A single It is also interesting to note that with the MicroBlaze
1D DCT is computed for 8 input samples in a single architecture and the JPEG implementation adopted,
clock cycle, so the full 2D-DCT needs 16 cycles to the DCT phase is the second most computationally in-
be completed. The complexity of each stage of the tensive phase of the algorithm. Since this work focuses
core anyway does not allow more than 54 MHz in only on the 2D-DCT hardware accelerator implemen-
synthesis, and the area occupied, without the logic tation, we did not optimize the RGB to YUV phase.
to interface to a standard processor bus, is already The inclusion of the IP core nullifies the weight of the
higher. DCT phase in the application, giving a global speed up
Figure 4 shows the area/delay scatter plot for the of 1.2.
four solutions, normalized with respect to the stan-
dard Xilinx IP Core. It can be seen that the Xil- 6 Conclusions
inx solution, our Optimized Fast 2D-DCT architec-
ture, and Bukhari’s solution are Pareto-optimal, lying In this paper we presented a novel architecture for
on the same constant area/delay curve. Nevertheless, the Fast 2D-DCT algorithm. The proposed solution
our proposal well balance area and delay, unlike Xilinx is optimized from the area/performance point of view.
and Bukhari’s solution. Agostini’s architecture, which It uses the symmetries of the algorithm to minimize
uses an organization similar to ours, features larger de- the number of functional units. Furthermore, the core
lay and area. Our work effectively optimizes this ar- has been designed to act as an Application Specific IP
chitecture for both area and delay. for the MicroBlaze soft core processor, and taking into
Table 2 reports the results obtained by executing the account the features and the limitations of its commu-
full JPEG encoding algorithm (including the reading nication system, the architecture has been even more

IEEE Computer Society Annual Symposium on VLSI(ISVLSI'07)


0-7695-2896-1/07 $20.00 © 2007
Phase Full SW HW/SW
File reading 133,375,241 137,566,414
RGB to YUV 1,575,687,380 1,586,965,423
Exp & Downsample 2,013,185 2,013,435
Set quant. table 74,711 98,242
DCT 585,084,357 4,227,699
Quantization 354,084,692 339,500,870
Entropic coding 461,738,243 465,292,474
Total 3,112,057,809 2,535,664,559

Table 2. Comparison, in clock cycles, of the JPEG algorithm executed on a MicroBlaze architecture
with and without the Optimized Fast 2D-DCT hardware accelerator

optimized. Our Fast 2D-DCT hardware accelerator Workshop on, pages 541–550, Leicester, UK,
adopts a single 1D-DCT element with a seven stage November 1997.
pipeline, that encompasses 19 adders/subtractors and
4 multipliers. Compared to other designs in literature, [7] L.V. Agostini, I.S. Silva, and S. Bampi. Pipelined
it satisfies the requirements of low occupation without fast 2d DCT architecture for JPEG image com-
sacrificing performance. When introduced in a com- pression. In Integrated Circuits and Systems De-
plete System-on-Chip architecture, it executes two or- sign, 2001, 14th Symposium on., pages 226–231,
ders of magnitude faster than a software implementa- Pirenopolis, Brazil, 2001.
tion. Overall, it can make the execution of the full [8] M. Kovac and N. Ranganathan. JAGUAR: a
JPEG encoding algorithm 20% faster on a standard fully pipelined VLSI architecture for JPEG im-
MicroBlaze system with reduced impact on occupa- agecompression standard. Proceedings of the
tion. IEEE, 83(2):247–258, February 1995.

References [9] Z.M. Yusof, Z. Aspar, and I. Suleiman. Field


programmable gate array (FPGA) based base-
[1] Frank Vahid. The softening of hardware. Com- line JPEG decoder. In TENCON 2000. Proceed-
puter, 36(4):27–34, 2003. ings, volume 3, pages 218–220, Kuala Lumpur,
Malaysia, 2000.
[2] Altera system-on-a-programmable-chip (SOPC)
Builder. Altera Corporation. [10] K. Z. Bukhari, G.K. Kuzmanov, and S. Vassil-
iadis. Dct and idct implementations on different
[3] Xilinx embedded developer kit (EDK). Xilinx fpga technologies. In Proceedings of ProRISC
Corporation. 2002, pages 232–235, November 2002.
[4] Xilinx xapp610 video compression using dct, ap- [11] MicroBlaze Processor Reference Guide. Xilinx
plication note. xilinx corporation, available at Corporation.
http://www.xilinx.com.
[12] Fast Simplex Link (FSL) Bus (v2.00a). Refer-
[5] Altera Megacore Digital Library, Altera Corpo- ence Guide. Xilinx Corporation.
ration.

[6] D.W. Trainor, J.P. Heron, and R.F. Woods. Imple-


mentation of the 2d DCT using a Xilinx XC6264
FPGA. In Signal Processing Systems, 1997. SIPS
97 - Design and Implementation., 1997 IEEE

IEEE Computer Society Annual Symposium on VLSI(ISVLSI'07)


0-7695-2896-1/07 $20.00 © 2007

You might also like