Professional Documents
Culture Documents
Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto
Politecnico di Milano, Dipartimento di Elettronica e Informazione
Via Ponzio 34/5
20133 Milano, Italy
E-mails: {tumeo, monchier, gpalermo, ferrandi, sciuto}@elet.polimi.it
√1
(
2
if k = 0
Λ(k) = (2)
1 else
inx [4] and Altera [5] offers, in their libraries, spe- 3 2D-DCT Overview
cific cores, optimized for their programmable devices
in terms of occupation. Nevertheless, they feature rela-
tively low performance and, furthermore, they are not The DCT is a frequency transformation commonly
so easy to integrate in System-on-Chip designs real- adopted for compression algorithms, that concentrates
ized with their own toolchains. the most information in a few low frequency coeffi-
Many custom designs for FPGA have also been pre- cients. Slightly different definitions of the transform
sented. Among them, Trainor et al. [6] propose an ar- exist. Nevertheless, the bi-dimensional version, in
chitecture with distributed arithmetic that exploits par- the mostly used form, for 8x8 input samples block is
allelism and pipelining. shown in Figure 1. This equation has a high computa-
tional complexity. For instance, a 8x8 block requires
Agostini et al. [7] propose a 2D-DCT architecture 4096 multiplications and 4096 additions. Many opti-
based on the previous work of Kovac et al. [8]. The mizations have been proposed and, among them, in the
authors decompose the transform in two 1D-DCT cal- field of image compression algorithms, the Fast DCT
culations with a transpose buffer thanks to the separa- has been widely adopted.
bility property. This design is based on the Fast DCT
algorithm. It uses a six stages Wallace tree multiplier, According to the Fast DCT algorithm, since the
that decomposes the multiplier in shift and add oper- cosines depend only on the position in the 8x8 block
ations. Nevertheless, since nowadays multipliers are of the samples, their values can be precomputed and
embedded in FPGA, this approach is no more effective the transform can be rewritten as a matrix multiplica-
in order to reduce occupation. The 2D DCT global la- tion, where the last matrix is the transpose of the first:
tency is 160 clock cycles and a complete 8x8 matrix is T = CxC 0 where C is the matrix of the values of the
processed in 64 clock cycles. Our proposal is loosely cosines.
inspired to this work. Nevertheless, we propose sev- In addition, since the 2D-DCT is a separable oper-
eral optimizations that achieve important advantages ation, it can be computed by applying a 1D-DCT in
in terms of area and performance. In addition, Agos- one dimension (row-wise) and then by applying an-
tini’s design is conceived for a fully HW implementa- other 1D-DCT to the results in the other dimension
tion of the JPEG encoder. On the other hand, our work (column-wise). This decomposition reduces the com-
targets a mixed HW/SW design, stressing the role of plexity of the calculation by a factor of four.
the interfaces to/from the processor.
Applying both the 1D decomposition and the Fast
Yusof et al. [9], present a similar DCT architecture, DCT algorithm, only 80 multiplications and 464 addi-
integrated in a complex SoC targeted at image encod- tions are needed to compute a 2D-DCT of a 8x8 block,
ing. Finally, Bukhari et al. [10] present an architecture where each 1D-DCT on a vector of 8 elements requires
that implements a modified Loeffler algorithm (result- 29 sums or subtractions and 5 multiplications. It is im-
ing in a faster but significantly larger implementation portant to stress that the result of the Fast DCT algo-
w.r.t. our proposal). In addition, the authors show rithm is scaled, so for example for the JPEG algorithm,
how the occupation of the accelerators can greatly vary it gets corrected in the quantization phase, where it can
when implemented on FPGAs from different vendors. be performed in one step with the quantization itself.
are valid on 16 bits. But, in order not to lose precision, Resource Used Available Utilization
when doing multiple passes performing a 2D-DCT, it Slices 2823 13696 20%
is important to represent the intermediate results be- Slice Flip Flop 3431 27392 12%
tween the first and the second 1D-DCT in a fixed point 4 input LUTs 2618 27392 9%
format, with at least 24 bits (8 bits for the decimal
part). Table 1. Resource utilization of the Optimized
Our 1D-DCT pipeline accounts for this. Each com- Fast 2D-DCT hardware accelerator on the Xil-
putation is performed at 24 bits precision, and the inx XC2VP30 FPGA
transpose memory allows to save 24 bits values. The
final results buffer saves, instead, only the integer part
of the numbers in 16 bits format. Therefore, effec- Starting from the loading of the first group of four
tively, the output rate of the complete 2D-DCT is two input samples, to the reading of the last group of two
16 bits values per clock cycle. results, the IP core takes 80 cycles. 48 cycles are
used to manage the interfaces and the ping pong buffer,
4.2 Interfaces while 32 cycles are used for effective computation.
The input logic starts receiving data from the pro- 5 Evaluation
cessor master port, feeds the ping pong buffer, and the
pipeline, as soon as the first group of four samples In Table 1, we show the occupation of our 2D-DCT
is available. The output logic waits that the full 8x8 accelerator on a Xilinx XC2VP30 Device. With Xilinx
block has completed the two 1D phases and the result ISE 8.2 our IP Core is synthesized at 107 MHz.
has been stored to the memory. Then, it starts sending Compared to the Xilinx [4] solution, our core has
results, grouped as two 16 bits values each, to the pro- an occupation around 2.5 times higher, but the Xilinx
cessor. The MicroBlaze, which, after sending the input IP core does not include input and output logic for a
samples, is waiting for a block to receive (MicroBlaze standard bus and it is much slower since it has an ini-
block read), finally starts reading the results. tial latency of 92 cycles and then produces just one
Table 2. Comparison, in clock cycles, of the JPEG algorithm executed on a MicroBlaze architecture
with and without the Optimized Fast 2D-DCT hardware accelerator
optimized. Our Fast 2D-DCT hardware accelerator Workshop on, pages 541–550, Leicester, UK,
adopts a single 1D-DCT element with a seven stage November 1997.
pipeline, that encompasses 19 adders/subtractors and
4 multipliers. Compared to other designs in literature, [7] L.V. Agostini, I.S. Silva, and S. Bampi. Pipelined
it satisfies the requirements of low occupation without fast 2d DCT architecture for JPEG image com-
sacrificing performance. When introduced in a com- pression. In Integrated Circuits and Systems De-
plete System-on-Chip architecture, it executes two or- sign, 2001, 14th Symposium on., pages 226–231,
ders of magnitude faster than a software implementa- Pirenopolis, Brazil, 2001.
tion. Overall, it can make the execution of the full [8] M. Kovac and N. Ranganathan. JAGUAR: a
JPEG encoding algorithm 20% faster on a standard fully pipelined VLSI architecture for JPEG im-
MicroBlaze system with reduced impact on occupa- agecompression standard. Proceedings of the
tion. IEEE, 83(2):247–258, February 1995.