You are on page 1of 15

J. Parallel Distrib. Comput.

109 (2017) 178192

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput.


journal homepage: www.elsevier.com/locate/jpdc

Self-reconfigurable architectures for HEVC Forward and Inverse


Transform
Daniel Llamocca
Electrical and Computer Engineering Department, Oakland University, Rochester, MI, USA

highlights

Self-reconfigurable system can trade-off resources and performance.


The parameterized hardware code allows for large design space exploration.
A different coding efficiency requirement can trigger reconfiguration.
A case made for run-time reconfigurable technology for video encoders and decoders.

article info a b s t r a c t
Article history: This work introduces a run-time reconfigurable system for HEVC Forward and Inverse Transforms that can
Received 30 June 2016 adapt to time-varying requirements on resources, throughput, and video coding efficiency. Three scalable
Received in revised form 26 April 2017 designs are presented: fully parallel, semi parallel, and iterative. Performance scalability is achieved
Accepted 24 May 2017
by combining folded/unfolded 1D Transform architectures and one/two transposition buffers. Resource
Available online 15 June 2017
usage is optimized by utilizing both the recursive evenodd decomposition and distributed arithmetic
techniques. The architecture design supports video sequences in the 8K Ultra High Definition format
Keywords:
HEVC transform (7680 4320) with up to 70 frames per second when using 64 64 Coding Tree Blocks with variable
Reconfigurable hardware transform sizes. The self-reconfigurable embedded system is implemented and tested on a Xilinx R
Zynq-
Embedded systems 7000 All-Programmable System-on-Chip (SoC). Results are presented in terms of performance (frames
Run-time partial reconfiguration per second), resource utilization, and run-time hardware adaptation for a variety of hardware design
parameters, video resolutions, and self-reconfigurability scenarios. The presented system illustrates the
advantages of run-time reconfiguration technology on PSoCs or FPGAs for video compression.
2017 Elsevier Inc. All rights reserved.

1. Introduction adaptively compressed at different bitrates so as to allow commu-


nication within the available bandwidth; also if energy constraints
The video compression standard known as High Efficiency are tight (e.g. mobile devices), energy consumption can be reduced
Video Coding (HEVC) is the latest standard video compression at the expense of reduced performance.
standard. It enhances video compression capability at the expense This work focuses on run-time hardware adaptation of the
of computational complexity [16]. Video compression standards HEVC Forward and Inverse Transforms. There has been extensive
have found their way into a wide variety of products increasingly work on dedicated hardware designs. Most of them use two 1D
prevalent in our daily lives. The increasing diversity of services, Transforms (or a single shared 1D transform) and one or two
the emergence of beyond high-definition video formats, and the transposition buffers: multiply-and accumulate architectures for
increasing network traffic caused by video applications have given Forward [11] and Inverse Transforms [10], multiplier-less Inverse
rise to the HEVC standard. Hardware implementations of the HEVC Transform with separate row and column 1D Transforms [9],
encoder are highly desired due to their high performance nature multiplier-less shared (row/column) 1D Inverse Transform designs
and reduced energy consumption when compared to software where transposition is processed using registers [3,17] or single-
realizations [12]. Of particular interest are run-time reconfigurable port SRAMs [13,14], and resource-intensive architectures that gen-
systems that allow hardware components to be altered while the erate 32 output pixels per cycle irrespective of transform size [8].
rest of the system is still operating [5]. This allows video com- Unified (Forward/Inverse) architectures are introduced in [2] and
munication systems to meet real-time constraints: video can be an FPGA implementation is presented in [4].
We present multiplier-less architectures for the HEVC For-
E-mail address: llamocca@oakland.edu. ward and Inverse N N Transforms that can tradeoff resources

http://dx.doi.org/10.1016/j.jpdc.2017.05.017
0743-7315/ 2017 Elsevier Inc. All rights reserved.
D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 179

Fig. 1. HEVC encoder. Input Image X : B bits per sample. Residual image U: B + 1 bits per sample. ST 1 = 2(B+M 9) , ST 2 = 2(M +6) , SIT 1 = 26 , SIT 2 = 2(21B) .

by throughput. Resource usage is optimized by the combined


use of recursive evenodd decomposition and distributed arith-
metic [18], which is allowed by the constant HEVC matrix coeffi-
cients [8]. The presented N N HEVC Transform units allow for
smaller transforms to be computed by the same hardware. The
architectures styles (for both Forward and Inverse Transforms) are:
(i) fully parallel: input columns fed every clock cycle, even between
blocks; this requires the most resources, (ii) semi parallel: input
columns fed every clock cycle within a block, need to wait between
blocks, (iii) iterative: input columns within a block and between Fig. 2. 1D Transform core with constant coefficients D(N). TP: fully pipelined or
blocks need to wait a number of cycles; this requires the least iterative. The circuit computes the product of D(N) (or DT (N)) by a column of U,
resources. resulting in a N-pixel column of Z.
To demonstrate run-time adaptability, a self-reconfigurable
embedded system is presented where we can alter (at run-time)
the design parameters of the N N HEVC Forward and Inverse
2. HEVC Transform and Scaling
Transforms. Moreover, the HEVC coding efficiency can be altered
by varying the quadtree structure and/or size of a Coding Tree Block
Fig. 1 depicts an HEVC encoder with its main components.
(CTB) [16]; if the largest block size varies, then we need to modify
The input image has B bits per sample while the residual image
(at run-time) the size of the Transform hardware. Thus, these
has B + 1 bits per sample [2]. The Forward Transform output
scalable architectures adapt to time-varying resource, throughput,
and coding efficiency requirements. Hardware is validated by em- requires NO = 16 bits while the Inverse Transform output requires
bedded implementation and testing of real-life scenarios with high B + 1 bits. The Forward/Inverse HEVC Transforms are 2D DCT-
definition frame sizes and different CTB quadtree structures. The like transforms made of two 1D Transforms (column and row) and
main contributions of the present work can be summarized as Scaling Units. In HEVC, the coefficients are approximations of the
follows: scaled DCT coefficients and are represented with 8 integer bits.
The 4 4 Transform can also use an alternative 4 4 DST-based
Performance-scalable multiplier-less architectures: We Transform [16]. We refer to the equations in [8] and the scaling
provide the fastest hardware realizations based on the avail- factor in [2] as the reference algorithm in the remainder of this
able resources. paper.
Fully customizable architectures validated on an All- In HEVC, the size of a N N block can be 4, 8, 16, 32. In the
Programmable System-on-Chip (SoC): The fully parameter- HEVC Forward Transform, the residual image block U is processed
ized VHDL-coded hardware allows for exploration of trade- block-wise by Y = D U DT (this includes scaling): the column
offs among resources, performance, and design parameters. transform applies D U, then scaling (bit pruning) to guarantee 16-
The code, not tied to any particular hardware, can be imple- bit output samples (ST 1 = 2(B+M 9) ). After that, the row transform
mented on any existing technology (e.g. PSoC, FPGA, ASIC). applies (D U) DT , then scaling to ensure 16-bit output samples
Self-reconfigurable embedded implementations: The run- (ST 2 = 2(M +6) ). In the HEVC Inverse Transform, the dequantized
time reconfigurable software/hardware system can swap
block Yq is processed by Uq = DT Yq D (including scaling): the
hardware configurations to adapt to resource, throughput,
column transform applies DT Yq, then scaling to guarantee 16-bit
and coding efficiency requirements.
output samples (SIT 1 = 26 ). After that, the row transform applies
The rest of the paper is organized as follows: Section 2 details (DT Yq) D, then scaling to guarantee output samples of B + 1
the HEVC Transform algorithm. Section 3 presents the three archi- bits (SIT 2 = 2(21B) ).
tecture designs. Section 4 provides embedded implementation de- Table 1 specifies how to process data: if we feed an input block
tails. Section 5 presents results in terms of resources, performance, (U, Yq) we get a transposed output block (Y T , UqT ). If we feed a
and run-time management. Conclusions are provided in Section 6. transposed input block (U T , YqT ), we get an output block (Y , Uq).
180 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Table 1
HEVC Forward/Inverse Transform: Processing column and row 1D transforms.
2D Transform In Column transform Row transform Out

T U Z = DU DZ T = DU T DT YT
Forward Y = DUD
UT Z = DU T DZ T = DUDT Y
Yq Z = DT Yq DT Z T = DT YqT D UqT
Inverse Uq = DT YqD
YqT Z = DT YqT DT Z T = DT YqD Uq

Fig. 3. (a) Fully parallel 4 4 1D Transform. (b) Architecture of fully pipelined, distributed arithmetic (DA) N N Inner Product. Here, L is the LUT input size and it is set to
4, this requires N to be a multiple of 4. It includes a Distributed Arithmetic rearrangement unit, filter blocks (see [7] for details) and an adder tree. NH is the number of bits
per coefficient is 8 according to HEVC.

The 1D Forward Transform (column and row) can then be defined 3.1. Hardware implementation of the 1D DCT-like transform
as a matrix product DN X and the 1D Inverse Transform (column
and row) as a matrix product DTN X . We compute the matrix multiplication D U, where D is an
The matrix DN can be recursively decomposed: first into DN /2 N N constant matrix, and U an input block. The coefficients D (N )
and MN /2 , then DN /2 is decomposed into DN /4 and MN /4 , and so and DT (N ) allow for recursive evenodd decomposition and dis-
on [8]. The 1D Forward column and row Transforms then re- tributed arithmetic [18] to optimize resource usage [6]. The circuit
quire proper permutations of the input and output sequences as generates one output column at a time, with U provided column-
well as N pre-additions/pre-subtractions. Likewise, the matrix DTN wise. Two versions are provided: fully pipelined (unfolded) and
can be recursively decomposed. The 1D Inverse column and row iterative (folded).
Transforms require proper permutations of the input and output Fig. 2 shows the parameterized VHDL-coded core with its pa-
sequences as well as N post-additions/post-subtractions [4]. This rameters and I/O ports. The design parameters are N (transform
method, known as recursive evenodd decomposition, optimizes size), BI (bits per input samples), NY (bits per output sample),
resource usage [8]. FWD/INV (Forward or Inverse), coefficients (provided as a text file.
Forward: D (N ): Inverse, DT (N )), and TP (version: fully pipelined,
3. HEVC Transform: Hardware implementation iterative). The case N = 4 is not decomposed.
The column/row transform (Forward and Inverse) units are
We describe the block-wise hardware implementation of the generated by proper parameter selection. For the HEVC Forward
shaded blocks of Fig. 1. We start by detailing the 1D Transform case, the column transform requires BI = B + 1, NY = B + M + 7,
core that implements the column/row transforms (both Forward D (N ), while the row transform requires BI = NO = 16, NY =
and Inverse cases), this unit implements a generic matrix product 15 + M + 7, D (N ). For the HEVC Inverse case, both the column and
Z = D U. We then explain the HEVC 2D Transform hardware: as row transforms require BI = NO = 16, NY = 22, DT (N ).
we need Z T , a transposition stage is required between column and (1) Fully pipelined implementation: Fig. 3(a) depicts the 4 4
row transforms. The Scaling stages amount to bit pruning. Transform comprised of four inner products (each of 4 inputs by 4
We have coded the architecture in VHDL derived from the constant coefficients). Fig. 3(b) depicts a fully pipelined inner prod-
reference algorithm mentioned in Section 2. Our architectures are uct (N inputs by N coefficients) implemented with distributed
designed to process N pixels (a column or a row) at a time. If we arithmetic (DA) whose latency is log2 BI +log2 (N /L)+ 1 cycles
feed U column-wise, we are implementing Z = D U for the (N multiple of L, the LUT input size [7]). The 32 32 recursive
column transform, and the output is the transposed matrix Y T , implementations of the Forward and Inverse Transforms are de-
i.e., we get Y row-wise. If we feed U row-wise, we are implement- picted in Figs. 4 and 5 respectively (16 16, 8 8, and 4 4 units
ing Z = D U T for the column transform, and the output is the are also shown). To avoid large combinational delays, registers are
matrix Y , i.e., we get Y column-wise. For clearness of presentation, inserted after the adder/ subtractors; the extra register delays are
in our hardware descriptions, the input block is fed column-wise balanced by the latency difference between inner products of dif-
to both the Forward and Inverse Transforms. ferent size. Both the Forward and Inverse Transform architectures
D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 181

Fig. 4. Fully Pipelined 32 32 1D Forward Transform. 16 16, 8 8 and 4 4 Transforms are depicted as well as they are components of the 32 32 Transform. The I/O
delay (latency) is indicated for every Inner Product (DA) hardware. Inner product with N /2 inputs have an extra delay than those with N /4 inputs. This is balanced by the
extra register levels added on the inputs.
182 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Fig. 5. Fully Pipelined 32 32 1D Inverse Transform. 16 16, 8 8, and 4 4 Transforms are depicted as well as they are components of the 32 32 Transform. The I/O
delay (latency) is indicated for every Inner Product (DA) hardware. Inner product with N /2 inputs have an extra delay than those with N /4 inputs. This is balanced by the
extra register levels added on the outputs.

can process a N-pixel column per cycle, and their latencies are (2) Iterative implementation: The 4 4 Transform hardware is
given by (L = 4): depicted in Fig. 6(a): it has a DA rearrangement unit (see Fig. 3(b)),
a vector shift register, 4 iterative inner products, and an FSM.
RLP = log2 (BI + log2 (N /L)) + 2 + log2 (N /L) (1)
Fig. 6(b) depicts the N N iterative inner product, whose latency is
IRLP = log2 (BI ) + log2 (N /L) + 2. (2) log2 N /L + 2 cycles. Fig. 7 depicts the 32 32 Forward Transform
D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 183

Fig. 6. (a) Iterative 4 4 Transform (N = 4) architecture. The DA rearrangement unit converts the N BI-bit vectors into BI N-bit vectors which are shifted out one by one by
the vector shift register. (b) Architecture details of the Iterative Inner product (N inputs by N coefficients) that uses Distributed Arithmetic.

recursive architecture. Unlike the Fully Pipelined case, the blocks 3.3. HEVC Transform (2D) implementation
after the DA rearrangement units have identical latencies (smaller
inner products delays are balanced by extra input groups to the Fig. 8 shows the HEVC Forward and Inverse Transform (2D)
vector shift registers). So, in addition to the registers placed after VHDL-coded core with its I/O ports and parameters: B, N , D (N )
the adder/subtractors, synchronizing registers are needed for data (including the smaller matrices: D (N /2) , M (N /2), . . . ), IMP
to arrive simultaneously on all DA rearrangement units. Thus, an (implementation style: fully parallel, semi parallel, iterative),
N N Forward Transform can process an input column every FWD/INV (Forward or Inverse). The design parameters allow us
BI + log2 (N /L) cycles (a non-decomposed transform processes to finely control execution time and resource usage. The HEVC
an input column every BI cycles at the expense of excessive hard- Transform needs row and column 1D Transforms, Scaling, and a
ware overhead). The Inverse Transform is similar to Fig. 5 without transposition stage. The alternative DST-based transform can be
input registers and with Iterative Inner Products (along with DA used for the 4 4 unit [2]. The input is provided column-wise and
rearrangement units and vector shift registers). Here, the number the outputs are generated row-wise (Y T , UqT ).
of input groups to all vector shift registers is the same; the latency The Forward Transform uses D (N ), ST 1 , ST 2 , BID = B + 1 bits
difference of inner products of different size is balanced by the per input sample and NOD = 16 bits per output sample. The
registers placed after the adder/subtractors on the outputs. Thus, Inverse Transform (shown in purple in Fig. 7) uses DT (N ), SIT 1 , SIT 2 ,
the Inverse Transform can process an input column every BI cycles. BII = NOD = 16 bits per input sample, and NOI = B + 1 bits per
The latencies (RLI: Forward, IRLI: Inverse) are given by: output sample. Scaling (ST 1 , ST 2 , SIT 1 ) amounts to bit pruning of the
LSBs to fit into 16 bits, while SIT 2 scaling amounts to bit pruning of
RLI = BI + 2 log2 (N /L) + 2 (3)
the LSBs to fit into B + 1 bits.
IRLI = BI + log2 (N /L) + 2. (4) We need a transposition stage between the column and row
transforms. Some works use a shared column/row transform with
3.2. Computing smaller 1d transforms only one transpose buffer [3,14], while others use separate column
and row transforms with one transpose buffer [8,9] or two trans-
In HEVC, the block sizes inside a Coding Tree Block (CTB) pose buffers [6,11] in a ping-pong mode. As an example of how we
vary continuously. By multiplexing the inputs (or outputs for the can handle variable-sized blocks, note that if we want to transpose
Inverse Transform) we can only use the hardware portion that a 4 4 matrix in a 32 32 buffer, we just load and transpose the
computes the smaller transform and keep the latencies given in upper left corner. Fig. 9 depicts a transposition of a 2 2 matrix in
Eqs. (1)(4). This is problematic when switching sizes, as it requires a 4 4 buffer: we only use the upper left corner and perform the
in some cases that we delay feeding data. Instead, our solution is operation in 2 cycles.
to control the synchronous clear (an(2..0) in Figs. 4, 5 and 7) where Three architecture designs are provided based on the 1D Trans-
we feed zeros to guarantee proper results when we require smaller form implementation and the number of transpose buffers:
transforms than N. (1) Fully parallel: Both the row and column 1D Transforms are
The 32 32 Transform cores of Figs. 4 and 7 can process any Fully Pipelined. In addition, two transpose buffers are used in ping
of the 4 4, 8 8, 16 16, and 32 32 Transforms via the pong mode [6]. This configuration allows us to continuously feed a
size signal. The VHDL-coded core allows for different N (size of column per cycle with no need to stall the pipeline between same-
maximum allowed transform); for example the case 16 16 is sized blocks. When switching blocks of different size (N1 N1 to
useful in applications that only require the 4 4, 8 8, and N2 N2 ), we need to wait N1 N2 extra cycles (only if N1 > N2 )
16 16 transforms. The latency of a N N transform core does so that the buffer processing the N2 N2 block avoids producing
not vary when smaller transforms are processed, as data needs to output data at the same time the buffer processing the N1 N1
propagate through the extra register levels. Thus, switching block block is.
sizes (N1 N1 to N2 N2 , where N1 , N2 N) does not incur in (2) Semi parallel: Both the row and column 1D Transforms are
extra delay cycles. Fully Pipelined. Only one transpose buffer is used, which requires
To summarize, a fully pipelined 1D Transform can process a us to always wait N1 1 cycles between blocks. Within a block, we
column every cycle, while the Iterative Forward Transform can can input one column per cycle.
process a column every BI +log2 (N1 /L) cycles (or every BI cycles (3) Iterative: The row and column 1D Transforms are Iterative.
for the Inverse core). This is also true between columns of different Within a block we can feed a column every BID + log2 N /L
blocks (which might not be of the same size). cycles (Forward Transform) or BII cycles (Inverse Transform).
184 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Fig. 7. Iterative Implementation of the 32 32 1D Forward Transform. 16 16, 8 8, and 4 4 Transforms are depicted as well, as they are components of the 32 32
Transform (the enable chain and control logic for these transforms are not shown here).

One transpose buffer is used. Between blocks, we need to wait N1 , N2 = 4, 8, 16, 32, and thus they prevail. Table 2 summarizes
(NOD + log2 N /L) (N 1) + 1 cycles (Forward Transform) or the number of required cycles between input columns for both the
NOI (N 1) + 1 cycles (Inverse Transform). HEVC Forward and Inverse Transforms.
The waiting cycles between blocks imposed by the transpose Execution time is defined from the clock edge the first input
buffers are longer than those imposed by the 1D Transforms for column is captured to the first clock edge where the last output
D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 185

Fig. 8. HEVC Forward and Inverse Transform. B = 8, NO = 16. Input: N N-pixel columns. Output: N N-pixel row. The Scaling stages (ST1 , SIT1 , ST2 , SIT2 ) perform bit pruning
of the LSBs. Transposition buffers: we can select one or two.

Table 2
HEVC Transform. Cycles between input columns. N1 : current block size, N2 : next block size, BID = B + 1, NOD = 16 ,
L = 4.
Between blocks Within a block
Fully parallel 1 + (N1 N2 ) , N1 > N2 1, N1 N2 1
Semi-parallel N1 1 1
Forward (NOD + log2 N1 /L) (N1 1) + 1 BID + log2 N1 /L
Iterative
Inverse NOI (N1 1) + 1 BII

Table 3
Execution time for P consecutive N N blocks. BIL = BID + log2 N /L, NOL = NOD + log2 N /L, L = 4. RLP1 =
log2 BIL + 2 + log2 (N /L), RLP2 = log2 NOL + 2 + log2 (N /L) . IRLP1 = IRLP2 = log2 NOI + log2 (N /L) + 2.
Execution time (number of cycles)
Fully parallel PN + RLP1 + RLP2 + N 1
Forward Semi parallel PN + (P 1) (N 1) + RLP1 + RLP2 + N 1
Iterative ((BIL + NOL) (N 1) + 1) P + RLI1 + RLI2
RLI1 = BID + 2 log2 (N /L) + 2, RLI2 = NOD + 2 log2 (N /L) + 2
Fully parallel PN + IRLP1 + IRLP2 + N 1
Inverse Semi parallel PN + (P 1) (N 1) + IRLP1 + IRLP2 + N 1
(2 NOI (N 1) + 1)
P + IRLI1 + IRLI2
Iterative
IRLI1 = IRLI2 = NOI + log2 (N /L) + 2

column appears. For P consecutive N N blocks, we refer to Fig. 10 Transforms in the encoder, this setup would require transposition
for the execution time of the Forward HEVC Transform (Fully stages between the Forward and Inverse Transform architectures
Parallel, Semi Parallel, Iterative). RLP1 , RLP2 refer to the latencies (Y to Yq, and U to Uq). To avoid these transposition stages and to
of the column and row Forward Fully Pipelined 1-D Transform reduce latency, the input block to the Forward Transform must be
respectively. RLI1 , RLI2 refer to the latencies of the column and entered column-wise and the input block to the Inverse Transform
row Forward Iterative 1-D Transform respectively. Similarly, for must be entered row-wise (see Table 1).
the Inverse case we have: IRLP1 , IRLP2 , IRLI1 , IRLI2 . Table 3 sum-
marizes these execution times for both HEVC Forward and Inverse 4. Self-reconfigurable embedded implementation
Transforms.
4.1. Self-reconfigurable embedded system
3.4. HEVC transform in the encoder
The HEVC Forward/Inverse Transform hardware was integrated
in the embedded system depicted in Fig. 11 and implemented on
The HEVC encoder, depicted in Fig. 1, includes the Forward a XilinxR
Zynq-7000 All-Programmable SoC (ARM R
+ FPGA). This
Transform, the quantizer, and the reconstruction loop (inverse device has two main portions: the Programmable Logic (PL) and
quantizer, Inverse Transform, Intra-Prediction, in-loop filtering, the Processing System (PS) [1]. A software routine inside the PS
and Inter-Prediction blocks). The latencies of the Forward and writes/reads data to/from the HEVC Transform hardware (located
Inverse Transform designs are specified in Table 3; we can select inside the PL) via an AXI4-Full Interface [1]. Run-time partial recon-
an optimal design based on the latency of the other components of figuration can be carried out via the Internal Access Configuration
the encoder. The quantizer and inverse quantizer operations can Port (ICAP), with the ICAP peripheral inside the PL. Alternatively,
be readily pipelined [2]. For Intra-Prediction, we refer to the work in Zynq-7000 devices, the Device Configuration Interface (DevC)
in [12] that presents a high-performance, scalable hardware design inside the PS can reconfigure (at run-time) PL regions (known as
along with comparisons with other approaches. Similarly, fast de- Reconfigurable Partitions) via the Processor Configuration Access
signs of in-loop filtering and Inter-prediction are presented in [14]. Port (PCAP) by writing partial bitstream files on it [15]. This is a
The setup of Fig. 8 (input blocks are fed column-wise to both the more efficient method as it does not require PL resources and as
Forward and Inverse Transform) is suitable for clearness of presen- the PS has a dedicated DMA controller to transfer partial bitstreams
tation. But when simultaneously using the Forward and Inverse from memory to PCAP. Input data (input video, partial bitstreams)
186 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Fig. 9. Transposition buffer 4 4. The shaded upper left corner can perform a
transposition operation of a 2 2 matrix in 2 cycles. Fig. 11. Self-reconfigurable embedded system. The Reconfigurable Partition (RP)
includes the HEVC Transform core and the input/output interface to the FIFOs.

is stored on the SD card and then loaded into memory. Output data
processing via the 32-bit AXI4-Full Interface. In our experiment, we
is written back to the SD card. keep CLKFX= S_AXI_ACLK, though CLKFX can be user-modifiable.
The HEVC Transform hardware is part of an embedded periph-
eral, called the AXI HEVC Transform, depicted in detail in Fig. 12: it 4.2. I/O considerations
consists of the HEVC Transform, an output buffer, and input/output
interfaces to FIFOs, and a 32-bit AXI4-Full Slave Interface logic (2 The AXI4-Full Interface limits the throughput of the system:
FIFOs and control). The output buffer is required since the HEVC We can only read/write 32 bits per cycle from iFIFO/oFIFO, and
Transform can output N pixels per cycle that require more than 32 we can only transmit (via DMA) up to 16 32-bit words per burst.
bits. The interface of Fig. 12 allows for fast data processing under these
The input interface consists of registers that wait until an input constraints, but we still fall short of taking full advantage of the
column is ready before feeding it to the HEVC Transform. For the Fully Parallel architecture. Here, the proposed self-reconfigurable
Forward Transform, we need to subtract samples coming from embedded system serves as a generic testbed for hardware vali-
intra-prediction/inter-prediction blocks (see Fig. 1). For conve- dation and run-time hardware adaptation. It also shows that the
HEVC Transform cores can fit in medium-sized FPGAs or PSoCs.
nience, in our tests we set those inputs to zero (this is in fact part
Approaches to optimize performance will require that the HEVC
of the HEVC operation). For the Inverse Transform, no subtractors
transform hardware directly reads/writes data from/to an external
are needed. The output interface is a multiplexor that captures data
memory, where the AXI4-Full Interface is only used to load video
from the output buffer and outputs 32 bits at a time. frames onto the external memory.
The iFIFO and oFIFO isolate the S_AXI_ACLK and CLKFX clock
regions. The FSM @ S_AXI_ACLK controls the AXI signals and the 4.3. Run-time considerations
associated FIFOss signals. The FSM @ CLKFX controls the glue
logic between the HEVC Transform and the FIFOs as well as the If we want to modify the HEVC Transform parameters (size,
associated FIFOs signals. This configuration allows for fast data implementation type, Forward/inverse), we also need to modify

Fig. 10. HEVC Forward Transform: Processing cycles for P consecutive N N blocks. (a) Fully parallel: a new block is processed once N pixels are captured. (b) Semi parallel:
a new block is processed N-1 cycles after loading N pixels. (c) Iterative: a new block can be processed once N pixels are loaded and once we waited a certain delay between
blocks. For Inverse Transform, use IRLP1, IRLP2, IRLI1, IRLI2 instead of RLP1, RLP2, RLI1, RLI2. For the Iterative case (Inverse), use NO instead of BIL and NO instead of NO.
D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 187

Fig. 12. HEVC Forward and Inverse Transform AXI4-Full peripheral. When design parameters vary, the interface needs to accommodate to allow for seamless communication
with the AXI bus. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

the input/output interfaces, the output buffers, and the FSM @


CLKFX. As a result, we grouped all these components along with
the HEVC transform into a Reconfigurable Partition (RP) (shaded in
light green in Fig. 12); this is the PL region that will be reconfigured
at run time via the DevC Interface.
The RP outputs toggle during partial reconfiguration, and the
registers inside the RP are not reset after partial reconfiguration [5].
Zynq-7000 devices allow us to set the reset after reconfiguration
property for the RP to clear all registers, though this setting can
impose stringent constraints to the RP shape. Moreover, the tog-
gling of RP outputs can affect the FIFOs behavior during partial
reconfiguration; thus, we need to force FIFOs reset after partial
reconfiguration. A better solution is to instead generate a PR_reset
pulse via software (a special word written into a specific address)
that resets the registers in the RP as well as the FIFOs. Fig. 13 shows
the FSM @ S_AXI_ACLK : as soon as the PR_reset is issued, the system
issues the reset for the RP and the FIFOs.

5. Results and analysis

Here, we test both the Forward and Inverse HEVC Transform


architectures for luma TB, with B = 8, and NO = 16. By
streaming video frames and retrieving data from the embedded
system, the design has been verified to work with a software
model based on the reference algorithm mentioned in Section 2.
To report results on resources and execution time, we use block
sizes N = 4, 8, 16, 32, and the iterative, semi parallel, fully paral-
lel implementations. The embedded system was implemented on
the ZC706 Evaluation Kit, that houses a XC7Z045 Zynq-7000 All-
Programmable SoC.

5.1. HEVC transform ip resource utilization

Table 4 reports resource usage for both Forward and Inverse


Fig. 13. FSM @ S_AXI_ACLK. If the proper control word (0xAA995577) is received on
Transforms for different sizes and implementation types. All cir- base address with offset 11, a PR_reset pulse is issued. This in turn issues a 16-cycle
cuits (except N = 4) use recursive evenodd decomposition (recall reset pulse (rst) for the Reconfigurable Partition and the FIFO.
that smaller sizes are always supported down to 4 4).
188 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Table 4
Forward/Inverse HEVC Transform Resource Utilization. Device: XC7Z045 SoC (54650 Slices: 218600 LUTs, 437200 FFs).
Type N LUTs (%) FFs (%) Slices (%)
4 371 0.2 580 0.1 93 0.2
8 1244 0.6 1947 0.4 311 0.6
Iterative
16 4181 1.9 6748 1.5 1046 1.9
32 14025 6.4 24037 5.5 3507 6.4
4 1077 0.5 1615 0.4 270 0.5
Forward 8 3506 1.6 4935 1.1 877 1.6
Semi parallel
16 12737 5.8 17935 4.1 3185 5.8
32 48859 22.4 67884 15.5 12215 22.4
4 1372 0.6 1885 0.4 343 0.6
8 4846 2.2 5939 1.4 1212 2.2
Fully parallel
16 18389 8.4 22037 5.0 4598 8.4
32 72432 33.1 84278 19.3 18108 33.1
4 458 0.2 641 0.1 115 0.2
Iterative 8 1407 0.6 2012 0.5 352 0.6
16 4475 2.0 6741 1.5 1119 2.0
32 14490 6.6 23364 5.3 3623 6.6
4 1466 0.7 2219 0.5 367 0.7
Inverse 8 4046 1.9 5890 1.3 1012 1.9
Semi parallel
16 13999 6.4 19924 4.6 3500 6.4
32 51689 23.6 72963 16.7 12923 23.6
4 1782 0.8 2485 0.6 446 0.8
8 5418 2.5 6942 1.6 1355 2.5
Fully parallel
16 19651 9.0 24027 5.5 4913 9.0
32 75262 34.4 89357 20.4 18816 34.4

Fig. 14. CTB quadtree structures (64 64 and 32 32) used in the experimental setup.

5.2. Execution time Based on Table 2 formulas (that apply on consecutive same-
sized TBs), we computed execution cycles per frame for the uni-
form TB size and quadtree approaches (Forward and Inverse Trans-
In HEVC, each frame (luma component) is partitioned into Cod- forms). These performance bounds assume we can feed and re-
ing Tree Blocks (CTBs). Each CTB is recursively partitioned into trieve TB columns as fast as the hardware allows (realistic for the
Coding Blocks (CBs) and then into Transform Blocks (TBs), this quadtree structure as there are extra cycles between TBs, especially
partitioning is known as a quadtree. We refer to the Block Size as for the Iterative case). We consider absolute times to be more
the TB size. For reporting execution time, we consider two different useful. For example, the Iterative Hardware (Forward Transform)
approaches: requires 3 878 081 cycles for a 1920 1080 frame partitioned in
64 64 CTBs (quadtree is that of Fig. 14(a)). This amounts to 19.39
Uniform TB size: CTB partitioning with same-sized TBs (N = ms (or 51 fps) at 200 MHz.
4, 8, 16, 32). This allows for a rapid assessment of the HEVC Table 5 reports execution time per frame (ms), frame rate
(fps), and throughput (average computed pixels per cycle) for the
Transform speed.
two approaches: Uniform TB size and quadtree (variable TB size).
Quadtree: We consider two CTB quadtrees shown in Fig. 14:
Results are reported for these frame sizes: 1920 1080 (Full HD),
(i) 64 64 CTB: it uses 4 4 to 32 32 TBs (N = TBmax = 3840 2160 (4K UHD), 4096 3072 (HXGA), and 7680 4320
32), and (ii) 32 32 CTB: it uses 4 4 to 16 16 TBs (8K UHD). Note that the frame size has a negligible effect on
(N = TBmax = 16). Execution time thus depends on the average computed pixels per cycle. We report results at 200 MHz.
CTB size and quadtree structure: variable TB sizes within a The maximum frequency in a Zynq-7000 device depends on the
CTB and between CTBs introduce extra cycles. implementation: it ranges from 200 to 300 MHz.
D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 189

Table 5
Forward/Inverse HEVC Transform Hardware Results for Luma TB: Execution Time (ms), Frame Rate, pixels/cycle. These results are Performance Bounds . Uniform TB size
(TB size fixed) approach, N = TB = 4, 8, 16, 32. Quadtree structure approach (Variable TB size, see Fig. 14), CTB 32 32: N = 16 (we can process the TB sizes 4,8,16), CTB
64 64: N = 32 (we can process the TB sizes 4,8,16,32).
Execution Time (ms) | Frames per second (fps)
Pixels/cycle
Type TB size N 1920 1080 3840 2160 4096 3072 7860 4320
4 49.24 20.3 fps 197.00 5.0 fps 298.84 3.3 fps 787.96 1.2 fps 0.21
8 30.78 32.4 fps 123.12 8.1 fps 186.77 5.3 fps 492.48 2.0 fps 0.33
Fixed
16 17.78 56.2 fps 70.63 14.1 fps 107.15 9.3 fps 282.52 3.5 fps 0.58
Iterative
32 9.81 101.9 fps 39.24 25.4 fps 59.10 16.9 fps 155.84 6.4 fps 1.06
16 32.35 30.9 fps 129.41 7.7 fps 194.88 5.1 fps 513.86 1.9 fps 0.32
Variable
32 19.39 51.5 fps 77.56 12.9 fps 116.79 8.5 fps 310.24 3.2 fps 0.53
4 4.53 220.4 fps 18.14 55.1 fps 27.52 36.3 fps 72.57 13.7 fps 2.28
8 2.43 411.5 fps 9.72 102.8 fps 14.74 67.8 fps 38.88 25.7 fps 4.26
Forward Fixed
16 1.26 790.5 fps 5.02 199.1 fps 7.61 131.2 fps 20.08 49.7 fps 8.26
Semi parallel
32 0.64 1555.9 fps 2.57 389.0 fps 3.87 258.3 fps 10.20 97.9 fps 16.25
16 2.10 475.9 fps 8.40 118.9 fps 12.65 79.0 fps 33.37 30.0 fps 4.97
Variable
32 1.07 931.4 fps 4.29 232.8 fps 6.46 154.6 fps 17.17 58.2 fps 9.73
4 2.59 385.7 fps 10.36 96.4 fps 15.72 63.5 fps 41.47 24.1 fps 4.00
8 1.29 771.5 fps 5.18 192.9 fps 7.86 127.1 fps 20.73 48.2 fps 8.00
Fixed
16 0.65 1531.5 fps 2.59 385.7 fps 3.93 254.3 fps 10.36 96.4 fps 16.00
Fully parallel
32 0.32 3061.3 fps 1.30 765.7 fps 1.96 508.5 fps 5.18 192.9 fps 32.00
16 1.71 583.5 fps 6.85 145.9 fps 10.32 96.8 fps 27.21 36.7 fps 6.09
Variable
32 0.88 1126.7 fps 3.55 281.7 fps 5.34 187.0 fps 14.19 70.4 fps 11.77
4 62.85 15.9 fps 251.42 3.9 fps 381.42 2.6 fps 1005.7 1.0 fps 0.16
Fixed 8 36.45 27.4 fps 154.87 6.8 fps 234.94 4.5 fps 583.20 1.7 fps 0.28
Iterative 16 19.62 50.9 fps 87.64 12.3 fps 132.95 8.4 fps 311.68 3.2 fps 0.53
32 10.12 98.7 fps 48.10 24.6 fps 72.43 16.4 fps 160.86 6.2 fps 1.03
16 39.09 25.5 fps 156.38 6.4 fps 235.50 4.2 fps 620.94 1.6 fps 0.26
Variable
32 21.94 45.5 fps 87.76 11.4 fps 132.15 7.5 fps 351.04 2.8 fps 0.47
4 4.53 220.4 fps 18.14 55.1 fps 27.52 36.3 fps 72.57 13.7 fps 2.28
Inverse 8 2.43 411.5 fps 9.72 102.8 fps 14.74 64.8 fps 38.88 25.7 fps 4.26
Fixed
16 1.26 790.5 fps 5.02 199.1 fps 7.61 131.2 fps 20.08 49.7 fps 8.25
Semi parallel
32 0.64 1555.9 fps 2.57 389.0 fps 3.87 258.3 fps 10.20 97.9 fps 16.25
16 2.10 475.9 fps 8.40 118.9 fps 12.65 79.0 fps 33.37 30.0 fps 4.97
Variable
32 1.07 931.4 fps 4.29 232.8 fps 6.46 154.6 fps 17.17 58.2 fps 9.73
4 2.59 385.7 fps 10.36 96.4 fps 15.72 63.5 fps 41.47 24.1 fps 4.00
8 1.29 771.5 fps 5.18 192.9 fps 7.86 127.1 fps 20.73 48.2 fps 8.00
Fixed
16 0.65 1531.5 fps 2.59 385.7 fps 3.93 254.3 fps 10.36 96.4 fps 16.00
Fully parallel
32 0.32 3061.3 fps 1.3 765.7 fps 1.96 508.5 fps 5.18 192.9 fps 32.00
16 1.71 583.5 fps 6.85 145.9 fps 10.32 96.8 fps 27.21 36.7 fps 6.09
Variable
32 0.88 1126.7 fps 3.55 281.7 fps 5.34 187.0 fps 14.19 70.4 fps 11.77

We observe that the frame rate decreases for smaller TB sizes.


Also, with the same hardware, the approach with variable TB size
features lower frame rates than the fixed TB size. In general, we
note that the Iterative hardware has very low frame rates (30 fps)
except for the 1920 1080 case; thus it is useful only for low frame
rates and small frame sizes (1920 1080).

5.3. Dynamic management of resources and performance

The self-reconfigurable embedded system of Fig. 11 allows for


run-time hardware reconfiguration to satisfy time-varying con-
straints based on input data, output data, and user input [5].
Dynamic management is implemented via a software routine in
C inside the PS that swaps HEVC Transform cores in response to: (i)
direct constraints on resources and performance, and (ii) different
coding efficiency: varying the CTB size and quadtree can change
the largest TB size, like switching from the quadtree of Fig. 14(a)
(TBmax = 32) to the one in Fig. 14(b) (TBmax = 16); here run-time
reconfiguration is required.
Current applications with large frame sizes feature quadtree Fig. 15. Frame rate (frames per second) vs. Reconfiguration rate. Scenario 1 (For-
structures with TB sizes up to 32 32 or 16 16 (the hardware ward Transform), 64 64 CTB, 1920 1080 video resolution.
can process smaller TBs). Cases with TB sizes up to 8 8 or 4 4
are rare (except for very small frames) and are not considered. With
this in mind, we consider three run-time reconfiguration strategies Scenario 1 (32 32/16 16): For each implementation
(for the Forward and Inverse Transforms): type (fully parallel, semi parallel, iterative), we allow for
32 32 and 16 16 block (TB) sizes: six configurations.
190 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

a b

Fig. 16. Resources-performance design space. The frame rate considers the effect of reconfiguration time to be negligible (K > 100). (a) Forward Transform (Scenario 1):
Dynamic management based on time-varying constraints on resources and frame rate (fps) as well as compression efficiency (CTB quadtree structure). Note how the system
meets the constraints 15. (b) Inverse Transform (Scenario 2): Resources (Slices) vs frame rate (fps) for various frame sizes.

Table 6
Memory Overhead for Forward and Inverse HEVC Transforms: Reconfigurable Partition (RP) size, bitstream size, and required memory.
Scenario (# of partial bitstreams) RP size (Slices) Bitstream size Reconfig. time Required memory
Forward 200 98 4948 KB 38.65 ms 29688 KB
1 (6)
Inverse 200 104 5251 KB 41.02 ms 31506 KB
Forward 150 32 1212 KB 9.47 ms 3636 KB
2 (3)
Inverse 150 35 1326 KB 10.35 ms 3978 KB
Forward 120 32 970 KB 7.57 ms 1940 KB
3 (2)
Inverse 120 34 1031 KB 8.05 ms 2062 KB

Scenario 2 (only 16 16): For each implementation type the 1920 1080 frame size. Frame rates are extracted from Table 5
(fully parallel, semi parallel, iterative), we only allow for using the quadtrees of Fig. 14.
16 16 block (TB) size: three hardware configurations. The To demonstrate dynamic management on the system of
Reconfigurable Partition (RP) size is reduced. Fig. 15(a), we consider a time-varying sequence of quadtree struc-
Scenario 3 (Low Cost): We only allow for iterative imple- ture specifications and constraints on resources and performance:
mentation for 32 32 and 16 16 block (TB) sizes: two (1) CTB 32 32: minimize resource usage.
configurations. This scenario has the smallest RP. (2) CTB 32 32: require fewer than 6000 slices subject to frame
rate 120 fps.
Table 6 reports the required memory (along with bitstream and (3) CTB 64 64: maximize frame rate.
RP sizes) for the three scenarios (for Forward and Inverse Trans- (4) CTB 64 64: minimize resource usage.
forms). While the embedded system was implemented in XC7Z045 (5) CTB 64 64: minimize resource usage subject to frame rate
SoC (in the ZC706 Evaluation Kit), we note that Scenarios 2 and 60 fps.
3 also fit in the smaller XC7Z020 SoC (in the ZED Development Fig. 16(a) shows how the system dynamically selects hardware
Board). realizations (-) that meet the requirements. Fig. 16(b) depicts
The average reconfiguration speed using the PCAP is 128 MB/s; the resources-performance design space for Scenario 2 (Inverse
the reconfiguration time is indicated in Table 6. Depending on Transform) for various frame sizes and for the 64 64 CTB. Note
the reconfiguration rate, the overhead can be significant. Fig. 15 that the Iterative case is very slow for large frame sizes. Thus,
illustrates how the reconfiguration rate affects the frame rate real-time applications with large frame sizes should use the Semi
(fps) for Scenario 1 (Forward Transform) with a 64 64 CTB Parallel and Fully Parallel implementations.
and 1920 1080 frame size. We reconfigure every K frames
(K = 1, 5, 10, 20, 50, 80. NR: no reconfiguration). We see that the 5.4. Comparisons with other implementations
higher the reconfiguration rate (lower K ), the lower the frame rate.
The Fully Parallel and Semi Parallel implementations are the most Table 7 provides a comparison with related HEVC Transform ar-
affected (the processing time of the Iterative case is comparable chitectures, where the proposed approach (3 types) uses a 32 32
to the reconfiguration time). At the maximum reconfiguration rate Forward HEVC Transform with fixed size (TB=32). Most designs are
(every frame, K = 1), we have 25.28, 25.17, and 17.22 fps for the multiplier-less that take advantage of the specific HEVC coefficient
Fully Parallel, Semi Parallel, and Iterative cases respectively. Most values (harder to adapt to other values like those of the alternative
applications do not require such high reconfiguration rates, so in 4 4 DST). Some approaches optimize transposition buffers by
general we say that the effect of the reconfiguration time overhead size reduction [11] or SRAM-based techniques [14]. The work in [8]
is negligible. Note that Scenario 1 was selected to illustrate the presents 1D Transforms that generate 32 pixels per cycle regard-
worst case since it has the largest Reconfigurable Partition; for less of transform size (at the expense of recursive duplicating
Scenarios 2 and 3, the drop is less pronounced. the amount of resources); it also uses transposition buffers with
Fig. 16(a) shows the Resource (Zynq-7000 Slices) and Perfor- gated clocks (not recommended for FPGA designs) that introduce
mance (fps) design space for Scenario 1 (Forward Transform) for a pipeline latency of N cycles. Our approach relies on Distributed
Table 7
Comparison of different approaches (32 32 HEVC Forward Transform). For proper comparisons, reconfiguration time overhead is not considered in achieved frame rate, and resources do not include the embedded interface.
In the proposed architecture, as well as in [11] and [17], the pixels per cycle are reported assuming a uniform TB size of 32 32 .

Proposed (run-time reconfigurable) Meher et al. [8] Chen Tikekar Pastuszak [11]
et al. [17] et al. [14]
1D 2 unfolded 2 unfolded 2 folded 1 shared 2 unfolded 1 shared 1 shared 2 folded
Transforms units units units unfolded unit units unfolded unit foldedunit units
Architecture
Transpose 2 1 1 1 1 2 1 SRAM-based 2
buffers
Technology Zynq-7000 Programmable SoC TSMC 90 nm 180 nm CMOS 40 nm TSMC 90 nm
Gate count 1.7M 1.1M 332K 208K 347K 79K 98.1K 16.4Kbit 328K
SRAM
Pixels per cycle 32* 16.25* 1.06* 16 32 2 2 32
Supporting 7860 4320 7860 4320 3840 2160 7860 3420 3840 2160 4096 3072 7680 4320
video format @ 193fps @ 98 fps @ 25 fps @60 fps @ 30fps @ 30 fps @ 60fps
Frequency 200 MHz 187 MHz 125 MHz 200 MHz 400 MHz
Notes * Average pixels/cycle for a 1920 1080 The 1D transform always Inverse Inverse Multipliers used
D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

video frame). produces 32 pixels per cycle Transform Transform


191
192 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Arithmetic and the synthesis tools to optimize resources, it takes [6] D. Llamocca, M. Pattichis, C. Carranza, A framework for self-reconfigurable
care of embedded interfacing and it is run-time reconfigurable. DCTs based on multiobjective optimization of the Energy-Performance-
Accuracy space, in: Proceedings of the 7th International Workshop on Recon-
Our design exhibits high frame rates, except for the Iterative case. figurable Communication-centric Systems-on-Chip, RECOSOC2012, pp. 579
In general, our architecture with 2 unfolded (fully pipelined) 1D 582, July 2012.
transforms and 2 transpose buffers outperforms other approaches [7] D. Llamocca, M. Pattichis, A. Vera, Partial reconfigurable FIR filtering system
in frame rates, though with a high gate count, which is a rough ap- using distributed arithmetic, Int. J. Reconfigurable Comput. 2010 (357978)
(2010) 14.
proximation (due to inefficient mapping) based on the FPGA logic [8] P.K. Meher, S.Y. Park, B.K. Mohanty, K.S. Lim, C. Yeo, Efficient integer DCT
cells. We note that the reported pixels per cycle in [8,11,14,17] are architectures for HEVC, IEEE Trans. Circuits Syst. Video Technol. 24 (1) (2014)
with respect to one transform block. We report the average pixels 168178.
per cycle for a 1920 1080 frame (there is minimal variation for [9] J.S. Park, W.J. Nam, S.M. Han, S. Lee, 2-D Large Inverse Transform (16 16,
32 32) for HEVC (High Efficiency Video Coding), J. Semicond. Technol. Sci.
larger frame sizes), this considers idle cycles between blocks. 12 (2) (2012) 203211.
[10] G. Pastuszak, Flexible architecture design for H.265/HEVC inverse transform,
6. Conclusions Circuits Systems Signal Process. 34 (6) (2014) 19311945.
[11] G. Pastuszak, Hardware architectures for the H.265/HEVC discrete cosine
This work presented scalable HEVC Forward and Inverse Trans- transform, IET Image Process. 9 (6) (2015) 468477.
[12] G. Pastuszak, A. Abramowski, Algorithm and architecture design of the
form implementations that can dynamically adapt resources H.265/HEVC intra encoder, IEEE Trans. Circuits Syst. Video Technol. 26 (1)
based on performance and coding efficiency requirements while (2016) 210222.
supporting beyond high-definition video formats. For common [13] M. Tikekar, C.-T. Huang, C. Juvekar, V. Sze, A. Chandrakasan, A 249-mpixel/s
HEVC video-decoder chip for 4k ultra-HD applications, IEEE J. Solid-State
applications, the reconfiguration rate has a negligible effect on
Circuits 49 (1) (2014) 6172.
performance. The results suggest that we can incorporate both [14] M. Tikekar, C.-T. Huang, V. Sze, A. Chandrakasan, Energy and area-efficient
Forward and Inverse HEVC Transform inside a medium-sized hardware implementation of HEVC Inverse Transform and Dequantization,
Zynq-7000 PSoC. This work illustrates the benefits of run-time re- in: Proceedings of the IEEE International Conference on Image Processing,
ICIP2014, pp. 21002014, Oct. 2014.
configurable FPGA or PSoC technology for HEVC encoder/decoder
[15] K. Vipin, S.A. Fahmy, ZyCAP: Efficient partial reconfiguration management on
implementations. the Xilinx Zynq, IEEE Embedded System Letters 6 (3) (2014) 4144.
[16] S. Vivienne, M. Budagavi, G. Sullivan, High Efficiency Video Coding (HEVC),
Acknowledgment in: Integrated Circuit and Systems, Algorithm and Architectures, Springer,
2014.
[17] Y.-Ho Chen, C.-Y. Liu, Area efficient video transform for HEVC applications,
This material is based upon work supported by the National Electron. Lett. 51 (14) (2015) 10651067.
Science Foundation under Grant No. NSF AWD CNS-1422031. [18] S. Yu, E. Swartziander, DCT implementation with distributed arithmetic, IEEE
Trans. Comput. 50 (9) (2001) 985991.
References

[1] UG585: Zynq-7000 All Programmable SoC Technical Reference Manual, Xilinx, Daniel Llamocca received the B.Sc. degree in electrical
Inc., Nov. 2014. engineering from Pontificia Universidad Catlica del Per,
[2] M. Budagavi, A. Fuldseth, V. Sze, M. Sadafale, Core transform design in the High in 2002, and the M.Sc. degree in electrical engineering
Efficiency Video Coding (HEVC) standard, IEEE J. Sel. Top. Sign. Proces. 7 (6) and the Ph.D. degree in computer engineering from the
(2013) 10291041. University of New Mexico at Albuquerque, in 2008 and
[3] P.T. Chiang, T.S. Chang, A reconfigurable inverse transform architecture design 2012, respectively. He is currently an Assistant Professor
for HEVC decoder, in: Proceedings of the IEEE International Symposium on with Oakland University. His research deals with run-
Circuits and Systems, ISCAS2013, pp. 10061009, May 2013. time automatic adaptation of hardware resources to time
[4] M. Jridi, P.K. Meher, A scalable approximate DCT architetures for efficient HEVC varying constraints with the purpose of delivering the best
compliant video coding, IEEE Trans. Circuits Syst. Video Technol. (2017) in hardware solution at any time. His current research in-
press. terests include reconfigurable computer architectures for
[5] D. Llamocca, M.S. Pattichis, Dynamic energy, performance, and accuracy opti- signal, image, and video processing, high-performance architectures for computer
mization and management using automatically generated constraints for sep- arithmetic, communication, and embedded interfaces, embedded system design,
arable 2D FIR filtering for digital video processing, ACM Trans. Reconfigurable and run-time partial reconfiguration techniques on field-programmable gate ar-
Technol. Syst. 7 (4) (2014) 4. rays.