self reconfigurable hevc

© All Rights Reserved

6 views

self reconfigurable hevc

© All Rights Reserved

- CUDA Compression Final Report
- B4.4-R3
- Wavelet Based Image Compression Techniques
- ssg_m1l1
- Streamed Coefficients Approach for Quantization Table Estimation in JPEG Images
- vcla
- Sklchan S^3 Mods v6.00
- videocoding2b
- Text Compressor Using Huffman Algorithm
- 4.Eng-An Approach on PZW Coding Technique Using SPHIT-Ravi Mathey
- SPI xapp859
- ME_ELC_TELE
- jlpea-494015 (1).pdf
- systemML
- TMS320DM6437 Algoritmo de Com Pres Ion
- New File
- matrix m
- app_f
- HEVC the Next Big Thing
- Zerotree Coding of Dct Coefficients

You are on page 1of 15

journal homepage: www.elsevier.com/locate/jpdc

Transform

Daniel Llamocca

Electrical and Computer Engineering Department, Oakland University, Rochester, MI, USA

highlights

The parameterized hardware code allows for large design space exploration.

A different coding efficiency requirement can trigger reconfiguration.

A case made for run-time reconfigurable technology for video encoders and decoders.

article info a b s t r a c t

Article history: This work introduces a run-time reconfigurable system for HEVC Forward and Inverse Transforms that can

Received 30 June 2016 adapt to time-varying requirements on resources, throughput, and video coding efficiency. Three scalable

Received in revised form 26 April 2017 designs are presented: fully parallel, semi parallel, and iterative. Performance scalability is achieved

Accepted 24 May 2017

by combining folded/unfolded 1D Transform architectures and one/two transposition buffers. Resource

Available online 15 June 2017

usage is optimized by utilizing both the recursive evenodd decomposition and distributed arithmetic

techniques. The architecture design supports video sequences in the 8K Ultra High Definition format

Keywords:

HEVC transform (7680 4320) with up to 70 frames per second when using 64 64 Coding Tree Blocks with variable

Reconfigurable hardware transform sizes. The self-reconfigurable embedded system is implemented and tested on a Xilinx R

Zynq-

Embedded systems 7000 All-Programmable System-on-Chip (SoC). Results are presented in terms of performance (frames

Run-time partial reconfiguration per second), resource utilization, and run-time hardware adaptation for a variety of hardware design

parameters, video resolutions, and self-reconfigurability scenarios. The presented system illustrates the

advantages of run-time reconfiguration technology on PSoCs or FPGAs for video compression.

2017 Elsevier Inc. All rights reserved.

nication within the available bandwidth; also if energy constraints

The video compression standard known as High Efficiency are tight (e.g. mobile devices), energy consumption can be reduced

Video Coding (HEVC) is the latest standard video compression at the expense of reduced performance.

standard. It enhances video compression capability at the expense This work focuses on run-time hardware adaptation of the

of computational complexity [16]. Video compression standards HEVC Forward and Inverse Transforms. There has been extensive

have found their way into a wide variety of products increasingly work on dedicated hardware designs. Most of them use two 1D

prevalent in our daily lives. The increasing diversity of services, Transforms (or a single shared 1D transform) and one or two

the emergence of beyond high-definition video formats, and the transposition buffers: multiply-and accumulate architectures for

increasing network traffic caused by video applications have given Forward [11] and Inverse Transforms [10], multiplier-less Inverse

rise to the HEVC standard. Hardware implementations of the HEVC Transform with separate row and column 1D Transforms [9],

encoder are highly desired due to their high performance nature multiplier-less shared (row/column) 1D Inverse Transform designs

and reduced energy consumption when compared to software where transposition is processed using registers [3,17] or single-

realizations [12]. Of particular interest are run-time reconfigurable port SRAMs [13,14], and resource-intensive architectures that gen-

systems that allow hardware components to be altered while the erate 32 output pixels per cycle irrespective of transform size [8].

rest of the system is still operating [5]. This allows video com- Unified (Forward/Inverse) architectures are introduced in [2] and

munication systems to meet real-time constraints: video can be an FPGA implementation is presented in [4].

We present multiplier-less architectures for the HEVC For-

E-mail address: llamocca@oakland.edu. ward and Inverse N N Transforms that can tradeoff resources

http://dx.doi.org/10.1016/j.jpdc.2017.05.017

0743-7315/ 2017 Elsevier Inc. All rights reserved.

D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 179

Fig. 1. HEVC encoder. Input Image X : B bits per sample. Residual image U: B + 1 bits per sample. ST 1 = 2(B+M 9) , ST 2 = 2(M +6) , SIT 1 = 26 , SIT 2 = 2(21B) .

use of recursive evenodd decomposition and distributed arith-

metic [18], which is allowed by the constant HEVC matrix coeffi-

cients [8]. The presented N N HEVC Transform units allow for

smaller transforms to be computed by the same hardware. The

architectures styles (for both Forward and Inverse Transforms) are:

(i) fully parallel: input columns fed every clock cycle, even between

blocks; this requires the most resources, (ii) semi parallel: input

columns fed every clock cycle within a block, need to wait between

blocks, (iii) iterative: input columns within a block and between Fig. 2. 1D Transform core with constant coefficients D(N). TP: fully pipelined or

blocks need to wait a number of cycles; this requires the least iterative. The circuit computes the product of D(N) (or DT (N)) by a column of U,

resources. resulting in a N-pixel column of Z.

To demonstrate run-time adaptability, a self-reconfigurable

embedded system is presented where we can alter (at run-time)

the design parameters of the N N HEVC Forward and Inverse

2. HEVC Transform and Scaling

Transforms. Moreover, the HEVC coding efficiency can be altered

by varying the quadtree structure and/or size of a Coding Tree Block

Fig. 1 depicts an HEVC encoder with its main components.

(CTB) [16]; if the largest block size varies, then we need to modify

The input image has B bits per sample while the residual image

(at run-time) the size of the Transform hardware. Thus, these

has B + 1 bits per sample [2]. The Forward Transform output

scalable architectures adapt to time-varying resource, throughput,

and coding efficiency requirements. Hardware is validated by em- requires NO = 16 bits while the Inverse Transform output requires

bedded implementation and testing of real-life scenarios with high B + 1 bits. The Forward/Inverse HEVC Transforms are 2D DCT-

definition frame sizes and different CTB quadtree structures. The like transforms made of two 1D Transforms (column and row) and

main contributions of the present work can be summarized as Scaling Units. In HEVC, the coefficients are approximations of the

follows: scaled DCT coefficients and are represented with 8 integer bits.

The 4 4 Transform can also use an alternative 4 4 DST-based

Performance-scalable multiplier-less architectures: We Transform [16]. We refer to the equations in [8] and the scaling

provide the fastest hardware realizations based on the avail- factor in [2] as the reference algorithm in the remainder of this

able resources. paper.

Fully customizable architectures validated on an All- In HEVC, the size of a N N block can be 4, 8, 16, 32. In the

Programmable System-on-Chip (SoC): The fully parameter- HEVC Forward Transform, the residual image block U is processed

ized VHDL-coded hardware allows for exploration of trade- block-wise by Y = D U DT (this includes scaling): the column

offs among resources, performance, and design parameters. transform applies D U, then scaling (bit pruning) to guarantee 16-

The code, not tied to any particular hardware, can be imple- bit output samples (ST 1 = 2(B+M 9) ). After that, the row transform

mented on any existing technology (e.g. PSoC, FPGA, ASIC). applies (D U) DT , then scaling to ensure 16-bit output samples

Self-reconfigurable embedded implementations: The run- (ST 2 = 2(M +6) ). In the HEVC Inverse Transform, the dequantized

time reconfigurable software/hardware system can swap

block Yq is processed by Uq = DT Yq D (including scaling): the

hardware configurations to adapt to resource, throughput,

column transform applies DT Yq, then scaling to guarantee 16-bit

and coding efficiency requirements.

output samples (SIT 1 = 26 ). After that, the row transform applies

The rest of the paper is organized as follows: Section 2 details (DT Yq) D, then scaling to guarantee output samples of B + 1

the HEVC Transform algorithm. Section 3 presents the three archi- bits (SIT 2 = 2(21B) ).

tecture designs. Section 4 provides embedded implementation de- Table 1 specifies how to process data: if we feed an input block

tails. Section 5 presents results in terms of resources, performance, (U, Yq) we get a transposed output block (Y T , UqT ). If we feed a

and run-time management. Conclusions are provided in Section 6. transposed input block (U T , YqT ), we get an output block (Y , Uq).

180 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Table 1

HEVC Forward/Inverse Transform: Processing column and row 1D transforms.

2D Transform In Column transform Row transform Out

T U Z = DU DZ T = DU T DT YT

Forward Y = DUD

UT Z = DU T DZ T = DUDT Y

Yq Z = DT Yq DT Z T = DT YqT D UqT

Inverse Uq = DT YqD

YqT Z = DT YqT DT Z T = DT YqD Uq

Fig. 3. (a) Fully parallel 4 4 1D Transform. (b) Architecture of fully pipelined, distributed arithmetic (DA) N N Inner Product. Here, L is the LUT input size and it is set to

4, this requires N to be a multiple of 4. It includes a Distributed Arithmetic rearrangement unit, filter blocks (see [7] for details) and an adder tree. NH is the number of bits

per coefficient is 8 according to HEVC.

The 1D Forward Transform (column and row) can then be defined 3.1. Hardware implementation of the 1D DCT-like transform

as a matrix product DN X and the 1D Inverse Transform (column

and row) as a matrix product DTN X . We compute the matrix multiplication D U, where D is an

The matrix DN can be recursively decomposed: first into DN /2 N N constant matrix, and U an input block. The coefficients D (N )

and MN /2 , then DN /2 is decomposed into DN /4 and MN /4 , and so and DT (N ) allow for recursive evenodd decomposition and dis-

on [8]. The 1D Forward column and row Transforms then re- tributed arithmetic [18] to optimize resource usage [6]. The circuit

quire proper permutations of the input and output sequences as generates one output column at a time, with U provided column-

well as N pre-additions/pre-subtractions. Likewise, the matrix DTN wise. Two versions are provided: fully pipelined (unfolded) and

can be recursively decomposed. The 1D Inverse column and row iterative (folded).

Transforms require proper permutations of the input and output Fig. 2 shows the parameterized VHDL-coded core with its pa-

sequences as well as N post-additions/post-subtractions [4]. This rameters and I/O ports. The design parameters are N (transform

method, known as recursive evenodd decomposition, optimizes size), BI (bits per input samples), NY (bits per output sample),

resource usage [8]. FWD/INV (Forward or Inverse), coefficients (provided as a text file.

Forward: D (N ): Inverse, DT (N )), and TP (version: fully pipelined,

3. HEVC Transform: Hardware implementation iterative). The case N = 4 is not decomposed.

The column/row transform (Forward and Inverse) units are

We describe the block-wise hardware implementation of the generated by proper parameter selection. For the HEVC Forward

shaded blocks of Fig. 1. We start by detailing the 1D Transform case, the column transform requires BI = B + 1, NY = B + M + 7,

core that implements the column/row transforms (both Forward D (N ), while the row transform requires BI = NO = 16, NY =

and Inverse cases), this unit implements a generic matrix product 15 + M + 7, D (N ). For the HEVC Inverse case, both the column and

Z = D U. We then explain the HEVC 2D Transform hardware: as row transforms require BI = NO = 16, NY = 22, DT (N ).

we need Z T , a transposition stage is required between column and (1) Fully pipelined implementation: Fig. 3(a) depicts the 4 4

row transforms. The Scaling stages amount to bit pruning. Transform comprised of four inner products (each of 4 inputs by 4

We have coded the architecture in VHDL derived from the constant coefficients). Fig. 3(b) depicts a fully pipelined inner prod-

reference algorithm mentioned in Section 2. Our architectures are uct (N inputs by N coefficients) implemented with distributed

designed to process N pixels (a column or a row) at a time. If we arithmetic (DA) whose latency is log2 BI +log2 (N /L)+ 1 cycles

feed U column-wise, we are implementing Z = D U for the (N multiple of L, the LUT input size [7]). The 32 32 recursive

column transform, and the output is the transposed matrix Y T , implementations of the Forward and Inverse Transforms are de-

i.e., we get Y row-wise. If we feed U row-wise, we are implement- picted in Figs. 4 and 5 respectively (16 16, 8 8, and 4 4 units

ing Z = D U T for the column transform, and the output is the are also shown). To avoid large combinational delays, registers are

matrix Y , i.e., we get Y column-wise. For clearness of presentation, inserted after the adder/ subtractors; the extra register delays are

in our hardware descriptions, the input block is fed column-wise balanced by the latency difference between inner products of dif-

to both the Forward and Inverse Transforms. ferent size. Both the Forward and Inverse Transform architectures

D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 181

Fig. 4. Fully Pipelined 32 32 1D Forward Transform. 16 16, 8 8 and 4 4 Transforms are depicted as well as they are components of the 32 32 Transform. The I/O

delay (latency) is indicated for every Inner Product (DA) hardware. Inner product with N /2 inputs have an extra delay than those with N /4 inputs. This is balanced by the

extra register levels added on the inputs.

182 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Fig. 5. Fully Pipelined 32 32 1D Inverse Transform. 16 16, 8 8, and 4 4 Transforms are depicted as well as they are components of the 32 32 Transform. The I/O

delay (latency) is indicated for every Inner Product (DA) hardware. Inner product with N /2 inputs have an extra delay than those with N /4 inputs. This is balanced by the

extra register levels added on the outputs.

can process a N-pixel column per cycle, and their latencies are (2) Iterative implementation: The 4 4 Transform hardware is

given by (L = 4): depicted in Fig. 6(a): it has a DA rearrangement unit (see Fig. 3(b)),

a vector shift register, 4 iterative inner products, and an FSM.

RLP = log2 (BI + log2 (N /L)) + 2 + log2 (N /L) (1)

Fig. 6(b) depicts the N N iterative inner product, whose latency is

IRLP = log2 (BI ) + log2 (N /L) + 2. (2) log2 N /L + 2 cycles. Fig. 7 depicts the 32 32 Forward Transform

D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 183

Fig. 6. (a) Iterative 4 4 Transform (N = 4) architecture. The DA rearrangement unit converts the N BI-bit vectors into BI N-bit vectors which are shifted out one by one by

the vector shift register. (b) Architecture details of the Iterative Inner product (N inputs by N coefficients) that uses Distributed Arithmetic.

recursive architecture. Unlike the Fully Pipelined case, the blocks 3.3. HEVC Transform (2D) implementation

after the DA rearrangement units have identical latencies (smaller

inner products delays are balanced by extra input groups to the Fig. 8 shows the HEVC Forward and Inverse Transform (2D)

vector shift registers). So, in addition to the registers placed after VHDL-coded core with its I/O ports and parameters: B, N , D (N )

the adder/subtractors, synchronizing registers are needed for data (including the smaller matrices: D (N /2) , M (N /2), . . . ), IMP

to arrive simultaneously on all DA rearrangement units. Thus, an (implementation style: fully parallel, semi parallel, iterative),

N N Forward Transform can process an input column every FWD/INV (Forward or Inverse). The design parameters allow us

BI + log2 (N /L) cycles (a non-decomposed transform processes to finely control execution time and resource usage. The HEVC

an input column every BI cycles at the expense of excessive hard- Transform needs row and column 1D Transforms, Scaling, and a

ware overhead). The Inverse Transform is similar to Fig. 5 without transposition stage. The alternative DST-based transform can be

input registers and with Iterative Inner Products (along with DA used for the 4 4 unit [2]. The input is provided column-wise and

rearrangement units and vector shift registers). Here, the number the outputs are generated row-wise (Y T , UqT ).

of input groups to all vector shift registers is the same; the latency The Forward Transform uses D (N ), ST 1 , ST 2 , BID = B + 1 bits

difference of inner products of different size is balanced by the per input sample and NOD = 16 bits per output sample. The

registers placed after the adder/subtractors on the outputs. Thus, Inverse Transform (shown in purple in Fig. 7) uses DT (N ), SIT 1 , SIT 2 ,

the Inverse Transform can process an input column every BI cycles. BII = NOD = 16 bits per input sample, and NOI = B + 1 bits per

The latencies (RLI: Forward, IRLI: Inverse) are given by: output sample. Scaling (ST 1 , ST 2 , SIT 1 ) amounts to bit pruning of the

LSBs to fit into 16 bits, while SIT 2 scaling amounts to bit pruning of

RLI = BI + 2 log2 (N /L) + 2 (3)

the LSBs to fit into B + 1 bits.

IRLI = BI + log2 (N /L) + 2. (4) We need a transposition stage between the column and row

transforms. Some works use a shared column/row transform with

3.2. Computing smaller 1d transforms only one transpose buffer [3,14], while others use separate column

and row transforms with one transpose buffer [8,9] or two trans-

In HEVC, the block sizes inside a Coding Tree Block (CTB) pose buffers [6,11] in a ping-pong mode. As an example of how we

vary continuously. By multiplexing the inputs (or outputs for the can handle variable-sized blocks, note that if we want to transpose

Inverse Transform) we can only use the hardware portion that a 4 4 matrix in a 32 32 buffer, we just load and transpose the

computes the smaller transform and keep the latencies given in upper left corner. Fig. 9 depicts a transposition of a 2 2 matrix in

Eqs. (1)(4). This is problematic when switching sizes, as it requires a 4 4 buffer: we only use the upper left corner and perform the

in some cases that we delay feeding data. Instead, our solution is operation in 2 cycles.

to control the synchronous clear (an(2..0) in Figs. 4, 5 and 7) where Three architecture designs are provided based on the 1D Trans-

we feed zeros to guarantee proper results when we require smaller form implementation and the number of transpose buffers:

transforms than N. (1) Fully parallel: Both the row and column 1D Transforms are

The 32 32 Transform cores of Figs. 4 and 7 can process any Fully Pipelined. In addition, two transpose buffers are used in ping

of the 4 4, 8 8, 16 16, and 32 32 Transforms via the pong mode [6]. This configuration allows us to continuously feed a

size signal. The VHDL-coded core allows for different N (size of column per cycle with no need to stall the pipeline between same-

maximum allowed transform); for example the case 16 16 is sized blocks. When switching blocks of different size (N1 N1 to

useful in applications that only require the 4 4, 8 8, and N2 N2 ), we need to wait N1 N2 extra cycles (only if N1 > N2 )

16 16 transforms. The latency of a N N transform core does so that the buffer processing the N2 N2 block avoids producing

not vary when smaller transforms are processed, as data needs to output data at the same time the buffer processing the N1 N1

propagate through the extra register levels. Thus, switching block block is.

sizes (N1 N1 to N2 N2 , where N1 , N2 N) does not incur in (2) Semi parallel: Both the row and column 1D Transforms are

extra delay cycles. Fully Pipelined. Only one transpose buffer is used, which requires

To summarize, a fully pipelined 1D Transform can process a us to always wait N1 1 cycles between blocks. Within a block, we

column every cycle, while the Iterative Forward Transform can can input one column per cycle.

process a column every BI +log2 (N1 /L) cycles (or every BI cycles (3) Iterative: The row and column 1D Transforms are Iterative.

for the Inverse core). This is also true between columns of different Within a block we can feed a column every BID + log2 N /L

blocks (which might not be of the same size). cycles (Forward Transform) or BII cycles (Inverse Transform).

184 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Fig. 7. Iterative Implementation of the 32 32 1D Forward Transform. 16 16, 8 8, and 4 4 Transforms are depicted as well, as they are components of the 32 32

Transform (the enable chain and control logic for these transforms are not shown here).

One transpose buffer is used. Between blocks, we need to wait N1 , N2 = 4, 8, 16, 32, and thus they prevail. Table 2 summarizes

(NOD + log2 N /L) (N 1) + 1 cycles (Forward Transform) or the number of required cycles between input columns for both the

NOI (N 1) + 1 cycles (Inverse Transform). HEVC Forward and Inverse Transforms.

The waiting cycles between blocks imposed by the transpose Execution time is defined from the clock edge the first input

buffers are longer than those imposed by the 1D Transforms for column is captured to the first clock edge where the last output

D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 185

Fig. 8. HEVC Forward and Inverse Transform. B = 8, NO = 16. Input: N N-pixel columns. Output: N N-pixel row. The Scaling stages (ST1 , SIT1 , ST2 , SIT2 ) perform bit pruning

of the LSBs. Transposition buffers: we can select one or two.

Table 2

HEVC Transform. Cycles between input columns. N1 : current block size, N2 : next block size, BID = B + 1, NOD = 16 ,

L = 4.

Between blocks Within a block

Fully parallel 1 + (N1 N2 ) , N1 > N2 1, N1 N2 1

Semi-parallel N1 1 1

Forward (NOD + log2 N1 /L) (N1 1) + 1 BID + log2 N1 /L

Iterative

Inverse NOI (N1 1) + 1 BII

Table 3

Execution time for P consecutive N N blocks. BIL = BID + log2 N /L, NOL = NOD + log2 N /L, L = 4. RLP1 =

log2 BIL + 2 + log2 (N /L), RLP2 = log2 NOL + 2 + log2 (N /L) . IRLP1 = IRLP2 = log2 NOI + log2 (N /L) + 2.

Execution time (number of cycles)

Fully parallel PN + RLP1 + RLP2 + N 1

Forward Semi parallel PN + (P 1) (N 1) + RLP1 + RLP2 + N 1

Iterative ((BIL + NOL) (N 1) + 1) P + RLI1 + RLI2

RLI1 = BID + 2 log2 (N /L) + 2, RLI2 = NOD + 2 log2 (N /L) + 2

Fully parallel PN + IRLP1 + IRLP2 + N 1

Inverse Semi parallel PN + (P 1) (N 1) + IRLP1 + IRLP2 + N 1

(2 NOI (N 1) + 1)

P + IRLI1 + IRLI2

Iterative

IRLI1 = IRLI2 = NOI + log2 (N /L) + 2

column appears. For P consecutive N N blocks, we refer to Fig. 10 Transforms in the encoder, this setup would require transposition

for the execution time of the Forward HEVC Transform (Fully stages between the Forward and Inverse Transform architectures

Parallel, Semi Parallel, Iterative). RLP1 , RLP2 refer to the latencies (Y to Yq, and U to Uq). To avoid these transposition stages and to

of the column and row Forward Fully Pipelined 1-D Transform reduce latency, the input block to the Forward Transform must be

respectively. RLI1 , RLI2 refer to the latencies of the column and entered column-wise and the input block to the Inverse Transform

row Forward Iterative 1-D Transform respectively. Similarly, for must be entered row-wise (see Table 1).

the Inverse case we have: IRLP1 , IRLP2 , IRLI1 , IRLI2 . Table 3 sum-

marizes these execution times for both HEVC Forward and Inverse 4. Self-reconfigurable embedded implementation

Transforms.

4.1. Self-reconfigurable embedded system

3.4. HEVC transform in the encoder

The HEVC Forward/Inverse Transform hardware was integrated

in the embedded system depicted in Fig. 11 and implemented on

The HEVC encoder, depicted in Fig. 1, includes the Forward a XilinxR

Zynq-7000 All-Programmable SoC (ARM R

+ FPGA). This

Transform, the quantizer, and the reconstruction loop (inverse device has two main portions: the Programmable Logic (PL) and

quantizer, Inverse Transform, Intra-Prediction, in-loop filtering, the Processing System (PS) [1]. A software routine inside the PS

and Inter-Prediction blocks). The latencies of the Forward and writes/reads data to/from the HEVC Transform hardware (located

Inverse Transform designs are specified in Table 3; we can select inside the PL) via an AXI4-Full Interface [1]. Run-time partial recon-

an optimal design based on the latency of the other components of figuration can be carried out via the Internal Access Configuration

the encoder. The quantizer and inverse quantizer operations can Port (ICAP), with the ICAP peripheral inside the PL. Alternatively,

be readily pipelined [2]. For Intra-Prediction, we refer to the work in Zynq-7000 devices, the Device Configuration Interface (DevC)

in [12] that presents a high-performance, scalable hardware design inside the PS can reconfigure (at run-time) PL regions (known as

along with comparisons with other approaches. Similarly, fast de- Reconfigurable Partitions) via the Processor Configuration Access

signs of in-loop filtering and Inter-prediction are presented in [14]. Port (PCAP) by writing partial bitstream files on it [15]. This is a

The setup of Fig. 8 (input blocks are fed column-wise to both the more efficient method as it does not require PL resources and as

Forward and Inverse Transform) is suitable for clearness of presen- the PS has a dedicated DMA controller to transfer partial bitstreams

tation. But when simultaneously using the Forward and Inverse from memory to PCAP. Input data (input video, partial bitstreams)

186 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Fig. 9. Transposition buffer 4 4. The shaded upper left corner can perform a

transposition operation of a 2 2 matrix in 2 cycles. Fig. 11. Self-reconfigurable embedded system. The Reconfigurable Partition (RP)

includes the HEVC Transform core and the input/output interface to the FIFOs.

is stored on the SD card and then loaded into memory. Output data

processing via the 32-bit AXI4-Full Interface. In our experiment, we

is written back to the SD card. keep CLKFX= S_AXI_ACLK, though CLKFX can be user-modifiable.

The HEVC Transform hardware is part of an embedded periph-

eral, called the AXI HEVC Transform, depicted in detail in Fig. 12: it 4.2. I/O considerations

consists of the HEVC Transform, an output buffer, and input/output

interfaces to FIFOs, and a 32-bit AXI4-Full Slave Interface logic (2 The AXI4-Full Interface limits the throughput of the system:

FIFOs and control). The output buffer is required since the HEVC We can only read/write 32 bits per cycle from iFIFO/oFIFO, and

Transform can output N pixels per cycle that require more than 32 we can only transmit (via DMA) up to 16 32-bit words per burst.

bits. The interface of Fig. 12 allows for fast data processing under these

The input interface consists of registers that wait until an input constraints, but we still fall short of taking full advantage of the

column is ready before feeding it to the HEVC Transform. For the Fully Parallel architecture. Here, the proposed self-reconfigurable

Forward Transform, we need to subtract samples coming from embedded system serves as a generic testbed for hardware vali-

intra-prediction/inter-prediction blocks (see Fig. 1). For conve- dation and run-time hardware adaptation. It also shows that the

HEVC Transform cores can fit in medium-sized FPGAs or PSoCs.

nience, in our tests we set those inputs to zero (this is in fact part

Approaches to optimize performance will require that the HEVC

of the HEVC operation). For the Inverse Transform, no subtractors

transform hardware directly reads/writes data from/to an external

are needed. The output interface is a multiplexor that captures data

memory, where the AXI4-Full Interface is only used to load video

from the output buffer and outputs 32 bits at a time. frames onto the external memory.

The iFIFO and oFIFO isolate the S_AXI_ACLK and CLKFX clock

regions. The FSM @ S_AXI_ACLK controls the AXI signals and the 4.3. Run-time considerations

associated FIFOss signals. The FSM @ CLKFX controls the glue

logic between the HEVC Transform and the FIFOs as well as the If we want to modify the HEVC Transform parameters (size,

associated FIFOs signals. This configuration allows for fast data implementation type, Forward/inverse), we also need to modify

Fig. 10. HEVC Forward Transform: Processing cycles for P consecutive N N blocks. (a) Fully parallel: a new block is processed once N pixels are captured. (b) Semi parallel:

a new block is processed N-1 cycles after loading N pixels. (c) Iterative: a new block can be processed once N pixels are loaded and once we waited a certain delay between

blocks. For Inverse Transform, use IRLP1, IRLP2, IRLI1, IRLI2 instead of RLP1, RLP2, RLI1, RLI2. For the Iterative case (Inverse), use NO instead of BIL and NO instead of NO.

D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 187

Fig. 12. HEVC Forward and Inverse Transform AXI4-Full peripheral. When design parameters vary, the interface needs to accommodate to allow for seamless communication

with the AXI bus. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

CLKFX. As a result, we grouped all these components along with

the HEVC transform into a Reconfigurable Partition (RP) (shaded in

light green in Fig. 12); this is the PL region that will be reconfigured

at run time via the DevC Interface.

The RP outputs toggle during partial reconfiguration, and the

registers inside the RP are not reset after partial reconfiguration [5].

Zynq-7000 devices allow us to set the reset after reconfiguration

property for the RP to clear all registers, though this setting can

impose stringent constraints to the RP shape. Moreover, the tog-

gling of RP outputs can affect the FIFOs behavior during partial

reconfiguration; thus, we need to force FIFOs reset after partial

reconfiguration. A better solution is to instead generate a PR_reset

pulse via software (a special word written into a specific address)

that resets the registers in the RP as well as the FIFOs. Fig. 13 shows

the FSM @ S_AXI_ACLK : as soon as the PR_reset is issued, the system

issues the reset for the RP and the FIFOs.

architectures for luma TB, with B = 8, and NO = 16. By

streaming video frames and retrieving data from the embedded

system, the design has been verified to work with a software

model based on the reference algorithm mentioned in Section 2.

To report results on resources and execution time, we use block

sizes N = 4, 8, 16, 32, and the iterative, semi parallel, fully paral-

lel implementations. The embedded system was implemented on

the ZC706 Evaluation Kit, that houses a XC7Z045 Zynq-7000 All-

Programmable SoC.

Fig. 13. FSM @ S_AXI_ACLK. If the proper control word (0xAA995577) is received on

Transforms for different sizes and implementation types. All cir- base address with offset 11, a PR_reset pulse is issued. This in turn issues a 16-cycle

cuits (except N = 4) use recursive evenodd decomposition (recall reset pulse (rst) for the Reconfigurable Partition and the FIFO.

that smaller sizes are always supported down to 4 4).

188 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Table 4

Forward/Inverse HEVC Transform Resource Utilization. Device: XC7Z045 SoC (54650 Slices: 218600 LUTs, 437200 FFs).

Type N LUTs (%) FFs (%) Slices (%)

4 371 0.2 580 0.1 93 0.2

8 1244 0.6 1947 0.4 311 0.6

Iterative

16 4181 1.9 6748 1.5 1046 1.9

32 14025 6.4 24037 5.5 3507 6.4

4 1077 0.5 1615 0.4 270 0.5

Forward 8 3506 1.6 4935 1.1 877 1.6

Semi parallel

16 12737 5.8 17935 4.1 3185 5.8

32 48859 22.4 67884 15.5 12215 22.4

4 1372 0.6 1885 0.4 343 0.6

8 4846 2.2 5939 1.4 1212 2.2

Fully parallel

16 18389 8.4 22037 5.0 4598 8.4

32 72432 33.1 84278 19.3 18108 33.1

4 458 0.2 641 0.1 115 0.2

Iterative 8 1407 0.6 2012 0.5 352 0.6

16 4475 2.0 6741 1.5 1119 2.0

32 14490 6.6 23364 5.3 3623 6.6

4 1466 0.7 2219 0.5 367 0.7

Inverse 8 4046 1.9 5890 1.3 1012 1.9

Semi parallel

16 13999 6.4 19924 4.6 3500 6.4

32 51689 23.6 72963 16.7 12923 23.6

4 1782 0.8 2485 0.6 446 0.8

8 5418 2.5 6942 1.6 1355 2.5

Fully parallel

16 19651 9.0 24027 5.5 4913 9.0

32 75262 34.4 89357 20.4 18816 34.4

Fig. 14. CTB quadtree structures (64 64 and 32 32) used in the experimental setup.

5.2. Execution time Based on Table 2 formulas (that apply on consecutive same-

sized TBs), we computed execution cycles per frame for the uni-

form TB size and quadtree approaches (Forward and Inverse Trans-

In HEVC, each frame (luma component) is partitioned into Cod- forms). These performance bounds assume we can feed and re-

ing Tree Blocks (CTBs). Each CTB is recursively partitioned into trieve TB columns as fast as the hardware allows (realistic for the

Coding Blocks (CBs) and then into Transform Blocks (TBs), this quadtree structure as there are extra cycles between TBs, especially

partitioning is known as a quadtree. We refer to the Block Size as for the Iterative case). We consider absolute times to be more

the TB size. For reporting execution time, we consider two different useful. For example, the Iterative Hardware (Forward Transform)

approaches: requires 3 878 081 cycles for a 1920 1080 frame partitioned in

64 64 CTBs (quadtree is that of Fig. 14(a)). This amounts to 19.39

Uniform TB size: CTB partitioning with same-sized TBs (N = ms (or 51 fps) at 200 MHz.

4, 8, 16, 32). This allows for a rapid assessment of the HEVC Table 5 reports execution time per frame (ms), frame rate

(fps), and throughput (average computed pixels per cycle) for the

Transform speed.

two approaches: Uniform TB size and quadtree (variable TB size).

Quadtree: We consider two CTB quadtrees shown in Fig. 14:

Results are reported for these frame sizes: 1920 1080 (Full HD),

(i) 64 64 CTB: it uses 4 4 to 32 32 TBs (N = TBmax = 3840 2160 (4K UHD), 4096 3072 (HXGA), and 7680 4320

32), and (ii) 32 32 CTB: it uses 4 4 to 16 16 TBs (8K UHD). Note that the frame size has a negligible effect on

(N = TBmax = 16). Execution time thus depends on the average computed pixels per cycle. We report results at 200 MHz.

CTB size and quadtree structure: variable TB sizes within a The maximum frequency in a Zynq-7000 device depends on the

CTB and between CTBs introduce extra cycles. implementation: it ranges from 200 to 300 MHz.

D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192 189

Table 5

Forward/Inverse HEVC Transform Hardware Results for Luma TB: Execution Time (ms), Frame Rate, pixels/cycle. These results are Performance Bounds . Uniform TB size

(TB size fixed) approach, N = TB = 4, 8, 16, 32. Quadtree structure approach (Variable TB size, see Fig. 14), CTB 32 32: N = 16 (we can process the TB sizes 4,8,16), CTB

64 64: N = 32 (we can process the TB sizes 4,8,16,32).

Execution Time (ms) | Frames per second (fps)

Pixels/cycle

Type TB size N 1920 1080 3840 2160 4096 3072 7860 4320

4 49.24 20.3 fps 197.00 5.0 fps 298.84 3.3 fps 787.96 1.2 fps 0.21

8 30.78 32.4 fps 123.12 8.1 fps 186.77 5.3 fps 492.48 2.0 fps 0.33

Fixed

16 17.78 56.2 fps 70.63 14.1 fps 107.15 9.3 fps 282.52 3.5 fps 0.58

Iterative

32 9.81 101.9 fps 39.24 25.4 fps 59.10 16.9 fps 155.84 6.4 fps 1.06

16 32.35 30.9 fps 129.41 7.7 fps 194.88 5.1 fps 513.86 1.9 fps 0.32

Variable

32 19.39 51.5 fps 77.56 12.9 fps 116.79 8.5 fps 310.24 3.2 fps 0.53

4 4.53 220.4 fps 18.14 55.1 fps 27.52 36.3 fps 72.57 13.7 fps 2.28

8 2.43 411.5 fps 9.72 102.8 fps 14.74 67.8 fps 38.88 25.7 fps 4.26

Forward Fixed

16 1.26 790.5 fps 5.02 199.1 fps 7.61 131.2 fps 20.08 49.7 fps 8.26

Semi parallel

32 0.64 1555.9 fps 2.57 389.0 fps 3.87 258.3 fps 10.20 97.9 fps 16.25

16 2.10 475.9 fps 8.40 118.9 fps 12.65 79.0 fps 33.37 30.0 fps 4.97

Variable

32 1.07 931.4 fps 4.29 232.8 fps 6.46 154.6 fps 17.17 58.2 fps 9.73

4 2.59 385.7 fps 10.36 96.4 fps 15.72 63.5 fps 41.47 24.1 fps 4.00

8 1.29 771.5 fps 5.18 192.9 fps 7.86 127.1 fps 20.73 48.2 fps 8.00

Fixed

16 0.65 1531.5 fps 2.59 385.7 fps 3.93 254.3 fps 10.36 96.4 fps 16.00

Fully parallel

32 0.32 3061.3 fps 1.30 765.7 fps 1.96 508.5 fps 5.18 192.9 fps 32.00

16 1.71 583.5 fps 6.85 145.9 fps 10.32 96.8 fps 27.21 36.7 fps 6.09

Variable

32 0.88 1126.7 fps 3.55 281.7 fps 5.34 187.0 fps 14.19 70.4 fps 11.77

4 62.85 15.9 fps 251.42 3.9 fps 381.42 2.6 fps 1005.7 1.0 fps 0.16

Fixed 8 36.45 27.4 fps 154.87 6.8 fps 234.94 4.5 fps 583.20 1.7 fps 0.28

Iterative 16 19.62 50.9 fps 87.64 12.3 fps 132.95 8.4 fps 311.68 3.2 fps 0.53

32 10.12 98.7 fps 48.10 24.6 fps 72.43 16.4 fps 160.86 6.2 fps 1.03

16 39.09 25.5 fps 156.38 6.4 fps 235.50 4.2 fps 620.94 1.6 fps 0.26

Variable

32 21.94 45.5 fps 87.76 11.4 fps 132.15 7.5 fps 351.04 2.8 fps 0.47

4 4.53 220.4 fps 18.14 55.1 fps 27.52 36.3 fps 72.57 13.7 fps 2.28

Inverse 8 2.43 411.5 fps 9.72 102.8 fps 14.74 64.8 fps 38.88 25.7 fps 4.26

Fixed

16 1.26 790.5 fps 5.02 199.1 fps 7.61 131.2 fps 20.08 49.7 fps 8.25

Semi parallel

32 0.64 1555.9 fps 2.57 389.0 fps 3.87 258.3 fps 10.20 97.9 fps 16.25

16 2.10 475.9 fps 8.40 118.9 fps 12.65 79.0 fps 33.37 30.0 fps 4.97

Variable

32 1.07 931.4 fps 4.29 232.8 fps 6.46 154.6 fps 17.17 58.2 fps 9.73

4 2.59 385.7 fps 10.36 96.4 fps 15.72 63.5 fps 41.47 24.1 fps 4.00

8 1.29 771.5 fps 5.18 192.9 fps 7.86 127.1 fps 20.73 48.2 fps 8.00

Fixed

16 0.65 1531.5 fps 2.59 385.7 fps 3.93 254.3 fps 10.36 96.4 fps 16.00

Fully parallel

32 0.32 3061.3 fps 1.3 765.7 fps 1.96 508.5 fps 5.18 192.9 fps 32.00

16 1.71 583.5 fps 6.85 145.9 fps 10.32 96.8 fps 27.21 36.7 fps 6.09

Variable

32 0.88 1126.7 fps 3.55 281.7 fps 5.34 187.0 fps 14.19 70.4 fps 11.77

Also, with the same hardware, the approach with variable TB size

features lower frame rates than the fixed TB size. In general, we

note that the Iterative hardware has very low frame rates (30 fps)

except for the 1920 1080 case; thus it is useful only for low frame

rates and small frame sizes (1920 1080).

run-time hardware reconfiguration to satisfy time-varying con-

straints based on input data, output data, and user input [5].

Dynamic management is implemented via a software routine in

C inside the PS that swaps HEVC Transform cores in response to: (i)

direct constraints on resources and performance, and (ii) different

coding efficiency: varying the CTB size and quadtree can change

the largest TB size, like switching from the quadtree of Fig. 14(a)

(TBmax = 32) to the one in Fig. 14(b) (TBmax = 16); here run-time

reconfiguration is required.

Current applications with large frame sizes feature quadtree Fig. 15. Frame rate (frames per second) vs. Reconfiguration rate. Scenario 1 (For-

structures with TB sizes up to 32 32 or 16 16 (the hardware ward Transform), 64 64 CTB, 1920 1080 video resolution.

can process smaller TBs). Cases with TB sizes up to 8 8 or 4 4

are rare (except for very small frames) and are not considered. With

this in mind, we consider three run-time reconfiguration strategies Scenario 1 (32 32/16 16): For each implementation

(for the Forward and Inverse Transforms): type (fully parallel, semi parallel, iterative), we allow for

32 32 and 16 16 block (TB) sizes: six configurations.

190 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

a b

Fig. 16. Resources-performance design space. The frame rate considers the effect of reconfiguration time to be negligible (K > 100). (a) Forward Transform (Scenario 1):

Dynamic management based on time-varying constraints on resources and frame rate (fps) as well as compression efficiency (CTB quadtree structure). Note how the system

meets the constraints 15. (b) Inverse Transform (Scenario 2): Resources (Slices) vs frame rate (fps) for various frame sizes.

Table 6

Memory Overhead for Forward and Inverse HEVC Transforms: Reconfigurable Partition (RP) size, bitstream size, and required memory.

Scenario (# of partial bitstreams) RP size (Slices) Bitstream size Reconfig. time Required memory

Forward 200 98 4948 KB 38.65 ms 29688 KB

1 (6)

Inverse 200 104 5251 KB 41.02 ms 31506 KB

Forward 150 32 1212 KB 9.47 ms 3636 KB

2 (3)

Inverse 150 35 1326 KB 10.35 ms 3978 KB

Forward 120 32 970 KB 7.57 ms 1940 KB

3 (2)

Inverse 120 34 1031 KB 8.05 ms 2062 KB

Scenario 2 (only 16 16): For each implementation type the 1920 1080 frame size. Frame rates are extracted from Table 5

(fully parallel, semi parallel, iterative), we only allow for using the quadtrees of Fig. 14.

16 16 block (TB) size: three hardware configurations. The To demonstrate dynamic management on the system of

Reconfigurable Partition (RP) size is reduced. Fig. 15(a), we consider a time-varying sequence of quadtree struc-

Scenario 3 (Low Cost): We only allow for iterative imple- ture specifications and constraints on resources and performance:

mentation for 32 32 and 16 16 block (TB) sizes: two (1) CTB 32 32: minimize resource usage.

configurations. This scenario has the smallest RP. (2) CTB 32 32: require fewer than 6000 slices subject to frame

rate 120 fps.

Table 6 reports the required memory (along with bitstream and (3) CTB 64 64: maximize frame rate.

RP sizes) for the three scenarios (for Forward and Inverse Trans- (4) CTB 64 64: minimize resource usage.

forms). While the embedded system was implemented in XC7Z045 (5) CTB 64 64: minimize resource usage subject to frame rate

SoC (in the ZC706 Evaluation Kit), we note that Scenarios 2 and 60 fps.

3 also fit in the smaller XC7Z020 SoC (in the ZED Development Fig. 16(a) shows how the system dynamically selects hardware

Board). realizations (-) that meet the requirements. Fig. 16(b) depicts

The average reconfiguration speed using the PCAP is 128 MB/s; the resources-performance design space for Scenario 2 (Inverse

the reconfiguration time is indicated in Table 6. Depending on Transform) for various frame sizes and for the 64 64 CTB. Note

the reconfiguration rate, the overhead can be significant. Fig. 15 that the Iterative case is very slow for large frame sizes. Thus,

illustrates how the reconfiguration rate affects the frame rate real-time applications with large frame sizes should use the Semi

(fps) for Scenario 1 (Forward Transform) with a 64 64 CTB Parallel and Fully Parallel implementations.

and 1920 1080 frame size. We reconfigure every K frames

(K = 1, 5, 10, 20, 50, 80. NR: no reconfiguration). We see that the 5.4. Comparisons with other implementations

higher the reconfiguration rate (lower K ), the lower the frame rate.

The Fully Parallel and Semi Parallel implementations are the most Table 7 provides a comparison with related HEVC Transform ar-

affected (the processing time of the Iterative case is comparable chitectures, where the proposed approach (3 types) uses a 32 32

to the reconfiguration time). At the maximum reconfiguration rate Forward HEVC Transform with fixed size (TB=32). Most designs are

(every frame, K = 1), we have 25.28, 25.17, and 17.22 fps for the multiplier-less that take advantage of the specific HEVC coefficient

Fully Parallel, Semi Parallel, and Iterative cases respectively. Most values (harder to adapt to other values like those of the alternative

applications do not require such high reconfiguration rates, so in 4 4 DST). Some approaches optimize transposition buffers by

general we say that the effect of the reconfiguration time overhead size reduction [11] or SRAM-based techniques [14]. The work in [8]

is negligible. Note that Scenario 1 was selected to illustrate the presents 1D Transforms that generate 32 pixels per cycle regard-

worst case since it has the largest Reconfigurable Partition; for less of transform size (at the expense of recursive duplicating

Scenarios 2 and 3, the drop is less pronounced. the amount of resources); it also uses transposition buffers with

Fig. 16(a) shows the Resource (Zynq-7000 Slices) and Perfor- gated clocks (not recommended for FPGA designs) that introduce

mance (fps) design space for Scenario 1 (Forward Transform) for a pipeline latency of N cycles. Our approach relies on Distributed

Table 7

Comparison of different approaches (32 32 HEVC Forward Transform). For proper comparisons, reconfiguration time overhead is not considered in achieved frame rate, and resources do not include the embedded interface.

In the proposed architecture, as well as in [11] and [17], the pixels per cycle are reported assuming a uniform TB size of 32 32 .

Proposed (run-time reconfigurable) Meher et al. [8] Chen Tikekar Pastuszak [11]

et al. [17] et al. [14]

1D 2 unfolded 2 unfolded 2 folded 1 shared 2 unfolded 1 shared 1 shared 2 folded

Transforms units units units unfolded unit units unfolded unit foldedunit units

Architecture

Transpose 2 1 1 1 1 2 1 SRAM-based 2

buffers

Technology Zynq-7000 Programmable SoC TSMC 90 nm 180 nm CMOS 40 nm TSMC 90 nm

Gate count 1.7M 1.1M 332K 208K 347K 79K 98.1K 16.4Kbit 328K

SRAM

Pixels per cycle 32* 16.25* 1.06* 16 32 2 2 32

Supporting 7860 4320 7860 4320 3840 2160 7860 3420 3840 2160 4096 3072 7680 4320

video format @ 193fps @ 98 fps @ 25 fps @60 fps @ 30fps @ 30 fps @ 60fps

Frequency 200 MHz 187 MHz 125 MHz 200 MHz 400 MHz

Notes * Average pixels/cycle for a 1920 1080 The 1D transform always Inverse Inverse Multipliers used

D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

191

192 D. Llamocca / J. Parallel Distrib. Comput. 109 (2017) 178192

Arithmetic and the synthesis tools to optimize resources, it takes [6] D. Llamocca, M. Pattichis, C. Carranza, A framework for self-reconfigurable

care of embedded interfacing and it is run-time reconfigurable. DCTs based on multiobjective optimization of the Energy-Performance-

Accuracy space, in: Proceedings of the 7th International Workshop on Recon-

Our design exhibits high frame rates, except for the Iterative case. figurable Communication-centric Systems-on-Chip, RECOSOC2012, pp. 579

In general, our architecture with 2 unfolded (fully pipelined) 1D 582, July 2012.

transforms and 2 transpose buffers outperforms other approaches [7] D. Llamocca, M. Pattichis, A. Vera, Partial reconfigurable FIR filtering system

in frame rates, though with a high gate count, which is a rough ap- using distributed arithmetic, Int. J. Reconfigurable Comput. 2010 (357978)

(2010) 14.

proximation (due to inefficient mapping) based on the FPGA logic [8] P.K. Meher, S.Y. Park, B.K. Mohanty, K.S. Lim, C. Yeo, Efficient integer DCT

cells. We note that the reported pixels per cycle in [8,11,14,17] are architectures for HEVC, IEEE Trans. Circuits Syst. Video Technol. 24 (1) (2014)

with respect to one transform block. We report the average pixels 168178.

per cycle for a 1920 1080 frame (there is minimal variation for [9] J.S. Park, W.J. Nam, S.M. Han, S. Lee, 2-D Large Inverse Transform (16 16,

32 32) for HEVC (High Efficiency Video Coding), J. Semicond. Technol. Sci.

larger frame sizes), this considers idle cycles between blocks. 12 (2) (2012) 203211.

[10] G. Pastuszak, Flexible architecture design for H.265/HEVC inverse transform,

6. Conclusions Circuits Systems Signal Process. 34 (6) (2014) 19311945.

[11] G. Pastuszak, Hardware architectures for the H.265/HEVC discrete cosine

This work presented scalable HEVC Forward and Inverse Trans- transform, IET Image Process. 9 (6) (2015) 468477.

[12] G. Pastuszak, A. Abramowski, Algorithm and architecture design of the

form implementations that can dynamically adapt resources H.265/HEVC intra encoder, IEEE Trans. Circuits Syst. Video Technol. 26 (1)

based on performance and coding efficiency requirements while (2016) 210222.

supporting beyond high-definition video formats. For common [13] M. Tikekar, C.-T. Huang, C. Juvekar, V. Sze, A. Chandrakasan, A 249-mpixel/s

HEVC video-decoder chip for 4k ultra-HD applications, IEEE J. Solid-State

applications, the reconfiguration rate has a negligible effect on

Circuits 49 (1) (2014) 6172.

performance. The results suggest that we can incorporate both [14] M. Tikekar, C.-T. Huang, V. Sze, A. Chandrakasan, Energy and area-efficient

Forward and Inverse HEVC Transform inside a medium-sized hardware implementation of HEVC Inverse Transform and Dequantization,

Zynq-7000 PSoC. This work illustrates the benefits of run-time re- in: Proceedings of the IEEE International Conference on Image Processing,

ICIP2014, pp. 21002014, Oct. 2014.

configurable FPGA or PSoC technology for HEVC encoder/decoder

[15] K. Vipin, S.A. Fahmy, ZyCAP: Efficient partial reconfiguration management on

implementations. the Xilinx Zynq, IEEE Embedded System Letters 6 (3) (2014) 4144.

[16] S. Vivienne, M. Budagavi, G. Sullivan, High Efficiency Video Coding (HEVC),

Acknowledgment in: Integrated Circuit and Systems, Algorithm and Architectures, Springer,

2014.

[17] Y.-Ho Chen, C.-Y. Liu, Area efficient video transform for HEVC applications,

This material is based upon work supported by the National Electron. Lett. 51 (14) (2015) 10651067.

Science Foundation under Grant No. NSF AWD CNS-1422031. [18] S. Yu, E. Swartziander, DCT implementation with distributed arithmetic, IEEE

Trans. Comput. 50 (9) (2001) 985991.

References

[1] UG585: Zynq-7000 All Programmable SoC Technical Reference Manual, Xilinx, Daniel Llamocca received the B.Sc. degree in electrical

Inc., Nov. 2014. engineering from Pontificia Universidad Catlica del Per,

[2] M. Budagavi, A. Fuldseth, V. Sze, M. Sadafale, Core transform design in the High in 2002, and the M.Sc. degree in electrical engineering

Efficiency Video Coding (HEVC) standard, IEEE J. Sel. Top. Sign. Proces. 7 (6) and the Ph.D. degree in computer engineering from the

(2013) 10291041. University of New Mexico at Albuquerque, in 2008 and

[3] P.T. Chiang, T.S. Chang, A reconfigurable inverse transform architecture design 2012, respectively. He is currently an Assistant Professor

for HEVC decoder, in: Proceedings of the IEEE International Symposium on with Oakland University. His research deals with run-

Circuits and Systems, ISCAS2013, pp. 10061009, May 2013. time automatic adaptation of hardware resources to time

[4] M. Jridi, P.K. Meher, A scalable approximate DCT architetures for efficient HEVC varying constraints with the purpose of delivering the best

compliant video coding, IEEE Trans. Circuits Syst. Video Technol. (2017) in hardware solution at any time. His current research in-

press. terests include reconfigurable computer architectures for

[5] D. Llamocca, M.S. Pattichis, Dynamic energy, performance, and accuracy opti- signal, image, and video processing, high-performance architectures for computer

mization and management using automatically generated constraints for sep- arithmetic, communication, and embedded interfaces, embedded system design,

arable 2D FIR filtering for digital video processing, ACM Trans. Reconfigurable and run-time partial reconfiguration techniques on field-programmable gate ar-

Technol. Syst. 7 (4) (2014) 4. rays.

- CUDA Compression Final ReportUploaded byAbhishek Kumar Singh
- B4.4-R3Uploaded byapi-3782519
- Wavelet Based Image Compression TechniquesUploaded byyumnamcha
- ssg_m1l1Uploaded byDickson Xavier
- Streamed Coefficients Approach for Quantization Table Estimation in JPEG ImagesUploaded byijcsis
- vclaUploaded byDivyesh Harwani
- Sklchan S^3 Mods v6.00Uploaded byFakhrur Razil Alawi
- videocoding2bUploaded byMarion Tucic
- Text Compressor Using Huffman AlgorithmUploaded byMohammad Sayef
- 4.Eng-An Approach on PZW Coding Technique Using SPHIT-Ravi MatheyUploaded byImpact Journals
- SPI xapp859Uploaded byVijendra Pal
- ME_ELC_TELEUploaded bytnarayan_002
- jlpea-494015 (1).pdfUploaded byAnonymous icWu4Rlj
- systemMLUploaded byYoungHoon Jung
- TMS320DM6437 Algoritmo de Com Pres IonUploaded byElihu Fernan Dehesa Gonzalez
- New FileUploaded byadrian_navarro_23
- matrix mUploaded byGyanendra Verma
- app_fUploaded byJerwin Virrey
- HEVC the Next Big ThingUploaded byIndyman Sun
- Zerotree Coding of Dct CoefficientsUploaded byFaizal Haris
- SLIDES_OS_3Uploaded byMalkeet Singh
- 15451-hw1Uploaded byRukhsar Neyaz Khan
- piecewise linear systemsUploaded byCrepen Lemurios
- Intranet Email SystemUploaded bySanjay Kumar
- Image Steganography Method using Zero Order Hold Zooming and Reversible Data HidingUploaded byAnonymous kw8Yrp0R5r
- AtpUploaded byViralBipinchandraRathod
- Efficient Adaptive Intra Refresh Error Resilience for 3D Video CommunicationUploaded byBilly Bryan
- Determinant and MatrixUploaded byRam Krishna
- Prediction BasedUploaded byNitesh
- Scheme Sylla Msc Cs 2016 Admn1480154276Uploaded byThomas Philip

- Separación De Audio De Mezclas Comprimidas LinealesUploaded byricardoag_new
- winrarUploaded bySomerul Excentric
- DSIP Solutions Dec 2012Uploaded byNilesh Patil
- 2014 Cytech Ip Camera PriceUploaded byRoman Balcazar Davila
- IPVM 2018 IP Camera Book Published 1 0Uploaded byfecl
- Pivot Animator.docxUploaded byRalph James Vernal Jaron
- Sp 105 Biomedical Signal ProcessingUploaded byarun@1984
- Installed FilesUploaded byhery
- A New Digital Image Watermarking Algorithm BasedUploaded byEng Marwa Elsherif
- Photoshop 101 a Crash CourseUploaded byMark Luigi Babol
- Project Report of ISO/IEC 23000 MPEG-A Multimedia Application FormatUploaded byM Syah Houari Sabirin
- ScalarUploaded byAnanda Krishna
- Advanced Compression Wp 12c 1896128Uploaded byAmar Sandhu
- CD15 de mp3Uploaded byluiskarlen
- 5 Product Introduction HERNIS Flex by Jan KristensenUploaded byMohd Firdaus
- VirtualDub TutorialUploaded byAnonymous GiXgzZWCo
- Fundamentals of TelephonyUploaded byDuane Bodle
- InfoUploaded byPapadopoulos Nikos
- 1U and Simple Entry Level Series DVR User's Manual V2.606 0912Uploaded byadycristi
- SRNS Relocation RP-040338Uploaded byMasroor Khan
- 18 the Design of Embedded GPS Navigation System Based on InternetUploaded byJiMicek
- Booting ARM Linux.pdfUploaded byAkhilesh Chaudhry
- WingScan Developers GuideUploaded bybjhbjbuv
- Red 2 (2013) M-hd 720p Hindi Eng Bhatti87Uploaded byjoinsyed
- Design Considiration-pdf.pdfUploaded byCelover H. Porteza
- 4_CompressionChapterUploaded byAshish Tiwari
- An FPGA based processor yields a real time high fidelity radar environment simulatorUploaded byRay Andraka
- Btech Academic Projects List 2013Uploaded byMahesh Kanike
- Gimp TutorUploaded byaedinmanoj
- Smartflash Patent 7334720Uploaded byAnonymous Hnv6u54H

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.