Read without ads and support Scribd by becoming a Scribd Premium Reader.
 
L. Zhang and W. Gao: Reusable Architecture and Complexity-Controllable Algorithm for the Integer/Fractional Motion Estimation of H.264Contributed Paper Manuscript received April 10, 2007 0098 3063/07/$20.00 © 2007 IEEE749
Reusable Architecture and Complexity-Controllable Algorithm forthe Integer/Fractional Motion Estimation of H.264
Li Zhang and Wen Gao,
Member 
, IEEE
Abstract
 
 — 
Motion estimation is the most computational intensive part of H.264 video coding. The motion estimation of  H.264 includes integer motion estimation and fractional motion estimation. In this paper, a reusable architecture for both of them is proposed. This architecture can support  fractional motion estimation and fast integer motionestimation algorithms including diamond search and cross search. The complexity of the motion estimation algorithmimplemented on the proposed architecture can be flexiblycontrolled to meet variable system requirement. Cycleallocation is used in the algorithm design so that the cyclebudget can be fully utilized. Experimental results show that this architecture-algorithm co-design can support HD(1280x720) video real-time encoding at 30 frames per second with 238K logic gates and 117 Mhz operation frequency
1
.
Index Terms —motion estimation, H.264, reusablearchitecture, complexity-controllable algorithm.
I.
 
INTRODUCTION
The video coding standard H.264 [1] provides super codingefficiency compared to the previous coding standards withmuch higher computational cost. The motion estimation (ME)is the most computational-intensive part of H.264 encoder.The ME of H.264 includes two parts: firstly the integer motion estimation (IME) is used to find integer pixel motionvector (MV), then fractional motion estimation (FME) is usedto refine the MV to quarter-pixel accuracy.In previous work, the IME and FME are implementedindependently by different hardware modules. For example, in[2]-[5], some dedicated architectures are proposed only for theIME, [6] proposed a dedicated architecture only for FME. In asystem level design such as
 
[7], the IME and FME areimplemented as two independent stages of a pipeline. The previous architectures for IME are designed for the full search pattern which has huge complexity, and the FME also needssubstantial computational complexity, so they should be processed independently. There have been many fast search patterns for IME, such as New three step search (NTSS) [8],
 
Block-based gradient descent search (BBGDS) [9], Four stepsearch (FSS) [10], Hexagon-based search (HEXBS) [11],Diamond search (DS) [12], Enhanced predictive zonal search
1
This work is done under the support of Spreadtrum Communication Inc.Li Zhang is with the Institute of Computing Technology, ChineseAcademy of Sciences and Graduate School of Chinese academy of Sciences,Beijing, China (email: zhangli@jdl.ac.cn
 
).Wen Gao is with the Institute of Computing Technology, ChineseAcademy of Sciences (email: wgao@jdl.ac.cn
 
).
(EPZS) [13] and UMHexagons [14]. When these fast search patterns are used for IME, the complexity of IME can begreatly reduced and even lower than FME, furthermore thedata flow of fast IME is very similar to that of FME, so thehardware module for FME can be reused for fast IME. WhenIME and FME are processed by the same module, thehardware cost can be saved.This paper proposed a reusable architecture for both IMEand FME. Based on the architecture, the complexity of themotion estimation algorithm can be flexibly controlled to meetvariable system requirement. Cycle allocation is used to fullyutilize the limited cycle budget to improve the codingefficiency. As an example for the flexible complexity-control,an algorithm is proposed for HD 720P (1280x720) videocoding in this paper. This algorithm can support HD 720Pvideo coding at 30 frames/s when it is running on the proposed architecture.This paper is organized as follows. In section
, the ME of H.264 is reviewed, section
proposes the reusablearchitecture, the complexity-controllable algorithm is presented in section
, section
presents the experimentalresults and we conclude this paper in section
.
II.
 
M
E OF
H.264
 A.
 
 Integer Motion Estimation
LDSPSDSP(a)(b)
 
Fig. 1. DS Search Pattern and CS Search Pattern
Since the full search of IME brings great computational burden, there have been many fast search patterns for IME.Most of them belong to the local search which uses a compactsearch pattern (such as a diamond). Fig. 1(a) has shown theDiamond search (DS) pattern in [12]: the large diamondsearch pattern (LDSP) consists of nine search positions, thesmall diamond search pattern (SDSP) consists of five search positions. When the best position is at the center of LDSP, thesearch pattern changes to SDSP, the best position of this
Authorized licensed use limited to: SAMSUNG ELECTRONICS DMC. Downloaded on July 29, 2009 at 08:21 from IEEE Xplore. Restrictions apply.
 
IEEE Transactions on Consumer Electronics, Vol. 53, No. 2, MAY 2007750
SDSP is the final best MV. The local search patterns such asDS may be trapped into local minimum when the motion islarge. To solve this problem, global search which has moresearch points that can cover the overall search window isused. The cross search (CS) shown in Fig. 1(b) is one kind of global search pattern, the search positions are arranged in across which can span the whole search window, the center of the cross is the initial search point of ME. The cross searchcan extend the search range so that the search process may not be trapped into local minimum.
 B.
 
 Fractional Motion Estimation
 A BC DE F G H I JK L M N O PQ RS Tdgbaehc j jaabbcc dd eeff gghhi i kk l lmmnn ooppqqrI nteger pi xel Hal f pi xel Quarter pi xel
 
Fig. 2. The FME of H.264
Fig. 2 illustrates the FME process of H.264, the reference pixels are obtained by interpolation. The horizontal half pixelsuch as gg is obtained with a 6-tap horizontal FIR:
' -52020-5 (1)
 ggEFGHI
= + + +
 
('16)5 (2)
 ggg
= + >>
 The vertical half pixel such as kk is obtained with a 6-tapvertical FIR:
' 520205 (3)
kkACGMQ
= + + +
 
('16)5 (4)
kkk
= + >>
 The horizontal-vertical half pixel such as ll is obtained with a6-tap vertical FIR:
' 520'205 (5)
llaabbggppqqr
= + + +
 
('512)10 (6)
lll
= + >>
 The
aa
,
bb
,
pp
,
qq and rr 
are obtained in a similar manner with'
 gg 
.The quarter-pixels are obtained with bilinear filters, for example:
(1)1 (7)
aggk
= + + >>
  b (1)1 (8)
 ggl
= + + >>
 Detailed information for the interpolation of fractional pixelcan be referred to [1].The search pattern of FME is also shown in Fig. 2. Supposethe best integer MV points to G, then eight half-pixel positions (cc, dd, ee, ff, gg, jj, kk and ll) around G aresearched, suppose the best one of the eight half-pixel positionsis ll, then eight quarter-pixel positions (a, b, c, d, e, f, g and h)around ll are searched, the best quarter-pixel position is thefinal best MV.
C. Analysis
It can be observed that the search pattern of DS for IMEand FME are similar: firstly 9 candidate positions are checked(including the center point), the best position will be used asthe center position of the next search pass. This similarity has provided chances for designing reusable architecture.In both IME and FME, the MV which has the minimalMotion_Cost is selected as best MV. _cos*
otiontSADMVDCos
λ 
= +
(9)In (9),
λ 
 is a constant parameter, the MVD_cost is the bitsused to encode the MVD (motion vector difference), the SAD(sum of absolute difference) is the difference between thecurrent block and reference block.
,,
Mijij j1i1
SAD O R
= =
=
(10)The O
i,j
is one pixel of the original block, the R 
i,j
is thecorresponding pixel in reference block. For IME, R 
i,j
isdirectly obtained from the reference frame, for FME, P
i,j
isobtained by interpolation operations. After the fractional pixels have been interpolated, the computation of SAD is thesame for IME and FME, so the hardware unit for calculating
SAD
can be reused for both IME and FME.
III.
 
EUSABLE
A
RCHITECTURE
 
 A.
 
 Proposed Architecture
Data ArrayCost Adder0......MV Comparator Integer reference pi xelOrgi nalMB pi xelMemory fetchrequestMVD_CostCost Adder1Cost Adder8FetchEngi nebest MV
 
Fig. 3. The block diagram of the proposed architecture
The proposed architecture is shown in Fig. 3, it includesfour parts: fetch engine, data array, cost adder and MVcomparator. In the proposed architecture, 9 search positionscan be processed in parallel. The fetch engine fetches theneeded integer reference pixels and original pixels from thememory. The data array provides the needed integer/fractionalreference pixels for 9 search positions. At each cycle, 9 rowsof reference pixels are generated by the data array. Since thelargest width of one block is 16 in H.264, each row contains16 reference pixels. Each row is input into one cost adder.
Authorized licensed use limited to: SAMSUNG ELECTRONICS DMC. Downloaded on July 29, 2009 at 08:21 from IEEE Xplore. Restrictions apply.
 
L. Zhang and W. Gao: Reusable Architecture and Complexity-Controllable Algorithm for the Integer/Fractional Motion Estimation of H.264751
There are 9 cost adders in the proposed architecture, each costadder generates
Motion_cost 
in (9) for one search position.One cost adder has 16 processing elements (PE) to take 16AD operations in parallel (an AD operation is to get absolutedifference of one original pixel and its correspondingreference pixel), thus totally 9*16=144 AD operations are processed in parallel. The MV comparator compares the
Motion_cost 
of nine positions and selects the best one as the best MV.
....
FIRFIRFIRFIRFIR
....
Type_0 Register Type_1 Register Type_3 Register Type_2 Register HorizontalFIRX 17VerticalFIRX 35
 
Fig. 4. The architecture of the data array
The architecture of the data array is shown in Fig. 4, itshould be configured according to the requirement of FMEsince generating a reference pixel for FME is much complexthan generating a reference pixel for IME. There are 35*8registers in the data array, they can be divided into 4 types:type_0 register contains integer pixel, type_1 register containshorizontal half pixel, type_2 register contains vertical half  pixel, type_3 register contains horizontal-vertical half pixel(such as pixel
ll 
in Fig. 2). Each register will shift its data tothe same type register below it at each cycle. In the data array,there are 18*6 type_0 registers, 17*6 type_1 registers, 18*2type_2 registers and 17*2 type_3 registers.At each cycle, one row which contains (3+16+3) integer reference pixels is fetched from the memory, the horizontal 6-tap FIR filters generate 17 needed horizontal half pixels andthese pixels are shifted into type_1 registers on the top of thedata array, at the same time 18 needed integer pixels areshifted into type_0 registers. The pixels in type_2 registersand type_3 registers are obtained by using vertical 6-tap FIR filters on the type_0 and type_1 registers at the same column.When the data array is performing half-pixel ME, 9 positions are processed in parallel. For a 16x16 block, totally22x22 integer pixels are needed to generate all neededreference pixels. Every half-pixel search position has its firstreference row when seven rows of integer pixels have beeninput. When the data array is performing quarter-pixel ME,the data flow is similar to the half-pixel ME. The quarter- pixels are generated by using bilinear filters on the relevantregisters. The dashed rectangle encloses 35*3 registers, theseregisters contain the reference half-pixels or the half-pixelsneeded by the bilinear filters to generate the quarter-pixels.When performing the integer pixel ME, the data array can be reused for different search patterns. The type_0 and type_1registers are regarded as the same type which contains integer  pixels, all the filters are not used, the type_2 and type_3registers are not used. Only a subset of type_0 and type_1registers is selected to contatin the integer reference pixels.The integer reference pixels are input row by row directly intothe data array, different registers are selected as the input of cost adder for different search patterns.
....................
20
 
Fig. 5. The configuration of data array for LDSP
Fig.5 shows the configuration of data array for LDSP. Ateach cycle, one row which contains 16+4 integer reference pixels are input into the data array. When five reference rowshave been input into the data array, every search position of LDSP has its first reference row. The registers enclosed in thedashed rectangle contain the reference pixels needed by the 9cost adders. The process of SDSP is similar, but only fivesearch positions are involved in SDSP, thus fewer registersare selected and only five cost adders are used.
....................
24
 
Fig. 6. The configuration of data array for horizontal direction of CS
The data array can also support cross search pattern, for thehorizontal direction, nine positions along the horizontal armare processed in parallel, so when the length of horizontal armare larger than nine, the search should be divided into several passes. For example: if the length of horizontal arm is 27, 3 passes are needed. At each cycle, one row which contains16+8 integer reference pixels is input into the data array. Fig.6shows the configuration of data array for horizontal directionsearch of CS, the dashed rectangle encloses the selectedregisters needed by nine cost adders.
Authorized licensed use limited to: SAMSUNG ELECTRONICS DMC. Downloaded on July 29, 2009 at 08:21 from IEEE Xplore. Restrictions apply.
Search History:
Searching...
Result 00 of 00
00 results for result for
  • p.
  • Notes
    Load more