00622048

1997 IEEE International Symposium on Circuits and Systems, June 9-12,1997, Hong Kong
A NOVEL SCALABLE ARCHITECTURE w I m MEMORY INTERLEAVING

ORGANIZATION FOR FULL SEARCH BLOCK-MATCHING ALGORITHM
Yeong-Kang Lai, Liang-Gee Chen, Tsung-Hun. Tsai, and Po-Cheng Wu

Department of Electrical Enigineering
National Taiwan University
.I-.
Taipei, Taiwan, R. 0. C. China
ABSTRACT O B I B 2 B i 04 05 0 6 0 7 B8 BY B I O B l i 812813 B l 4 B i
16017HIX . ~ BlilBll
12 013 1174 O M 047
This paper describes a high-throughput scalalble architec- 48 B4Y H i l l BA2 B h i
64 865 1166 B7Y BXI
ture for full-search block-matching algorithm (FSBMA).
T h e number of processing elements (PES) is scalable ac-
cording to the variable algorithm parameters and the per-
a(0.0)a(0,l)a(o.Z). , , , .
formance required for different applications. By use of the .q7,o)a(l,l)a(l,z). . , , , .
efficient PE-rings and the intelligent memory-interleaving a(2,0)a(2.l)a(2.2).
organization, the efficiency of the architecture can be in-

creased. Techniques for reducing interconnections and ex-
ternal memory accesses are also presented. Our results 16x16 template block
demonstrate t h a t the scalable PE-ringed architecture is a
224 0225 8226.
240 BZ4l 0242
. 7
flexible and high-performance solution for FSBMA.
~~
from current frame

-8
1. INTRODUCTION
Among various video compression techniques, the motion-
compensated hybrid coding is the most popular one and is
adopted by several standards and proposals [I]-[3]. Block
matching algorithm for motion estimation is nowadays used 31x31 search area
' X
in a wide variety of applications. It removes the temporal from previous frame

redundancy within frame sequences, thus it provides these
coding systems with significant bit-rate reduction. Howev-
er, it also requires a large amount of computation and a
heavy memory bandwidth. Figure: 1. Template block and search area for 256
The different performance requirements are needed for possible displacement ( N = 16,p = 8).
different real-time video applications. T h e performance re-
quirements can be evaluated by some essential parameters yet efficient full search block-matching architecture is de-
such as the block size, the search area size, the frame size, scribed. The proposed approach offers the various d a t a flow
and the frame rate. T o meet real-time video application, the of different parameters and thus achieves an efficiency close
number of P E S depends on the performance requirements. t o 1 0 0 ]percent. It also utilizes the data-reuse t o reduces the
For numerous application domains, optimal block-matching number of input pins. Section 2 shows the overall scalable
parameters have not been specified yet. Furthermore, flex- architecture. In Section 3 , we present some techniques of
ibility in parameters is explicitly favored for emulation of memor:y bandwidth reduction, d a t a arrangement, and inter-
algorithms in an early phase of system development. Hence, connection simplification. In Section 4, the performance of
there exists a substantial need for a flexible motion estima- the proposed VLSI architecture along with various conven-
tor, allowing the user t o select his own parameters and t o tional full search block-matching architectures is analyzed.
check the influence of parameter variations. Finally, Section 5 gives a conclusion.
Several dedicated hardware implementations have been
realized for full-search (FS) or full-search-based block 2. THE PROPOSED VLSI ARCHITECTURE
matching algorithms [4]-[ll]. Principally, these realizations
are systolic arrays, laid out for a specific set of parameter The procedure of a block-matching algorithm is t o find the
values. Hence they usually offer only a limited flexibility, best matched displaced block from the previous frame Ft-1,
or even no flexibility a t all. Moreover, the parallel architec- within a search range, for each N x N block in the present
ture with multiple PES can efficiently increase throughput. frame .Ft. A straightforward method, the full search, ex-
However, the number of input pins, the difficulties on d a t a haustively matches all possible candidates t o find the dis-
addressing, and interconnection complexity between mem- placement (called motion vector) with a minimal distortion.
ory modules and PES make it hard t o implement. We pro- As a c.riterion of distortion, the mean absolute difference
pose the novel scalable PE-ringed architecture effectively t o (MAD) is calculated for each candidate location ( U , v ) :
solve all these problems.
In the presented paper, a scalable, parametrizable and MAD(u, v )
0-7803-3583-XI97$10.00 01997 IEEE

H MM columns and H P E columns from the above PE from memory module
current
block
v pixels
MM
lOWS
to
from .the
the right
left PE
PE
current
v
block - PE
rows
DIXBIS to time-sharing common bus to the below PE
Figure 3. Architecture of PE.

a raster-scan order. The upper-left H x V of the all candi-
date blocks are first computed. After 256 clock cycles, the
MAD of each candidate block is produced in each PE and
time-sharing common bus

... ... I
transferred to PE’s final-result latch. These H x V latched
MADS can then be sent t o the minimum extractor t o get a
minimum MAD. During the comparisons, the PES continue
t o perform the block-matchings of the next H x V candi-
date blocks. It does not consume extra cycles t o fill u p the
motion vector pipeline operations.
Figure 2 . The scalable PE-ringed architecture. 3. T E C H N I Q U E S FOR DATA M A N A G E M E N T

N-1 N-1 This architecture is based on some d a t a management tech-
= IFt(z + I , Y + m ) - Ft-i(z + I + U,Y + m + .)I niques: (1) On-chip memory configuration for data-reuse,
1=0 m=O ( 2 ) Memory interleaving organization for parallel d a t a ac-
(1) cesses, and (3) Propagation of the accumulated partial-
where ( z , y ) is the coordinate of the upper-left pixel of the results for eliminating the interconnection overheads be-
current block in F,, and the values of U and TJ are limited tween the PES and the MMs, These techniques are de-
t o between - p t o p - 1. Fig. 1 illustrates the procedure of scribed in the following subsections. Fig. 4 shows a simple
FSBMA. T h e candidate blocks are labeled by BO B255. N example t o illustrate the operation of the proposed archi-
Fig. 2 shows the scalable PE-ringed architecture t o per- tecture. In terms of (Z), we assume t h a t the 4x4 MMs and
form the FSBMA. T o exploit the parallelism in FSBMA, the 4x4 PES are needed t o meet a certain real-time application.
proposed architecture is consists of H x V identical on-chip
Memory Modules (MMs) and H x V identical processing el- 3.1. On-chip Memory C o n f i g u r a t i o n for Data-
ements (PES). T h e P E S are connected in a ringed fashion. Reuse
Each of the P E S is composed of an absolute difference unit, The memory bandwidth is the main bottleneck t o support
an accumulator, and a final-result latch in a pipelined fash- the high performance motion estimators. However, pixels
ion. T h e detailed hardware architecture of a PE is shown in are repeatedly used several times t o evaluate different can-
Fig. 3. According t o the performance requirements of dif- didate blocks. To avoid the extremely high bandwidth re-
ferent real-time video applications, the number of the MMs quirements for chip 1/0 and memory systems, we utilize the
and the PES, i.e., H x V , is determined by the following overlap between the search areas of the adjacent current
evaluation equation. Assume t h a t the cycle time is Tp for blocks (see Fig. 5) and propose the scheme of three half-
t h e N x N current block with the search range of - p t o search-area (HSA) segments. Each of the memory mod-
p - 1. ules is partitioned into three HSA segments, as shown in
Fig. 6(a). One HSA block (31 x 16 pixels) is interleaved t o
x ( 2 ~ ) ’ x N 2 x Tp 1 reside the same labeled segments in the 4 x 4 MMs. Fig. 6(b)
T=
H x V
<- (2) illustrates the operations of the three HSA segments. When
fr executing task 0 (matching current block 0), the PES access
where T denotes the total block-matching time of all the search area pixels from the segments 0 and 1. During
blocks in a frame (frame size: W X L). It should be small- performing the FS of current block 0, the HSA segment 2
er than frame period, l/fr. The search area pixels t o be in all memory modules is being filled by the right-half of
computed are first stored in the MMs. T h e current block the search area for task 1 (HSA C). When executing task 1
pixels are sequentially input and broadcasted t o all P E S in (matching current block l), the search area pixels from the
memory module i
I segment2 1
contents of segments
TO T1 T2 T3 T4 T5 T6
* time
HSA segment 0 A I n / D D m G G
HSA segment 1 B B m E E M H
HSA segrnent 2 m C C I F / F F m
H full search for current block Ti: task i
L I I ' I ' i ' I '

F i g u r e 16. ( a ) The three partitioned HSA s e g m e n t s .
l
(b) Contents of the three HSA s e g m e n t s ( w r i t e op-
erations are m a r k e d b y small f r a m e s ) .
Figure 4. A simple architecture e x a m p l e for N = 4, 3.3. Propagation of the accumulated partial-
p = 2, H = 4, and V = 4.(some interconnections are results
omitted.)
In most of multiple PES designs, a fully connected intercon-
HSA blocks (31 x 16) ~ w r s nblocks
l (16 x 16) nection is demanded between the multiple PES and multiple
MMs. However, the interconnection gives the extra delay
and consumes larger routing space due to large numbers of
a buses, multiplexers and tri-state buffers. To overcome the
drawback, the adjacent P E S are connected in the horizon-
tal rings and the vertical rings, and we arrange the data
--
flow from the memory modules to the P E S and redistribute
P I
A
I
R
1
C
I
D
I
E
I
F G H - - . the PES operations. This method allows the operations be-
+_ I
ssarch area of c u m n l block 0 (task 0)
longing i o a candidate block t o be performed by the several
search area of current block 1 (task 1)
- 1 search area of current black 2 (lark 2) PES. Then, for every clock cycle, P E propagates the accu-
mulated partial-result to the adjacent PE to form the next
F i g u r e 5 . Overlapped search areas of a d j a c e n t cur- partial-result of the block candidate. After 256 clock cycles,
rent blocks. the final result of each candidate block is produced in each
P E . The data flow is shown in Table 1. By use of the prop-
agation of the accumulated partial-results, each of the P E S
segments 1 and 2 of the 4 x 4 MMs are accessed by the PES, is only connected to one memory module. This eliminates
and the new data (HSA D) are input to segment 0. In this the complicated interconnection and the switching circuitry
cyclic manner, the 31 x 16 new pixels can be easily accessed between the P E S and the memory modules.
from system memory during performing the block-matching
of the current block by only one input port. 4. P E R F O R M A N C E ANALYSIS
In this section, we analyze the performance of the proposed
3.2. Memory Interleaving Organization VLSI architecture for the full search block-matching algo-
rithm. Table 2 presents a comparison between the pro-
T h e on-chip memory is used t o reduce the 1oa.d for chip posed and previous architectures. T h e 2-D array provides
1/0 and memory system. The next problem is: how to high-speed motion estimators with very high costs. T h e 1-
provide all required data for the H x V P E S simultaneously. D array is an efficient low-cost design for some low-speed
Our approach is t o divide the memory into H x V memory applications. However, the proposed architecture gives a
modules, and to interleave the input pixels to these H x satisfactory solution taking into account scalability, input
V modules. An example are shown in Fig. 7. The label ports and computation speed.
(0 N 15) for each pixel indicates the module that stores
the pixel. This memory interleaving organization provides 5. C O N C L U S I O N
a solution for parallel data accesses, but it stipulates that A scalable PE-ringed architecture for F S B M A has been de-
each of P E S is able to access the corresponding memory scribed. The number of processing elements (PES) is scal-
module in parallel. able according to the variable algorithm parameters and the
Architecture proposed architecture 2-D array [4] 1-D array [5]
No. of the PES 128 I 64 1 32 [ 16 128 1 64 1 32 1 16 128 1 64 1 32 I 16
No. ofinDut d a t a Dins 16 I 16 I 16 I 16 24 I 24 I 24 1 24 136 I 72 I 40 I 24
Clock cycles per
block matching 512 1024 2048 4096 512 1024 2048 4096 512 1024 2048 4096
Table 1. Data flow for FSBMA ( N = 1 6 , p = 8 )

Time I Date S e q u e n c e I PBO I P B l I . I P B 1 4 1 P B 1 6
OX 1640 I
... ... ...
... ... ...
O X 16+14 .(0,14) B2
OX16+16 .(0,16) B1
1X 16+0 .(1,0) B48 B49
1X 1 6 + 1 .(l,l) BKl
... ... ... ... ... ...
... ... ... ... ...
1x16114 a(1.14) BKO
1 X 16+16 &(1,16) 849 B60
... ... ...
... ... ...
... ... ...
... ... ... ... ... ...
14X 16+0 r(14,O) 832
14X 16+1 414.1) 836 B31
I ... I ... I . . . I .. . I . I ... I
14X16+14 a(14,14) 834 B36 . B16 B17
14x16+16 .(14,16) B33 834 . B19 816
16X 16+0 a(16.0) B16 Blt . B1 B3
16X 16+1 ~ ( 1 6 ~ 1 ) B19 B16 . B1 B2
... ... ... . . . . . . . ...
... ... ... . . . . . . . ...
16116+14 a(lK,l4) B11 B19 . BO B1
16X16+16 a( 1 6 , l S ) El? Bl8 . B3 BO
I ... I ... I ... l . . . I . I ... I ...
I I 1 . 1 ... 1
Figure 7. An example for the distribution of search
area pixels to 4 x 4-memory modules.
II ”’
( ”
I
I
”
...
NBXT
...
...
...
...
performance required for different applications. A config-

uration of random-access on-chip memory modules solves [6] Y. S. Jehng, L. G. Chen,and T. D. Chiueh, “ An ef-
the problems of chip 1/0 and memory bandwidth require- ficient and simple VLSI tree architecture for motion
ment. Input d a t a are arranged in the memory modules by estimation algorithm,’’ I E E E Transactions o n Signal
memory interleaving organization. Combined with a tech- Proc., vol. 41, no. 2, Feb. 1993.
nique of the propagation of accumulated partial results, the [7] H. M. Jong, L. G. Chen, and T. D. Chiueh, I‘ Paral-
interleaved memory module provides every PE with its re- lel Architecture for 3-Step Hierarchical Search Block-
quired d a t a simultaneously without introducing complicat- Matching Algorithm,” I E E E Transactions o n Circuit
ed interconnections and switching circuitry. In summary, and S y s t e m for Video Technology., vol. 4 , no. 4, Aug.
the proposed architectures have the following desirable fea- 1994.
tures: (1) low hardware cost, ( 2 ) high throughput rate, ( 3 )
[8] Shifan Chang, Juin-Haur Hwang, and Chein-Wei Jen
low latency delay, (4) low 1/0 and memory bandwidth re-
quirements, and (5) 100 percent computation efficiency. , “Scalable Array Architecture for Full Search Block
Matching ,” I E E E Transactions o n Circuit and S y s t e m
REFERENCES for Video Technology., vol. 5 no. 4 , pp 332-353, Aug.
1995.
[l] C C I T T SGXV Working Party XV/1 Specialists Group
[9] Luc De Vos and Matthias Schobinger , “VLSI Architec-
on Coding for Visual Telephony, Document #584, Nov.
1989. ture for a Flexible Block Matching Processor ,” IEEE
Transactions on Circuit and S y s t e m for Video Tech-
[2] ISO/IEC/JTCl/SC29/WGll/Draft/CD 13818-2, nology., vol. 5 no. 5, pp 417-428, Oct. 1995.
“Recommendation H.262,” November, 1993.
[lo] Santanu D u t t a and Wayne W o l f , “A Flexible Paral-
[3] ISO/IEC/JTCl/SC29/WGll/Draft/CD 13818-1, lel Architecture Adapted t o Block-Matching Motion-
“MPEG-2 Systems,” November, 1993. Estimation Algorithms ,” IEEE Transactions on Cir-
[4] Luc De Vos and Michael Stegherr, , “Parameterisable cuit and S y s t e m for Video Technology., vol. 6 no. 1, p p
VLSI Architectures for the Full-Search Block-Matching 74-86, Feb. 1996.
Algorithm ,” I E E E Transactions o n Circuit and Sys- [ll] Gangan Gupta and Chaitali Chakrabarti , “Archi-
tem., vol. 36 no. 10,p p 1309-1316, Oct. 1989. tectures for Hierarchical Other Block Matching Algo-
[5] K . M. Yang, M. T . Sun, and L. Wu, “ A family of VLSI rithms ,” I E E E Transactions o n Circuit and S y s t e m
designs for the motion compensation block-matching for Video Technology., vol. 5 no. 6, p p 477-489, Dec.
algorithm,” I E E E Transactions o n Circuit and S y s t e m 1995.
for Video Technology., vol. 36, no. 10, pp.1317-1325,
Oct. 1989.
1232

00622048

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

00622048

Uploaded by

Copyright:

Available Formats

1997 IEEE International Symposium on Circuits and Systems, June 9-12,1997, Hong Kong

A NOVEL SCALABLE ARCHITECTURE w I m MEMORY INTERLEAVING

Yeong-Kang Lai, Liang-Gee Chen, Tsung-Hun. Tsai, and Po-Cheng Wu

organization, the efficiency of the architecture can be in-

from current frame

in a wide variety of applications. It removes the temporal from previous frame

0-7803-3583-XI97$10.00 01997 IEEE

Figure 3. Architecture of PE.

time-sharing common bus

Figure 2 . The scalable PE-ringed architecture. 3. T E C H N I Q U E S FOR DATA M A N A G E M E N T

H full search for current block Ti: task i

L I I ' I ' i ' I '

Table 1. Data flow for FSBMA ( N = 1 6 , p = 8 )

performance required for different applications. A config-

You might also like