Professional Documents
Culture Documents
.I-.
Taipei, Taiwan, R. 0. C. China
ABSTRACT O B I B 2 B i 04 05 0 6 0 7 B8 BY B I O B l i 812813 B l 4 B i
16017HIX . ~ BlilBll
12 013 1174 O M 047
This paper describes a high-throughput scalalble architec- 48 B4Y H i l l BA2 B h i
64 865 1166 B7Y BXI
ture for full-search block-matching algorithm (FSBMA).
T h e number of processing elements (PES) is scalable ac-
cording to the variable algorithm parameters and the per-
a(0.0)a(0,l)a(o.Z). , , , .
formance required for different applications. By use of the .q7,o)a(l,l)a(l,z). . , , , .
efficient PE-rings and the intelligent memory-interleaving a(2,0)a(2.l)a(2.2).
1. INTRODUCTION
Among various video compression techniques, the motion-
compensated hybrid coding is the most popular one and is
adopted by several standards and proposals [I]-[3]. Block
matching algorithm for motion estimation is nowadays used 31x31 search area
' X
current
block
v pixels
MM
lOWS
to
from .the
the right
left PE
PE
current
v
block - PE
rows
DIXBIS to time-sharing common bus to the below PE
I segment2 1
contents of segments
TO T1 T2 T3 T4 T5 T6
* time
HSA segment 0 A I n / D D m G G
HSA segment 1 B B m E E M H
HSA segrnent 2 m C C I F / F F m
--
flow from the memory modules to the P E S and redistribute
P I
A
I
R
1
C
I
D
I
E
I
F G H - - . the PES operations. This method allows the operations be-
+_ I
ssarch area of c u m n l block 0 (task 0)
longing i o a candidate block t o be performed by the several
search area of current block 1 (task 1)
- 1 search area of current black 2 (lark 2) PES. Then, for every clock cycle, P E propagates the accu-
mulated partial-result to the adjacent PE to form the next
F i g u r e 5 . Overlapped search areas of a d j a c e n t cur- partial-result of the block candidate. After 256 clock cycles,
rent blocks. the final result of each candidate block is produced in each
P E . The data flow is shown in Table 1. By use of the prop-
agation of the accumulated partial-results, each of the P E S
segments 1 and 2 of the 4 x 4 MMs are accessed by the PES, is only connected to one memory module. This eliminates
and the new data (HSA D) are input to segment 0. In this the complicated interconnection and the switching circuitry
cyclic manner, the 31 x 16 new pixels can be easily accessed between the P E S and the memory modules.
from system memory during performing the block-matching
of the current block by only one input port. 4. P E R F O R M A N C E ANALYSIS
In this section, we analyze the performance of the proposed
3.2. Memory Interleaving Organization VLSI architecture for the full search block-matching algo-
rithm. Table 2 presents a comparison between the pro-
T h e on-chip memory is used t o reduce the 1oa.d for chip posed and previous architectures. T h e 2-D array provides
1/0 and memory system. The next problem is: how to high-speed motion estimators with very high costs. T h e 1-
provide all required data for the H x V P E S simultaneously. D array is an efficient low-cost design for some low-speed
Our approach is t o divide the memory into H x V memory applications. However, the proposed architecture gives a
modules, and to interleave the input pixels to these H x satisfactory solution taking into account scalability, input
V modules. An example are shown in Fig. 7. The label ports and computation speed.
(0 N 15) for each pixel indicates the module that stores
the pixel. This memory interleaving organization provides 5. C O N C L U S I O N
a solution for parallel data accesses, but it stipulates that A scalable PE-ringed architecture for F S B M A has been de-
each of P E S is able to access the corresponding memory scribed. The number of processing elements (PES) is scal-
module in parallel. able according to the variable algorithm parameters and the
Architecture proposed architecture 2-D array [4] 1-D array [5]
No. of the PES 128 I 64 1 32 [ 16 128 1 64 1 32 1 16 128 1 64 1 32 I 16
No. ofinDut d a t a Dins 16 I 16 I 16 I 16 24 I 24 I 24 1 24 136 I 72 I 40 I 24
Clock cycles per
block matching 512 1024 2048 4096 512 1024 2048 4096 512 1024 2048 4096
...
NBXT
...
...
...
...
1232