You are on page 1of 4

NEW 2" DCT ALGORITHMS SUITABLE FOR VLSI IMPLEMENTATION

Pierre DUHAMEL and Hedi H'MIDA

CNET/PAB/RPE, 38-40, rue du G&n&ral Leclerc, 92131 Issy-les-Moulineaux
(FRANCE)

ABSTRACT
Small
length
Discrete
Cosine
Transforms
(DCT'S)
areusedforimagedatacompression.
In t h a t case,
length 8 or 16 DCT's
are
needed
to be
performed
a t video rate.
We proposetwonewimplementation
of DCT'swhich
have
several
interesting
features,
as f a r as VLSI
implementation is concerned.
A first
one,
using
modulo-arithmetic,
needs
only
so t h a t a single
one
multiplication
per
input
point,
multiplier is needed on-chip.
Asecondone,basedon
a decomposition of t h eD C T
of t h e s e
into
polynomial
products,
and
evaluation
polynomialproducts
by distributedarithmetic,results
small
chip,
with
a great
regularity
and
in a very
t e s t a b i l i Ft yu.r t h e r m o trsheae,m
s t er u c t ucraen
be
used
for
FFT computationbychangingonlythe
ROM-part of t h e chip.
a new
Bothnewarchitectures
,are mainlybasedon
as a cyclicconvolution,
formulation of a length-2DCT
which is explainedinthefirstsection
of thepaper.

While i t is possible
to
obtain
"classical"
algorithms
meeting
these
three
points
(the
paper
describing
t h e m is under
the
process
of being
written),
we
proposeinthispapertwocompletelynewapproaches
t hhaat vs e v e rianlt e r e s t i nf ega t u r e s ,
as f a r
VLSI implementation is concerned.

F u r t h e r m o r teh, e saep p r o a c h easrdee s c r i b efdr o m
c o n s i d e r a t i othhnatashvt eeo r x t iicma pl o r t a n c e ,
s i n c ef o trh ef i r st i m el,e n g t h2D C T ' sa r es t a t e d
i nt e r m s of cyclicconvolution,whichallowstoobtain
quickly its multiplicativecomplexity(wethusobtain
a secondderivation
of a result by M.T. HEIDEMAN).

We will
give
in this
paper
only
sketches
of proofs
f o rt h ed e r i v a t i o n s
of thealgorithms,sinceouraim
is to showthattheunderstanding
of t h e m a t h e m a t i c a l
underlying
structure
of t hDe C cTalne a d
to new
efficient algorithms.
11. THE LENGTH 2" DCT AS POLYNOMIAL PRODUCTS
The DCT is defined as follows :

I. INTRODUCTION
In therecentyears,many
fast DCTalgorithmswere
of majorinterest
:
proposed,amongwhichthreeare
the
CHEN-FRALICK
[I] algorithm, B.G. LEE [31,
and VETTERLI-NUSSBAUMER [41 algorithm.

a long
time,
has
The
first
one,
being
proposed
for
been
considered
for
VLSI implementation
several
times,althoughitdoesnotmeettheminimumarithmetic complexity Dl.
Theotheronesmeettheminimumknownnumber
of
bgth multiplications and additions
to implement a length
DCT
2 algorithm.
Furthermore,
has
itbeen
shown
t h a t , if thesealgorithmscouldbeimproved,thesame
approach would also improve a whole class of algorithm
(Le.IDand
2-D FFT's, DST ---) [ 6 ] . F r o m a p r a c t i c a l
point of view,thealgorithm
by LEEhasgreater
a
regularity than the VETTERLI-NUSSBAUMER algorithm,
but
has
poor
roundoff
noise
performances,
due
to
t h e l/cos coefficients.Both of themhavebeenimplemented in hardware (or silicon)
[51.
Withthoseconsiderations
in mind,onecan
t h e r e is stillsomeneedforDCTalgorithmsmeeting
the following three characteristics altogether

see t h a t
:

- minimum arithmetic complexity (or low hardware cost),
- g r e arte g u l a r i t y
a lengthN/2DCTinside

of

the
graph
(the
availability
of
of a length N DCT is o f t e n

required),
- good noise performances.

as

(1)

xk

N-1
=

1

2n

xi cos

i=O

4N

(2i+l) k

The
equivalence
between
the
above
DCT
for
and a cyclic
convolution
is obtained
through
two
:
permutations of t hien p uvt a r i a b l exs .
Thefirstone,alreadygiven
in[4]
t h tee r m (s2 i + l )
in ( I ) i n t o( 4 i t l )

eq. (1) can then be written

'k

=x

is used to c h a n g e
: L e t us define

as :

N-1

(*)

N=2"

2.rr
X I

4N ( 4 i + l )

cos

i=O

k

The
second
one
will allow t o change a product of
indices ( 4 i + l ) ( 4 k + l )i n t o a sum of indices : u-1 + vk'
This
result
is obtained
through
the
use
of a o n e
t o one
correspondance
between
the
set of i n t e g e r s
of t hfeo r m
(4i+l),
i-0, -- 2"-1 a n tdhseu c c e s s i v e
n+2
powers of 5 modulo 2
i t is always
possible
write :

..

to

U.

(3)

' >2n+2 ,

4i+l = < 5

This
can
be
applied
recursively
to t hlee n g t h
N/2
DCTarisingfromthecomputation
of t h ee v e nt e r m s ,
a completeformulation
a n d so on,thusresultingin
of the DCT as polynomial products.

i=o, ____ ~ " - 1

The permutation (4) is t h e n a l w a y s f e a s i b l e :

x'
< 5i >

(4)

X'li

=

4N -1

4

and eq. (3) becomes as follows :

1 XZk 1

L e t us now
consider
separately
the
even
a n do d dt e r m si X 2 k + l
f of t h e DCT.

I t is well
known,
and
fairly
obvious
from
eq.
(1)
that
X2k is t h eo u t p u t
of a DCT of lengthN/2.

1

When
considering
these
polynomial
products,
it
is
easily
recognized
that
polynomials
the
involving
t hi en p u t
of t hDe CaTraerl el d u c t i o n s
of X(z)
modulo the cyclotomic factors of xN-1 (N=Zn).Knowing
see t h a t h ew h o l e
set of
polynomial
t h i so, n ec a n
products is equivalent t o a cyclic
convolution
(Le.
N
a polynomial
product
modulo
x -1) followed
b
f i a
reduction
modulo
the
cyclotomic
factors
of x -1.
T hsee q u e n c e
t o be
cyclically
convolved
with
the
to befound.But,sinceweknow,
i n p u td a t ar e m a i n s
by
successive
applications
of eq. (IO) t o t h D
e CT's
of decreasinglength
N, N/2, N / 4 ---- t h ee x p r e s s i o n
of t h e unknown
polynomial
modulo
the
cyclotomic
f a c t o r ist,
is easy to reconstruct
the
initial
one,
given in eq. (12) :

1

H e n c e ,t h ef o l l o w i n gd e c o m p o s i t i o n ,o nt h eo d dt e r m s
will apply recursively on the DCT's
of reduced lengths.
,
When
considering
only
the
odd
terms
)X2k+l
eq.(I)
is nowsymmetrical
in i a n d k, a n dt h et w o
permutationsdescribedabovearenowfeasiblein
k.
( W i t ht h eo n l yd i f f e r e n c et h atth e r ea r eN / Z + t e r m s
XZk+l, and N t e r m s xZicl, thus resulting in the
- term
of eq. 7. ( s e e [XI formoredetails).Hence,wehave
as a result :

I

We have
now
established
that
the
DCT
N=Zn can be obtained as shown in fig.

of

length

(I).

111. THEORETICAL CONSEQUENCES
As a sideresult,it
is nowveryeasy
boundonthemultiplicativecomplexity
2" DCT :

where :

L e t us now define the following polynomials

I t hasbeenshownbyWINOGRAD
[91, t h a tt h em u l t i plicativecomplexity of a cyclicconvolution
of length
2" is given by :

:

(13)

N/2-1

m

'k

(14)

N - 1

V(z)

=

\

LA
i = O

eq. ( 6 ) can now be reformulated

(11)

Y(z) = X(z)

. V(z) mod

<

si >4N

z

i

-n -2

Consequences of p r a c t i c a li m p o r t a n c ec a nb eo b t a i n e d
byobservation
of t a b l e 1, containingthecomparison
betweenthislowerboundandthepracticalalgorithms
for short-lengths :

as :

zN" + I

Le. : t h eo d dt e r m s
of t h eD C Tc a nb es t a t e d
a polynomial product of length N/2.

as

In fact, observation of t a b l e 1 t e l l s us t h a t t h e c o m p u t a t i o n of a DCT of length 2" n e e d sm o r et h a no n e
multiplicationperpointwhateverthealgorithmwill
be. As a consequence, if a DCTchip
is needed t o
workinrealtime
at videorates(which
is t h e case
many
inDCT
applications),
implementation
of a
DCT
will
need
more
than
one
multiplier
on-chip.

42.2.2
1806

p(DCT 2") = 2""

I t is possible(butmoreintricate)
t o show,byusmg
of WINOGRAD t h at th iusp p e r s o moet h erre s u l t s
boundisalsothelowerbound.Thisresultwasalready
obtained by M.T. HEIDEMAN [ I 11.

211

COS

-n -1

i

1

(10)

p(conv. 2") = 2""

Furthermore,
one
of the
multiplications
involved
as a convolution, as shown
itnhDe C cTo m p u t e d
( I ) is trivial ( V(z) mod. x-1 = 1). We t h e n
in
fig.
obtain, as an upper bound :

k

x". z

t o g e ta nu p p e r
of t h el e n g t h

be the inner product
N

4
-

VETTCRLI

LEE

4

4

4

26

32

32

lower bound

CHEN

16
6

8

16

44

Table 1 : comparison of the
lower
bound
and
the
practical algorithms

L-1

of

F u r t h e r m o rsei,n ct heceo m p u t a t i o n
of t hr e s u l t
modulo
the
cyclotomic
factors
of xN-I is obtained
as intermediate
variables
inside
the
inverse
NTT,
be
simplified
t hbeu t t e r f l i essh o w n
in fig. (2) can
withthelastoperationsinvolvedinthecomputation
Thisresultsinthediagramshown

I t shouldbenotedthatthiscorresponds
case for NTT's to be used :

to a favorable

SinceNTT'saregenerally
performedonshort-length
s e q u e n c e s( N = 1 6s e e m s
to be a maximum),weavoid
to t h e
in
NTT
that,
due
the
usual
problem
arising
th
relationship
betweena
, t h e Nroot of unity, N,
the
length
of t ht er a n s f o r ma n, d
M the
modulus
( a N 2 1 mod M), it is often
impossible
to use 2
as a root of unity
(thus
avoiding
multiplications
in the NTT) for even moderate lengths.
- What is needed to c o m p u t teh D
e CT
is really a
cyclic convolution, and there
is no need of the overlapadd
or
overlap-save
algorithms
to obtain a linear
convolution, as is needed in FIR filtering.
- Themoduloarithmetic
is notsuch a problem,since,
with the given constraints, we can work modulo
a Fermatnumber,or
a pseudoFermatnumber
[131, which
gives one of the Cimplest known modulo-arithmetic
[141.
In this case, a In flg. (3) represents
only
a shift,
andcanbeimplemented
by a rotation of theinput
word at a bit level.
- F u r t h e r m o r e , s i n c e a great precision on the
Xk is oft e nn e e d e d( u s e
of DCTinadaptativefeedbackloops),
t h en e e d
of greater
wordlengths
when
using
NTT's,
usual
case, is not
such
a waste.
c o m p a r e d to the

V. DCT 3Y DISTRIBUTED ARITHMETIC
Another
possibility
is to use the
decomposition
of
the
DCT
into
polynomial
products,
as explained in
s e c t i o n 11, and
then
to
compute
these
polynomial
products by distributed arithmetic.
L e t us briefly recall the computation
using t h e d i s t r i b u t e d a r i t h m e t i c :

Y =

-

of innerproducts

6-1

a 1. x i o

.t

j=1

obtaining a

SincewehavenowestablishedtheDCT
as a cyclic
convolution,wecanuseNumberTheoreticTransforms
(NTT) 1121 t o c o m p u t e t h e c o n v o l u t i o n , a n d t h e s c h e m e
of fig. ( I ) now becomes as showninfig.
(2).

of t h e NTT-'box.
in fig. 3 f o r N=8.

the expression of xi i n t e r m s ot its binary representation
( 2 ' s complement).
Let
us know
write
xi a t a bit
level in (15) and reverse the two resulting sums.
We g e t :

(17)

IV. THE DCT COMPUTED BY NTT
Nevertheless,
there
is still a way
DCT
algorithm
with
one
multiplication
per
point
:
(hence one multiplier on-chip)

t o be computed, and

,
(E
L-1

ai xij)

2-j

is0

In thisequation,thedoublesum
is a successiveshift
and
add
of elementary
terms
(between
brackets),
each
term
being
an
inner
product
between
ai
a n d a v e c t o r of bits (x.., i = O , ---N-I).
'I
f dependson
N binaryvariaL e t f bethisfunction.
ZN different
values.
If t h e s e
bles,
hence
can
take
a ROM at t h ea d d r e s s
corresvaluesarestoredin
ponding to the
binary
configuration
of the
input
bits,
an
implementation
of the
inner
product
by
distributed arithmetic is as shown in fig. 4.

I 1

When usedin
a DCTalgorithm,thedistributedarithof polynomial
the
product
m eitm
i cp l e m e n t a t i o n
willrequireoneinnerproductcomputationpercoeffic i e n t of theresultingpolynomial,andsomebutterflies
todecomposetheinitialDCTintopolynomialproducts
(see fig. 5).
A number of r e m a r k s a r e of i n t e r e s t :

- Sincethe
ROM is addressed by t h eb i t s
of s a m e
weight of t h e o u t p u t s of t h eb u t t e r f l i e s ,t h e s eb u t t e r flies can be implemented in serial arithmetic.
- Thespeed
of a circuitimplementingthisarchitect u r e will belimitedonly
by theoutputaccumulator.
If therequiredspeed
is lower,it is possible to r e d u c e
t hsei z e
of t hcei r c u i t
by using the
relationships
between
the
different
inner
products
involved
181,
in a mannerverysimilar
to thatexplainedin
[I 51
f o r t h e c o m p u t a t i o n of convolution.

-

All t h ec o m b i n a t i o n s
of t h ei n p u td a t aa r ep e r f o r medinserialarithmetic.Hence,theresultingarchiregular
and
easily
implemented.
t e c t u r e is very

-

S i n ctehset r u c t u r e
of the
decomposition
of t h e
is t hsea m e
as f o r
DCT
into
polynomial
products
o t h et r a n s f o r m st h,sea mset r u c t u rceaanl sboe
of F o u r iterra n s f o r m s
used
for
the
computations
by changing only the
ROM p a r t of t h e chip.

VI. CONCLUSIQN

We have
first
explained
the
equivalence
between
DCT and cyclic convolution.

Thus,weusedthisrelationshiptoobtainnewDCT
algorithms
with
some
characteristics
suitable
for
VLSI implementation.
O t h ea rl g o r i t h mc sa n
also be
obtained
with
such
an approach. Further work will be reported.

422.3
1807

REFERENCES
s11

[21

[31

[41

151

[61

DUHAMEL
P.
: "Dispositif
transformee
de
encosinusd'unsignalnum6rique6chantillonni".
French
patent,
n"9601629,
February
1986.
P. DUHAMEL : "Dispositif d ed 6 t e r m i n a t i o nd e
latransformkenumkriqued'unsignal".French
patent n"8612431, September 1986.
S. WINOGRAD : "Some
bilinear
forms
whose
m u l t i p l i c a t ci voem p l e xdi et yp etnhodens
field of constants".
Math.
Syst.
Theory,
1977,
Vole10,pp.169-180.
L. AUSLANDER, S. WINOGRAD : "Themultiplicative complexity of certain semi-linear systems
defined bypolynomials".
Adv. in AppliedMathematics. Vol. 1, n03, pp.257-299,1980.
M.T.
HEIDEMAN,
Private
communication.
H.J.NUSSBAUMER, "Fast Fourier Transform and
Convolution
algorithms."
Springer-Verlag,
1981.
R.C.AGARWAL,
C.S. BURRUS : "Fast convolutions
using
Fermat
number
transforms
with
applicationtodigitalfiltering".IEEETrans.on
ASSP, VOI. 22, pp. 87-97,1974.
L.M. LEIBOW'ITZ : "A simplifiedbinaryarithmetic
for
the
Fermat
Number
Transform".
IEEE
Trans.
on
ASSP, Vol. 24,
pp.
356-359,
1976.
S. CHU, C.S. BURRUS : "A p r i m e f a c t o r F F T a l gorithm using distributed arithmetic". IEEE Trans.
o n ASSP, Vol. 30, n02,pp.217-226,April1982.

W.H. CHEN, C. HARRISON, S.C. FRALICK
Dis"A fast computationalAlgorithmforthe
crete Cosine Transform".
IEEE
Trans.
comm.,
on
Vol. COM-25
n09,
September1977, pp. 1004-1011.
A. JALALI, K.R. RAO, "A High-speed
FDCT
Processor
for
real-time
Processing
of NTSC
color TV signal".
IEEE
Trans.
on
elec.
comp.
Vol. EMC.
24,
n02,
May 1982, pp. 278-286.
BYEONG
LEE,
"A
new
algorithm
to
compute
theDiscreteCosineTransform".IEEETrans.on
ASSP. Vol ASSP-32,
n06,
December
1984,
pp.1243-1245.
M. VETTERLI, H. NUSSBAUMER, "A simple
F F Ta n dD C Ta l g o r i t h m sw i t hr e d u c e dn u m b e r
of operation" : SignalProcessing,
Vol. 6,
n04,
August1984, pp.267-278.
M. VETTERLI, A. LIGTENBERG, "A d i s c r e t e
cosinetransformchip",IEEEJournal
of Select.
Areas.incom.,
Vol. SAC-4,nO1.January1986,
pp.49-65.
: "Implementation of "split-radix"
P.DUHAMEL
FFT algorithms
for
complex,
real,
and
realsymmetricdata".IEEETrans.onASSP,
Vol. 34,
n02,pp.285-295,April1986.
16 16

It

-.13
14

Fig. 1 : Thecomputatlon of a length 2" DCT based
on a cyclic convolutlon

Fig. 2

Fig. 4

:

Imp1ementatlon of an
inner
product
dlstrlbuted arlthrnetlc

: General scheme of t h e DCT cornpuled by N T l

28

Fig. 5 : TheDCT

1808

by

'

of length 8 by distributed a r l t h r n e t l c