Pierre DUHAMEL and Hedi H'MIDA
CNET/PAB/RPE, 3840, rue du G&n&ral Leclerc, 92131 IssylesMoulineaux
(FRANCE)
ABSTRACT
Small
length
Discrete
Cosine
Transforms
(DCT'S)
areusedforimagedatacompression.
In t h a t case,
length 8 or 16 DCT's
are
needed
to be
performed
a t video rate.
We proposetwonewimplementation
of DCT'swhich
have
several
interesting
features,
as f a r as VLSI
implementation is concerned.
A first
one,
using
moduloarithmetic,
needs
only
so t h a t a single
one
multiplication
per
input
point,
multiplier is needed onchip.
Asecondone,basedon
a decomposition of t h eD C T
of t h e s e
into
polynomial
products,
and
evaluation
polynomialproducts
by distributedarithmetic,results
small
chip,
with
a great
regularity
and
in a very
t e s t a b i l i Ft yu.r t h e r m o trsheae,m
s t er u c t ucraen
be
used
for
FFT computationbychangingonlythe
ROMpart of t h e chip.
a new
Bothnewarchitectures
,are mainlybasedon
as a cyclicconvolution,
formulation of a length2DCT
which is explainedinthefirstsection
of thepaper.
While i t is possible
to
obtain
"classical"
algorithms
meeting
these
three
points
(the
paper
describing
t h e m is under
the
process
of being
written),
we
proposeinthispapertwocompletelynewapproaches
t hhaat vs e v e rianlt e r e s t i nf ega t u r e s ,
as f a r
VLSI implementation is concerned.
F u r t h e r m o r teh, e saep p r o a c h easrdee s c r i b efdr o m
c o n s i d e r a t i othhnatashvt eeo r x t iicma pl o r t a n c e ,
s i n c ef o trh ef i r st i m el,e n g t h2D C T ' sa r es t a t e d
i nt e r m s of cyclicconvolution,whichallowstoobtain
quickly its multiplicativecomplexity(wethusobtain
a secondderivation
of a result by M.T. HEIDEMAN).
We will
give
in this
paper
only
sketches
of proofs
f o rt h ed e r i v a t i o n s
of thealgorithms,sinceouraim
is to showthattheunderstanding
of t h e m a t h e m a t i c a l
underlying
structure
of t hDe C cTalne a d
to new
efficient algorithms.
11. THE LENGTH 2" DCT AS POLYNOMIAL PRODUCTS
The DCT is defined as follows :
I. INTRODUCTION
In therecentyears,many
fast DCTalgorithmswere
of majorinterest
:
proposed,amongwhichthreeare
the
CHENFRALICK
[I] algorithm, B.G. LEE [31,
and VETTERLINUSSBAUMER [41 algorithm.
a long
time,
has
The
first
one,
being
proposed
for
been
considered
for
VLSI implementation
several
times,althoughitdoesnotmeettheminimumarithmetic complexity Dl.
Theotheronesmeettheminimumknownnumber
of
bgth multiplications and additions
to implement a length
DCT
2 algorithm.
Furthermore,
has
itbeen
shown
t h a t , if thesealgorithmscouldbeimproved,thesame
approach would also improve a whole class of algorithm
(Le.IDand
2D FFT's, DST ) [ 6 ] . F r o m a p r a c t i c a l
point of view,thealgorithm
by LEEhasgreater
a
regularity than the VETTERLINUSSBAUMER algorithm,
but
has
poor
roundoff
noise
performances,
due
to
t h e l/cos coefficients.Both of themhavebeenimplemented in hardware (or silicon)
[51.
Withthoseconsiderations
in mind,onecan
t h e r e is stillsomeneedforDCTalgorithmsmeeting
the following three characteristics altogether
see t h a t
:
 minimum arithmetic complexity (or low hardware cost),
 g r e arte g u l a r i t y
a lengthN/2DCTinside
of
the
graph
(the
availability
of
of a length N DCT is o f t e n
required),
 good noise performances.
as
(1)
xk
N1
=
1
2n
xi cos
i=O
4N
(2i+l) k
The
equivalence
between
the
above
DCT
for
and a cyclic
convolution
is obtained
through
two
:
permutations of t hien p uvt a r i a b l exs .
Thefirstone,alreadygiven
in[4]
t h tee r m (s2 i + l )
in ( I ) i n t o( 4 i t l )
eq. (1) can then be written
'k
=x
is used to c h a n g e
: L e t us define
as :
N1
(*)
N=2"
2.rr
X I
4N ( 4 i + l )
cos
i=O
k
The
second
one
will allow t o change a product of
indices ( 4 i + l ) ( 4 k + l )i n t o a sum of indices : u1 + vk'
This
result
is obtained
through
the
use
of a o n e
t o one
correspondance
between
the
set of i n t e g e r s
of t hfeo r m
(4i+l),
i0,  2"1 a n tdhseu c c e s s i v e
n+2
powers of 5 modulo 2
i t is always
possible
write :
..
to
U.
(3)
' >2n+2 ,
4i+l = < 5
This
can
be
applied
recursively
to t hlee n g t h
N/2
DCTarisingfromthecomputation
of t h ee v e nt e r m s ,
a completeformulation
a n d so on,thusresultingin
of the DCT as polynomial products.
i=o, ____ ~ "  1
The permutation (4) is t h e n a l w a y s f e a s i b l e :
x'
< 5i >
(4)
X'li
=
4N 1
4
and eq. (3) becomes as follows :
1 XZk 1
L e t us now
consider
separately
the
even
a n do d dt e r m si X 2 k + l
f of t h e DCT.
I t is well
known,
and
fairly
obvious
from
eq.
(1)
that
X2k is t h eo u t p u t
of a DCT of lengthN/2.
1
When
considering
these
polynomial
products,
it
is
easily
recognized
that
polynomials
the
involving
t hi en p u t
of t hDe CaTraerl el d u c t i o n s
of X(z)
modulo the cyclotomic factors of xN1 (N=Zn).Knowing
see t h a t h ew h o l e
set of
polynomial
t h i so, n ec a n
products is equivalent t o a cyclic
convolution
(Le.
N
a polynomial
product
modulo
x 1) followed
b
f i a
reduction
modulo
the
cyclotomic
factors
of x 1.
T hsee q u e n c e
t o be
cyclically
convolved
with
the
to befound.But,sinceweknow,
i n p u td a t ar e m a i n s
by
successive
applications
of eq. (IO) t o t h D
e CT's
of decreasinglength
N, N/2, N / 4  t h ee x p r e s s i o n
of t h e unknown
polynomial
modulo
the
cyclotomic
f a c t o r ist,
is easy to reconstruct
the
initial
one,
given in eq. (12) :
1
H e n c e ,t h ef o l l o w i n gd e c o m p o s i t i o n ,o nt h eo d dt e r m s
will apply recursively on the DCT's
of reduced lengths.
,
When
considering
only
the
odd
terms
)X2k+l
eq.(I)
is nowsymmetrical
in i a n d k, a n dt h et w o
permutationsdescribedabovearenowfeasiblein
k.
( W i t ht h eo n l yd i f f e r e n c et h atth e r ea r eN / Z + t e r m s
XZk+l, and N t e r m s xZicl, thus resulting in the
 term
of eq. 7. ( s e e [XI formoredetails).Hence,wehave
as a result :
I
We have
now
established
that
the
DCT
N=Zn can be obtained as shown in fig.
of
length
(I).
111. THEORETICAL CONSEQUENCES
As a sideresult,it
is nowveryeasy
boundonthemultiplicativecomplexity
2" DCT :
where :
L e t us now define the following polynomials
I t hasbeenshownbyWINOGRAD
[91, t h a tt h em u l t i plicativecomplexity of a cyclicconvolution
of length
2" is given by :
:
(13)
N/21
m
'k
(14)
N  1
V(z)
=
\
LA
i = O
eq. ( 6 ) can now be reformulated
(11)
Y(z) = X(z)
. V(z) mod
<
si >4N
z
i
n 2
Consequences of p r a c t i c a li m p o r t a n c ec a nb eo b t a i n e d
byobservation
of t a b l e 1, containingthecomparison
betweenthislowerboundandthepracticalalgorithms
for shortlengths :
as :
zN" + I
Le. : t h eo d dt e r m s
of t h eD C Tc a nb es t a t e d
a polynomial product of length N/2.
as
In fact, observation of t a b l e 1 t e l l s us t h a t t h e c o m p u t a t i o n of a DCT of length 2" n e e d sm o r et h a no n e
multiplicationperpointwhateverthealgorithmwill
be. As a consequence, if a DCTchip
is needed t o
workinrealtime
at videorates(which
is t h e case
many
inDCT
applications),
implementation
of a
DCT
will
need
more
than
one
multiplier
onchip.
42.2.2
1806
p(DCT 2") = 2""
I t is possible(butmoreintricate)
t o show,byusmg
of WINOGRAD t h at th iusp p e r s o moet h erre s u l t s
boundisalsothelowerbound.Thisresultwasalready
obtained by M.T. HEIDEMAN [ I 11.
211
COS
n 1
i
1
(10)
p(conv. 2") = 2""
Furthermore,
one
of the
multiplications
involved
as a convolution, as shown
itnhDe C cTo m p u t e d
( I ) is trivial ( V(z) mod. x1 = 1). We t h e n
in
fig.
obtain, as an upper bound :
k
x". z
t o g e ta nu p p e r
of t h el e n g t h
be the inner product
N
4

VETTCRLI
LEE
4
4
4
26
32
32
lower bound
CHEN
16
6
8
16
44
Table 1 : comparison of the
lower
bound
and
the
practical algorithms
L1
of
F u r t h e r m o rsei,n ct heceo m p u t a t i o n
of t hr e s u l t
modulo
the
cyclotomic
factors
of xNI is obtained
as intermediate
variables
inside
the
inverse
NTT,
be
simplified
t hbeu t t e r f l i essh o w n
in fig. (2) can
withthelastoperationsinvolvedinthecomputation
Thisresultsinthediagramshown
I t shouldbenotedthatthiscorresponds
case for NTT's to be used :
to a favorable
SinceNTT'saregenerally
performedonshortlength
s e q u e n c e s( N = 1 6s e e m s
to be a maximum),weavoid
to t h e
in
NTT
that,
due
the
usual
problem
arising
th
relationship
betweena
, t h e Nroot of unity, N,
the
length
of t ht er a n s f o r ma n, d
M the
modulus
( a N 2 1 mod M), it is often
impossible
to use 2
as a root of unity
(thus
avoiding
multiplications
in the NTT) for even moderate lengths.
 What is needed to c o m p u t teh D
e CT
is really a
cyclic convolution, and there
is no need of the overlapadd
or
overlapsave
algorithms
to obtain a linear
convolution, as is needed in FIR filtering.
 Themoduloarithmetic
is notsuch a problem,since,
with the given constraints, we can work modulo
a Fermatnumber,or
a pseudoFermatnumber
[131, which
gives one of the Cimplest known moduloarithmetic
[141.
In this case, a In flg. (3) represents
only
a shift,
andcanbeimplemented
by a rotation of theinput
word at a bit level.
 F u r t h e r m o r e , s i n c e a great precision on the
Xk is oft e nn e e d e d( u s e
of DCTinadaptativefeedbackloops),
t h en e e d
of greater
wordlengths
when
using
NTT's,
usual
case, is not
such
a waste.
c o m p a r e d to the
V. DCT 3Y DISTRIBUTED ARITHMETIC
Another
possibility
is to use the
decomposition
of
the
DCT
into
polynomial
products,
as explained in
s e c t i o n 11, and
then
to
compute
these
polynomial
products by distributed arithmetic.
L e t us briefly recall the computation
using t h e d i s t r i b u t e d a r i t h m e t i c :
Y =

of innerproducts
61
a 1. x i o
.t
j=1
obtaining a
SincewehavenowestablishedtheDCT
as a cyclic
convolution,wecanuseNumberTheoreticTransforms
(NTT) 1121 t o c o m p u t e t h e c o n v o l u t i o n , a n d t h e s c h e m e
of fig. ( I ) now becomes as showninfig.
(2).
of t h e NTT'box.
in fig. 3 f o r N=8.
the expression of xi i n t e r m s ot its binary representation
( 2 ' s complement).
Let
us know
write
xi a t a bit
level in (15) and reverse the two resulting sums.
We g e t :
(17)
IV. THE DCT COMPUTED BY NTT
Nevertheless,
there
is still a way
DCT
algorithm
with
one
multiplication
per
point
:
(hence one multiplier onchip)
t o be computed, and
,
(E
L1
ai xij)
2j
is0
In thisequation,thedoublesum
is a successiveshift
and
add
of elementary
terms
(between
brackets),
each
term
being
an
inner
product
between
ai
a n d a v e c t o r of bits (x.., i = O , NI).
'I
f dependson
N binaryvariaL e t f bethisfunction.
ZN different
values.
If t h e s e
bles,
hence
can
take
a ROM at t h ea d d r e s s
corresvaluesarestoredin
ponding to the
binary
configuration
of the
input
bits,
an
implementation
of the
inner
product
by
distributed arithmetic is as shown in fig. 4.
I 1
When usedin
a DCTalgorithm,thedistributedarithof polynomial
the
product
m eitm
i cp l e m e n t a t i o n
willrequireoneinnerproductcomputationpercoeffic i e n t of theresultingpolynomial,andsomebutterflies
todecomposetheinitialDCTintopolynomialproducts
(see fig. 5).
A number of r e m a r k s a r e of i n t e r e s t :
 Sincethe
ROM is addressed by t h eb i t s
of s a m e
weight of t h e o u t p u t s of t h eb u t t e r f l i e s ,t h e s eb u t t e r flies can be implemented in serial arithmetic.
 Thespeed
of a circuitimplementingthisarchitect u r e will belimitedonly
by theoutputaccumulator.
If therequiredspeed
is lower,it is possible to r e d u c e
t hsei z e
of t hcei r c u i t
by using the
relationships
between
the
different
inner
products
involved
181,
in a mannerverysimilar
to thatexplainedin
[I 51
f o r t h e c o m p u t a t i o n of convolution.

All t h ec o m b i n a t i o n s
of t h ei n p u td a t aa r ep e r f o r medinserialarithmetic.Hence,theresultingarchiregular
and
easily
implemented.
t e c t u r e is very

S i n ctehset r u c t u r e
of the
decomposition
of t h e
is t hsea m e
as f o r
DCT
into
polynomial
products
o t h et r a n s f o r m st h,sea mset r u c t u rceaanl sboe
of F o u r iterra n s f o r m s
used
for
the
computations
by changing only the
ROM p a r t of t h e chip.
VI. CONCLUSIQN
We have
first
explained
the
equivalence
between
DCT and cyclic convolution.
Thus,weusedthisrelationshiptoobtainnewDCT
algorithms
with
some
characteristics
suitable
for
VLSI implementation.
O t h ea rl g o r i t h mc sa n
also be
obtained
with
such
an approach. Further work will be reported.
422.3
1807
REFERENCES
s11
[21
[31
[41
151
[61
DUHAMEL
P.
: "Dispositif
transformee
de
encosinusd'unsignalnum6rique6chantillonni".
French
patent,
n"9601629,
February
1986.
P. DUHAMEL : "Dispositif d ed 6 t e r m i n a t i o nd e
latransformkenumkriqued'unsignal".French
patent n"8612431, September 1986.
S. WINOGRAD : "Some
bilinear
forms
whose
m u l t i p l i c a t ci voem p l e xdi et yp etnhodens
field of constants".
Math.
Syst.
Theory,
1977,
Vole10,pp.169180.
L. AUSLANDER, S. WINOGRAD : "Themultiplicative complexity of certain semilinear systems
defined bypolynomials".
Adv. in AppliedMathematics. Vol. 1, n03, pp.257299,1980.
M.T.
HEIDEMAN,
Private
communication.
H.J.NUSSBAUMER, "Fast Fourier Transform and
Convolution
algorithms."
SpringerVerlag,
1981.
R.C.AGARWAL,
C.S. BURRUS : "Fast convolutions
using
Fermat
number
transforms
with
applicationtodigitalfiltering".IEEETrans.on
ASSP, VOI. 22, pp. 8797,1974.
L.M. LEIBOW'ITZ : "A simplifiedbinaryarithmetic
for
the
Fermat
Number
Transform".
IEEE
Trans.
on
ASSP, Vol. 24,
pp.
356359,
1976.
S. CHU, C.S. BURRUS : "A p r i m e f a c t o r F F T a l gorithm using distributed arithmetic". IEEE Trans.
o n ASSP, Vol. 30, n02,pp.217226,April1982.
W.H. CHEN, C. HARRISON, S.C. FRALICK
Dis"A fast computationalAlgorithmforthe
crete Cosine Transform".
IEEE
Trans.
comm.,
on
Vol. COM25
n09,
September1977, pp. 10041011.
A. JALALI, K.R. RAO, "A Highspeed
FDCT
Processor
for
realtime
Processing
of NTSC
color TV signal".
IEEE
Trans.
on
elec.
comp.
Vol. EMC.
24,
n02,
May 1982, pp. 278286.
BYEONG
LEE,
"A
new
algorithm
to
compute
theDiscreteCosineTransform".IEEETrans.on
ASSP. Vol ASSP32,
n06,
December
1984,
pp.12431245.
M. VETTERLI, H. NUSSBAUMER, "A simple
F F Ta n dD C Ta l g o r i t h m sw i t hr e d u c e dn u m b e r
of operation" : SignalProcessing,
Vol. 6,
n04,
August1984, pp.267278.
M. VETTERLI, A. LIGTENBERG, "A d i s c r e t e
cosinetransformchip",IEEEJournal
of Select.
Areas.incom.,
Vol. SAC4,nO1.January1986,
pp.4965.
: "Implementation of "splitradix"
P.DUHAMEL
FFT algorithms
for
complex,
real,
and
realsymmetricdata".IEEETrans.onASSP,
Vol. 34,
n02,pp.285295,April1986.
16 16
It
.13
14
Fig. 1 : Thecomputatlon of a length 2" DCT based
on a cyclic convolutlon
Fig. 2
Fig. 4
:
Imp1ementatlon of an
inner
product
dlstrlbuted arlthrnetlc
: General scheme of t h e DCT cornpuled by N T l
28
Fig. 5 : TheDCT
1808
by
'
of length 8 by distributed a r l t h r n e t l c