Professional Documents
Culture Documents
Ryszard M. Stasiński
Institute of Multimedia Telecommunications
Poznan University of Technology, Poznań, Poland
e-mail: ryszard.stasinski@put.poznan.pl
March 7, 2023
Abstract
In the paper it is shown that there exist infinite classes of fast DFT
algorithms having multiplicative complexity lower than O(N log N ), i.e.
smaller than their arithmetical complexity. The derivation starts with
nesting of Discrete Fourier Transform (DFT) of size N = q1 · q2 · . . . qr ,
where qi are powers of prime numbers: DFT is mapped into multidimen-
sional one, Rader convolutions of qi -point DFTs extracted, and combined
into multidimensional convolutions processing data in parallel. Crucial to
further optimization is the observation that multiplicative complexity of
such algorithm is upper bounded by 0(N log Mmax ), where Mmax is the
size of the greatest structure containing multiplications. Then the size
of the structures is diminished: Firstly, computation of a circular convo-
lution can be done as in Rader-Winograd algorithms. Secondly, multi-
dimensional convolutions can be computed using polynomial transforms.
It is shown that careful choice of qi values leads to important reduction
of Mmax value: Multiplicative complexity of the new DFT algorithms
is O(N logc log N ) for c ≤ 1, while for more addition-orietnted ones it is
O(N log1/m N ), m is a natural number denoting class of qi values. Smaller
values of c, 1/m are obtained for algorithms requiring more additions, part
of algorithms for c = 1, m = 2 have arithmetical complexity smaller than
that for the radix-2 FFT for any comparable DFT size, and even lower
than that of split-radix FFT for N ≤ 65520. The approach can be used
for finding theoretical lower limit on the DFT multiplicative complexity.
1
1 Introduction
The re-discovery of Fast Fourier Transform (FFT) in 1965 [1] resulted in a wave
of related publications, e.g. there were two special issues on this subject of
IEEE Transactions on Audio Electroacoustics in 1967, and 1969 [2]. It also
raised some fundamental questions, is the computational complexity of N -point
DFT O(N log N ), or lower [3]? Or at least, what is the lower bound on DFT
multiplicative complexity in the field of complex numbers? The answer to the
second question is provided by Winograd in [4], it is O(N ), but the result is ob-
tained for excessive number of additions, surpassing even O(N 2 ), the complexity
of DFT definition formula. More influental was his other work [5], which trig-
gerd the second wave of publications, including those on the Winograd-Fourier
transform algorithm (WFTA), which realization was presented in [6], and with
some improvements in [7]. At the same time important results were obtained
for multidimensional DFT computation by Nussbaumer [8], [9], who devised
polynomial transforms, finally split-radix FFT was introduced [10]. Winograd’s
works raised the question: to what extent it is reasonable to reduce the number
of multiplications needed for computation of DFT, while keeping its computa-
tional cost low?
Simple close form formulae on the number of operations (27) cause that
split-radix FFT [10] is a convenient yardstick for eveluating performance of DFT
algorithms, but this is not the best FFT. Firstly, in [11] its version with reduced
number of additions was presented. Recently it was shown in [12] that there are
several algorithms having smaller arithmetical, or at least multiplicative com-
plexity than split-radix FFT. The smallest arithmetical complexity has radix-6
FFT from [13], the used in section 5.3 radix-6 FFT from [14] has exceptionally
small number of multiplications.
Recently the problem of DFT computational complexity is attacked from an-
other angle, additions and multiplications are not counted separately, only the
N -point DFT computation is transformed into integer multiplication problem
of size bN [15], N is the sequence length, b is the integer representation length.
Then in the next paper [16] it is proved that it exists an integer multiplica-
tion algorithm requiring O(b log b) bit operations, which means that DFT bit
complexity is O(bN log bN ). The transformation is based on paper [17], where
DFT computation is done by calculation of a convolution. Note that if bit com-
plexity of an addition is O(b), then bit complexity of O(N log N ) additions is
O(bN log bN ), see (1), (2).
This paper is going the way pointed out by Winograd [4], this time trans-
formation of DFT computation problem into that of convolution calculation is
based on a generalization of Rader idea [18]. There is no assumption concerning
the way in which real multiplications and additions are done. It is shown that
there are DFT algorithm classes, for which
M (N )
−→ 0 for N → ∞ (1)
A(N )
2
where M (N ), A(N ) are multiplication and addition numbers, and
A(N ) = O(N log N ). (2)
The result is obtained using the technique from [19]. Algorithm construction
starts with nesting of Rader-Winograd DFTs, section 2, that can be applied to
any DFT of size:
N = pk11 · pk22 · pk33 · . . . (3)
pi are prime numbers. Whenever at least two numbers pi − 1, pj − 1 have
common factors, polynomial transforms can be used for reducing multiplicative
complexity of the algorithm, section 4. Before [19] polynomial transforms were
used only for optimization of 63 = 9 · 7 and 80 = 16 · 5 DFT modules [20]. For
full development of the algorithm Winograd’s findings on DFT computation are
necessary, section 3, as well as for calculation of lower bound on the number of
multiplications for such N , section 5.1, Theorem 1.
To meet condition (1) majority of primes in (3) should have possibly large
common factors for pi − 1. Condition (2) implies the use of FFT for polynomial
transforms of possibly small radices, section 5.2. Lack of this restriction in
[19] caused that the number of additions grew faster than O(N log N ). As a
result primes are searched in specific families of numbers (16). The assumption
concerning density of these primes, hence, the validity of proven in appendix
A Theorems 2, and 3 seems to be supported by experimental data, section
5.3. Operation numbers for the best algorithms for N ≤ 65520 are provided in
appendix B.
2 Nesting
According to Rader [18] DFT computation when its size N is a prime number
can be done by an (N − 1)-point circular convolution:
X(0) = x0 (0)
X(k) ⇐= PN −2 a (4)
X(ak ) = n=0 x0 (an )WNk−n
where: PN −1
x0 (n) = m=0 x(m) for n = 0
(5)
x(N − n) − x(0) for n 6= 0
and an are cyclic group elements modulo N , the group size is N − 1. Rader
algorithms exist also when N is a power of a prime number [21]. Their structure
is more complicated, hence, a good idea is to add such DFTs to the structure
of the developed here algorithms later, when Winograds improvements to the
Rader algorithm are implemented, section 3.
The Rader algorithm can be represented in matrix form as follows:
T=E·H·D (6)
where here matrix D represents operation from (5), multiplication by matrix
H is equivalent to computations in (4), while matrix E defines final inverse
3
permutations. Note that multipliers are present only in two blocks of matrix
H: N − 1-point one realizing convolution, and another one of size 1 needed to
transfer sample x0 (0) to the algorithm output (4).
When N is a product of prime numbers qi , i = 1, 2, . . . r, the qi − 1-point
circular convolutions can be nested. The technique is applicable for any N =
N1 · N2 · . . . Nr when each Ni is mutually prime with any other Nj . DFTs of
such size can be mapped into r-dimensional ones of size N1 × N2 × . . . Nr by
the following permutation of input sequences:
N N N
x[( n1 + n2 + . . . + nr ) mod N ] =⇒ x(n1 , n2 , . . . , nD ). (7)
N1 N2 Nr
In this way prime factor FFT algorithms (PFA FFT) are obtained [2]. The
mapping is reversible, r-dimensional DFT for such dimensions can be mapped
into one-dimensional one. In the paper both mappings are extensively used. In
matrix form r-dimensional DFT can be represented as follows:
T = Tr ⊗ . . . ⊗ T2 ⊗ T1 , (8)
4
3 Winograds improvements to Rader algorithm
Coefficients of a product of polynomials Y (Z) = H(Z)X(Z) can be obtained by
convolving coefficients of polynomials H(Z), and X(Z). If the result is reduced
modulo Z M −1, the M -point circular convolution of coefficients is obtained. Let
us introduce based on the Chinese Remainder Theorem approach to derivation
of polynomial product algorithms in rings of residues [20]:
Algorithm 1:
Task: Compute a polynomial product Y (Z) = H(Z) · X(Z) modulo
Q(Z). Polynomial Q(Z) can be factored as follows:
Y
Q(Z) = Qi (Z)
i
5
4 Optimization of multidimensional polynomial
products
4.1 Basic algorithms [20]
The general structure of the most frequently used class of polynomial product
algorithms is provided in Algorithm 1. Let us consider it’s special case for
(divisors of) Z N − 1:
Algorithm 1F:
Task: Compute the polynomial product Y (Z) = H(Z) · X(Z) mod-
ulo Z N − 1, or its divisor. Such polynomial can be factored as
follows: Y
Q(Z) = (Z − WNk ), 0 ≤ k < N,
k
As can be seen, for PN (Z) the residues are computed using a ”cropped” DFT
formula, DFTs for k mutually prime with N are called reduced N -point DFTs
[20]. Effective algorithms for them can be easily extracted from ”full” ones.
Step 3. in Algorithm 1F is done using the inverse (reduced) DFT. Note that
actual size of reduced N -point DFT is M , the rank of polynomial PN (Z).
If a reduced DFT algorithm is not sufficiently effective, a polynomial product
can be computed by a ”brute force” approach:
Algorithm 2:
Task: Compute the polynomial product Y (Z) = H(Z) · X(Z) mod-
ulo PN (Z).
1. Compute fast convolution algorithm of coefficients of H(Z) and
X(Z).
2. Form a polynomial from the result, and then reduce it modulo
PN (Z).
Comment: An obvious choice in step 1 is Algorithm 1F using FFT
of size not smaller than 2M − 1.
6
4.2 Multidimensional polynomial products
Let us exploit circular shift property for reducing multiplicative complexity of
the N × N -point DFT:
PN −1 PN −1 k2 ·n2
n2 =0 [ n1 =0 x(n1 , (n2 − k1 n1 )N )]WN =
PN −1 PN −1
= n1 =0 n2 =0 x(n1 , (n2 − k1 n1 )N )WNk2 ·n2 =
PN −1 PN −1 (10)
= n1 =0 WNk2 k1 n1 n2 =0 x(n1 , n2 )WNk2 ·n2 =
= X((k2 k1 )N , k2 )
(.)N denotes reduction modulo N . (k2 k1 )N takes on all k1 values only when k2
is mutually prime witn N , hence, in (10) some DFT samples X(k1 , k2 ) are not
computed, while some other more than once. Taking into account multiplierless
additions in the inner formula of (10) a DFT in the ring of residue polynomials
modulo Z N − 1 can be introduced:
N
X −1
Xk1 (Z) = xn1 (Z)Z k1 n1 mod(Z N − 1) (11)
n1 =0
where coefficients of residue polynomials xn1 (Z) are x(n1 , n2 ). Indeed, polyno-
mial Z is here the primitive root of unity of rank N :
Note that residue polynomials modulo PN (Z) form a field. To conclude (10),
calculate reduced N -point DFTs:
(Z1N1 − 1) × PN2 (Z2 ) [20]. To exploit this do not decompose N1 -point circular convolution
into polynomial products, Algorithm 1.
7
by reduced N2 -point DFTs in other dimension (13). Polynomial transform is
defined in a field, hence, its inverse exists, and as it is DFT, its kernel is
−N2 /N1
WN−1
1
= Z2 modPN1 (Z2 )
which means that it is calculated without multiplications, too. Then the follow-
ing algorithm is formulated:
Algorithm 3F:
Task: Multiplication of two-dimensional residue polynomials:
Y (Z1 , Z2 ) = X(Z1 , Z2 ) · H(Z1 , Z2 )modPN1 (Z1 ) × PN2 (Z2 )
when N1 is a divisor of N2 .
1. Compute N1 -point reduced polynomial transforms of X(Z1 , Z2 )
and H(Z1 , Z2 ):
N /N
Xk1 (Z2 ) = X(Z1 , Z2 ) mod(Z1 − Z2 2 1 ) modPN2 (Z2 ),
N /N
Hk1 (Z2 ) = H(Z1 , Z2 ) mod(Z1 − Z2 2 1 ) modPN2 (Z2 ).
8
4.3 Practical considerations
For many DFT sizes implementation of Algorithm 3 is not clear. Let us consider
computation of N = 56 653-point DFT, N = 181 · 313, neither 180 is a divisor
of 312, nor 312 a divisor of 180. However, gcd{180, 312} = 12 > 1, and this
implies that Algorithm 3 can be implemented. To reveal this let us consider
computation of 180- and 312-point convolutions by Algorithm 1F using PFA
FFT. The algorithm is based on mapping (7), hence, instead of 180- and 312-
point DFTs we are dealing with 4 × 9 × 5- and 8 × 3 × 13-point ones. Then,
the use of Algorithm 3 becomes clear: it should be applied to computation
of polynomial products modulo P4 (Z1 ) × P8 (Z4 ) and P9 (Z2 ) × P3 (Z5 ). The
remaining irreducible greatest polynomial product is P9 (Z2 )×P5 (Z3 )×p8 (Z4 )×
P13 (Z6 ) having size Mmax = 6 · 4 · 4 · 12 = 1152. Note that the same Mmax
is obtained when the DFT size is multiplied by 61, i.e. for N = 3455833 =
181 · 313 · 61. Namely, 60 = 4 · 3 · 5, and the multipliers are divisors of 8, 9, and
5, respectively.
Let us find upper bounds on the multiplicative complexity of the above algo-
rithms. In Algorithm 1F there are two FFTs, forward and inverse one, composed
of small-size, hence, multiplicative-efficient DFT modules. Then, the number
of multiplications of the 1152-point polynomial product can be approximated
by 2 · Mmax log2 Mmax , and as a consequence, the number of multiplication for
the whole DFT is upper bounded by 2N log2 Mmax ≈ 20 · N . At the same
time for FFTs of similar size, i.e. for N = 216 and N = 222 it is smaller than
N log2 N = 16 · N and 22 · N (27). It is an open question if N = 3455833-
point DFT algorithm indeed has less multiplications per input sample than the
N = 222 -point one, but this is definitely not a particularly efficient algorithm.
In Appendix B an N = 65520-point algorithm is presented, for which the num-
ber of multiplications is 4.01 · N , while the number of arithmetical operations
is slightly smaller than for N = 216 -point split-radix FFT.
5 Arithmetical complexity
5.1 Lower bound on DFT multiplicative complexity
Optimizations of DFT structure (9) done in the previus sections cause that
matrix H represents multiplication by irreducible polynomial products. At the
same time its block-diagonal form implies that the products are done in parallel.
Then, the following theorem is valid:
Theorem 1: If polynomial products represented by blocks of matrix
H in (9) are irreducible, then the lower bound on the multiplication
number of DFT algorithms for N given by formula (3) can be ob-
tained by summing up numbers 2Ki −1, where Ki are sizes of matrix
H blocks.
Proof: The reasoning is the same as in the paper [4] Q.E.D.
9
5.2 Additions: proposed algorithm sizes
Appearance of polynomial transforms in stages preceding multiplications, and
their inverses in the following stages implies that in many cases arithmetical
complexity of the algorithm shifts from the computation of polynomial products
to that of polynomial transforms. Arithmetical complexity of a polynomial
transform of size p being a prime number is O(p2 ) [20], which means that for
an ill-considered choice of DFT sizes the algorithm complexity may appear to
be greater than O(N log N ), i.e. the algorithm is not fast [19]. The solution is
to use small-radix FFTs for polynomial transforms. In [19] it has been shown
that for a radix-p polynomial transform, p is prime:
4(p2 − 2) p
a∞ = = O( ). (14)
p log2 p log p
where asymptotic parameter a∞ is:
OA (N )
for N → ∞: −→ a∞ , (15)
N log2 N
10
Figure 1: Map showing properties of numbers from which class q23 of primes is
selected for u ≤ 10. On the crossing of i-th column and j-th row there is an
answer if the number 2i 3j + 1 is prime (Y - yes, n - no), i ≥ 1, j ≥ 0. Answers
for potential new primes for u = 10 are on the right and bottom sides of the
map, there are 4 primes: 211 · 32 + 1, 211 · 36 + 1, 211 · 310 + 1, and 23 · 310 + 1.
In the top row letters Y point at Fermat primes: 3, 5, 17, and 257.
important, it is possible to bound number s2 in (16) for q23 . It seems that even
for s2 = O(1) there is enough primes to construct algorithms of sizes N → ∞.
Their multiplicative complexity might be o(N log N ) [23], while a∞ = 4, the
same as for the split-radix FFT.
5.3 Multiplications
Done in Appendix A analysis of growing disproportion between N , DFT size,
and Mmax , the size of the greatest irreducible polynomial product of the algo-
rithm for primes (16) leads to the following Theorem:
Theorem 2: There exist fast DFT algorithms having multiplicative
complexity
O(N logc log N ), c ≤ 1
i.e. lower than their arithmetical complexity O(N log N ).
Note: For primes q23 c = 1. If every Fermat number were prime,
multiplicative complexity for q2 primes would be O(N log0.5 N ) [23].
In this section evaluations on which the Theorem 2 is based are illustrated
by results for real data:
Let us introduce ordering of algorithms according to rank u. They are con-
structed using two types of small DFT modules: for sizes being prime numbers
(16):
q2p2 p3 ... = 1 + 2s1 p2 s2 . . . pd sd for s1 ≤ u + 1, si ≤ u, i = 2, 3, . . . (17)
where p2 = 3, p3 = 5, . . ., and for sizes being powers of primes up to: 2u+3 , and
pi u+1 [21], section 5.2. The only prime for u = 0 is 3 = 21 + 1, for u = 1 the
11
Figure 2: Values log Nmax as a function of u for primes q23 versus function 4 · u2
(dotted-dashed line).
12
Figure 3: Values log Nmax as a functions of u for primes q235 and q2357 compared
to functions 6 · u3 , and 10 · u4 , respectively (dotted-dashed lines).
13
products can be computed also by Algorithm 2. We need here a forward and
inverse DFT algorithm of size 2 · Mmax , or little more, that is one of rank u = 6,
see Fig.2, which in the worst case have polynomial products of maximum size
27 · 35 = 31104. They can be computed in the same way by the 65520-point
algorithm. Number of multiplications for the last algorithm is 262820, Table I,
hence, µ(65520) = 262820/65520 ≈ 4.01. Then, the worst case upper bound on
65520
µ(N ) for a rank u = 6 DFT algorithm is µ2 = 31104 · (2 · µ(65520) + 3) ≈ 23.2.
Finally, assume that the size of rank u = 6 DFT algorithm is (2 + ) · Mmax ,
which leads to upper bound on µ(Nmax ) = (2 + )(2 · µ2 + 3), somewhat above
100, compare with log2 log2 Nmax = 14.02, Theorem 2.
It is interesting how Algorithm 2 compares to Algorithm 1F for exact data,
i.e. what is the cost of computing 31104-point polynomial product by Algorithm
1F. Formulae from [14] show that reduced 93312-point FFT requires 406350
multiplications (FFT part for DFT indices mutually prime with N ). This leads
to µ1F = 2 · 406350
31104 + 3 ≈ 29.1 > µ2 . This is as expected, nevertheless, Algorithm
1F requires less additions. Here the aM coefficient is aM ≈ 0.875, it is used in
evaluation of µ(Nmax ) in the previous paragraph. Taking into account that for
”parental” 93312-point FFT aM ≈ 0.73 that it does not increase with N [12],
and that the difference in aM values between full FFT, and its reduced part
diminish with growing N , actual bound is probably even smaller.
6 Conclusion
One of open questions posed in [3] is, if the arithmetical complexity of DFT
and hence polynomial algebra algorithms is O(N log N ), or it is smaller. This
paper provides an interesting comment to this question — the DFT can be
computed using O(N log N ) arithmetical operations, among which all but few
are additions and subtractions. Namely, the number of multiplications needed
is O(N logc log N ) for c ≤ 1. The algorithms are constructed using nest-
ing technique guaranteeing that arithmetical complexity of the algorithms is
O(N log N ). The method is based on extensive use of polynomial transforms,
which implies special choice of component DFT modules. The limits on algo-
rithm multiplicative complexity are obtained for an assumption that the density
of primes being sizes of small DFT modules in the set of potential DFT sizes can
be approximated by the same relations as for primes in general. Experimental
results confirm that this premise is reasonable. What is important, for practical
range of sizes the new algorithms have approximately the same arithmetical
complexities as the best known FFTs, while require much less multiplications.
References
[1] J.W. Cooley, J.W. Tukey, ”An algorithm for machine computation of com-
plex Fourier series,” Math. Comput., vol. 19, pp. 297–301, 1965.
14
[2] —, ”Special issue on fast Fourier transform,” IEEE Trans. Audio Electroa-
coust., vol. AU-15, and vol. AU-17, June 1967, and June 1969.
[3] A.V. Aho, J.E. Hopcroft, J.D. Ullman, “The design and analysis of com-
puter algorithms”, Addison-Wesley, 1974.
[12] R. Stasinski, ”Split multiple radix FFT”, Proc. EUSIPCO 2022, pp. 2251-
2255, 2022.
[13] J.-B. Martens, ”Recursive cyclotomic factorization - a new algorithm for
calculating the Discrete Fourier Transform”, IEEE Trans. Acoust., Speech,
Signal Proces., vol. ASSP-32, pp. 750–761, 1984.
[16] D. Harvey, J. van der Hoeven. “Integer multiplication in time O(n log n)”,
hal-02070778, 2019.
[17] L. Bluestein, ”A linear filtering approach to the computation of discrete
fourier transform”, Audio and Electroacoustics, IEEE Transactions on, vol.
18, no. 4, pp. 451–455, 1970.
15
[18] C.M. Rader, “Discrete Fourier transform when the number of data samples
is prime”, Proc. IEEE, Vol. 56, pp. 1107–1108, 1968.
[19] R. Stasinski, “Prime factor DFT algorithms for new small-N DFT mod-
ules”, IEE Proc., Pt. G, Vol. 134, pp. 117–126, 1987.
[24] P. Ribenboim, ”The little book of big primes,” Springer-Verlag, 1991, 1996.
[25] K. Ireland, M. Rosen, “A classical introduction to modern number theory”,
Springer-Verlag, 1982.
A Proofs
Let us begin with evaluation of lower bound on ”density” of prime numbers
from the class q23 (16).
Corollary 1: For a natural number smax there exist O(smax ) nat-
ural numbers s2 < smax for which exists at least one prime number
of the form q23 (16).
Discussion instead of proof: In book [24] discussion on existence of primes k2n +1
for each k and some n is reported, k, n are natural numbers. The conclusion is
that k, for which k2n + 1 numbers are composite for all natural n, are extremely
rare. The only known ones are Sierpiński numbers. They meet some congruences
linked with divisors of fifth Fermat number, hence, for them usually k 6= 3s2 .
It is then highly improbable that other than Sierpiński numbers never being
primes are for almost all k of the form k = 3s2 , of course, if such numbers exist
at all. Q.E.D.
Theorem 4: With possible exception of Fermat prime numbers the
number of primes of the form (16) is infinite.
Proof: With smax → ∞, O(smax ) → ∞, too. Number classes q235... contain q23
one Q.E.D.
We are now (almost) sure that asymptotic considerations have sense for this
algorithm class.
16
In the following derivations it is assumed without proof that numbers (16)
are not peculiar, i.e. that density of primes among them can be evaluated using
the same approximations as for randomly chosen natural numbers. We need a
condition stronger than Corollary 1:
Corollary 2: For natural numbers s2 ≤ smax , and s1 ≤ s0max ,
s0max ≈ smax there exist O(smax ) prime numbers from the class q23
(16).
Proof: Let us consider average amounts of prime numbers in the following
series, see Figure 1, new rank u primes, section 5.3:
p3i = 2u+1 3i + 1, i = 0, 1, . . . , u;
p2i = 2i 3u + 1, i = 1, . . . , u + 1;
The average distance between a prime p and the following one is O(log p) [24],
[25], hence, we can evaluate the amount of primes as follows:
X 1
pcount ≈ , j = 2, 3. (19)
i
log pji
The denominator in (19) varies in small range of values, e.g. for p3i its range is
from (u + 1) log 2 to (u + 1) log 2 + c2 (u + 1) log 3, c2 is a constant, hence, (19)
can be bounded by, v = u + 1:
c2 v c2 c2 v c2
= > pcount > = = O(1), (20)
v log 2 log 2 v log 2 + c2 v log 3 log(2 · 3c2 )
similar evaluation is true for series p2i . Then, taking all primes for u = 1, 2, . . . , smax
we obtain O(smax ) primes Q.E.D.
Proof: Let us consider primes from the set q23 (16), i is the prime index:
For such primes Mmax is, see first footnote in section 5.3:
1 u+1 u
Mmax ≤ 2 3 = 3c3 u (22)
3
c3 is a constant. On the other hand, the maximum size of the DFT is4 (3), (21):
Y
Nmax = qi
i
4 Inclusion of Rader-Winograd DFTs for 2k1 , 3 k2 does not change the final evaluation.
17
for all i. Then:
Y Y 2
Nmax ≥ (2s1i 3i + 1) > 2s1 3i ≥ 22c4 u 3c4 u
i i
2
Nmax > 3c4 u , (23)
in the derivation it is assumed that for each power of 3 only one prime5 is taken
into consideration (i.e. s2i = i), i takes almost all values between 0 and u, i.e.
there are not more than 2c4 u primes (values of i), and that s1 ≥ 1 but is taken
small enough to fulfill the inequality in the center, c4 is a constant.
A polynomial product can be computed using Algorithm 2, based on the
DFT algorithm derived here, but of much smaller size, for µ(N ) see (18) :
µ(M -point PP) ≈ c1 + 2µ(2M -point DFT) ≈ 2µ(2M -point DFT). (24)
Q.E.D.
Assume now that polynomial products of size (22) are computed using Al-
gorithm 1F requiring O(Mmax log Mmax ) multiplications, i.e.:
for growing Nmax , the increase in number of additions is negligible, and condi-
tions (1), (2) are met also for the algorithm from Corollary 3.
5 Actually there are usually 4 primes in the series, section 5.3 .
18
We are now ready to prove Theorems 2 and 3:
Sketch of Proof: Let us consider series of numbers from class q235 (16):
19
B Few real algorithms
In Table I data for some interesting algorithms for smallest primes from classes
q2 and q23 can be found, u = 0, 1. Trivial multiplications are not counted. The
latter family is constructed using widely known DFT modules [20], plus the 13-
point one. Numbers of arithmetical operations are compared to those obtained
from formulae (rounded up):
mfft (N ) = N log2 N − 3N + 4,
(27)
afft (N ) = 2N log2 N + mfft (N ),
which for N being power of 2 describe multiplication and addition counts for
the split-radix FFT [10] (3 multiplications and 3 additions per complex multi-
plication). As can be seen, majority of algorithms in Table B.1 have smaller
arithmetical complexities than that for the split-radix FFT formula. The differ-
ence in percents of the number of arithmetical operations is provided in column
”saldo”. It should be noted that improvement in arithmetical complexity given
by algorithm from [11] being −5.56% is asymptotic, up to N = 16 numbers
of operations are identical to those for the split-radix FFT, then the numbers
gradually diverge. Of course, numbers of multiplications for algorithms from
Table B.1 are much smaller than for any FFT. Many of the algorithms are also
better than any presented in the bibliography for the same DFT size [20], even
if some WFTA improvements are done, see e.g. [7].
Note that grouped below a line Fermat primes based algorithms require
more multiplications, but less additions than those for q23 , nevertheless, if the
sum of arithmetical operations is considered, they are very interesting, too. In
particular, impressive is the performance of the 8160-point algorithm (without
FFTs, first line), especially that it is constructed using such ”ineffective” Rader-
Winograd DFT modules as the 17-point and 32-point ones6 .
In Fermat prime based algorithms with ”FFT modules” for N = 8160, and
16320 (second lines) some of 32- or 64-point DFT modules are FFTs, not Rader-
Winograd algorithms. Multipliers of these FFTs are nested (9), prescription how
it is done can be found in [19].
The computed on the basis of Theorem 1 theoretical minimum on the number
of multiplications for 65520-point DFT is 217556 (108778 non-trivial real or
imaginary multipliers, 2N = 131040). This is only 17.2% less than the number
from Table B.1. The greatest DFT algorithm in Table B.1 that meets the
theoretical lower limit on the number of multiplications is for N = 504.
6 The polynomial product modulo P (Z) = Z 8 + 1 in 17- and 64-point DFTs is computed
16
using 27 multiplications and 57 additions [20].
20
Table 1: Some real almost multiplierless DFT algorithms [19].
21