You are on page 1of 5

Dual-Field Multiplier Architecture for Cryptographic Applications

E. Savaş A. F. Tenca and Ç. K. Koç


Sabanci University Oregon State University
Istanbul, Turkey TR-34956 Corvallis, OR 97331

Abstract nomial basis is used. Since the steps of the Mont-


The multiplication operation in finite fields GF (p) gomery multiplication algorithm for both fields are al-
and GF (2n ) is the most often used and time- most identical, it is possible to design a unified archi-
consuming operation in the harware and software re- tecture. Feasibility and advantage of designing such a
alizations of public-key cryptographic systems, partic- unified multiplier architecture for elliptic curve cryp-
ularly elliptic curve cryptography. We propose a new tography have been extensively discussed in [7], [9].
hardware architecture for fast and efficient execution Various hardware implementations of the Mont-
of the multiplication operation in this paper. The pro- gomery multiplication algorithm for limited precision
posed architecture is scalable, i.e., can handle operands operands were proposed in [10]. Implementations uti-
of any size; only limited by input/output and scratch lizing high-radix modular multipliers have also been
space size, not by computational unit. It can also be proposed in [11]. Aspects of using high-radix repre-
configured to fit the available chip area for the de- sentation have been discussed in [12]. Even though
sired performance. Our proposed architecture com- very high-radix designs have certain complications in
putes multiplication faster in GF (2n ) than GF (p), hardware, moderate radix values offer faster alterna-
which conforms with premise of GF (2n ) for hardware tives to simple radix-2 multiplier designs.
realizations. The original unified multiplier in [7] uses radix-2
design and offers an equal performance for both GF (p)
I. Introduction and GF (2n ) of same precision. For this very reason,
One of the motivations for fast and area efficient however, the original design is not optimized since it
hardware solutions for multiplications in finite fields does not take the advantage of using GF (2n ), which
GF (p) and GF (2n ) comes from the fact that they are is, in general, more efficient than GF (p) in hardware
the most time-consuming operations in cryptographic implementations. Our first observation is that this
applications such as the decipherment operation of the situation can be remedied by putting to use the part of
RSA algorithm [1], the Diffie-Hellman key exchange the circuitry which is underutilized in GF (2n ) mode.
algorithm [2], the Government Digital Signature Stan- This allows us to run the multiplier module in higher
dard [3] and also recently elliptic curve cryptography radix values for GF (2n ) than those for GF (p) without
[4]. significantly increasing the design complexity.
In this paper, following the design principles intro-
duced in [7], we present a novel scalable multiplier B. Montgomery Multiplication Algorithm
architecture that is unified in the sense that multi- In [5], Montgomery described a modular multiplica-
plication for both GF (p) and GF (2n ) is performed tion method which proved to be very efficient in both
in the same datapath. Furthermore, the novelty pre- hardware and software implementations. An obvious
sented here is that the multiplier works with radix-4 advantage of the method is the fact that it replaces
for GF (2n ) and radix-2 for GF (p). Therefore, the ar- division operations with simple shift operations. The
chitecture is referred as dual-radix. We will discuss method adds multiples of the modulus rather than
the effects and advantages of these techniques on the subtracting it from the partial result. Refer to [5].
chip area, signal propagation time and the clock cycle for detailed explanation of the algorithm.
count to complete a multiplication operation. Given two integers a and b, and a prime modulus
p, the Montgomery multiplication algorithm computes
A. Previous Work c̄ = M onM ult(a, b) = a · b · R−1 (mod p) where
The multiplier architecture presented here is based R = 2n and a, b < p < R and p is an n-bit prime num-
on the Montgomery multiplication algorithm which is ber. The Montgomery multiplication does not directly
originally proposed as an efficient method for doing compute c = a · b (mod p), therefore certain trans-
multiplication operation in GF (p) [5]. The algorithm formation operations must be applied to the operands
replaces division operation with simple shifts, which a and b before the multiplication and to the interme-
are particularly suitable for implementation in hard- diate result c̄ in order to obtain the final result c. T
ware as well as in software on general-purpose com- The Montgomery multiplication algorithm with
puters. Therefore, the Montgomery multiplication al-
gorithm generally allows to design a hardware unit radix-2k for GF (p) can be given as in the following:
with shorter signal propagation time (higher maxi- Algorithm A
mum clock frequency) besides taking advantage of cer- Input: a, b ∈ [1, p − 1], p, and m
tain design optimizations such as systolic array [6] and Output:c ∈ [1, p − 1]
pipeline organizations [7]. 1: c := 0
In [8], it is also shown that Montgomery multipli- 2: for i = 0 to m − 1
cation might be very efficient in GF (2n ), when poly- 3: q := (c0 + ai · b0 ) · (p0 ) (mod 2k )
4: c := (c + ai · b + q · p)/2k C. Precomputation in Montgomery Multi-
where p0 = 2k − p−1 0 (mod 2k ). In the algorithm, the plication Algorithm
multiplier a is written with base (radix-2k ) and digits The unified multiplier architecture introduced in the
m−1 next section utilizes a precomputation technique in or-
ai so that a = i=0 ai ·2k·i , where m is the number of
der to decrease the critical path delay of the original
digits in a and m = n/k. In Step 4, the multiplicand unified multiplier in [7]. Note that Step 4 of the Al-
b, the modulus p, and the partial result c enter the gorithm A computes
computations as full-precision integers. However, in
our implementation we will treat b, p, and c as multi-
word integers in order to design a scalable multiplier c := (c0 + ai · b + q · p)/2k .
and in each clock cycle one word of these values will
be processed. One may also consider this representa- Depending on the radix value chosen, the LSDs of the
tion as writing the multiplicand, the modulus and the operands, ai , b0 , and c0 will determine which one of
the values in {0, b, p, b + p, 2p, 2b, 2b + 2p, . . .} is added
partial result with digits b(j) , p(j) , and c(j) of w bits, to the partial result c. If one precomputes and stores
e−1 e−1
so that b = j=0 b(j) · 2w·j , p = j=0 p(j) · 2w·j , and the value of b + p, the calculations in Step 4 can be
e−1
c = j=0 c(j) · 2w·j where e = n/w. Note that the significantly simplified.
base-2w used to represent b, p, and c in Step 4 is differ- There are three implications of the precomputation
technique. First, the fact that an adder must be avail-
ent from the radix-2k used to represent the multiplier able to perform the precomputation potentially leads
a in Step 3. Note also that q, c0 , b0 , and p0 are all to an increase in the chip area. However, we show that
k-bit integers. such an adder is already an integral part of our design
In order to avoid a possible confusion due to the and the precomputation will be done without any ex-
usage of two different bases, we elect to refer the digits tra overhead in this sense. Second, the precomputed
of b, p and c as words when implementing Step 4, and value must be stored. This will imply an increase in
use the term digit exclusively for the multiplier a, and the register space. And finally, there must be a so-
for b0 , p0 , and c0 in Step 3 when they are in the same called selection logic to select which multiples of b and
equation with the digits of a. Digits can be easily p must participate in the addition in Step 4. The selec-
distinguished by the subscript notation (e.g. ai or b0 ) tion logic will be naturally on the critical path and can
from superscript notation of word (e.g. b(j) ). We will potentially result in both an increase in the chip area
also use the notation xi,j to denote the jth bit in the and critical path delay. On the other hand, the pre-
ith digit of x. computation technique also simplifies the design since
In addition, the radix of the multiplier architecture Step 4 can be performed with only one addition, once
is determined by the base used to represent the mul- the selection logic generates its output. We will pro-
tiplier a. vide implementation results to expose the effects of the
The Montgomery multiplication algorithm for precomputation technique in the multiplier design.
GF (2n ) is given below:
Algorithm B II. Radix-(2,4) Multiplier Architecture
Input: a(x), b(x), p(x), and m In this section, we present a unified and scalable
Output: c(x) multiplier architecture which operates in radix-2 in
1: c(x) := 0 GF (p) mode and in radix-4 in GF (2n ) mode. and the
2: for i = 0 to m − 1 architecture is called radix-(2,4).
3: q(x) := (c0 (x) + ai (x) · b0 (x)) · p0 (x) (mod xk ) A. Processing Unit
4: c(x) := (c(x) + ai (x) · c(x) + q(x) · p(x))/xk In this section, we explain the design details of the
where p0 (x) = p−1 0 (x) (mod x ).
k
The two algo- processing unit (PU) which is basically responsible for
rithms are almost identical except that the addition performing Step 3 and Step 4 of Algorithm A.
operation in GF (p) becomes a bitwise modulo-2 ad- Since the multiplier uses radix-2 for GF (p), the
dition in GF (2n ). In Algorithm A, there must be an least-significant bits (LSB) of the operand digits, ai ,
extra reduction step at the end to reduce the result b0 , and c0 will determine which one of the values in
into the desired range if it is greater than the modu- {0, b, p, b + p} is added to the partial result c. In the
lus. On the other hand, this step is not essential part case of GF (2n ), multiplication is performed in radix-
of the algorithm and there are simple conditions that 4. Therefore, the LSDs (least significant digits) of
can be added to the algorithm in order to eliminate it b, p, and c and of the current digit of a are in or-
[13], hence we intentionally exclude it from the algo- der to determine q. The LSB of p is always 1, then
rithm definitions. only p0,1 , the second least significant bit of the mod-
From this point on, we will only use the nota- ulus, is included in the computations. Consequently,
tion introduced in Algorithm A for both GF (p) and ai,1 , ai,0 , b0,1 , b0,0 , c0,1 , c0,0 and p0,1 determine one of
GF (2n ) and leave polynomial notation completely out the following values to be added to the partial result:
of our representation of field elements in GF (2n ). Op- {0, b, p, b + p, x · b, x · p, x · (b + p)}.
erations will be deduced from the mode (GF (p) or In Figure 1, the architecture of the processing unit
GF (2n )) in which the module is operated. The ele- (PU) used in the dual-radix multiplier is illustrated.
ments of both fields are represented identically in the The local control logic in Figure 1 contains the selec-
digital systems. tion logic which generates the signals, m00 , m01 , m10 ,

2
j
(b+p)

Latch
j Next
b
Stage
j
p

Shifter

j
0 cc

MUX-0 MUX-1
m
a 00
i
m
b Local 01
0 m
cs Control 10
0 m j
Logic 11 cs
cc
0
p
0 z x y
(3,2) Adder Array
Field
Select
j j
cc cs
next stage next stage

Fig. 1. Processing unit of dual-radix architecture with radix-2 for GF (p) and radix-4 for GF (2n )

and m11 , to determine which multiples of b and p will

csw-1(j)
MUX2w-1
MUX1w-1

cs1(j)

cs0(j)
MUX20
MUX21
MUX11

MUX10
be in the calculations. cc0 and cs0 in Figure 1 are the
least significant digits of carry and sum part of the
partial result c.
The dual-radix architecture consists of one or more z
Dual Field
x y z x
Dual Field
y z
Dual Field
x y

processing units (PU), identical to the one shown in FSEL


Adder Adder Adder
Figure 1, organized in a pipeline. Each PU takes a
digit (k-bits) from the multiplier a, the size of which Cw-1(j) Sw-1(j) C1(j) S1(j) C0(j) S0(j)

depends on the radix and the mode (finite field), and


operates on the words of b, b + p, c and p successively Shift & Alignment Layer

starting from the least significant word. Starting from


the second cycle it generates one word of partial result ccw-1(j) cs (j) cc1(j) cs1(j) cc0(j) cs0(j)
each cycle which is communicated to the next PU. Af- w-1

ter e + 1 clock cycles, where e is the number of words Fig. 2. Dual Field Adder Array for radix-(2,4) Unified Multi-
in the modulus (i.e. e = n/w), a PU finishes its plier
portion of work and becomes free for further compu-
tation. One can refer to [7] for more information about
the pipeline organization.
A redundant representation (Carry-Save) is used for strated that this additional functionality is obtained
the partial result in the architecture. Thus, for the almost without any cost.
partial result we can write c = cc + cs, where cc and C. Selection Circuitry
cs stand for the carry and sum part of the partial re- As stated previously, the selection logic for radix-
sult. Redundant format necessitates an extra addition (2,4) multiplier, which is shown in Figure 3, deter-
operation to transform the final result into nonredun- mines which of the inputs of MUX-0 and MUX-1 in
dant format at the end of the calculations. The trans- Figure 1 are to be added in (3, 2) adder array, which
formation operation is simply performed by a carry in turn calculates c := c + ai · b + q · p.
propagate adder (e.g. carry look-ahead adder) which In GF (p)-mode the multiplier uses radix-2, hence
is also capable of doing modulo-2 addition operation in m00 and m01 must be calculated while m10 and m11
GF (2n )-mode. The existence of an adder is also useful are forced to be 0 since input 0 of MUX-1 is always se-
for performing the precomputation of b + p, which is lected in this mode. We can use the following formulae
used during multiplication. to express the control inputs of MUX-0.
B. (3, 2) Adder Array
An n-bit (3, 2) adder array shown in Figure 1 con- m00 = ai,0
sists of two parts: single-bit dual-field adders (DFA)
and shift-and-alignment layer as demonstrated in Fig- m01 = q0 = (cs0,0 ⊕ cc0,0 ⊕ ai,0 · b0,0 )
ure 2. When used in GF (p)-mode, the DFA simply
becomes a Carry-Save adder. A DFA cell is basically where ⊕ stands for modulo-2 addition, ai,j denotes
a full-adder capable of doing addition with or with- jth bit of the digit ai and qj is the jth bit of q, and
out carry. It has an input called F SEL that enables csi,j and cci,j are the sum and carry bits of the partial
this functionality. Our implementation results demon- result, respectively.

3
m00 area time −5 time x area
ai,0 m00 x 10
4000 7 2.2
A1 A1 A1
cc0,0 A2 A2 A2
m01 2
cs0,0 m01 3500
6.5
1.8
ai,0
b0,0 3000 1.6
ai,1 m10 6
m10

critical path delay (ns)


Latch
FSEL

# of gates (NAND)
1.4
2500

time x area
p0,1
FSEL 5.5 1.2

2000
1

cs0,1 5
0.8
FSEL 1500
m11
FSEL 0.6
4.5
ai,0 1000
b0,1 0.4

ai,1
b0,0 500 4 0.2
0 20 40 0 20 40 0 20 40
word length (w) word length (w) word length (w)

Fig. 3. Selection logic for radix-(2,4) multiplier Fig. 4. Implementation results: Critical path delay and area

On the other hand, the multiplier computes with layer is another reason for larger area usage in the
radix-4 in GF (2n )-mode. Thus, the select inputs of dual-radix design. Note also that, the relative increase
MUX-1 must also be calculated. For this, we use the in area becomes less significant as the word size also
formulae increases. This can also be explained by the fact that
m10 = ai,1 · F SEL the area of selection logic is independent of word size.
When w = 32, the area consumed by the selection
m11 = q1 = [(cs0,1 ⊕ ai,0 · b0,1 ⊕ ai,1 · b0,0 ) logic becomes less significant. For example, increase
in area in the dual-field multiplier (A2), 45% when
⊕(cs0,0 ⊕ ai,0 · b0,0 ) · p0,1 ] · F SEL w = 32. The use of the precomputation technique in
Note that the first input of MUX-1, cc is always zero in the architecture A2 improves the critical path delay
this mode since redundant form is also used for partial by 18% to 23%.
result and the carry part of it is forced to be zero. The performance of the two multipliers in terms of
clock cycle count to perform a multiplication is deter-
III. Implementation results mined, to a large extent, by the number of PUs (t)
We implemented processing units of two different and the word size (w), which is subject to the limi-
multiplier architectures: (A1) the original unified tations on the silicon area available. Therefore, the
multiplier in [7], and (A2) radix-(2,4) multiplier. We relative increase in the area of a PU may be mislead-
used VHDL to implement two architectures and syn- ing in evaluating the overall performance of the new
thesized the resulting code using Mentor Graphics architectures. Two architectures utilize many PUs or-
tools for an ASIC technology of 0.5µm AMI CMOS ganized in a pipeline. To provide more insight in the
(ADK library [14]). overall effect of the new architecture on the area and
Figure 4 demonstrates the area and time delay of time, we investigated the time to compute multipli-
two different PU designs, using different word sizes. cation for a precision range of cryptographic interest
Area consumption is always given in terms of 2-input given a limited area. Figure 5 demonstrates the results
NAND gates. Due to the highly modular nature of for multiplier configuration in GF (p)-mode with ap-
the design, the critical path of a PU determines the proximately 30, 000 gates. We basically designed the
maximum clock frequency that can be applied to the multipliers for each architecture by putting as many
whole multiplier. PUs as possible.
As can easily be observed from Figure 4, there is In this configuration, the new architecture, A2, of-
an increase in area of the new architecture. There fer a significant speedup in time performance over the
are two basic reasons for this increase: (1) having original architecture A1 for the range of [160, ∼ 500].
an extra interstage register for passing the precom- Beyond the precision of 500 bits, higher area require-
puted value, b + p, to the next stage, (2) selection ments of new architectures will have a negative im-
logic. The selection logic becomes more complicated pact on the performance. For the same area the
due to what may be appropriately called as a look- new architecture, A2 is by 13% to 35%. Note that
ahead technique which processes the least-significant the maximum speedup in the new architectures, ex-
bits of the operands. The fact that two least signif- ceeds the maximum speedup provided by a single PU.
icant bits of some operands are needed in the look- This is due to the fact that having more PUs not al-
ahead technique partially explains the further increase ways improves the performance, hence may result in
in the area. More complicated shift-and-alignment a slight degradation for some bit lengths. The dual-

4
−6 w=8 −6 w=16 −6 w=32
x 10 x 10 x 10
9 9 10
A1
A2

8 8 9

8
7 7

7
6 6
Time (seconds)
6
5 5
5

4 4
4

3 3
3

2 2 2

1 1 1
0 500 1000 0 500 1000 0 500 1000
precision precision precision

Fig. 5. Multiplication Timings (in µs) for an area of 30.000 gates with w = 8, 16, and 32 in GF (p)-mode

radix architecture offers a significant speedup over A1 [4] N. Koblitz. Elliptic curve cryptosystems. Mathematics of
in GF (2n )-mode. It outperforms A1 by 56% to 67% Computation, 48(177):203–209, January 1987.
in this mode. [5] P. L. Montgomery. Modular multiplication without trial
division. Mathematics of Computation, 44(170):519–521,
April 1985.
IV. Summary and Conclusions [6] Colin D. Walter. An improved linear systolic array for fast
Using the design methodology proposed in [7], we modular exponentiation. IEE Proceedings - Computers
and Digital Techniques, 147(5):323–328, Sept. 2000.
presented a new unified multiplier architecture called [7] E. Savaş, A. F. Tenca, and Ç. K. Koç. A scalable and
dual-radix architecture for binary extension and prime unified multiplier architecture for finite fields GF(p) and
fields. The architecture utilizes a precomputation GF(2m ). In Ç. K. Koç and C. Paar, editors, Crypto-
technique and improves critical path delay signifi- graphic Hardware and Embedded Systems - CHES 2000,
cantly. The cost of implementing the precomputation Lecture Notes in Computer Science No. 1965, pages 281–
technique in hardware in terms of area is studied and it 296. Springer, Berlin, Germany, 2000.
has been concluded that the overall impact is insignif- [8] Ç. K. Koç and T. Acar. Montgomery multiplication in
icant for a large range of precision. The dual-radix GF(2k ). Designs, Codes and Cryptography, 14(1):57–69,
architecture also facilitates faster computation of mul- April 1998.
[9] Johann Grossschadl. A bit-serial multiplier architecture
tiplication in GF (2n )-mode than GF (p)-mode. The
for finite fields GF(p) and GF(2k ). In Cryptographic
area and speed characteristics of the dual-radix archi- Hardware and Embedded Systems, Lecture Notes in Com-
tecture is also extensively investigated and its perfor- puter Science, No. 2162, pages 202–219. Springer-Verlag,
mance in terms of area and time is compared against Berlin, 2001.
single-radix, unified multiplier architecture. At the ex- [10] A. Bernal and A. Guyot. Design of a modular multiplier
pense of using extra resources, which proved to have a based on Montgomery’s algorithm. In 13th Conference
very limited impact on the silicon area under certain on Design of Circuits and Integrated Systems, pages 680–
circumstances, it provides significant improvement in 685, Madrid, Spain, November 17–20 1998.
[11] P. Kornerup. High-radix modular multiplication for cryp-
critical path delay compared to the original unified de- tosystems. In E. Swartzlander, Jr., M. J. Irwin, and
sign in both GF (p) and GF (2n )-modes. Furthermore, G. Jullien, editors, Proceedings, 11th Symposium on
it provides a superior performance in GF (2n )-mode. Computer Arithmetic, pages 277–283, Windsor, Ontario,
June 29 – July 2 1993. IEEE Computer Society Press, Los
References Alamitos, CA.
[12] C. D. Walter. Space/Time trade-offs for higher radix
[1] J.-J. Quisquater and C. Couvreur. Fast decipherment modular multiplication using repeated addition. IEEE
algorithm for RSA public-key cryptosystem. Electronics Transactions on Computers, 46(2):139–141, February
Letters, 18(21):905–907, October 1982. 1997.
[2] W. Diffie and M. E. Hellman. New directions in cryp- [13] Colin D. Walter. Montgomery exponentitation needs no
tography. IEEE Transactions on Information Theory, final subtractions. Electronic Letters, 35(21):1831–1832,
22:644–654, November 1976. October 1999.
[3] National Institute for Standards and Technology. Digital [14] ASIC design kit. Mentor Graphics Co.
Signature Standard (DSS). FIPS PUB 186-2, January
2000.

You might also like