You are on page 1of 11

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Low-Complexity Digit-Level Systolic Gaussian


Normal Basis Multiplier
Qiliang Shao, Zhenji Hu, Shaobo Chen, Pingxiuqi Chen, and Jiafeng Xie, Member, IEEE

Abstract— Normal basis multiplication over G F(2 m ) is widely such as the IEEE [25] and NIST [26] for elliptic curve digital
used in various applications such as elliptic curve cryptography. signature algorithm.
As a special class of normal basis with low complexity, Gaussian A number of designs have been released for efficient
normal basis (GNB) has received considerable attention recently.
In this paper, we propose a novel decomposition algorithm implementation of field multiplication based on GNB [3],
to develop a digit-level (DL) low-complexity systolic structure [22], [23], [27], [29]–[39], including: 1) the bit-level struc-
for GNB multiplication over G F(2 m ). First, we propose two tures such as parallel-in serial-out (PISO) [27], serial-in
algorithms separately to achieve a systolic GNB multiplier with parallel-out (SIPO) [29], [30], and parallel-in parallel-out
low critical path delay and low register complexity. Next, we (PIPO) [31], [32]; 2) the digit-level (DL) designs such as
present the corresponding structure according to the proposed
algorithm (combination of previous two proposed algorithms). PISO [33], PIPO [22], [23], and SIPO [24]; and 3) the bit-
Compared with the existing systolic DL GNB multipliers (through parallel ones such as the multiplier of [3]. The DL structures
both the theoretical and application-specific integrated circuit have gained substantial attentions recently due to their efficient
comparison), the proposed multiplier achieves significantly less tradeoff between area and time complexities [24], [34].
area-delay product (ADP), e.g., for a systolic structure of digit For large field sizes in G F(2m ), the multiplications can be
size of 8 for G F(2409 ), the proposed structure has 12.3% less
ADP compared to the best of the existing designs, for the same realized by using systolic array to achieve high speed and
digit size. regular implementations [14]. Systolic structures are vastly
used in applications with high-performance requirements as
Index Terms— Digit-level (DL), Gaussian normal basis (GNB),
low critical path delay (CPD), low register complexity, systolic the processing elements (PEs) in the structure employ regis-
structure. ters for pipelining. In [15], Kwon has presented an efficient
digit-serial systolic multiplier based on optimal normal basis.
In [16], another systolic multiplier is proposed for high-
I. I NTRODUCTION performance implementation. Other efficient systolic multipli-
ers over G F(2m ) have been proposed in [14] and [17]. Besides
L OW-COMPLEXITY implementations of finite field mul-
tipliers over G F(2m ) have drawn substantial attention
recently due to their widespread applications in various envi-
that, efficient digit-serial systolic multipliers are introduced
in [18]–[20]. Very recently, an efficient DL systolic GNB
ronments. A lot of efforts have been carried out to obtain low- multiplier is introduced in [24]. Overall, systolic realization
complexity multipliers for various high-performance usages of DL GNB multipliers is not that abundant in the litera-
including elliptic curve cryptography [1]–[8]. ture. Furthermore, although these GNB multipliers have been
In general, there are three bases can be selected to represent optimized to achieve low complexity through DL implemen-
a finite field, i.e., polynomial basis (PB), normal basis, and tation, their area–time complexities are still relatively high
dual basis [9]–[24]. Gaussian normal basis (GNB), as a special and need to be improved, e.g., the register complexity and
class of normal basis over G F(2m ) [9]–[13] (where m > 1 and critical path of [24] are still large and can be improved
m is not divisible by eight), has received considerable attention further.
in the literature due to its low-complexity implementation Based on the above consideration, we propose, in this
(compared with PB and dual basis, GNB is much more paper, a novel decomposition algorithm to develop DL systolic
efficient in the hardware designs involving with many squaring GNB multipliers over G F(2m ) in order to achieve a lower
operations since the GNB squaring has almost no hardware critical path, high-speed and low-complexity implementations.
usage). GNB has also been included in a number of standards We briefly introduce the motivation of the proposed work
(based on the concept of novel cut-set retiming and novel input
Manuscript received December 1, 2016; revised March 26, 2017 and signal broadcasting) after the review of the work of [24]. Then,
June 11, 2017; accepted June 22, 2017. This work was supported by Ohio
RAPIDS grant. (Corresponding author: Jiafeng Xie.) we introduce novel multiplication algorithms to reduce the
Q. Shao, S. Chen, P. Chen, and J. Xie are with the Department of critical path delay (CPD) and register complexity, respectively.
Electrical Engineering, Wright State University, Dayton, OH 45435 USA Furthermore, a new structure of the proposed systolic GNB
(e-mail: shao.10@wright.edu; chen.181@wright.edu; chen.148@wright.edu;
jiafeng.xie@wright.edu). multiplier is proposed. The proposed multiplier can achieve
Z. Hu is with the School of Law, Shanghai University of Finance and low CPD and low register complexity compared with the
Economics, Shanghai, China (e-mail: jenneyhu1986@163.com). best of the existing GNB systolic multiplier of [24]. Finally,
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. we also compare the hardware and time complexities of
Digital Object Identifier 10.1109/TVLSI.2017.2720190 the proposed architectures with the existing ones through
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

application-specific integrated circuit (ASIC) synthesis to Algorithm 1 Existing DL Systolic Multiplication [24]
benchmark the higher efficiencies of the presented design.
The organization of this paper is as follows. In Section II,
we briefly review the preliminaries of the GNB multiplication
over G F(2m ) (including the existing DL systolic GNB multi-
plier) as well as the introduction of cut-set retiming and signal
broadcasting technique (motivation of the proposed strategy).
In Section III, a novel DL systolic multiplication algorithm is
proposed. Proposed low CPD, low register complexity DL sys-
tolic GNB multiplier is described in Section IV. In Section V,
we benchmark the hardware and time complexities of the
proposed design along with the corresponding existing ones.
Conclusions are given in Section VI.

II. R EVIEW OF THE E XISTING DL S YSTOLIC GNB


M ULTIPLIER [24] AND M OTIVATION OF
THE P ROPOSED W ORK

A. Existing DL Systolic GNB Multiplier [24]


2 m−1
Normal basis N = {β, β 2 , β 2 , . . . , β 2 } exists in the
finite field G F(2m ) over G F(2) for any positive integer m, Let us define the lth coordinate of C as
where β is called normal element. Each field element in
G F(2m ), take A = (a0 , a1 , . . . , am−1 ) as an example, can  m−1
m−1  (l)
cl = ai b j μi, j (5)
be represented as a linear combination of the elements in N,
m−1 2i = a β + a β 2 + · · · + a 2m−1 , i=0 j =0
i.e., A = i=0 ai β 0 1 m−1 β
where ai ∈ G F(2), 0 ≤ i ≤ m − 1. Assume that m > 1 which can also be represented in a matrix form as
and T > 1 are two integers. Let p = mT + 1 be a
cl = a · M(l) · btr , 0 ≤ l ≤ m − 1 (6)
prime number and gcd(mT /k, m) = 1, where k is the
multiplication order of 2 modulo p. Then, the normal basis
2 m−1
where a = [a0 , a1 , . . . , am−1 ] denotes the row vector cor-
N = {β, β 2 , β 2 , . . . , β 2 } of G F(2m ) over G F(2) is called responding to the field element A, and btr represents the
the GNB of type T [21]. matrix transpose of row vector b = [b0 , b1 , . . . , bm−1 ] which
The multiplication over GNB can be performed based on a corresponds to the field element B. M(l) can be obtained from
multiplication matrix Mm×m [25]. Let A and B be two field the l-fold right and down cyclic shifting of the multiplication
elements over G F(2m ) matrix M = M(0) . Then, we can write the product C as

m−1
i 
m−1
j 
m−1
l
A= ai β 2 , B= b j β2 (1) C= cl β 2 . (7)
i=0 j =0 l=0

then we can have their product C as Recently, a low-latency systolic GNB multiplier has been
proposed in [24]. Algorithm 1 describes the existing systolic
 m−1
m−1  GNB multiplication. Let us define q = m/d, where d is
ai b j β 2 +2 .
i j
C = (c0 , c1 , . . . , cm−1 ) = AB = (2) the digit size and 1 ≤ d ≤ m. Then, the product C = AB
i=0 j =0
q−1 2id
can be performed as C = L (V  id, B  id),
d−1 2 j i=0
where L(V, B) = j =0 J (V  j, B  j ) (Note that
Let us define μi, j = β 2 +2 ∈ G F(2m ) as a field element,
i j

this version is directly obtained from [20, Fig. 2], for detailed
where 0 ≤ i, j ≤ m − 1. Then, with respect to N, one can
information, the readers can always refer to [22, eqs. (36)–
have
(42), (46)], especially concerning the matter of structure’s last
 (l) l
m−1
μi, j = μi, j β 2 (3) cycle operation). Let us define n and k as two integers that
satisfy q = kn, then, we can get the partial product Ci by Ci =
l=0 k−1 2id
j =0 L (Vi  j d, Bi  j d). Thus, one can decompose
substituting (3) into (2), one can have the product C into n-term partial products, which is C = C0 +
kd
C12 +· · ·+Cn−12(n−1)kd = (((C 2kd +C 2kd +· · · )2kd +C .
n−1 ) n−2 )
 m−1
m−1  
m−1 0
μ(l)
l d
C = ai b j i, j β
2 Each partial product can be written as Ci = (((L(Vi , Bi ))2 +
d d
i=0 j =0 l=0 L(Vi  d, Bi  d))2 + · · · +)2 + L(Vi  (k − 1)d, Bi 
 m−1
m−1  m−1
 (k − 1)d).
ai b j μ(l)
l
= i, j β .
2
(4) Based on Algorithm 1, Fig. 1(a) depicts the existing DL
i=0 j =0 l=0 systolic GNB multiplier over G F(2m ) [24]. We can see that
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LOW-COMPLEXITY DL SYSTOLIC GNB MULTIPLIER 3

Fig. 1. (a) Existing DL systolic GNB multiplier over G F(2m ) [24].


(b) Detailed structure of PE.

the existing multiplier contains k PEs and one accumulating


Fig. 2. Example of the internal structure of the existing DL systolic GNB
modular (AM). Each PE is carried out by the Steps 8 and 9 multiplier over G F(2m ) [24], where type 4 GNB over G F(27 ) is taken
of Algorithm 1, and the AM is computed by Step 11. (d = 2, k = 2, and  or  denotes circular-shifting) and the black box
Fig. 1(b) presents the detailed internal structure of PEs. denotes registers.
According to Algorithm 1, we need to define the output B of
PE j as Bi, j to compute the partial product Ci . Two elements Vi
and Bi , 1 ≤ i ≤ n −1, are fed to the multiplier from left cycli- Assume d = 2 and k = 2, then, we have q = 4 and n = 2.
cally to compute the partial product Ci recursively. The latency The multiplication matrix M is
⎡ ⎤
of the multiplier is (k + n) cycles, i.e., it takes (k + n) cycles 0 1 0 0 0 0 0
to get the final product C = AB. The CPD of the existing ⎢1 0 1 0 0 1 1⎥
⎢ ⎥
systolic GNB multiplier is T A + (log2 T  + log2 (d + 1))TX , ⎢0 1 0 1 1 1 0⎥
⎢ ⎥
where T A and TX are the delay of an AND gate and a XOR gate, M=⎢ ⎢0 0 1 0 0 1 0⎥⎥. (8)
respectively. The structure of Fig. 1 has (m/d)1/2 dm AND ⎢0 0 1 0 0 0 1⎥
⎢ ⎥
gates, {≤ (m/d)1/2 d(m −1)(T −1)/2+(1+(m/d)1/2 d)m} ⎣0 1 1 1 0 0 1⎦
XOR gates, and (1 + 3(m/d)1/2 )m bit-registers. 0 1 0 0 1 1 1
Then, follow the Algorithm 1 of [24] and the example
B. Motivation of the Proposed Work of [22], we can have the detailed internal structure of the PE
The work of [24] achieved high-performance operation, yet for the structure in Fig. 1 the detailed computation steps are
it can be improved further, i.e., the critical path of the design in [22, eqs. (29)–(47)].
in [24] is still large and the register complexity of each PE is
still high (each PE has 3 m) registers. B. Low Critical Path Delay
Noting that the cut-set retiming technique can adjust the
From the structure of Fig. 2 and Algorithm 1, we find
critical path of a design properly according to the specific
that inside the PE, the operation L(Vin , Bin ) consists of three
application requirement. Therefore, we propose an important
main suboperations, i.e., one bit-addition, one operand multi-
design innovation to appropriately cut set the DL systolic
plication (MA), and one operand addition. The first addition
multiplier to minimize the clock period along with the appro-
consists of a recombined addition operation (RAO), which
priate modification of multiplication algorithm (covered in
performs the addition of a series of reconstructed operand B
Section III-A). Meanwhile, it is noted that a systolic structure
itself. All the bits of operand B are positionally recombined
can have global input broadcasting to provide inputs to all
and some of them are added together to form the new bits
PEs [40]. We here propose novel algorithms to facilitate a
according to Algorithm 1. Then, the multiplication is per-
modified signal broadcasting for low-register implementation.
formed between the operand B after the recombined addition
and the operand V as well as shifted addition as shown by
III. P ROPOSED A LGORITHMS the example in Fig. 2 (MA). The result of MA is added with
In this section, we present our proposed algorithms sepa- the shifted input from the previous PE and then produces the
rately to reduce the critical path and register complexity. result to the next PE on its right. The CPD of the structure of
Fig. 1 is thus T A + (log2 T  + log2 (d + 1))TX .
Although the structure in Fig. 1 is efficient in implemen-
A. Example tation, it can still be improved further, e.g., the CPD in
Let us follow the example presented in [22], i.e., type 4 Fig. 2 needs to be shortened for practical high-performance
GNB over G F(27 ). Note that the proposed design in this paper applications. Here, we introduce a novel algorithm (based
does not focus on computation complexity, therefore, for a on appropriate cut set) in which the CPD is shortened by
detailed information, one needs to refer [22, eqs. (29)–(47)]. performing the RAO and MA in advance.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 3. Proposed cut-set strategy of designing low CPD DL systolic GNB multiplier over G F(2m ), where S box refers to shift operation, RAO denotes
recombined addition operation, ⊗ denotes MA operation, and ⊕ denotes operand addition operation. (a) Proposed DL systolic GNB multiplier. (b) Detailed
structure of PEs, where black boxes denote registers.

Fig. 3 depicts the proposed cut-set strategy of achieving Algorithm 2 Proposed Low CPD DL Systolic Multiplication
low CPD implementation based on the existing DL systolic
multiplier over G F(2m ). From Fig. 3(a), one can see that
the structure consists of k PEs and one accumulation cell
(AC) after the proposed novel cut-set retiming. The detailed
internal operational structure of these components is presented
in Fig. 3(b). First of all, PE-1 performs the first pair of RAO
and MA operations of operand B and V . The result of MA
from PE-1 is then yielded to PE-2 to perform the operand
addition with shifted C. Meanwhile, we still have the RAO
and MA operations in PE-2 which yields its result to the
next PE on its right. “S box” performs the shifting of the
bits of both operands V and B. The last operand addition
operation is performed in AC, and we get the first partial
product after k clock cycles. All the rest partial products Ci s
are recursively accumulated in AC to produce the final result
C after n clock cycles. The CPD of this proposed architecture
is thus T A +(log2 T +log2 (d))TX (for NIST recommended
GNB, T A + (log2 T  + log2 (d))TX is larger than 2TX , the
time duration required for AC), which is shorter than [24].
According to the proposed cut-set retiming strategy shown broadcasting strategy to reduce the corresponding registers
in Fig. 3, we derive here the modified low critical path DL among PEs.
systolic multiplication algorithm as presented in the proposed As seen in Fig. 1(b), there are generally two types of
Algorithm 2. registers equipped for one PE: Type-one for operand pipelining
where MA denotes the operand multiplication operation (after bits shifting, the top one for operand B and the bottom
inside each PE the detail of this step can be seen in one for operand V ); another one for pipelining of computation
[22, eqs. (29)–(47)] and the example of Fig. 2. According [the registers used to pipeline the data after the L(Bin , Vin )
to Algorithm 2, we perform the RAO and MA operations operation]. The registers used to pipeline the computational
of the operands B and V in advance, which corresponds to data are critical to the correctness of final output, while the
Steps 8 and 9. Each PE (from PE-2 to PE-k) then executes registers for pipelining the shifted operands (the top and
the computation of the Steps 10, and 11 of Algorithm 2 later, bottom ones) are relatively less important.
while AC is performing Step 13. By computing the RAO and Based on the above consideration, we propose here a novel
MA operations in advance in each PE, we have shortened the strategy to reduce the registers related to the pipelining of
critical path of the structure of [24]. the shifted operands (the top and bottom ones). Let us first
consider the data pipelining of shifted operand B among PEs
in the existing design of [24]. It is seen that, in Fig. 4, the
C. Low Register Complexity shifted operand’s subscript (the subscript denotes the degree
Systolic structure sometimes suffers from large register of shifting, according to Fig. 1) increases one per cycle for
complexity, as all the PEs in the array are uniform and fully a single PE (for neighboring PEs, within the same cycle, the
pipelined (there are a lot of registers in the PEs). Noting subscript increases with the numbering of PE). The pipelining
that a systolic structure can have global input broadcasting to of the bits of shifted operand V is similar to the shifted
provide inputs to all PEs [40], we propose here a novel signal operand B as shown in Fig. 5.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LOW-COMPLEXITY DL SYSTOLIC GNB MULTIPLIER 5

Fig. 4. Data pipelining of operand B among PEs for the structure of Fig. 1, Fig. 7. Data pipelining of operand V among PEs with added operands for
where the diagonal line represents data flow between PEs, and the vertical the structure of Fig. 1, where the gray area represents all added operands, and
line represents data flow in one PE. the green area represents one specific operand cluster fed to all PEs.

whose initial value is


Bt
= [B(n−1−t ), B(n−t )  d, . . . , B(n+k−2−t )  (k − 1)d]
Vt
= [V(n−1−t ), V(n−t )  d, . . . , V(n+k−2−t )  (k − 1)d]
(9)
which can also be represented as
(0) (1) (k−2) (k−1)
Bt
= [Bt , Bt , . . . , Bt , Bt ]
(0) (1) (k−2) (k−1)
Vt
= [Vt , Vt , . . . , Vt , Vt ] (10)
(0) (1) (k−1)
Fig. 5. Data pipelining of operand V among PEs for the structure of Fig. where Bt = B(n−1−t ), Bt = B(n−t ) d …, Bt =
1, where the diagonal line represents data flow between PEs, and the vertical B(n+k−2−t )  (k − 1)d as well as Vt
. While the rest compu-
line represents data flow in one PE.
tation steps, such as RAO and MA, are still the same as the
previous one in [24].
Then, we have the modified low register complexity DL
systolic multiplication algorithm as proposed in Algorithm 3.

Algorithm 3 Low Register Complexity DL Systolic Multipli-


cation

Fig. 6. Data pipelining of operand B among PEs with added operands for
the structure of Fig. 1, where the gray area represents all added operands, and
the green area represents one specific operand cluster fed to all PEs.

To give the detailed register reduction strategy, we can first


add extra operands with the corresponding subscript to fill
the data flow table (highlighted gray areas), such that the
operand’s subscript increases one per cycle for a single PE
and one for neighboring PEs, as shown in Figs. 6 and 7. For
simplicity of discussion, we can define all the shifted operands
where Steps 5 and 6 denote the initialization of the value
for one specific clock cycle as one operand cluster, as shown
of the operand vector and their cyclic shifting. Through this
in Figs. 6 and 7. Thus, after the required clock cycles, we
arrangement, the registers used for pipelining shifted operands
can still get the same output as that in Fig. 1 since the proper
within PEs can be removed.
operand will still be fed to the corresponding PE during each
cycle period (as illustrated by the example in Fig. 8).
According to the rearranged data pipelining scheme, we then D. Proposed DL Systolic GNB Multiplication Algorithm
define the operand vectors Bi
and Vi
(for k number of PEs), Based on the discussion in Sections III-B and III-C, we then
where 0 ≤ t ≤ n + k − 1, to represent each operand cluster, combine the two modified algorithms together to propose our
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 8. Example of rearranging data pipelining by using operand cluster Bi


, where the operand cluster is shifted connected with each PE according
to (9) and (10).

Fig. 9. Proposed low-complexity DL systolic GNB multiplier over G F(2m ). (a) Proposed structure of DL systolic GNB multiplier. (b) Detailed internal
structures of PEs, where black boxes denote registers.

novel multiplication algorithm. The proposed multiplication As one can see in Algorithm 4, Steps 5 and 6 perform the
algorithm for DL systolic GNB multiplier over G F(2m ) is operations of initialization of the value of the operand cluster
described in Algorithm 4. as well as the cyclic shifting. RAO is performed by Steps 7 and
10, and AC is computed by Step 12. By computing the first
RAO in advance and rearranging the data broadcasting, we
Algorithm 4 Proposed DL Systolic Multiplication have successfully shortened the CPD and reduced the register
complexity.

IV. P ROPOSED L OW C RITICAL PATH D ELAY AND L OW


R EGISTER C OMPLEXITY DL S YSTOLIC GNB M ULTIPLIER
Based on the proposed Algorithm 4, we present here the
proposed DL systolic GNB multiplier over G F(2m ), which
can achieve both low CPD and low register complexity.

A. Proposed Structure
The proposed DL systolic structure of GNB multiplier
over G F(2m ) based on the proposed Algorithm 4 is depicted
in Fig. 9. As shown in Fig. 9(a), it consists of one AC,
k number of PEs, and two shift-registers for operands B and V ,
respectively. The detailed internal structures of AC and PEs
are presented in Fig. 9(b). The shift registers rearrange all the
bits of operand B and V , so that there will be only one operand
cluster to be fed to k number of PEs in one clock cycle period.
The PE-1 yields the output to PE-2 after performing the RAO
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LOW-COMPLEXITY DL SYSTOLIC GNB MULTIPLIER 7

Fig. 11. Alternate structure when d = 2u (where u can be any positive


integer). (a) Proposed structure. (b) Detailed internal structures, where the
black box denotes the registers.

and

Fig. 10. Example of the proposed DL systolic structure. (a) Proposed V0(0) = (a3 , a4 , a5 , . . . , a1 , a2 )
structure. (b) Detailed example of internal structure of PE, where the black (1)
box denotes the registers.
V0 = (a4 , a5 , a6 , . . . , a2 , a3 ) (13)
note that the different structure (with respect to different GNB
field size) may have different expression of Bi
and Vi
.
and MA operations. The internal structures of PEs from PE-2
Then, following the steps in Algorithm 4, we can finally get
to PE-(k) are the same, where each PE contains one RAO,
the accumulation of partial product Ci , which is also the final
one MA operation and one operand addition operation. The
product C in this case.
RAO performs reconstructed addition operation, whose result
is yielded to the MA operation in the same PE. The result
of MA operation is then yielded to the next PE to be added C. Extra Modification
with the shifted output of addition operation from the previous Noting that the proposed structure works good for the case
PE. PE-k performs the last operand addition and accumulation of d = 2u (where u can be any positive integer) as the structure
operations. After the AC receives its first input from left, the has a lower critical path than that of [24]. However, for those
final result C becomes available in (n + k) clock cycles. cases of d = 2u , the structure of Fig. 9 has the same time
complexity of [24] (for this case log2 (d + 1) = log2 d). Thus,
B. Example in this section, we suggest an alternate structure as shown
Let us follow the example presented in Fig. 2, i.e., type 4 in Fig. 11.
GNB over G F(27 ). For a detailed information, one can always Note that the functions of all internal units inside
refer to [22]. of PEs of the structure in Fig. 11 are the same as that of Fig. 9.
When d = 2 and k = 2, we have q = 4 and n = 2. Here, we just apply another cut-set retiming that the register
According to Algorithm 4, we can have a structure similar to complexity can be reduced further, i.e., move the operand
Fig. 9 with two PEs and one AC, shown in Fig. 10. addition (Fig. 9) back to the original PE. The structure of
If the initial bits of B and V loaded in the shift registers Fig. 11 has the same critical path of [24] as well as the latency.
are (b0 , b1 , b2 , . . . , b5 , b6 ) and (a0 , a1 , a2 , . . . , a5 , a6 ), respec- However, the register complexity is significantly reduced
tively (follow the similar example in [22]). For the first cycle, compared with the one of [24], i.e., only ≈ ((m/d)1/2 m)
according to (8) and (9), we can have registers are being used in the PEs.


B0
= B0(0), B0(1) V. A REA –T IME C OMPLEXITIES


V0
= V0(0) , V0(1) (11) A. Theoretical Comparison
where The area–time complexities of the proposed and the existing
(0) ones of [18]–[20], [22]–[24], [34], [35], [39] in terms of
B0 = (b3 , b4 , b5 , . . . , b1 , b2 )
(1)
logic gate count, register count, CPD, and latency are shown
B0 = (b4 , b5 , b6 , . . . , b2 , b3 ) (12) in Table I. Note that the recent reports of [34]–[39] (two
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE I
C OMPARISON OF THE A REA AND T IME C OMPLEXITIES FOR VARIOUS DL M ULTIPLIERS OVER G F(2m )

of them are based on KA and TMVP block recombina- TABLE II


tion approaches [37], [38]) are nonsystolic structures, which C OMPARISON OF L ATENCY OVER G F(2409 )
maybe not fair to compare with our proposed systolic structure
(as well as [24]), as each has specific advantages over the other.
Nevertheless, we list the area–time complexities of [34], [35],
and [39] in Table I, for a comparison (the work of [36]
focused on novelty of circuit-level, not on the algorithm level,
its complexity is the same as [22], thus, we do not list it
in Table I). While for the designs of [37] and [38], as they TABLE III
have employed fast algorithm like TMVP to reduce the area C OMPARISON OF CPD W ITH D IFFERENT D IGIT S IZE FOR VARIOUS
complexity, it may not fair to compare with all existing works, DL M ULTIPLIERS OVER G F(2409 )
and thus we do not list them here (the works of [37] and [38]
also have specific requirements of the normal basis, e.g., [37]
is only for type-II optimal normal basis).
We also noticed that the designs of [34]–[39] have
the same or even larger area–time complexity than other
existing designs, e.g., [39] has larger number of XOR
than [22] and [23]; [35] has the same complexity
as [22] and [23]; the two structures of [34] have larger time
complexity than the ones of [22] and [23] (the major advan-
tage of [34] is on the hybrid double and triple multipliers).
Therefore, here we mainly focus on the comparison with the
existing systolic structures, especially the one of [24].
According to Table I,√ the latency of the proposed DL that, the latency of the proposed one is much lower than the
systolic multipliers is 2 m/d (we follow the way of [24] other existing multipliers. As the digit size increases from 2
and get the maximum latency here), while [18] requires to 14, our proposed multiplier has nearly 2.5–13.7 times less
2m/d and a number of existing GNB multipliers require the latency compared with those in [18] and [20].
latency of m/d. In addition, multiplier in [19] has latency In terms of register complexity and CPD, one can see
of (d + mT /d(mT /d + 1)) clock cycles, which is more than in Table√ I that the DL-PIPO multiplier in [24] requires
the proposed one. We have also chosen the field of G F(2409 ) (1 + 3 m/d)m registers and its CPD is TW 1 = T A +
to have a detailed comparison of the latencies of various DL (log2 T  + log2 (d √
+ 1))TX . While the proposed architecture
GNB multipliers. As shown in Tables II, the latency of our (Fig. 9) requires 2 m/dm registers, and the CPD is TW 5 =
proposed structure is the same as the one in [24]. Besides T A + (log2 T  + log2 d)TX , which is significantly less than
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LOW-COMPLEXITY DL SYSTOLIC GNB MULTIPLIER 9

TABLE IV
ASIC S YNTHESIS R ESULTS OF THE P ROPOSED S YSTOLIC M ULTIPLIER AND THE O NE OF [24]

TABLE V
ASIC S YNTHESIS R ESULTS FOR THE E XISTING AND THE P ROPOSED DL M ULTIPLIERS OVER G F(2409 )

the existing one of [24]. We have also chosen the field size structure through simulation and register transfer level tool
of G F(2409 ) to have a detailed comparison of ours and the provided by the Design Compiler and found it correct.
existing one of [24] (Table III). It is seen that the CPD of our As shown in Table IV, our proposed design performs better
proposed structure is more efficient than the existing one. when the digit size becomes smaller. When comparing with
the ones of [24], our proposed design (Fig. 9) has less critical
path than the one of [24] (the structure of Fig. 11 has slightly
B. ASIC Implementation larger critical path than [24], this is due to the fact of the
We have also synthesized our proposed and the existing employment of signal broadcasting technique). Besides that,
designs to obtain the area–time complexity. We have used due to the significant reduction of register complexity, the
Synopsys Design Compiler based on Taiwan Semiconductor proposed design still achieves better area–time complexity
Manufacturing Company 65-nm standard-cell library. The than the one of [24].
results in terms of area, CPD, and latency cycles (including We have also compared our systolic multiplier with the
latency time) of our proposed systolic structure (also the one existing DL multipliers in terms of different digit sizes with
of [24]) are shown in Table IV with different field sizes the same field size m = 409, as shown in Table V. One can see
(m = 163, 283, and 409) and digit sizes (to have a fair that our proposed multiplier has smaller area–time complexity
comparison, we use the same digit sizes suggested in [24]). compared with the other multipliers (the ADP of the proposed
Note that we have checked the functionality of the proposed one is the smallest among all the multipliers in Table V),
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

especially with the competing systolic design of [24]. For digit [13] R. Azarderakhsh and A. Reyhani-Masoleh, “Efficient FPGA imple-
size of d = 8, the ADP of the proposed structure is 12.3% less mentations of point multiplication on binary Edwards and generalized
Hessian curves using Gaussian normal basis,” IEEE Trans. Very Large
than that of [24]. The ADP of the proposed design with d = 7 Scale Integr. (VLSI) Syst., vol. 20, no. 8, pp. 1453–1466, Aug. 2012.
has the least ADP among all designs, e.g., at least 18.3%, [14] P. K. Meher, “Systolic and non-systolic scalable modular designs of
35.4%, and 53.9% less the existing of [24], [22] and [23], finite field multipliers for Reed–Solomon codec,” IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., vol. 17, no. 6, pp. 747–757, Jun. 2009.
and [18], respectively. [15] S. Kwon, “A low complexity and a low latency bit parallel systolic
multiplier over GF(2m ) using an optimal normal basis of type II,” in
Proc. 16th IEEE Symp. Comput. Arithmetic, Jun. 2003, pp. 196–202.
C. Discussion [16] J. Fan, D. V. Bailey, L. Batina, T. Guneysu, C. Paar, and I. Verbauwhede,
From the comparison results shown in Tables IV and V, “Breaking elliptic curve cryptosystems using reconfigurable hardware,”
in Proc. Int. Conf. Field Program. Logic Appl., 2010, pp. 133–138.
one can see that the proposed design outperforms the existing [17] Z. Wang and S. Fan, “Efficient Montgomery-based semi-systolic mul-
ones, especially the recent report of systolic structure in [24]. tiplier for even-type GNB of GF(2m ),” IEEE Trans. Comput., vol. 61,
Besides that, one can always choose a suitable structure for no. 3, pp. 415–419, Mar. 2012.
[18] S. Talapatra, H. Rahaman, and J. Mathew, “Low complexity digit
specific environment as the proposed design has either low serial systolic Montgomery multipliers for special class of GF(2m ),”
critical path (Fig. 9) or low register complexity (Fig. 11) IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 5,
performance. pp. 847–852, May 2010.
[19] C.-Y. Lee and C. W. Chiou, “Scalable Gaussian normal basis multipliers
over G F(2m ) using Hankel matrix-vector representation,” J. Signal
VI. C ONCLUSION Process. Syst., vol. 69, no. 2, pp. 197–211, 2012.
[20] R. Azarderakhsh and A. Reyhani-Masoleh, “A modified low complexity
A low-complexity DL systolic GNB multiplier over digit-level Gaussian normal basis multiplier,” in Proc. 3rd Int. Workshop
G F(2m ) has been proposed in this paper. We have proposed Arithmetic Finite Fields, 2010, pp. 25–40.
[21] Digital Signature Standard, Nat. Inst. Standards Technol., Gaithersburg,
a novel multiplication algorithm to reduce the CPD and the MD, USA, Jan. 2000.
register complexity. Moreover, both theoretical and ASIC [22] A. Reyhani-Masoleh, “Efficient algorithms and architectures for field
implementation results are presented for comparison. Based multiplication using Gaussian normal bases,” IEEE Trans. Comput.,
vol. 55, no. 1, pp. 34–47, Jan. 2006.
on our presented results, our proposed design has smaller [23] S. Kwon, K. Gaj, C. H. Kim, and C. P. Hong, “Efficient linear array
CPD and fewer register complexity when compared with the for multiplication in G F(2m ) using a normal basis for elliptic curve
existing DL systolic multipliers. The proposed DL multiplier, cryptography,” in Proc. 6th Int. Workshop Cryptogr. Hardw. Embedded
Syst., 2014, pp. 76–91.
thus, can be extended and employed in sensitive usage models
[24] R. Azarderakhsh, M. M. Kermani, S. Bayat-Sarmadi, and C.-Y. Lee,
including high-performance cryptographic applications. “Systolic Gaussian normal basis multiplier architectures suitable
for high-performance applications,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 23, no. 9, pp. 1969–1972, Sep. 2015.
R EFERENCES [25] IEEE Standard Specifications for Public-Key Cryptography,
[1] I. Blake, G. Seroussi, and N. Smart, Elliptic Curves in Cryptography IEEE Standard 1363-2000, Jan. 2010.
(London Mathematical Society Lecture Note Series). Cambridge, U.K.: [26] National Institute of Standards and Technology, Digital Signature Stan-
Cambridge Univ. Press, 1999. dard, FIPS Publications 186-2, U.S. Dept. Commerce, Washington, DC,
[2] R. R. Farashahi and M. Joye, “Efficient arithmetic on Hessian curves,” in USA, Jan. 2000.
Proc. Int. Conf. Pract. Theory Public Key Cryptogr., 2010, pp. 243–260. [27] J. K. Omura and J. L. Massey, “Computational method and apparatus
[3] B. Sunar and Ç. K. Koç, “An efficient optimal normal basis type II for finite field arithmetic,” U.S. Patent 4 587 627, May 6, 1986.
multiplier,” IEEE Trans. Comput., vol. 50, no. 1, pp. 83–87, Jan. 2001. [28] P. Chen, S. N. Basha, M. Mozaffari-Kermani, R. Azarderakhsh, and
[4] C.-Y. Lee, J.-S. Horng, I.-C. Jou, and E.-H. Lu, “Low-complexity bit- J. Xie, “FPGA realization of low register systolic all-one-polynomial
parallel systolic Montgomery multipliers for special classes of GF(2m ),” multipliers over G F(2m ) and their applications in trinomial multipliers,”
IEEE Trans. Comput., vol. 54, no. 9, pp. 1061–1070, Sep. 2005. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 2,
[5] J. Xie, J. J. He, and P. K. Meher, “Low latency systolic Montgomery pp. 725–734, Feb. 2017.
multiplier for finite field GF(2m ) based on pentanomials,” IEEE Trans. [29] T. Beth and D. Gollman, “Algorithm engineering for public key algo-
Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 2, pp. 385–389, rithms,” IEEE J. Sel. Areas Commun., vol. 7, no. 4, pp. 458–466,
Feb. 2013. May 1989.
[6] C.-Y. Lee, E.-H. Lu, and J.-Y. Lee, “Bit-parallel systolic multipliers for [30] G. B. Agnew, R. C. Mullin, I. M. Onyszchuk, and S. A. Vanstone,
GF(2m ) fields defined by all-one and equally spaced polynomials,” IEEE “An implementation for a fast public-key cryptosystem,” J. Cryptol.,
Trans. Comput., vol. 50, no. 5, pp. 385–393, May 2001. vol. 3, no. 2, pp. 63–79, 1991.
[7] Y.-R. Ting, E.-H. Lu, and Y.-C. Lu, “Ringed bit-parallel systolic mul- [31] A. Reyhani-Masoleh and M. A. Hasan, “Efficient digit-serial normal
tipliers over a class of fields GF(2m ),” Integr., VLSI J., vol. 38, no. 4, basis multipliers over binary extension fields,” ACM Trans. Embedded
pp. 571–578, 2005. Comput. Syst., vol. 3, no. 3, pp. 575–592, 2004.
[8] J. Xie, P. K. Meher, and Z.-H. Mao, “High-throughput digit-level systolic [32] A. H. Namin, H. Wu, and M. Ahmadi, “A word-level finite field
multiplier over GF(2m ) based on irreducible trinomials,” IEEE Trans. multiplier using normal basis,” IEEE Trans. Comput., vol. 60, no. 6,
Circuits Syst. II, Exp. Briefs, vol. 62, no. 5, pp. 481–485, May 2015. pp. 890–895, Jun. 2006.
[9] J. Adikari, V. S. Dimitrov, and R. J. Cintra, “A new algorithm for double [33] C.-Y. Lee and P.-L. Chang, “Digit-serial Gaussian normal basis multi-
scalar multiplication over Koblitz curves,” in Proc. IEEE Int. Symp. plier over GF(2m ) using Toeplitz matrix-approach,” in Proc. Int. Conf.
Circuits Syst., May 2011, pp. 709–712. Comput. Intell. Softw. Eng. (CiSE), 2009, pp. 1–4.
[10] K. Järvinen and J. Skyttä, “On parallelization of high-speed proces- [34] H. El-Razouk and A. Reyhani-Masoleh, “New architectures for digit-
sors for elliptic curve cryptography,” IEEE Trans. Very Large Scale level single, hybrid-double, hybrid-triple field multiplications and expo-
Integr. (VLSI) Syst., vol. 16, no. 9, pp. 1162–1175, Sep. 2008. nentiation using Gaussian normal bases,” IEEE Trans. Comput., vol. 65,
[11] C.-Y. Lee, P. K. Meher, and J. C. Patra, “Concurrent error detection in no. 8, pp. 2495–2509, Aug. 2016.
bit-serial normal basis multiplication over GF(2m ) using multiple parity [35] V. Trujillo-Olaya and J. Velasco-Medina, “Half-matrix normal basis
prediction schemes,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., multiplier over GF( pm ),” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 18, no. 8, pp. 1234–1238, Aug. 2010. vol. 64, no. 4, pp. 879–891, Apr. 2017.
[12] W. Geiselmann and D. Gollmann, “Symmetry and duality in normal [36] B. Rashidi, S. M. Sayedi, and R. R. Farashahi, “An efficient and high-
basis multiplication,” in Proc. 6th Symp. Appl. Algebra, Algebraic speed VLSI implementation of optimal normal basis multiplication over
Algorithms Error-Correcting Codes (AAECC), 1989, pp. 230–238. GF(2m ),” Integr., VLSI J., vol. 55, pp. 138–154, Sep. 2016.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LOW-COMPLEXITY DL SYSTOLIC GNB MULTIPLIER 11

[37] C.-Y. Lee and P. K. Meher, “Area-efficient subquadratic space- Shaobo Chen received the B.S degree in commu-
complexity digit-serial multiplier for type-II optimal normal basis of nication engineering from Xi’dian University, Xi’an,
G F(2m ) using symmetric TMVP and block recombination techniques,” China, and the M.S. degree in electrical engineering
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 62, no. 12, pp. 2846–2855, from the University of Pittsburgh, Pittsburgh, PA,
Dec. 2015. USA, in 2012 and 2015, respectively. He is currently
[38] J.-S. Pan, C.-Y. Lee, and Y. Li, “Subquadratic space complexity Gaussian pursuing the Ph.D. degree with the Department
normal basis multipliers over GF(2m ) based on Dickson–Karatsuba of Electrical Engineering, Wright State University,
decomposition,” IET Circuits, Devices Syst., vol. 9, no. 5, pp. 336–342, Dayton, OH, USA.
2015. His current research interests include VLSI cryp-
[39] B. Rashidi, S. M. Sayedi, and R. R. Farashahi, “Efficient and low- tographic circuits design and VLSI signal processing
complexity hardware architecture of Gaussian normal basis multipli- systems.
cation over GF(2m ) for elliptic curve cryptosystems,” IET Circuits,
Devices Syst., vol. 11, no. 2, pp. 103–112, 2017.
[40] H. T. Kung, “Why systolic architectures?” Computer, vol. 15, no. 1,
pp. 37–46, Jan. 1982. Pingxiuqi Chen received the B.S. degree in physics
from Hainan Normal University, Haikou, China, in
2014. She is currently pursuing the Ph.D. degree
with the Department of Electrical Engineering,
Wright State University, Dayton, OH, USA.
Qiliang Shao received the B.S. degree in polymer Her current research interests include VLSI cryp-
material science and engineering from Donghua tographic circuits design and finite field arithmetic
University, Shanghai, China, in 2014. He is cur- design.
rently pursuing the M.S. degree with the Department
of Electrical Engineering, Wright State University,
Dayton, OH, USA.
His current research interests include VLSI cryp-
tographic circuits design and finite field arithmetic
design.
Jiafeng Xie (M’15) received the B.E. degree in
measurement and control technology and instru-
mentation from Yanshan University, Qinhuangdao,
China, in 2006, the M.E. degree in control sci-
ence and engineering from Central South University,
Changsha, China, in 2010, and the Ph.D. degree in
Zhenji Hu is currently pursuing the Ph.D. degree
with the School of Law, Shanghai University of electrical engineering from the University of Pitts-
Finance and Economics, Shanghai. burgh, Pittsburgh, PA, USA, in 2014.
Her current research interests include finance and He is currently an Assistant Professor with the
security issues related to the finance. Department of Electrical Engineering, Wright State
University, Dayton, OH, USA. His current research
interests include VLSI cryptographic circuits design, hardware security,
postquantum cryptography, DNA cryptography, intelligent system fault detec-
tion, and VLSI signal/image processing systems.
Dr. Xie is currently serving in the editorial board of Microelectronics Journal
(Elsevier).

You might also like