Professional Documents
Culture Documents
Abstract— Normal basis multiplication over G F(2 m ) is widely such as the IEEE [25] and NIST [26] for elliptic curve digital
used in various applications such as elliptic curve cryptography. signature algorithm.
As a special class of normal basis with low complexity, Gaussian A number of designs have been released for efficient
normal basis (GNB) has received considerable attention recently.
In this paper, we propose a novel decomposition algorithm implementation of field multiplication based on GNB [3],
to develop a digit-level (DL) low-complexity systolic structure [22], [23], [27], [29]–[39], including: 1) the bit-level struc-
for GNB multiplication over G F(2 m ). First, we propose two tures such as parallel-in serial-out (PISO) [27], serial-in
algorithms separately to achieve a systolic GNB multiplier with parallel-out (SIPO) [29], [30], and parallel-in parallel-out
low critical path delay and low register complexity. Next, we (PIPO) [31], [32]; 2) the digit-level (DL) designs such as
present the corresponding structure according to the proposed
algorithm (combination of previous two proposed algorithms). PISO [33], PIPO [22], [23], and SIPO [24]; and 3) the bit-
Compared with the existing systolic DL GNB multipliers (through parallel ones such as the multiplier of [3]. The DL structures
both the theoretical and application-specific integrated circuit have gained substantial attentions recently due to their efficient
comparison), the proposed multiplier achieves significantly less tradeoff between area and time complexities [24], [34].
area-delay product (ADP), e.g., for a systolic structure of digit For large field sizes in G F(2m ), the multiplications can be
size of 8 for G F(2409 ), the proposed structure has 12.3% less
ADP compared to the best of the existing designs, for the same realized by using systolic array to achieve high speed and
digit size. regular implementations [14]. Systolic structures are vastly
used in applications with high-performance requirements as
Index Terms— Digit-level (DL), Gaussian normal basis (GNB),
low critical path delay (CPD), low register complexity, systolic the processing elements (PEs) in the structure employ regis-
structure. ters for pipelining. In [15], Kwon has presented an efficient
digit-serial systolic multiplier based on optimal normal basis.
In [16], another systolic multiplier is proposed for high-
I. I NTRODUCTION performance implementation. Other efficient systolic multipli-
ers over G F(2m ) have been proposed in [14] and [17]. Besides
L OW-COMPLEXITY implementations of finite field mul-
tipliers over G F(2m ) have drawn substantial attention
recently due to their widespread applications in various envi-
that, efficient digit-serial systolic multipliers are introduced
in [18]–[20]. Very recently, an efficient DL systolic GNB
ronments. A lot of efforts have been carried out to obtain low- multiplier is introduced in [24]. Overall, systolic realization
complexity multipliers for various high-performance usages of DL GNB multipliers is not that abundant in the litera-
including elliptic curve cryptography [1]–[8]. ture. Furthermore, although these GNB multipliers have been
In general, there are three bases can be selected to represent optimized to achieve low complexity through DL implemen-
a finite field, i.e., polynomial basis (PB), normal basis, and tation, their area–time complexities are still relatively high
dual basis [9]–[24]. Gaussian normal basis (GNB), as a special and need to be improved, e.g., the register complexity and
class of normal basis over G F(2m ) [9]–[13] (where m > 1 and critical path of [24] are still large and can be improved
m is not divisible by eight), has received considerable attention further.
in the literature due to its low-complexity implementation Based on the above consideration, we propose, in this
(compared with PB and dual basis, GNB is much more paper, a novel decomposition algorithm to develop DL systolic
efficient in the hardware designs involving with many squaring GNB multipliers over G F(2m ) in order to achieve a lower
operations since the GNB squaring has almost no hardware critical path, high-speed and low-complexity implementations.
usage). GNB has also been included in a number of standards We briefly introduce the motivation of the proposed work
(based on the concept of novel cut-set retiming and novel input
Manuscript received December 1, 2016; revised March 26, 2017 and signal broadcasting) after the review of the work of [24]. Then,
June 11, 2017; accepted June 22, 2017. This work was supported by Ohio
RAPIDS grant. (Corresponding author: Jiafeng Xie.) we introduce novel multiplication algorithms to reduce the
Q. Shao, S. Chen, P. Chen, and J. Xie are with the Department of critical path delay (CPD) and register complexity, respectively.
Electrical Engineering, Wright State University, Dayton, OH 45435 USA Furthermore, a new structure of the proposed systolic GNB
(e-mail: shao.10@wright.edu; chen.181@wright.edu; chen.148@wright.edu;
jiafeng.xie@wright.edu). multiplier is proposed. The proposed multiplier can achieve
Z. Hu is with the School of Law, Shanghai University of Finance and low CPD and low register complexity compared with the
Economics, Shanghai, China (e-mail: jenneyhu1986@163.com). best of the existing GNB systolic multiplier of [24]. Finally,
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. we also compare the hardware and time complexities of
Digital Object Identifier 10.1109/TVLSI.2017.2720190 the proposed architectures with the existing ones through
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
application-specific integrated circuit (ASIC) synthesis to Algorithm 1 Existing DL Systolic Multiplication [24]
benchmark the higher efficiencies of the presented design.
The organization of this paper is as follows. In Section II,
we briefly review the preliminaries of the GNB multiplication
over G F(2m ) (including the existing DL systolic GNB multi-
plier) as well as the introduction of cut-set retiming and signal
broadcasting technique (motivation of the proposed strategy).
In Section III, a novel DL systolic multiplication algorithm is
proposed. Proposed low CPD, low register complexity DL sys-
tolic GNB multiplier is described in Section IV. In Section V,
we benchmark the hardware and time complexities of the
proposed design along with the corresponding existing ones.
Conclusions are given in Section VI.
then we can have their product C as Recently, a low-latency systolic GNB multiplier has been
proposed in [24]. Algorithm 1 describes the existing systolic
m−1
m−1 GNB multiplication. Let us define q = m/d, where d is
ai b j β 2 +2 .
i j
C = (c0 , c1 , . . . , cm−1 ) = AB = (2) the digit size and 1 ≤ d ≤ m. Then, the product C = AB
i=0 j =0
q−1 2id
can be performed as C = L (V id, B id),
d−1 2 j i=0
where L(V, B) = j =0 J (V j, B j ) (Note that
Let us define μi, j = β 2 +2 ∈ G F(2m ) as a field element,
i j
this version is directly obtained from [20, Fig. 2], for detailed
where 0 ≤ i, j ≤ m − 1. Then, with respect to N, one can
information, the readers can always refer to [22, eqs. (36)–
have
(42), (46)], especially concerning the matter of structure’s last
(l) l
m−1
μi, j = μi, j β 2 (3) cycle operation). Let us define n and k as two integers that
satisfy q = kn, then, we can get the partial product Ci by Ci =
l=0 k−1 2id
j =0 L (Vi j d, Bi j d). Thus, one can decompose
substituting (3) into (2), one can have the product C into n-term partial products, which is C = C0 +
kd
C12 +· · ·+Cn−12(n−1)kd = (((C 2kd +C 2kd +· · · )2kd +C .
n−1 ) n−2 )
m−1
m−1
m−1 0
μ(l)
l d
C = ai b j i, j β
2 Each partial product can be written as Ci = (((L(Vi , Bi ))2 +
d d
i=0 j =0 l=0 L(Vi d, Bi d))2 + · · · +)2 + L(Vi (k − 1)d, Bi
m−1
m−1 m−1
(k − 1)d).
ai b j μ(l)
l
= i, j β .
2
(4) Based on Algorithm 1, Fig. 1(a) depicts the existing DL
i=0 j =0 l=0 systolic GNB multiplier over G F(2m ) [24]. We can see that
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 3. Proposed cut-set strategy of designing low CPD DL systolic GNB multiplier over G F(2m ), where S box refers to shift operation, RAO denotes
recombined addition operation, ⊗ denotes MA operation, and ⊕ denotes operand addition operation. (a) Proposed DL systolic GNB multiplier. (b) Detailed
structure of PEs, where black boxes denote registers.
Fig. 3 depicts the proposed cut-set strategy of achieving Algorithm 2 Proposed Low CPD DL Systolic Multiplication
low CPD implementation based on the existing DL systolic
multiplier over G F(2m ). From Fig. 3(a), one can see that
the structure consists of k PEs and one accumulation cell
(AC) after the proposed novel cut-set retiming. The detailed
internal operational structure of these components is presented
in Fig. 3(b). First of all, PE-1 performs the first pair of RAO
and MA operations of operand B and V . The result of MA
from PE-1 is then yielded to PE-2 to perform the operand
addition with shifted C. Meanwhile, we still have the RAO
and MA operations in PE-2 which yields its result to the
next PE on its right. “S box” performs the shifting of the
bits of both operands V and B. The last operand addition
operation is performed in AC, and we get the first partial
product after k clock cycles. All the rest partial products Ci s
are recursively accumulated in AC to produce the final result
C after n clock cycles. The CPD of this proposed architecture
is thus T A +(log2 T +log2 (d))TX (for NIST recommended
GNB, T A + (log2 T + log2 (d))TX is larger than 2TX , the
time duration required for AC), which is shorter than [24].
According to the proposed cut-set retiming strategy shown broadcasting strategy to reduce the corresponding registers
in Fig. 3, we derive here the modified low critical path DL among PEs.
systolic multiplication algorithm as presented in the proposed As seen in Fig. 1(b), there are generally two types of
Algorithm 2. registers equipped for one PE: Type-one for operand pipelining
where MA denotes the operand multiplication operation (after bits shifting, the top one for operand B and the bottom
inside each PE the detail of this step can be seen in one for operand V ); another one for pipelining of computation
[22, eqs. (29)–(47)] and the example of Fig. 2. According [the registers used to pipeline the data after the L(Bin , Vin )
to Algorithm 2, we perform the RAO and MA operations operation]. The registers used to pipeline the computational
of the operands B and V in advance, which corresponds to data are critical to the correctness of final output, while the
Steps 8 and 9. Each PE (from PE-2 to PE-k) then executes registers for pipelining the shifted operands (the top and
the computation of the Steps 10, and 11 of Algorithm 2 later, bottom ones) are relatively less important.
while AC is performing Step 13. By computing the RAO and Based on the above consideration, we propose here a novel
MA operations in advance in each PE, we have shortened the strategy to reduce the registers related to the pipelining of
critical path of the structure of [24]. the shifted operands (the top and bottom ones). Let us first
consider the data pipelining of shifted operand B among PEs
in the existing design of [24]. It is seen that, in Fig. 4, the
C. Low Register Complexity shifted operand’s subscript (the subscript denotes the degree
Systolic structure sometimes suffers from large register of shifting, according to Fig. 1) increases one per cycle for
complexity, as all the PEs in the array are uniform and fully a single PE (for neighboring PEs, within the same cycle, the
pipelined (there are a lot of registers in the PEs). Noting subscript increases with the numbering of PE). The pipelining
that a systolic structure can have global input broadcasting to of the bits of shifted operand V is similar to the shifted
provide inputs to all PEs [40], we propose here a novel signal operand B as shown in Fig. 5.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 4. Data pipelining of operand B among PEs for the structure of Fig. 1, Fig. 7. Data pipelining of operand V among PEs with added operands for
where the diagonal line represents data flow between PEs, and the vertical the structure of Fig. 1, where the gray area represents all added operands, and
line represents data flow in one PE. the green area represents one specific operand cluster fed to all PEs.
Fig. 6. Data pipelining of operand B among PEs with added operands for
the structure of Fig. 1, where the gray area represents all added operands, and
the green area represents one specific operand cluster fed to all PEs.
Fig. 9. Proposed low-complexity DL systolic GNB multiplier over G F(2m ). (a) Proposed structure of DL systolic GNB multiplier. (b) Detailed internal
structures of PEs, where black boxes denote registers.
novel multiplication algorithm. The proposed multiplication As one can see in Algorithm 4, Steps 5 and 6 perform the
algorithm for DL systolic GNB multiplier over G F(2m ) is operations of initialization of the value of the operand cluster
described in Algorithm 4. as well as the cyclic shifting. RAO is performed by Steps 7 and
10, and AC is computed by Step 12. By computing the first
RAO in advance and rearranging the data broadcasting, we
Algorithm 4 Proposed DL Systolic Multiplication have successfully shortened the CPD and reduced the register
complexity.
A. Proposed Structure
The proposed DL systolic structure of GNB multiplier
over G F(2m ) based on the proposed Algorithm 4 is depicted
in Fig. 9. As shown in Fig. 9(a), it consists of one AC,
k number of PEs, and two shift-registers for operands B and V ,
respectively. The detailed internal structures of AC and PEs
are presented in Fig. 9(b). The shift registers rearrange all the
bits of operand B and V , so that there will be only one operand
cluster to be fed to k number of PEs in one clock cycle period.
The PE-1 yields the output to PE-2 after performing the RAO
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
and
Fig. 10. Example of the proposed DL systolic structure. (a) Proposed V0(0) = (a3 , a4 , a5 , . . . , a1 , a2 )
structure. (b) Detailed example of internal structure of PE, where the black (1)
box denotes the registers.
V0 = (a4 , a5 , a6 , . . . , a2 , a3 ) (13)
note that the different structure (with respect to different GNB
field size) may have different expression of Bi
and Vi
.
and MA operations. The internal structures of PEs from PE-2
Then, following the steps in Algorithm 4, we can finally get
to PE-(k) are the same, where each PE contains one RAO,
the accumulation of partial product Ci , which is also the final
one MA operation and one operand addition operation. The
product C in this case.
RAO performs reconstructed addition operation, whose result
is yielded to the MA operation in the same PE. The result
of MA operation is then yielded to the next PE to be added C. Extra Modification
with the shifted output of addition operation from the previous Noting that the proposed structure works good for the case
PE. PE-k performs the last operand addition and accumulation of d = 2u (where u can be any positive integer) as the structure
operations. After the AC receives its first input from left, the has a lower critical path than that of [24]. However, for those
final result C becomes available in (n + k) clock cycles. cases of d = 2u , the structure of Fig. 9 has the same time
complexity of [24] (for this case log2 (d + 1) = log2 d). Thus,
B. Example in this section, we suggest an alternate structure as shown
Let us follow the example presented in Fig. 2, i.e., type 4 in Fig. 11.
GNB over G F(27 ). For a detailed information, one can always Note that the functions of all internal units inside
refer to [22]. of PEs of the structure in Fig. 11 are the same as that of Fig. 9.
When d = 2 and k = 2, we have q = 4 and n = 2. Here, we just apply another cut-set retiming that the register
According to Algorithm 4, we can have a structure similar to complexity can be reduced further, i.e., move the operand
Fig. 9 with two PEs and one AC, shown in Fig. 10. addition (Fig. 9) back to the original PE. The structure of
If the initial bits of B and V loaded in the shift registers Fig. 11 has the same critical path of [24] as well as the latency.
are (b0 , b1 , b2 , . . . , b5 , b6 ) and (a0 , a1 , a2 , . . . , a5 , a6 ), respec- However, the register complexity is significantly reduced
tively (follow the similar example in [22]). For the first cycle, compared with the one of [24], i.e., only ≈ ((m/d)1/2 m)
according to (8) and (9), we can have registers are being used in the PEs.
B0
= B0(0), B0(1) V. A REA –T IME C OMPLEXITIES
V0
= V0(0) , V0(1) (11) A. Theoretical Comparison
where The area–time complexities of the proposed and the existing
(0) ones of [18]–[20], [22]–[24], [34], [35], [39] in terms of
B0 = (b3 , b4 , b5 , . . . , b1 , b2 )
(1)
logic gate count, register count, CPD, and latency are shown
B0 = (b4 , b5 , b6 , . . . , b2 , b3 ) (12) in Table I. Note that the recent reports of [34]–[39] (two
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
C OMPARISON OF THE A REA AND T IME C OMPLEXITIES FOR VARIOUS DL M ULTIPLIERS OVER G F(2m )
TABLE IV
ASIC S YNTHESIS R ESULTS OF THE P ROPOSED S YSTOLIC M ULTIPLIER AND THE O NE OF [24]
TABLE V
ASIC S YNTHESIS R ESULTS FOR THE E XISTING AND THE P ROPOSED DL M ULTIPLIERS OVER G F(2409 )
the existing one of [24]. We have also chosen the field size structure through simulation and register transfer level tool
of G F(2409 ) to have a detailed comparison of ours and the provided by the Design Compiler and found it correct.
existing one of [24] (Table III). It is seen that the CPD of our As shown in Table IV, our proposed design performs better
proposed structure is more efficient than the existing one. when the digit size becomes smaller. When comparing with
the ones of [24], our proposed design (Fig. 9) has less critical
path than the one of [24] (the structure of Fig. 11 has slightly
B. ASIC Implementation larger critical path than [24], this is due to the fact of the
We have also synthesized our proposed and the existing employment of signal broadcasting technique). Besides that,
designs to obtain the area–time complexity. We have used due to the significant reduction of register complexity, the
Synopsys Design Compiler based on Taiwan Semiconductor proposed design still achieves better area–time complexity
Manufacturing Company 65-nm standard-cell library. The than the one of [24].
results in terms of area, CPD, and latency cycles (including We have also compared our systolic multiplier with the
latency time) of our proposed systolic structure (also the one existing DL multipliers in terms of different digit sizes with
of [24]) are shown in Table IV with different field sizes the same field size m = 409, as shown in Table V. One can see
(m = 163, 283, and 409) and digit sizes (to have a fair that our proposed multiplier has smaller area–time complexity
comparison, we use the same digit sizes suggested in [24]). compared with the other multipliers (the ADP of the proposed
Note that we have checked the functionality of the proposed one is the smallest among all the multipliers in Table V),
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
especially with the competing systolic design of [24]. For digit [13] R. Azarderakhsh and A. Reyhani-Masoleh, “Efficient FPGA imple-
size of d = 8, the ADP of the proposed structure is 12.3% less mentations of point multiplication on binary Edwards and generalized
Hessian curves using Gaussian normal basis,” IEEE Trans. Very Large
than that of [24]. The ADP of the proposed design with d = 7 Scale Integr. (VLSI) Syst., vol. 20, no. 8, pp. 1453–1466, Aug. 2012.
has the least ADP among all designs, e.g., at least 18.3%, [14] P. K. Meher, “Systolic and non-systolic scalable modular designs of
35.4%, and 53.9% less the existing of [24], [22] and [23], finite field multipliers for Reed–Solomon codec,” IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., vol. 17, no. 6, pp. 747–757, Jun. 2009.
and [18], respectively. [15] S. Kwon, “A low complexity and a low latency bit parallel systolic
multiplier over GF(2m ) using an optimal normal basis of type II,” in
Proc. 16th IEEE Symp. Comput. Arithmetic, Jun. 2003, pp. 196–202.
C. Discussion [16] J. Fan, D. V. Bailey, L. Batina, T. Guneysu, C. Paar, and I. Verbauwhede,
From the comparison results shown in Tables IV and V, “Breaking elliptic curve cryptosystems using reconfigurable hardware,”
in Proc. Int. Conf. Field Program. Logic Appl., 2010, pp. 133–138.
one can see that the proposed design outperforms the existing [17] Z. Wang and S. Fan, “Efficient Montgomery-based semi-systolic mul-
ones, especially the recent report of systolic structure in [24]. tiplier for even-type GNB of GF(2m ),” IEEE Trans. Comput., vol. 61,
Besides that, one can always choose a suitable structure for no. 3, pp. 415–419, Mar. 2012.
[18] S. Talapatra, H. Rahaman, and J. Mathew, “Low complexity digit
specific environment as the proposed design has either low serial systolic Montgomery multipliers for special class of GF(2m ),”
critical path (Fig. 9) or low register complexity (Fig. 11) IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 5,
performance. pp. 847–852, May 2010.
[19] C.-Y. Lee and C. W. Chiou, “Scalable Gaussian normal basis multipliers
over G F(2m ) using Hankel matrix-vector representation,” J. Signal
VI. C ONCLUSION Process. Syst., vol. 69, no. 2, pp. 197–211, 2012.
[20] R. Azarderakhsh and A. Reyhani-Masoleh, “A modified low complexity
A low-complexity DL systolic GNB multiplier over digit-level Gaussian normal basis multiplier,” in Proc. 3rd Int. Workshop
G F(2m ) has been proposed in this paper. We have proposed Arithmetic Finite Fields, 2010, pp. 25–40.
[21] Digital Signature Standard, Nat. Inst. Standards Technol., Gaithersburg,
a novel multiplication algorithm to reduce the CPD and the MD, USA, Jan. 2000.
register complexity. Moreover, both theoretical and ASIC [22] A. Reyhani-Masoleh, “Efficient algorithms and architectures for field
implementation results are presented for comparison. Based multiplication using Gaussian normal bases,” IEEE Trans. Comput.,
vol. 55, no. 1, pp. 34–47, Jan. 2006.
on our presented results, our proposed design has smaller [23] S. Kwon, K. Gaj, C. H. Kim, and C. P. Hong, “Efficient linear array
CPD and fewer register complexity when compared with the for multiplication in G F(2m ) using a normal basis for elliptic curve
existing DL systolic multipliers. The proposed DL multiplier, cryptography,” in Proc. 6th Int. Workshop Cryptogr. Hardw. Embedded
Syst., 2014, pp. 76–91.
thus, can be extended and employed in sensitive usage models
[24] R. Azarderakhsh, M. M. Kermani, S. Bayat-Sarmadi, and C.-Y. Lee,
including high-performance cryptographic applications. “Systolic Gaussian normal basis multiplier architectures suitable
for high-performance applications,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 23, no. 9, pp. 1969–1972, Sep. 2015.
R EFERENCES [25] IEEE Standard Specifications for Public-Key Cryptography,
[1] I. Blake, G. Seroussi, and N. Smart, Elliptic Curves in Cryptography IEEE Standard 1363-2000, Jan. 2010.
(London Mathematical Society Lecture Note Series). Cambridge, U.K.: [26] National Institute of Standards and Technology, Digital Signature Stan-
Cambridge Univ. Press, 1999. dard, FIPS Publications 186-2, U.S. Dept. Commerce, Washington, DC,
[2] R. R. Farashahi and M. Joye, “Efficient arithmetic on Hessian curves,” in USA, Jan. 2000.
Proc. Int. Conf. Pract. Theory Public Key Cryptogr., 2010, pp. 243–260. [27] J. K. Omura and J. L. Massey, “Computational method and apparatus
[3] B. Sunar and Ç. K. Koç, “An efficient optimal normal basis type II for finite field arithmetic,” U.S. Patent 4 587 627, May 6, 1986.
multiplier,” IEEE Trans. Comput., vol. 50, no. 1, pp. 83–87, Jan. 2001. [28] P. Chen, S. N. Basha, M. Mozaffari-Kermani, R. Azarderakhsh, and
[4] C.-Y. Lee, J.-S. Horng, I.-C. Jou, and E.-H. Lu, “Low-complexity bit- J. Xie, “FPGA realization of low register systolic all-one-polynomial
parallel systolic Montgomery multipliers for special classes of GF(2m ),” multipliers over G F(2m ) and their applications in trinomial multipliers,”
IEEE Trans. Comput., vol. 54, no. 9, pp. 1061–1070, Sep. 2005. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 2,
[5] J. Xie, J. J. He, and P. K. Meher, “Low latency systolic Montgomery pp. 725–734, Feb. 2017.
multiplier for finite field GF(2m ) based on pentanomials,” IEEE Trans. [29] T. Beth and D. Gollman, “Algorithm engineering for public key algo-
Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 2, pp. 385–389, rithms,” IEEE J. Sel. Areas Commun., vol. 7, no. 4, pp. 458–466,
Feb. 2013. May 1989.
[6] C.-Y. Lee, E.-H. Lu, and J.-Y. Lee, “Bit-parallel systolic multipliers for [30] G. B. Agnew, R. C. Mullin, I. M. Onyszchuk, and S. A. Vanstone,
GF(2m ) fields defined by all-one and equally spaced polynomials,” IEEE “An implementation for a fast public-key cryptosystem,” J. Cryptol.,
Trans. Comput., vol. 50, no. 5, pp. 385–393, May 2001. vol. 3, no. 2, pp. 63–79, 1991.
[7] Y.-R. Ting, E.-H. Lu, and Y.-C. Lu, “Ringed bit-parallel systolic mul- [31] A. Reyhani-Masoleh and M. A. Hasan, “Efficient digit-serial normal
tipliers over a class of fields GF(2m ),” Integr., VLSI J., vol. 38, no. 4, basis multipliers over binary extension fields,” ACM Trans. Embedded
pp. 571–578, 2005. Comput. Syst., vol. 3, no. 3, pp. 575–592, 2004.
[8] J. Xie, P. K. Meher, and Z.-H. Mao, “High-throughput digit-level systolic [32] A. H. Namin, H. Wu, and M. Ahmadi, “A word-level finite field
multiplier over GF(2m ) based on irreducible trinomials,” IEEE Trans. multiplier using normal basis,” IEEE Trans. Comput., vol. 60, no. 6,
Circuits Syst. II, Exp. Briefs, vol. 62, no. 5, pp. 481–485, May 2015. pp. 890–895, Jun. 2006.
[9] J. Adikari, V. S. Dimitrov, and R. J. Cintra, “A new algorithm for double [33] C.-Y. Lee and P.-L. Chang, “Digit-serial Gaussian normal basis multi-
scalar multiplication over Koblitz curves,” in Proc. IEEE Int. Symp. plier over GF(2m ) using Toeplitz matrix-approach,” in Proc. Int. Conf.
Circuits Syst., May 2011, pp. 709–712. Comput. Intell. Softw. Eng. (CiSE), 2009, pp. 1–4.
[10] K. Järvinen and J. Skyttä, “On parallelization of high-speed proces- [34] H. El-Razouk and A. Reyhani-Masoleh, “New architectures for digit-
sors for elliptic curve cryptography,” IEEE Trans. Very Large Scale level single, hybrid-double, hybrid-triple field multiplications and expo-
Integr. (VLSI) Syst., vol. 16, no. 9, pp. 1162–1175, Sep. 2008. nentiation using Gaussian normal bases,” IEEE Trans. Comput., vol. 65,
[11] C.-Y. Lee, P. K. Meher, and J. C. Patra, “Concurrent error detection in no. 8, pp. 2495–2509, Aug. 2016.
bit-serial normal basis multiplication over GF(2m ) using multiple parity [35] V. Trujillo-Olaya and J. Velasco-Medina, “Half-matrix normal basis
prediction schemes,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., multiplier over GF( pm ),” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 18, no. 8, pp. 1234–1238, Aug. 2010. vol. 64, no. 4, pp. 879–891, Apr. 2017.
[12] W. Geiselmann and D. Gollmann, “Symmetry and duality in normal [36] B. Rashidi, S. M. Sayedi, and R. R. Farashahi, “An efficient and high-
basis multiplication,” in Proc. 6th Symp. Appl. Algebra, Algebraic speed VLSI implementation of optimal normal basis multiplication over
Algorithms Error-Correcting Codes (AAECC), 1989, pp. 230–238. GF(2m ),” Integr., VLSI J., vol. 55, pp. 138–154, Sep. 2016.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[37] C.-Y. Lee and P. K. Meher, “Area-efficient subquadratic space- Shaobo Chen received the B.S degree in commu-
complexity digit-serial multiplier for type-II optimal normal basis of nication engineering from Xi’dian University, Xi’an,
G F(2m ) using symmetric TMVP and block recombination techniques,” China, and the M.S. degree in electrical engineering
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 62, no. 12, pp. 2846–2855, from the University of Pittsburgh, Pittsburgh, PA,
Dec. 2015. USA, in 2012 and 2015, respectively. He is currently
[38] J.-S. Pan, C.-Y. Lee, and Y. Li, “Subquadratic space complexity Gaussian pursuing the Ph.D. degree with the Department
normal basis multipliers over GF(2m ) based on Dickson–Karatsuba of Electrical Engineering, Wright State University,
decomposition,” IET Circuits, Devices Syst., vol. 9, no. 5, pp. 336–342, Dayton, OH, USA.
2015. His current research interests include VLSI cryp-
[39] B. Rashidi, S. M. Sayedi, and R. R. Farashahi, “Efficient and low- tographic circuits design and VLSI signal processing
complexity hardware architecture of Gaussian normal basis multipli- systems.
cation over GF(2m ) for elliptic curve cryptosystems,” IET Circuits,
Devices Syst., vol. 11, no. 2, pp. 103–112, 2017.
[40] H. T. Kung, “Why systolic architectures?” Computer, vol. 15, no. 1,
pp. 37–46, Jan. 1982. Pingxiuqi Chen received the B.S. degree in physics
from Hainan Normal University, Haikou, China, in
2014. She is currently pursuing the Ph.D. degree
with the Department of Electrical Engineering,
Wright State University, Dayton, OH, USA.
Qiliang Shao received the B.S. degree in polymer Her current research interests include VLSI cryp-
material science and engineering from Donghua tographic circuits design and finite field arithmetic
University, Shanghai, China, in 2014. He is cur- design.
rently pursuing the M.S. degree with the Department
of Electrical Engineering, Wright State University,
Dayton, OH, USA.
His current research interests include VLSI cryp-
tographic circuits design and finite field arithmetic
design.
Jiafeng Xie (M’15) received the B.E. degree in
measurement and control technology and instru-
mentation from Yanshan University, Qinhuangdao,
China, in 2006, the M.E. degree in control sci-
ence and engineering from Central South University,
Changsha, China, in 2010, and the Ph.D. degree in
Zhenji Hu is currently pursuing the Ph.D. degree
with the School of Law, Shanghai University of electrical engineering from the University of Pitts-
Finance and Economics, Shanghai. burgh, Pittsburgh, PA, USA, in 2014.
Her current research interests include finance and He is currently an Assistant Professor with the
security issues related to the finance. Department of Electrical Engineering, Wright State
University, Dayton, OH, USA. His current research
interests include VLSI cryptographic circuits design, hardware security,
postquantum cryptography, DNA cryptography, intelligent system fault detec-
tion, and VLSI signal/image processing systems.
Dr. Xie is currently serving in the editorial board of Microelectronics Journal
(Elsevier).