You are on page 1of 7

Microprocessors and Microsystems 29 (2005) 17 www.elsevier.

com/locate/micpro

An efcient VLSI implementation of IDEA encryption algorithm using VHDL


M. Thaduri, S.-M. Yoo*, R. Gaede
Electrical and Computer Engineering Department, The University of Alabama in Huntsville, 301 Sparkman Dr, Huntsville, AL 35899, USA Received 15 November 2003; revised 29 May 2004; accepted 5 June 2004 Available online 8 July 2004

Abstract Data security is an important issue in computer networks and cryptographic algorithms are essential parts in network security. So far, International Data Encryption Algorithm (IDEA) is very secure. In this paper, we present a VLSI implementation of the IDEA block cipher using VHDL using AMI 0.5 process technology standard cells. We have optimized the modulus multiplier and exploited the temporal parallelism available in the IDEA algorithm. In our implementation, the subkeys are generated internally once the original key is fetched. This key is retained unless a new key is used for encryption. This implementation does not employ an additional RAM to store the subkeys. Our chip contains the same eight units, and each unit can execute one round of the algorithm. Using pipelined design, eight rounds of the algorithm are executed in parallel in a chip. Our implementation operating at 10 MHz achieves a throughput of greater than 700 Mbps, which is several times higher than previous implementations. q 2004 Elsevier B.V. All rights reserved.
Keywords: Data encryption; Modulus multiplier; Temporal parallelism; VLSI implementation

1. Introduction Recently, the number of individuals and organizations using wide computer networks for personal and professional activities has increased a lot. Among them, there are several applications highly sensitive to data security such as commercial exchange on the Internet and smart cards [1,2]. A cryptographic algorithm is an essential part in network security. A well-known cryptographic algorithm is the Data Encryption Standard (DES) [3,4], which is widely adopted in security products. However, serious considerations arise from long-term security because of the relatively short key word length of only 56 bits and recently from the highly successful cryptanalysis attack [4]. Another cryptographic algorithm is International Data Encryption Algorithm (IDEA) [4,5], which is considered one of the most important post-DES cryptographic algorithms due to its high immunity to attacks [6,7]. The IDEA algorithm overcomes the problems of DES algorithm. IDEA is highly secure. It would take one billion computers testing one billion combinations per second, 10,000 billion years to
* Corresponding author. Tel.: 1-256-824-6858; fax: 1-256-824-6803. E-mail address: yoos@ece.uah.edu (S.-M. Yoo). 0141-9331/$ - see front matter q 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2004.06.002

crack the code (2128 variants)-longer than the universe has existed [4]. It can be widely used in audio and video data for cable TV, pay TV, video conferencing, sensitive nancial and commercial data, e-mail via public networks transmission lines via modem, router or ATM link, and smart cards. Many researchers have implemented the IDEA algorithm. Curiger et al. [8] has implemented a chip which is the rst silicon block encryption device that can be applied to on-line encryption in high-speed networking protocols like ATM. With a system clock frequency of 25 MHz, this device permits a data conversion rate of more than 177 Mbps. Salomao et al. [9] has implemented a single round of IDEA on one chip, and it operates at a worst case clock frequency of 30 MHz producing a throughput of 424 Mbps. Qin et al. [10] has also implemented a chip whose maximum clock time period and throughput reported are 8 and 133 Mbps, respectively. In this paper, we present a VLSI implementation of the IDEA block cipher using VHDL using AMI 0.5 process technology [11] standard cells. We have optimized the modulus multiplier and exploited the temporal parallelism available in the IDEA algorithm. In our implementation, the subkeys are generated internally once the original key is

M. Thaduri et al. / Microprocessors and Microsystems 29 (2005) 17

fetched. This key is retained unless a new key is used for encryption. This implementation does not employ additional RAM to store the subkeys, which is a signicant improvement in area. Our chip contains the same eight units, and each unit can execute one round of the algorithm. Using pipelined design, eight rounds of the algorithm are executed in parallel in a chip. Our implementation operating at 10 MHz achieves a throughput of greater than 700 Mbps, which is several times higher than previous implementations. The rest of the paper is organized as follows. Section 2 describes the IDEA cryptographic algorithm. Section 3 describes the architecture of the cryptographic chip, explaining the details of the different building blocks. Section 4 compares the performance of our implementation to earlier ones. Finally, Section 5 concludes the paper.

2. The IDEA cryptographic algorithm In this section, we will briey introduce the IDEA algorithm. IDEA is a symmetric, block-oriented cryptographic algorithm. It operates on 64-bit plain text blocks and uses 128-bit keys, which makes it practically immune to brute-force attacks. IDEA is based upon a basic function, which is iterated eight times. The rst iteration operates on the input 64-bit plain text block and the successive iterations operate on the 64-bit block from the previous iteration. After the last iteration, a nal transform step produces the 64-bit cipher block. The algorithm structure has been chosen such that, with the exception that different key sub-blocks are used, the encryption process is identical to the decryption process. IDEA uses both confusion and diffusion to encrypt the data. The design philosophy behind the algorithm is one of mixing operation from different algebraic groups. Three algebraic groups, XOR, addition modulo 216, and multiplication modulo 216 1, are mixed, and they are all easily implemented in both hardware and software. All these operations operate on 16-bit sub-blocks. Fig. 1 shows an overview of IDEA. The 64-bit input data is divided into four 16-bit sub-blocks X1, X2, X3, and X4. These four sub-blocks become the input to the rst round of the algorithm. There are eight rounds total. In each round, the four subkeys are XORed, added, and multiplied with one another and with six 16-bit sub-keys. Between the rounds, the second and the third sub-blocks are swapped. Finally, the four sub-blocks after the eighth round are collected and combined with four sub-keys in an output transformation. In each round, the sequence of events is as follows: 1. 2. 3. 4. 5. Multiply X1 by the rst subkey. Add X2 and the second subkey. Add X3 and the third subkey. Multiply X4 by the fourth subkey. XOR the results of Steps 1 and 3.

Fig. 1. Basic structure of the IDEA algorithm.

6. 7. 8. 9. 10. 11. 12. 13. 14.

XOR the results of Steps 2 and 4. Multiply the results of Step 5 by the fth subkey. Add the results of Steps 6 and 7. Multiply the results of Step 8 by the sixth subkey. Add the results of Step 7 and 9. XOR the results of Steps 1 and 9. XOR the results of Steps 3 and 9. XOR the results of Steps 2 and 10. XOR the results of Steps 4 and 10.

The output of the round is the four sub-blocks that are the results of Steps 11, 12, 13, and 14. Swap the two inner blocks (except for the last round) and that is the input to the next round. After the 8th round, there is a nal transformation stage shown in Fig. 1. Finally the four sub-blocks are attached to get the nal encrypted result. Here each stage is using only three basic functional units that were mentioned earlier.

3. The VHDL implementation of IDEA The goal of this implementation is to achieve the highest possible throughput. After a careful evaluation of the basic building blocks, we have established an eightstage pipelined data path containing one round in each stage, each operating on a different data set. We have used the bottom-up design approach, implementing the elementary operations rst before designing the nal data path.

M. Thaduri et al. / Microprocessors and Microsystems 29 (2005) 17

Fig. 2. Three blocks of our multiplier.

The design of multiplier is very crucial for the optimal performance of the chip. The time required for multiplication modulo (216 1) has a signicant inuence on the system clock frequency. Various multiplication modulo architectures have been proposed in Ref. [12] for cryptographic applications. The area and throughput for different implementations are also shown in Ref. [12]. They have chosen the multiplication by mod p 2 1 additions using bit-pair recoding in standard cells. The design of the multiplier is critical in the problem of dealing with 16-bit input data to initially produce 32-bit result, then taking the modulus of the result with 65537. Different

implementations of the multiplier have been tested, but due to clock cycle problems in those designs the result was only obtained after eight clock cycles. With such latency, this multiplier unit could not be used in the pipeline stages because we had to get the result in only one clock cycle. As mentioned above, for parallel, synchronous implementation of IDEA algorithm, wherein each round is imple-mented in a single clock cycle, it is important that the multiplier design itself should be asynchronous, instead of pipelined synchronous design as in the case of Booth algorithm. Also, it is reported that the use of Wallace tree compressors can achieve in the worst case, the same reduction of

Fig. 3. Final layout of multiplier in AMI 0.5 process.

M. Thaduri et al. / Microprocessors and Microsystems 29 (2005) 17

the number of partial products as Booths algorithm in less time [15]. Thus, Booths algorithm is not used in our design since one of our design goals is a high throughput. The multiplier in our design is composed of three blocks as shown in Fig. 2. The rst is a parallel partial product generator. The second is the Wallace tree section which adds all of the partial products simultaneously to produce two numbers. The third is the carry look ahead adder that adds the two numbers, obtained from the Wallace tree section. We have implemented a parallel multiplier unit using the following new scheme for generating the initial partial products, where the delay associated with the generation of the partial products comes down to the delay associated with a single AND gate. ! n21 n21 X X n Z XYmod2 1 X Y mod2n 1
i1 i1

Fig. 4. Basic block diagram of the complete chip.

n21 X i1

2 X i1

Y !

n21 X i1

4 X i3

n21 X i1

n21 X in22

Y mod2n 1

Fig. 5. Subkey generation unit.

M. Thaduri et al. / Microprocessors and Microsystems 29 (2005) 17

Fig. 6. Block diagram of our chip.

After generating the partial products in this way, we use Wallace 4:2 compressors to quickly generate the nal two vectors, which are added using a carry select adder. The nal modulus of the 32-bit result is found using the pseudo code, IF RESULT(15 DOWNTO 0) . RESULT(31 DOWNTO 0) THEN O/P , RESULT(15 DOWNTO 0)-RESULT(31 DOWNTO 0); ELSE O/P , (65537 RESULT(15 DOWNTO 0))RESULT(31 DOWNTO 0); End IF; The multiplier was successfully simulated for basic functionality and further synthesized into Altera FPGA using the Leonardo synthesis tool and back annotated to get actual delay of the circuit. Also, the multiplier unit was

synthesized and targeted to an ASIC using the Leonardo synthesis tool. The design was optimized for delay. Then, the nal layout of the multiplier was generated. It occupied an area of 12,40,000 mm2. We have used the nal target device EPF10K70RC240 chip because, when we tried to use a smaller chip and optimize the device with constraints on area, we got an unpredicted result. This might be because the multiplier unit would have been bigger than the target chip, so it was mixing up some nets while optimizing. We guess that the chunks at the beginning are the full adders used to generate the partial products. Then, all the partial products are added using Wallace tree to get the nal sum, then the modulus is generated. This output is obtained by running the Altera place and route tools and doing the back annotation for getting the real delay of the circuit. We can see the glitches. Fig. 3 shows the nal layout of multiplier in AMI 0.5 process. The adder unit was also synthesized using the Leonardo synthesis tool, but into a smaller target device EPF10K20TC144. The XOR gate was easily implemented and synthesized. Now, the three basic functional units are used to implement the whole chip. The basic functionality is split up into eight pipeline stages as shown in Fig. 4. Each stage receives six subkeys generated by the key generation unit and then the intermediate results are passed to the next stage on detecting a clock signal. In the implementation of the algorithm, we have used a strategy to generate the subkeys on the y without the latency of the subkey generation appearing in the critical path. The 128-bit key is latched into the chip on the clock signal depending on the key latch signal. Then for the next six consecutive clock triggers, eight new subkeys are generated, on each clock trigger. Then, these 52-subkeys are used until a new key is fed using the key latch signal. There is a latency of three

Fig. 7. Final product of IDEA cryptographic chip.

6 Table 1 Performance comparisons without scaling Ref. Clock frequency (MHz) 25 30 8 10

M. Thaduri et al. / Microprocessors and Microsystems 29 (2005) 17

Data conversion rate (Mbps) per chip 177 424 133 700

Area (mm2) 107.8 29 4.58 1.95

[8] [9] [10] Ours

down to 31%. Similarly, there will be a 94% increase in delay in Ref. [9], 500% increase in delay in Ref. [10] and 245% increase in delay in our implementation. The clock frequency is modied in each of these designs to reect these delays, and the throughput is recalculated. The area is also scaled to 1 mm. During the calculation of the area, the area due to interconnection is not taken into account. Table 2 shows the scaled comparison. It shows that our implementation considerably improved throughput and area.

Table 2 Performance comparisons with scaling Ref. Clock frequency (MHz) 35 17 1.6 4 Data conversion rate (Mbps) per chip 250 218 26 522 Area (mm2) 74 59 73 7.8

5. Conclusion In this paper, we have presented a VLSI implementation of the IDEA block cipher using VHDL using AMI 0.5 process technology standard cells. We have optimized the modulus multiplier and exploited the temporal parallelism available in the IDEA algorithm. The subkeys are generated internally once the original key is fetched. This implementation does not employ additional RAM to store the subkeys, which is a signicant improvement in area. Our chip contains the same eight units, and each unit can execute one round of the algorithm. Using pipelined design, eight rounds of the algorithm are executed in parallel in a chip. Consequently, our implementation achieves a high throughput compared to others. Recently, US National Institute of Standards and Technology (NIST) has selected the Rijndael algorithm [13] as the Advanced Encryption Standard to replace the DES encryption algorithm. The Rijndael is expected to be a next generation cryptographic standard. Currently, we are investigating the possibility of implementing the Rijndael encryption algorithm using pipelined design.

[8] [9] [10] Ours

clock cycles in the pipeline if the encryption key is changed. This three-clock cycle latency is present because the subkeys are generated faster than the round pipeline scheduling. Fig. 5 shows the block diagram of the subkey generation part, and Fig. 6 shows the block diagram of our chip. The total delay of the complete chip is equal to the delay of a single pipeline stage. Thus, the total round frequency depends on the delay of two multiplier units. Fig. 7 shows the nal output of our IDEA cryptographic chip.

4. Performance comparison We have optimized the modulus multiplier and exploited the temporal parallelism available in the IDEA algorithm. Consequently, the performance of our implementation is comparable to earlier ones as shown in Table 1. Our data conversion rate is several times higher than other implementations considering clock frequency and the area of our chip is much smaller than others. Next, we considered the effect of feature size scaling on the delay of the multiplier to compare our implementation with previous implementations as the design of the modulus multiplier has a signicant effect on the overall throughput of the chip. In Ref. [8], the chip was laid out in 1.2 mm technology. In Ref. [9] and [10], the feature size was 0.7 mm and 0.25 mm, respectively. We have used AMI 0.5 mm technology to implement the chip. To make an equal scale comparison, we refer to [14], which considers the effect of scaling on multipliers. We scale each reference design to 1.0 mm for fair comparison among all implementations. We assume the designs to use procedural tree multipliers. Using 24 single signicant length bits and non-Booth implementation, the delay in Ref. [8] will scale

References
[1] N. Asokan, P. Janson, M. Steiner, M. Waidner, The state of the art in electronic payment systems, IEEE Comput. Magazine (1997) 28 35. [2] D. Naccache, D. MRaihi, Cryptographic smart cards, IEEE Micro. (1996) 14 25. [3] N.F.P. National Bureau of standards, US Department of Commerce, 1977. [4] B. Schneier, Applied Cryptography, Wiley, New York, 1996. [5] X. Lai, J. Massey, A proposal for a new block encryption standard, EUROCRYPT Conf. (1990) 389 404. [6] J. Wilson, Data security hits home, IEEE Micro. (1995) 88. [7] http://www.mediacrypt.com/press/idea_ps_052001.pdf [8] A. Curiger, H. Bonnenberg, R. Zimmerman, N. Felber, H. Kaeslin, W. Fichtner, VINCI: VLSI implementation of the new block cipher IDEA, IEEE Custom Integrated Circuits Conf. (1993) 15511554. [9] S. Salomao, V. Alves, E. Filho, HiPCrypto: a high-performance VLSI cryptographic chip, IEEE Int. ASIC Conf. (1998) 711. [10] Y. Qin, J.C. Oh, B. Kim, CMOS implementation of the IDEA encryption algorithm, IEEE Midwest Symp. Circuits Syst. (2000) 272 275.

M. Thaduri et al. / Microprocessors and Microsystems 29 (2005) 17 [11] http://www.mentor.com/partners/hep/AsicDesignKit/ASICindex. html. [12] A. Curiger, H. Bonnenberg, H. Kaeslin, Regular VLSI architectures for multiplication modulo (2n 1), IEEE J. Solid-State Circuits (1991) 990994. [13] Joan Daemen, AES Proposal: Rijndael, Proton World Int.l Zweefvliegtuigstraat, 10, B-1130 Brussel, Belgium and Vincent Rijmen, Katholieke Universiteit Leuven, ESAT-COSIC, K. Mercierlaan, 94, B-3001 Heverlee, Belgium. [14] H. Al-Twaijry, M. Flynn, Technology scaling effects on multipliers, IEEE Trans. Comput. 47 (11) (1998) 12011215. [15] P. Bonatto, V.G. Oklobdzija, Evaluation of Booths algorithm for implementation in parallel multipliers, IEEE 29th Asilomar Conf. Signals, Syst. Comput. (1996) 608 610.

Seong-Moo. Yoo is an associate professor of electrical and computer engineering department, the University of Alabama in Huntsville. His research interests include computer security, wireless networks, and parallel computer architecture. He has a PhD degree in computer science from the University of Texas at Arlington. He is a senior member of IEEE (computer society and communication society) and a member of ACM. Contact him at yoos@ece.uah.edu.

Madan Mohan Thaduri is working as an analog design engineer at Princeton Microwave Technologies in NJ, USA. His research interests include VLSI architectures, DSP systems and embedded programming. He has a Masters degree in electrical engineering from the University of Alabama in Huntsville. He can be contacted at thadurm@ece.uah.edu.

Rhonda Kay Gaede is an associate professor of electrical and computer engineering, the University of Alabama in Huntsville. Her research interests include computer architecture, VLSI design, and recongurable computing. She has a PhD degree in electrical engineering from the University of Texas at Austin. She is a member of IEEE (computer society), ASEE and ACM. Contact her at gaede@ece.uah.edu.

You might also like