You are on page 1of 5
An FPGA Implementation of 30Gbps Security Module for GPON Systems Truong Quang Vinh1, Ju-Hyun Park1, Young-Chul Kim1, Kwang-Ok Kim2 1 Department of Electronics and Computer Engineering, Chonnam National University 300 Yongbong-dong, Buk-gu, Gwangju 500-757, Korea tqvinh@soc.chonnam.ac.kr 2 Electronics and Telecommunication Research Institute 161 Gajeong-dong, Yuseong-gu, Daejeon, Korea kwangok@etri.re.kr Abstract amount of resource needed. To achieve high throughput for gigabit links in GPONs, we apply GPON systems require gigabit throughput data pipelined architectures for all process blocks of encryption for security and privacy. This paper security module, especially for AES core. The presents an implementation of very high speed security pipelined architecture for AES can improve the module for GPON on Virtex4 FPGA. The security throughput but it utilizes much area due to duplicated module supports payload encryption with constant hardware for implementing 11 rounds. Therefore, some delay by using counter mode AES algorithm. Our researchers have proposed several speed-area trade-off design of AES has three advanced features: composite to implement the architectures for AES algorithm. To field arithmetic SubByte, efficient MixColumn optimize the resource for AES implementation, transformation, and On-the-Fly Key-Scheduling. Full- researchers focus on improving some blocks of the pipelined architecture is employed for the AES ciphers. In [8]-[10], efficient implementations of the S- architecture in order to achieve the high performance box are proposed to minimize area and delay. The for security module. The experiment shows that the architecture of proposed S-box is combination of proposed architecture can achieve a throughput of SubBytes and Inverse SubBytes transformations, 30Gbits/s on a Xilinx Virtex-4 VLX100-12 device. The instead of look-up tables that require much memory. To performance of our design is well suitable for enhance key schedule, some authors propose on-the-fly encryption applications of GPON systems. key expansion that can generate the round keys concurrently during the encryption or decryption 1. Introduction procedure without extra memory to store the round keys [8], [9]. Recently, GPONs (Gigabit-capable Passive Optical This paper explores efficient schemes for designing Networks) are attractive for cost-effective delivery of the security module in order to achieve the target high-bandwidth data directly to building, curb, and performance of GPON systems. Our design employs a home. This creates a strong requirement for access composite field arithmetic architecture for SubByte network to be trustworthy, secure, and reliable. transformation. Moreover, we apply sub-pipelined for Therefore, encryption module is an essential part in this function block to improve the throughput of AES GPON systems for protecting broadcast data from algorithm. Another part that has improvement is the eavesdropping due to the multicast nature of the key-expander. We propose an area-efficient key GPONs. The ITU-T G.984 document [1] recommends expander which can compute round keys in on-the-fly using the Advanced Encryption Standard (AES) for manner. Besides, we exploit sub-pipelined architecture payload encryption in GPONs. The National Institute for key expansion block and use optimized set of of Standards and Technology (NIST) defined five registers to store round keys. Our key expander is modes of operation of AES [2]. However, only AES suitable for pipelined AES architecture that can start at with counter mode (CTR-AES) can be used for GPON the same time with data encryption. payload encryption. In this paper, we present a GPON The paper is organized as follows. Section 2 security module using CTR-AES algorithm which is presents the architecture of the GPON security module. implemented by a full-pipelined architecture for area Section 3 describes the hardware implementation of and performance optimization. AES algorithm. The advanced features for AES For hardware implementation of security module, hardware implementation are presented in Section 4. there are two critical constrains: performance and Section 5 presents the implementation results and the 978-1-4244-2358-3/08/$20.00 © 2008 IEEE 868 CIT 2008 performance comparisons with different architectures. delay as encryption time to synchronize with the cipher In the section 6, we give the conclusion. GEM payload at the output. f. Security Encoder: multiplexes the cipher GEM 2. The architecture of the GPON security (G-PON Encapsulation Method) Payloads from Bypass GEM Payload and Encrypted GEM Payload depending The GPON security module is implemented to whether security function is enabled. For the authentic guarantee a secure communication in Tx/Rx link of frames, the encoder performs XORed 128bits GPON. Using the module, the transmission data are Pseudorandom Cipher block with delayed GEM ensured to be confidentiality, integrity, and origin payload to generate cipher GEM payload. authenticity of each frame sent and received by the OLT (Optical Line Termination) / ONT (Optical The AES algorithm in GPON security module uses Network Termination) [1]. The top structure of the counter mode to encrypt data [2]. In counter mode GPON security module is shown in Fig. 1. encryption, the forward cipher function is invoked on each counter blocks, and the resulting output blocks are exclusive-ORed with the corresponding plaintext blocks to produce the ciphertext blocks. The forward cipher function is used in both CTR decryption and CTR encryption. Therefore, only one hardware implementation is used for both encryption and decryption. The XORed operation is executed in security encoder block. 3. AES core implementation Fig. 1. The top structure of the GPON security module. 3.1 AES general architecture The AES algorithm is a symmetric-key cipher, in a. Port-ID Table: is implemented as 4K 12-bit which both the sender and the receiver use a single key registers to store the port identifier. Only frames with for encryption and decryption. In the encryption of the the appropriate Port-ID are encrypted by CTR-AES AES algorithm, each round except the final round core. consists of four steps: SubByte, ShiftRow, MixColumn, b. Security Decoder: generates Crypto counter and AddRoundKey. The SubByte is nonlinear with the format: (Inter Frame Count[19:0] & Intra transformation, which substitutes each byte of round Frame Count[15:0]) & (Inter Frame Count[29:0] & data according to a substitution table called SBox. The Intra Frame Count[15:0]) & (Inter Frame Count[29:0] ShiftRow step is a circular shifting of bytes in each row & Intra Frame Count[15:0]). It also registers 128-bit of the round data. The MixColumn transformation GTC Payload for the Payload Bypass. operates on the State column-by-column, treating each c. Key Expander: restores the initial key and column as a four-term polynomial. The AddRoundKey generates round keys for CTR-AES from 128-bit key can be simply performed by applying exclusive OR to input. The total bit number of round_keys is 1408 = the round key with the data block. The round keys are 128*(10+1). The shadow key is used if the OLT different in every round and are generated by Key require key exchange. The ONT responds by Expansion. generating, storing and sending a new key. When the new key is transferred successfully to OLT, both the 3.2 The full-pipelined architecture for AES OLT and ONU (Optical Network Unit) begin using the algorithm new key at precisely the same frame boundary. In order to achieve very high throughput, we apply d. CTR-ARE Core: is the same process of AES pipeline technique both for outer round and inner round algorithm except input values which is crypto counter. of AES architecture. For outer round pipelining, the The crypto counter increases at every 128-bit data pipeline registers are placed between the data path block. 128-bit input blocks are transformed into 128-bit instances of each round. For the inner round pipelining, pseudorandom cipher blocks we decompose four processes SubByte, ShiftRow, e. Payload Bypass: delivers the insecure payload MixColumn and AddRoundKey into sub-pipelined without an authentication encryption. It has the same stages with equivalent delay. The Fig.2 shows full pipelined architecture of AES algorithm. 869 Among round processes of AES algorithm, the drawback, further pipelining can be used. By using the SubByte phase has the most delay. Therefore, the 2-stage pipelined architecture with three 8-bit registers number of sub-stages of this block is more than that of (Fig.3), the critical path is broken in half. To reduce other phases. We implemented two full-pipelined more path delay, the 3-stage pipelined architecture can architectures which have 2-stage sub-pipeline and 5- be also applied (Fig.4). stage sub-pipeline for each round process. Thus, the SubByte block has to be decomposed into 2 stages and 3 stages, respectively. We can achieve a very high map affine map throughput when using 5-stage sub-pipelined for AES -1 architecture. Fig. 3. 2-stage pipelined SBox using GF operations. Fig. 4. 3-stage pipelined SBox using GF operations. 4.2. MixColumn In MixColumn transformation, the columns of the State are considered as polynomials over GF(2 8) and Fig. 2. Full-pipelined architecture for AES algorithm. multiplied modulo x4 + 1 with a fixed polynomial c(x ) = ‘03’ x3 + ‘01’ x2 + ‘01’ x + ‘02’. In direct form, the 4. Advanced features for AES Hardware MixColumn transformation can be expressed as implementation ìs'0,c = ({02} · s 0,c ) Å ({03} · s1,c ) Å s 2,c Å s 3,c ïs' = s Å ({02} · s ) Å ({03} · s ) Å s ï 1,c 0,c 1,c 2, c 3, c (1) í This section presents innovative features in AES hardware implementation. Each sub-block in s ' = s ï 2, c 0 ,c Å s1, c Å ({02 } · s 2, c ) Å ({03 } · s 3,c ) encryption process is optimized for area and delay. Our ïîs'3,c = ({03} · s 0,c ) Å s1,c Å s 2,c Å ({02} · s 3,c ) improvement for AES architecture is focused on SubByte, MixColumn, and Key Expander block. The Several architectures have been proposed for the detail hardware implementations for these blocks are implementation of MixColumn transformation. described as follows. Substructure-shared architecture is applied in [4] [6], [7], [9]. In our implementation, we also use 4.1. SubByte transformation substructure sharing techniques to implement an In the SubByte transformation (Sbox), the input is efficient hardware for MixColumn transformation. To considered as an element of GF(28). First, the apply this technique, the equation (1) should be multiplicative inverse in GF(28) is calculated. Then, an rewritten in an efficient way as affine transformation over GF(2) is applied. The ìs'0, c = {02} · (s 0, c Å s1, c ) Å s1, c Å (s 2, c Å s3,c ) implementation of a SBox can be done by a look-up ïs' = {02} · (s Å s ) Å s Å (s Å s ) table, but it consumes much resource. Nevertheless, we ï 1, c 1, c 2, c 0, c 2, c 3, c (2) í can implement a SBox using Galios Field operations ï s ' 2 , c = {02} · (s 2 , c Å s 3, c ) Å s 3 , c Å (s 0 , c Å s1, c ) [10]. Field arithmetic GF(2 4) is used instead of GF(28) ïîs'3, c = {02} · (s 3, c Å s0, c ) Å s 2, c Å (s 0, c Å s1, c ) to optimize area. In this architecture, the input values is mapped to two elements of GF(2 4). Then, the The equation for MixColumn transformation is now multiplicative inverse is calculated using GF(24) more symmetrical, and we can apply substructure operation. Next, the two GF(24) elements are inverse sharing to optimize area for hardware implementation. mapped to one element in GF(28). Last, the affine The {02} constant multiplication is computed by the transformation is performed. Although the composite function denoted by a = xtime(b). The xtime() function field implementation of Sbox is very efficient in area, it can be implemented at the byte level as a left shift and suffers from a long critical path. To overcome this 870 a subsequent conditional bitwise XOR with {1b} if the After r clock cycles, a new round key is generated, so most significant of input byte is one (b7 = 1). The all the round keys are available after (r×Nr) +1 clock xtime() block can be implemented by 3 2-bit XOR gate. cycles. By using efficient architecture of xtime() and applying XOR-sharing, the MixColumn transformation can be implemented as shown in the Fig.5. reg reg reg reg reg reg reg reg Roundkey(Nr) Roundkey(0) Roundkey(1) Roundkey(2) Roundkey(3) Fig.6. The architecture of on-the-fly key expander. Fig. 5. (a) The efficient architecture of the MixColumn. 5. Performance results and comparisons (b) The implementation of xtime() function. We implemented the GPON security module with The total number of gate counts for MixColumn full-pipelined architecture of 128-bit CTR-AES on transformation is 324, which includes 108 2-bit XOR Virtex-4 VLX100-12. Xilinx ISE 8.2i was used to gates (each XOR gate contains 3 gates). synthesize the design and provided post-placement timing results. For simulation, we used ModelSim 5.8c 4.3. Key-Expander to verify the encrypt/decrypt operations. We evaluated the hardware cost in terms of BRAMs, slices, The Key Expansion routine generates a total of 11 maximum frequency and throughput. round keys from an initial key in 128-bit AES We implemented two full-pipelined architectures of algorithm. For pipelined AES architecture, all round AES core which have 3 sub-pipelined stages (r=3) and keys must be available at the same time. Therefore, 5 sub-pipelined stages (r=5). The 3-stage sub-pipelined some researchers implemented a key expansion routine design has total 31 stages (r×10 + 1). Thus, after 31 to compute a round key, and duplicate this hardware 10 clock cycles, the corresponding cipher text blocks will times for total 10 rounds [4], [5]. These architectures appear every clock cycle. By using this architecture, we can calculate all round keys at the same time, but they can achieve the throughput of 26.7Gbits/s. The 5-stage consume much area. Some other researchers propose sub-pipelined design has higher performance than the method to reduce Xinmiao Zhang [9] has proposed key 3-stage sub-pipelined design. However, this design expander that can operate in on-the-fly manner. The consumes more area for pipeline registers and takes data encryption and the key expansion can start more clock cycles for round processes. The table 1 simultaneously. Inherited from that architecture, we shows the comparison between existed AES implement an area-efficient key expander which also implementations and our implementation. Since can compute round key in on-the-fly manner. In order previous architectures have been implemented on to operate synchronously with the sub-pipelined round VirtexE device, we also choose Xilinx VirtexE-family process, the key expander is divided into r sub-stages. device beside Virtex4 for our design in order to We use 11 registers to store 11 round keys. It is compare the result fairly. According to the experiment different from the architecture of the key expansion in result, the designs in [3]-[5] have less performance [9], in which the author used r sets of registers all because they just use outer-pipeline architecture. In the round keys and temporary values for sub-pipelined implementations of [7] and [9], the authors improve the stage. By this scheme, we can reduce more area than throughput by applying sub-pipeline architectures. the previous architecture. The sub-pipelined Nevertheless, these designs require more slices for architecture for on-the-fly key expander with 3 sub- extra hardware. In term of throughput/slice, our stage (r=3) is shown in Fig.6. implementation is more efficient than the published Since round keys are generated on the fly, the approaches. The result of synthesized report shows that number of sub-pipelined stages for key expansion must our design with 5-stage sub-pipelined architecture can be the same with the number of encryption sub-stages. achieve throughput of 31.6 Gbits/s. 871 Table 1. Comparison of FPGA implementation of the AES algorithm Design Device Frequency Throughput slices BRAMs Mbps/slice (MHz) (Mbps) Shuenn-Shyang [3] XCV1000e-8 125.38 1604 1857 0 0.867 Jae-Gon Lee [4] XCV3200e-8 40 5120 8009 104 0.639 Saqib, N.A. [5] XCV812e-8 20.192 2584 2744 0 0.942 Jarvinen [7] XCV1000e-8 129.2 16500 11719 0 1.408 Xinmiao Zhang (r=3) [9] XCV812e-8 93.5 11965 9406 0 1.272 Xinmiao Zhang (r=7) [9] XCV1000e-8 168.4 21556 11022 0 1.956 Our AES core design (r=3) XCV1000e-8 91.1 11661 8914 0 1.308 Our AES core design (r=5) XCV1000e-8 150.25 19232 9820 0 1.958 Our AES core design (r=3) XC4VLX100-12 208.49 26686 9478 0 2.816 Our AES core design (r=5) XC4VLX100-12 247.19 31640 9904 0 3.195 The whole architecture of GPON security including [4] Jae-Gon Lee, Woong Hwangbo, Seonpil Kim, Chong-Min Kyung, AES core are synthesized on Xilinx Virtex-4 VXL100- “Top-down implementation of pipelined AES cipher and its verification with FPGA-based simulation accelerator”, Proceedings 12. The some extra resource is needed for security of 6th International Conference on ASIC , pp. 68-72, Oct. 2005. decoder, security encoder, and payload bypass. [5] Saqib, N.A., Rodriguez-Henriquez, F., Diaz-Perez, A., “AES Therefore, the total areas for the security module with algorithm implementation - an efficient approach for sequential and AES core (r=3) and AES core (r=5) are 11958 slices, pipeline architectures”, Proceedings of the Fourth Mexican and 13384 slices, respectively. International Conference on Computer Science , pp. 126-130, Sept. 2003. 6. Conclusions [6] Nedjah, N., de Macedo Mourelle, L., Cardoso, M.P., “A Compact Pipelined Hardware Implementation of the AES-128 Cipher”, Third International Conference on Information We presented a FPGA implementation of the high Technology: New Generations , pp. 216-221, April 2006. speed GPON security module using counter mode AES [7] Yongzhi Fu, Lin Hao, Xuejie Zhang, Rujin Yang, “Design of an algorithm. Our design has three main efficient features: extremely high performance counter mode AES reconfigurable composite field arithmetic SubByte, area-efficient processor”, Second International Conference on Embedded MixColumn, and on-the-fly sub-pipelined Key- Software and Systems, Dec. 2005. Expander. By using these improvement features, our [8] Hodjat, A., Verbauwhede, I., “Area-throughput trade-offs for design has optimal area and maximum throughput. For fully pipelined 30 to 70 Gbits/s AES processors”, IEEE Transactions full-pipelined architecture with 51 stages, we can on Computers, vol. 55, no. 4, pp. 366-372, April 2006. achieve throughput of 30 Gbits/s on Virtex4 VLX100 [9] Xinmiao Zhang, Parhi, K.K., “High-speed VLSI architectures for device. Our implementation is well suitable for the AES algorithm”, IEEE Transactions on Very Large Scale encryption applications of GPON systems. Integration (VLSI) Systems, vol. 12, no. 9, pp. 957-967, Sept. 2004. [10] J. Wolkerstorfer, E. Oswald, and M. Lamberger, “An ASIC Acknowledgement Implementation of the AES Sboxes”, Proceeding of RSA Conference , This research was financially supported by the pp.29-52, Feb. 2002. Electronics and Telecommunication Research Institute (ETRI) in Korea. The CAD tools for design in this work were supported by IDEC. References [1] “Gigabit-capable Passive Optical Networks (G-PON): Transmission convergence layer specification”, ITU-T G.984.3 Amendment 1, July. 2005. [2] Morris Dworkin, “Recommendation for Block Cipher Modes of Operation”, NIST Special Publication , http://csrc.nist.gov/ CryptoToolkit/modes/, 2001. [3] Shuenn-Shyang Wang, Wan-Sheng Ni, “An efficient FPGA implementation of advanced encryption standard algorithm”, Proceedings of the International Symposium on Circuits and Systems, vol. 2, pp. 597-600, May 2004. 872