11-1 CH04.Hardware Implementation For Fast Encryption (4p)

Agenda
Ch 04. Hardware Implementation I. Overview of hardware implementation
for Fast Encryption II. Hardware Design Approaches

2.1 Hardware design approaches
2.2 ASIC/FPGA
2.3 Pipeline/Parallel-structures
III. High-speed Block cipher

3.1 high-speed for components
이 훈 재 (李焄宰) Hoon-Jae Lee 3.2 Comparisons for implemented AES/SEED/ARIA/IDEA
CNSL
Cryptography and Network Security Lab. IV. High-speed Stream cipher
4.1 Parallel-Structured/Shifting LFSR
hjlee@dongseo.ac.kr 4.2 Word-Based Stream Cipher
http://kowon.dongseo.ac.kr/~hjlee
http://crypto.dongseo.ac.kr
2011-04-03 CNSL-Internet-DongseoUniv. 1
High-Speed Implementation - Adder & Shifter Speed (ns)

- Memory Access Time (ns)
64-bit Adder Speed[0.13um ASIC, 2004]
Samsung DDR SRAM Speed (2004.7) 180/326ps
Belgium, Neve et.al
IEEE Trans. VLSI system
2004
720ps @ 1.1V, 0.18um
326ps @ 1.1V, 0.13um
180ps @ 2.5V, 0.13um
2011-04-03 CNSL-Internet-DongseoUniv. 3 2011-04-03 CNSL-Internet-DongseoUniv. 4
1
- Adder & Shifter Speed (ns) - Multiplier Speed (ns)
32-bit Adder Speed[SAMSUNG 0.13um ASIC, 2002]

Samsung ASIC Speed (2002)
64-bit Multiplier Speed[0.35um ASIC, 2003]
12.8ns
European, SEDA
2003, M.S thesis “Design and Realization of a high-
speed 64 x 64 multiplier for low power applications”
12.8ns @ 0.35um
2011-04-03 CNSL-Internet-DongseoUniv. 5 2011-04-03 CNSL-Internet-DongseoUniv. 6
High-Speed Implementation 1.1 AES evaluation-Initial(1997, NIST)

Algorithm XOR
– Core Operation
Mod 232 Mod 232 Fixed Variable Mod 232 GF(28) S-box
Add sub shift rotate Mul Mul *LUT
*LUT
Notations >>>
<<<
Max. Sever 180 ps 180 ps Several Several 12.8 ns Several Several
Speed(ASI al 10 10 ps 10 ps ns ns
C) ps
MARS ● ● ● ● ● ● ●
RC6 ● ● ● ● ●
Rijndael ● ● ● ●
Serpent ● ● ●
Twofish ● ● ● ● ●
2011-04-03 CNSL-Internet-DongseoUniv. 7
2
1.1 AES evaluation - Initial(1997, NIST) 1.1 AES evaluation - Final(2000, NIST)
1.1 AES evaluation - Final(2000, NIST) 1.1 AES evaluation – hardware efficiency
Government ASIC
NSA
IBM, Inc.
Misubishi, Inc.
Universities FPGA
UC Berkeley
USC(University of Southern California)
WPI(Worcester Polytechnic Institute )
GMU(George Mason University)
Micronic, Inc.
3
1.2 Somethings to design approaches 1.2 Somethings to design approaches
Crypto/Non-Crypto Module
Crypto Module: Refer to FIPS 140-1,2,3 (CMVP)
Implementation Platforms
Non-Crypto Module: Power module, Interface module, etc.
Security device approaches

End-to-end Device: mainly software-oriented design
Link Encryption device: mainly hardware-oriented design
Hybrid device: ETE & LE
Noiseless or noisy channel approaches

Block cipher(PKC): wired-communication, computer security
Not adaptable for noisy channel for satellite or wireless
Stream cipher: bit-oriented, adaptable in satellite or wireless
channels,
Required for Mode changing in OFB mode for block cipher
Word-based Stream cipher: Block cipher & Stream cipher
- Europe ECRYPT-eSTREAM (http://www.ecrypt.eu.org/stream/)
14

Which Platform? Platform Characteristics
The choice of the implementation platform is driven by a
multitude of factors, which include,
Performance needed
Development and per-unit costs
Power consumption (Important in case of wireless devices!)
Flexibility
Physical Security
Choice heavily depends on application requirements, but…
15 16
4
Reconfigurable Hardware and Cryptography

Reconfigurable Hardware and Cryptography
Why Hardware?
Advantages of reconfigurable platforms
Software Implementations are too slow
for time critical applications Algorithm agility
Hardware implementations are Algorithm Upgradability
intrinsically more secure Architecture Efficiency
Resource Efficiency
Algorithm Modification
Why Reconfigurable? … Throughput (Relative to software)
Cost Efficiency (Relative to ASICs)
17 18
Hardware Implementation Methodologies Hardware Implementation Methodologies

HDL Designing done using FSM model Four basic components of a CFSM
Termed as CFSM (Cryptographic FSM)
State Register
Key Register
Updating Logic
Control & Load Logic
Modification of generic CFSM for

implementing hash functions…
19 20
5
Module approaches
Hardware Implementation Methodologies S/W module: Designed/implemented at Client PC
u-P module: crypto-microprocessor

CFSM for Hash Functions low/medium speed
H/W module: FPGA or ASIC
high-speed or ultra-high speed
(higher costs)
Hardware Design Approaches

FPGA (Field-Programmable Gate Array) :
- ALTERA, FPGA
- XILINX , CPLD
ASIC (Application-Specific Integrated Circuits) :
- CAD company supported
21

Crypto M
Examples Examples
Project 25 OTAR Documents : TIA/EIA APCO Project 25
TIA/EIA Telecommunications Systems Bulletin, APCO Project 25,

TSB102.AACA, "Over-The-Air-Rekeying(OTAR) Protocol New
Technology Standards Project Digital Radio Technical Standards," Jan.
1996.
TIA/EIA Telecommunications Systems Bulletin, APCO Project 25,
TSB102.AACA-1,"Over-The-Air-Rekeying(OTAR) Protocol Addendum
1," Dec. 2000.
TIA/EIA Telecommunications Systems Bulletin, TSB 102.AACB, "Over-
The-Air-Rekeying(OTAR) Operational Description, Jan. 1997.
TIA, IS 102.AAAA,”DES Encryption Protocol”
TIA, TSB-102.BAAA, “Recommanded Common Air Interface” Non-Crypto M Non-Crypto M

Appendum BAAA-1
TIA,TSB-102.BAAD,”Common Air Interface Operational Description for
Conventional Channels”
6
II. Hardware Design Approaches 2.1 Higher speed approaches in Block cipher)
Approaches Descriptions Properties
Design Tools 1) HW Design FPGA(Field Programmable Gate Array) - Low cost, easy design at Lab./research Inst.
Tools * Xilinx Vertex/Altera FLEX, etc - Lower speed to ASIC if the same layout
FPGA(Field Programmable Gate Array)
ASIC(Application Specific Integrated Circuits) ASIC(application specific IC) - High cost, difficulty in design at design house
* Samsung/Hynix/Lucent Tech. etc - Higher speed to FPGA about 5~10 times if the
NT(nano-technology) same layout
2) Semi- 0.5µm → 0.35 µm → 0.22 µm → 0.18 µm Decreasing layout → High speed, low power,
Design approaches conductor layout → 0.13 µm (→ 90nm → 70nm → 60nm → 50nm) Down sizing
Round-Common: implementation for one-round function 3) Pipeline Round-common →◘ → Single round chip (1)
approaches
lower gates, lower costs, but low-speed Pipeline-structured →◘◘◘◘◘→ multiple k-round chip (k times in speed)
* partial pipeline (k=#round, #round/divisor,
Pipeline-structured: implementation for full-round sequential functions * full pipeline or #round x positive integer)
higher gates, higher costs, but high-speed
Parallel-structured →◘ → Single round chip x n parallel (n times)
Parallel : parallel operation for round-common or pipeline-structured →◘ →
Parallel & Pipeline : combinations of parallel and pipeline-structured Parallel Pipeline →◘◘◘◘◘→ multiple k-round chip x n parallel (n x k times)
→◘◘◘◘◘→
4) Component Selection of high-speed components High-speed component
approaches Ex) XOR, Mod232adder, MUL, SFT, S-P (Security needs in the first important)
Combine or Optimize of high-speed components Optimized combination of components

Ex) # rounds, # steps in F-function • #round is inverse ratio to high-speed
• #step in round is inverse ratio to high-speed
• the more fast in each step is the better
2.2 ASIC vs. FPGA 2.2 FPGA Device - example(Target)

Items ASIC FPGA FPGA – Altera chip
Through -puts High Medium FPGA device is “Cyclone II” by

(0.13 µm/0.5 µm) (0.22um , 0.35um)
Altera
Power Light-weight Medium Specification of “Cyclone II”

consumption
EP2C35F6728C
350,000 Gate
Costs for High costs Low costs
implementation 33,216 Logic elements
(50,000$-100,000$/item) (design tools & PC)
672 pins
Costs for Low cost(large items) Medium cost 8 speed (4 is fastest speed)
Sales High cost(small items)
Design Expert-company Field-programmable
Approaches (Customized)
7
2.2 FPGA Device - example(Target) 2.2 ASIC – company in Korea
Foundary – Manufacturing company

Samsung LSI (http://www.samsungelectronics.com/semiconductors/asic/ASIC.htm)
Hynix Semiconductor (http://www.hynix.com/)
ANAM Semiconductor (http://www.aaww.com/)
DongBu Electronics (http://www.dsemi.com/)
Design House
Samsung (http://www.samsungelectronics.com/semiconductors/asic/ASIC.htm)
Hynix (http://www.hynix.com/)
CNC technology (http://www.cnstec.com/)
ADC (http://www.adc.co.kr/)
TLI (http://www.tli.co.kr/)
ECT (http://www.ect.co.kr/)
INC technology (http://www.inctech.co.kr/)
ARALION (http://www.aralion.com)
2.3 Pipeline-structured 2.3 Pipeline-structured

Throughputs (s : # of input data, n : # of pipeline steps)
Input – (unit 1 … unit n) - Output
Efficiency: T1 s×n
Sp = =
Tn n + ( s − 1)
Sp
Max. throughputs: E= × 100
n
n
lim S p = lim =n
s →∞ s →∞ n 1
+ (1 − )
s s
8
2.3 pipeline vs. round-common III. High-speed Block cipher
Round-Function for high-speed
라운드 구성에 따른 결과 출력도 라운드 공유 방식
라운드 방식에 따른 게이트 소요도
라운드 공유 방식
파이프라인 방식
S-box memory speed(RAM/ROM)
파이프라인 방식
Bit-wised adder/multiplier XOR/AND
1 만 단 위 총 소 요 게 이 트 수
1152 78
1024 72
총 결과 출력 비트
66
896 60
768 54
48
Word-wised modulo adder/multiplier
640
512
42
36 (Mod 232) (+)mod 232 , (⊙)mod 232
30
384 24
256 18
12
S-P/MA/SSMA LUT (Look-Up
6
128
0
0 Table)
1 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
라운드 수행 횟수 라운드 소요 횟수 # of Round vs. Throughput or
Security
• round-common (blue): # of gates are round-independant
Pipeline vs. Throughput or Area
• pipeline(red): # of gates are round-dependent increasing
Block Ciphers: Key Elements Block Cipher: Core Operations

Bitwise XOR, AND, OR.
Addition or subtraction modulo 2n
Shift or rotation by a constant number of bits.
Data-dependent rotation by a variable number
of bits.
Multiplication modulo the table entry value.
Multiplication in the Galois field specified by the
table entry value.
Inversion modulo the table entry value.
Look-up-table substitution
9
3.1 High-speed core components 3.1 High-speed core components
- Memory Access Time (ns) - Adder & Shifter Speed (ns)
삼성전자 DDR SRAM Speed (2004.7월) 64-bit Adder Speed[0.13um ASIC] 180/372ps
Belgium, Neve et.al
IEEE Trans. VLSI system
2004
720ps @ 1.1V, 0.18um
326ps @ 1.1V, 0.13um
180ps @ 2.5V, 0.13um
64-bit adder [2006년]

372 ps @ 0.18-um CMOS
2006 IEEE CS ISCAS 2006,
Kim JooYoung, et al.
3.1 High-speed core components 3.1 High-speed core components

- Adder & Shifter Speed (ns) - Multiplier Speed (ns)
32-bit Adder Speed[SAMSUNG 0.13um ASIC] 387ps 64-bit Multiplier Speed[0.35um ASIC] 10ns
Samsung ASIC Speed Europe SEDA et.al
2003, M.S. thesis “Design and Realization of a high-speed 64 x 64 multiplier

for low power applications”
12.8ns @ 0.35um
10ns [2003-11,Joseph Gebis, UCB]
10
3.1 High-speed core components 3.2 Implementation examples- AES (Final,2000)
– Core Operation
Algorithm XOR Mod 232 Mod 232 Fixed Variable Mod 264 Mul GF(28) S-box
Add sub shift rotate Mul *LUT
Designed ASIC or Performan #Integrated Year Remarks
*LUT by FPGA ce[Mbps] [Gates]
Notation >>> GMU - FPGA 414 2,507 2000
<<< George Mason 0.22um CLBslices
Univ.
Max. delay ~10 ps 180 ps 180 ps ~10 ps ~10 ps 10.0 ns ~1 ns ~1 ns
(ASIC) USC FPGA 353 4312 2000
MARS ● ● ● ● ● ● ● 0.22um
WPI - FPGA 294 3528 2000
RC6 ● ● ● ● ● Worcester 0.22um
Ploytechnic
Instit.
Rijndael ● ● ● ●
MICRONIC FPGA 179 - 2000
0.22um
Serpent ● ● ●
NSA ASIC 606 - 2000 2-Gen.
Twofish ● ● ● ● ● 0.5um Previous
3.2 Implementation Examples in Block cipher 3.2 Implementation Examples in Block cipher
- AES: George Mason Univ’2000 - AES: George Mason Univ’2000
11
- AES: George Mason Univ’2000 - AES: George Mason Univ’2000
- AES pipeline [2002-2006] - SEED
Designed ASIC or Performan # Integrated Published Remarks Designed ASIC or Performan # Integrated Published Remarks
by ce[Gbps] to by FPGA ce[Gbps] [Gates] to
FPGA [Gates]
Lee, ASIC 2.56 CISC’200 Seo, FPGA 2.62 16,770 APASIC’2
YoonKyun 0.25um 2 YongHo 000
g et.al et.al
Goo, FPGA 20 WISC’200 40- Choi, ASIC 237 14,110 KICS,
BonSeog 2 stage(10) ByungYoo 0.25um 2000
et.al Pipeline n et.al
McLoone, FPGA 12.02 IEEE Jeon, ASIC 258.9 17,610 KIISC, Non-
Syp2000 ShinWoo 0.5um 2001 pipeline
McCanny et.al
Hodjat FPGA 21.54 5177 FCCM200 Choi, FPGA 35.34 10.610 KIISC, Smartcard,
et.al CLBslices 4 HeongMug etc.
2004.10
/UCLA et.al
Saggese FPGA 20.3 5810 FPL2003 /Samsung
et.al CLBslices Um, FPGA 6,400 (54,803 LUT KISS, pipeline
Goo, FPGA Under 3,992 KIISC RFID for SungYong Xilinx + 2,048 2003.3
BonSeog Gbps Journal low-power et.al Vertex-II buf+gates)
0.25um
et.al 2006.10
12
3.2 Implementation Examples in Block cipher 3.2 Implementation steps– AES= 4 steps
- ARIA
Designed ASIC or Performan # Integrated Publishe Remarks
by FPGA ce[Gbps] [Gates] d to
NSRI ASIC 1,871 8,935 WISC200 Non-pipeline
Team Hynix 3 [AES]*
0.25um [1,839]* [9,088]* High-speed,
low-
integrated
Park, FPGA 1,142 29,930 ga WISC200 Non-pipeline
JinSeob 0.22um 1,599 sl 4
et.al
Yoo, Xilinx 437 1,490 slices IEEK, Non-pipeline
YeongGab VertexE- 2005.4
et.al 1600 FPGA
Jang, 64-bit - - KIISC -
HwanSug microproc Journal
et.al essor 2006.6
c) Round function
3.2 Implementation steps– IDEA= 7 steps 3.2 Implementation steps– SEED= 8 steps
- 8-round transformation
- 8-round transformation
(6 x 8 = 48 subkeys)
(6 x 8 = 48 subkeys)
- output transformation
- output transformation
(4 subkeys)
(4 subkeys)
- Key generation
- Key generation
(16-bit, 52 subkeys)
(16-bit, 52 subkeys)
- Multiplication-Addition
- Multiplication-Addition
(MA) structures
(MA) structures
a) Block Diagram
b) Round functions
13
3.2 Implementation steps– ARIA= 3 steps 3.2 Implementation steps– comparisons
Parameters DES T-DES IDEA ARIA SEED AES
Block size 64 64 64 128 128 128
128/192/ 128/192/2
Key size 56 112 128 128
256 56
# rounds 16 16 x 3 8 12/14/16 16 10/12/14
128x13/ 128x11/
Roundkey 48x16
48x16 16x52 128x15/ 32x2x16 128x13/
size ,3
128x17 128x15
8x8, 4
S-BOX 6x4, 8 6x4, 8 16-bit MA 8x8, 2 8x8
8x32, 4
Key Space 256 2112 2128 2128~256 2128 2128~256
Proper CPU
8-bit 8-bit 16-bit 32-bit 32-bit 32-bit
(word size)
Year 1975-77 1979 1990-92 2003 1997 2000-01
Switzerlan
Country USA USA KOREA KOREA USA
d
TTAS.KO
Standard FIPS-46 - - - FIPS-197
-12.0004
3.2 Block cipher implementation analysis

3.2 Implementation steps– comparisons
- Block cipher mode vs. Performance(by KISA)
Cipher Clock #of round # Steps in Speed
Period(ns) (#of clock) 1-round (Mbps)
MARS 5 32(114) - 224.6
RC6 5 20(122) - 209.8
RIJNDAEL 9.7 10(10) 4 steps 1,319.6

SERPENT 6.3 32(32) - 634.9
TWOFISH 5 16(128) - 200

SEED 10.3 16(48) 8 steps 258.9
ARIA - 12() 3 steps 1,781.0
[summary] (1) # Steps in 1-round “important factor in high-speed ver.”

( SEED=8 > IDEA=7 > DES=5 > AES=4 > ARIA=3 )
(2) “More important factor is critical time in each step”
14
Conclusion (On Designing Block cipher) IV. High performance in stream cipher
Approaches Descriptions Properties
1) HW Design
Tools
FPGA(Field Programmable Gate Array)
* Xilinx Vertex/Altera FLEX, etc
- Low cost, easy design at Lab./research Inst.
- Lower speed to ASIC if the same layout
High speed components
ASIC(application specific IC)
* Samsung/Hynix/Lucent Tech. etc
- High cost, difficulty in design at design house
- Higher speed to FPGA about 5~10 times if the
Higher speed LFSR
same layout
2) Semi- 0.5µm → 0.35 µm → 0.22 µm → 0.18 µm Decreasing layout → High speed, low power, Higher speed Nonlinear Combiner
conductor layout → 0.13 µm (→ 90nm → 70nm → 60nm → 50nm) Down sizing
3) Pipeline
approaches
Round-common →◘ → Single round chip (1) Higher speed Filter Function
Pipeline-structured →◘◘◘◘◘→ multiple k-round chip (k times in speed)
* partial pipeline
* full pipeline
(k=#round, #round/divisor,
or #round x positive integer)
Higher speed Irregular Clocked
Parallel-structured →◘
→◘
→
→
Single round chip x n parallel (n times)
Device
Parallel Pipeline →◘◘◘◘◘→ multiple k-round chip x n parallel (n x k times)
4) Component
→◘◘◘◘◘→
Selection of high-speed components High-speed component
Parallel-Stream Cipher
Word-Based Stream Cipher
approaches Ex) XOR, Mod232adder, MUL, SFT, S-P (Security needs in the first important)
Combine or Optimize of high-speed components Optimized combination of components

Ex) # rounds, # steps in F-function • #round is inverse ratio to high-speed
• #step in round is inverse ratio to high-speed
• the more fast in each step is the better
4.1 Higher speed of LFSR 4.1 Higher speed of LFSR

- LFSR-based stream cipher - LFSR-based stream cipher
x5
1 x x2 x3
Basic components in stream cipher
Output
LFSR, NFSR, Nonlinear Function, etc. System clock
0 1 2 3 4
LFSR(Linear Feedback Shift Register) 5 3 2

- Primitive polinomial : P ( x ) = x + x + x + x + 1
Primitive polynomial Maximal period = 2n - 1 a) Blockdiagram of the n=4 stage LFSR

Ext. clock speed control(1-bit output/clock)
Linear output Predictable for output bits
Q D Q .D
... Q D Q D Q D Output
0 1 2 3 4
n-stage LFSR Output ~Q < ~Q < ~Q < ~Q < ~Q <

System clock
System clock b) Implementation of the n=4 stage LFSR
15
- LFSR-based stream cipher - LFSR-based stream cipher
Parallel Stream Cipher (n, m) PS-LFSR
Feedback connection : the original connection
Basic components in parallel stream cipher

PS-LFSR, Serial/Parallel Converter 0 1 2 3 4 5
n-stage LFSR
...... i ...... n-1 Output
m-Parallel Nonlinear Functions SYSTEM CLOCK
a) n-stage LFSR
Parallel-Shifting LFSR(PS-LFSR) Feedback connection m : m-bit left-shifting of the original combination

feedback m
Crypto-degree Similar to LFSR -(m-1) -(m-2) -(m-3)

. . . . . .
...... n-m+1
Ext. clock m-bit parallel output/clocking m times faster Feedback connection 2 : 1-bit left-shifting of the original combination
feedback 2
-1 0 1 ...... n-2
( 2≤m≤n ) Feedback connection 1: the original combination
feedback 1
. . .
0 1 2 ...... n-1
----------------------------------------
[Ref] HJ Lee, SJ Moon” Parallel Stream Cipher for Secure High-Speed Communications” --- m-1 2m-1 3m-1 n-1
-(m-1) ... ... ... n-2
Signal Processing, Vol.82, No. 2, Feb. 2002. ...
-5
5
4
m+5
m+4
2m+5
2m+4
:
n-3
n-4
. . . m-bit
: OUTPUT
-4 3 m+3 2m+3 n-5
:
-3 2 m+2 2m+2 n-6
-2 1 m+1 2m+1 ...
-1 0 m 2m n-m
SYSTEM CLOCK
(m-1)-stage LBUF n-stage PS-LFSR
b) (n,m) PS-LFSR
(n=39, m=8) PS-LFSR

(n=40, m=8) PS-LFSR - example feedback 8
Feedback connection 8
0 1 2 3 4 5 ...... 35 ...... 39
-7 -6 -5 30
Output
40-stage LFSR ......
SYSTEM CLOCK
feedback 2
a) n=40 stage LFSR
-1 0 1 37
Feedback connection 8 : s(33+t)= s(28+t) ^ s(-5+t) ^ s(-6+t) ^ s(-7+t)

feedback 8 feedback 1
-7 -6 -5 28
. . . . . . ...
feedback 2
0 1 2 38
Feedback connection 2 : s(39+t)= s(34+t) ^ s(1+t) ^ s(t) ^ s(-1+t)
-1 0 1 34
feedback 1
Feedback connection 1: s(40+t)= s(35+t) ^ s(2+t) ^ s(1+t) ^ s(t)
. . .
0 1 2 35 --- 7 15 23 31 ---
-7 6 14 22 30 38
-6 5 13 21 29 37
...
-5 4 12 20 28 36
-8 7 15 23 31 39 m=8 bits
-4 3 11 19 27 35
-7 6 14 22 30 38
OUTPUT
-6 5 13 21 29 37
. . . m=8 bits
-3 2 10 18 26 34
-5 4 12 20 28 36
-4 3 11 19 27 35 OUTPUT -2 1 9 17 25 33
-3 2 10 18 26 34 -1 0 8 16 24 32
-2 1 9 17 25 33
-1 0 8 16 24 32
SYSTEM CLOCK
SYSTEM CLOCK
p( x) = x 39 + x 37 + x 25 + x 24 + x 22 + x 8 + x 6 + x 4 + 1
8-stage LBUF 40-stage PS-LFSR
b) (n,m)=(40,8) PS-LFSR 7-stage LBUF 39-stage PS-LFSR
b) (n,m)=(39,8) PS-LFSR
16
m-Parallel Stream Cipher m-parallel summation generator

c 11 , c12 ,.., c1 M
M M
C1 C2 CN C1 C2 CN m
... ...
l1-PS-LFSR1
(x11)
Summation
(x1) (x1) (x21) Generator:
l1-LFSR1 l1-LFSR1 (y1)
Nonlinear Nonlinear . . . SUM-BSG1
... ...
combiners: combiners:
(x2) (x2)
l2-LFSR2 l2-LFSR2 (xm1)
(memoryless (memoryless m
l2-PS-LFSR2
(x3) or not) (x3) or not)
l3-LFSR3 l3-LFSR3
...... ...... . . . . . . . . . . . . . . .
(xN) (xN)
lN-LFSRN f1, f2, ..., fm lN-LFSRN f1, f2, ..., fm
cm1 , cm 2 ,.., cmM
M M
... ... ... ...
(x1m)
m m Summation
Serial/ Parallel/ (x2m) Generator:
Serial/ Parallel/ (ym)
m m m m
Plaintext Parallel Serial Parallel Serial Plaintext . . . SUM-BSGm
Converter Converter Converter Converter m
Ciphertext lm-PS-LFSR m
(xmm)
N
o
t
e
:
N
=
m
,
M
=
M
=
.
.
=
M
=
M
1
m
Transmitter Channel Receiver
Higher speed clock-controlled LFSR Higher speed clock-controlled LFSR

(compensate for down-speed) (compensate for down-speed)
KEY_DATA
KEY_LDEN
feedback connection 4 carry = S1
From FA sum = S0
feedback connection 3
feedback connection 2 feedback 4
feedback connection feedback connection 1 d[-3+t] d[3+t]d[6+t] d[31+t]d[33+t]d[44+t] d[47+t] feedback 3

d[85+t]
......
...... feedback 2
00
0 01 feedback 1
02
Y 127-LFSR
39-stage LFSR ca 03 Y 129-LFSR
89-stage 4-bit parallel bLFSR d S0 S1 S0 S1 S0 S1 S0 S1 S0 S1 S0 S1
KEYDATA 1 d[88+t]
d[t] d[6+t]d[9+t] d[34+t]d[36+t]d[47+t] d[50+t]
SEL 1 s0 s1 s0 s1 s0 s1 s0 s1 s0 s1 s0 s1 s s0 s1
......
......
k n d -3 y
0
1 d0 y
0
1 d84 y
0
1 d85 y
0
1 d86 y
0
1 d87 y
0
1 d88 0 y
0
1
SEL CLK 2 2 2 2 2 2 y 2
... ... 3 3 3 3 3 3 1 3
......
......
KEYLDEN 3-stage ......
LBUF(d-3,d-2,d -1)
fac(t)
(t) fb (t)z(t) SYSTEM_CLOCK
ffac ffdb
CLOCK-CONTROL
CLOCK-CONTROL DATA GENERATION
CLOCK-CONTROL
d0
d1 fd function output sequences
d2 KEYSTREAM
......
SYSTEM CLOCK * filtered function
d88
17
(n=39, m=8) PS-LFSR (n=39,m=8)PS-LFSR: Graphic Design

feedback 8
-7 -6 -5 30
......
feedback 2
-1 0 1 37
feedback 1
...
0 1 2 38
--- 7 15 23 31 ---
-7 6 14 22 30 38
-6 5 13 21 29 37
...
-5 4 12 20 28 36
m=8 bits
-4 3 11 19 27 35
OUTPUT
-3 2 10 18 26 34
-2 1 9 17 25 33
-1 0 8 16 24 32
SYSTEM CLOCK
p( x) = x 39 + x 37 + x 25 + x 24 + x 22 + x 8 + x 6 + x 4 + 1
7-stage LBUF 39-stage PS-LFSR
b) (n,m)=(39,8) PS-LFSR
(n=39, m=8) PS-LFSR : Simulation (n=39, m=8) PS-LFSR : Time Delay
18
4.2 Word Based Stream Ciphers
4.1 Higher speed of LFSR
- Basic components in WBSC
LFSR vs. PS-LFSR Word Based Linear Feedback Shift
Items 39-LFSR (39,8) PS-LFSR
Register (LFSR)
Period 2 39 − 1 239 − 1 Dynamic Tables
Processing rate @
500MHz system clock
500 x 8 Pseudorandom Functions
500 Mbps = 4 Gbps
(Max. Delay = 1.73 ns) Generic Finite State Machine (FSM)
Hardware Complexity 219 gates 401 gates

[ gates ] (1.83 : 1 )
-39 D F/Fs -46 D F/Fs
-1 (2-1) MUX -1 (2-1) MUX
-7 XORs -56 XORs
4.2 Word Based Stream Ciphers 4.2 Word Based Stream Ciphers
- Word Based LFSR - Dragon
Word-based Stream Cipher (WBSC)
Word Based LFSR Fast, word-operation (@Tx, @ Rx)
Ex) Snow, Sober, Turing, Dragon
LFSR
Examples:
Snow
Sober
Turing
Dragon
Nonlinear Filter
W-bit word W-bit word
Output W-bit word W-bit word W-bit word W-bit word
19
- Dragon - Dragon
Dragon
Word –based stream cipher : ICISC’2004 H/W performance analysis (32-bit word)
Throughput: 23Gbps (H/W), 3.8Gbps (S/W)
Cooperative results with Australia QUT-ISRC(Dr. Dawson)
Submitted to the Int’l standard ECRYPT-eSTREAM (phase 3) Step 1 : 0.387 ns
Step 2 : 1.0 ns
NLFSR M
Step 3 : 1.0 ns
FPDS Selection
Step 4 : 0.387 ns
Total : 2.774 ns
F
Performance = 64bit /2.774ns = 23 Gb/s in Maximal
feedback keystream = 64bit /4.27ns = 15 Gb/s in Normal
* In SAMSUNG 0.13um ASIC, 32-bit adder delay time=0.387 ns
- Dragon - Dragon
Parallel-Structured Word-based LFSR (PS-WLFSR)
Word-based FSR (WFSR)
Similar
to the bit-based LFSR (LFSR)
Feedback shifted by the unit of the word-size (W)
20
- Dragon - Dragon
- Dragon - Dragon
Implementing s-box in F function

High-accessable SRAM memory (compiled SRAM)
Per S-Box, used 256X32 bit (1KB) SRAM
Totally 24 S-Boxes, 24KB SRAM required
21
- Dragon - Dragon
(New) Dragon in Parallel-Structured (m=16)
(New) Dragon in Parallel-Structured (m=8)
4.2 Word Based Stream Ciphers

- Dragon Conclusion
Performance Analysis Higher speed in block cipher
Items Worst case Typical case Best case
ASIC latest tech.(layout) : 0.13um, 90–40nm
memory 287,600 287,600 287,544
Area
comb. 8,126 8,068 8,219 Pipeline/Parallel pipeline-structured
(gate size)
total 295,726 295,688 295,763 The smaller in #step of round, the faster
Critical Path delay(ns) 14.36 10.26 6.72
Throughput 4.4 Gbps 6.2 Gbps 9.5 Gbps
Selection of the faster components in step
Parallel-Throughput(m=8) 35.2 Gbps 49.6 Gbps 76 Gbps (Security is the best required)
(max. m=16) 70.4 Gbps 99.2 Gbps 152 Gbps
Higher speed in stream cipher
Parallel-Structured LFSR m times faster
※ [Note] 1) comb. : Combinational logic
Clock-controlled (compensate) 1 time
2) Best/Typical/Worst case : By Synthesis Library Environmental
3) Throughput [bps] ≒ (Output bits) × Speed
Word-Based Stream Cipher W times faster
22

11-1 CH04.Hardware Implementation For Fast Encryption (4p)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11-1 CH04.Hardware Implementation For Fast Encryption (4p)

Uploaded by

Copyright:

Available Formats

Agenda

Ch 04. Hardware Implementation I. Overview of hardware implementation

for Fast Encryption II. Hardware Design Approaches

III. High-speed Block cipher

High-Speed Implementation - Adder & Shifter Speed (ns)

2011-04-03 CNSL-Internet-DongseoUniv. 3 2011-04-03 CNSL-Internet-DongseoUniv. 4

32-bit Adder Speed[SAMSUNG 0.13um ASIC, 2002]

2011-04-03 CNSL-Internet-DongseoUniv. 5 2011-04-03 CNSL-Internet-DongseoUniv. 6

High-Speed Implementation 1.1 AES evaluation-Initial(1997, NIST)

Security device approaches

Noiseless or noisy channel approaches

1.2 Somethings to design approaches 1.2 Somethings to design approaches

Choice heavily depends on application requirements, but…

Reconfigurable Hardware and Cryptography

1.2 Somethings to design approaches 1.2 Somethings to design approaches

Hardware Implementation Methodologies Hardware Implementation Methodologies

Modification of generic CFSM for

u-P module: crypto-microprocessor

Hardware Design Approaches

1.2 Somethings to design approaches 1.2 Somethings to design approaches

 TIA/EIA Telecommunications Systems Bulletin, APCO Project 25,

 TIA, IS 102.AAAA,”DES Encryption Protocol”

 TIA, TSB-102.BAAA, “Recommanded Common Air Interface” Non-Crypto M Non-Crypto M

Combine or Optimize of high-speed components Optimized combination of components

2.2 ASIC vs. FPGA 2.2 FPGA Device - example(Target)

Through -puts High Medium FPGA device is “Cyclone II” by

Power Light-weight Medium Specification of “Cyclone II”

Foundary – Manufacturing company

2.3 Pipeline-structured 2.3 Pipeline-structured

Block Ciphers: Key Elements Block Cipher: Core Operations

64-bit adder [2006년]

3.1 High-speed core components 3.1 High-speed core components

 2003, M.S. thesis “Design and Realization of a high-speed 64 x 64 multiplier

 10ns [2003-11,Joseph Gebis, UCB]

Parameters DES T-DES IDEA ARIA SEED AES

Block size 64 64 64 128 128 128

3.2 Block cipher implementation analysis

MARS 5 32(114) - 224.6

RC6 5 20(122) - 209.8

RIJNDAEL 9.7 10(10) 4 steps 1,319.6

TWOFISH 5 16(128) - 200

ARIA - 12() 3 steps 1,781.0

[summary] (1) # Steps in 1-round “important factor in high-speed ver.”

Combine or Optimize of high-speed components Optimized combination of components

4.1 Higher speed of LFSR 4.1 Higher speed of LFSR

LFSR(Linear Feedback Shift Register) 5 3 2

Primitive polynomial Maximal period = 2n - 1 a) Blockdiagram of the n=4 stage LFSR

n-stage LFSR Output ~Q < ~Q < ~Q < ~Q < ~Q <

System clock b) Implementation of the n=4 stage LFSR

Basic components in parallel stream cipher

m-Parallel Nonlinear Functions SYSTEM CLOCK

Parallel-Shifting LFSR(PS-LFSR) Feedback connection m : m-bit left-shifting of the original combination

Crypto-degree Similar to LFSR -(m-1) -(m-2) -(m-3)

(m-1)-stage LBUF n-stage PS-LFSR

4.1 Higher speed of LFSR 4.1 Higher speed of LFSR

 (n=39, m=8) PS-LFSR

Feedback connection 8 : s(33+t)= s(28+t) ^ s(-5+t) ^ s(-6+t) ^ s(-7+t)

b) (n,m)=(40,8) PS-LFSR 7-stage LBUF 39-stage PS-LFSR

 m-Parallel Stream Cipher  m-parallel summation generator

4.1 Higher speed of LFSR 4.1 Higher speed of LFSR

 Higher speed clock-controlled LFSR  Higher speed clock-controlled LFSR

feedback connection 2 feedback 4

feedback connection feedback connection 1 d[-3+t] d[3+t]d[6+t] d[31+t]d[33+t]d[44+t] d[47+t] feedback 3

TIA/EIA Telecommunications Systems Bulletin, APCO Project 25,

TIA, IS 102.AAAA,”DES Encryption Protocol”

TIA, TSB-102.BAAA, “Recommanded Common Air Interface” Non-Crypto M Non-Crypto M

2003, M.S. thesis “Design and Realization of a high-speed 64 x 64 multiplier

10ns [2003-11,Joseph Gebis, UCB]

(n=39, m=8) PS-LFSR

m-Parallel Stream Cipher m-parallel summation generator

Higher speed clock-controlled LFSR Higher speed clock-controlled LFSR

(n=39, m=8) PS-LFSR (n=39,m=8)PS-LFSR: Graphic Design

(n=39, m=8) PS-LFSR : Simulation (n=39, m=8) PS-LFSR : Time Delay