You are on page 1of 18

ECE 545 Digital System Design with VHDL

Course web page:


ECE web page Courses Course web pages ECE 545
http://ece.gmu.edu/coursewebpages/ECE/ECE545/F10/

Kris Gaj Research and teaching interests:


reconfigurable computing computer arithmetic cryptography network security

Contact:
The Engineering Building, room 3225 kgaj@gmu.edu Office hours: Monday, 7:30-8:30 PM, Wednesday, 6:00-7:00 PM, and by appointment

ECE 545 Part of: MS in Computer Engineering


One of five core courses (must be passed with B or better) Strongly suggested for two concentration areas:

Design level
Digital System Computer Design with VHDL Arithmetic algorithmic register-transfer gate transistor layout devices ECE 545 ECE 645

Courses
VLSI Design VLSI Test for ASICs Concepts

Digital Systems Design Microprocessor and Embedded Systems


Elective course in the remaining concentration areas

ECE 586 ECE 680

ECE 681 Digital Integrated Circuits

ECE 682

MS in Electrical Engineering
Elective

Physical VLSI Design Semiconductor ECE 584 ECE684 Device Fundamentals

MOS Device Electronics

DIGITAL SYSTEMS DESIGN


Concentration advisors: Kris Gaj, Jens-Peter Kaps, Ken Hintz 1. ECE 545 Digital System Design with VHDL K. Gaj, project, FPGA design with VHDL, Aldec/Mentor Graphics, Xilinx/Altera 2. ECE 645 Computer Arithmetic K. Gaj, project, FPGA design with VHDL or Verilog, Aldec/Mentor Graphics, Xilinx/Altera 3. ECE 681 VLSI Design for ASICs N. Klimavicz, project/lab, back-end ASIC design with Synopsys tools 4. ECE 586 Digital Integrated Circuits D. Ioannou, R. Mulpuri 5. ECE 682 VLSI Test Concepts T. Storey

Grading Scheme
Homework

10% 40% 20% 30%

Project Midterm Exam Final Exam

Midterm exam 1
2 hours 30 minutes in class design-oriented open-books, open-notes practice exams will be available on the web Tentative date: Monday, November 1st

Final exam
2 hours 45 minutes in class design-oriented open-books, open-notes practice exams will be available on the web Date: Monday, December 20, 7:30-10:15pm

Project
individual semester-long

Project

related to the research project conducted by Cryptographic Engineering Research Group (CERG) at GMU supporting NIST (National Institute of Standards and Technology) in the evaluation of candidates for a new cryptographic standard
9

Hash Function
arbitrary length

m message

Background
It is computationally infeasible to find such m and m that h(m)=h(m)
11

hash function hash value

h(m)
fixed length

Main Application: Digital Signature


Signature
HANDWRITTEN DIGITAL A6E3891F2939E38C745B 25289896CA345BEF5349 245CBA653448E349EA47

Typical Digital Signature Scheme


Alice
Message Signature Message Signature

Bob

Hash function

Hash function

Hash value 1 Hash value


yes no

Main Goals:

unique identification proof of agreement to the contents of the document

Public key cipher

Hash value 2

Public key cipher

Alices private key

Alices public key

Handwritten and Digital Signatures


Common Features
Handwritten signature 1. Unique 2. Impossible to be forged 3. Impossible to be denied by the author 4. Easy to verify by an independent judge 5. Easy to generate Digital signature

Handwritten and Digital Signatures


Differences
Handwritten signature Digital signature 6. Associated physically 6. Can be stored and with the document transmitted independently of the document 7. Almost identical 7. Function of the for all documents document 8. Usually at the last 8. Covers the entire page document

Hash function algorithms


Customized (dedicated)
MD2 MD4
Rivest 1988 Rivest 1990

Based on block ciphers


MDC-2 MDC-4

Based on modular arithmetic


MASH-1
1988-1996

Attacks against dedicated hash functions known by 2004


MD2 MD4
partially broken broken, H. Dobbertin, 1995 (one hour on PC, 20 free bytes at the start of the message)

IBM, Brachtl, Meyer, Schilling, 1988

MD5 MD5
Rivest 1990

SHA-0 discovered, RIPEMD


1995 NSA, 1998 France

weakness

SHA-0 SHA-1

NSA, 1992

RIPEMD
European RACE Integrity Primitives Evaluation Project, 1992

NSA, 1995

RIPEMD-160
NSA, 2000

partially broken, collisions for the compression function, Dobbertin, 1996 (10 hours on PC)

reduced round version broken, Dobbertin 1995

SHA-1

RIPEMD-160

SHA-256, SHA-384, SHA-512

SHA-256, SHA-384, SHA-512

What was discovered in 2004-2005?


MD4
broken; Wang, Feng, Lai, Yu, Crypto 2004 (manually, without using a computer)

263 operations
Schneier, 2005 In hardware: Machine similar to the one used to break DES:

MD5
broken; Wang, Feng, Lai, Yu Crypto 2004 (1 hr on a PC)

SHA-0 SHA-1

attack with 240 operations Crypto 2004 RIPEMD attack with 263 operations Wang, Yin, Yu, Aug 2005

broken; Wang, Feng, Lai, Yu, Crypto 2004 (manully, without using a computer)

Cost = $50,000-$70,000 or Cost = $0.9-$1.26M In software:

Time: 18 days Time: 24 hours

RIPEMD-160

Computer network similar to distributed.net used to break DES (~331,252 computers) : Cost = ~ $0 Time: 7 months

SHA-256, SHA-384, SHA-512

Cryptographic Standards

National Security Agency (also known as No Such Agency or Never Say Anything)
Created in 1952 by president Truman Goals: designing strong ciphers (to protect U.S. communications) breaking ciphers (to listen to non-U.S. communications) Budget and number of employees kept secret Largest employer of mathematicians in the world Larger purchaser of computer hardware

So how the cryptographic standards have been created so far?

NSA-developed Cryptographic Standards


Block Ciphers 1977
DES Data Encryption Standard Triple DES

Cryptographic Standard Contests


IX.1997 X.2000 15 block ciphers 1 winner

1999

2005

AES
I.2000

NESSIE
CRYPTREC

XII.2002 XI.2004 V.2008 X.2007 XII.2012

Hash Functions

1993 1995
SHA-0

2003
SHA-2

SHA-1Secure Hash Algorithm

34 stream ciphers 4 SW+4 HW winners

eSTREAM SHA-3

51 hash functions 1 winner

1970

1980

1990

2000

2010 time
96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12

time

SHA-3 Contest - NIST Evaluation Criteria

Software or hardware?
SOFTWARE
security of data during transmission

HARDWARE
speed random key generation

Security

So*ware Eciency Flexibility

Hardware Eciency
ASICs FPGAs

low cost flexibility (new cryptoalgorithms, protection against new attacks)

access control to keys resistance to side-channel attacks tamper resistance

Simplicity

Licensing
25

Primary efficiency indicators


Latency Software Hardware
Mi

Efficiency parameters
Throughput = Speed
Mi+2 Mi+1 Mi

Speed

Memory

Speed

Area

Encryption/ decryption

Time to encrypt/decrypt Encryption/ a single block decryption of data


Ci+2 Ci+1 Ci

Power consumption

Ci

Number of bits encrypted/decrypted in a unit of time

Throughput =

Block_size Number_of_blocks_processed_simultaneously Latency

Advanced Encryption Standard (AES) Contest 1997-2001


June 1998 15 Candidates
from USA, Canada, Belgium, France, Germany, Norway, UK, Israel, Korea, Japan, Australia, Costa Rica

Speed of the final AES candidates in Xilinx FPGAs Speed [Mbit/s]


500 450 400 350 300 250 200 150 100 50 0 K.Gaj, P. Chodowiec, AES3, April, 2000

Round 1
Security Software efficiency Flexibility

August 1999 5 final candidates


Mars, RC6, Rijndael, Serpent, Twofish

Round 2
Security Hardware efficiency

October 2000 1 winner: Rijndael


Belgium

Serpent Rijndael Twofish RC6

Mars

# votes
100 90 80 70 60 50 40 30 20 10 0

Survey filled by 167 participants of the Third AES Conference, April 2000

Results of the NSA group ASICs Speed [Mbit/s] AES3, April, 2000
700 600 500 400 300 200 100 0
105 202 177 103 57 414 431 606 NSA ASIC GMU FPGA

143 61

Rijndael Serpent Twofish

RC6

Mars

Rijndael Serpent Twofish RC6

Mars

Efficiency in software: NIST-specified platform Speed [Mbits/s] 30 25 20 15 10 5 0 Rijndael RC6 Twofish Mars Serpent
Adequate
200 MHz Pentium Pro, Borland C++ 128-bit key 192-bit key 256-bit key

NIST Report: Security


Security
AES Final Report, October 2000

High

Serpent

MARS Twofish

Rijndael RC6 Simple Complex Complexity

NIST SHA-3 Contest - Timeline

GMU Team Goals


Fair and comprehensive methodology for evaluation of hardware performance in FPGAs

51 candidates

Round 1 14 Oct. 2008 July 2009

Round 2

5-6

Round 3 1-2 Mid 2012

End of 2010

High-speed fully autonomous implementations of all 14 SHA-3 candidates & SHA-2 256-bit & 512-bit variants optimized for the maximum throughput to area ratio Open-source benchmarking tool supporting optimization of tool options and efficient generation of results for multiple FPGA families

35

36

Primary Designers of GMU Codes


Ekawat Homsirikamol a.k.a Ice Marcin Rogawski

Methodology

Developed optimized VHDL implementations of 14 Round 2 SHA-3 candidates + SHA-2 in two variants each (256 & 512-bit output), for some functions using several alternative architectures

38

Comprehensive Evaluation
two major vendors: Altera and Xilinx (~90% of the market) multiple high-performance and low-cost families
Altera
Technology Low-cost Highperformance Low-cost

Uniform Evaluation
Language: VHDL Tools: Interface Performance Metrics Design Methodology Benchmarking
39 40

FPGA vendor tools

Xilinx
Highperformance

90 nm 65 nm

Cyclone II Cyclone III

Stratix II Stratix III

Spartan 3

Virtex 4 Virtex 5

Why Interface Matters?


Pin limit

Interface: Two possible solutions


msg_bitlen message zero_word

end_of_msg

SHA core

Total number of i/o ports Total number of an FPGA i/o pins

Support for the maximum throughput


Time to load the next message block Time to process previous block

Length of the message communicated at the beginning + easy to implement passive source circuit area overhead for the counter of message bits
41

Dedicated end of message port more intelligent source circuit required + no need for internal message bit counter
42

SHA Core: Interface & Typical Configuration


clk clk rst rst clk clk rst rst clk rst rst

SHA Core: Interface & Typical Configuration


io_clk rst clk rst io_clk clk io_clk clk rst rst io_clk rst clk rst

clk

ext_idata w foin_full foin_write

Input FIFO
din full write dout empty read idata w foin_empty foin_read

SHA core
din src_ready src_read dout dst_ready dst_write odata w foout_full foout_write

Output FIFO
din dout

ext_odata w foout_empty foout_read

ext_idata w foin_full foin_write

Input FIFO
din full write dout

SHA core
dout dst_ready dst_write odata w foout_full foout_write

Output FIFO
din dout ext_odata w foout_empty foout_read

full empty write read

idata din w foin_empty empty src_ready read foin_read src_read

full empty write read

SHA core is an active component; surrounding FIFOs are passive and widely available Input interface is separate from an output interface Processing a current block, reading the next block, and storing a result for the previous message can be all done in parallel

Some functions may require a faster input/output clock in order to load input data at a faster rate

43

44

Performance Metrics Primary


1. Throughput (single long message) 2. Area 3. Throughput / Area 3. Hash Time for Short Messages (up to 1000 bits)
45

Performance Metrics - Area Secondary


We force these vectors to look as follows through the synthesis and implementation options:

0 0

0 0

Areaa
46

Choice of Optimization Target


Primary Optimization Target: Throughput to Area Ratio Features: practical: good balance between speed and cost very reliable guide through the entire design process, facilitating the choice of
high-level architecture implementation of basic components choice of tool options

Our Design Flow


Specification Interface Controller Template Library of Basic Components

Datapath Block diagram

Controller ASM Chart

VHDL Code Formulas for Throughput & Hash time

Max. Clock Freq. Resource Utilization Throughput, Area, Throughput/Area, Hash Time for Short Messages

leads to high-speed, close-to-maximum-throughput designs


47

48

Basic Operations of 14 SHA-3 Candidates

ATHENa Automated Tool for Hardware Evalua?oN


http://cryptography.gmu.edu/athena

Benchmarking open-source tool, wriGen in Perl, aimed at an AUTOMATED genera?on of OPTIMIZED results for MULTIPLE FPGA plaSorms Under development at George Mason University.

NTT Number Theoretic Transform, GF MUL Galois Field multiplication, 49 MUL integer multiplication, mADDn multioperand addition with n operands

49

50

Basic Dataflow of ATHENa


User
6 5

FPGA Synthesis and Implementation


3

conguraKon les

constraint les testbench

Database query

Ranking of designs

HDL + scripts + configuration files


1

Result Summary + Database Entries HDL + FPGA Tools

synthesizable source les

ATHENa Server

Download scripts and configuration files8


4

Database Entries
0

Designer Interfaces + Testbenches


51

result summary (user-friendly)

database entries (machine- friendly)


52

ATHENa Major Features (1)


synthesis, implementa?on, and ?ming analysis in batch mode support for devices and tools of mulKple FPGA vendors:

ATHENa Major Features (2)


automated vericaKon of designs through simula?on in batch mode OR support for mulK-core processing automated extracKon and tabulaKon of results several opKmizaKon strategies aimed at nding

53

genera?on of results for mulKple families of FPGAs of a given vendor

automated choice of a best-matching device within a given family

op?mum op?ons of tools best target clock frequency best star?ng point of placement
54

Generation of Results Facilitated by ATHENa


batch mode of FPGA tools
vs.

Relative Improvement of Results from Using ATHENa Virtex 5, 256-bit Variants of Hash Functions
2.5 2 1.5 1 Area Thr Thr/Area

ease of extraction and tabulation of results


Excel, CSV (available), LaTeX (coming soon)

0.5 0

optimized choice of tool options


55

Gr oe Sh stl av ite -3 Lu ffa Ke cc ak Ha ms i Ec ho Sk ein Fu gu e Sh a2 B Cu MW be Ha sh Bl ak e Sh ab al SI MD

Ratios of results obtained using ATHENa suggested options vs. default options of FPGA tools

JH

56

Results

58

Throughput [Mbit/s] Virtex 5, 256-bit variants of algorithms


16000 14000 12000 10000 8000 6000 4000 2000 0

Throughput [Mbit/s] Virtex 5, 512-bit variants of algorithms


14000.0 12000.0 10000.0 8000.0 6000.0 4000.0 2000.0 0.0

ffa

as h Fu gu SH e Av ite -3 BL AK E

al

si

JH

ak

es

tl

ffa

O SI M D

al

G ro e

EC

Av i

BL

Sh

SH

eH

ub

Ke

SH

59

ub

Fu gu e
60

G ro

BM

EC

SH

eH

Lu

cc

Ke

Sh

Sk

am si

st l

am

SI M

BM

ab

JH

A-

ei

ak

te -

as

Lu

cc

AK

Sk

ab

A-

ei

10

Normalization & Compression of Results


Absolute result
e.g., throughput in Mbits/s, area in CLB slices

Normalized Throughput & Overall Normalized Throughput

Normalized result
normalized _ result = result _ for _ SHA 3_ candidate result _ for _ SHA 2

Overall normalized result

Geometric mean of normalized results for all inves?gated FPGA families


61 62

Overall Normalized Throughput: 256-bit variants of algorithms


Normalized to SHA-256, Averaged over 7 FPGA families
8 7 6 5 4 3 2 1 0

Overall Normalized Throughput: 512-bit variants of algorithms


Normalized to SHA-512, Averaged over 7 FPGA families
4 3.5 3 2.5 2 1.5 1 0.5 0

ffa

al Sh ab

G ro

EC

eH

Av i

BL

Ke

ub

ffa

ub e h Has

JH

al

si

Fu gu e

ak

es

te -

AK

am

ei

EC

G ro

Av i

Sk

Ke

BL

SH

Sh

SI M

BM

cc

Lu

ab

63

tl

SH

Area [CLB slices] Virtex 5, 256-bit variants of algorithms


10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0

Area [CLB slices] Virtex 5, 512-bit variants of algorithms


18000 16000 14000 12000 10000 8000 6000 4000 2000 0

ffa

h Fu gu e

JH

st l

W BM

al

si

H am si Fu gu e

SI M D

G ro e

AK

BM

Sk

SH

eH

Sh

Av i

Sk

SH

G ro

EC

eH

Av i

Sh

Ke

BL

ub

Ke

ub

SH

65

SH

BL

EC

Lu

cc

SI M D
66

ffa

JH

al

as

A-

ak

tl

te -

ak

ei

am

AK

as

te -

es

A-

ab

ei

ab

Lu

cc

si Fu gu e
64

JH

es

ak

as

te -

tl

AK

ei

SI M

BM

Lu

cc

Sk

am

11

Overall Normalized Area: 256-bit variants of algorithms


Normalized to SHA-256, Averaged over 7 FPGA families
30 25 20 15 10 5 0 30 25 20 15 10 5 0

Overall Normalized Area: 512-bit variants of algorithms


Normalized to SHA-512, Averaged over 7 FPGA families

ffa

ffa

st l BM W

as h Fu gu e Ke cc ak Sh ab al

ei n Fu gu e

st l BM W

al

am si

am si

te -3

te -3

JH

JH

O EC H O EC H

as

ak

AK

ei

SI M

AK

G ro e

BL

Ke

BL

ub

SH

ub

67

SH

G ro e

Sk

eH

EC

Av i

eH

Sk

Sh

Av i

Overall Normalized Throughput/Area: 256-bit variants


Normalized to SHA-256, Averaged over 7 FPGA families
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 1.4 1.2 1 0.8 0.6 0.4 0.2

Overall Normalized Throughput/Area: 512-bit variants


Normalized to SHA-512, Averaged over 7 FPGA families

ffa

al

JH

si

tl

Fu gu e

ak

as

es

AK

te -

ei

am

E Fu gu e

G ro

es

eH

as

te -

Sk

am

SI M

Sh

BM

BL

Av i

AK

ab

Sk

G ro

eH

Av i

EC

Ke

Lu

cc

BL

Sh

ub

ub

SH

69

SH

Ke

Throughput vs. Area Normalized to Results for SHA-256 and Averaged over 7 FPGA Families 256-bit variants
best

Throughput vs. Area Normalized to Results for SHA-512 and Averaged over 7 FPGA Families 512-bit variants
best

worst

worst

71

SI M
70 72

ab

BM

ffa

al

JH

si

tl

ak

ei

Lu

cc

SI M
68

Lu

ab

Lu

cc

12

Execution Time for Short Messages up to 1000 bits Virtex 5, 256-bit variants of algorithms

Execution Time for Short Messages up to 1000 bits Virtex 5, 512-bit variants of algorithms

73

74

256-bit variants
Thr/Area Thr Area Short msg.

512-bit variants
Thr/Area Thr Area Short msg.

Summary of Results
Throughput/Area & Throughput most crucial for high-speed implementations Area cannot be easily traded for Throughput Best performers so far 1-2. Keccak & Luffa 3. Groestl Worst performers so far: 14. SIMD 13. ECHO 12. BMW

BLAKE BMW CubeHash ECHO Fugue Groestl Hamsi JH Keccak Luffa Shabal SHAvite-3 SIMD Skein
75

76

More About our Designs & Tools


Cryptology e-Print Archive - 2010/445 (100+ pages) Detailed hierarchical block diagrams Corresponding formulas for execution time and throughput

FPL 2010 paper ATHENa features Case studies

Comparison with Other Groups

ATHENa web site Most recent results Comparisons with results from other groups Optimum options of tools
77 78

13

Comparison with Best Results Reported by Other Groups Virtex 5, 256-bit variants of algorithms
OTHER GROUPS
Area
BLAKE CubeHash ECHO Groestl Hamsi Keccak Luffa Shabal Skein (estimated)

Best Overall Reported Results as of Aug. 6, 2010 Virtex 5, 256-bit variants of algorithms
BEST REPORTED RESULTS

GMU
Source
Kobayashi et al. Kobayashi et al. Lu et al. Gauvaram et al. Kobayashi et al.

Thr 2676 2960 14860 10276 1680 6900 6343 2051 3535

Thr/Area 1.61 5.02 1.59 5.97 2.34

Area 1871 707 5445 1884 946 1229 1154 1266 1463

Thr 2854 3445 13875 8677 2646 10807 8008 2624 2812

Thr/Area 1.53 4.87 2.55 4.61 2.80 8.79 6.94 2.07 1.92
79

Area BLAKE BMW CubeHash ECHO Fugue Groestl Hamsi JH Keccak Luffa Shabal SHAvite-3 SIMD Skein 1660 4400 590 5445 956 1722 946 1108 1229 1154 153 1130 9288 1632

Thr 2676 5577 2960 13875 3151 10276 2646 3955 10807 8008 2051 2887 2326 3535

Thr/Area 1.61 1.27 5.02 2.55 3.30 5.97 2.80 3.57 8.79 6.94 13.41 2.55 0.25 2.17

Source Kobayashi et al. GMU Kobayashi et al. GMU GMU Gauvaram et al. GMU GMU GMU GMU Detrey et al. GMU GMU Tillich et al.

1660 590 9333 1722 718 1412 1048 153 1632

4.89 Bertoni et al. 6.05


Kobayashi et al.

13.41 Detrey et al. 2.17


Tillich

80

Throughput vs. Area: Best reported results Virtex 5, 256-bit variants of algorithms
best

Your Project

worst 82

81

Analysis of Alternative Architectures - Unrolled

Analysis of Alternative Architectures - Folded


Folded Vertically-2x (fv2) Folded Horizontally-2x (fh2)

Basic

r times

r/2 times

r times
83

2r times

2r times
84

14

Preliminary results for CubeHash, Groestl, Keccak & Luffa in Virtex 5


8

Your Project
14 SHA-3 candidates left in the contest
x1

x1

Keccak
x2

Given:
specification of the function reference implementation in C interface testbench and test vectors GMU implementation of the basic version including
block diagrams ASM charts short description formulas for execution time & throughput source codes results for Xilinx and Altera FPGAs

Normalized Throughput

x2 x4 fv2

x1
5 4 3 2

Luffa
x1 ^2 fv3 x2

Groestl
fv4

CubeHash Groestl

CubeHash
2 3

Luffa Keccak

1 0 0 1 4 5

Normalized Area

85

Your Project
Develop:
Block diagram ASM chart Formulas for execution time & throughput Synthesizable code in VHDL Results for multiple families of FPGAs from Xilinx and Altera for at least one architecture from each of the following three classes of architectures: Unrolled architecture Folded architecture Architecture based on the use of embedded FPGA resources (BRAMs, multipliers, DSP units, etc.) [256 bit only, 512-bit only, or both]

What is an FPGA?
Configurable Logic Blocks I/O Blocks Block RAMs & Embedded Multipliers

Block RAMs and MULs

Block RAMs and MULs

88

RAM Blocks and Multipliers in Xilinx FPGAs

Using Embedded FPGA Resources

Basic design Your design

( 1536, ( 768,

0, 2,

0) 4)

Basic design
The Design Warriors Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com)

( 3010, ( 1505,

0, 32 kbit,

0) 4)
90

Your design

89

15

Block RAM
Spartan-3 Dual-Port Block RAM

Block RAM can have various configurations (port aspect ratios)


1 0 2 0 0 4

Port B

Most efficient memory implementation


Dedicated blocks of memory

Ideal for most memory requirements


4 to 104 memory blocks
18 kbits = 18,432 bits per block (16 k without parity bits)

Use multiple blocks for larger memories

Builds both single and true dual-port RAMs Synchronous write and read (different from distributed RAM)
91

Dual-Port Bus Flexibility


WEA

Port A
ENA RSTA CLKA ADDRA[9:0] DIA[17:0] WEB ENB

Block RAM

8k x 2
4,095

4k x 4

16k x 1

8,191 0

8+1

2k x (8+1)
2047 16+2 0 1023 16,383
92

1024 x (16+2)

Embedded Multipliers in Spartan 3

Port A In 1K-Bit Depth

DOA[17:0]

Port A Out 18-Bit Width

Port B In 1k-Bit Depth

RSTB CLKB ADDRB[9:0] DIB[17:0]

DOB[17:0]

Port B Out 18-Bit Width

18x18 bit signed multipliers with optional input/output registers


93 94

Multiplier-Accumulator - MAC

Xilinx XtremeDSP
Starting with Virtex 4 family, Xilinx introduced DSP48 block for high-speed DSP on FPGAs Essentially a multiply-accumulate core with many other features Now also Spartan-3A and Virtex 5 have DSP blocks

The Design Warriors Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com) 95
96

16

DSP48 Slice: Virtex 4

Simplified Form of DSP48

97

98

Xilinx FPGA Devices


Technology 120/150 nm 90 nm 65 nm 45 nm 40 nm Spartan 6 Virtex 6 Spartan 3 Low-cost High- performance Virtex 2, 2 Pro Virtex 4 Virtex 5 Technology 130 nm 90 nm 65 nm 40 nm

Altera FPGA Devices


Low-cost Cyclone Cyclone II Cyclone III Cyclone IV Arria I Arria II Mid-range High- performance Stra?x StraKx II StraKx III StraKx IV

All Projects - Organization


Projects divided into phases Deliverables for each phase submitted through Blackboard at selected checkpoints and evaluated by the instructor and/or TA Feedback provided to students on a best effort basis Final report and codes submitted using Blackboard at the end of the semester

Honor Code Rules


All students are expected to write and debug their codes individually Students are encouraged to help and support each other in all problems related to the - operation of the CAD tools, - basic understanding of the problem.

17

Course Objectives
At the end of this course you should be able to:
Code in VHDL for synthesis Decompose a digital system into a controller (FSM) and datapath, and code accordingly Write VHDL testbenches Synthesize and implement digital systems on FPGAs Effectively code digital systems for cryptography, signal processing, and microprocessor applications

Additional Skills Learned in the Project


Reading & understanding specification of a complex algorithm Design of new hardware architectures based on existing architectures (datapath & controller) Reading, understanding, and modifying existing VHDL code Using embedded resources of modern FPGAs Characterizing performance of your codes for multiple FPGA families

This knowledge will come about through homework, exams, and an extensive project
The project in particular will help you know VHDL and the FPGA design flow from beginning to end

103

104

Project Task 1
Read the following chapters from the GMU technical report published at http://eprint.iacr.org/2010/445
Chapter 1 Introduction & Motivation Chapter 2 Methodology Chapter 3 Comprehensive Designs of SHA-3 Candidates 3.1, 3.2 + subsection concerning your algorithm Chapter 4 Design Summary and Results

Project Task 1 cont.


In one week: Meeting with the instructor devoted to fully understanding the GMU report, specification, block diagrams, interface, and timing formulas. In two weeks: Draft block diagrams of the - selected unrolled architecture - selected folded architecture. Corresponding timing formulas for execution time & throughput.
105 106

Download and get familiar with the package of a hash function assigned to you
http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/submissions_rnd2.html

Read carefully the specification of your algorithm

18

You might also like