NAVAL POSTGRADUATE SCHOOL
Monterey, California
ADA245 791
!$ 1IJilll 1i~lllI[([l/l! 00
ITHESIS
DISCRETE COSINE TRANSFORM IMPLEMENTATION IN VHDL by TaHsiang Hu December 1990 Thesis Advisor: Thesis CoAdvisor: ChinHwa Lee Chyan Yang
Approved for public release; distribution is unlimited.
9203291
Unclassified
Security Classification of this page
REPORT DOCUMENTATION PAGE
I a Report Security Classification Unclassified 2a Security Classification Authority 2b Declassification/Downgrading Schedule 4 Performing Organization Report Number(s)
6a Name of Performing Organization 6b Office Symbol
lb Restrictive Markings
3 Distribution Availability of Report
Approved for public release; distribution is unlimited.
5 Monitoring Organization Report Number(s)
7a Name of Monitoring Organization
Naval Postgaduate School
6c Address (city. state, and ZIP code)
62
Naval Postgraduate School
7b Address (city, state, and ZIP code)
Monterey, CA 939435000
Sa Name of Funding/Sponsoring Orgamzation 8b Office Symbol
Monterey, CA 939435000
9 Procurement Instrument Identification Number
I(If Applicable) Sc Address (city, state, and ZIP code)
__________________________________________I________
10 Source of Funding Numbers
Elemntn Nwrnbe& Projet No
ITask
Na
Work Unit Accession No
I1 Title (Include Security Classification)
1 2 Personal Author(s)
DISCRETE COSINE TRANSFORM IMPLEMENTATION IN VHDL
14 Date of Report (year, monthday)
15 Page Count
TaHsiang Hu
13b Time Covered
13a Typc of Report
Master's Thesis IFrom To December 199 166 16 Supplementary Notation The views expressed in this thesis are those of the author and do not reflect the official
policy or position of the Department of Defense or the U.S. Government.
17 Cosati Codes Ficlid Group [Subgroup 18 Subject Terms (continue on reverse if necessary and identify by block number)
FFT SYSTEM, DCT SYSTEM IMPLEMENTATION
t) Abstract (continue on reverse if necessary and identify by block number
Several different hardware structures for Fast Fourier Transform(FFI) are discussed in this thesis. VHDL was used in providing a simulation. Various costs and performance comparisons of different FFT structures are revealed. The FFT system leads to a design of Discrete Cosine Transform(DCT). VHDL allows the hierarchical description of a system in structural and behavioral description. In the structural description, a component is described in terms of an interconnection of more primitive components. However, in the behavioral description, a component is described by defining its input/output response in terms of a procedure. In this thesis, the lowest hierarchy level ischiplevel. In modeling of the floating point unit AMD29325 behavior, several basic functions or procedures are involved. A number of AMD29325 chips were used in the different structures of the FFT
butterfly. The full pipline structure of the FFT butterfly, controller, and address sequence generator are simulated in VHDL. Finally, two methods of implementation of the DCT system are discussed.
20
Distribution/Availability of Abstract unclassified/unlimitd a e
,
[
DT
cusers
21
Abstract Security Classification 22c Office Symbol
Unclassified
22b Telephone (Include Area code)
'a;i Name of Responsible Individual
Chinlwa Lee
)) FORM 1473,84 MAR
(408) 6550242
83 APR edition may be used until exhausted All other editions are obsolete
EC / Le
security classification of this pagc
Unclassified
i
Approved for public release; distribution is unlimited.
Discrete Cosine Transform Implementation In VHDL by TaHsiang Hu Captain, Republic of China Army B.S., ChungCheng Institute Of Technology, 1984
Submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN ELECTRICAL ENGINEERING from the NAVAL POSTGRADUATE SCHOOL December 1990
Author:
___ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __
Taqisiang Hu
Approved by:
Hwaee, Thesis Advisor
Chyan Yang, CoAdvisor
Michael A. Morgan, Chairman Department of Electrical and Computer Engineering
ii
ty C . A number of AMD29325 chips were used in the different structures of the FFT butterfly.
terms
interconnection of more primitive components. In modeling of the floating point unit AMD29325 behavior.
Accession For NTIS GRA&I DTIC TAB
Unannounced
Just If ict.response in terms of a procedure. Finally. VHDL allows the and a hierarchical behavioral component description of In in a system in
structural
description. two methods of implementation of the DCT system are discussed. a component is described by defining its input/output. controller. By
on
AvaIlabl.A Dist jAvli nJ/or_ OOA
. The FFT system leads to a design of Discrete Cosine Transform(DCT). and address sequence generator are simulated in VHDL. Various costs and performance
comparisons of different FFT structures are revealed. In this thesis. several basic functions or procedures are involved. in the behavioral domain.ABSTRACT Several different hardware structures for Fast Fourier Transform(FFT) are discussed in this thesis. the lowest hierarchy level is chiplevel. is described
the
structural of an
description. VHDL was used in providing a simulation. The full pipeline structure of the FFT butterfly. However.
.... 1........... .
14
a............................ c...... ..
21
iv
.... OVERVIEW OF THE FAST FOURIER TRANSFORM . Subtraction Operation Function
. .
.. ............... A.
THE ELEMENT FUNCTIONS ASSOCIATED WITH THE ARITHMETICAL OPERATION OF AMD29325 ......
BASIC MODELING FUNCTIONS OF AMD29325 ..
INTRODUCTION .TABLE OF CONTENTS
I....... ........ 15 16 21
21
Multiplication Operation Function Division Operation Function . 5
B. A.
INTRODUCTION TO FLOATING POINT UNIT CHIP AMD29325 .
BEHAVIORAL DESCRIPTION OF THE AMD29325 CHIP
III.
.
Addition Operation Function ...
..............
..
1..... .. B. ....
.. ..
FLOATING POINT UNIT ....
1 1 2 5
VHDL HARDWARE DESCRIPTION LANGUAGE . ..
OVERVIEW OF THE IEEE FLOATING POINT STANDARD FORMAT ...... 3....
14 15 15 .... 9 10
C... 10
2...
.
.. .
II. d.. THE DATA FLOW DESIGN OF THE FAST FOURIFA TRANSFORM
A. OVERVIEW OF THE THESIS ............. b..................
DECIMATION IN TIME(DIT) ..
THE TOP FUNCTIONS ASSOCIATED WITH THE
ARITHMETICAL OPERATIONS OF AMD29325 .....
. RAM ....
3... IV...... 6.. ................ . STRUCTURE 5 OF DIF BUTTERFLY .....
5.. ......... TO IMPLEMENT THREE ADDITIONAL PRECISION FORMATS TO IMPROVE THE ARITHMETIC ACCURACY........ 4. .
IMPROVEMENTS AND FUTURE RESEARCH ...........2.............
........................ STRUCTURE 2 OF DIF BUTTERFLY .
STRUCTURE 4 OF DIF BUTTERFLY ........ .... A..
DECIMATION IN FREQUENCY(DIF) ........... .. STRUCTURE 3 OF DIF BUTTERFLY 4.. 68 68
INTRODUCTION TO DISCRETE COSINE TRANSFORM(DCT) THE DISCRETE COSINE TRANSFORM SYSTEM IMPLEMENTATION ............ ..
CONCLUSION ..... 1. 2....... 2...
70 76 76 77
V.. ..... SOME VHDL BEHAVIORAL MODELS ....
.... B.
FULL PIPELINE DIF BUTTERFLY STRUCTURE
CONTROLLER FOR THE BUTTERFLY STRUCTURE . ..... 1..........
22
COMPARISON OF SEVERAL DATA FLOW CONFIGURATIONS OF THE FAST FOURIER TRANSFORM . B... . 3. 24 26 32 33 40 43 46 50 50 51 52 59
..... C. STRUCTURE 6 OF DIF BUTTERFLY . B...
....... ADDRESS SEQUENCE GENERATOR .... ... ..
1...
CONCLUSION A......
SIMULATION OF THE DATA FLOW DESIGN OF FFT
60
THE DATA FLOW DESIGN OF THE DISCRETE COSINE TRANSFORM ....
. D........ STRJCTURE 1 OF DIF BUTTERFLY . . 78
v
......... .
...
..
.............
APPENDIX C: THE SOURCE FILE OF THE FPU CHIP AMD29325 APPENDIX D: THE SIMPLIFIED I/O PORT OF THE FPU CHIP AMD29325 ............ ... TO BUILD THE FAST FOURIER TRANSFORM USING SPECIAL "COMPLEX VECTOR PROCESSOR (CVP)" CHIP .
o. B...........
o....82 92 92 103 105 81
APPENDIX A: THE ELEMENT FUNCTIONS OF THE FPU
APPENDIX B: THE TOP FUNCTIONS AND BEHAVIOR OF THE FPU A... ....... TO ADD SEVERAL OTHER FUNCTIONS ASSOCIATED WITH THE AMD29325 OPERATION ..
107
APPENDIX E: THE PIPELINE STRUCTURE OF THE FFT BUTTERFLY .......... B..
150
vi
.......... TO PERFORM THE RADIX 4 FAST FOURIER TRANSFORM IN DIT OR DIF ALGORITHMS .... THE BEHAVIOR FUNCTIONS OF THE FPU . ..2...... 130
APPENDIX G: THE BEHAVIOR OF RAM ...... o 148
THE SOURCE FILE ASSOCIATED WITH DATA READ THE SOURCE FILE OF THE CONVERSION BETWEEN FP NUMBER AND IEEE FORMAT
...... 4. .................. .........
.. . THE TOP FUNCTIONS OF THE FPU .
APPENDIX H:
THE SOURCE FILE OF THE FFT SYSTEM
APPENDIX I: THE ACCESSORY FILES A............ 78
3.. 109
APPENDIX F: THE ADDRESS SEQUENCE GENERATOR AND CONTROLLER .... 78
TO IMPROVE THE ADDRESSING SEQUENCE GENERATOR TO REDUCE FETCHING IDENTICAL WEIGHT FACTORS....... ...
116 126
.. .......... . 81
5............ ..148 ....
..152 INITIAL DISTRIBUTION LIST.153
vii
...LIST OF REFERENCES................................
LIST OF TABLES
TABLE 3.1 TABLE 3.2 TABLE 3.3 TABLE 3.4 TABLE 3.5 TABLE 3.6 TABLE 3.7 TABLE 3.8
Time space diagram of DIF structure 1. Time space diagram of DIF structure 2. Time space diagram of DIF structure 3. Time space diagram of DIF structure 4. Time space diagram of DIF structure 5. Time space diagram of DIF structure 6. Comparison of 6 DIF butterfly structures.
. .
30 35 39 42 45 48 49
. . . . . . . .
Comparison of the FFT result using the MATLAB ...... 65
function and this simulated FFT system. TABLE 5.1
The comparison of total number of arithmetic . ... .. 80
operations needed in Radix 2 and Radix 4
viii
LIST OF FIGURES
FIGURE 1.1 FIGURE 2.1
The designed tree of this thesis ... The IEEE single precision floating point
....
4
format ........... FIGURE 2.2
......................
6
Format parameter for the IEEE 754 floating .................. 7
point standard ......... FIGURE 2.3
AMD29325 block diagram (adapted from AMD data ...................... 11
book) ........ FIGURE 2.4
AMD29325 pin diagram (adapted from the AMD data ...................... . 12
book) ........ FIGURE 2.5
AMD29325 operation select (adapted from AMD data ....................... .. ....... 13 16
book) ......... FIGURE 2.6 FIGURE 2.7
The entity of a FULLADDER
Three constructs in VHDL language (adopted from ..................... . 17
[Ref. 4]) ....... FIGURE 2.8
AMD29325 pin description (adapted from the AMD .................... 18
data book) ....... FIGURE 3.1
Signal flow graph and the shorthand .......... . 22 23
representation of DIT butterfly ... FIGURE 3.2 FIGURE 3.3
8 point FFT using DIT butterfly ....... .. Signal flow graph and shorthand representation ................. ..
in DIF butterfly ...... FIGURE 3.4
24 25
8 point FFT using DIF butterfly ....... ..
ix
FIGURE 3.5
8 point FFT w~th DIF butterfly in nonbit................. .. 26
reversal algorithm ...... FIGURE 3.6
Two different basic butterflies and their ............... .. 27
arithmetic operations ..... FIGURE 3.7
Butterfly implementation in pipeline .....................
.
structure ........ FIGURE 3.8 FIGURE 3.9 FIGURE 3.10 FIGURE 3.11 FIGURE 3.12 FIGURE 3.13 FIGURE 3.14
.
29 34 38 41 44 47
Butterfly implementation in structure 2 Butterfly implementation in structure 3 Butterfly implementation in structure 4 Butterfly implementation in structure 4 Butterfly implementation in structure 6
Controller flow chart and its logical symbol 53 The block diagram of address sequence ............. .
. .
generator and controller ...
FIGURE 3.15
55
56
Address sequence generator flow chart
FIGUP.E 3.16
Timing of read cycle and write cycle
.
. . .
(adopted from National CMOS RAM data book) FIGURE 3.17 FIGURE 3.18
.
61 62 66
The original data flow system of FFT The revised data flow system of FFT
. .
. . .
FIGURE 3.19
The flow chart of the universal .................... 67
controller ....... FIGURE 4.1
Full pipeline structure to implement the DCT
system, the input data come from the FFT system output. FIGURE 4.2 ....... ...................... 73
Block diagram of the universal controller ................. x .. 74
and the FFT system ......
.. bottom is the DIF algorithm ...... top is the DIT ......... 79
algorithm. 75
controller ...1
Butterfly in Radix 4.......FIGURE 4..3
Modified flow chart of the universal .... FIGURE 5......
xi
...
Chyan Yang who provided a lot of help and enthusiasm when it was greatly needed. I would like to thank my Thesis CoAdvisor.ACKNOWLEDGMENTS
I wish to express my thanks to my Thesis Advisor. and guidance.
xii
. ChinHwa Lee. for his understanding. Finally. I would like to extend deep appreciation to my family and friends who gave their total support throughout the entire project. Prof. infinite patience. Prof.
A.
language
address a number of recurrent problems in the design cycles. 1]. register. Consequently. up to the desired system level. Also it does not force a design methodology on a designer" [Ref. 2].
Many existing hardware description languages can operate at the logic and gate level. and documentation of digital hardware. "It
is
a
new
hardware by and the
description U. For example.
language of
developed Defense
and for
standardized documentation design"
Department of CAD was
specification "The
microelectronics develoned to
[Ref.S.
and In
behavioral the
digital a
structural
domain. chip. 3]. it can be extended beyond this to higher behavioral levels. in the behavioral domain. While VHDL is perfectly suited to this level of description. it can extend from the level of gate. exchange of design information. a component is
1
.
INTRODUCTION
VHDL HARDWARE DESCRIPTION LANGUAGE VHDL stands for VHSIC Hardware Description Language. designers is
structural [Ref.I. they are lowlevel logic design simulators. VHDL allows hierarchy implementation domains. However. VHDL is technology independent and is not tied to a particular simulator or logic value set. by in two domains.
component
described in terms of an interconnection of more primitive components.
and a Discrete Cosine Transform system. Chapter I gives a general introduction. VHDL was used to model at the chip level. and integrated models of the FFT system.e. Then. In this thesis. In other words. Different structures were studied here to compare system performance and costs. Furthermore. a Discrete Fourier Transform system. controller. four basic operations of the floating point unit AMD29325. VHDL RAM models. chips. six different kinds of data flow configurations. Modeling the behavior at the chiplevel is the first task. in Chapter IV a Discrete Cosine Transform(DCT) is implemented based on the extension of the universal controller
2
. the lowest hierarchy level is at the chiplevel. Chapter III includes the designs of the butterfly of a Fast Fourier Transform(FFT) in DIF algorithm. The structural
modeling and behavioral modeling in VHDL are the main subjects in this thesis. various structures of FFT system are
designed using these primitives. i. and a
simplified version A29325 are created in Chapter II. address
sequence generator. In this thesis. a floating point unit. OVERVIEW OF THE THESIS This thesis is divided into five chapters. In order to model these chips accurately Timedelay and holduptime as VHDL generic are introduced. Several element functions.described by defining the input/output response in terms of a procedure. B. VHDL is the main language tool to allow for capturing and verifying all the design details.
Various nodes in the tree will be explained in detail in the following chapters. The hierarchy of the design units created in this thesis can be summarized in a tree shown in Figure 1. and end at the top. Finally. The efforts start at the bottom of the tree. Chapter V gives the conclusions and suggestions of possible future research.of the FFT system.1.
3
.
DCT

Egstern
FFT

Syste m1
Address Sequence Generator
A 2 932
ontroller
AMD29325
FFp

Addition
F~Sbstraction
Fp

Mlultiplication
Fp

Division
Element Functions
FIGURE 1.
4
.1
The design tree of this thesis.
a sign bit. HONEYWELL 8200. IBM 3303 might use different floating point formats. In this situation.1)
5
. Usually. there may also be a need to represent NAN( not a number ) or infinite number. There are several formats for representing floating point numbers. VAXII. a floating point number is used. how rounding is carried out. In these situations. Different computer systems such as CDC 7600. Any floating point format usually includes three parts. Fixed point number representation is not sufficient to support these needs. The variations occur in the number of bits allocated for the exponent and mantissa patterns. and the actions taken while underflow and overflow occur.
FLOATING POINT UNIT
OVERVIEW OF THE IEEE FLOATING POINT STANDARD FORMAT Sometimes applications require numbers with large
numerical range that can not be stored as integers. A.II. and a mantissa bit
pattern. the value of a floating point format is
(sign)Mantissa *
2
exponent
(2. there is a need for a standard floating point format to allow the interchange of floating point data easily. DEC. Therefore. an exponential bit pattern.
Storage Locationl (recister or rmerncr/)
LExponentMnis
Sign Bite Exponent Bits
L
Hidden Sit (if used)
L Mantissa
Bits'_
FIGURE 2.
6
. 1
The IEEE single precision f loating point f ormat.
2 Format parameter floating point standard. The other
precision formats are shown in Figure 2.2 [Ref. There is an important fact that 1 bit is hidden in the mantissa. and 23 for the mantissa. 5]. 4).
Single p (bits of precision) Emu Era.1
[Ref.0
(2. 8 for the exponent. The IEEE single precision floating format contains 32 bits: 1 for the sign.0 < actual fraction < 2.In
Figure
2. Exponent bias 24 127 126 127
Single extended ? 32 2 1023 s 1022
Double 53 1023 1022 1023
Double extended a 64 16383 516382
FIGURE 2. the actual size of fraction is 24.
for
the
IEEE
754
7
.1. Consequently. exponent bits and mantissa bits. added by 1. from the 22th bit down to the zero bit in Figure 2.2)
The IEEE floating point format supports not only single precision but also other precision formats. In this case the actual value of the fraction is
1. the actual number of bits of the fraction is that of the mantissa value. In other words.
the
IEEE
single
precision
floating point format is shown with the sign bit.
The second case is when the magnitude is less than 2126 i.
only
single
precision is used. and s is the sign of bit.999999910 * 2127
(2.
8
. the floating point number is represented as
(1) s *
f
*
2
e'exponent bias
(2.
special The
cases
can
occur
from
arithmetic
first case
is called
"overflow" when the
magnitude is greater than the upper limit of the equation (2. and. if e is the value of the exponent. f is the value of the fraction.4)
Several operations. if single precision with exponent bias of 127 is adapted. the negative number has a sign bit of 1. In a single precision system.
1 2 9 10. Accordingly. The last row in Figure 2.2 shows the concept of exponent bias. a floating point value with exponent bits "100000012".
would be (129127)2 = 22.e.3)
The
sign bit
s indicates the
sign of the
floating point
number.In
the
simulation
programs
of
this
thesis.4). This indicates the implied range of the exponent of floating number is no longer strictly positive. For example. the magnitude range is
0 < magnitude < 1. The positive number has a sign bit of 0.
overflow arithmetic range and underflow occurred when the result of an
operation
is beyond or below the in the AMD29325
representable chip only the
[Ref. In the IEEE standard format. irrespective of the mantissa value. In
this thesis.
the infinity is 7FA00000 1 6 . INTRODUCTION TO FLOATING POINT UNIT CHIP AMD29325
The AMD29325 chip is a high speed floating point processor unit.
However. The NAN is
defined as a number with the exponent being 255. The DEC single precision floating point format is also
9
.5)
and this is called "underflow". On the other hand. 6].
circuit. the zero is defined as a number with the exponent minimum value and the mantissa zero. it is the representation of infinity. NAN (not a number). The NAN in the AMD29325 is 7FA11111
16. if all bits except the sign bit are set to 1. The third situation is how to represent zero. for reasons of convenience. it would be the representation of underflow. If the single precision is adopted. and infinity.
zero format is the same as that of the IEEE standard.0 < magnitude < 2 "126
(2.
substraction. It performs 32 bits single precision floating point multiplication operations in VLSI
addition. It can use the IEEE floating point standard format.
B. and the mantissa is not equal to zero. if all exponent bits are 0. this represents a number 010* If all bits of a floating point number become 0.
5. several basic functions had been created before modeling the behavior of the AMD29325. only four
arithmetic operations necessary for simulation program had been created. C.4. Input and output registers can be made transparent independently. pin I0. In this thesis. and floating point division. There which monitor the status of operations: inexact result. floating point multiplication. All buses are registered with a clock enable.5. THE ELEMENT FUNCTIONS ASSOCIATED WITH THE ARITHMETICAL OPERATION OF AMD29325 In order to simulate the features of AMD29325. All selected functions are listed in Figure 2. floating point format.. zero. BASIC MODELING FUNCTIONS OF AMD29325 i. Figure 2. and IEEE floating point format and DEC floating point format. and
notanumber(NAN). and 12 can choose eight different functions. I.. Its pin diagram is shown in Figure 2.3 shows the block diagram of the AMD29325.
Selection to perform an arithmetic operation on chip AMD29325 is via the 3 pins I0. It includes operations of convers:
i among 32bit
integer format.supported.
underflow. I. The AMD29325 chip has three buses in 32bit architecture. floating point addition. and 12. re six flags
invalid operation. floating point
subtraction. Although the division function is not used in the
10
. overflow. two input buses and one output bus. In Figure 2.
The following is a brief description of those element functions associated with the modeling of AMD29325.
11
. Usually.3 book). it still included in the model of the AMD29325.Y
~
Y
ITER
I CVK SELECT ANO ENASLE C> D to FLOATI. These element functions are listed in Appendix A. • INT TO BITSARRAY: to transfer an integer value into its corresponding bits pattern.As [ATr
uNP S
• EGISTES
RSCISTR *L FZTER
.POINT ALU OTF
STATUS FLG G N. • BITSARRAY TO FP: to convert the mantissa bits pattern into its corresponding floating point value. • FP TO BITSARRAY: to do the inverse conversion from floating point value into its corresponding mantissa bits pattern.G. it is used when the exponent value is converted to its corresponding IEEE exponent format.
AMD29325 block diagram (adopted form AMD data
actual simulation of the AMD29325._>UMOERFLOW
C
ZERO
FIGURE 2.
FLAG USTATUS
WEXIACT INVAUO
OVERFLOW .
This is a situation of multiplication by zero. " ISZERO: to test the bit pattern of an input parameter to see whether it is a zero or not. input
" IS UNDERFLOW: to check the bit pattern of an input parameter to see whether it is underflowed or not. and the most significant bit is assigned as 0.
12
. • IS NAN: to check the bit pattern of an input parameter to see whether it is a NAN expression or not. " ISOVERFLOW: to test the bit pattern of an parameter to see whether it is overflowed or not. " BECOME ZERO: to set the result to zero before the actual arithmetic operation occurs.
in
the IEEE
* SHIFL TO R: to shift the bit pattern from left to right.4 data book).[R0R31F 32 SCS31 CLK
3 0 F 1
INEXACT INVALID NAN
ENS EN "UNDERFLOW 4 FTFT
OVERFLOW
ZERO
IEEEfDEC
ONEBUS
RND 0 .RND 1
FIGURE 2.
AMD29325 pin diagram
(adopted from the AMD
" UNHIDDEN BIT: to recover the hidden bit standard format.
R (OtC forrnat)
FIGURE 2.
13
.pont conversion (INTTOFP)
F . output bit
pattern is "000111" when the input pattern is "000110" to do the inverse as the previous .
AMD29325 operation select
(adapted from ?'I'D
• BECOMENAN! to set the result of an operation to be infinity before the actual operation occurs. DECREMENT: function.2S F (floatingpoint) 
R (integer)
1
0
1
Fioatingpoint.S F R S
Output Equation
Floatingpoint additon (R PLUS S) Floatingpont subtacton (" MINUS S) Fioatingpoint multoiiction (R TIMES S) Wicatingpornt constant subtraction (2 MINUS S) Integer. Otherwise. * INCREMENT: to generate a bit pattern which is greater than input bit pattern by one.R (floatingDoint)
1
1
0
F (DEC format) 
R (IEEE
format)
1
1
F (IEEE format) . For example. element
. This is a situation of division by zero.R+ S FR. * SETFLAG: to verify that the input parameter is located in the representation range which is between the upper limit and lower limit. give it some proper flag if it is not.tointeger conversion (FPTOINT) IEEETODEC format conversion (IEE ETOOEC) DECTOIEEE format conversion (CErTO.12 0 0 0 0 1 0 0 1 1 0
10 0 1 0 1 0
1
Operation F . BACK TO BITARRAY: to convert a given floating point number into the corresponding IEEE standard bit format.IEEE)
P (nteger) .5
data book).otlotaing.
conversion of the floating point value back into the IEEE standard format will be done. the key steps of floating point addition operation are described. and division function.
14
. (2) (3) shift s I by d places to t~e right. Addition Operation Function Since the operands are in the IEEE standard format. This means d is equal to e2 minus el. A. and si be used as the exponent and mantissa value of a floating point a i. These arithmetic functions will call those element functions mentioned previously. Let e. Immediately after the result of this ac Ation operation is generated. These are the addition function. now it become sI' find the sum of s2 and si'. before the addition operation can occur. find the distance d between el and e 2 .2. conversion from IEEE standard bit pattern into a floating point value is necessary.ll of the VHDL source programs of the arithmetic functions are attached in Appendix B. The algorithms o. (1) if el is less than or equal to e 2 . THE TOP FUNCTIONS ASSOCIATED WITH THE ARITHMETICAL OPERATIONS OF AMD29325 Four important features of the AMD29325 are created in this thesis. In the following. a. these arithmetic functions are described below. subtraction function. The basic procedure for adding two floating point number al and a2 is very straight forward and involves four steps. multiplication function.
and adjust it.(4)
determine the sign from a2. the operands are in the IEEE standard format. the substraction operation function can be performed by calling the addition operation function after the sign of the minuend has been changed to its inverse. it would be converted back into the IEEE standard format. (2) find the product of p. Multiplication Operation Function As mentioned previously. and adjust it to the range shown in equation (2. Let pi and e. When the quotient is generated. The method for multiplication of two floating numbers al and a2 is similar to integer number multiplication. Therefor. In the following steps the product of two floating numbers is calculated. Once the result of this multiplication operation is obtained. before this operation function can occur. Subtraction Operation Function Similar to the addition operation function as
mentioned above. If single precision is adopted in the system. and p . they are converted into floating point value.2) and modify the adjusted sum of the exponent at the same time. (1) find the sum of el and e2 . the conversion of the IEEE standard format into floating point format is necessary. it is converted back into the IEEE standard format. since the absolute value
of a2 is greater than al. d. Division Operation Function As mentioned previously. c. the normalized action is the subtraction of 127 from the exponent value.
respectively. b. be the value of mantissa and exponent of a. In the following steps the floating
15
.
In the VHDL language.6
The entity of a FULLADDER. Generic provides a channel to pass a parameter of constant timing to a component from list which is its an
environment.6.point number division operation is described. del 2 TIME := 20 ns ) . Cin : in BIT Sum.
FIGURE 2. end entity FULLADDER . Let p. there are three levels of abstraction possible to
entity FULLADDER is generic( del_1 : TIME := 10 ns .
3. Assume that the dividend and divisor are al and a2 respectively. and port supports a signal interface to its environment. As previous examples. BEHAVIORAL DESCRIPTION OF THE AMD29325 CHIP
As shown in Figure 2. (1) find out the distance d between el and e2 and then denormalize it. Y. (2) find out the quotient of the division operation.
16
. an entity of a full adder with port and generic is declared.2). the action of denormalizing means that the distance d is added to 127. port( X. and at same time modify the quotient. Cout : out BIT ) . 'In' and
'Out' are used to
indicate the direction of the signal data flow. Then adjust it into the proper range in equation (2. and e i be the mantissa and exponent of a i .
if X = 'I' then N := N+l. 0: out bit ). Three constructs in VHDL language (adopted from
17
. 12:in bit.c. constant sum vector : bitvector( 0 to 3):="0101". end component .c :bit . 12:in bit. Y.
Cout <= (X and Y) or (S and Cin) after del_2. end if if Y = 'I' then N := N+l.Behavioral Constructs
architecture hehav'oral view of full adder is
begin process variable N: integer .
end process . Cout <= carry_vector after del_2
. Cin .
N := 0 . end dataflow view. .7 [Ref. end structureview . FIGURE 2. Structural Constructs architecture structureview of fulladder is component half adder generic( delay : time := 0 ns ) . begin S <= X xor Y after del_1 .Sum ).Y. port map( a.
Data Flow Constructs
architecture dataflowview of full adder is signal S: bit . U2: halfadder generic( delay => del_l ). port map( X.
Sum <= S xor Cin after del_1 .b. constant carry_vector: bitvector( 0 to 3):="0011".b ). S: out bit ). port map( b. end behavioralview. begin wait on X. begin Ul: halfadder generic( delay => del_1 ). C.Cin. end Sum <= sum vector after del 1 . port(ll. port(ll. signal a. if .
. end if if Cin = 'I' then N := N+l .Cout ) . 4]). end component . U3: orgate generic( delay => del_2 ).c. component or_gate generic( delay : time := 0 ns ) .a.
and mostsignifcant portions of the 32bit inout and output words being placed on the buses during the HIGH and LOW portions ot CLK. due to rounding. See Table I for a list of operations and the corresponding codes. is the leastsignificant F0 . F 0 . IEEE mode is selected. Active LOW) When U is LOW. FT O Input Register Feedthrough Control (Input. Active HIGH) When FT 0 is HIGH.g. bus operation. Output Register Feedtfiough Control (Input. registers and S are transparent. IEEE/DEC IEEE/DEC Mode Select (Input) '.F31 F Operand Bus F 0 is the leass eadcant (Input) bit. seiects R0 P31 as the inrut to register R. Active HIGH) A HIGH indicates that the last operation produced a final result that overflowed the floatingpoint format. register F retains the previous contents. RND ..:e..In 16btmode.S3I S. Active LOW) When E is LOW. In 32bLt mode. Active HIGH) A HIGH indicates thal the last operation performed was invalid. The output in such cases is either an IEEE NotaNumber (NAN) or a DECreserved operand. Register F Clock Enable (Input. When IEEE/DE. register R is clocked on the LOWtoHIGH transition of CLK. e.
18
. JNDERFLOW Underflow Flag (Output.
Clock (Input) CLK For tne internai registers.8 AMD29325 pin description (adopted from the AMD data book).. Active HIGH) A HIGH indicates that the last operation produced a final result of zero.F3 1 assume a highimoedance state.Vhen IECEic. ZERO Zero Flag (Output.en it HIGH.
FT 1
10. When ENT is HIGH. S Operand Bus So . 14 Register R inout Select (Input) A LOW on 1.is HIGH. register F and the status flag register are transparent. Active HIGH) A HIGH indicates that the last operation produced a rounded result that underftowed the floatingpoint format. Active HIGH) A HIGH indicates that the nfal result prcdiuced by the last operation s not to be interpreted as a number. U' Output Enable (Input. A HIGH selects 'he ALU F port as the input to re. A HIGH on ONEBUS configures the input bus circuitry for singleinput bus operation.ide. with the least. Input and outout buses are 32 bits . Active HIGH) A HIGH indicates that the final result of the last operation was not infinitely precise.
FIGURE 2. INVALID Invalid Operation Flag (Output.or 322Bit I/O Mode Select (Input) S16/32 A LOW on S16/32 selects the 32b0 IO moce: a HIGH selects the 16cit I/0 mode. OVERFLOW Overflow Flag (Output. Active HIGH) When FT1 is HIGH. NAN NotaNumber Flag (Output. register S retains the previous contents. PROJ/AProjective/Atfine Mode Select (Input) Choice of prolective or aMine mode determines the way in which infinities are handled in IEEE mode A LOW on mode: a HIGH selects projective PROJ/AT selects affine mode.is LOW.s!er R.12 Operation Select Lines (input) Used to select the operation to be performed by the ALU. register R s retains the previous contents ENS Register S Clock Enable (Input. (Output) bt. 16. inoutand cutout buses are 16 bits w. Input Bus Configuration Control (Input) ONEBUS A LOW on ONEBUS configures the input bus circuitry for twoinput. Active LOW) ENT When is LOW. V. " times 0. respeciveiy. Active LOW) When r is LOW. ENR Register R Clock Enable (Input. register S is clocked on the LOWtoHIGH transition of CLK. the contents of register F are placed on F0 . register F is clocked on the LOWtoHIGH transition of CLK. 13 ALU S Port Input Select (Input) A LOW on 13 selects register S as the input to the ALU S port A HIGH on 13selects register F as the input to the ALU S port.F3 1 When M is HIGH. RND 1 Rounding Mode Selects (Input) O NDO and RNO1 select one of tour rounding modes. See 0 Table 5 fora list of rounding modes and the corresponding control codes. DEC mode is selected.PIN DESCRIPTION
R0 R 3 1 R Operand Bus (Input) F is the leas:sgniicant b.
INEXACT Inexact Result Flag (Output. When E is HIGH.
those pins are only listed in the port declaration of the AMD29325. In order to better understand the usage of the chip pins. In the program attached in the Appendix. operation functions.8. zero. underf low.
In
Figure
2. floating point multiplication. there
four
arithmetic
functions implemented. which is
attached in the Appendix D. there is a mixed situation where more than one level of abstraction is used in the
simulation model. 19
. and floating point division. In
program. A simplified entity A29325 is created. you can find mixed constructs there. The final way is the structural level description which instantiates several
components to build the adder circuit. 7]. and overflow. which uses the signal assignment statement to express the relationship between input and output. the AMD29325 pin
description is listed in Figure 2. Usually. Four flags are checked: not a
number(NAN). The second way is the data flow level description.
floating point substraction.
examples use three different levels to depict the same full adder as shown. The attached VHDL in simulation program this of the chip AMD29325 are is
Appendix C. Since many functions of this chip are not required in the simulation for this thesis. only those pins of input and output signals. Generally speaking. There are differences among these three levels. which uses a conditional branch structure in the process. clock. floating point addition.7.describe
specific
circuits
[r'ef. The first way is the behavioral level
description.
undesired output data signals may appear on the output data bus. the
desired output data will be preempted and not shown on the data bus. and the total behavior of the AMD29325 have been introduced in this chapter. this is the situation when the period of the clock is less than the constant delay of the selected operation.
20
. it is necessary to be sure that the period of the clock is greater than the constant delay of the chip.and chip enable necessary for simulation are included in the port declaration of the AMD29325. the two input signal buses must be driven and the chip enable signal must be active low. When the model is called by the other top level
environment. arithmetic functions. When the floating point unit AMD29325 is employed in a system design. In the next chapter. the floating point unit is triggered to execute the selected operation function. When the clock comes with the positive rising edge.
Otherwise. All element functions. the subject will focus on the system configuration. Since the constant time delay is the VHDL inertial delay. Data on the output bus will change after a constant time delay.
1.. One with odd sequence number. Note that in this figure the input
21
.
N1
(3. Assume that the total number of input data is N. 1). A. the butterfly can be Through a well operation for known derivation DIT in fast of
steps.
N1
X(k)=
n0
x(n)e j 22n/
for k=0 . For Discrete Fourier Transform(DFT). They are the methods of decimation in time and decimation in frequency.1
transform
represented
graphically
Figure
[Ref. which is an integer of power of 2. the Discrete Fourier transform formula is. DECIMATION IN TIME(DIT) In this method.1)
In the following a brief description of two data flow designs
of Fast Fourier Transform are presented. For a limited sequence x(n). The complete signal flow of an 8point FFT is shown in Figure 3. it is possible to divide x(n) into two half series.III.
THE DATA FLOW DESIGN OF THE FAST FOURIER TRANSFORM
OVERVIEW OF THE FAST FOURIER TRANSFORM The Fourier Transform is usually used to change time
domain data into frequency domain data for spectral analysis.
the
fourier 3.2 [Ref. and the other with even sequence number. 8]. the operations are performed on a sequence of data. For some problems the analysis in the frequency domain is
simpler than that in the time domain.
The signal data flow of the butterfly is shown in Figure 3.3 [Ref. the time sequence was partitioned into two subsequences having even and odd indices. This arrangement has the property that the output will turn out to be in the natural order. except that bit reversal ordering
22
.A
1
1
C
A
C
r or
\< B Dk B Wk
D
C = A+B
D = (AB)Wk
FIGURE 3. This idea is similar to the idea of the decimation in time. 1). In DIT. data is arranged in bit reversal order according to the needs of decimation. 2.2. 1].1 Signal flow graph and the shorthand
representation of DIT butterfly.4 (Ref. And the completed signal data flow of an 8point FFT in DIF algorithm is shown in Figure 3.4 is similar to Figure 3. DECIMATION IN FREQUENCY(DIF) Another way to decompose the calculation of the
Discrete Fourier Transform(DFT) is known as the decimation in frequency. An alternative is to partition the time sequence x(n) into first and second halves. Figure 3.
the noninplace algorithm is adopted. Figure 3. Another arrangement is to have both the input and the output data in the normal order. in order to keep normal order for both the input and the output data.3 are called inplace algorithm. the FFT shown in Figure 3.
occurred in the output.2 and 3.2
8 points FFT using DIT butterfly.4. Notice that this is no longer an inplace algorithm. two data values are used as a pair inputs to a butterfly
calcUlation. In both Figure 3.
23
. The output can be put back into the same storage locations that hold the initial input values because they are no longer needed for any subsequent computations. As a
consequence of this characteristic.00
4
6 3
26 77
FIGURE 3. In this thesis.2 and Figure 3.5 shows a nonbitreversal algorithm.
A1
C
A
C
S1
D
B
r or Wk
D
C = A + BWk D= A BWk FIGURE 3. and one complex multiplication.
subtraction. Figure 3.3 Signal flow graph and shorthand representation in DIF butterfly. There are two inputs. and two writes for C and D.6 shows the total
24
. W. complex numbers A and B. to form two outputs C and D.
Figure 3. They are combined together with a complex weight factor.
COMPARISON OF SEVERAL DATA FLOW CONFIGURATIONS OF THE FAST
FOURIER TRANSFORM The objective is to consider several data flow structures to find an optimum implementation of the Fast Fourier
Transform.6 shows the basic butterfly structures of
both the DIT and the DIF Fast Fourier Transform. five complex memory access are required.
B. three reads for A.B and Wk. Inspection calculation of the formula one shows complex that a single butterfly one complex
requires
addition. Additionally.
From the above analysis it is known that if all operations take equal time. and coefficient read required. Secondly. they may be accessed concurrently. the real and the imaginary parts of the input complex data are accessed simultaneously. two ways were adopted.
number of floating point operations. it is noted that the multiplications are performed between the data and a coefficient.4
8 points FFT using DIF butterfly.0\0 1 2 3 4 2 6
5
7
FIGURE 3. Several different structures associated with a noninplace algorithm of the butterfly in the DIF are discussed below. data read. the throughput is limited by the memory
access requirement. If the coefficients are stored in a separate memory. In order to ease this bottleneck. data write. 25
. Firstly.
There are three layers of arithmetic processors shown.5 reversal algorithm.7. for a total number of 10 arithmetic operations. 1. There is one layer for data read. In order to reduce the execution steps.7. Therefore. In this full pipeline structure shown in Figure 3.7. The data flow configuration is shown in Figure 3. each arithmetic operation uses a processor. STRUCTURE 1 OF DIF BUTTERFLY It is known that the total number of required
arithmetic operations for real data is 10. which includes four multiplications and six adfitions/subtractions.2
2 3
4 5 6
4 5 6 7
8 points FFT with DIF butterfly in nonbitFIGURE 3. and one layer for data write not shown in Figure 3. The time space diagram for this
26
. a full pipeline structure can be adopted. it needs 10
processors.
8W k = 4* I+ 100 4* 3+
1. (AB)Wk = 4* I+

2.
27
.
3.6
Two
different
basic
butterflies
and
their
arithmetic operations. A+B=2+ 2.
A+BWk
=2 + 3= 2
ABW k
4 Data Reads 4 Data Writes 2 Coeff Reads
FIGURE
3.A
C
A
C r or
Wk
r or w k D B D
C = A + BWk D= ABWk
C =
A+B
D = (AB)Wk
1. AB =23.
Because input and output buses are always busy. At the Nth time step the input 'ata A. Four steps of data flow execution can be overlapped with the execution of the previous data. the bus
utility of this structure is 100% as shown in Table 3. For data sample A(n). B(n). the average efficiency of processors is 100%.1. since all 10
processes are busy in one row of the time space table. Therefore. This structure requires input and output buses concurrently. and Wk are fetched. the output data C and D are generated. the total size of the input data and output data buses are 192 and 128 respectively which are shown in Figure 3. Since in this thesis single precision IEEE floating point format (32 bits) is used. C and D are stored back to memory. the
28
. time multiplexed buses by input and output are not usable in this structure. In the next 3 time steps. and Wk(n) the complete butterfly operation needs 5 time steps. For example. Every processor in this structure is always busy. B.structure is listed in Table 3.7.1. therefore. The average of
efficiency of processors
is defined as the percentage
processors used in one completec cycle of the arithmetic time space table. At the time step N+4. in structure 1. These steps are shown in shaded boxes in Table 3.1.
Cx
C.C
FIUE37Btefyipeetaini
ieiesrcue
U9
.
1
Time space diagram of DIF structure 1. 30
.
2nd row ALU's oper. for +" & ''oper. for "*"
3rd row LU's
for
t +"&
t B B
U S
U
S
B(n) C (n4) W(n) A(n) N D(n4)
Cr(n1) R2r(n2) =Dr(n3) = Ar(n1)+Br(n1) Rlr(n2)*Wr(n2) R2r(n3)R3r(n3)
Rlr(n1)
=R2i(n2)
=
r(n1)Br(n1) Rlr(n2)*Wi(n2) Di(n3) = R2i(n3) + Ci(n1) R3r(n2) = R3i(n3) Ai(n1)+Bi(n1) Rli(n2)*Wi(n2) R3i(n2) = Ai(n1)Bi(n1) Rli(n2)*Wr(n2)
Rli(n1) =
B(n+1) Cr(n) = C(n3) W(n+1) Ar(n)+Br(n) A(n+l) N D(n3) Rlr(n) =R2i(n1) + Ar(n)Br(n) 1 Ci(n) =R3r(n1) Ai(n)+Bi(n) 11i(n) = Ai(n)Bi(n)
R2r(n1) = Dr(n2) = Rlr(n1)*Wr(n1) R2r(n2) R~r(n2)
=
Rlr(n1)*Wi(n1) Di(n2) = R2i(n2) +
=R3i(n2)
Rli(ni) *Wi (ni) R3i(n1) = Rli(n1)*Wr(n1)
=
B(n+2) Cr(n+1) = 2r(n) =Dr(n1) C(n2) W(n+2) Ar(n+1)+Br(n+1) Rlr(n)*Wr(n) N A(n+2) + D(n2) Rlr(n+1) = 2i(n) = 2 r(n+1)Bz n+1) Rlr(n)*Wi(n) Ci(n+1) = R3r(n) =R3i(n1) Ai(n+1)+Bi(n+1) Rli(n)*Wi(n) li(n+i)
______~~ ___i(n+1)
R2r(n1)R3r(n1) Di(n1) = R2i(n1) +
=
R3i(n) =
Rli (n) *Wr (n)
_____
Bi (n+1)
TABLE 3.S 0 t U
e t P p
U
1 n
p u
t
1st row ALU's oper.
C(nl) N + D(nl) 3
B(n+3) Cr(n+2) = R2r(n+l) =(n+3) Ar(n+2)+Br(n+2) Rlr (n+1) *Wr (n+1) (n+3)
Rlr(n+2)
=R2i(n+l)
Dr (n) = R2r(n) R3r (n) Di(n) = R2i(n) + 3i(n)
=
r(n+2)Br(n+2) Rlr(n+)*Wi(nI) Ci(n+2) = 3r(n+l) = Ai(n+2)+Bi(n+2) Rli(n+1)*Wi(n+l) R3i(n+l) = Ai(n+2) Bi (n+2) Rli(n+l) *Wr(n+l)
Rli(n+2) =
C(n) N + D(n) 4
B(n+4) Cr(n+3) =R2r(n+2) Dr(n+1) = W(n+4) Ar(n+3)+Br(n+3) Rlr(n+2)*Wr(n+2) R2r(n+1)A(n+4) R3r(n+l)
Rlr(n+3)
=R2i(n+2)
= =
Ar(n+3)Br(n+3) Rlr(n+2)*Wi(n+2) Di(n+l) R2i(n+1) Ci(n+3) =R3i(n+1) i(n+3)+Bi(n+3) Rli(rl+2)*Wi(n+2)
R3r(n+2)
+
R3i(n+2) = Ai(n+3)Bi(n+3) Rli(n+2)*Wr(n+2)
Rli(n+3) =
B(n+5) Cr(n+4) =R2r(n+3) =Dr(n+2)= C(n+l) W(n+5) Ar(n+4) +Br (n+4) Rlr(n+3)*Wr(n+3) R2r(n+2)N A(n+5) R3r(n+2) + D(n+l) lr(n+4) =R2i(n+3) = 5 Ar(n+4)Br(n+4) Rlr(n+3)*Wi(n+3) Di(n+2)= R2i(n+2) + Ci(n+4) = 3r(n+3) =R3i(n+2) Ai(n+4)+Bi(n+4) Rli(n+3)*Wi(n+3) li(n+4) = R3i(n+3) = i(n+4)Bi(n+4) Rli(n+3)*Wr(n+3) input bus size = 192 bits. Time space diagram of DIP structure 1(continued). output bus size
=
=128
bits
# of execution steps per data sample
5
=
# of overlapped steps in two adjacent data samples
average efficiency of processors = 100%
4
bus utility = 100 % TABLE 3.1.
31
.
8. In Figure 3.extra registers are used to stored the previous input data A(n).average efficiency of the processors in this structure is 100%.2. a second pair of registers is used here as a buffer to save the previous input data A. R3i. In Table 3. in
structure 2 the number of I/O data lines required is reduced. which means that
32
. Here. R2i. The number of time steps for a data sample is 6. STRUCTURE 2 OF DIF BUTTERFLY For structure 1. In Figure 3.8. the
processors need to get the previous input data A(n). Therefore.2. the disadvantage is that the number of input and output data buses is too large. the sizes of the input and output buses are decreased to 128. When the current data A(n+l) is read. An overlap time space diagram is listed in Table 3. 2 for substraction or addition and 4 for
multiplication. Due to the time multiplexing. B(n) and W(n) for the arithmetic operations concurrently. 2. the data flow sequence controller of this structure will be more complicated than that of structure 1. . The number of overlap time steps for two adjacent data samples is 3.
Therefore. R2r and R3r are fed back to the first row processors through the selectors controlled by the selection signal S1.8. 6 processors are used to implement a butterfly structure in Figure 3. while in structure 1 only 5 were required. the number of rows for one cycle of
arithmetic operation in the time space is 3.
the average efficiency of processors is 56%. it is not allowed to use time multiplexed buses for both input and
output. The multiplication is performed in 1 of every 3 steps. The operations for the
multiplier in the box is 4.all of the arithmetic operations will be repeated at every 3 time steps. This results from the fact that from step N to N+2 the time space boxes associated with data buses are 6. there are 6 times space boxes and only 4 boxes are used by processors. The performance of this structure is better than that of the structure 2. In structure 3. the average efficiency of processors was 56% which is lower than that of structure 1. There are four processors arranged to perform different arithmetic operations at different times in
structure 3. The total number of operations in those 6 boxes should be 18. Although the number of data bus lines is reduced. but only 10 operations are
executed. more selectors than that of structure 2 are used. Therefore. Here. the data bus utility. because the input bus is always busy. From step N to N+2. The input data is fed at the proper time to the floating point unit(FPU) by selection
33
. which is 83%. and only 5 boxes were used to convey data. 3. is decreased by 17% compared with that of structure 1. In Figure 3. STRUCTURE 3 OF DIF BUTTERFLY In structure 2.9. the emphasis is to increase the processor operation
efficiency.
8
Butrl
ipeettini
trcue2
0
4
.K ~
CC
rIUR
3.
for
_ _
2nd row
+"&"Multipliers
_ _ _ _ _ _ _ _
N
Cr(ni)= Ar(nl)+Br(n1) Ci(n1)= Ai(nl)+Bi(n1)
R2r(n1) Rlr(nl)*Wr(n1) R2i(n1) = Rlr(nl)*Wi(n1) R3r(nl) = Rli(ni) *Wi (ni) R3i=
*Wr(nl)
________________________Rli(nl)
N
+
C(nl)
B(n)
Dr(nl)= R2 r(n1) R3 r(n.signals Si and S2. However.
35
.2
Time space diagram of DIP structure 2. the method for generating the S t e
p
Output Bus
__ _ _ __
Input Bus A(n)
1st row ALU's oper.R3r(n) Di(n)
= + R3i(n)
_________
4
______ _____R2i(n)
TABLE 3.1) Di(nl)
(ni) +R3i (ni)
1
________R2i
N
+
D(nl)
W(n)
Rlr(n) = Ar(n)Br(n) Rli(n)
Ai(n)Bi(n)
2
____ ___ ____ ___ _ _ _ _ _ _ _ _ _
N
+
A(n+l)
Cr(n) Ci(n)
= =
Ar(n)+Br(n) Ai(n)+Bi(n)
R2r(n)= Rlr(n) R2i(n)Rlr(n) R3r(n)= Rli(n)
*Wr(n)
3
*Wi(n)
*Wi(n)
*
___ ___ ___ ___ ___ ___ ___ ___
R3i(n)Rli(n)
*
Wr(n)
N
+
C(n)
B(n+i)
Dr(n) = R2r(n) .
N
+
D(n)
W 1) (n+
Rlr(n+l) Ar(n+l)
=

Br(n+l)
5 Rli(n+l) = Ai(n+l) .
36
.Bi(n+l) N
+
A(n+2)
Cr(n+1) = Ar(n+l)+Br(n+l) Ci(n+l) Ai(n+l)+Bi(n+l)
R2i(n+l)=
R2r(n+l)= Rlr(fl+l)*Wr(fl+l) Rlr(n+l) *Wi(n+l) R3r(n+1) = Rli (n+1) *Wi (n+1)
6
_______
~
=64
~
____
~
_____Ri(n+l1) ____
~
~
R3i(n+l)=
R
*Wr (n+l1)
input bus size output bus size
bits bits
=6
=64
# of execution steps per data sample
# of overlap steps for two adjacent data samples = 3 average efficiency of processors = 56% bus utility
=
83 %
TABLE 3.2
Time space diagram of DIP structure 2(continued).
As a matter of fact.selection signals and which functional signals F1 through F5 should be generated in this structure are important issues. The number of arithmetic operations in 3 rows should be 12. and W(n) to be manipulated are shadowed in this table. but the number of actual operations is 10. it always keep these processors busy. In Table 3. the size of data bus lines. and bus utility are the same as those of structure 2.
37
. execution steps. It still has the same 2 empty time space boxes as structure 2 in row N to N+2 as shown in Table 3.3.3 and 3. However. It is higher than that of structure 2. From Table 3.2. The complete cycle of
butterfly operations is 3. Hence. but is still lower than that of structure 1. the input data samples A(n). although fewer
processors are used than the previous structure. B(n). The processor average efficiency of this structure is higher than that of structure 2. the number of operations associated with the boxes in this structure is 1.3. The functional signals Fl through F5 are used to change the processors to the correct arithmetic function at the right time. it is clear that the environmental support to processors in structure 3 is about the same as that of structure 2. the average of efficiency of the processors is 83%. except that a different number of processors are used. Therefore.
CD.
38
.
C
I
M
l
0
0
FIGURE 3.9
Butterfly implementation in structure 3.
TABLE 3.
# of execution steps per data sample
# of overlapped steps in two adjacent data samples
average efficiency of processors = 83%.
39
.Ar(n+l)+
W(n+l)
=Ci(n+l)
=Rlr(n+1)=
Rli(n+l1)

Ai(n+l)+ Bi(n+l)
Ar(n+l)Br(n+1)
5 C(n+1) N
+
Ai(n+i)Bi (n+1)
A(n+2)
R2r(n+l)= Rlr(n+l)* Wr(n+l)
R2i(n+l)= Rlr(n+l)= Rli(n+1) Rlr(n+l)* Rli(n+l)*= Wi(n+l) Wi(n+l) Rli(n+l)* output bus size = 64 bits
=6
input bus size = 64 bits.3
=
3
=
bus utility
83%
Time space diagram of DIP structure 3.S Output t Bus e C(nl) N
Input Bus
Processor #1
Processor Processor Processor # 2 #3 #4
A(n)
R2r(nl)= Rlr(nl)* Wr(nl)
_____ _
R2i(nl)= Rlr(nl)= Rli(n1) Rlr(nl)* Rli(nl)* = Wi(nl) Wi(nl) Rli(n1)
___ _____*Wr
(ni)
B(n) N
+
Dr(nl)= R2r(nl)Rlr(nl)
lCr(n)
=
Di(n)= Rli(nl)+ R2i(nl) Ci(n)= Ai(n)+ Bi(n) Rlr(n)= Ar(n)Br(n) Rli(n) = Ai(n)Bi(n)
D(nl) N
+
21_____
W(n)
Ar(n) + Br(n)
_____
C(n) N
+
A(n+1)
R2r(n) Rlr(n) Wr(n)
=
*
R2i(n) =Rlr(n) =Rli(n) = Rlr(n)* Rli(n)* Rli(n)* Wi(n) Wi(n) Wr(n) Di(n)= Rli(n)+ R2i(n)
B(n+l) N
+
41_____
Dr(n) = R2r(n) Rlr(ra) Cr(n+1) Br(n+l)

D(n) N
+
.
the average efficiency of processors is 100%. and the number of overlapped steps is 3 for two adjacpnt data samples. In other words. The sizes of the input and output data ouses are still 64. The time space diagram is shown in Table 3. This situation can be improved using the time multiplexed bus for input and output to achieve a higher bus utility. the same as that of structure 1.4. not every processor is busy all the times. STRUCTURE 4 OF DIF BUTTERFLY In the previous structure. Only two
processors are used as shown in Figure 3. 8 steps are needed for completing one butterfly operation. In Table 3. If it is desired to keep the processors busy as in structure 1. two processors are always busy.4. A special device "1:4 DMUX" are i.sed to route the output of the ALU to
different buffers.4. In this situation.10. It is noted that in this
structure the bus time space usage repeats every 5 time steps. the controller and address sequence generator for this structure would be more complicated than that of the former structures. and to use fewer processors In structure than in of
structure
1.
40
. what can be done?
4. There is only about 50% usages from step N+3 to step N+7.
Z5
III I
IVI I I
I
I
I____
FIGURE 3.10
Butterfly implementation in structure 4
41
.
42
.4
=
100%
Time space diagram of DIF structure 4.Step N
Output
Bus
Input
Bus__
Processor #1
_ _ _ _ _ _ _
Processor W2
__ _ _ _ _ _ _ _ _
A(n)
____ ___
R3r(n1)= Rli(n1)*
Wi (ni)
R3i(n1) Rli(n1)*Wr(n1)
_ _ _ _ _ _ _ _ _ _
N+l
B(n)
___ ___
Dr(n1)= R2r(n1)R3r (ni)_
Di(n1)R2i(n1)+R3i(n1)
_ _ _ _ _ _ _ _
N+2 N+3 N+4
D(n1)
____Ar(n)+Br(n)
Cr(n)= W(n) Rlr(n)Ar(n)Br(n) R2r(n)= R3r(n)= Dr(n)= Cr(n+l)= W(n+l) IRlr(n+1)=
Ci(n)
Ai(n)+Bi(n)
C
C(n)
Rli(n)= Ai(n)Bi(n) R2i(n)=
Rlr(n)*Wi(n)
____ _____
_____Rlr(n)*Wr(n)
N+5 N+6
____R2r(n)R3r(n)
A(n+l)
____ ____________Rli(n)*Wi(n)
R3i(n)=
Rli(n) *Wr(n)
B(n+l) ')(n)
____ ___Ar(n+l)+Br(n+l)
Di(n)=
R2i(n)+R3i(n)
N+7 N+8
Ci(n+1)=Ai(n+l)+Bi(n+1)
C(nl1)
____
______Ar(n+l)Br(n+l)
Rli(n+l):= Ai (n+l)Bi(n+l)
input bus size output bus size
=64
bits bits
=
=64
W of execution steps per data sample
8
=3
# of overlapped steps in two adjacent data samples
average efficiency of processors = 100% bus utility Table 3.
the total number of time step boxes is 20. the emphasis is on using a single processor in the butterfly structure. STRUCTURE 5 OF DIF BUTTERFLY If only a single processor is allowed in the butterfly structure. The bus activities cycles every 10 steps.5 the total number of arithmetic operation is 10. An alternative configuration is shown in Figure 3. the total number of steps needed for one butterfly cycle is 13. In Table 3.5. it still needs a data size of 64 in both input and output buses. 6 for additions or subtractions. it is necessary to use a time multiplexed bus.
Additionally. However.5.11 where input data is selected for the FPU. the imaginary part of the input data must wait until the real part manipulation has been completed. One of the
disadvantages in this structure is that the real part and the imaginary part of the data can not be manipulated in a single processor simultaneously. and the output data from FPU is stored to registers selected by the control signals Sl thruogh S6.5. but only 5 boxes are used. The selection signals depend on activities shadowed in Table 3. In DIF Figure 3. the input data must be fetched and the output data must be stored. In order to increase the bus utility. From step N+4 to N+13. it is true that the bus
utility of 25% is lower than any of the previous structures. 4 for multiplication. Therefore.
43
. what would happen? In the following. From step N to step N+12.
I
II
A.
C''
o
C.
44
.: =C
FIGURE 3.r
)
M.11
Butterfly implementation in structure 5.
TABLE 3.Step
N N+1
output bus
______
input bus
A(n) B(n)
processor
Dr(n1)= R2r(n1)R3r(n1) Di(n1)= R2i(nl)+R3i(n1)
_____
N+2
N+3
______Ci(n)=
Cr(n)= Ar(n)+Br(n)
Ai(n)+Bi(n)
N+4
N+5 N+6 N+7
N+8
_____ ______
C(n)
W(n)
Rlr(n)= Ar(n)Br(n)
Rli (n) = Ai (n) Bi (n) Rlr(n) *Wr(n)
______R3r(n)=
______R2r(n)=
Rli(n)*Wi(n)
Rlr(n)*Wi(n)
______R21(n)=
N+9 N+10 N+11 N+12
______R3i(n)=
Rli(n)*Wr(n) A(n+1) B(n+1) Dr(n)= R2r(n)R3r(n) Di(n)= R2i(n)+R3i(n) Ar(n+l)+Br(n+1)
_____
______
D(n)
______Cr(n+1)=
N+1 3 N+14
N+15 N+16 N+17 N+18 N+19
_____ ______
Ci(n+1)= Ai(n+1)+Bi(n+1) C x1)
W(n+1)
____
Rlr(n+1)= Ar(n+1)Br(n+1)
Rli(n+1)= Ai (n+1) Bi (n+1) Rlr(n+1) *Wr(n+l) Rli(n+1)*Wi(n+1)
__R2i(n+1)=
_____
__R2r(n+1)=
______R3r(n+1)=
____
Rlr(n+l)*Wi(n+1) Rli(n+1)*Wr(n+1)_
______R3i(n+1)=
N+20
N+21
____ __
A(n+2)
B(n+2)
Dr(n+1)= R2r(n+1)R3r(n+l)
Di(n+1)= R2i(n+1)+R3i(n+ll)
input bus size = 64 bits.5
3
bus utility = 25%
Time space diagram of DIP structure 5.
45
.
output bus size = 64 bits
=
# of execution steps per data sample
13
=
# of overlapped steps in two adjacent data samples
average efficiency of processors = 100%.
12.6. it is obvious that the size to 32 of the input and The bus output buses are utility of this
decreased
respectively. In Table 3. . which was 100%. and the
comparison is listed in Table 3.
structure is still much lower than that of the structure 1. In the following section. The controller must know whether the
current data on bus is input data or output data.6. STRUCTURE 6 OF DIF BUTTERFLY This structure is a modified version of structure 5 shown in Figure 3. The bus utility calculation is similar to the previous approach.
46
. with only 9 boxes used for every 20 boxes of
input/output data. Increase of the buses utility by time multiplexing is achieved at the expense of more complicated controller and address sequence generator. The time space diagram is shown in Table 3. The bus utility is 45% in this structure. In this thesis. The bus utility of structure 2 and 3 were 83%. which is higher than the previous one.7. a design of a
controller and addressing sequencer of structure 1 will be presented.6. All 6 structures have been introduced. only the address sequence generator and controller of structure 1 are implemented.
'Ia
I
*'z
Iii
0
~ C.
Ell
FIUE31jutefyipemnaii
C
srcue6
==

47
.
Step
N N+1
Output Bus
______
Input Bus
Ar(n) Br(n)
processor
Dr(nl)= R2r(n1)R3r(nl) Di(nl)= R2i(n,l)+R3i(n1)
______
N+2 N+3
N+4
Ai(n) Cr(n)
___________Ci(n)=
Cr(nl)= Ar(nl)+Br(n1) Rlr(n1)= Ar(nl)Br(nl)
Ai(n)+Bi(n)
Bi(n)
N+5
N+6
N+7
______
Ci(n)
Wr(n)
Wi(n)
______R3r(n)
Rli(n)= Ai(n)Bi(n)
R2r(n)= Rlr(n)*Wr(n)
Rli(n)*Wi(n)
______
NIB______
N+9
___________R3i(n)=
R2i(n)= Rli(n)*Wr(n)
Rlr(n)*Wi(n)
N+10
_____
Ar(n+l)
Dr(ri)= R2r(n)R3r(n)
N+11
N+12
Dr(n)
Di(n)
Br(n+1)
______Cr(n+l)=
Di(n)= R2i(n)+R3i(n)
Ar(n+1)+Br(n+1)
N+ 13 N+14 N+15 N+16 FN+17
N+l8 N+l9 N+20
_________
Cr(n+1)
Ai(ns) Bi(n~l)
Rlr(n+1)= Ar(n+1)Br(n+1) Ci(n+l)= Ai(n+l)+Bi(n+1) Rli(n+1)= Ai(n+l)Bi(n+1) R2r(n+1)= Rlr(n+1)*Wr(n+1) R3r(n+l)= Rli(n+l)*Wi(n+l)
__R2i(n+1)=
Ci(n+1)
Wr(n+1) Wi(n+l)
Rli(n+1)*Wr(n+1) Rlr(n+l)*Wi(n+1)
_________
__R3i(n+l)=
____
__
Ar(n+2)
Dr(n+l)= R2r(n+l)R3r(nl)
input bus size = 32 bits;
output bus size = 32 bit;
=
# of execution steps per data sample
13
=
# of overlapped steps in two adjacent data samples
average efficiency of processors = 100%; TABLE 3.6
3
bus utility = 45%
Overlap time space diagram of DII' structure 6.
48
DIFi1 # of FPU chips (AMD29325)
needed
DIF 2 6
DIF 3 4
____
DIF 4 2
DIF 5 1
___
DIF 6 1
___
10
data bus size
(bits)
320
_ _ _
128 6 56% 3 83% 1539 *10 =15390
I
_ _ _
128
_ _ _
128 8 100% 3 50% 26530
128
_ _ _
64
___
*
# of executed ~steps_______ average efficiency # of overlap steps bus utility total # of executed steps for 1024 real data
points
I
5 100% 4 100% 516 *10 =5160
_ __
6 83% 3 83% 15390
13 100% 3 25% 51230
13 100% 3 45% 51240
I
_
_
_
I
_
__
I
_
_
_
_
I_
TABLE 3.7
Comparison of 6 DIP butterfly structures
49
C.
SOME VHDL BEHAVIORAL MODELS
The
objective of
this
section
is
to describe
a VHDL
modeling effort to verify an FFT system design and show the benefit of VHDL simulation at the data flow level. Only
structure 1 mentioned previously is used. 1. FULL PIPELINE DIF BUTTERFLY STRUCTURE The structure 1 mentioned in the previous section is a full pipeline structure. Figure 3.7 shows 10 processors and several internal registers holding previous partial results. There are some and other output registers data used by to hold weight
coefficients
produced
this
butterfly
structure. There are no multiplexed buses for input and output data. In order to reduce the response time of this butterfly structure, two different triggers are employed. Floating point processors are positive edge triggered. The registers are
negative edge triggered. In this way, only three and a half clock periods are needed to complete one butterfly operation. Otherwise, it would require 7 clock periods if either positive or negative edge data is employed alone. To avoid undesirable into this butterfly structure, and
signal
entering
undesirable output data generated out of it, enable signals, IE, OE, and ENABLE are needed. In this structure butterfly, the signal IE is used for input register enable, the signal OE for output register enable, and the signal ENABLE for
50
The flow chart shows the activities as below: • Initially. OUTA is an output signal used to show output data available on the output bus. 2. Both signals INR and OUTA are generated by the controller. which were mentioned in previous section. In this thesis. INR is an output signal used for input data request. INE is an input signal showing that the input data fetched has been completed. OUTE is an input signal showing that the output data has been stored. are generated by this controller which was needed to manipulate the butterfly
structure. Figure 3. and ENABLE. The
controller communicate with its environment via seven signals. How to generate those enable signals IE. Signals IE.processor enable. it is triggered by IN_E and OUTE generated by the address sequence generator.
51
. while signal INE and OUTE are produced by the address sequence generator. and ENABLE with appropriate timing is discussed in the
following VHDL model. while "clear a signal" means that a signal is set to 0. CONTROLLER FOR THE BUTTERFLY STRUCTURE This controller is designed to produce not only the
enable signal for the butterfly but also requests for input to FFT butterfly and output to be stored. CNT is an internal counter in this controller. 2 for input and 5 for output.13 shows the flow chart of the controller and its logical symbol. OE. OE. the action of "set a signal" means that a signal is set 1.
and then clear OUTA. this controller ask its environment to store the output data by setting OUTA. " When INE is set meaning that the input data is fetched to the end of the input data set. there is a need to obtain data from memory and feed it to butterfly to achieve the
calculation of an eight point FFT. The internal counter. • Finally. is used to detect the 5th cloc' period after the controller initiated the butterfly and the first input data was fed into the butterfly. clears OUTA. the main functions
52
.
3. However. the controller would automatically set the OUTA to indicate that the output data on output bus is
available.
* At the proper time. It also sets INR. When data becomes available at the output. The input data is going to be fed into the butterfly by setting the INR. the controller would stop input data fetching by clearing INR. ADDRESS SEQUENCE GENERATOR According to Figure 3. and asks for data fed from RAM into butterfly. when OUTE is set due to finishing the data set. Table 3.1 shows that 5 steps occur between fetching the data from RAM to producing output on the data bus. Hence. the output of the butterfly would be available by setting OE. in the above description it did not mention clearly when the output port of the butterfly structure would be open.5. the controller would close the output port of the butterfly immediately by clearing OE. When the number in CNT is 5. CNT."
It will initiate IE and ENABLE to activate the butterfly. and close the imports of the butterfly by clearing IE.
s=.E 'C [ ' 
. set CUT__A 2.ha inoee
N
NE
into
N. trun off OE & ENABLE
FIGURE 3.
53
. set OUTA 2. 2. turn on OE
1.: IN _R 2..
Y
1.
2. clear CNT 3.increment CNT 3. turn cn IE&ENALEo.
0
t1
2. load Cata into Rea 3. increment CNT A..13
Controller flow chart and its logical symbol..T RC LL
1. clear OUT__A .turn off 1E
1.CNT
lata 1~2 .~~rt
N'
Cjc:T.sz. Increment.
R2/W2.
associated with the Figure 3. ZN is the number of data samples of the FFT. State 5 and state 6 of Figure 3. ADD1. and output signals STAGE_CNT. R2/W2. The connection of RAM and butterfly is shown in Figure 3.14. RI/WI.5. ADD2.17. In
butterfly 3. and memory chip enable signals.of
this
generator
are
to
produce
the
input
and
output
addresses for memory access. OSTO and FFTCMP. and ISTO. real/write signals. The interface includes input signals CHE. Memory enable signals contain chips enable OEl. Another function of this generator is to cooperate with the controller. CHE 54
. and R3/W3 are also required. Since it is necessary to fetch input data A. three RAM modules are used. ADD2. OE2. and ADDR2 and signals OEl. ADDR3 are used to fetch the weight coefficient Wk from RAM. R3/W3. OUTE. The address sequence generator also cooperates with the universal controller at a higher level of hierarchy. and OE3. algorithm is In this thesis.15 occur when the predetermined 2N value has been reached. Signals OE2. Wk concurrently. INR. B. and ADDR1 are used to access memory RAM 0 and RAM 1 respectively. and ADD3 are shown with bold signal lines representing a bus. They cooperate via four signals INE. Figure 3. and ADD3.
are generated these signals
according for data
Figure
addresses include ADD1. LEN. the The input and nonbitreversal output addresses to bus
implemented. The signal OE3. and OUT_A which were mentioned in the previous section. Memory read/write
signals Rl/Wl.15 is the flow chart of the address sequence
generator.
the signal ISTO would be set to 1._
OZT7 L{ I C 4 .A
A
A. in Figure 3. STAGECNT represents stage counter number in the FFT algorithm.
LK
FIGURE 3.:SS SECUENCER CE1. It uses the signal ISTO to indicate which of the two RAM.~ K. LEN represents the input data length. The universal controller uses signal CHE to start the address sequence generator. RAM 0 or RAM 1. the input data is stored. ISTO represents a pointer signal of the initial input data in the RAM. The signal STAGECNT keeps a number to tell the external universal controller which 55
. and sets N on the signal LEN.14 The block diagram of address sequence generator and controller.7RATOR O0u T_
OL. For example. OSTO represents a pointer signal of the output data in the RAM. Before the beginning of the FFT data flow. the universal controller loads N number of pairs of input data. FFTCMP represents the FFT completion.L EP
I
. represents chip enable.18 if the input data is stored in RAM 1.
1.writeoutputdate
I. increment WCNT
wCNT= 2N
%.
[~et FFT_CMP
8. generate next WADR 6.
1.read inputdata
2.15
Address sequence generator flow chart. set OUTE T1. generate next RADPR 2. 1 increment STAGECNT
7. clear INE & OUTE
I..
56
.E

5.read inputdata
y 2. clear FFTCNP & ST _ C NT 7
1.1
a Ir
0.increment RCNT
4..increment WCNT 3.
~
~
A
\U 40 CuA
OUTAz I
CU.
FIGURE 3.
TY
set I
( L....CNTz2N

N
& :iNE~ I OUTE
I. set 1N...
1.CNT=2rN
5.
.
•::'AGECNT =LoG(2NTF:.~0CUlA1 4.R
0
: OU
0NR
OLJ'0
. increment RCNT 3. 2.generate next WADF. 6. 1
...ciearRCNT & WCNT 2.s
.I..
1. yY 1. O tne 0 1a pair N 2. generate next RADR 3. write outoutdata 5.
which are initially stored in RAM 1 shown in Figure 3. Before the universal controller starts the FFT.
ORI/OWI. It will
indicate where the input data is stored by setting ISTO to 1.17 and 3. OCH2. OR3/OW3. For example.15. the universal controller would set N to 4 on signal LEN and use selection signals Cl. and output enable BE. 57
. and one group of memory access signals OCH 1. The signal OSTO is used to indicate where the output data is available from the two RAMs. and CADD3.18. once the signal STAGECNT reaches 3. it would store the input data set into one of the two RAMs using those memory access signals drawn at the bottom of Figure 3. which is an 8point FFT shown in Figure 3. This represents the FFT completion.stage of the FFT is executed currently in the butterfly. CADDI.17 and 3. the signal FFTCMP would be set. The number in the STAGECNT would count from 0 to 2. shown in Figure 3.18 include the signals of memory access OCHI. Selection signals S1 and S2 are used to control the "3 to 1" selector. which results from the iog 2 (8). C2. and ORl/OWl. if the number of pairs of data to the FFT is 4. CADD 1. selection signal Cl. Those signals provide a way that the universal controller can use to fetch input data and store the results of the FFT. Those signals drawn at the bottom of Figure 3. for a complete 8point FFT. OR2/OW2. the total executable stage is 3.18.18. There is another way for memory access to provide data to the universal ccntroller. C2.5. CADD2. As shown in Figure 3. OCH3. For example.
the controller is ready. The number stored in RCNT is incremented by 1. check the status of INR and OUTA generated by the controller in the following.
internal counters of read and write operations. in this case being 0 at the end of the FFT. the activities of the address sequence generator can be summarized. " First. the address sequence generator responds to the universal controller by setting signal
FFTCMP. According to the pointer OSTO. When both IN R and OUTR are clear. INE and OUTE.Then it activates the address sequence generator by setting CHE. and the butterfly needs to be fed with data. 2. Load N with predetermined number of pairs of data to be transformed. OCH2 and BE. so the address sequence generator would wait until INR is set. In the following. When the INR is set. and stores the output data of the FFT butterfly back to RAM. clear FFT CMP and STAGECNT. the universal controller would fetch the results of the FFT from RAM 0 via CADD2. the address sequence generator selects the input data from RAM. the controller is not ready. Using
signals. clear RCNT. The source program of tPie address sequence generator and the controller are attached in Appendix E. WCNT. * Third. would indicate where the results of the FFT are stored. When the FFT is done. Signal OSTO. In the execution process of the FFT. the signal STAGECNT would tell the universal controller which stage of FFT is S1 and S2 and two groups of memory access
active. Let RCNT and WCNT be the 2/OW2. 1. " Second.
58
.
check the number of R CNT and W CNT. only static RAM. The number stored in RCNT and WCNT are incremented by 1.
4. because input is a 32bit floating point number. Once the IN E and OUTE are set. As mention above. The address sequence generator would increase the STAGE CNT and compare it with the total stage number required. the INE would be set. if the predetermined number is reached for each counter.
59
. and the output data coming from it are available on the output data bus. The number stored in W_CNT is incremented by 1. for example. the execution stages should be 3. If the number in the STAGECNT has counted to this execution stage number.16. having a separate input and output data bus was implemented. the stop signals IN E and OUT E would be transmitted to the controller. When OUT A is set. the controller had opened the output portof the butterfly. If they are not yet the same do the next stage again. the butterfly needs to be fed with data. date set up time and access time associated with the read cycle and the write cycle are shown in Figure 3. when the data read is complete. * Finally. Several parameters.
* Fourth. For example. In order to reduce the complexity of the signal timing in RAM and simplify the model of the RAM. The size of the RAM is 256 by 32. 4. When both IN R and OUTA are set.3. the address sequence generator would set the signal FFTCMP. and the data on the output data bus is available. For example if the total number of pairs of data is 4. The RAM VHDL model is attached in Appendix F. a random access memory model is necessary for the VHDL simulation. indicating that the FFT operation is completed. RAN
Since
there
is
memory
storage
required
in
this
structure.
In order to reduce the total size of the FFT design. D. Their behavior is described in the data flow design of the FFT for simulation. Three 2 to 1 selectors are used to decide where input data is to be fetched from and where output data is to be stored. several VHDL models which are associated with the data flow of the FFT system were built. registers. SIMULATION OF THE DATA FLOW DESIGN OF FFT Right now. he can change the size of the local variable DATAMATRIX to increase the storage of the RAM. and buffers were not modeled at the chip level.17 is an original description. are left and have The a 2 faster to 1
simulation.
several
elements
out. the universal controller uses
signal C1 and memory access signals of RAM 1 or RAM 2 to select data on the input bus and store it into RAM 1 or RAM 2 respectively.
selectors. where 6 pairs of RAMs with 256 by 32 bits are required to read and write data. Shown in Figure 3.only a few timings are concerned in this model program. If someone needs a larger sized RAM. each RAM module contains
three blocks of RAM for storing A. The universal controller also uses CHE
60
. B. Assuming that the initial input data is stored in RAM 1. In this situation. the universal controller would load the length of the input data pairs on signal LEN. In Figure 3. It then indicates where the input data is by setting signal ISTO. and coefficient Wk.17.
READ CYCLE TIMING
VIH
tc(rd)
ADDRESS A
VIL
ADDRESS VALID
VIH CHIPr SELECT S
.
61
.
VILI
VIH
tds
S(S)
OUTPUT DATA
?V
Q
t
'
WRITE CYCLE TIMING
VIH
ADDRESS A
VIL
ADDRESS VALID
" i1 "t isu( A)
t.
INPUT DATA
VL
D V
A OO T C
Esu
D VALID
VIH OUTPUT Q
VL
HIZ
FIGURE 3.16 Timing of read cycle and write cycle (adopted from National CMOS RAM data book)..h( A
tw(
WRITE ENABLE
VIH
s (A
W
VIL VIH
CHIP SELECT
L
t ( s )"
. I .
1
_
=
A
A
liii

z2
p
"j l
I

: :
..22
.
I __C L
d C
t1
I
e

Ci
~
0
FIGURE 3.17
The original data flow system of FFT.
62
..IA"'L.
and so on and so forth.5.
The
address
sequence generator would generate access signals OEl. the signal STAGECNT always reveals to the executed. the output data of the first stage would then be of the input data of the second stage. the revised version of the design is created in Figure 3. In the manipulation of the data flow. In Figure 3. The size of the internal bus lines was reduced from 128 to 64. the total number of execution stages is 3. At universal controller which end of the FFT stage the is being address
the
operation.18.to
trigger
the
address
sequence
generator. The completion flag is then set on the signal FFTCMP.5. 63
. all the data flow operations are similar to what was mentioned earlier with the exception of the number of selectors.
sequence generator would indicate to the universal controller about where the final output data is stored via the pointer signal OSTO. and internal data buses used are reduced. Since the original FFT design in Figure 3. As shown in
Figure 3. as shown in Figure 3. The output data of the second stage FFT would again be stored back to RAM 1. and ADD1 to fetch the first input data to the FFT butterfly after the controller has been initiated by the signal INE and OUTA. RAM size. it will store output data from the butterfly of the first stage to RAM 2 via the selector enable S1. Since the universal controller stores the input data in RAM 1. If the input data number is 8.18. Ri/WI.5 operating system.17 is too large to be accommodated in the VAX VMS 4.
using of the created FFT system for a Discrete Cosine Transform is discussed. a successful example of the simulation result of the revised FFT system is shown. therefore. The complex data.
64
. This design was successfully simulated on the VMS 4. because the output data of RAM with size 64 can not convey two complex numbers. The flow chart of the universal controller is shown in Figure 3. which requires a size of 128. It is split into two separable data buses of size 64 and multiplexed into RAM.In Figure 3.17 are triggered at different edges of the clock.
the output data bus of the FFT butterfly
contains C and D outputs. In Table 3.18. needs to be multiplexed onto the two registers. In the next chapter.8. The two registers A and B shown in Figure 3. This is a full pipeline structure that requires several VHDL models.5 operating system.19. In this chapter the data flow models of a FFT system was discussed.
0 + l.Oj 6.Oj 2.8 comparison of the FYT result of using the MATLAB function and this simulated FFT system.0

8.Oj .0 .5.
4.0

2.
65
.2426407 + 14.Oj .0 3.0 1.Oj + 0.0
+
0.Oj
10.0710678j
1.6568542j + 2.6568542j 5.1.Oj
2.2426407 0.Oj
2.0 5.0 10.0

4.0
*10.0710677j

10.Oj
output data using HATLAB function
9.0

0.Oj .Oj 2.Oj
.2426407 + 14.6568542j
output data using simulated program
9.0 2.input data have 8 complex number 2.0
4. 1 .10.2426407
5. .Oj 2.
1.Oj

6.Oj 3.6568532j
TA&BLE 3.0 + l.Oj 3.0710678j
10.0

2.0

8.0 + 2.0 + 2.0 .0710602j
5.
=
zo....
.
fFT
66
...... ': ..L LS

•
"II t*..
IFIGURE318 he re ie dat flwsse

.... .. .
19
The flow chart of the universal controller.
67
.E 3.r
FTCH output data
FIGUR.1.ITIA. LOAD in Pt d a~ &
2.'TE those signals to active the FFT system
3.
IN".
necessary to know the difference between the it is of
formula
Discrete Cosine Transform. and the formula of Fast Fourier Transform.. Before the structure of DCT system is designed. In this chapter. and storage. of the DCT include image data compression.3)
From the equation (4.IV. the Fast Fourier Transform
implementation was discussed..
68
.2)
a(K) = V2
for K= 1I.1). the relationship between DCT and FFT is derived as.1)
a(0) = V/7/N
for K =0
(4. N1
(4. the discussion is focused on the DCT using the system designeL for FFT.
Applications
coding.
THE DATA FLOW DESIGN OF THF DISCRETE COSINE TRANSFORM INTRODUCTION TO DISCRETE COSINE TRANSFORM(DCT) In the previous chapter. A. The onedimensional (u(n). 0<=n<=Nl) is define as
N1
DCT for a limited
sequence
V(K) = a (K)
n0
u(n) cos (
(2n+l) k/2N))
(4.
the data from the FFT calculation and scale weight factor Hk/2 (K) must be stored in RAM.
69
.4)
conversion of the FFT to the DCT can be done in 3 steps.4)
U(K)
=
u(n) e
n0
j 2 z
/N
(4. a complex multiplication.5)
The total number of input sequence N must be an integer number of power of 2 [Ref. The scale factor a(K) and the FFT weight factor Wk/2 can be merged. From the equation (4. which can be written as
Hk/ 2 (k) = a(K) *WK
2
(4. one addition.V(K)
= Re[ a (K) ejk
N1
2
N*U(K)
(4. and taking the real part of the data. Prior to calculating the DCT. and one scale multiplication when floating point operations are counted. Then. This requires two real multiplication. 9].6)
In
this
way. a scale multiplication. two real data multiplications and one addition will yield the result.
it
is
possible
to
reduce
the
number
of
multiplications from 3 to 2.
3. CHE. One is to use the full pipeline structure. CADD3 and BE. OR2/OW2. In Figure 4.18. this requires the memory address
sequence generator to access data stored in RAM. The third group of signals.
include LEN. OCH2. additional circuitry is used to perform two
multiplications and one addition to obtain the Discrete Cosine Transform.18.2 shows the block diagram of the FFT and the external universal controller. are the status
signals from the FFT system. once the output data from the FFT system is stored in memory. CADDI. are associated with memory access in the FFT system. OCH1. The universal controller discussed in the previous
70
. a full pipeline structure uses 3 additional processors. C2. Figure 4. and STAGECNT. In addition.
THE DISCRETE COSINE TRANSFORM SYSTEM IMPLEMENTATION
Two methods to implement a DCT system are discussed here. Cl. CADD2. A second method of implementing the DCT is shown in Figure 4. shown at the lower hand corner in Figure 3. the other is to modify the universal controller of the FFT system discussed in the previous chapter. FFT_CMP. OCH3. The second group of signals. The first group of signals shown at the bottom of the Figure 3. In other words. The interface signals include three groups signals. ORI/OWl. and ISTO which are used to initiate the address sequence generator in the FFT system. 2 for multiplication and 1 for addition.1. OSTO.B. OR3/OW3.
j a(K)*e ik/ZN.3. Therefore. After the complete output data is generated from the FFT butterfly. This approach will complicate the universal controller. If it is undesirable to build any additional circuitry.8). only the real part of D is kept. In the Figure 3. and Wk are input data. After the complete output data of FFT is generated. whereas C and D are output data. there is a trade off between these two methods.
and B be 0.
For Discrete Cosine
Transform. method two can be adopted.8)
A.7) and equation (4.
let Wk be same
A be U(K). the result of DCT is needed to go through the butterfly for one more cycle. The real part of the output data is the result of the Discrete Cosine
Transform. with 3 processors and a local memory access sequence generator. B.7)
D = (A .chapter is modified to complete the Discrete Cosine Transform of the input data. 71
. If the first method is used.
C
=
A +B
(4. It is straight forward to modify the flow chart of the universal controller of Figure 3. one more
cycle through the butterfly is needed if we want to do DCT for original input data. Based on equation (4. it is necessary to build additional circuitry. the butterfly structure of DIF nonbitreversal algorithm was shown where the input and output have the following relationship.19. In this way the
butterfly can yield another output D.B)*Wk
(4.
the improvement and future research of this thesis will be discussed
72
. In the next chapter.The idea of how to get a Discrete Cosine Transform result using an FFT structure is discussed here.
Hr
ENABLE
C.K
R/w
CH
ADCR
[A flp CY. Hr: the real Part of Scale weicht factors Hi: th e imaoi nary part of scale weignt factors P/W.:
R/w
H ADOR ENABLE
sequencer
CL K Ar: the real part orl data coming from FFT system output Ai: th e imaginary part off dEta coming fromn FPT system output * *ADCRP.
73
. the input data come from the FFT system output.1 Full pipeline structure to implement the DCT system.J are the mecory accession sicnal [ DOR is the in itinal inzutL cat@ ECddress ClB the c hic ena: le cf SECUEncer ) FIGURE 4. &CH.
74
. LEN CHE ISTO OSTO PFTCMP 7AGECNT
(iricuo.C2 OP I /OWI
0 R2/ '.g x::ve~y.2 Block diagram of the universal controller and the FFT system.
WK
~
otol
CLK
CLK
FIGURE 4.v2
OCH2
FT SYSTEH
UNIBVERSAL
CON'TROLLER
OCH3
CAD Delerc ~ I~er CADD2 CADO 3.
LOAD input data &
2.C NP
3.
INITIA TE those signals to activate the FFTsystern
FFT.3
Modified flow char~t of the universal controller.
INITIATE those signal to
5.
75
.FETCH the real part of output data
FIGURE 4. LOAD the scale weight factors
4.
A "trial and error" approach was often taken. In this thesis.5. One problem is related to the source programs created under VHDL version 1. Many signal processing algorithms require sumofproduct operations that are well suited to designs
discussed in this thesis.0. The result is shown in the Table 3. CONCLUSION
CONCLUSION
Although this thesis modeled the floating point arithmetic processor "AMD29325". A. the data flow design of FFT in the full pipeline butterfly structure has been built and the model has been verified. Due to limitation of time the data flow design of DCT is not fully simulated. This problem developed due to the software version change.V. the methodology can be applied to other digital signal processing systems. In the Intermatrix VHDL version 1.8. A few problems were easy to solve such as the syntax errors. There are still unresolved problems. For example.5 that can not run under VHDL version 2. but many problems were difficult to overcome. and the DCT
system. Many problems had been encountered in the study. It can not generate a triggered pulse
waveform in the interactive simulation mode. data flow FFT systems. it can not print a negative value in the report file. there are several internal
problems. When the "BLOCK"
76
.
However.
77
.8 there are still some errors in rows !.18.is used in the VHDL source program. In this thesis truncation was used to deal with the large values generated when the length of mantissa size exceeded 23 bit of the IEEE mantissa size pattern. The very important experience here is how to deal with system design in topdown design methodology and how to use VHDL simulation to analyze systems to get an optimum design. IMPROVEMENTS AND FUTURE RESEARCH
The data flow designs of a Radix 2 FFT in DIF algorithm and the data implemented thesis can in be flow designs of a DCT had been discussed and this thesis. several areas in this improved. Several directions are
listed in the following for future research. It is replaced by the revised program which is shown in Figure 3. it would generate some unexpected sice effects. In Table 3. For further improvement a rounding method should be used. These errors were caused by truncation. 6. in Chapter III the
original FFT design does not run on the VMS 4. and 8 of the output data from the
FFT system simulation. For example.5 operating system because of the size and complexity of the design used in the source program. Hierarchical design is an important approach that allows step by step solution to circuit design.
B.
double precision. formats: single extended
precision. 10]. TO "ERFORM THE RADIX 4 FAST FOURIER TRANSFORM IN DIT OR DIF ALGORITHMS It is possible to further reduce the number of
calculations required to perform the FFT by using a radix 4 algorithm provided that the number of input data is an integer power of 4.1.
78
. the advantage of the radix 4 algorithm is to reduce the number of multiplications by 25% [Ref. 3. floatingpoint to integer conversion. and double extended precision. integer to floatingpoint conversion. These formats are shown in Figure 2. TO ADD SEVERAL OTHER FUNCTIONS ASSOCIATED WITH THE AMD29325 OPERATION In this thesis. As shown in Table 5. Two basic signal data flows in DIT and DIF
algorithm for radix 4 are shown in Figure 5. There are other functions shown in Figure 2.1.1.2. only four floating point arithmetic operations are implemented. and DEC to IEEE format conversion. TO IMPLEMENT THREE ADDITIONAL PRECISION FORMATS TO IMPROVE THE ARITHMETIC ACCURACY Only single precision is There are three other precision employed in this thesis.5 associated with the AMD29325 operation including the floatingpoint constant substraction. 2. IEEE to DEC format conversion.
Jx 2
3
W3k
FIGURE 5.
Wk+ X2 •W2k + X3 Wk .X .x Vj2k + jx
w3k
X1
K j 2K KX X
X.Jx1 )X =x 2
2
3
* W3k
X2
0
W2k .x. top is bottom is the DIP algorithm.X3 .xo
xI
=
= X0O+ Xl1 + X2 + X<3

x x .2 + x3
Ix I

)
w
X1
(xO
X'+ JX) *W 3
X2 X3 
2K
X2
= (X

X1 + X2 xX 3 ) • W2k 
X3 = (x 0 + IxI .1 Butterfly in Radix 4.XO
xo

.
79
. W2k.x3) ° W3k
X0 C

X0
=
Xo+ X. = Xo .
K
XO
X . * Wk+ X2 *
+
w3k
X3
3
X0
iX1 " Wk .
the DIT algorithm.x 2 .
1 operations needed in Radix 2 and Radix 4.Radix 2 N 64 256 1024 (N) 192 1024 5120 (+) 384 2048 10240
Radix 4 (*) 144 768 3840 (+) 384 2048 10240
The comparison of total number of arithmetic TABLE 5.
80
.
5. In addition it provides four 40 bit programmable complex
accumulators to facilitate operations in radix2 and radix4 algorithms. If the address sequence generator is modified to recogniz. the total number of weight factors
needed for an 8point fast fourier transform is 12. 1/2. one (Ref.4.
81
. only 4 weight factors are different. k = 0.e. 11] can special be chip for FFT operation called The CVP implements a "CVP" 32
used. the memory needed to stored weight factors can be reduced. 5.
full
bit complex multiplication on chip in a single clock cycle. The number of fetches for the weight factor is also 12. TO IMPROVE THE ADDRESSING SEQUENCE GENERATOR TO REDUCE FETCHING IDENTICAL WEIGHT FACTORS In Figure 3. TO BUILD THE FAST FOURIER TRANSFORM USING A SPECIAL "COMPLEX VECTOR PROCESSOR (CVP)" CHIP In order to increase the speed of the FFT simulation program. In fact. 1/4. i. and 3/4. the identical weight factors.
'X'. package refer is type BITARRAY is array( integer range<> ) of BIT. constant sprecision: integer := 32. type LOGICMATRIX is array( integer range<> ) of LOGIC ARRAY( 31 downto 0) constant dprecision: integer := 64. function BITSARRAYTO INT( bits: BITARRAY) return NATURAL. function ISOVERFLOW( expbits: BITARRAY. precision:INTEGER) 82
. zero bit:BIT. type LOGICARRAY is array( integer range<> ) of LOGIC LEVEL . funbtion BITSARRAYTOFP( bits: BITARRAY) return REAL .length: NATURAL) return BITARRAY. function UNHIDDEN BIT( bits: BITARRAY) return BITARRAY. function FP TO BITSARRAY( fp: RLAL. function SHIFL TOR( bits: BITARRAY . type FLAG is record ovf bit:BIT. use std. type BITMATRIX is array( integer range<> ) of BITARRAY(31 downto 0) . function INTTOBITSARRAY( int.all.'0'.'Z). type LOGICLEVEL is ('1'. length: NATURAL) return BITARRAY . unf bit:BIT. times :integer) return BIT_ARRAY.standard. end record.APPENDIX A: THE ELEMENT FUNCTIONS OF THE FPU
these element functions associated with FPU(floating point unit) library std . nan bit:BIT.
precision: INTEGER) return BOOLEAN.5.LTARRAY.5. function ADD(signa:BIT. end refer . precision:INTEGER) return BITARRAY . .0. bits a: B. precision:INTEGER) return BITARRAY. function DECREASEMENT(bits:BITARRAY. function BACKTOBITSARRAY(expbits:BITARRAY. end if . bitsb: BITARRAY) return REAL. variable index :REAL := 0. begin for i in bits'range loop if bits(i) = 'I' then result := result + index . index := index*0. function ISZERO( bits: BITARRAY) return BOOLEAN. package body refer is function BITSARRAYTOFP( bits:BITARRAY) return REAL is variable result :REAL := 0. function INCREASEMENT(bits:BITARRAY.exp_bits: BIT ARRAY. precision:INTEGER) return BITARRAY. precision: INTEGER) return FUAG. fp:REAL. sign_b:BIT. function ISNAN( exp bits: BITARRAY) return BOOLEAN. function BECOMENAN( bits: BITARRAY) return BIT_ARRAY. function SET_FLAG( bits. function BECOME ZERO( bits: BITARRAY) return BITARRAY.
function ISUN ERFLOW( exp_bits: BITARRAY.5 = 2**(l)
83
.return BOOLEAN.
end loop; return result; end BITSARRAYTOFP; function FP TO BITSARRAY( fp: REAL; length: NATURAL) return BITARRAY is variable local: REAL; variable result: BITARRAY( lengthi downto 0); begin local := fp ; for i in result'range loop local := local*2.0 ;
if local >= 1.0 then
local := locall.0; result(i) := '1'; else result(i) := '0'; end if end loop ; return result ; end FPTOBITSARRAY function INT TO BITSARRAY( int,length: NATURAL) return BITARRAY is variable digit:NATURAL := 2**(length1); variable local:NATURAL ; variable result:BITARRAY(lengthl downto 0); begin local := int ; for i in result'range loop
if local/digit >= 1 then
result(i) := '1';
local := local  digit;
else result(i) := '0'; end if; digit := digit/2; end loop; return result; end INTTOBITSARRAY; function BITSARRAYTOINT( bits: BITARRAY) return NATURAL is variable result :NATURAL := 0; begin for i in bits'range loop result := result*2; if bits(i) = 'I' then 84
result := result + 1; end if; end loop ; return result ; end BITSARRAYTOINT; function UNHIDDEN BIT( bits: BITARRAY) return BITARRAY is variable result : BITARRAY(bits'length downto 0); begin for i in bits'range loop result(i) := bits(i); end loop; result(bits'length) := '1'; IEEE format return result; end UNHIDDENBIT;
function SHIFL TO R( bits: BITARRAY; times :integer) return BITARRAY is variable number:integer := times; variable result : BITARRAY(bits'length1 downto 0); begin for i in bits'range loop result(i) := '0'; end loop; while number <= bits'lengthi loop result(numbertimes) : bits(number); number := number+l ; end loop; return result; end SHIFLTOR; function ISOVERFLOW( expbits: BITARRAY; precision: INTEGER) return BOOLEAN is variable result: BOOLEAN ; begin case precision is when 32 =>  single precision if exp bits =B"11111111" then result := TRUE; else result := FALSE; end if;  double precision when others => if exp bits =B"l1111111111" then result := TRUE; else result := FALSE; 85
end if; end case; return result; end ISOVERFLOW; function ISUNDERFLOW( exp_bits: BITARRAY; precision: INTEGER)
return BOOLEAN is variable result: BOOLEAN ; begin case precision is
when 32 => single precision
if expbits =B"00000000" then result := TRUE; else result := FALSE; end if; double precision when others => if exp bits =B"00000000000" then result := TRUE; else result := FALSE; end if; end case; return result; end ISUNDERFLOW; function ISZERO( bits: BITARRAY) return BOOLEAN is variable result: BOOLEAN ; begin for i in bits'range loop if bits(i) /= '0' then result := FALSE; return result ; end if; end loop ;
result := TRUE
return result; end ISZERO;
mf
function ISNAN( expbits: BITARRAY ) return BOOLEAN is variable result: BOOLEAN ; begin for i in expbits'range loop if expbits(i) /= 'I' then result := FALSE; return result ; 86
ovf bit := '1'. end loop . return result . if ISOVERFLOW( expbits. end if. end loop . function BECOME NAN( bits: BITARRAY) return BIT ARRAY is variable result: BITARRAY(bits'left downto bits'right). begin for i in bits'range loop result(i) := '0'. bitsa: BITARRAY.exp_bits: BITARRAY .zero bit := '1'. result. precision) then result. elsif ISUNDERFLOW( expbits. if ISZERO( bits) then result.unf bit := '0'.
result := TRUE
return result.
function SETFLAG( bits. end BECOME ZERO. result. precision) then result. signb:BIT. result. function BECOMEZERO( bits: BITARRAY) return BIT ARRAY is variable result: BITARRAY(bits'left downto bits'right).ovf bit :='o'.zero bit := '0'. return result. function ADD(signa:BIT.nan bit := '1'. begin for i in bits'range loop result(i) := '1'.unf bit := '1'. end if. end SETFLAG.nan bit := '0'. precision: INTEGER) return FLAG is variable result: FLAG . 87
. end ISNAN . return result. end BECOMENAN.end if end loop . result. begin result.
0.precision ) then result := bits . end if. frab := BITSARRAYTOFP(bitsb). end ADD. when "10" => siga := 1.
88
. fraa := BITSARRAYTOFP(bitsa).
end case. variable fra b: REAL.
result := abs(siga*fraa + sigb*frab)
return result.0.
when 101"1 =>
siga := 1. variable length : INTEGER := bits'length .initial condition C(0)=1 variable bit num :integer := 0. sigb :=1. begin xbuff := signa&signb. variable xbuff: BITARRAY( 0 to 1). return result. variable sig_b: REAL. variable carry : BIT := '1'. when "I1" => siga :=1. case xbuff is when "00" =>
siga := 1. precision: INTEGER) return BITARRAY is variable result : BIT ARRAY( bits'lengthi downto 0 ).
while bit num <= lengthl loop
buf := bits(bitnum) & carry . sigb := 1. variable siga: REAL. variable buf : BITARRAY( 0 to 1 ). . when "01" => carry := '0'.0. sigb := 1. result(bit_num) :='0'.0.0.0.0.0. function INCREASEMENT(bits: BITARRAY. variable fra a: REAL. begin if ISOVERFLOW( bits. case buf is when "00" => carry := '0'. sigb :=1.bitsb : BITARRAY) return REAL is variable result: REAL.
return result. result(bitnum) := '1'. result(bitnum) when 'll" => carry := '1'. variable buf : BIT ARRAY( 0 to 1 ). :.
1. initial condition C(O) = 1 variable bit num :integer := 0. fp:REAL.precision ) then result := bits .
function DECREASEMENT(bits:BITARRAY. bit num := bit num + 1. end INCREASEMENT
:= '1'. when "01" => borrow := '1'. when "I1" => borrow := '0'.result(bit_num) whn "10" => carry := '0'. precision:INTEGER) return BIT ARRAY is variable result : BIT ARRAY( bits'lengthi downto 0 ). :. result(bit_num) :='l'. return result.it. end DECREASEMENT function BACKTOBITSARRAY(expbits:BITARRAY. precision:INTEGER) return BITARRAY is 89
. variable length : INTEGER := bits'length . begin if ISUNDERFLOW( bits. when "10" => borrow := '0'. result(bitnum) := '0'. resul4(bitnum) end case. while bit num <= lengthl loop buf := bits(bitnum) & borrow. end if. end loop. bit num := bit num + end loop. case buf is when "00" => borrow := '0'. variable borrow:BIT := '1'. result(bit_num) :='O'. end case.101. return result.
0. if fra value = 1. return result.5 ) > 0. end if . result := expbits_buf & bitsbuf . variable result: BIT ARRAY(lengthl downto 0) variable bitsbuf: BITARRAY(lengthlexpbits'length downto 0) variable fra value: REAL.0 then fp_buf := fp_buf / 2.0 then fpbuf := fpbuf * 2.precision). return result .precision). end if .1. if underflow condition occurred if ISUNDERFLOW( expbits buf. end loop.0 then 90
. set the fra bits result := exp_bits_but & bits buf.0 and ISOVERFLOW( exp_bits . bitsbuf := BECOMEZERO( bits buf). else while abs( fpbufl. be careful input prarmeter must be positive real value begin if fp = 0.variable length:INTEGER := precisioni.0 and fpbuf >= 1.precision) then bits buf := FP TOBITSARRAY( fpbuf.0 then result := BECOME_ZERO( result ). end if end if. . elsif fpbuf < 1. end if. become 0. if( fp>l. precision)) then result := BECOMENAN( result ) .it produces value over between 1 an 2 fra value := fp_buf .0. return result.precision)) then result := BECOME_ZERO( result ). variable expbits buf :BIT ARRAY( expbits'length1 downto 0) := exp_bits. return result.bits buf'length ). if ( fp<l. if ISOVERFLOW( expbits_bufprecision) then exit when( fp_buf <= 2. variable fp_buf :REAL := fp. expbitsbuf := INCREASEMENT( expbits buf. expbitsbuf :DECREASEMENT( expbitsbuf.0.5 loop if fp_buf > 2. return result.0).0 and ISUNDERFLOW( exp bits.
end refer. end if. else bits buf : .
91
. end if. then elsif fra value =0.O_BITSARRAY( fravalue. return result. else exp bitsbuf : INCREASEMENT( exp bits buf. bits buf :=BECOME_ZER~O( bitsbuf). end BACKTOBITSARRAY. result := exp bits buf & bitsbuf.precision) then bits buf := BECOME_ZEROC bitsbuf).bits_buf'length ) FPT end if.0 bitsbuf :=BECOME_ZERO( bitsbuf).*
if IS_OVERFLOW( exp bits buf.precision).
when "I1" =>
siga := 1. expdiff: INTEGER) return REAL.0.0.APPENDIX B: THE TOP FUNCTIONS AND BEHAVIOR OF THE FPU A.
sigb := 1. package body FPADDER is function ADDER(sign_a:BIT.
case xbuff is
when "00" =>
sig_a. 1.0.= sig b = when "01" => siga := sigb := when "10" =>
1. package FPADDER is function ADDER(signa:BIT. bits a: BIT ARRAY. variable frab: REAL. bitsb : BITARRAY . end FPADDER .0.precision: INTEGER ) return BITARRAY . exp_length.0.0. expdiff :INTEGER) return REAL is variable result: REAL. signb:BIT. THE TOP FUNCTIONS OF THE FPU Floating Point Addition
library fpu. variable sigb: REAL.mantissalength. variable siga: REAL. bits b: BITARRAY. signb:BIT.all. variable xbuff: BITARRAY( 0 to 1).
92
.refer.0. 1. 1. begin
xbuff := signa&signb. bits a: BIT ARRAY. use fpu. variable fra a: REAL.
siga :=1. function ADD2( bits a: BITARRAY . bits b : BITARRAY .
sigb := 1.0;
end case; if expdiff >=0 then fra a := BITSARRAY TO FP(bitsa); BITSARRAYTOFP(SHIFLTOR(bits_b,expdiff)); fra b else fra a BITSARRAYTOFP (SHIFLTO R(bits_a,abs(exp_diff))); fra b := BITSARRAYTOFP(bitsb); end if result := abs(siga*fraa + sig_b*frab) ; return result; end ADDER; function ADD2( bits a: BIT ARRAY ; bits b: BITARRAY; explength,mantissalength,precision: INTEGER ) return BITARRAY is variable a is nan :BOOLEAN; variable b is nan :BOOLEAN; variable a is zero :BOOLEAN; variable b is zero :BOOLEAN; variable a is underflow :BOOLEAN; variable b is underflow :BOOLEAN; variable expa :INTEGER; variable expb :INTEGER; variable expdiff :INTEGER ; variable bitslength :INTEGER := bits a'length; variable sign bita : BIT := bits_a(bitsa'left); variable exp bitsa : BITARRAY(bits_a'leftl downto bits a'leftexplength); variable mantissaa : BITARRAY(mantissalength downto bits a'right); variable sign bitb : BIT := bits b(bits b'left); variable expbitsb : BITARRAY(bitsb'leftl downto bits b'leftexplength); variable mantissab : BITARRAY(mantissalength downto bits b'right); variable bitsc: BITARRAY(bits_a'left downto bits a'right); variable signbitc : BIT ; variable expbitsc:BIT ARRAY(bitsa'lefti downto bits a'leftexplength); variable buf bits c :BIT ARRAY( bitsa'leftl downto bits a'right); variable fra c : REAL ; begin expbits_a := bitsa(bits_a'leftl downto bits_a'leftexplength); exp bits_b bitsb(bits b'leftl downto bitsb'leftexplength); 93
a is nan := ISOVERFLOW( expbits_a, precision) ; b is nan := ISOVERFLOW( expbitsb, precision) ; a is underflow:= ISUNDERFLOW( expbitsa, precision) ; b is underflow := IS_UNDERFLOW( expbitsb, precision) ; a is zero := ISZERO( bitsa ); b is zero := ISZERO( bitsb ); if a is zero then bits c := bitsb; return bits c; elsif b is zero then bits c := bitsa; return bits c ; end if ; case ( aisnan or bisnan ) is
when TRUE =>
if ( a is nan and a is nan ) then bits c := bits a; elsif b is nan then bits c := bits b; else bits c bitsa; end if; when FALSE => expa := BITSARRAY TO INT(expbitsa); expb := BITSARRAY TO INT(expbits b);
expdiff := exp_a  exp b if expdiff >= 24 then
bits c := bitsa; return bits c ; elsif abs(exp diff) >= 24 then bits c := bitsb; return bitsc; end if ;
if expdiff > 0 then expbitsc := exp bits_a ; signbit c := signbit_a
elsif( expdiff < 0 ) then expbitsc := expbits_b ; signbit c signbit_b ; end if; if ( a is underflow or b is underflow ) then if a is underflow then in the underformat there is not unhidden bit exitent mantissa a := '0' & bits a( mantissa lengthi downto bitsa'right); elsif b is underflow then mantissab :='0' & bitsb( mantissa_lengthi downto bits b'right); end if; 94
else mantissaa :=UNHIDDENBIT(bits_a( mantissalengthi downto bits_a'right)); mantissab :=UNHIDDENBIT(bitsb( mantissa_lengthI downto bits b'right)); end if ; if( expdiff = 0 and ( mantissa a >= mantissab )) then expbitsc := exp bitsa ; sign bit c := sign bita ; elsif( expdiff = 0 and ( mantissab > mantissa_a )) then expbits_c := exp bitsb ; signbitc := signbitb ; end if fra c :=2.0 * ADDER( sign_bit_a, mantissa a, signbit_b, mantissa_b,expdiff); if fra c = 0.0 then bits c := BECOMEZERO( bits_a else bufbitsc := BACKTOBITSARRAY( exp_bits_c, frac,precision ); bits c := sign bit c & buf bitsc ; end if end case; return bits c end ADD2; end FPADDER ;
I
Floating Point Subtractionlibrary fpu; use fpu.refer.all; use fpu.fpadder.all; package FPSUBER is function SUB2( bits a: BIT ARRAY ; bits b: BIT ARRAY; explength, mantissa_length, precision: INTEGER ) return BITARRAY ; end FPSUBER
95
variable a is underflow :BOOLEAN. end if.refer. explength. mantissalength .
variable bitsc : BITARRAY(bits b'left downto beginbitsbright).all. begin if bit. else bufbitsb( bits_bleft' :='Il'. explength. explength. bits c := ADD2(bits_a. end FPMULTIER .mantissa lengthprecision: INTEGER return BITARRAY . use fpu. variable bisunderflow :BOOL£AN.mantissalength. variable a is nan :BOOLEAN. buf bitsb. variable b is nan :BOOLEAN. end SUB2 .
96
. precision ). bits b: BITARRAY. mantissalength. bits b: BITARRAY. explength. return bits c . variable expa :INTEGER. variable expb :INTEGER.package body FPSUBER is function SUB2( bits a: BITARRAY . variable b is zero :BOOLEAN.precision: INTEGER ) return BIT ARRAY is variable a is zero :BOOLEAN. precision: INTEGER ) return BITARRAY is variable bufbitsb : BITARRAY(bits b'left downto bitsb'right)
:= bits b . bits b: BITARRAY. package body FPMULTIER is function MULTI2( bitsa: BITARRAY . end FPSUBER

Floating Point Multoplication
library fpu. package FPMULTIER is fufiction MULTI2( bitsa: BITARRAY ._b(bitsb'left) = 'I' then butbitsb( bits_b'left) :='0'.
mantissab : BITARRAY(mantissalength downto bits b'right). expbits a := bitsa(bits_a'leftl downto bits_a'leftexplength). fra c : REAL . bitslength :INTEGER := bits a'length. precision) . ISZERO( expbitsb). bitsc( bits_c'lengthl ):= sign bit_c . expbitsc:BIT ARRAY(bitsa'lefti downto bitsa'leftexp length). bitsc( bits_c'lengthl ):= sign bit_c . when FALSE => expa := BITSARRAYTOINT(expbits_a). b is zero if (a is zero or b iszero) then bits c := BECOMEZERO( bits_c ). a is underflow:= ISUNDERFLOW( expbits_a. a is zero := IS ZERO( exp bits a ). b is nan := ISOVERFLOW( expbitsb. case ( a is nan or b is nan ) is when TRUE => if a is nan then bitsc := BECOME NAN( bits_a ). signbit c : BIT . end if.variablq variable variable variable variable variable variable variable variable variable variable variable variable
expsum :INTEGER . else bitsc := BECOME NAN( bits_b ).
begin signbitc := sign bit a xor signbit b . signbit b : BIT := bits b(bits b'left). 97
. mantissa a : BITARRAY(mantissa_length downto bits a'right). b is underflow:= ISUNDERFLOW( expbitsb. expbitsb : BIT ARRAY(bits_b'leftl downto bits b'leftexplength). sign bit a : BIT := bits a(bitsa'left). expbits_b := bits b(bits b'lefti downto bitsb'leftexplength). else a_isnan := IS_OVERFLOW( expbitsa. bitsc( bits_c'lengthl ):= signbit_c . bufbits c :BITARRAY( bits_a'leftl downto bits a'right). expbitsa : BITARRAY(bits_a'leftl downto bitsa'leftexplength).precision).precision). bitsc: BITARRAY(bits_a'left downto bitsa'right). precision) .
if precision =32 thensingle precision exp sum exp_sum .127. else 98
. end if else exp bits c :=INTTOBITSARRAY( exp sum . end if.0 exp :bits_c :=B1000000001.0) then bits c := BECOMEZERO( bits_c).b. end if.exp_b := BITSARRAYTOINT(exp bits_b). frac :=4. end if. return bitsc . elsif exp sum < 0 then if (exp sum < 1) or ( exp sum = 1 and bits c < 2. underf low bits_c( bits_c'lengthl ):= sign bitc. elsif b is underfiow then mantissab :='O' & bitsb( mantissa_lengthi downto bitsb'right).0 * BITSARRAYTOFP( mantissa a)* BITSARRAYTOFP( mantissab expsum
:= expa + exp. elsif ( exp sum = 1 and frac >= 2. if( a is Iunderfiow or b is underfiow )then if aisunderfiow then
in
underf low formate there is not unhidden bit existing mantissa a :=10O & bits a (mantissa lengthi downto bitsa'right). mantissab :=UNHIDDENBIT(bitsb( mantissa lengthi downto bitsb'right)).exp length).0) then fra c := fra c/2. else mantissaa :=UNHIDDENBIT(bitsa( mantissa_lengthl downto bits_a'right)). IEEE EXP FORMAT if exp sum >= 255 then bits c :=BECOMENAN( bitsc) overflow bitsc( bits c'lengthl ):= sign bitc.
elsif ( exp sum = 1 and frac >= 2. the other case is 64(double precision).expsum := exp_sum . explength.
if exp sum >= 2047 then
bits c := BECOMENAN( bitsc ) .all. fra_c.refer. end if . package FPDIVIDER is function DIVIDE2( bitsa: BIT ARRAY . 99
. end MULTI2.0) then
bits c := BECOMEZERO( bitsc ) . precision ).0 )
then fra c := fra c/2. end FPMULTIER Floating Point Dividerlibrary fpu.0 expIbitsc := B"00000000000" . bits b: BITARRAY.mantissa_length. end if. bitsc := signbitc & buf bits c . end case.exp length) end if. bufbitsc := BACKTOBITSARRAY( expbits_c.precision: INTEGER ) return BITARRAY . underflow bitsc( bits_c'lengthl ):= signbitc return bits c .1023. overflow bitsc( bits_c'lengthl ):= signbitc elsif expsum < 0 then
if (expsum < 1) or ( expsum = 1 and frac < 2. return bits c . end if else exp bits c := INTTOBITSARRAY( expsum . use fpu.
variable expbitsbuf : BIT ARRAY( bitsa'lefti downto bits_a'leftexp length ) .precision )) or ( ISOVERFLOW( exp_bitsb. expbitsbvalue := BITSARRAYTOINT( expbits_b ).bits b : BIT ARRAY .precision return BITARRAY . variable fra bits_b_value : REAL .bitsa( bits a'left1 downto bits_a'leftexplength ) .GER . explength . variable sign bits a :BIT := bitsa( bits a'left ). begin sign bits c := sign bits a xor sign bitsb. package body FPDIVIDER is function DIV( bitsa. variable expbits b value : INTEGER .function DIV( bitsa.
100
. variable frabits c value : REAL .precision : INTEGER) return BITARRAY is variable length : INTEGER := bits_a'length variable diff_expvalue : INTEGER . expbits_a_value := BITSARRAYTO_INT( expbitsa ). variable bits value : REAL . variable bits c : BITARRAY( bits a'left downto bits a'right ) . if ( ISUNDERFLOW( expbitsa. variable sign bitsb :BIT : bitsb( bits b'left ).bits b : MITARRAY .explength INTE. variable buf bits c : BITARRAY( bits a'left 1 downto bits a'right) . variable sign bitsc :BIT . variable expbitsa : BIT ARRAY( bits_a'lefti downto bitsa'leftexplength ) :. variable expbitsb : BITARRAY( bits b'leftl downto bits_b'leftexplength ) := bitsb( bits b'lefti downto bits b'leftexplength ) .precision )) then buf bitsc := BECOME_ZERO( buf bits_c ) bits c := signbits c & buf bits_c . variable frabits a value : REAL . return bits c . end FPDIVIDER . variable expbits a value : INTEGER .
return bits c. else exp bits buf:= INTTOBITSARRAY( diff exp value. return bitsc.precision ))then bufbitsc := BECOME_NAN( buf bitsc). if (diff exp value > 255 or (diff_exp value = 255 and fra bits c value >= 1. fra bitsbvalue :=BITSARRAYTOFP( UNHIDDEN BIT(bits b( bits b'left lengthi downto bitsb'right))
exp
end if. double precision
if (diff exp value > 2147
or
101
. bits c := sign bitsc & buf bitsc. single precision if precision = 32 then diffexpvalue := exp_bits avalue exp bits_bvalue + 127. else fra bitsIa value :=BITSARRAYTO_FP( UNHI6DDENBIT(bitsa( bits a'left exp lengthi downto bits_a'right)). exp length).0)) then buf bitsc := BECOMEZERO( buf bitsc ) bits c := sign bitsc & bufbitsc. elsif( diff expvalue < 0 or diff exp value = 0 and frabitscvalue <= 1. else diffexpvalue := exp bits a value exp_bitsbvalue + 1023.elsif ( IS_OVERFLOW( exp bits a.precision ) or ( ISUNDERFLOW( exp bits b. fra bitscvalue := frabitsavalue/ frabitsbvalue. end if. return bits c.0)) then buf bitsc := BECOMENAN( buf bitsc) bits c := sign bitsc & bufbits c.
variable b is nan :BOOLEAN. variable bitsc: BITARRAY(bits_a'left downto variable sign bitc: BIT . v riable exp bitsb:BITARRAY(bits b'leftl downto bitsb' leftexp length) :=bitsb(bitsb'leftl downto bitsb'leftexp length). return bitsc. elsif( diff exp_value < 0 or (diff exp_value =0 and fra bits c value <= 1. return bitsc.precision: INTEGER) return BITARRAY is variable a is zero :BOOLEAN. variable exp bitsa:BITARRAY(bits a'leftl downto bits a'leftexp length) :=bits a(bitsa'leftl downto bitsa'leftexp length).mantissa lengtii.(diff exp value = 2047 and fra bitscvalue >= 1. end if buf bits c :=BACKTOBITSARRAY( exp bitsbuf. end DIV *
function DIVIDE2( bits a: BIT ARRAY . frabitscvalue. exp length). variable inv_b. variable a is nan :BOOLEAN. end if.0)) then buf bitsc := BECOMENAN( buf bitsc) bits c := sign bitsc &buf bits c. return bits c. bits b: BITARRAY. else exp bits buf:= INTTOBITSARRAY( diffexp_value.itsb: BITARRAY(bits_b'left downto bits b'right). explength. variable b is zero :BOOLEAN.precision ) bitsc := signbitsc & buf bits c. begin a is zero ISZERO( exp bits a ) biszero :=ISZERO( expbitsb 102
.0)) then buf bitsc := BECOME_ZERO( buf bits_c bits c := sign bitsc & bufbitsc.
precision. fpu. fpu.bits b: BIT ARRAY. variable buf c :BITARRAY( bitsa'left downto bits a'right ).refer.all.
fpu. precision) .
function FPUNIT( bits_a.choice :INTEGER ) return BITARRAY end utilityl . end if. else bits c := bits a .fpmultier. package utilityl is
fpu. precision. b is nan := ISOVERFLOW( expbitsb. variable mantissalength : INTEGER .all. explength. end case.if a is zero then bits c := BECOMEZERO( bits a ). end FPDIVIDER .fpsuber. use fpu. begin if precision = 32 then explength := 8.fpdivider. else a is nan := IS OVERFLOW( expbitsa. B.all. elsif ( not( a_is_zero) and b is zero ) then bitsc := BECOMENAN( bitsa ).bits b: BIT ARRAY. case ( aisnan or b is nan ) is when TRUE => if b is nan then bitsc := BECOMEZERO( bitsa ).choice :INTEGER) return BITARRAY is variable explength : INTEGER .all.
103
. precision) . precision).all.fp_adder. return bits c . bitsb. end DIVIDE2. package body utilityl is function FPUNIT( bits_a. when FALSE => bitsc := DIV( bitsa. THE BEHAVIOR FUNCTIONS OF THE FPU
library fpu. end if.
mantissalength := 52. exp length. end FPUNIT. exp length. exp length. end if. bits b . end utilityl
104
. bitsb . return buf c. else double precision explength := 11.
mantissa_length. precision).
when 3 => bufc := MULTI2( bits a .
mantissalength.
mantissalength. end case . precision). case choice is
when 1 => buf c := ADD2( bitsa .
when others => bufrc := DIVIDE2( bits a .mantissa length := 23 .
mantissalength.
when 2 => buf c := SUB2( bitsa . bitsb . precision). precision). bits_b . exp length.
FTO. inet : out BIT
'0' ) . ENR. fpu.
OE : in BOOLEAN := false . constant ADD : INTEGER := 1. 10_12 : in BITARRAY( 2 downto 0) := B"000" .ONEBUS.
105
.FT1.all.APPENDIX C: THE SOURCE FILE OF THE FPU CHIP AMD29325
library fpu.all. use fpu.utilityl. constant SUB : INTEGER := 2.ENS.OE) variable precision : INTEGER := R'length .refer.all. fpu. 13_14 : in BITARRAY( 1 downto 0) := B"00" IEEEORDEC : in BIT '1' . zero. invd.refer.S : in BITARRAY( 31 downto 0) B"00000000000000000000000000000000". variable choice : INTEGER . . constant DIV : INTEGER := 4.
end AM29325 library fpu.write file. constant MULTI : INTEGER := 3. architecture behavioral of AM29325 is begin process(CLK. S16_OR S32.it is designed with single precision and only 4 arithmetic operations built in AMD29325 entity AM29325 is generic( D_FPUT : time := ll0ns ). port( R.all. unf. PROJ OR AFF : in BIT
:=
'00 0
RNDORNDI: in BITARRAY( 1 downto 0) := B"00" . use fpu. nan.CLK : in BIT
10'. fpu.ENY. variable BUFF : BITARRAY( 31 downto 0) variable BUFF FLAG : FLAG .utilityl.all. F : out BITARRAY( 31 downto 0) := B"00000000000000000000000000000000" ovf.
BUF F :=FP_UNIT(R.
106
.unf bit after DFPUT.precision). ovf <= BUFFFLAG.ovf bit after DFPUT unf <= BUFFFL&AG.S. end behavioral.begin Ill' ) then if ( OE and (CLK'EVENT and CLK case 10_12 is when Bi"000"1 => choice := ADD => when B"O001" SUBSUB choice when B11O1O" => choice := MULTI. when others => choice := DIV.zero bit after DFPUT. zero<= BUFFFLAG. end if. end process. nan <= BUFFFLAG. BUFF(30 downto 23) .precision.choice) F <= BUF Fafter DFPU T. end case .nanbit after DFPUT. BUFFFLAG := SET FLAG(BUF F.
fft.
107
.AM29325 . entity A29325 is generic ( D FPU T : TIME := 110 ns ). IEEE OR DEC : in BIT .THE APPENDIX D: THE SIMPLIFIED I/O PORT OF THE FPU CHIP
AMD29325
library fpu.all.S : in BITARRAY( 31 downto 0) := B"00000000000000000000000000000000".AM29325 entity. fft. port( R.refer.fft.this program is created for simplifing .in2 :in BITARRAY( 31 downto 0)
inl. in INTEGER 1 . RNDORND1: in BIT_ARRAY( 1 downto 0) := B"00" .FT1. fft use fpu.
OE : in BOOLEAN := false .am29325 architecture simple of A29325 is component AM29325 generic( D_FPUT : time := l0ns ). 10_12 : in BITARRAY( 2 downto 0) := B"000" .T S16_ORS32. in BOOLEAN FALSE out BIT ARRAY( 31 downto 0) B"0000000000000000000000000000000" =
output of fft
end A29325 library fpu . port( inl. 13_14 : in BIT ARRAY( 1 downto 0) := B"00" . in2 input signal
clock option enable outl
B'00000000000000000000000000000000". in BIT := 'I' .ENS. PROJORAFF : in BIT '0 .all.ENY. use fpu.ONEBUS.refer.CLK : in BIT
• 10'. ENR.FTO.
met).
end simple
108
. func. unf. ovf.FT1. nan. nan. ENR.BITARRAY( 31 downto 0) :BlOOOOOOOOO000000000000000000000000 ovf.CLK BIT :='0'.ONEBUS. ENS. ONEBUS.
Fl: AM29325
generic map( D_FPU_T
=> li0ns) port map( inl. end component F :out for Fl signal signal signal signal signal signal signal AM29325 use entity fft.ENY. OUTi. 13_14 : BITARRAY( 1 downto 0) B11001. mnet BIT 0 func :BITARRAY( 2 DOWNTO 0) := 000" . enable. zero. 13_I4.FTO. zero.ENS. mnet : out BIT 0) '0' .AM29325( behavioral). zero. IEEE_OR_DEC : BIT e1l . elsif( option = 2) then func <= "1001"1 elsif( option = 3) then func <= 11010"1 elsif( option = 4) then func <= "101111 end if. in2. S16_or_S32. ENY. S16_ORS32. ENDO_RND1. invd. RNDO_RNDl: BITA RRAkY( 1 downto 0) B"100" ovf. nan. IEEE_OR_DEC. PROJORAFF :BIT '0' . PROJ_ORAFF. unf. invd. invdi. ENR. FT1. clock. ehd process. unf.
begin process( option) begin if ( option =1) then func <= "1000"1 . FTO.
use fpu. in BOOLEAN := false .bimg w_real.

a is the input signal.refer. d is the

output signal. fft.basic. clock in BIT := '1' .
library fpu. rising edge trigger option in INTEGER .
is the input signal.basic. 109
.c is the output signal.w img clock enable TIME := 110 ns ).refer.
chip enable for am29325
ie : in BOOLEAN :

FALSE . fft. is the weight signal.aimg b_real. fft. . fft.
d_real.in2 :in BITARRAY( 31 downto 0) .
in LOGICARRAY( 31 downto 0).all.
end FFTCELL
.c_img : out LOGICARRAY( 31 downto 0) . enable : in BOOLEAN := FALSE .all.b

in LOGICARRAY( 31 downto 0).all architecture structural of FFTCELL is component A29325 generic ( DFPU T : TIME := 110 ns ).inl.fft.APPENDIX E: THE PIPELINE STRUCTURE OF THE FFT BUTTERFLY
library fpu. port( inl. use fpu.
w
in BIT :='1' . fft.A29325.
input enable for final stage output

oe : in BOOLEAN :

FALSE .

output enable for first stage input
c_real.all it designed for single precision entity FFT_CELL is generic ( DFPU T port( a_real.
. in LOGICARRAY( 31 downto 0).A29325. in2 is the input signal B'00000000000000000000000000000000".dimg : out LOGIC ARRALY( 31 downto 0)).
:BITIARRAY( 31 DOWNTO 0) *B"00&000000000000000000000000000000"1 :BITIARRAY( 31 DOWNTO 0)
:B"0O0000000000000O000O000000000"
110
.B"OOOO000000000000000000000000000" signal reg_3img :BITARRAY( 31 DOWNTO 0) *B'060000000000000000000000000000000"I signal reg c'L real :BITARRAY( 31 DOWNTO 0)
:B"'O000000000000000000000000000000"1
signal regclimg
:BIT_ARRAY( 31 DOWNTO 0)
:B"0000000000000000000000000000000011
signal regc2_real :BIT_'ARRAY( 31 DOWNTO 0)
:B"00000000000000000000000000000000"1
signal regc2_img signal regc3 real signal regc3_img signal regc4 real signal regc4_img
:BITIARRAY( 31 DOWNTO 0)
:B"00000000000000000000000000000000"1
:BIT_ARRAY( 31 DOWNTO 0)
:B"00000000000000000000000000000000"I
:BITIARRAY( 31 DOWNTO 0)
:B"00000000000000000000000000000000"
.outi end component .
chip enable for am29325 out BITARRAY( 31 downto 0) ). signal bufareal :BIT_ARRAY( 31 DOWNTO 0)
:B"lIO00OOO000000O0O0000000"1
signal bufbreal :BITARRAY( 31 DOWNTO 0)
:BlOO000000000000000000000000000000"1
signal bufwreal :BITARRAY( 31 DOWNTO 0)
:Bl'OOOOOOOO0000000000000000000"1
signal bufa_img signal bufb_img signal bufwimg
:BITARRAY( 31 DOWNTO 0) *Bl"OOOOOOOO0000000000000000000000000 :BITARRAY( 31 DOWNTO 0) *BOOOOOOOOOOOO000000O00000000000000"1 :BIT_ARRAY( 31 DOWNTO 0) *B"000O00000000000000000000000000000"1
signal reglreal :BITARRAY' 31 DOWNTO 0)
:
r"czOOOOO~ooooooooooooooooooooooooo"
signal reglimg
BIT_'ARRAY( 31 DOWNTO 0) *B"0000000000000000000000000000000011 signal reg2 real :BI1_eRPAY' '? DOWNTO 0) *B"000000000000000000O00000000000000"1 signal reg_2img :BIT_ARRAY( 31 DOWNTO 0)
:B"0000000000000000000000000000000011
sigjnal reg_3 real
:BITIARRAY( 31 DOWNTO 0) :. output of f ft
for ALL : A29325 use entity fft.A29325( simple).
constant DELTi constant DELT2 begin
begin at stage 1 simply discribe DFF behavior

process( clock.
111
. := 110 ns .subtraction 1 . . . buf_b_real <= LOGICTO BIT( b_real ) after DEL_Ti.multiplication 2 . ie begin if ( clock'event and ( clock ='0' )and ( ie = true)) then buf_a_real <= LOGIC TO BIT( a_real ) after DEL_Ti. .
. addition 10 ns .signal regwlreal : BITARRAY( 31 DOWNTO 0)
: B"0000000000000000000000000000000"
signal regwl img : BITARRAY( 31 DOWNTO 0)
:= B"OOOOO0000000000000000000000000"
signal regw2 real : BIT ARRAY( 31 DOWNTO 0)
= B"0000000000000000000000000000000"
signal regw2_img : BIT ARRAY( 31 DOWNTO 0)
: B"00000000000000000000000000000000"
signal xlreal : BITARRAY( 31 DOWNTO 0) := B"00000000000000000000000000000000" signal xlimg : BITARRAY( 31 DOWNTO 0) := B"00000000000000000000000000000000" signal x2_real : BIT ARRAY( 31 DOWNTO 0) := B"00000000000000000000000000000000" signal x2_img : BITARRAY( 31 DOWNTO 0) := B"00000000000000000000000000000000" signal x3_real : BIT ARRAY( 31 DOWNTO 0) := B"00000000000000000000000000000000" signal x3_img : BIT ARRAY( 31 DOWNTO 0) := B"00000000000000000000000000000000" signal x4_real : BITARRAY( 31 DOWNTO 0) := B"00000000000000000000000000000000" signal x4_img : BITARRAY( 31 DOWNTO 0) := B"00000000000000000000000000000000" signal xclreal : BIT ARRAY( 31 DOWNTO 0) := B"00000000000000000000000000000000" signal xclimg : BITARRAY( 31 DOWNTO 0) := B"00000000000000000000000000000000" signal signal signal signal div mult sub add : INTEGER := 4 : INTEGER := : INTEGER := : INTEGER := : time : time .division 3 . buf a img <= LOGIC TO BIT( aimg) after DEL_Ti.
buf_b_real. xcl_img ). . end
<= LOGIC_TOBIT( w~img
)
after DELTI. add.delay time at input weight factor process( clock ) begin if ( clock'event and ( clock ='1' )) then reg_wl_real <= buf_w_real after DEL_T2. sub. enable. xl img ). enable. clock. end process . reg_wl_img <= bufrw_img after DELT2. buf_bimg. add. end process. clock.
A2 : A29325 generic map ( D_FPUT =>110 ns) port map ( buf_a_img.
A3 : A29325
generic map ( D_FPUT =>110 ns) port map ( buf_a_real.
bufw img end if .buf b img <= LOGIC_TOBIT( bimg ) buf w real <= LOGICTOBIT( wreal )
after DEL_Ti.
of
stage 1
begin
Al : A29325
at
stage 2
generic map ( DFPU T =>110 ns) port map ( bufa_real.
sub. enable. buf_breal. clock. end of stage 2
begin
at
stage 3
simply discribe DFF behavior
112
. xl_real ). A4 : A29325 generic map ( DFPUT =>110 ns) port map ( bufa_img. buf_bimg.
clock. after DEL_Ti. xcl_real ). end if . enable.
.
. if ( clock'event and ( clock =11' )) then reg_c2_real <= reg ci real after DELT2. enable. regwy2_real. end process. enable. regylimg after DELTi. x2_real ) B2 : A29325 generic map ( DFPUT =>110 ns) port map ( reglimg. clock.. mult... mult. enable. clock. muit.. . x3_real ) B4 : A29325 generic map ( DFPUT =>110 ns) port map ( reg_1_real. enable.. reg_w2_real.of .
end of stage 3
begin at
stage 4

B1 : A29325 generic map ( DFPUT =>110 ns) port map ( regilreal. muit. x2_img ) B3 :A29325 generic map ( DFPUT =>110 ns) port map ( regilimg.process( clock) begin if ( clock'event reg_1_real <= regilimg <= reg_ci_real <= regclimg <= regw2_real <= regy2_img <= end if. clock. xci 1mg after DEL_Ti. reg_c2_1mg <= reg cji mg after DELT2. x3_img ) delay time at input weight factor process( clock) beg.end stage 4113 . end process
and ( clock =101 )) then x1 real after DELTi. end if. clock... xci real after DELTi. xl 1mg after DELTi. regwv2_img. reg Wi real after DELTi. regy2_img.
begin at stage 5 simply discribe DFF behavior process( clock) begin if ( clocklevent and ( clock =101 )) then reg2_real <= x2_real after DELTi.. regc3_img after DELT2. sub.. clock.end input weight factor and ( clock ='11 )) then reg c3_real after DELT2. add. reg_3img <= x3 img after DELTi.of . regc3_img <= regc2_img after DELTi. delay time at process( clock) begin if ( clocklevent regc4_real <regc4_img <= end if..
stage 6
. end process . clock. reg_2img <= x2img after DELT1. end if.. x4_real )
C2 :A29325 generic map ( DFPUT =>110 ns) port map ( reg_2img. reg_3_real. regc3_real <= reg_ c2_real after DELTi. end process. enable.
114
. reg_3img.. end of stage 5
begin at ClI
stage6

A29325 generic map ( DFPUT =>110 ns) port map ( reg_2_real. enable. reg_3_real <= x3_real after DELTi.. x4_img )...
)after dimg <= BITTOLOGIC( X4_img end if .


end
of
stage 7
end structural.*
begin at stage 7 simply discribe DFF behavior process( clock. creal <= BITTOLOGIC( regc4_real )after DELT1. )after dreal <= BITTOLOGIC( x4_real DELTI. ce begin if ( ciock'event and ( clock =101 ) and (oe = true)) then DELTi. cimg <= BITTOLOGIC( regc4_img )after DELTi. end process.
115
.
all .APPENDIX F: THE ADDRESS SEQUENCE GENERATOR AND CONTROLLER library fpu. fft.ram_256. architecture simple of SEQ_CONT is
fft.basic. variable testl : BOOLEAN . if( testl and test2 ) then for i in bits l'range loop result(i):= 'X'
end loop .
function RESOLVE( bits_1. function TABLE1( bits: BITARRAY) return INTEGER is variable result :integer := 0 . end if .all. end RESOLVE .
elsif( testi ) then result := bits_2 . fft. bits 2: LOGICARRAY) return LOGIC ARRAY is variable result :LOGIC ARRAY( bits_1'left downto bits_l'right) . return result .all .convert.
else
assert( testl and test2 report " bus can not resolve any one input signal severity error .
use fpu.
from 1 to 6 
generic( testnumber : positive := 2 )
library fpu. use fpu. elsif( test2) then result := bits_1 . fft. fft.basic. entity SEQ_CONT is end
fft.refer. fft.
.all. begin result := 2**( BITSARRAYTOINT( bits)+ 1) return result . variable test2 : BOOLEAN .refer.
fft.ram_256. begin testl := IS_HiZORX( bits_1 ) test2 := ISHiZOR X( bits_2 ) . end TABLE1 116
.all.convert.all.
end TABLE2
constant constant signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal
chssetupt : TIME := 200 ns . := '0' . : BIT := '1' : BIT := '1' : LOGIC ARRAY( 7 downto 0) "ZZZZZZZZ".
117
.function TABLE2( N: INTEGER) return INTEGER is variable result :integer := 0 . begin while 2**( result) < N loop result := result + 1 .
return result . .
signal
signal signal signal signal signal signal
CHS WC
RWWC ADDR_1 CHS_1 RW_1 TRIGRD_0 TRIGWR_0
: BIT := '1' : BIT := '1' : BIT := '0' : INTEGER := 1 . : BIT := '1' : BIT := '1' : BIT : BIT := '0' .
. : BIT := '1' : BIT := '1' : LOGICARRAY( 7 downto 0) := "ZZZZZZZZ". end loop . wrtsetupt : TIME := 200 ns . LEN ISTO CHE IN R OUT_A INE OUTE FFTCMP STAGECNT OSTO TRIG EN SO Sl ADDR_0 CHS_0 RW 0 ADDRWC : BITARRAY( 2 DOWNTO 0 ) : : : : BIT BIT BIT BIT := := : : '1' '1' '0' '0' "000" . : BIT := '1' : BIT := '0' : BIT := 'I' : BIT := '0' : BIT := '0' : LOGICARRAY( 7 downto 0) := "ZZZZZZZZ".
.
. OE <= FALSE after 500 ns elsif( (CNT >=4) and ( INE ='11) INR <= '0' IE <= FALSE. OUTE ) variable CNT :INTEGER := 0. elsif( CNT = 5 ) then OUTA <= I1'
end if
elsif( (CNT >=4) and (OUTE = 11) then OUTA <= '0' ENABLE <= FALSE. if( CNT = 4 ) then OE <= TRUE . OUTA <='O' IE <= TRUE. and (CLOCK'event))
) then
118
.signal signal signal signal signal signal
TRIGRD_1 TRIGWR_1 RDADDR_0 RDADDR_1 WRADDR_0 WRADDR_1
01 := : BIT 0' := : BIT : LOGICARRAY( 7 DOWNTO 0) := ZZZZZZZZ"I . LOGICARRAY( 7 DOWNTO 0) := ZZZiZZZZ" . end if end process.
signal signal signal signal
IE OE ENABLE STATE
:BOOLEAN :BOOLEAN :BOOLEAN :INTEGER
:=FALSE :=FALSE :=FALSE :=0. begin if ( (IN E='O' and INE'event ) and ( OUTE='O'and OUTE'event ))then CNT := 0. :LOGICARRAY( 7 DOWNTO 0)
:"ZZZZZZZZ"
LOGICARRAY( 7 DOWNTO 0)
:= ZZZZZZZZ"I . . ENABLE <= TRUE 0')) then elsif(( CLOCK'event and CLODCK = CNT := CNT + 1. IN_E.
begin
FFT controllerprocess( CLOCK. INR <= '1'.
OUTA variable RCNT :INTEGER :=0 variable WCNT :INTEGER :=0 variable N :INTEGER :=0. variable F :INTEGER := 0 begin if ((CHE = '0' )then if ((STATE = 0) and ( CLOCK'event and
CLOCK =
11)
)then
.
else STATE <= 7 end if. IN R. LEN.1. F TABLE2( TABLE1( LEN)
. next addr
STATE <= 3.do state 0STAGECNT <= 0 . COEBUF := "100000000"1 PTR := ISTO. FFT CMP <= Ill STATE <.
elsif ( (STATE CLOCK
=1) =
and ( CLOCKlevent and 1') )then
do state 1 which is initization state INE <= '0' OUTE <= '0' RCNT :=0 WCNT :=0
EN <=
'0'1 =
if( (INR
1') ) then

gen. 119
. 0' ::BIT PTR variable variable COEBUF : LOGICARRAY( 7 downto 0) := 00000000" . STATE. CHE.out actural length find N :=TABLE1( LEN).address sequencergenerate step by step signal process( CLOCK.
'0' after 1 ns I1l after chssetup~t

RWWC <= Ill' COEBUF := INC( COE_BUF elsif ( CLOCK = '1')then ADDR_0 <= RDADDRl 1 TRIGRD_1 <= not( TRIGRD_1 ) generate next addr end if CHS_0 <= 11. or 4 read if( (INR
I1' and RCNT < 2*N CLOCKlevent )) then
= =
and
if( PTR when RAM_0 is read 
'0' )then
if (( CLOCK = '0')) then ADDR_0 <= TRIGRD_0 <= generate rext addr ADDR WC <= CHSWC <= RDADDR 0 not( TRIG_RD_0 )
COEBUF '1'. 3.elsif( (2 then
<=
STATE) and (STATE
<=
4))
do state 2.120
generate
. '0' after 1 ns I1' after chssetup~t RW_0 <= Il1' elsif( PTR RAM_1 is read=
I1' )then
when
if (( CLOCK = '0')) then ADDR_1 <= RDADDR_0 TRIG RD 0 <= not( TRIG RD 0 ) next addrADDRWC <= COEBUF CHSWfC <= I1'.
end if. elsif( CELOCK = I1' ) then ADDR 1 <= WR ADDR 1.'0' after 1 ns I1l after chssetupt. TRIG_WR_1 <= not( TRIGWR_1). elsif( RCNT = 2*N ) then INE <= '1' EN <= I1' end if

writingif( ( OUT_A = I1' and W_CNT < 2*N and CLOCK'event) or (OUTA'event and OUT A = 11)) then if( PTR = '0' ) then if( CLOCK = '0') then ADDR1 <= WRADDRO0 TRIGWR_0 <= not( TRIGWR_0). CHS_1 <= 11'. RW_1 <= '1'
end if RCNT RCNT + 1 R STATE <= 3 .
*
. Ill after chs setupt. RWWC <= Il1' COEBUF := INC( COE_BUF elsif (CLOCK = '1')then ADDR1 <= RDADDR_1 TRIGRD_1 <= not( TRIGRD_1 )

generate next addr end if CHS_1 <=
1'. '0' after i ns. '0' after 30 ns I1' after chs_setup~t RW_1 <= 121 '11.TRIG <= not(TRIG) after delt.
*N ) then OUT E <= I1' . W_CNT :=WCNT + 1 STATE <= 2. elsif( WCNT =2. if( CLOCK = '0') then Si <= '0' SO <= 1' elsif( CLOCK = 1') then Si <= '1. if((W_CNT = 2*N) and (RCNT STATE <= 7
=
2*N)) then
end if. end if . '0' after 30 ns. SO <= '0' after 500 ns.
RW_0 <= Ill'. end if. Si <= '0' after 500 ns. Il' after wrtsetup~t
elsif( PTR = 1') then if( CLOCK = 0') then ADDRO0 <= WR ADDRO0 TRIGWR_0O <= not( TRIGWR_0) elsif( CLOCK = Ill ) then ADDRO <= WRADDR1l TRIGWR_1 <= not( TRIGWR1
)
end if
CHS_0 <= '11. increment stage_counter do elsif (STATE =7 )then if (IN E Ill1and OUTE = 11') STAGECNT <= STAGE_CNT + 1 122
then
.
state 7 . SO <= '0' end if .'01 after 30 ns. Ill after wrtsetupt. '0' after 30 ns '1' after chs setupt.
TRIG_WR_0). OSTO <= PTR STATE <= 1 elsif( STAGECNT <(F+1)) then STATE <=1 end if end if. TRIG_RD_0 TRIG_ RD _1 TRIGWR_0 TRIG_WR_1 end if state 8 which is finaldo elsif if
<= <= <= <=
then
not( not( not( not(
TRIG_RD_0). TRIG_RD_1). CHS_1 <= '1' RW_1 <= '31'. end if. ADDR_0 <= "ZZZZZZZZ". STATE <= 0. ADDR_1 <= "ZZZZZZZZ".PTR := NOT( PTR STATE <= 8. else if( INE = '1'l STATE <= 2. else STATE <= 3. ADDRWC<= "ZZZZZZZZ". TRIGWR_1).
123
.
CSTATE = 8 )then (STAGECNT =(F+l)
)then
FFTCMP <= '0' after 500 ns. CHSWC <= Il1' RWWVC <= 1'. end if. CHS_ <= '1' RW_ <= '11'. end process. elsif( CHE = I1'l INE <= '1' OUTE <= I1' SO <= 0' then
OSTO <= '0'.
variable il : INTEGER : 0 . 8 ) 124
). variable jl : INTEGER : 0 . STAGE_CNT) variable jum. variable K2 : INTEGER : 0 . end if .8)) . variable L : INTEGER := 0 . variable i2 : INTEGER :. end if .process(TRIGRD_0. if( ( (il+1) mod addrdis )= 0 ) then ji j + 1 end if il :=il + 1. if( STAGECNT >= 0 and TRIGRD l'event) then
RDADDR_1 <=
BITTOLOGIC( INTTOBITSARRAY(((i2 mod addrdis ) + addrdis + j2*jumpdis ) . if( STAGE CNT >= 0 and TRIG WR 0'event) then WRADDR_0<=BIT TOLOGIC( INT_TO_BITSARRAY( ki. TRIGWR_1. variable kl : INTEGER : 0 .
. variable j2 : INTEGER : 0 . jumpdis := TABLE1(LEN)*2 / 2**( STAGE_CNT) . _dis : INTEGER : 0 . TRIGRDI. variable addr dis : INTEGER : 1 .8)) if( ( (i2+1) mod addrdis )= 0 ) then j2 j2 + 1 end if i2 := i2 + 1. TRIG WR 0.0 . begin if( STAGE CNT'event and STAGE CNT >= 0 ) then addr dis := TABLE1(LEN) / 2**( STAGECNT ) . il 0
i2 :=0
jl
j2
=0
:=0
kl k2 L else
0 0 TABLEI(LEN)
if( STAGECNT >= 0 and TRIGRDO'event) then
RDADDR_0 <=
BITTOLOGIC( INTTOBITSARRAY(((il mod addrdis) + jl*jumpdis) .
k2 k2 + 1 end if end if end process.
125
.8)).ki end if
ki + 1
if( STAGE_CNT >= 0 and TRIGWR 1'event) then WRADDR_1 <= BITTOLOGIC( INTTOBITSARRAY((k2 + L) .
the size of ram is 256 by 32 entity RAM_256 is generic ( read cycle t : TIME := 300 ns .read cycle time writecycle t : TIME := 300 ns .it is chip select signal rw en signal i_datalines : in LOGIC ARRAY( 31 downto 0 ).all.APPENDIX G: THE BEHAVIOR OF RAM
library fpu. fft. .all architecture behavioral of RAM_256 is signnl addr buf : LOGICARRAY( addrlines'left addrlines'right ). it is read/write enable
library fft.basic. fpu. :in BIT . fft.chip set up time
wrt pulse widtht : TIME := 150 ns .. .write pulse width
chsaccesst :TIME := 50 ns). use fft.
.all.basic.refer.refer.access time from chip select port( addr lines : in LOGICARRAY( 7 downto 0 ). odatalines : out LOGIC_ARRAY( 31 downto 0 )).fpu.
.
.write cycle time
datasetupt chssetupt
: TIME := 150 ns . end RAM_256 . downto
126
.all.
. chs : in BIT . use fpu.data setup time
: TIME := 150 ns .
idatalinesbuf <= idata lines . end process .data lines'left ARRAY( signal idatalinesbuf : LOGIC downto iidata lines'right ). . signal rwenbuf signal chs_buf : BIT begin
addrbuf <= addrlines . end if .
when chip is enable .check for write pules width violation process(rw_en. . chs) begin if ( rwen = '0' and chs ='0' ) then 127
.
.: BIT .check for read cycle timing violation process(rw_en. end if .check for write cycle time violation process(rw_en. end process . chs) begin if ( rwen = '0' and chs '0' ) then assert addrbuf'delayed( writecyclet )'stable
report " write cycle time error "
severity error . rw en buf <= rw en chs buf <= chs . chs) begin if ( (rwen = '1') and (chs ='O') ) then assert addr buf'delayed( read cyclet )'stable
report " read cycle time error "
severity error .
.
end process .
. end if . chs) begin if ( rw en = '0' and chs ='0' ) then assert chs buf'delayed( chs_setupt )'stable report " chip select setup time error severity error . chs) variable cell num :INTEGER := 0 . variable cellmatrix : LOGICMATRIX( 0 to (2** addrlines'length . variable databuf : LOGICARRAY( idata lines'left downto idataiines'right ).
. end process .1 ) ) begin cellnum := BITSARRAYTOINT( LOGIC write mode 128
T
OBIT( addrbuf))
. end process .check for chip select setup time violation process(rwen.assert rw en buf'delayed( wrtpulsewidtht )'stable
report " read/write time error "
severity error . end if .
process(rw en. " end if .
.check for data setup time violation process(rwen. chs) begin if ( rw en = '0' and chs ='0' ) then assert idata linesbuf'delayed( data_setupt )'stable
report " data setup time error " severity error .
* chip disableelse o data lines <= "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" end if.
129
.. end process. cell_matrix( cellnum ) after chsaccess_t. cellmatrix( cellnum ) := databuf. end behavioral. read mode elsif((rwen = '11) and ( chs'event and chs = '0' 1 then odatalines <= "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ".if( (rwen = '0') and ( chs'event and chs = '0')) then data buf := i datalinesbuf.
fpu. fft.all.all. fpu. return result .basic.refer. fft.all. use fpu. if( testl and test2 ) then for i in bits_l'range loop result(i):= 'X' end loop . fft.all architecture simple of sys2 is
fft. variable testl : BOOLEAN . function TABLE1( bits: BIT ARRAY) return INTEGER is variable result :integer := 0 .form 1 to 6 end .all. variable test2 : BOOLEAN . elsif( test2) then result := bits_1 .convert. end RESOLVE . use fpu. library fpu. end if . begin testl := IS HiZ ORX( bits_1 ) .all. elsif( testl ) then result := bits_2 . . bits_2: LOGICARRAY) return LOGIC ARRAY is variable result: LOGICARRAY( bits_1'left downto bits l'right). fft.all. begin result := 2**( BITSARRAYTOINT( bits)+ 1) . return result . use fft.APPENDIX H: THE SOURCE FILE OF THE FFT SYSTEM
library fpu. test2 := ISHiZOR X( bits 2 ) .convert.all entity sys2 is generic( testnumber : POSITIVE := 2 ) .ram_256. use fft.readl file.
function RESOLVE( bits_1.refer.readl file.ram 256.
130
.basic. else assert( testl and test2 report " bus can not resolve any one input signal severity error . fft.
begin while 2**( result) < N loop result := result + 1 . end loop . active low write/read enable signal LOGICARRAY( 31 downto 0 ).data setup time
: TIME := 150 ns .end TABLE1 .read cycle time
: TIME := 300 ns .
component RAM_256 generic( readcyclet writecyclet datasetupt chssetupt
: TIME := 300 ns . end inputvector .
function TABLE2( N: INTEGER) return INTEGER is variable result :integer := 0 . active low chip select signal BIT .

chip set up time write pulse width
wrt_pulse widtht : TIME := 150 ns.write cycle time
: TIME := 150 ns . 131
.access time from chip select
LOGIC ARRAY( 7 downto 0 ).
.
.
"I100".
"011". return result .
"100" ) .
"001"
11010". function inputvector return vectorset is begin
return( "000".
. chsaccesst port( addrlines : in chs : in rw en : in i datalines : in : TIME := 50 ns).
. end TABLE2 type vectorset is array( positive range <> ) of BITARRAY(2 downto 0) . BIT .
c is the output signal.FFTCELL( structural ).
. d is the output signal. constant constant constant signal signal signal signal signal signal signal signal signal signal signal signal signal signal del t : TIME := 100 ns .bimg : in LOGICARRAY( 31 downto 0).
for all :RAM_256 use entity fft. chssetupt : TIME := 200 ns .
clock : in BIT := "i' .
'I' 'I' '0' '0'
BIT := 'I' BIT := 'I' BIT := '0' INTEGER := 1 .
d_real. enable : in BOOLEAN := false . LEN ISTO CHE IN R OUTA IN E OUTE FFT CMP STAGECNT OSTO TRIG EN SC S1 : BITARRAY( 2 DOWNTO 0 ) : : : : : : : : : : : : : : BIT BIT BIT BIT
:= := := :=
"000" .ai mg : in LOGICARRAY( 31 downto 0).chip enable for am29325
ie oe
: in BOOLEAN := FALSE .c_img : out LOGICARRAY( 31 downto 0) .
.
breal. for Fl:FFTCELL use entity fft. end component . port( a_rea.RAM_256( behavioral ).o_datalines : out LOGICARRAY( 31 downto 0 )).
w_real.input enable for final stage output
: in BOOLEAN := FALSE . end component . wrt_setupt : TIME := 200 ns . component FFT CELL generic ( D FPU T : TIME := 110 ns ).wimg : in LOGICARRAY( 31 downto 0).
.w is the weight signal.
.a is the input signal.dimg : out LOGICARRAY( 31 downto 0)). BIT := 'I' BIT := '0' BIT := 'I' BIT := '0' BIT := '0' 132
.
.
b is the input signal.output enable for first stage input

c_real.
LOGICARRAY( 7 DOWNTO := ZZZiZZZZ" .
0) 0) 0) 0)
signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal
CLOCK times IE OE ENABLE STATE EADDR_0 ECHS_0 ERW_0 EADDRWC ECHSWC ERWWC EADDR_1 ECHS_1 ERW_1 S2 CH_0 CH_1
. .
:BIT Ill'1 :BIT I=ll1 :LOGIC_ARRAY( 7 dcwnto 0) := ZZziZZZZ".ARRAY( 7 DOWNTO := ZZZiZZZZ". LOGIC. :BIT Ill'1 : BIT Ill'1 : BIT := 0' : BIT : BIT 133 Ill1' Il'1
.
:BIT BIT : : : : :
Il'1 Ill'1
BIT := 0' . .ARRAY( 7 downto 0)
:"ZZZiZZZZ". :BIT := '1l BIT := '1' :LOGICARRAY( 7 downto 0) :BIT Ill'1 :BIT Ill'1 :LOGICARRAY ( 7 downto 0)
:="ZZZiZZZZ".signal signal signal signal signal signal signal signal signal signal signal *signal signal signal signal signal signal
ADDR_0 :ZHS_0 RW_0 ADDRWC CHSWC RWWC ADDR_1 CHS_1 RW_1 TRIGRD_0 TRIGWR_0 TRIGRD_1 TRIGWR_1 RDADDR_0 RDADDR_1 WRADDR_0 WRADDR_1
:LOGICARRAY( 7 downto 0) := ZZZiZZZZ".
:BIT Ill'1 :BIT Ill'1 LOGIC_. BIT := 0' . :BOOLEAN :BOOLEAN :BOOLEAN :INTEGER
:=FALSE :=FALSE :=FALSE :=0. BIT := 0' . :BIT := Ill :integer :=0.ARRAY( 7 DOWNTO := ZZZZZZZ" :LOGICARRAY( 7 DOWNTO "IZZZZZZZZ"I LODGIC_. BIT := 0' .
:LOGICARRAY( 7 downto 0)
:"ZZZZZZZZ".
:LOGICARRAY(31 downto 0) :LOGICARRAY(3l downto 0) :LOGICARRAY(31 downto 0) :LOGICARRAY(31 downto 0)
:=lzzzzzzzzzzzzzzzzl
It VVVVVYVVYYYYVYVVVVYYYVVVVVVVVVVY II.
ADDRLINES_0 :LOGIC_. Rlimg :LOGICARRAY(31 downto 0)
signal signal signal signal
Wreal :LOGICARRAY(31 downto 0) :. 'tvvYXXvvvvXvvvvYXvvvvXvvXvvXvvvvvY."XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"I."XXXXXXXXXXXXkXXXXXXXXXXXXXXXXXXXX".ARRAY( 7 downto 0) := ~ZZiZZZZZZ" . Ill'1 .ARRAY(31 downto 0) :LOGICARRAY(31 downto 0) :LOGIC_. Ill'1 ."XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"l. outlreal :LOGICARRAY(31 downto 0) :=" XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"I.ARRAY(31 downto 0)
signal
Rlreal :LOGIC_ARRAY(31 downto 0) :..
signal signal signal signal signal

inlreal inlimg in2_real in2_img
:LOGIC ARRAY(31 downto 0) :LOGIC . Ill'1 .ARRAY( 7 downto 0) := ~ZZiZZZZZZ" . ADDRLINESW :LOGICARRAY( 7 downto 0)
:Rinreal Rin img ROreal RO img
"zzzzzzzff
."XXXXXXXXXXXX)XXXXXXXXXXXXXXXXXXXX". Win real :LOGICARRAY(31 downto 0) :. ADDRLINES_1 :LOGIC_.signal signal signal signal signal signal signal signal signal signal signal
CH_W RWEN_0 RWEN_1 RWENW
: : : :
BIT BIT BIT BIT
Ill1 . outl_img :LOGICARRAY(31 downto 0) :. Wimg :LOGICARRAY(31 downto 0) :."XXXXXXXXXXXXkxxxxxxxxxxxxxxxxxxx". out2_img :LOGICARRAY(31 downto 0) 134
signal signal signal
.. Win img :LOGICARRAY(31 downto 0) :" xxxxxxxxXXXXXXXXXXXXXXXXXXXXXXXXX.
active low RW ENl1<= RWland ERW1 .active low RWENW <= RW_WC and ERWWC .
DONE
:BOOLEAN
:=false. ex real :LOGICARRAY(31 downto 0)
* t"yyyVXyyxyyyyyyyyvyyVxyxyyVxyyyxyxt
. EADDRO 0 ADDRLINES 1 <= RESOLVE( ADDR_1 .active low RW EN0<= RW 0and ERWOJ . EADDR71) 135
. FFTreal :LOGICARRAY(31 downto 0)
*="yyyyyyyyyyyyyyvvvvvyyyvyyvvvvvvv"
.
begin CLOCK <= NOT(
CLOCK
)after
500 ns.
exWreal exW img
=Xxx
:LOGICARRAY(31 downto 0) :LOGIC_ARRAY(31 downto 0)
xxxxx7xXX7XXX p 7XXI.
:INTEGER :INTEGER 0 0= :BITARRAY( 2 downto 0) : GO' .signal signal signal signal signal
D
*="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"I.active low CHW <= CHSWC and ECHS WC .
signal signal
signal signal signal signal F N L
FFT img :LOGIC_ARRAY(31 downto 0) *="IxxxxxxkxXXXXXXXXXXXXXXXXXXXXXXXXX.
times <= times + 1 after 1000 ns.
:=t"vvvvXvvvvvvvvXvvvvvvvvvvvvvvvvvvI.active low ADDRLINES_0 <= RESOLVE( ADDR_0 .

good"
reovdsgaCHO0<= CHS 0and ECHSO 0active low CH_1 <= CHS_1 and ECHS_1 . assert not( DONE) report "this is enough severity error. out2_real :LOGICARRAY(31 downto 0) eximg :LOGICARRAY(31 downto 0) *="XXXXXXXXXXXXkXXXXXXXXXXXXXXXXXXXX".
read real( "wreal. ISTO W= '0' 136
. exW img <= "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ". data_i). variable i : NTEGER := 1. EADDR WC<= "00000000". i i +1 end if end process

generate addressing signal by universal controller process (times. variable data_wi: REALMATRIX( 1 to 1000 ) . CLOCK) begin if( times = 1 and (CLOCK'event and CLOCK='l') ) then S2 <. end if.dat". begin if( times =0) then read real( "rldat". datawr ). exw~img <= BIT_TOLOGIC(convertl(datawi(i))).import input data by universal controller process (times. read real( "img. variable data_i : REAL_MATRIX( 1 to 1000 ) . F <= TABLE2( TABLE1( L .
EADDRWC).


initialization
L <= input vector( test_number) N <= TABLE( L ) . else if( times <= N and (CLOCKlevent) ) then ex real <= BITTOLOGIC(convertl(datar(i))) . exW real <= BI!TTOLOGIC(convertl(data_wr(i))) . ex img <= BITToLOGIC(convertl(datai(i))). CLOCK) variable datar : PEAL_MATRIX( 1 to 1000) .ADDRLINESW <= RESOLVE( ADDRWC
. exW img <= BITTO_LOGIC(convertl(datawi(i))). ex img <= "IZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ".dat". ex real <= "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ". datar). data wi). elsif( times =( N+1 ) and (CLOCKlevent) ) then exW real <= BIT_TO_LOGIC(convertl(datawr(i))). variable datawr: REAL_MATRIX( 1 to 1000 ) .dat". elsif( times = (N*(F+1)/2+l) ) then exW real <= "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ".Il1' EADDR_0 <= "00000000". readreal( "wimg.
end if* if ( times < (N*(F+1)/2+1) ) then CHE <= '1l' elsif( times = (N*(F+1)/2+1) ) then CHE <= 1'1.
<=
'0' after 1 ns Ill after wrt_setupt.en EADDRWC <= INC( EADDRWC).
elsif ( times <= (N*(F+1)/2) and ( CLOCK'event ) then ECHSWC ERWWC else ECHSWC <= 1'.
<=
'0' after 1 ns I1' after wrt_setupt. EADDRWC <= "IZZZZZZZZ". '0' after 1 ns. '1' after chs_setupt. after wrtsetupt . elsif( times <= (N*(F+1)/2) and (CLOCK'evert)) '1. after 1 ns.
ISTO <= '0'. I1' after chs_setupt
I1'. if( times <= N and ( CLOCK'event ))then ECHS_0 ERW_0
<=
11' '0' after i ns.
<=
'1'. elsif( times = (N*(F+1)/2+1) ) then EADDR 0 <= "IZZZZZZZZ". '0' after 10 ns.
ECHSWC <=
11'
ERWWC
<=
'0' '1' 11'. after chs_setupt .
Il1'. end if. '0' '1'
after 1 ns. ERWWiC <= '11'.
LEN <= L. S2 <= '0' end if.elsif ( times <= N and (CLOCK'event)) then EADDR_0 <= INC( EADDRO0) EADDRWC <= INC( EADDRWC). nd of program e 137
.
elsif( (CNT >=4) and ( INE ='11') ) then IN R <= '0' IE <= FALSE. end if. STATE. variable RCNT variable W_CNT PTfR variable variable COEBUF

CHE. end process
.
INR <= 11'. OUT_E ) variable CNT :INTEGER :=0. OUTA) : INTEGER 0 : INTEGER :=0 0' :: BIT :LOGIC_ARRAY( 7 downto 0) :00000000" . LEN. OUTA <=10O IE <= TRUE.FFT controllerprocess( CLOCK.
address sequencergenerate step by step signal
process( CLOCK. *!1sif( (CNT >=4) and (OUTE = 11) and (CLOCK'event)) then OUTA <= '0' ENABLE <= FALSE. IN_R. IN_E. begin if ( (IN E='0' and INE'event ) and
( OUTE='0'and OUTE'event
))then
CNT :=0.
138
. elsif( CNT = 5 ) then OUTA <= I1l end if.
end if
end process. OE <= FALSE after 500 ns. ENABLE <= TRUE elsif(( CLOCK'event and CLOCK = 0')) then CNT := CNT + 1. if( CNT = 4 ) then OE <= TRUE .if ( (FFT CMP = '0' and FFTCMP'event) and (times >= 1) ) then CHE <= Il1' DONE <= TRUE.
next addr
3. FFTCMP <= '1' STATE <=1.state 2. or do read if( (INR
'll and RCNT < 2*N CLOCK'event )) then
= =
and
if( PTR
'0' )thien when RAM_0 is read
139
.
elsif ( (STATE = 1) and CLOCK'event and CLOCK= 11) )then
do state 1 which is initization state INE <= '0' OUTE <= '01 RCNT :=0 WCNT :=0 EN <= '0'1 if( (INR STATE else STATE end if.
<= <=
11')

) then
gen. 7.begin if ((CHE = '0' ) )then = 0) and ( CLOCK'event if ((STATE )then and CLOCK = 11) find out actural length 
do state 0 STAGECNT <= 0 .
elsif( (2 then
<=
STATE) and (STATE
4
<=
4)
. COEBUF := "100000000"1 PTR := ISTO. 3.
CHS_0 <= 11.'01 after 1 ns. CHS_1 <= '11. I1' after chssetup~t RW_0 <= 11 elsif( PTR = I1l )then when RAM_1 is read if (( CLOCK = '0')) then
ADDR_1 <= RDADDR 0 TRIGRD_0 <= niot( TRIGRD_0 ) generate next addr ADDRWC <= COEBUF CHSWFC <= '1'. '0' after i ns. RW1l <= 11 end if 140
. '0' after 1 ns '1' after chssetupt. RW_WC <= Il1' COE BUF := INC( COE_BUF elsif (CLOCK = '1')then ADDR_0 <= RDADDR_1 TRIGRD_1 <= not( TRIG RD1 ) generat next addr end if.if ((
CLOCK =
'0'))
then
ADDR_0 <= RD ADDR_0 TRIGRD_0 <= not( TRIGRDO ) generate next addr ADDR WC <= COEBUF CHSWiC <= '1'. I1l after chs setupt. RWWC <= '1' COE BUF := INC( COEBUF elsif ( CLOCK = '1')then ADDR 1 <= RD ADDR1 1 TRIGRD_1 <= not( TRIGRD_1 ) .generate next addr end if._ '0' after 1 ns '1' after chs_setupt.
elsif( PTR = 1') then if( CLOCK = '0') then ADDRO0 <= WRADDRO0 TRIGWR_0 <= not( TRIGWRO) elsif( CLOCK = I1' ) then ADDRO0 <= WRADDR1l TRIGWR_1 <= not( TRIGWR1 ) end if . '1' after wrtsetupt. '0' after 30 ns. elsif( RZCNT =2*N INE < 1 EN4 <='1 end if
) then
. '0' after 30 ns. end if.R CNT := RCNT+ 1 STATE <= 3 . CHS_0 <= 11'. RW_1 <= 1'. CHS_1 <= 11' '0' after 30 ns '1' after chssetupt. '1' after chssetupt RW_0 <= '11.writing if( ( OUT A Il'1and WCNT < 2*N and CLOCKlevent )or (OUTA'event and OUT A then
11'))
if( PTR = '0' ) then if( CLO0CK = '0') then ADDR1 <= WRADDROTRIGIWR 10 <= not( TRIGWRO 0 elsif( CLOCK = I1' ) then ADDR1 <= WRADDRl1 TRIG_WR_11 <= not( TRIGWRi ) end if. Ill after wrtsetupt. 141
.TRIG <= not(TRIG) after delt. '0' after 30 ns.
increment stagecounter elsif ( STATE = 7 ) then if ( IN E = '1' and OUTE =
STAGECNT <= STAGE_CNT 1i) then + 1 .
do state 7 . .
:= W CNT + 1 W CNT STATE <= 2 . if((WCNT = 2*N) and ( RCNT = 2*N)) then STATE <= 7 . end if .
TRIGRD_0 TRIGRD_1 TRIGWR_0 TRIGWR_1 <= <= <= <= not( not( not( not( TRIG_RD_0) TRIGRDI) TRIGWR_0) TRIGWRI) . end if . S1 <= '0' after 500 ns . SO <= '0' after 500 ns .
PTR := NOT( PTR
STATE <= 8 .
elsif( WCNT = 2*N ) then OUT E <= 'I' .
end if . .
else if( IN E = '1' ) then STATE <= 2 .if( CLOCK = '0') then Si <= '0' SO <= '1' elsif( CLOCK = '1') then SI <= '1'
SO <= '0'
end if . . do state 8 which is final elsif ( STATE = 8 ) then 142
. end if . else STATE <= 3 .
SO <= '0' SI. elsif( STAGE CNT <(F+1)) then STATE <=1. variable i2 :INTEGER :=0. variable jl :INTEGER :0. ADDR_1 <= IIZZZZZZZZ". CHS_ <= '1' RW_ <= '' STATE <= 0. end process. TRIGRD 1. ADDR_160 <= "IZZZZZZZZ"I. begin if( STAGEICNT'event and STAGECNT >= 0 )then addrdis :=TABLE1(LEN) / 2**( STAGECNT ) . <= '0' OSTO <= '0'. TRIGWRi. variable L :INTEGER :=0. osTo <= PTR STATE <= 1. CHS_ <= I1l RW_ <= '11'. end if end if. il 0 i2 :=0
143
.
process(TRIGRD_0. CHSW <= I1l RWW <= '1'. variable j2 :INTEGER :=0. jump dis :=TABLE1(LEN)*2 I2**( STAGE_CNT). TRIGWR_0.if ( STAGECNT = (F+l) )then FFT CMP <= '0' after 500 ns. variable ki INTEGER :=0. variable K2 :INTEGER :0. ADDRWC<= I"ZZZZZZZZ". elsif( CHE = Ill then INE <= I1l OUTE <= '1'. variable addr dis : INTEGER :1 variable il INTEGER :=0. end if.1 STAGE_CNT) variable jump_dis : INTEGER: 0.
ovt2_img. SO. if( STAGECNT >= 0 and TRIGWRO'event) then WRADDR_0 <= BITTOLOGIC(INTTOBITSARRAY( k1. end process. simply depict the behavioral of 4 to 1 switch process( out1 real. end if . exreal. end if if( STAGECNT >= 0 and TRIGRD_l'event) then RD ADDR 1 <= BIT TOLOGIC( INTTOBITSARRAY(((i2 mod addr dis ) .8)). 8)). out2_real. S2 ) variable test : BITARRAY( 2 downto 0 ) : "000" begin test := SO&Sl&S2 . k1 := kl + 1 end if if( STAGECNT >= 0 and TRIGWR 1'event) then WRADDR_ 1<= BIT TO LOGIC(INTTOBITSARRAY((k2+N).8)).ji "= 0 j2 0 . outl_img. S1. ki := 0. eximg. if( ( (il+l) mod addrdis )= 0 ) then ji := ji + 1 end if ii := ii + 1. + addrdis + j2*jumpdis ) . 144

.8)) if( ( (i2+1) mod addr dis )= 0 ) then j2 := j2 + 1 end if i2 := i2 + 1. k2 := 0 else if( STAGECNT >= 0 and TRIGRDO'event) then RD ADDR 0 <= BITTOLOGIC( INTTOBITSARRAY(((il mod addr_dis) + jl*jumpdis) . k2 := k2 + 1 end if end if .
ENABLE. W in iing <= W 1MG.
simple depict D FFT behavioral_real. Ri_1MG) W in real <= WREAL. RiREAL). outlimg. inlhimg. RiREAL). elsif( TRIG = '0' and TRIG'EVENT ) then in2_real <= RESOLVE( RO_REAL. W _img. out2_img. out2_real. inlimg <= RESOLVE( ROING. R in img <= "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ". RliMG. in2_img.
Fl:FFTCELL generic map( DFPUT =>110 ns) port map( inl_real. end process. in2_real. end process. end case. clock. TRIG. W process ( RO_REAL. out2_img) 145
. exreal . RiREAL. <= outlimg.case test is when "'100"' => R in real Rinimg when "010" 0 Rin real R in img when "001" 0 R in real
R~in1img
outireal . EN) begin if( EN = '0' ) then if ( TRIG = '1' and TRIG'EVENT )then inlreal <= RESOLVE( ROREAL. outi_real. RO_1MG. OE. eximg
when others => Rinreal <= "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ". W_in_img. in2_img <= RESOLVE( ROIMG.
<= <= <= <= <=
out2_real . W_in_real. IE. RlI1MG end if end if.
write cycle t => 150 ns .RO_real). generic map( read cycle_t => 300 ns.
RO i:RAM_256
=> 300 ns generic map( read cyclet => 300 ns. CHW. inreal. writecyclet => 150 ns. chssetupt wrtpulse widtht => 150 ns. write cyclet => 150 ns. ns)
Wi:RAM 256
generic map( readcyclet 146
=>
300 ns
. ns) => 50 cbsaccesst port map( ADDRLINES_1. CH_1. ns . CH_0. cbssetupt wrtpulse widtht => 150 ns._inreal. Ri_img )
Wr:RAM256
=> 300 ns generic map( read cyclet => 300 write cycle t => 150 datasetupt => 150 chssetupt wrtpulse widtht => 150 => 50 cbsaccesst port map( ADDR_LINES_W. datasetupt => 150 ns. Ri r:RAM_256 generic map( read cyclet write cycle t datasetupt
=>
Rii:RAM_256
=> 300 ns.R. RW_ENO.
ns . ns) => 50 chs access t port map(ADDRLINES_0. exWreal. ROr:RAM_256 generic map( read cycle_t => 300 ns. RW. CH_0. datasetupt => 150 ns. R_inimg. RW _EN_1. chssetupt wrtpulse widtht => 150 ns._ENW. ns. RWEN_0. Wreal ).=> 300 ns. RWEN_1. CHi1. ns. Ri_real). RO_img )
300 ns => 300 ns => 150 ns => 150 ns chs_setupt wrtypulse widtht => 150 ns ns) => 50 chsaccesst port map( ADDRLINES_1. R. Rinimg. datasetupt => 150 ns. ns) => 50 chs accesst port map( ADDRLINES_0.
ns. ns)
end simple. ns . ns. RW _ENW. W_img )
ns. e t => 150 dataset. CHW.
147
. exWimg. _t => 150 chssetupt wrtpulse width t => 150 => 50 chsaccesst port map( ADDRLINES_W.=> 300 write cy.
use STD. read(temptempdata). variable temp: LINE. temp) . . procedure readreal(Fname:in STRING .
148
. begin

extract the real dataarray from data file. data_array: out
REALMATRIX) is . THE SOURCE FILE ASSOCIATED WITH DATA READ
library fpu. variable L flag: BOOLEAN := true. end loop. library fpu.this procedure is design for input real data file F: text is in Fname.TEXTIO.APPENDIX I: THE ACCESSORY FILES A. package READIFILE is type REALMATRIX is array( integer range <> ) of real .variable tempdata:real.
while ( not endfile(F)) loop readline(F.all. dataarray:out REALMATRIX). count := count + 1. end READ1FILE . variable count : INTEGER := 1. end readreal.TEXTIO. data_array(count):= tempdata . use STD. end READ1 FILE.all. package body READIFILE is procedure readreal(F_name:in string.
variable IOtemp: BITARRAY(1 to 32). variable tempchar:CHARACTER.all. variable L flag: BOOLEAN := true.library fpu. while L flag loop read(temptempchar). use STD.all. begin
cut out the unwanted space or portion. end bittype .refer. variable count : INTEGER := 1. dataarray:out BITMATRIX) is . package READ FILE is function bittype ( char : CHARACTER ) return BIT .all. variable temp: LINE.fpu. elsif ( char = '0') then
b := '0'.this procedure is design for input data length 32 bits file F: text is in F name. package body READFILE is function bittype( char : CHARACTER) return BIT is variable b: BIT .TEXTIO.TEXTIO.fpu.all.refer. variable i :integer := 2. use STD. dataarray:out BITMATRIX). 149
. if(tempchar = 'I' or tempchar = '0') then L_flag := false .
end if return b. end if. library fpu. procedure readdata(Fname:in STRING .
while
not endfile(F) loop L_flag :' true . end READFILE . begin if ( char = '1') then b := '1'. procedure read data(F name:in string. i := 2 readline(Ftemp).
convert fpnumber into IEEE standard format procession = 32
function CONVERT1( value : REAL ) return BITARRAY is variable result : BIT ARRAY( 31 downto 0 ) := "OO000000000000000000000000000000"
. elsif( endfile(F) ) then assert not (tempchar /= 'I' and tempchar /= '0')
report " reach down to the end of data file.refer.end loop

extract the bits array from data file. THE SOURCE FILE OF THE CONVERSION BETWEEN FPNUMBER AND IEEE FORMAT library fpu . if( tempchar = '1' or tempchar = '0') then 10_temp(i) := BIT TYPE(tempchar) .
end if
i := i + 1. end loop.
end loop.
10_temp(l) := BIT TYPE(temp char) . use fpu. dataarray(count):= IOtemp .
B. count := count + 1 . package CONVERT is function CONVERT1( value : REAL ) return BITARRAY . package body CONVERT is
. end CONVERT . ". end readdata.all.
150
.
while (i <= 32) loop
read(temptempchar). end READFILE.
0) then sign := '0' . elsif( local < 1. quot := quot + 1 . local := abs(value) .0 ) then return result .0 ) then local := local * 0. variable exp_bits : BITARRAY( 7 downto 0 ) . variable local : REAL := 0.1 end if end loop . expbits := INTTOBITSARRAY( (quot+127).
151
.0) or ( local < 1.0) then sign := 'I' .mantissabits'length). end if . variable sign : BIT .0) loop if ( local >= 2. while (local >= 2. variable quot : INTEGER := 0 . end CONVERT1 . elsif( value = 0. mantissa bits FPTO_BITSARRAY( (locall. expbits'length) result := sign & expbits & mantissabits .0 . end CONVERT . elsif( value < 0.0 ) then local := local * 2. quot := quot . return result .0).0 . begin if( value > 0.variable mantissa bits : BITARRAY( 22 downto 0 ) .5 .
76. Pollard. 2. Heanesy & D. 5. Cutis. Schaefer. A. 2ned. 1989. "A Fast 32bits Complex Vector Processing Engine".
10. L.. Lipsett. L. Stevenson. Vol 11 part 8. Kluwer Academic Publishers. VHDL MANUAL. J. pages 466 . J. Armstrong.
1988. J. Introduction To HDLBased Design Using VHDL. R. Proceeding Of The Institute Of Acoustics. H.522. Kern and T. Addison wesley. Jain. IEEE 1. 1989. R..
A. page A14. VHDL: Hardware Description And Design. CHIPLEVEL MODELING WITH VHDL. Prentice Hall. 7. 1990.LIST OF REFERENCES 1. Carlson. Computer Design And Architecture. Computer Architecture & Quantitative Approach. Morgan Kaufmamn. pages 150 . Array Processing And Digital Signal Processing Hand Book. "A Proposed Standard For Binary Floating point Arithmetic". IEEE Computer. First Principles Of Discrete Sjy:tems And Digital Signal Processing. and Ussery.151.
3. E. 8. S. Kirk. D. Strum and D. Inc. 1989. March 1981.
A
9. D. page 49. R.
152
.F.. C. 6.
Fundamentals Of Digital
Image Processing. 11.A. page 5. Patterson.
Prentice Hall. 4. E. 1989. 1989.. pages 1820.. Synopsys. Prentice Hall. K. C.
VA 223046145 Library. CA 939435100 Y.
3
153
. 0. CA 939435100 Professor ChinHwa Lee.
5
5. Code 52
Naval Postgraduate School
2
2. TAIWAN 40124 R. CA 939435100 Professor Chyan Yang.
1
6.S. CA 939435002 3. Wu. Department Chairman.
1
7. TaHsiang 26 LANE415. Code EC/Le Department of Electrical and Computer Engineering Naval Postgraduate School Monterey. Defense Technical Information Center Cameron Station Alexandria. 20375 Hu. Copies 1. Code 8120 Naval Research Laboratory 'Washington.
2
Monterey. LEIN WU RD. TAICHUNG. DC. 1
4.INITIAL DISTRIBUTION LIST
No. C. Code EC Department of Electrical and Computer Engineering Naval Postgraduate School Monterey. Code EC/Ya Department of Electrical and Computer Engineering Naval Postgraduate School Monterey.