You are on page 1of 6

Design and Implementation Of a SHARC Digital

Signal Processor Core In Verilog HDL


NMozaffar and N Z. Azeemi*
Department of Hardware, Streaming Networks Pvt. Ltd. Islamabad, Pakistan.
*Asst. Professor Department o f Computer Engineering, VLSl Lab, CIIT, Islamabad, Pakistan
Emails: naveed 33@vahoo.com,nazeemi@,comsats.edu.~k

Abstruct-This paper describes the design ami


implementation of on 8-bit rued point Digital Signal
Processor Core in verilog HDL. The architecture exploits
the principles of pipelining and parallelism in order to
obtain high speed ond throughput. The modules of the
design f i t on a XlLlNX,XC4OlOXLFPCA with 13OK gates
runningat a clockfreqrrencyof32.31 MHz.
The proposed architecture follows the Analog Devices, Figure I. Super Harvard Architecture
SHARC DSP standard. This DSP architecture balonces a The paper i s organized as follows. An outline o f the
high performance processor core with high performance Digital Signal Processor is discussed in Section 2 and i s
buses. Progrom Memory (PM) and Data Memory (DMj. followed by architecture details in Section 3. The synthesis
In the core, every instruction can execure in a single cy& results are provided in Section S with conclusion in
The burs and instmetion cache provide rapid, unimpeded Section 9.
dataflow to the core td maintain the execution rate.
2. ARCHITECTURE OUTLINE
Keywords: VLSI, Verilog HDL. FPCA, SHARC (Super
Harvard Architecture). FFTIFost Fourier transform).CLB The system architecture for the implementation i s shown
(Conjigurable logic blockr). in Figure 2. I n order to achieve high throughput, Super
Harvard architecture i s implemented. The architecture
1. INTRODUCTION consists o f a numeric execution unit. memory modules,
and the main control unit.
A Digital Signal Processor is a type of microprocessorone
that is incr&dibly fast and powerful. A DSP i s unique
because it processes.data in real time. These real time
I"WI I
processor makes up the fastest growing segment of the
semiconductor market and i s particularly well suited to
handle the demands o f processing information, whether as
the engine o f communication applications or by providing
the processingplatform for the convergence ofthe Internet
and wireless applications.
One of the best ways of implementing Digital Signal
Processor is the FPGA FPGA's have the capability of
being reconfigurable within a system, which can be a big
advantage in applications that need multiple trial version
within development, offering reasonably fast time to
market.
This motivates the design and implementation of Digital
Signal Processor in hardware presented in this article.
The design i s implemented using Verilog HDL. Each of
the nine modules i s coded individually. The functional and
timing simulations i s performed followed by the synthesis.
AI! these sub-modules are interconnected in a top-module,
which is also verified and synthesized.
Analog Devices' SHARCQ DSP i s based on a 32-bit super
Harvard architecture that includes a unique memory
architecture comprised o f two large on-chip, dual-poned
SRAM blocks coupled with a sophisticated I/O processor, I 1 -
which gives SHARC the bandwidth for sustained high-
speed computations. The Block diagram o f SHARC Figure 2. DSP System Architecture Block Diagram
architecture is illustrated in Figure 1

Proceedings IEEE INMIC 2003 247


The proposed design uses a register-register based data Performing a logical shift or arithmetic shifi of the
path prototype. where operands are fetched from the accumulator value.
register by the functional units and then the results are Normalizing the accumulator.
written back to the register file. Post scaling the accumulator before storing the
accumulator value in data register.
The DSP can be operated in either ofthe two modes In the Fig below the top register shows the pattern before
the shift, and the bottom register shows the pattern that
Reo1 Time Mode: results from the shift.
In this mode the DSP accepts real time data from the

-
field and process it.

Off line Mode:


In thi? mode the DSP can operate on the data stored in
the Data Memory. 7 6 5 4 3 2 1 0
I ~ l ' l ' l 0 1 ~ 1 0 1 ' 1 1
The design i s verified for Oftline mode only. Data Movement in an elght bit barrel shifter

Figure 4. Right ShiR operation


3. ARCHITECTURE DEATILS
3.3 Multiply and Accumulate Unit (MAC)
The design consist o f three main numeric execution units
for data processing as described below: A hardware Multiplier and Multiplier-Accumulator
(MAC) are standard components in all off-the-shelf DSPs.
3.1 Arithmetic Logic Unit (ALU) Multipliers are extensively used in signal processing and
communication systems.
An Arithmetic Logic Unit i s the center core o f a central
The Multiplierladder unit provides multiply and
processing unit. It consists o f a Purely combinational logic
accumulates (MAC) capability. The overall function o f the
and perfoms a set o f arithmetic and logic micro
Multiply Accumulator(MAC) i s given by equation (I).
operations on two input buses. I t has n encoded inputs for
selecting which operation to perform. The select lines are
decoded within the ALU to provide upto 2n different Count-1
operations. The A L U implements a wide range o f Q = Z ( t l ) Multiplicand(rr)*Multiplierfn) (1)
arithmetic and logical functions. The ALU transfer the n=O
result to a destination accumulator after an operation i s
performed. Instructions that perform memory -to- memory '9'i s the primary data output. 'A' and 'B' are multiplied
operation are exceptions. together and the'product added from the current result.
The proposed design o f the A L U i s capable o f performing The eounf value in the Equation is set to a fixed value by a
8 different micro operations. parameter.
The MAC uses Array Mulriplier for multiplication and a
i;~.
;;";o
~
L
;- . high-speed hardware Implemented adder for addition.
-. ~ ~ i
Figure 5 shows the proposed M A C architecture for
implementing in the DSP.

x- IO.ISI.nUlm.,lC
.\emur-
--7

.res",,

Figure 3. A L U System Architecture

3.2 Barrel Shifter

Barrel shifter circulates data bits in a synchronous manner. Figure 5 . The M A C System Architecture
The Barrel Shifter i s used for Scaling operations such as:
As evident form the Figure this architecture is base on an
Prescaling an input data memory operand or the Array Multiplier, a Adder, a M u , and Output hold
accumulator value before an ALU operation. register.

248 Proceedings IEEE INMlC 2003


There are two modes of M A C operation control by the
Multiplier Multiplication
input pin MAC / MUL:

Made I :

MAC / MUL= 0

Array Multiplier performs only Multiplication on the input


data.

Mode 2:

M A C / MUL = I Figure I . Steps required in multiplication

In this mode i t performs Multiplication and accumulation For partial product generation multiplicand x is multiplied
ofthe input data i.e. it is in M A C mode. by x"2"i for all i.Adding all bits o f the multiplicand with
ith bit of the multiplier generates pp, (partial product for
3.3.1 Array Multiplier ith bit) and then shill pp; left. Due to its concurrency, only
one level delay is introduced Independent of the
Due to today's VLSl technology, the array (or parallel) multiplication word length. Hence this area needs no fast
multiplier has become increasingly economical and speed algorithms for generation of partial products.
papular. Some multiplier basics, notations and Ml laplicam
4 Vlll VlZl 1111 flO1
conventions are described here.
Generation o f partial products i s the first step of
multiplication operation. These partial products need to be
added together for generating the final product of
multiplication. Figure 6 shows the process of partial
product generation in the multiplication of two unsigned
numbers a and b. Adding all partial products generates the
final product. So the process o f multiplication consists of
generating partial products and adding them together.
. . . . OD,

0, a. 0, 0, % Figure 8. Generation o f Partial Products.


b, b4 b, b: bt bm
The term reduction will be best explained by the following
*A 4.O,,

.. .. .
%bo +bo %b. .visual method o f reduction in Figure 9.
o.b, +b, 4 e,b, %b,
a,& o.b, u,b2 u,b, a,b, 0.b: LeAn
4 0, 4 0, +, 4
0.6. a.b. 4 4 %b. nab8
4 4 0, 4 4 a d ,
Figure 6. Multiplication oftwo 4-bit no
__ __ -
. m . m I
LeA(W
* w . w
An N x M array multiplier architecture consists of three
well-defined major sections performing different (FJlWSbapdlrg
operations: csrydhdlaiw lW*)

. The generation o f N partial products layers.


The reduction of these partial products into 2
Figure 9. Reduction o f Partial Products.

- layers.
And addition of the two layers into a final
product
In the above Figure three dots each symbolizes a partial
product. Using FA (Full Adder) reduces these to two bits,
where one has the weight o f 2'(sum) and the other
Z'(carry). This type o f reduction i s known as 3 to 2
Figure 7. Visually demonstraies'the three major sections reduction or carry saves addition. The two dots are
for performing array multiplication. reduced to 2 using a HA (Half Adder). It can be seen that
this stage (Level n+l) does not yield any reduction. The
rightmost diagram has I dot, which i s carried down
without any action.

Proceedings IEEE INMIC 2003 249


3.3.2 Adder 3. 4.3 Control Unit

The speed o f a signal processing or communication system The 5-bit opcode i s further decoded by the control unit
ASIC depends heavily on these lunctional units. Adders This module separate the 5-bits o f opcode and assign them
are in the critical path o f many other arithmetic operations to the control register of the appropriate functional unit.
like multiplication. scaling, add-compare select, and These Control register is connected to the Function select
division. pin ofthe three functional units.

--
Two basic adder architectures are studied for This opcode contains the following information:
implementation:
Defines the function to be performed.
Rippie Carry Adder.
Carry Lookahead Adder. . Select a specific functional unit for that operation
Specify the source of data to that functional unit.

A ripple adder i s not normally used in high-speed


3.5 MEMORY MODULES
arithmetic. However in situations in which a minimum
amount o f hardware is required and speed is not critical,
The proposed design uses three Memories for maximum
then a ripple adder can prove advantageous. The proposed
throughput of the DSP. The design has separate memories
design implements a C a q Look-ahead Adder schemefor
for storing data, filter coefficients and instructions as
f a r arifhmeric operorions.
specified below.
3 . 4 CONTROLLER
MODULES
- Data Memory (DM)
This unit o f the design serves as the torchbearer for the all
the signals flowing. This means to say that it routes all the
signals to their proper destinations. For some signals it
- Coefficient Memory (CM)
Program Memory (PM)

makes sure that the signals arrive at their destined place to 3.5.1 Data Memory
ensure the efficient operation o f architecture. The
Controller Unit includes three sub modules. Data Memory is used to store only Data coming from
either external source or Funcional units output registers.lt
Program Counter (PCj
can store 64 words each o f 8 bits.The address bus width is
Instruction Decoder (ID)
of 6-bit thus it easily address 64 memory ocations.The
Control Unit (CUj
memory is designed with separate read and write address
bus thus enabling read and write from different locations
3.4.1 Program Counter of the memory at the same time in the intelval o f one
clock cycle.At first the write instructions must be executed
The Program Counter i s a 6-bit counter its output i s
to write the data in the Data Memory
connected to Program Memory. It generates a 6-bit count
value, which is used to address the Program Memory,
3.5.2 Coellicient Memory
which stores the instructions. I t can select 64 memory
locations. The Program Counter usually holds the address
Coefficient Memory is used to store only Filter
o f the next instruction from the instruction, which i s
coefficients and twiddle factors in case o f FFTSt can store
currently executing.
16 words each o f 8 bits.The address bus width i s o f 4-bit
I t addresses a 64x1 I bit Program Memory
thus it easilyaddress 16 memory ocations.The memory is
designed with only a single address bus and a read control
3.4.1 Instruction Decoder signal.Before simulation the memory is initialized with
Memory initialirarion file which contains the required
The instruction decoder decodes the 1 I-bit instruction. It Coefficients needed for a particular operation.
also generates control signals for the memory. The
instructions contain the following information. 3.5.3 Program Memory
Opcode.
Program Memory i s used to store only intructions which
Address.
are first given as input by the user during simulation .It
Read control. can store 64 words each o f I 1 bits.The address bus width
Write control. i s of 6-bit thus it easily address 64 memory ocations.At
CMaddress. first the write instructions must be executed to write the
C M read control dlta in the Data Memory.

250 Proceedings IEEE INMIC 2003


4. IMPLEMENTING FFT ALGORlTHlUM

The proposed archilecture is tested for FFT Table A. Instructions for the Computation ofone butterfly.
implementation, a digital signal processing algorithm used
for calculating DFT. A 4-point radix-2 algorithm is
implemented for this architecture as shown in Figure IO.

r - -..-..-
addrcs o f x (2)

efsfc 2 from DM

rte 6 from x (01


Store ihercrvlr of IlO~OOoo~OOOl WRITE
step 7 on the
w4 -1 address of x 01

5. PIPELINING
Figure IO.Flow graph for a 4-poig radix-2 algorithm.
The proposed DSP design is a fully pipelined architecture
that has five stages. The pipelining in the design can be
The WO4and Wlrarethe twiddle factors and are stored in shown as follows:
the CM prior to the processing. For the convenience of
understanding, we have supposed the values of these as
the I and 2 respectively and are stored in the CM.The
algorithm uses four butterilies and each requires I 1
instructions and so 4 point DFT needs 44 instructions to be
computed. Four more have been added to output the four
outputs from the DM thus making a total 48 instructions.
Now these 48 instructions will need 53 clock cycles to
execute complying with the pipeline specifications.
The four.data inputs samples are coming from the external
environment (in our case given input by the simulation
waveform) and are stored in the memory. An efficient
program (set of EEDSPOOI-the proposed DSP
architecture-instruction sets) is written as shown in Table
A, so that data samples are fetched in the bit reversed CAR
order. Since each wite instruction takes 2 cycles to
execute 8 operations take 1 I cycles, as three write are
included. 19-"1
Cwff Addr. Rcgirleri

Feleh

Similar three more Sets of instNctionS are required with


different operands and addresses to compute 4 point DFT.

4 . 1 Simulation Waveform EllCEYlC

The simulation waveform is generated by XlLlNX ISE OHR


Ver. 4.2 showing the computation of 4 point DFT. The
input data samples are supposed as-x (n) = (2, 4. 6, I ) . Data out
The computation is completed in 53 cycles and at last we
have the output X (k) = (D,2, 3, F6). These values are
displayed in the ALU-OUT register. Figure IO. Five stages ofpipelining in DSP Architecture.

Proceedings IEEE INMlC 2003 251


I n this pipelined architecture first instruction takes five 8. SYNTHESIS RESULTS
cycles IO execute and thereafter the subsequent
instructions take only one cycle to execute. Therefore, the The architecture presented is highly modularized one that
throughput has been increased. This initial delay of five makes it very suitable for VLSl implementation. It has
cycles is called the lutency of the architecture. Hence N high input data rate of one sample per cycle and high
instructions take N + 5 cycles to execuie as compared to throughput rate. Due to smooth data flow, control circuits
serial processing where N instructions will require Nx5 are designed compactly as counter based logic with pause
cycles showing an increase of five times in the throughput. function to freeze all operations. The architecture is
One of the problems .that has been encountered in the verified by Verilog simulation based on register transfer
pipeline is that of the conflicting instructions. In our level descriptions. The entire architecture is fitted on a
design it occurs in the write instruction. Therefore, in the XlLlNX XC401OXL FPGA running at a clock frequency
pipelined architecture. control the flow of pipeline is a bit of 32.31 MHz. The implementationiSynthesis results are
changed by delaying conflicting instruction. This method shown in the Table B .
is called interlocking. Thus the write instruction takes two
cycles to execute. Table B. Synthesis results on XlLlNX XC401OXL FPGA

6. INTEGRATION O F COMPONENTS
Critical Path Timin 10.91 nr
3 2 3 i MHZ
To achieve the design o f the Digital Signal Process&, the
Number OfCLBr
modules were implemented according to the plan set. The 7765
Digital Signal Processor was divided into several Addirialai GareCounl For lOBr
Components (top-down approach). The Functionality of
each module that it must satisfy to Communicate with 9. CONCLUSION
other modules were defined at that stage with great caution.
This helped us in the bottom-up approach of integration o f DSPs is an answer to the intense need of high-speed and
the modules to get a final Digital Signal Processor intensive processing technologies, which is both cheap and
circuitry. The individual modules were combined to form easy to use. DSPs is finding favor not only for computer
the Digital Signal Processor core. systems, but also in consumer electronics products such as
cellular phones.
The CPU and the Data path are designed to comprehend
the details of DSPs architecture. The author develops the
logic, architecture and interface.

[ I ] Modeling, Synthesis, And Rapid Prototyping with the


Verilog HDL. Michcheal D. Cilerri.

(21 VLSl Digital Signal Processor, Kjay K Modisen i.

[3] The ADSP-21535 DSP Hardware Reference Manual.

141 Erskine, C.. and S. Magar. “Architecture and


Applications of a Second-Generation Digital Signal
Processor,” Proceedings of IEEE International
Conference on Acoustics, Speech, and Signal
Processing, USA, 1985.
Figure I 1.Integration of Components

7. DESIGN FLOW USING EDA TOOL


[SI Lin. K., G. Frantz. and R. Simar, Jr.. “The TMS320
Family orDigital Signal Processors,” Proceedings of
The EDA Tool used for the synthesis and simulation of the IEEE, USA, Volume 75, Number 9, pages 1143-
1159, September 1987.
the DSP processor is XlLlNX ISE Ver. 4.2. First the
architecture of the DSP is studied then the design is made
first using traditional Paper-pencil approach and then the
design is translated to the HDL language, Verilog HDL.
After the designing process is completed the verilog
description of the design is written and it is then imported
in the EDA tools.

252 P r o c e e d i n g s IEEE INMlC 2003

You might also like