You are on page 1of 28

Outline

Ú What is a ³Soft´ Processor


Ú What is the NIOS II?
Ú Architecture for NIOS II, what are the
implications
‡ TigerSHARC VS. NIOS II
‡ Pipeline Issues
‡ Issues related to FIR
Ú Hardware acceleration, using FPGA
logic
< t·s is  ´Softµ Processor?
Ú Processor implemented in VHDL, Verilog,
etc., and downloaded onto FPGA hardware
Ú Can implement many parallel processors
on one FPGA
Ú Can use addition FPGA resources on the
same chip that is not part of the processor
core.

Ú NIOS II is a ³Soft´ Processor


<  ´Softµ Processor?
Ú Higher level of design reuse
Ú Reduced obsolescence risk
Ú Simplified design update or change
Ú Increased design implementation
options
Ú Lower latency between processor and
FPGA components
< t is NIOS II?
Ú Software-defined processor
Ú The processor core is loaded onto
FPGA
Ú Programmed using µnormal¶
programming tools (C, asm), not
hardware description languages
Ú Can use the rest of the FPGA hardware
for accelerating parts of the code
dow Is NIOS II Implemented
Ú The custom FPGA logic that interacts
with the processor is implemented in
Altera Quartus II
Ú The Avalon Interface bus (common
instruction/data bus) is implemented in
Quartus II
Ú The architecture is generated in Quartus
II and used for programming in Eclipse
IDE
NIOS II IDE

Ú Coding is implemented in Eclipse rather than


VisualDSP.
6 e Different NIOS II Cores
Ú There are 3 cores available from Altera
a NIOSII/e: Economical Core
a NIOSII/s: Standard Core
a NIOSII/f: Fast Core
< t·s t e Difference between
t e Cores?

An LE is equivalent to a 8-1 NAND gate + 1 D-Flip Flop


An ALM is equivalent to 2 LE¶s
Comprison of 6igerSd C nd NIOS
II rc itecture
6igerSd C rc itecture
NIOS II rc itecture

-thirty two 32-bit general registers, six 32-bit control registers


-variable cache based on how much FPGA space you have
-ALU- 32bit two input to one input, does shifts, logic and arithmetic. Shifter is
not separate like TigerSHARC
lon Interfce

-separate address, data and control lines


-up to 1024-bit data width transfer, can be set to any width (not power of 2)
-one transfer per clock cycle.
NIOS II/f pipeline
Ú Six stages
Ú One instruction can be dispatched and/or
retired pre cycle
Ú Dynamic branch prediction: 2-bit branch
history table (no BTB like in TigerSHARC)
NIOS II/f pipeline
The pipeline stalls for:

‡ Multi-cycle instructions
‡ Cache misses
‡ Data dependencies (2 cycles between
calculating and using result)

Mispredicted branch penalty: 3 cycles


drdwre multipl
Ú Can use different options for multiplier
(at the processor design stage)
a No h/w multiply (saves FPGA gates)
ż Speed depends on algorithm
a Use embedded multipliers (if FPGA has
those)
ż 1-5 cycles (depends on FPGA)
a Implement multipliers on FPGA gates
ż 11 cycles
a Division 4-66 cycles on hardware
Compre to 6igerSd C
Ú No support for parallel instructions
Ú No support for SIMD operations
Ú Multicycle instructions stall the pipeline

All the above limitations can be overcome


by using FPGA space unoccupied by the
processor itself
Comprison of NIOS II nd
6igerSd C on n FI lgorit m
Integer FI lgorit m
w  coeff[]={1, 2, 3, 4, 5, 6, 7, 8};
w  data1[] = {1, 0, 0, 0, 0 ,0 ,0 ,0};
w  output[8];
w  i=0, j=0, k=0;

(k=0; k<8; k++) output[k] =0;

( j =0; j< 8; j++)


{
( i= 0; i< 8; i++)
{
output[j] += data1[i]*coeff[7-i];
}
}
Speed nlsis
- ñ  w
 
  
 
 w
w 
    w
   
 
 ñ     

    
 ˜     ww   ww
w
 
     
  w w w
 w  w 
 
! w w "    w w   
  # 
$
  
w %
Speed nlsis
Ú ù cycles per iteration except the first two
(branch predicted not taken) and the last
(branch predicted taken) ± those will be
ù+3=12 cycles
Ú 1 data stall ± can remove by moving
instruction from line 4 to 7
Ú Speed: 8 cycles * (N-3) + 11 cycles * 3 =
8*(N-3)+33 cycles
Ú For 1024-tap FIR: 8201 cycles
Ú Clock cycle is 3 times longer (200MHz vs
600MHz)
Speed comprison
‡ 8201 NIOS II cycles equivalent to 24603
TigerSHARC cycles
‡ Lab3 timing:
± 56000 cycles Debug mode
± 13000 unoptimized ASM
± 4000 Optimized ASM

Worse than unoptimized assembly, but no


hardware acceleration used, so this is not
that bad
drdwre ccelertion
Ú Profiling tool in Eclipse can show how
long each function takes
Ú If function takes too long, it can be sped
up by
a Custom instructions
a Hardware Acceleration
Ú Hardware Acceleration is to take the
function and transform it into FPGA
circuitry
drdwre ccelertion
Ú Can be done using C2H compiler from Altera
Ú Trades off Logic Size for Speed up.
6  
    
¬  




 ¬ 
 
 

    


¬ 
 &-'   (
[ ¬   &' -  (
  !
 &'  (
" " 
 # &-'  -(
""#
$  " 
  &' - (
% 
&'  -(
&[  '( &'  - (
&[  ' ) &' - (
Conclusion
Ú ³Soft´ Processors such as the NIOSII
offers another alternative in the
embedded system scene.
Ú The NIOSII offers the advantage of
added configurability, and customization
that blur the line between FPGAs and
DSPs
eferences
[1] http://www.fpgajournal.com/articles/behere.htm
Describes an FPGA-DSP project based on Altera Nios
[2] http://www.altera.com/products/ip/processors/nios2/ni2-index.html
Official Nios II page
[3] http://www.hunteng.co.uk/dsp-fpga.htm
DSP or FPGA? What is better when?
[4] http://www.hunteng.co.uk/pdfs/tech/DSP1736FPGA.pdf
Article from Xilinx about FPGA DSPs
[5] http://www.niosforum.com
Community forum for NIOS
[6] http://www.altera.com/literature/hb/nios2/n2cpu_nii5v1.pdf
NIOSII Processor Handbook ±Altera Corporation
[7] http://www.altera.com/literature/manual/mnl_avalon_spec.pdf
Avalon Memory-Mapped Interface Specifications ± Altera Corporation
[8] http://www.analog.com/en/prod/0,2877,ADSP%252DTS201S,00.html
ADSP-TS201S 500/600 MHz TigerSHARC Processor with 24 Mbit on-chip embedded
DRAM