You are on page 1of 28


Ú What is a ³Soft´ Processor

Ú What is the NIOS II?
Ú Architecture for NIOS II, what are the
‡ Pipeline Issues
‡ Issues related to FIR
Ú Hardware acceleration, using FPGA
< t·s is  ´Softµ Processor?
Ú Processor implemented in VHDL, Verilog,
etc., and downloaded onto FPGA hardware
Ú Can implement many parallel processors
on one FPGA
Ú Can use addition FPGA resources on the
same chip that is not part of the processor

Ú NIOS II is a ³Soft´ Processor

<  ´Softµ Processor?
Ú Higher level of design reuse
Ú Reduced obsolescence risk
Ú Simplified design update or change
Ú Increased design implementation
Ú Lower latency between processor and
FPGA components
< t is NIOS II?
Ú Software-defined processor
Ú The processor core is loaded onto
Ú Programmed using µnormal¶
programming tools (C, asm), not
hardware description languages
Ú Can use the rest of the FPGA hardware
for accelerating parts of the code
dow Is NIOS II Implemented
Ú The custom FPGA logic that interacts
with the processor is implemented in
Altera Quartus II
Ú The Avalon Interface bus (common
instruction/data bus) is implemented in
Quartus II
Ú The architecture is generated in Quartus
II and used for programming in Eclipse

Ú Coding is implemented in Eclipse rather than

6 e Different NIOS II Cores
Ú There are 3 cores available from Altera
a NIOSII/e: Economical Core
a NIOSII/s: Standard Core
a NIOSII/f: Fast Core
< t·s t e Difference between
t e Cores?

An LE is equivalent to a 8-1 NAND gate + 1 D-Flip Flop

An ALM is equivalent to 2 LE¶s
Comprison of 6igerSd C nd NIOS
II rc itecture
6igerSd C rc itecture
NIOS II rc itecture

-thirty two 32-bit general registers, six 32-bit control registers

-variable cache based on how much FPGA space you have
-ALU- 32bit two input to one input, does shifts, logic and arithmetic. Shifter is
not separate like TigerSHARC
lon Interfce

-separate address, data and control lines

-up to 1024-bit data width transfer, can be set to any width (not power of 2)
-one transfer per clock cycle.
NIOS II/f pipeline
Ú Six stages
Ú One instruction can be dispatched and/or
retired pre cycle
Ú Dynamic branch prediction: 2-bit branch
history table (no BTB like in TigerSHARC)
NIOS II/f pipeline
The pipeline stalls for:

‡ Multi-cycle instructions
‡ Cache misses
‡ Data dependencies (2 cycles between
calculating and using result)

Mispredicted branch penalty: 3 cycles

drdwre multipl
Ú Can use different options for multiplier
(at the processor design stage)
a No h/w multiply (saves FPGA gates)
ż Speed depends on algorithm
a Use embedded multipliers (if FPGA has
ż 1-5 cycles (depends on FPGA)
a Implement multipliers on FPGA gates
ż 11 cycles
a Division 4-66 cycles on hardware
Compre to 6igerSd C
Ú No support for parallel instructions
Ú No support for SIMD operations
Ú Multicycle instructions stall the pipeline

All the above limitations can be overcome

by using FPGA space unoccupied by the
processor itself
Comprison of NIOS II nd
6igerSd C on n FI lgorit m
Integer FI lgorit m
w  coeff[]={1, 2, 3, 4, 5, 6, 7, 8};
w  data1[] = {1, 0, 0, 0, 0 ,0 ,0 ,0};
w  output[8];
w  i=0, j=0, k=0;

(k=0; k<8; k++) output[k] =0;

( j =0; j< 8; j++)

( i= 0; i< 8; i++)
output[j] += data1[i]*coeff[7-i];
Speed nlsis
- ñ  w

 ˜     ww   ww
  w w w
 w  w 
! w w "    w w   
w %
Speed nlsis
Ú ù cycles per iteration except the first two
(branch predicted not taken) and the last
(branch predicted taken) ± those will be
ù+3=12 cycles
Ú 1 data stall ± can remove by moving
instruction from line 4 to 7
Ú Speed: 8 cycles * (N-3) + 11 cycles * 3 =
8*(N-3)+33 cycles
Ú For 1024-tap FIR: 8201 cycles
Ú Clock cycle is 3 times longer (200MHz vs
Speed comprison
‡ 8201 NIOS II cycles equivalent to 24603
TigerSHARC cycles
‡ Lab3 timing:
± 56000 cycles Debug mode
± 13000 unoptimized ASM
± 4000 Optimized ASM

Worse than unoptimized assembly, but no

hardware acceleration used, so this is not
that bad
drdwre ccelertion
Ú Profiling tool in Eclipse can show how
long each function takes
Ú If function takes too long, it can be sped
up by
a Custom instructions
a Hardware Acceleration
Ú Hardware Acceleration is to take the
function and transform it into FPGA
drdwre ccelertion
Ú Can be done using C2H compiler from Altera
Ú Trades off Logic Size for Speed up.



 &-'   (
[ ¬   &' -  (
 &'  (
" " 
 # &-'  -(
$  " 
  &' - (
&'  -(
&[  '( &'  - (
&[  ' ) &' - (
Ú ³Soft´ Processors such as the NIOSII
offers another alternative in the
embedded system scene.
Ú The NIOSII offers the advantage of
added configurability, and customization
that blur the line between FPGAs and
Describes an FPGA-DSP project based on Altera Nios
Official Nios II page
DSP or FPGA? What is better when?
Article from Xilinx about FPGA DSPs
Community forum for NIOS
NIOSII Processor Handbook ±Altera Corporation
Avalon Memory-Mapped Interface Specifications ± Altera Corporation
ADSP-TS201S 500/600 MHz TigerSHARC Processor with 24 Mbit on-chip embedded