You are on page 1of 55

(EN 412) VLSI Design

System-on-Chip (SoC)
&
Network-on-Chip (NoC)

By Prof. Dr. Yaseer A. Durrani


Dept. of Electronics Engineering
University of Engineering & Technology, Taxila
1

Outline
❑ What is SoC?
❑ SoC Challenges, Application
❑ SoC Design Flow
❑ What is NoC?
❑ NoC Topologies
❑ NoC Design Flow

1
Modern Car

Body & Convenience


Xenon Light, Seat Position,
Powertrain Climate Control, Dashboard
Engine Control Climate Control
Transmission Control
Airbag Night Vision
Battery Management Park Distance
Steering Control
Transmission
Hybrid Dashboard
Engine Blindspot
Cooling Detection Suspension
FAN TPMS
Mirror Door Brake
Light Battery
Management Central Lock ABS ESP
Adaptive
Cruise Control

Safety Chassis

Airbag, ABS Brakes, Active Suspension,


Adaptive Cruise Control Power Steering
3

Introduction
❑ System on Chip (SoC) is basically a circuit embedded on a small coin-sized
chip & integrated with a microcontroller or microprocessor. The design of SoC
usually includes CPU, memory, I/O ports, secondary storage devices, &
peripheral interfaces such as I2C, SPI, UART, CAN, Timers, etc. depending
upon the requirement such as digital or analog signal processing system or
floating-point unit
❑ SoC connects CPU, hard disk, RAM, ROM, USB connectivity, & other
secondary storage devices with circuit embedded on the chip whereas the
motherboard does this using expansion cards

2
Types of SoC
❑ Few types of SoC design:
– SoC used in microprocessors
– Built for microcontrollers
– SoC designed for a special dedicated application & cannot be used for any
other work
– Programmable SoC, not all but few components of such SoC are
programmable
❑ Advantages of SoC:
– Reduction in overall system design as compare to motherboard designs
– Compact & small size chips even the size of a fingertip
– Better efficiency & performance
– Less power consumption
– Less time to market

Introduction
❑ SoC is an IC that implements most or all function of complete electronic system
❑ SoC is IC that integrates software & hardware Intellectual Property (IP) using
more than one design methodology for the purpose of defining functionality &
behavior of proposed system
❑ SoC technology put max. amount of technology into smallest possible space
❑ SoC bridges the gap b/w software & hardware
❑ Four vital areas of SoC:
❑ Higher levels of abstraction
❑ IP and platform re-use
❑ IP creation – ASIPs, interconnect and algorithm
❑ Earlier software development and integration

3
Three forms of SoC Design
❑ ASIC Vendor Design:
– Design in which all components in chip are designed as well as fabricated
by ASIC vendor
❑ Integrated Design:
– Design by ASIC vendor in which all components are not designed by that
vendor. It implies the use of cores obtained from some other source such as
a core/IP vendor or a foundry
❑ Desktop Design:
– Design by fabless company that uses cores which for most part obtained
from other source such as IP companies, EDA companies, design services
companies, or foundry

SoC Design Challenges


❑ SoC design takes longer compared to ASIC
❑ For ASIC following factors influence Turn Around Time (TAT):
– Frequency of the design
– Number of clock domains
– Number of gates
– Density
– Number of blocks & sub-blocks
❑ Factor that influences TAT for SoC is system integration (integrating different
silicon IPs on same IC)

On-Chip bus
Off-Chip bus 128/512 bits
32-bits
Drivers

Drivers

ASIC DRAM
I/O

I/O

ASIC DRAM
SoC

Leverage Internal Bandwidth vs External Bandwidth


8

4
SoC Design Challenges
Verification Definition Library Issues

32 Bit Memory
ARM uP
Arbiter, Bridge,
DMA, Address SRAM
ARM9 Decoder, SRAM IP Integration
HW/SW Microcontroller
I/F
co-verification
Communication
Application Interfaces Interfaces
Application (USB, Enet, Smartcard)
Logic with Logic
DSP pre-
processing Standard Peripherals Standard
Analog sensor (Watchdog, Timers, Peripherals
UART, etc)

Hierarchical Platform for


5 Different IP Sources
Design future chips

Your System Just Grew Pins !


9

Key Parameters of Scaling


Process Power
.18u to .13u to 90n… Active, Leakage

Performance Productivity
300MHz to multi-GHz… Gates/designer-day,
Design reuse
Density
Metal layer tradeoff, Quality / Reliability
library selection, DPM, FIT, SER,
optimization Hot Carrier Effect
Complexity
2M to 5M to 10M gates…

10

10

5
Migration from ASICs to SoCs
❑ Application Specific Integrated Circuits (ASIC): is logic chips designed by end
customers to perform a specific function for a desired application
❑ ASIC vendors supply libraries for each technology they provide. These libraries
contain pre-designed & pre-verified logic circuits
❑ ASIC technologies are:
– Full-Custom, Semi Custom, Platform based, Gate array, Standard Cell

❑ Mid-1990s, ASIC technology evolved from chip-set philosophy to an embedded-


cores-based System-on-Chip (SoC) concept
❑ SoC is an IC designed by stitching together multiple stand-alone VLSI designs
to provide full functionality for an application
❑ SoC compose of pre-designed models of complex functions known as cores
(terms such as intellectual property block, virtual components, and macros) that
serve a variety of applications

11

11

SoC vs. ASIC


❑ SoC is not just a large ASIC
❑ Architectural approach involving significant design reuse
❑ Addresses the cost & time-to-market problems
❑ SoC methodology is an incremental step over ASIC methodology
❑ SoC design is significantly more complex
❑ Need cross-domain optimizations
❑ IP reuse & Platform-based design increase productivity, but not enough
❑ Even with extensive IP reuse, many ASIC design problems remain
❑ Productivity increase far from closing design gap

Much higher levels of High Performance


integration (<0.25um) Circuits & Tools

Common Shorter Design


Architectural Blocks Cycle+ Lagging
Productivity

12

12

6
SoC Benefits
❑ Typical approach:
– Define requirements SoC Consists of
– Design with off-the shelf chips
❑ Digital ❑ Analog
– at 0.5 year mark : first prototypes
– 1 year : ship with low margins/loss – Control ❑ RF
– Start ASIC integration – µP ❑ Power
– 2 years : ASIC-based prototypes – DSP ❑ MEMs
– 2.5 years : ship, make profits – Interfaces
❑ IP
❑ With SoC ❑ Memory
– Define requirements – SRAM
– Design with off-the shelf cores – DRAM
– at 0.5 year mark : first prototypes – FLASH
– 1 year : ship with high margin and market share
mem Proc IP cores
mem
Ip- USB
CPU Sec hub USB
CPU DSP
hub

Typical : $70 Typical : $10


Ip- X
DSP X Sec Co-
Proc
Up to now : Collection of Chips Now : Collection of Cores 13

13

Typical SoC Applications


❑ Typical applications of SoC: Consumer Devices, Networking, Communications,
& other segments of electronic industry
– Microprocessor, Media Processor, GPS Controllers, Cellular Phones, GSM
Phones, Smart Pager ASICs, Digital Television, Video Games,
PC-on-a-Chip
❑ Common problems designing complex chip:
– Time-to-market pressures demand rapid development
– Quality of results (performance, area, power)
– Increasing chip complexity makes verification more difficult
– Deep submicron issues make timing closure more difficult
– Development team has different levels and areas of expertise, and is often
scattered throughout the world
– Designers may worked on similar designs in past, but cannot reuse these
designs because the design flow, tools & guidelines changed
– SoC designs include embedded processor cores, so significant software
component, which leads to additional methodology, process &
organizational challenges

14

14

7
SoC Architecture
❑ CPU, Direct Memory Access (DMA), DSP engine all share same bus
❑ Lot of control wires b/w blocks & peripheral buses b/w subsystems  there is
interdependency b/w blocks & lot of wires in chip
❑ Intelligent system integration is used on-chip interconnect that unifies all traffic
into single entity

❑ Example: Sonics’ SMART Interconnect SiliconBackplane MicroNetwork


❑ To compare traditional CPU bus, on-chip interconnect such as Sonics
SiliconBackplane has following advantages:
– Higher efficiency, Flexible configuration, Guaranteed bandwidth & latency,
Integrated arbitration
❑ MicroNetwork is a heterogeneous, integrated network that unifies, decouples &
manages all of the communication b/w processors, memories, I/O devices

15

15

SoC Architecture
❑ Example: Basic WiseNET
❑ Architecture includes: Ultralow-power dual-band radio transceiver (Tx & Rx)
❑ Sensor interface with signal conditioner & two A/D converters (ANA_FE)
❑ Digital control unit based on Cool-RISC microcontroller (μC) with on-chip low-
leakage memory, several time basis & digital interfaces
❑ Power management block (POW)

16

16

8
SoC Architecture
❑ Example: Typical gateway VoIP (Voice over Internet Protocol) system-on-chip
❑ Gateway VoIP SoC is a device used for functions such as vocoders, echo
cancellation, data/fax modems, and VoIP protocols

17

17

SoC Example
❑ Beaglebone black development board
– This board contains a AM335x microprocessor based SoC. This block
diagram shows the all internal components of AM335x SoC. This SoC
contains all peripherals inside single chip along with Cortex A8 processor

18

18

9
SoC Architecture
❑ Architecture means operational structure of users view of system with time it
evolves functional specification & hardware implementation.
❑ System architecture defines system-level building blocks such as processors
and memories interconnection between them
❑ Processor architecture determines processor’s IS, associated programming
model, its detailed implementation, includes hidden registers, branch prediction
circuits and specification details concerning ALU.
❑ Implementation of a processor is called microarchitecture

19
19
19

SoC Hardware & Software Implementations


❑ If all elements cannot be contained on a single chip, the implementation can be
system on board (or chip) but often called SoC
❑ In fundamental decision in SoC design is choose which components in system
are implemented in hardware & in software
❑ Software is usually executed on general-purpose processor (GPP) which
intercepts instructions at run time.
❑ Architecture offers flexibility, adoptability and resource sharing among
applications which hardware is generally slower & power hungry
❑ SoC designers aim to combine benefits b/w h.w & s.w

20

20

10
SoC Processor Execution Sequence
❑ SoC processors directly implemented sequential execution model. Next
instruction is not processed until all execution for current instruction is complete
and results have been committed:
❑ 1. Fetch instruction into instruction register (IF)
❑ 2. Decode the opcode of instruction (ID)
❑ 3. Generate the address in memory of any data item residing there (AG)
❑ Fetch data operand into executable registers (DF)
❑ Executing specified operation (EX)
❑ Write back the result to register file (WB)

Instruction timing in a Pipelined processing 21

21

SoC Pipelined Soft Processor Execution


❑ Simple pipeline execution has only 1-operation in each phase at any given time
❑ Rigid or static pipeline requires the processor to go through all stages or
phases of pipeline required by a particular instruction or not
❑ Dynamic pipeline allows bypassing of one or more pipeline stages depends on
required instruction
❑ More complex dynamic pipeline allows instruction to complete out of order or
even to initiate out of order

22

22

11
SoC Superscalar processor execution
❑ All instructions are independent to execute on parallel basis
❑ Dynamic pipelined processors remain limited to execute a single operation per
cycle by virtue of their scalar nature. This limitation can be avoided with addition
of multiple functional units and dynamic scheduler to process more than one
instruction per cycle

Superscalar Instruction timing in pipelined ILP Superscalar processor model 23

23

SoC VLIW Processors


❑ Very long instruction word (VLIW) rely on static analyses in compiler instead of
hardware to implement
❑ VLIW processor executes multiple independent operations due to control
complexity is not significantly greater than scaler processor
❑ VLIW processor is used for high performance applications

VLIW processor model


24

24

12
SoC SIMD Processors
❑ Single Instruction Stream Multiple Data Stream (SIMD) class of processor
includes both array and vector processors
❑ SIMD processor is a natural response to the use of certain regular data
structures such as vectors & matrices
❑ Assembly based programming SIMD architecture is similar to programming
processor except some operations perform computations on aggregate data
❑ Array processor consists of many interconnected processor elements, each
having their own local memory space
❑ Vector processing consists of a single processor that references a global
memory space and has special function units that operate on vectors

Array processor model Vector processor model 25

25

SoC Multiprocessors
❑ Multiprocessors can cooperatively execute to solve a single problem by using
some form of interconnection for sharing results
❑ Each processor executes completely independently however, most applications
requires some form of synchronization during execution to pass information and
data b/w processors
❑ It share memory and execute separate program tasks (MIMD: multiple
instruction stream, multiple data stream), their proper implementation is
significantly more complex than array processor

26

26

13
SoC Memory & Addressing
❑ SoC applications vary with memory requirements. It can be simple with only
ROM & RAM or can be support to elaborate operating system requiring large
off-chip memory with a memory management unit and cache hierarchy

Processors with memory off-die

SoC: processors & memory

27

27

SoC Memory Examples


❑ SoC designers to consider whether to put RAM & ROM on-die or off-die

28

28

14
Memory for SoC Operating System
❑ Critical decision in SoC about selection of Operating system & memory
management functionality such as virtual memory
❑ If system is restrict to real memory with MBs, system can be implemented on
true system on SoC (all memory on-die)
❑ Virtual memory is often slower and more expensive requires complex memory
management unit
❑ Users has limited ways of creating new processes & expanding application
base of system

Operating system for SoC designs

Virtual-to-real address mapping 29

29

Virtual Memory
❑ Virtual memory, or virtual storage provides an "idealized abstraction of storage
resources that are actually available on a given machine which "creates illusion
to users of a very large (main) memory
❑ The operating system using a combination of hardware & software,
maps memory addresses used by program, called virtual addresses,
into physical addresses in computer memory.
❑ Operating system manages virtual address spaces & assignment of real
memory to virtual memory. Address translation hardware in CPU, often referred
to as memory management unit (MMU), automatically translates virtual
addresses to physical addresses

Virtual memory combines active RAM & inactive


memory to form large range of addresses 30

30

15
SoC Memory Management Unit
❑ A memory management unit (MMU), or paged memory management
unit (PMMU) has all memory references passed through itself, primarily
performing the translation of virtual memory addresses to physical addresses
❑ MMU effectively performs virtual memory management, handling at same
time memory protection, cache control, bus arbitration and, in simpler computer
architectures (especially 8-bit systems), bank switching
❑ Modern MMUs typically divide virtual address space (range of addresses used
by processor) into pages, each having a size which is a power of 2, usually a
few KB, but they may be much larger. The bottom bits of address (offset within
a page) are left unchanged. Upper address bits are virtual page number

31

31

Intellectual Property (IP)


❑ Utilizing the pre-designed modules enables:
– To avoid reinventing the wheel for every new product
– To accelerate the development of new products
– To assemble various blocks of large ASIC/SoC quite rapidly
– To reduce the possibility of failure based on design & verification of block
for first time
❑ These predesigned modules are commonly called Intellectual Property (IP)
cores or Virtual Components (VC)

Examples
32

32

16
Core-based Design
❑ Cores are pre-designed, verified but untested functional blocks
❑ Core is the intellectual property of vendor
– Internal details not available to user
❑ Core-vendor supplied tests must be applied to embedded cores

33

33

Types of Intellectual Property


❑ Soft IP cores: (Synthesizable RTL)
– Technology independent synthesizable in RTL VHDL/Verilog code to
provide functional descriptions of IPs
– Layout is completely flexible
– Performance & area depend on cell library used
– Synthesis, test, timing analysis, & verification are required
– Offers max. flexibility & re-configurability to match the requirements

❑ Hard IP cores: (Non-Modifiable Layout)


– Consist of hard layout & timing information using particular physical design
libraries that cannot be modified
– Highly optimized for area & performance, synthesized to specific technology
– Includes behavioral model for simulation
– May be encrypted
– Test sets (test stimuli and test responses) are given
– Gives min. flexibility to the core integrator, but saves design time and effort
– Cannot be re-synthesized

34

34

17
Types of Intellectual Property
❑ Firm IP cores: (Gate-Level Netlist)
– Technology-dependent gate-level netlist that meets timing constraints
– Layout is flexible
– Best of both cores balance high performance & optimization of hard IPs with
flexibility of soft IPs
– Performance & area are more predictable
– May be encrypted to protect IP
– Technology dependent delivered in targeted netlists to specific physical
libraries after going through synthesis without performing the physical layout
– Internal implementation of core cannot be modified
– User can parameterize I/O to remove unwanted functionality

35

35

Trade-offs among Soft, Firm & Hard Cores


Soft
Reusability core
Portability Firm
Flexibility core Hard
core

Predictability, Performance, Time-to-Market

IP Format Representation Optimization Technology Reusability


Hard GDSII Very High Technology Low
Dependent
Soft RTL Low Technology Very High
Independent
Firm Target Netlist High Technology High
Generic 36

36

18
Design for Reuse
❑ Reusing macros/cores/IP that have already been designed & verified helps to
address all of the problems above
❑ To overcome the design gap, design reuse - use of pre-designed & pre-verified
cores, or reuse of existing designs becomes vital concept in design
methodology
❑ Effective block-based design methodology requires extensive library of
reusable blocks, or macros based on following principles:
– Macro must be extremely easy to integrate into overall chip design
– Macro must be so robust that integrator has to perform essentially no
functional verification of internals of the macro

Challenge for designers is not whether to adopt


reuse, but how to employ it effectively

37

37

Design for Reuse


❑ To be fully reusable, hardware macro must be:
❑ Designed to solve general problem:
– Easily configurable to fit different applications
❑ Designed for use in multiple technologies:
– For soft macros means, synthesis scripts must produce satisfactory quality
of results with a variety of libraries
– For hard macros means, effective porting strategy for mapping the macro
onto new technologies
❑ Designed for simulation with a variety of simulators:
– Good design reuse practices both Verilog & VHDL & verification test-bench
should be available & work with all major commercial simulators
❑ Designed with standards-based interfaces:
– Unique or custom interfaces should be used only if no standards-based
interface exists

38

38

19
Design for Reuse
❑ Verified independently of the chip in which it will be used:
– Often, macros are designed & only partially tested before being integrated
into a chip for verification. Reusable designs must have full, stand-alone
test benches & verification suites that afford high levels of test coverage
❑ Verified to a high level of confidence:
– Very rigorous verification as well as building a physical prototype that is
tested in actual system running real software
❑ Fully documented in terms of appropriate applications & restrictions:
– In particular, valid configurations & parameter values must be documented.
Any restrictions on configurations or parameter values must be clearly
stated. Interfacing requirements and restrictions on how the macro can be
used must be documented

39

39

IP Reuse based SoC Design


❑ Controller, Custom Logic, AMS blocks
❑ Short Time-to-Market (pre-designed)
❑ Less Expensive (reuse)
❑ Faster Performance (optimized algorithms & implementation)
❑ Lesser Area (optimized algorithms & implementation)

40

40

20
Virtual Components (VC) for Reuse

41

41

What is MPSoC?
❑ Multiprocessor System-on-Chip (MPSoC) is SoC that uses multiple instruction-
set processors (CPUs) usually targeted for embedded applications
❑ Used by platforms that contain multiple, usually heterogeneous, processing
elements with specific functionalities reflecting the need of expected
application domain, a memory hierarchy & I/O components
❑ MPSoCs often require large amounts of memory

42

42

21
Homo/Heterogenous SoC
CPU CPU CPU CPU CPU DSP MEM
MEM MEM MEM MEM
Interconnection network (BUS)

Interconnection network (BUS, XBAR)


Embedded Dedicated I/O
FPGA IP
CPU CPU CPU CPU

MEM MEM MEM MEM Heterogenous SoC (MP-SoC)

Homogeneous SoC (MP-SoC)

43

43

Typical Low Power SoC


❑ Processors: uC, uP or DSP cores
❑ Standard interface core: Industry standards UART, PCI, USB, FireWire,
Ethernet, I2C, SPI
❑ Memory cores: D-cache, ROM, RAM, Flash, EEPROM
❑ Peripherals: cache controllers, counter/timers, DMA controllers, real-time timers,
power-on-reset generators
❑ Timing sources: oscillators, PLL
❑ Digital cores: JPEG compressor, MPEG decoder, VGA
❑ Analog cores: clock recovery, RF components, ADC, DAC
❑ Voltage regulators & power management circuits

44

44

22
Hardware-Software Codesign
❑ SoC design process is a hardware-software codesign in which design
productivity is achived by design reuse
– Design exploration at behavioral level (C, C++, etc.) by system architects
– Creation of Architecture Specification
– RTL Implementation (Verilog/VHDL) by hardware designers
❑ Drawbacks
– Specification Errors - susceptible to late detection
– Correlating validations at behavioral & RTL difficult
– Common interface b/w system & hw designers based on natural language

Generic C/C++/Java SW
Applications
µC specific Assembly
Compiler Opcode
Assembler
I/O

Register Board support DSP Drivers


Setup in µC package

Opcode HEX code


instruction
ASIC FPGA
VHDL CPLD HW
PAL/PLA ... 45

45

Design Methodologies

46

46

23
Time Driven Design (TDD) Methodology
❑ Best for moderately sized & complex ASICs
– Consists primarily of new logic on DSM processes
– Doesn’t have a significant utilization of hierarchical design
❑ Problems:
– Looping between synthesis & placement
– Long turnaround times for each loop
– Unanticipated chip-size growth late in design process
– Repeated area, power, & timing, optimizations
– Late creations of adequate manufacturing test vector

47

47

Block-based Design (BBD) Methodology


❑ Ideally, BBD is behaviorally modeled at system level, where HW/SW tradeoffs
& co-verification is performed
❑ New design components are partitioned & mapped onto specified RTL blocks
❑ RTL blocks are then designed to budgeted timing, power, & area constraints
❑ Opportunistically reused functions are poorly characterized, subject to
modification & require pre-verification
❑ Hard, firm, or soft programmable processor cores
❑ Extracting test-bench data from system-level simulation to support functional
verification
– Shift from “ASIC-out” to “system-in” verification
❑ Processor-dominated or custom bus architecture
❑ Predominately flat manufacturing test architecture is used
❑ Timing analysis in both a hierarchical and flat context
– Top-down planning to create block budgets for hierarchical timing analysis
– Flat detailed timing analysis for higher accuracy
– Management of guard band becomes critical in DSM design to ensure
design convergence
❑ Handoff b/w design team & ASIC vendor often occurs at lower level than in
TDD
• A fully placed netlist is normal, not uncommon to go all way to GDS II
• RTF is attractive, but impractical for all but least aggressive designs
48

48

24
Platform-based Design (PDB) Methodology
❑ TDD + BDD + extensive & planned design reuse
❑ PBD separates design into two areas of focus:
1. Block authoring: Blocks are generated so they interface easily with
multiple target designs. Two new design concepts must be established:
– Interface standardization:
• Block authoring may be distributed among different design teams if
under same interface specifications & design methodology guidelines
– Virtual system design:
• To establish system constraints necessary for block design. For
example: Power profile, Test options, Hard, firm, or soft,…
2. System-chip integration: Focus on designing & verifying system architecture,
& interfaces b/w blocks. Starts with partitioning system around preexisting
block-level functions & identifying new or differentiating functions needed
• System-level partitioning along with performance analysis, HW/SW
design tradeoffs & functional verification

49

49

Platform-based Design (PDB) Methodology


❑ PBD design is either derivative design with added functionality, or a
convergence design where previously separate functions are integrated
❑ Hierarchical, heterogeneous test architecture
– Manufacturing test design is incorporated into standard interfaces to
support each block’s specific test methodology
– BIST, scan BIST, full/partial scan, JTAG, …
❑ Hierarchical timing analysis
– Based on pre-characterized block timing
❑ Hierarchical routing
– Interblock routing is to and from area-based block connectors
– Constraint driven to meet SI and interface timing requirements
❑ Physical design is a key stage in design
– Usually 0.25 mm and smaller process technology
– DSM interconnect-dominated delays
❑ Uses primarily firm or hard VCs
– Predictable, and pre-verified

50

50

25
Platform-based Design (PDB) Methodology

51

51

Waterfall Design
❑ Project transitions from phase to phase in a step function, never returning to
activities of previous phase
❑ Worked well up to 100k gates & down 0.5um
❑ H/W & S/W developments are serialized
❑ High investment at each level
❑ Requirements must be well defined first
❑ Domain & solution to the problem are extremely stable
❑ Client must wait until the end to see any product
❑ Inherent to data administration, because data persists
❑ Product failure signals process failure, promoting fear
❑ Upstream investments save on downstream costs
❑ Little or no prototyping

52

52

26
Spiral Design
❑ Process for developing a system in steps & throughout the various steps,
feedback is obtained & incorporated back into the process
❑ As complexity increases, geometry shrinks & time-to-market pressures continue
to escalate, chip designers are moving from waterfall to spiral model
❑ Design team works on multiple aspects of design simultaneously, incrementally
improving in each area as design converges on completion
❑ Parallel concurrent H/W /S/W development
❑ Parallel verification & synthesis
❑ Develop modules only if not available
❑ Planned iteration throughout
❑ Smaller investment at each level, iteration
❑ Requirements & problem can shift
❑ Success criteria determined for each
iteration
❑ Client may see product sooner, product is
likely lower quality initially
❑ Inherent to world, because world is
influenced by design
❑ Product failure is just part of design
process
❑ Costs similar across board
❑ Prototype dependent 53

53

Spiral / Waterfall Design Processes

Spiral

Waterfall

54

54

27
Top-Down Vs Bottom-Up Design
❑ Top-Down Design:
– Choice of algorithm (optimization)
– Choice of architecture (optimization)
– Definition of functional modules
– Definition of design hierarchy
– Split up in small boxes or units (modules)
– Define required units ( adders, state machine, etc.)
– Map into chosen technology (synthesis, schematic, layout)(change
algorithms or architecture if speed or chip size problems)
– Behavioral simulation tools
❑ Bottom-Up Design:
– Build gates in given technology
– Build basic units using gates
– Build generic modules of use
– Put modules together
– Hope that you arrived at some reasonable architecture
– Gate level simulation tools
Old fashioned design methodology a la discrete logic
55

55

SoC Requirements & Specifications


❑ It is important to analyze carefully the requirements, sufficient time, market
situation to determine all factors prior to design:
– Compatibility with previous designs
– Reuse of previous designs
– Customer requests/complaints
– Sales reports
– Cost analysis
– Competitive equipment analysis
– Trouble reports (reliability) of previous products

56

56

28
SoC Design Flow
Concept definition
Specification System specifications
Design
Improvements System simulation
Design
Measurements
Digital Spec
(Compl.& Perf.& Power)
Design Cycle
Modeling System simulation (II)
FPGA
HDL Simulation HDL Design
Implementation
Simulation &
Verification FPGA Synthesis FPGA prototype

System Design Process 57

57

SoC Flow
Specify
Generic SoC Generic
HW IP blocks Electronic system level design SW IP blocks

Architecture Partition
platform HW/SW

Application
Specific
Integrate SW IP Modules
Application Integrate SW
Application
Specific Specific IP Modules
Specific IP block Operating
HW IPs
into Arch. platform system
HW/SW Low level
Co-simulation SW design
HW design Functional SW simulation flow
flow simulation

HW & SW (low level)


FPGA platform

58

58

29
SoC Flow
HW & SW (low level)
FPGA platform

Physical Application
design SW development

HW/SW Verification on
Prototype application prototype
or development board SW test
IC fabrication

Prototype
IC fabrication

Ship ICs & SW to Clients


59

59

Design Methodologies

Front-End ASIC Design Flow Back-End Design Flow or Generic Physical Flow

60

60

30
SoC Design Flow

61

61

Four major areas of SoC design


❑ Block(IP) authoring
❑ VC delivery
❑ Chip integration
❑ Software development
(Design environment)

62

62

31
EDA/CAD Tools

63

63

SoC Processor Area


❑ SoC have die sizes 10-15mm on a side. Die is produced in bulk from
large wafer around 30cm diameter (about 12 inch)
❑ Defect randomly occur over wafer surface
NG : good Dies
ND : Defect Dies

Poission distribution of defects

ρD : Defect density

64

64

32
SoC Processor Area

rbe : Register bit equivalent

65

65

SoC Baseline Area Model


❑ Manufacturing process has defect density of 0.2 defect/cm2, pd=0.2,
yield Y=0.95, A=25mm2

❑ Smaller the feature size, more logic can be accommodated


❑ Latches, 10%, buses, routing, clocking & overall control 40% space
❑ Total system area: Processor+storage=1462A, 5200-2462=2728A for
cache

66

66

33
Baseline die floor plan
❑ Summary of area design rule:
1. Compute target chip size from target yield &
defect density
2. Compute die cost
3. Compute net available area. Allow 10-20%
for pins, guard ring, power etc.
4. Determine rbe size from minimum size
5. Allocate area based on trial system
architecture until base size is calculate
6. Allocate area for cache & storage

67

67

SoC Power Management

Battery capacity & duty cycle 68

68

34
SoC Future
❑ Have 100s of hardware blocks
❑ Have billions of transistors
❑ Have multiple processors
❑ Have large wire-to-gate delay ratios
❑ Handle large amounts of high-speed data
❑ Need to support “plug-and-play” IP blocks
❑ Future NoC needs to be ready for these SoCs...

69

69

Networks-on-Chip

70

70

35
SoC Nightmare
System Bus

DMA CPU DSP

Mem The “Board-on-a-Chip”


Ctrl. Bridge Approach

The
MPEG architecture is
I o o tightly coupled

C
Control Wires Peripheral Bus

71

71

Evolution or Paradigm Shift?


❑ Architectural paradigm shift
– Replace wire spaghetti by an intelligent network infrastructure
❑ Design paradigm shift
– Busses & signals replaced by packets
❑ Organizational paradigm shift
– Create a new discipline, a new infrastructure responsibility
❑ Communication Bandwidth: Rate of information transfer b/w module &
surrounding environment in which it operates usually measure in
byte/sec.

Network
link

Network
router

Computing
module

Bus
72

72

36
Bus based On-Chip Communication
❑ Buses are collection of wires to which one or more components are
connected
❑ Master (or Initiator): IP component that initiates a read or write data
transfer
❑ Slave (or Target): IP component that does not initiate transfers & only
responds to incoming transfer requests
❑ Arbiter: Controls access to the shared bus. Uses arbitration scheme to
select master to grant access to bus
❑ Decoder: Determines which component a transfer is intended for
❑ Bridge: Connects two busses & acts as slave on one side & master on
other

73

73

Types of Buses
❑ Bus typically consists of three types of signal lines
❑ Address
– Carry address of destination for which transfer is initiated
– Can be shared or separate for read, write data
❑ Data
– Carry information b/w source & destination components
– Can be shared or separate for read, write data
– Choice of data width critical for application performance
❑ Control
– Requests and acknowledgements
– Specify more information about type of data transfer
• Byte enable, burst size, cacheable/bufferable, write-back/through, …

Address Bus
Data Bus
Control Bus

74

74

37
Bus Standards
❑ Bus Standards: Microprocessors have adopted number of bus
standards such as VME bus, Intel Multibus-II, ISA bus, PCI bus etc. to
connect together ICs on PCBs in System-on-board implementation
❑ Arbitration & Protocols: Bus master is the unit that initiate
communication on computer bus or I/O path. In SoC, bus master is
processor and slaves are I/O devices & memory components
❑ Bus master controls the bus paths using slave address, control signal &
flow of data signals b/w master & slaves such process called arbitration
❑ Bus protocol is an agreed set of rules for transmitting information b/w
two or more devices over a bus. Protocols determine:
❑ 1. Type & order of data being sent
❑ 2. How sending device indicates that it has finished sending information
❑ 3. Data compression method used
❑ 4. How receiving device acknowledge reception information
❑ 5. How arbitration is performed to resolve contention on bus & in what
priority & type of error checking to be used

75

75

Bus Standards
❑ Bus Bridge: Module that connects together two buses but not necessary
❑ Three Functions:
– 1. If two buses use different protocols, a bus bridge provides necessary
format & standard conversion
– A bridge is inserted b/w two buses to segment them & keep traffic contained
within segments. This improves concurrency: both buses operate same time
– A bridge often contains memory buffers & associated control circuits that
allow write posting
❑ Physical Bus Structure: depends on no. of wires, paths, cycle time etc. &
protocol
❑ Bus Varieties: Buses may be unified or split. Unified bus address is initially
transmitted in bus cycle followed by one or more data cycles. Split bus has
separate buses for each of these functions

76

76

38
SoC AMBA Buses
❑ Two commonly used SoC bus standards are Advanced Microcontroller
Bus Architecture (AMBA) bus developed by ARM & CoreConnect bus
by IBM. AMBA: Two buses are defined in AMBA specification:
❑ Advanced High-Performance Bus (AHB) is designed to connect
embedded processors, peripherals, Direct Memory Access (DMA)
controllers, on-chip memory & interfaces. It has high speed, high
bandwidth bus architecture, 32-bit data operation with extendable
1024-bits, concurrent multiple master/slave operations
❑ Advanced Peripheral Bus (APB) has lower performance than AHB bus
for slower peripheral modules. It is designed for low power & reduced
interface complexity
❑ Third bus Advanced System Bus (ASB) is designed for low
performance 16/32-bit uCs

77

77

SoC CoreConnect Bus


❑ IBM CoreConnect Bus: designed for specific processor core,
PowerPC and also adoptable to other processers. CoreConnect Bus &
AMBA bus share many common features
❑ CoreConnect Bus architecture provides three buses for interconnecting
cores, library macros & custom logic such as:
– Processor local bus (PLB)
– On-chip peripheral bus (OPB)
– Device control register (DCR) bus

CoreConnect-based SoC 78

78

39
SoC PLB/Split Transaction Buses
❑ PLB Bus: It is used for high-bandwidth, high performance & low-latency
interconnections b/w processors, memory & DMA controllers
❑ It is fully synchronous, split transaction bus with separate address, read,
& write data buses allowing two simultaneous transfers per clock cycle
❑ All masters/slaves have their own Address Read/Write Data & control
signals also called transfer qualifier signals
❑ Split transaction Bus: PLB address Read/Write data buses are
decoupled, allowing for address cycles to be overlapped with Read or
Write data cycles and read data cycles to be overlapped with write data
cycle

PLB transfer protocol 79

79

SoC OPB Bus


❑ On-chip peripheral bus (OPB) is designed for easy connection of on-
chip peripheral devices
❑ It provides a common design point for various on-chip peripherals.
❑ It is fully synchronous bus which functions independently at a separate
level of bus hierarchy. It is not intended to connect directly to processor
core. The processor core can access the slave peripherals on this bus
through the PLB to OPB bridge unit which is a separate core

80

80

40
Bus Topologies
❑ Reduces impact of capacitance across two segments
❑ Reduces contention & energy

Improves system throughput


Multiple ongoing transfers on different buses

Hierarchical Shared Bus

Shared Bus 81

81

Types of Bus Topologies

Full crossbar/matrix bus (point to point)


Partial crossbar/matrix bus

Tri-state buffer based bidirectional signals


Ring Bus
82

82

41
Bus Physical Structure

MUX based signals


AND-OR based signals

83

83

Electrical Properties & System Integration


1. Efficient interconnect: Delay, Module Module Module

Power, Noise, Scalability,


Reliability Module Module Module Module

2. Increase system: Integration Module Module


Productivity Module

3. Enable Multi Processors for SoCs Module Module

Wire-area
d n d n

NoC: d Simple Bus: d

O (n)
O n3( n)
( n)
n n

O (n) O n

d n
Point-to Point: Segmented Bus: d

O n2( n ) O n2( n )
( ) ( )
n
O n n O n n

Scalability – Area & Power in NoCs 84

84

42
On-Chip Communication
❑ Bus Structure: Highly connected multi-core system. Largely power
consumes and increase capacitive load
❑ Point-to-point Structure: Optimal in bandwidth, availability, latency &
power usage, simple to design. Occupy hardware area & routing
problems
❑ NoC Structure: Lower complexity & cost, can be reuse of components,
efficient & high-performance interconnect, Scalable. May have latency,
software need clear synchronization in uP

Point-to-point NoC Network 85


Bus
85

Bus Vs NoCs

Bus-based architectures Irregular architectures Regular Architectures

❑ Bus based interconnect ❑ Networks on Chip


– Low cost ❑ Layered Approach
– Easier to Implement – Better electrical properties
– Flexible – Higher bandwidth
– Bandwidth limited – Energy efficiency
– Speed grows down – Scalable
– No concurrency – Concurrent to reuse
– Central arbitration – Built-in Pipeline
– No layers of – Separate abstraction layers
abstraction – No performance guarantee
– Extra delay in routers
– Area & power still increase

86

86

43
Typical NoC Router
H Buffer
Buffer H

Crossbar Switch H Buffer


Buffer H

H Buffer
Buffer H

Routing Arbitration

▪ This example uses a centralized arbitrer for all I/O ports


▪ Distributed arbitration can also be used

87

87

Layered Approach
Software Traffic Queuing
Modeling Theory
Transport Architectures
Network Separation of
concerns
Wiring
Networking

PE PE PE

PE PE PE Router PE

PE PE PE

Regular NoC
88

88

44
Networks-on-Chip (NoC)
❑ Network-on-chip (NoC) is a communication subsystem on IC b/w IP cores in SoC
❑ NoC can span synchronous & asynchronous clock domains or use unclocked
asynchronous logic
❑ NoC technology applies networking theory & methods to on-chip communication
& brings notable improvement over conventional bus & crossbar interconnections
❑ NoC improves the scalability of SoC & power efficiency of complex SoC
compared to other designs
❑ Significant design effort for bus-style interconnect is not going to sufficient for
large SoCs:
– Physical implementation does not scale: bus fanout, loading, arbitration
depth all reduce operating frequency
– Available bandwidth does not scale: the single bus must be shared by all
masters & slaves

89

89

NoC Characteristics
❑ Communication latency: Time delay b/w module requesting data &
receiving a response to request
❑ Master & Slave: A unit can initiate or react to communicate requests. A
Master such as processor, controls transactions b/w itself & other
modules. A Slave such as memory responds to requests from master
❑ Concurrency Requirement: Number of independent simultaneous
communication channels operating in parallel
❑ Packet or Bus Transaction: The size and definition of information
transmitted in a single transaction. For bus: address with control bits
(R/W) and data. Packet consists of header (address & control) data
❑ ICU: It manages interconnect protocol & physical transaction. If IP core
requires protocol translation to access bus, the unit called bus wrapper
❑ Multiple Clock Domains: Different IP modules may operate at different
clock and data rates

90

90

45
NoC Properties
❑ NoC is more than a single shared bus
❑ NoC provides point-to-point connections b/w any two hosts attached to
network either by crossbar switches or through node-based switches
❑ NoC provides high aggregate bandwidth through parallel links
❑ NoC, communication is separate from computation
❑ NoC uses a layered approach to communications, although with few
network layers due to complexity and expense
❑ NoCs support pipelining and provide intermediate data buffering b/w
sender & receiver

91

91

NoC Properties
❑ Regular geometry that is scalable 1 ns (1 GHz) 0.1 ns (10 GHz)
❑ Flexible QoS guarantees B B
❑ Higher bandwidth
❑ Reusable components
– Buffers, arbiters, routers, protocol stack A
❑ No long global wires (or global clock tree) A

– No problematic global synchronization Year 2005 Year 2010

– GALS: Globally asynchronous, locally synchronous design


❑ Reliable & predictable electrical & physical properties
❑ Silicon cost of a bus is small
❑ Any bus is almost directly compatible with most available IPs, including software
running on CPUs
❑ Concepts are simple & well understood
❑ Problems:
– Internal network contention causes (often unpredictable) latency
– Network has a significant silicon area
– Bus-oriented IPs need smart wrappers
– Software needs clean synchronization in multiprocessor systems
– System designers need reeducation for new concepts
– Every unit attached adds parasitic capacitance, therefore electrical
performance degrades with growth
– Bus timing is difficult in a deep submicron process
– Bus arbiter delay grows with number of masters
– Bandwidth is limited & shared by all units attached 92

92

46
Routing Algorithms
❑ NoC routing algorithms should be simple
– Complex routing schemes consume more device area (complex
routing/arbitration logic)
– Additional latency for channel setup/release
– Deadlocks must be avoided
❑ Deadlock can occur if it is impossible for any messages to move (without
discarding one)
– Buffer deadlock occurs when all buffers are full in a store and forward
network. This leads to a circular wait condition, each node waiting for space
to receive the next message.
– Channel deadlock is similar, but will result if all channels around a circular
path in a wormhole-based network are busy (recall that each “node” has a
single buffer used for both input and output)
❑ Some additional features are highly desirable
– QoS, fault-tolerance

93

93

2D-Mesh NoC–XY Routing


❑ X-Y routing is determined completely from their addresses
❑ In X-Y routing, message travels “horizontally” (in X-dimension) from source node
to “column” containing destination, where message travels vertically
– X direction is determined first, next Y direction
❑ Four possible direction pairs, east-north, east-south, west-north, & west-south
❑ Advantages for X-Y routing:
– Very simple to implement
– Deterministic
– Deadlock-free

94

94

47
Optimizing NoC Particular Application
❑ NoC architecture has to flexible & parametric
• Parameters allow customization
• Parameters: Buffers depth, number of virtual channels, NoC size, etc.
❑ Application Specific Optimization
– Buffers Application Architecture Library
– Routing
– Topology Architecture / Application Model
– Mapping to topology
– Implementation and Reuse NoC Optimisation

❑ Architecture Optimization Configure


– Quality of Service (QoS) Support Evaluate
Refine
– Topology
❑ Fault Tolerance Analyse / Profile
– Gossiping architectures
Good?
No
Optimized
Synthesis NoC

Communication Centric Design

95

95

NoC Design Flow


R R R R R
❑ Optimize capacity for performance/power tradeoff
Module Module Module
❑ Capacity allocation is a traditional WAN optimization
problem R Module R R R

Module Module
R R R R R
Extract inter-
module traffic Module Module Module Module

Module R R R R R

Module

Place modules R R R Module R R

Module
Module

Allocate link R R R R

capacities Module Module Module

R Module R
Module Module
R R R R R
Verify QoS and
Module Module Module Module
cost
Module R R R R
Module

R Module R
Module
Module
96

96

48
NoC Structure
❑ NoC is a packet switched on-chip communication network designed using a
layered methodology
– “Routes packets, not wires”
❑ NoCs use packets to route data from the source to the destination PE via a
network fabric that consists of
• Switches (routers)
• Interconnection links (wires)
❑ NoCs are an attempt to scale down the concepts of largescale networks, &
apply them to embedded SoC domain
❑ ISO/OSI network protocol stack model
Processing
Elements (PEs)

97

97

NoC Topologies
❑ Famous NoC topologies are:
a) SPIN,
b) CLICHE’
c) Torus
d) Folded torus
e) Octagon
f) BFT

98

98

49
NoC Topologies
❑ Direct Topologies:
– Each node has direct point-to-point link to a subset of other nodes in
system called neighboring nodes
– Nodes consist of computational blocks and/or memories, as well as a NI
block that acts as a router
– When number of nodes in system increases, total available communication
bandwidth also increases
– Fundamental trade-off is b/w connectivity & cost
– Most direct network topologies have an orthogonal implementation, where
nodes can be arranged in n-dimensional orthogonal space
– Routing for such networks is fairly simple
• e.g. n-dimensional mesh, torus, folded torus, hypercube, & octagon
❑ 2D Mesh Topology:
– All links have same length
• Eases physical design
– Area grows linearly with number of nodes
– Must be designed to avoid traffic accumulating in center of mesh

Direct Topology 2D Mesh Topology

99

99

NoC Topologies
❑ Torus/k-ary /n-cube:
– n-dimensional grid with k nodes in each dimension
– k-ary 1-cube (1-D torus) is essentially a ring network with k nodes
• Limited scalability as performance decreases when more nodes

– k-ary 2-cube (i.e., 2-D torus) topology is similar to a regular mesh


• Except that nodes at edges are connected to switches at opposite edge
via wrap-around channels
• Long end-around connections can, however, lead to excessive delays

100

100

50
NoC Topology
❑ Folding Torus Topology:
– Overcomes the long link limitation of a 2-D torus
– Links have the same size
– Meshes and tori can be extended by adding bypass links to increase
performance at the cost of higher area
❑ Octagon Topology:
– Another example of a direct network
– Messages being sent between any 2 nodes require at most two hops
– More octagons can be tiled together to accommodate larger designs
• by using one of nodes is used as a bridge node

Folding Torus Topology Octagon Topology


101

101

NoC Topology
❑ Indirect Topologies:
– Each node is connected to an external switch, and switches have point-to-
point links to other switches
– Switches do not perform any information processing, and correspondingly
nodes do not perform any packet switching
– e.g. SPIN, crossbar topologies
❑ Fat tree topology:
– Nodes are connected only to the leaves of the tree
– More links near root, where bandwidth requirements are higher
❑ k-ary n-fly butterfly network:
– Blocking multi-stage network – packets may be temporarily blocked or
dropped in the network if contention occurs
– kn nodes, and n stages of kn-1 k x k crossbar
– e.g. 2-ary 3-fly butterfly network

Fat Tree Topology K-ary n-fly butterfly Network 102

102

51
NoC Topology
❑ (m, n, r) Symmetric Clos network
– Three-stage network in which each stage is made up of a number of
crossbar switches
– m is the no. of middle-stage switches
– n is the number of input/output nodes on each input/output switch
– r is the number of input and output switches
– e.g. (3, 3, 4) Clos network
– Non-blocking network
– Expensive (several full crossbars)
❑ Benes network
– Rearrangeable network in which paths may have to be rearranged to
provide a connection, requiring an appropriate controller
– Clos topology composed of 2 x 2 switches
– e.g. (2, 2, 4) re-arrangeable Clos network constructed using two (2, 2, 2)
Clos networks with 4 x 4 middle switches

(m,n,r) Symmetric Clos Network BenesNetwork 103

103

NoC Topology
❑ Irregular or ad hoc network topologies
– Customized for an application
– Usually a mix of shared bus, direct & indirect network topologies
– e.g. reduced mesh, cluster-based hybrid topology

104

104

52
NoC Example
Processor Processor Processor
Master Master Master

Routing Global
Routing Routing Memory
Node Node Node Slave

Processor Processor Processor


Master Master Master
Global I/O
Slave

Routing Routing Routing


Node Node Node
Global I/O
Slave

Processor Processor Processor


Master Master Master

Routing Routing Routing


Node Node Node

105

105

NoC Based FPGA Architecture


Functional FR
FR
CR CR ETH
unit CPU
I/F
CNI CNI CNI CNI CNI CNI NoC for inter-
R R R R R R routing

FR
CR CR
SERDES
CNI CNI CNI CNI CNI CNI
Routers R R R R R R

FR FR
FR FR
PCI CR D/A
DSP CPU
A/D
CNI CNI CNI CNI CNI CNI Configurable
R R R R R R
region – User
logic

CR CR CR CR
Configurable
network CNI CNI CNI CNI CNI CNI
interface R R R R R R

FR
FR
CR CR ETH
DRAM
I/F
CNI CNI CNI CNI CNI CNI
R R R R R R

106

106

53
Networks Processor

107

107

Ideal NoC Network


❑ What would the ideal network look like?
– Low area overhead
– Simple implementation
– High-speed operation
– Low-latency
– High-bandwidth
– Operate at a constant frequency even with additional blocks
– Increase available bandwidth as blocks are added
– Provide performance guarantees
– Have a “universal” interface

These are competing requirements: Design a network that is the “best” fit

108

108

54
SoC Interconnect Switches
❑ The units on chip, link bandwidth has 16-128 wires
❑ Node fanout is the number of channels connect a node to its neigbors
❑ Nodes can altered both to establish connectivity & improve n/w bandwidth

Static network (links b/w units is fixed)

109

109

Static & Dynamic Network


❑ Static n/w consist of 2D switches to connect SoC modules
❑ Dynamic n/w consist of centralized crossbar switch. These switches
avoid traffic congestion at different clock frequencies
❑ Crossbar-based interconnect connects local block on same chip

Dynamic network Switch-based on interconnect scheme


110

110

55

You might also like