You are on page 1of 18

Network-on-Chip Introduction

Ingo Sander
ingo@imit.kth.se

Network-on-Chips
z z

Today buses are the dominating technology for system-on-chips However, buses have severe limitations that become evident, if the number of components in a system is large
z z

The bus is a communication bottleneck, bandwidth is limited Buses are only scalable to a certain extent

Networks-on-Chip shall overcome the limitation of buses, since the provide a much larger amount of communication resources and are scalable
2B1448 SoC Architectures 2

December 14, 2005

A Network-on-Chip
S T S T Channel S T S T
December 14, 2005

Network-on-Chip
S S S T Switch T Terminal Node S S T S T S T S T S T S T S T S T S T
2B1448 SoC Architectures

S T S T S T S T

S T S T S T S T

S T S T T

S T S T S T

T S

A terminal node can be any kind of component like


z z z z

T S T
3

Processor Memory Hardware component Bus-based system with several components, e.g. Processor and Memory
4

2B1448 SoC Architectures

December 14, 2005

Network-on-Chip
S T S T S T S T
December 14, 2005

Network Interface
z

S T S T S T S T

S T S T S T S T

S T S T S T S T

Information in the form of packets is routed via channels and switches from one terminal node to another

Network Switch

Interface

Terminal Node (Resource)

Different terminals with different interfaces shall be connected to the network The network uses a specific protocol and all traffic on the network has to comply to the format of this protocol

2B1448 SoC Architectures

December 14, 2005

2B1448 SoC Architectures

Network Interface
Network Switch
z

Network abstractions
In order to allow for different resources to connect to the network, the network interface can be divided into
z

International Standards Organization (ISO) developed the Open Systems Interconnection (OSI) model to describe networks:
z

Network Interface

Resource Network Interface

Terminal Node (Resource)

A resource independent part (Network Interface) A resource dependent part (Resource Network Interface)
7

7-layer model

Provides a standard way to classify network components and operations Networks-on-Chips use a similar protocol stack corresponding to the 4 lowest layers of the OSI protocol
2B1448 SoC Architectures 8

December 14, 2005

z 2B1448 SoC Architectures

This is also the solution for the Nostrum NoC (d l d t KTH)

December 14, 2005

OSI model
application presentation session transport network data link physical
December 14, 2005

OSI layers
end-use interface data format application dialog control connections end-to-end service reliable data transport mechanical, electrical
2B1448 SoC Architectures 9

z z z z

Physical: connectors, bit formats, electrical properties Data link: error detection and control across a single link (single hop). Network: end-to-end multi-hop data communication Transport: connection-oriented services over multiple links, e.g. ordering of packets, errorfree connection
2B1448 SoC Architectures 10

December 14, 2005

OSI layers, contd.


z

Internet Protocol (not an on-chip protocol!)


application presentation session transport network data link physical node A application presentation session transport network data link physical node B
12

Session: services for end-user applications: data grouping, checkpointing, etc Presentation: data formats, transformation services Application: interface between network and end-user programs

IP network data link physical router


2B1448 SoC Architectures

December 14, 2005

2B1448 SoC Architectures

11

December 14, 2005

Units of Resource Allocation


z

Units of Resource Allocation


Message
Packet Header

z z

A message is a contiguous group of bits that is delivered from source terminal to destination terminal. A message consists of packets. A packet is the basic unit for routing and sequencing. Packets maybe divided into flits. A flit (flow control digit) is the basic unit of bandwidth and storage allocation. Flits do not have any routing or sequence information and have to follow the route for the whole packet. A phit (physical transfer digits) is the unit that is transfered across a channel in a single clock cycle.
2B1448 SoC Architectures 13

Packet

RI

SN
Body Flit Body Flit Tail Flit

Head Flit

Flit

Type

VC Phit

Messages, Packets, Flits and Phits are handled in different layers of the network protocol
December 14, 2005 2B1448 SoC Architectures 14

December 14, 2005

Network-on-Chip
S T S T S T S T
December 14, 2005

S T S T S T S T

S T S T S T S T

S T S T S T S T

Factors that influence the performance of a networkon-chip are


z

Network-on-Chip Topologies
Ingo Sander ingo@imit.kth.se
Dally: Ch 3, (4), 5

z z

Topology (static arrangement of channels and nodes) Routing Techniques (selection of a path through the network) Flow Control (how are network resources allocated, if packets traverse the network) Router Architecture (buffers and switches) Traffic Pattern
15

2B1448 SoC Architectures

Network Topology
z

Topology Examples
0 0 00 1 10 20 1 0 1 2 3

The network topology refers to the static arrangement of channels and nodes in the network A good topology allows to fulfill the requirements of the traffic at reasonable costs Network topology can be compared with a network of roads
2B1448 SoC Architectures 17

4-node ring

2 00 01 02 03 3 10 11 12 13 01 11 21

4 02 12 22

20

21

22

23

30

31

32

33

6 03 7 13 23

7 Butterfly with 8 nodes

4x4-Torus
December 14, 2005 December 14, 2005 2B1448 SoC Architectures

18

Combined Node consists of Terminal and Switch Node


Switch Node

Nomenclature Network-on-Chip
z
00 01 02 03

is equivalent to
10 11 12 13

Combined Node Terminal Node

20

21

22

23

30

31

32

33

The topology of an interconnection network is specified by a set of nodes N* connected by a set of channels C Messages originate and terminate in set of terminal nodes N, where N N* Here:
z z

4x4-Torus (Channels are bidirectional)

N = N* = 16 C = 16 4 = 64
20

December 14, 2005

2B1448 SoC Architectures

19

December 14, 2005

2B1448 SoC Architectures

Nomenclature Network-on-Chip
z
00 01 02 03

Nomenclature Network-on-Chip
Each channel c = (x,y) C connects a source node x to a destination node y, where x, y N* A channel is characterized by its width wc or wxy , which is the number of parallel signals it contains The source node of a channel is denoted sc and the destination node dc
21

00

01

02

03

z
10 11 12 13

10

11

12

13

20

21

22

23

20

21

22

23

30

31

32

33

30

31

32

33

4x4-Torus (Channels are bidirectional)

4x4-Torus (Channels are bidirectional)

Its frequency fc or fxy is the rate at which bits are transported on a signal Its latency tc or txy is the time required for a bit to travel from x to y Usually the latency is directly related to the physical length of the channel lc = vtc of the by a propagation velocity v The bandwidth of the channel is bc = wcfc
22

December 14, 2005

2B1448 SoC Architectures

December 14, 2005

2B1448 SoC Architectures

Nomenclature Network-on-Chip
z
00 01 02 03

Direct and Indirect Networks


Each switch node x has a channel set Cx = CIx COx , where
z

Direct Network
z

00

01

02

03

10

11

12

13

CIx = {c C | dc = x} is

z
20 21 22 23

z
30 31 32 33

4x4-Torus (Channels are bidirectional)

degree
2B1448 SoC Architectures

The degree of x is x = |Cx |, which is the sum of the in degree and out

the input channel set COx = {c C | sc = x} is the output channel set

Every Node in the network is both a terminal and a switch

10

11

12

13

20

21

22

23

30

31

32

33

Direct Network

Indirect Network
z

Nodes are either switches or terminal

Indirect Network
December 14, 2005 23 December 14, 2005 2B1448 SoC Architectures 24

Bisection of a network
z z

Bisection
z

A bisection of a network is a cut that partitions the entire network nearly in half The channel bisection of a network is the minimum channel count over all bisections of the network

Channel bisection
z

BC = 4 (2 bidirectional channels go through the bisection) BB = 4b (b is the bandwidth of each channel)

Bc =
z

min
bisections

C ( N1 , N 2 )

Bandwidth bisection
z

4-node ring

The bisection bandwidth of a network is the minimum bandwidth over all bisections of the network

BB =
December 14, 2005

min
bisections

B ( N1 , N 2 )
25 December 14, 2005 2B1448 SoC Architectures 26

2B1448 SoC Architectures

Paths
Non-Minimal Path (|P| = 5)

Paths
z
02 03

00

01

10

11

12

13

z
20 21 22 23

z
30 31 32 33

4x4-Torus (Channels are bidirectional)

Minimal Path (|P| = 3)

A path is an ordered set of channels P = { c1, c2, ... ,cn }, where dc,i = sc,i+1 for i = 1... (n - 1) The length or hop count of a path is |P | A minimal path from node x to node y is a path with the smallest hop-count
27

z
00 01 02 03

The set of all minimal paths between x and y is denoted Rxy

10

11

12

13

20

21

22

23

30

31

32

33

4x4-Torus (Channels are bidirectional)

Minimal Paths (|P| = 3)


2B1448 SoC Architectures 28

December 14, 2005

2B1448 SoC Architectures

December 14, 2005

Paths
Largest minimal hop count (Diameter Hmax = 4)
00 01 02 03

Paths
z

10

11

12

13

The diameter Hmax is the largest minimal hop count over all pairs of terminal nodes

Distance in hops from node 00


00 0 01 1 02 2 03 1

10 1

11 2

12 3

13 2

20

21

22

23

2 20

3 21

4 22

3 23

The average minimum hop count Hmin is defined as the average hop count over all sources and destinations

30

31

32

33

30 1

31 2

32 3

33 2

H min =
z

1 N2

x , yN

H ( x, y )
30

4x4-Torus (Channels are bidirectional)

4x4-Torus (Channels are bidirectional)

Here: Hmin = 2

December 14, 2005

2B1448 SoC Architectures

29

December 14, 2005

2B1448 SoC Architectures

Paths
Non-Minimal Path (|P| = 5)

Paths
z
02 03

00

01

10

11

12

13

20

21

22

23

30

31

32

33

A specific implementation may choose to incorporate some non-minimal path Then the actual average hop count Havg is defined over the path used by the network

Non-Minimal Path (|P| = 5)

The physical distance of the path is


D( P)= lc
cP

00

01

02

03

10

11

12

13

z
20 21 22 23

The delay of the path is


t ( P )= D( P ) / v

30

31

32

33

4x4-Torus (Channels are bidirectional)

H avg H min
2B1448 SoC Architectures 31

4x4-Torus (Channels are bidirectional)

December 14, 2005

December 14, 2005

2B1448 SoC Architectures

32

Traffic Patterns
z

Throughput
z

The traffic pattern is a very important factor for the performance of a network In uniform random traffic each source is equally likely to send to each destination Uniform random traffic is the most commonly used traffic pattern, however it implies a balancing of the load, which often does not cause a problem for the network
2B1448 SoC Architectures 33

The throughput of a network is the data rate in bits per second that the networks accepts per input port The topology of a network has a significant impact on the throughput (besides flow control and routing) The ideal throughput is defined as the throughput assuming a perfect routing and flow control
z z

Load is balanced over alternate paths No idle cycles on bottleneck channels

December 14, 2005

December 14, 2005

2B1448 SoC Architectures

34

Thoughput
z

Throughput
z

Maximum throughput occurs, if some channel of the network becomes saturated The channel load of a channel is
z

the ratio of the bandwidth demanded from the channel to the bandwidth of the input ports (in other words) the amount of traffic that must cross the channel, if each input unit injects one unit of traffic according to the given traffic pattern

The ideal throughput ideal is the input bandwidth that saturates the bottleneck channel
ideal = b / max

The channel that carries the largest fraction of the traffic determines the maximum channel load max
2B1448 SoC Architectures 35

In general it is difficult to determine the maximum channel load max, but in case for uniform traffic the task is much simpler

December 14, 2005

December 14, 2005

2B1448 SoC Architectures

36

Ideal Throughput in a Torus


z
00 01 02 03

Another useful lower bound on channel load


z

z
10 11 12 13

20

21

22

23

30

31

32

33

z z

4x4-Torus (Channels are bidirectional)

Assuming uniform traffic, 50% of the packets cross the bisection channels Best throughput, if packets are evenly distributed over the bisection channels Load on these channels is then = N / 2BC Thus max = N / 2BC And the ideal throughput is ideal b / max = 2bBC /N, which is a upper bound
37

z z z

A packet needs in average Hmin hops to be delivered There are C channels in the network We have N nodes sending packets With equal load, we get a lower bound for
c , LB = max,LB =
H min N C
38

December 14, 2005

2B1448 SoC Architectures

December 14, 2005

2B1448 SoC Architectures

Latency
z

Latency
z

The latency of the network is the time required for a message to traverse a network, from the the time head arrives at the input port to the time where the tail of the mesage departs the output port Latency depends not only on topology, but also on routing, flow control and the design of the router Here the focus lies on topology
2B1448 SoC Architectures 39

There are two latency components:


z

Head latency Th : Time required for head of the message to traverse the network Serialization latency Ts= L/b : Time required for the tail to catch up (time for a message of length L to cross a channel with bandwidth b)

December 14, 2005

December 14, 2005

2B1448 SoC Architectures

40

Head Latency
z

Latency
z

Head latency depends on two topology factors


z

Together this gives average latency:


z

T0 = Hmin tr + Dmin / v + L / b (no congestion)

z z

Router delay Tr (time spent in the routers) and time of flight Tw (time spent on wires) Tr = Hmin tr Tw = Dmin / v (average distance Dmin , propagation velocity)

Clearly Hmin, Dmin, and b are to a large extent determined by the topology If there is congestion in the network there is a forth term TC

December 14, 2005

2B1448 SoC Architectures

41

December 14, 2005

2B1448 SoC Architectures

42

Latency
Head Tail Arrival at node x tr Leave x txy Arrival at node y tr Leave y txy Arrival at switch z L/b

Example from Dally


z

z z z

y
z z

The network has N = 64 nodes Hmin = 4 Channel width wc = 16 Channel frequency fc = 1 GHz Channel latency tc = 5 ns Router delay tr = 8 ns Packet Length L = 64 bytes

z z

Tr = Hmin tr = 4 8 ns = 32 ns Tw = Hmin tc = 4 5 ns = 20 ns Ts = L / b = L / (fc wc ) = 64 8 / (1 GHz 16) = 512 / 16 ns = 32 ns T0 = 32 ns + 20 ns + 32 ns = 84 ns

December 14, 2005

2B1448 SoC Architectures

43

December 14, 2005

2B1448 SoC Architectures

44

To get a feeling about NoC (Toy example)


z

To get a feeling about NoC (Toy example)


z

z z z

z z z

The network has N = 64 nodes Hmin = 2 8 / 3 = 5.33 Channel width wc = 32 Channel frequency fc = 1 GHz Channel latency tc = 1 ns Router delay tr = 1 ns Packet Length L = 16 bytes

z z z

z z z

The network has N = 64 nodes Hmin = 2 8 / 3 = 5.33 Channel width wc = 32 Channel frequency fc = 1 GHz Channel latency tc = 1 ns Router delay tr = 1 ns Packet Length L = 16 bytes

z z

Tr = Hmin tr = 5.33 1ns = 5.33 ns Tw = Hmin tc = 5.33 ns Ts = L / b = L / (fc wc ) = 16 8 / (1 GHz 32) = 128 / 32 ns = 4 ns T0 = 5.33 ns + 5.33 ns + 4 ns = 14.66 ns

December 14, 2005

2B1448 SoC Architectures

45

December 14, 2005

2B1448 SoC Architectures

46

Path Diversity
z

Path Diversity
0 0 00 1 10 20 1

A network with multiple minimal paths between most pairs of node is more robust than a network that has only one single route between the nodes

Random Traffic
z

Each node is equally likely to send a message to any other node 50% of the packets pass the bisection max = 1

2 01 3 11 21

4 02 5 12 22

6 03 7 Butterfly with 8 nodes 13 23

December 14, 2005

2B1448 SoC Architectures

47

December 14, 2005

2B1448 SoC Architectures

48

Traffic Patterns
z z

Path Diversity
z

The performance of a network is strongly depending on the traffic pattern The table below shows a number of different traffic patterns that can be used to analyze the performance of the network

Bit Rotation Traffic z The node with address { b2, b1, b0 } sends to { b1, b0, b2 } z Thus we get the following permutation { 0, 2, 4, 6, 1, 3, 5, 7 } z Thus packets from nodes {0,1,4, 5} will all have to pass switch node 10 z max = 2 (since for instance channel 00, 10 is used by two connections)

0 00 1 10 20

2 01 3 11 21

4 02 5 12 22

6 03 7 Butterfly with 8 nodes 13 23

December 14, 2005

2B1448 SoC Architectures

49

December 14, 2005

2B1448 SoC Architectures

50

Path Diversity
z

Torus and Mesh Networks


z
00 01 02 03

In the torus there are several minimal paths to go, if the source and destination are not adjacent Also a non-minimal route can be taken, this is not possible in a butterfly network

10

11

12

13

Torus and Mesh networks, k-ary n-cubes, pack N = kn nodes in each dimension and channels between nearest neighbors Advantages
z z z

20

21

22

23

Regular structure allows efficient packaging For local communication latency is low Good path diversity Comparably larger hop count
2B1448 SoC Architectures 52

30

31

32

33

Disadvantage
z

4x4-Torus

December 14, 2005

2B1448 SoC Architectures

51

December 14, 2005

Rings, Tori and Meshes


0 1 2 3 00 01 02 03

Properties of Tori and Meshes


z

Torus
z z

Mesh
z z

4-node ring (4-ary 1-cube)

00

01

02

03

10

11

12

13

10

11

12

13

20

21

22

23

20

21

22

23

30

31

32

33

30

31

32

33

4x4-Mesh (4-ary 2-mesh)

Channel Bisection BC,T = 4 N / k Channel load under uniform traffic (50% of traffic crosses bisection) T,U = k / 8 Channel load under worst traffic (100% of traffic crosses bisection) T,W = k / 4 Average minimum hop count (k even) Hmin, T = nk / 4

Channel Bisection BC,M = 2 N / k Channel load under uniform traffic (50% of traffic crosses bisection) M,U = k / 4 Channel load under worst traffic (100% of traffic crosses bisection) M,W = k / 2 Average minimum hop count (k even) Hmin, M = nk / 3
54

December 14, 2005

4x4-Torus (4-ary 2-cube)

2B1448 SoC Architectures

53

December 14, 2005

2B1448 SoC Architectures

Physical implementation of Mesh and Tori


z

Folding networks leads to shorter largest channel length

In order to implement a network on a chip, the abstract nodes of the network must be mapped to real positions in physical space A goal is to have the same latency for all channels

Folded 4-ary 2 cube


December 14, 2005 2B1448 SoC Architectures 55 December 14, 2005 2B1448 SoC Architectures 56

Summary
z

The topology is an important factor of the network Mesh and Tori offer a huge amount of bandwidth and path diversity Performance is dependent on the traffic pattern

Non-Blocking Networks, Concentrators and Distributors


Ingo Sander ingo@imit.kth.se
Dally: 6.1 6.4, 7.1

December 14, 2005

2B1448 SoC Architectures

57

Definitions on Non-Blocking
z

Unicast and Multicast


z

z z

A network is non-blocking, if it can handle all circuit requests that are a permutation of the inputs and outputs Otherwise it is blocking A network is strictly non-blocking, if any permutation can be set up incrementally, without the need to rearrange existing connections A network is rearrangably non-blocking, if it can route arbitrary permutations, but incremental constructions may require a rearrangement of existing connections
2B1448 SoC Architectures 59

Unicast traffic means that each input is connected to at most one output Multicast traffic means that an input can be connected to several outputs Networks may be non-blocking for unicast or unicast and multicast

December 14, 2005

December 14, 2005

2B1448 SoC Architectures

60

Non-Interfering Networks
z

Crossbar Networks
z

z
1. 2.

While non-blocking is very important for circuit-switching applications, it is by far not as important in packet switching networks A packet switching network shall be noninterfering, which means
There must be adequate channel bandwidth to support all of the traffic sharing the channel No single flow is denied service because of other flows for more than a short predetermined period of time
2B1448 SoC Architectures 61

A n x m crossbar or crosspoint switch directly connects n inputs to m outputs with no intermeditate stages It is strictly non-blocking for both unicast and multicast Problem: Cost of a n x n network increases as n2 Can only be used for small networks

Symbol

December 14, 2005

December 14, 2005

2B1448 SoC Architectures

62

Clos Networks
z

Non-Blocking Clos for Unicast


(3, 3, 4) Clos Network

A Clos network is a threestage network in which each stage is composed of a number of crossbar switches A symmetric Clos is characterized by a triple (m, n, r) denoting
z

A Clos network is strictly non-blocking iff m 2n-1 (unicast)

z z

the number of input and output ports on each input or output switch (n) number of middle stage switches (m) number of input and output switches
2B1448 SoC Architectures

N = rn = 12 (Terminal) H=4 |Rab | = m

December 14, 2005

63

December 14, 2005

2B1448 SoC Architectures

64

Non-Blocking Clos for Unicast


A Clos network is rearrangable iff m n (unicast)

Benes Network
z

A Clos network from 2 x 2 switches, is also called a Benes network These networks require a minimum number of crosspoints to connect N = 2i ports It has 2i 1 stages with 2i-1 switches Thus total number of crosspoints is (2i 1) 2i+1
2B1448 SoC Architectures 66

December 14, 2005

2B1448 SoC Architectures

65

December 14, 2005

Concentrators
z

Concentrator
z

Concentrators combine the traffic of several terminal nodes into a single network channel This is useful when
z

Traffic from a single terminal is to small to justify an own network port Traffic from several bursty channels can be combined

M terminals are combined into a single network channel that can have a lower bandwidth than the sum of the terminal channels Concentration Factor is kC = MbT / bN

December 14, 2005

2B1448 SoC Architectures

67

December 14, 2005

2B1448 SoC Architectures

68

Application of Concentrator

Distributors
z

A distributor is the opposite of a concentrator A distributor can arbitrarily distribute packets to network channels
z z z

Random Round-robin Etc.


2B1448 SoC Architectures 70

December 14, 2005

2B1448 SoC Architectures

69

December 14, 2005

Distributors
z

Summary
z

Distributors are useful, if high-speed terminal accesses low-speed network Disadvantages


z

Crossbar-switches are only cost-efficient, if size is small Alternatives are


z z z

Increases serialization latency by bT / bN Difficult to balance load

Strictly non-blocking Clos-switches Rearrangable Clos-switches Benes-switches

Concentrators can be used to combine several terminals into one network channel Distributors can be used to connect a highbandwidth terminal to a low bandwidth network
2B1448 SoC Architectures 72

December 14, 2005

2B1448 SoC Architectures

71

December 14, 2005

You might also like