You are on page 1of 98

AMBA3 (AXI) Protocol and

AXI-based Bus Design

July 25th, 2011


Sungjoo Yoo
Embedded System Architecture Lab.
POSTECH
Agenda
• AMBA3 (AXI3 and AXI4) protocol
– Specification
– Focus on ordering
• PL301: a crossbar bus
– Arbitration, QoS, cyclic dependency schemes
– Crossbar-based bus design flow
On-Chip Network (OCN) Design
Flow
Bus RTL Verification (Specman-based)
IP1 IP2 IP3 IP4

Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem

Pre-layout Bus RTL


Register slice generation
Architecture exploration w/
optimization
performance evaluation Early engagement P&R
w/ floorplanning

IP1 IP2 IP3 IP4 IP3 IP3


IP4 IP4
latching
Bus 1 pipelining Bus 1 Bus 1
too big
Bus RTL I/O delay
IP1 IP2 IP1 bufferingIP2
generation
Mem Mem too long
wire
[Source: AXI Spec]

AMBA3 AXI Protocol


• Advanced eXtensable Interface (AXI)
– Variable-length bursts
• From 1 to 16 data transfers (called beats) per burst
– Bursts with a transfer size of 8-1024 bits
– Wrapping, incrementing, and non-incrementing bursts
– Atomic operations (exclusive or locked accesses)
– System-level caching and buffering control
– Secure and privileged access
• Five channels
– Read: Address read (AR) and read (R) data channels
– Write: Address write (AW), write (W) data, and response
(R) channels
[Source: AXI Spec]

Interconnect, Interface & Channel


Separate Read / Write Channels
• AMBA AXI allows for independent read and
write transactions.
Write Address/Control
AWREADY

Write data
WREADY

AXI Response AXI


Master BREADY Slave

Read Address/Control
ARREADY

Read data
RREADY
Split Transaction
• Address, data, and response are handled
separately.
Write Address/Control
AWREADY

Write data
WREADY

AXI Response AXI


Master BREADY Slave

Read Address/Control
ARREADY

Read data
RREADY
Split Transaction: Write (1/3)
Master issues
address

Write Address/Control
AWREADY

Write data
WREADY

AXI Response AXI


Master BREADY Slave

Read Address/Control
ARREADY

Read data
RREADY
Split Transaction: Write (2/3)

Master gives
Write Address/Control
data
AWREADY

Write data
WREADY

AXI Response AXI


Master BREADY Slave

Read Address/Control
ARREADY

Read data
RREADY
Split Transaction: Write (3/3)

Write Address/Control
AWREADY
Slave
Write data acknowledges
WREADY

AXI Response AXI


Master BREADY Slave

Read Address/Control
ARREADY

Read data
RREADY
Split Transaction: Read (1/2)

Write Address/Control
AWREADY

Write data
WREADY

AXI Response Master issues AXI


Master address Slave
BREADY

Read Address/Control
ARREADY

Read data
RREADY
Split Transaction: Read (2/2)

Write Address/Control
AWREADY

Write data
WREADY

AXI Response AXI


Master BREADY Slave

Slave returns
Read Address/Control
data
ARREADY

Read data
RREADY
Wire Counts
• Address 32b, data 32b bus case: 184~204
– AW: 52~56, W: 39~43, B: 4~8, AR: 52~56, R: 37~41

Write Address/Control 52~56


AWID[3:0] write addr ID (0~4 bits)
AWADDR[31:0] write addr ID
AWLEN[3:0] burst length
AWSIZE[2:0] burst size
Payload
AWBURST[1:0] burst type
AWLOCK[1:0] lock info
AXI AWCACHE[3:0] cache type AXI
Master AWPROT[2:0] protection type Slave
AWVALID write address valid Handshake
AWREADY write address ready signals

Write data 39~43


WID[3:0] write ID tag, AWID = WID (0~4 bits)
WDATA[31:0] write data
WSTRB[3:0] write strobes
WLAST write last
WVALID write valid
WREADY write ready
Write Address Channel & Signals

Signal Source Description


AWID[3:0] Master Transaction id
AWADDR[31:0] Master Start address
AWLEN[3:0] Master Burst length (1~16)
AWSIZE[2:0] Master Data width (1~1024)
AWBURST[1:0] Master Burst type (FIXED, INCR, WRAP)
AWLOCK[1:0] / AWCACHE[1:0] Master / Master Lock / cache type
AWPROT[2:0] Master Protection type
AWVALID / AWREADY Master / Slave Handshake control
Write Data & Response Channels

Signal (Write Data Ch.) Source Description


WID[3:0] Master Transaction id
WADDR[31:0] Master Start address
WSTRB[3:0] Master Write strobe per byte
WLAST Master Write last
WVALID / WREADY Master / Slave Handshake control

Signal (Write Response Ch.) Source Description


BID[3:0] Master Transaction id
BRESP[1:0] Master Write response
BVALID / BREADY Master / Slave Handshake control
Read Channels & Signals

• AW signals ~ AR signals
• R signals ~ W signals
– RLAST is controlled by slave
One Address for Burst
ADDRESS A11 A21 A31

DATA D11 D12 D13 D14 D21 D22 D23 D31

• Separation of address and data channel


– Master provides the start address of burst
– Slave needs to generate the remaining addresses based
on burst type (FIXED, INCR, WRAP)
Burst Length, Size and Type
Data Bus
• Write strobe

• Narrow transfer

Incrementing burst case


Data Bus (Cont’d)
• Narrow transfer
Incrementing burst case

• An easy way of unaligned write, use write


strobe
Unaligned Transfer
Unaligned Transfer (Cont’d)
Unaligned Transfer (Cont’d)
• Wrapping burst case
[AXI4]

Burst Length (AXI4)


• AR(W)LEN[7:0] allows burst 256
• Burst in AXI3 protocol:
– Early termination of bursts is not supported.
– A burst must not cross a 4-kbyte boundary. This ensures that a
burst is only destined for a single slave.
• AXI4 protocol longer burst support:
– Bursts longer than 16 beats are only supported for the INCR
burst type. Both WRAP and FIXED burst types remain
constrained to a maximum burst length of 16 beats.
– Exclusive accesses are not permitted to use a burst length
greater than 16.
• What if AXI4 to AXI3 transfer? A conversion wrapper!
Benefit of Split Transaction:
Multiple Outstanding Requests

ADDRESS A11 A21 A31

DATA D11 D12 D13 D14 D21 D22 D23 D31

*D21,D22,D23 의 delay 감소

• Parameters for multiple outstanding requests


– Master I/F: Issuing capability  master가 generation 할 수 있는
outstanding request의 개수
– Slave I/F: Acceptance capability  slave가 받아 들일 수 있는
outstanding request의 개수
Effect of Multiple Outstanding
Requests
• Setup
• Single outstanding by each master
• 5000 clock cycles simulation Memory
utilization
RS
Master #1 80 Read case
32b SDR (RS 0/2)
Master #2
70
RS
Bank #1 60
Master #3 Bank #2 50
PL300 PL
Master #4 Bank #3
40
340
Bank #4
30
Master #5 20

Master #6 10
0
1 2 4 6 # masters

• Analysis
• A single master w/o outstanding requests can achieve only about 30%
utilization.
• The effects of bank interleaving by different masters are significant, up to 74%.
• 6 master case may suffer from bank conflicts.
Effect of Multiple Outstanding
Requests
• Setup
• Multiple outstanding request by each master
Memory
utilization
RS
Master #1 80 Read case
32b SDR 70 (RS 0/2)
Master #2 RS
Bank #1
60

Master #3 Bank #2
50
PL300 PL
40
Master #4 Bank #3
340 30

Master #5 Bank #4 20
10
Master #6 0
1 2 4 6 # masters

• Analysis
• A single master w/ multiple outstanding requests can achieve >50% utilization.
• The effects of bank interleaving by different masters are significant, up to 74%.
• Register slice does not degrade the overall performance, i.e. utilization since
multiple outstanding hides its latency.
Read Burst Operation

Read request Data read 1st data The last data


is initiated is ready is transferred is transferred

Read request
is accepted Note: data transfer only when valid = ready = 1
Overlapping Read Bursts

Read request A
is accepted Read request B
is accepted via AR channel
while data A(0) is
transferred via R channel
Write Burst Operation

Write request A Response completes


is accepted write operation
Out-of-Order Transaction
ADDRESS A11 A21 A31

RDATA D21 D22 D23 D31 D11 D12 D13 D14

• Transaction ID is used to identify data transfer at all


channels
– ARID <-> RID and AWID, WID <-> BID
– Up to four bits
• Ordering by transaction ID
– Master needs to finish data transfers with the same transaction ID
in the order of request issue.
– Slave can handle data transfers with different transaction IDs out-
of-order
Ordering #1: Request and
Response
• Write request and data
– The write data can appear at an interface
before the write address that relates to it
• Two relationships that must be maintained
are:
– Read data must always follow the address to
which the data relates
– A write response must always follow the last
write transfer in the write transaction to which
the write response relates
Dependencies between Channel
Handshake Signals (AXI3)
• To prevent a deadlock situation, you must
observe the dependencies that exist between
the handshake signals
• In any transaction:
– The VALID signal of one AXI component must
not be dependent on the READY signal of the
other component in the transaction
– The READY signal can wait for assertion of the VALID
signal WLAST
[AXI4]

Dependencies between Channel


Handshake Signals (AXI4)
• The AXI3 protocol requires that the write response for all
transactions must not be given until the clock cycle after the
acceptance of the last data transfer
• In addition, the AXI4 protocol requires that the write response for
all transactions must not be given until the clock cycle after address
acceptance WLAST

AXI3

AXI4
Ordering #2: Multiple read
requests in slave and interconnect

Slave
IP 1
Master Slave Master Inter-
IP IP IP connect

Slave
Read data IP 2
reordering depth
Ordering #3: Write
• Write data with different AWIDs follow their
address order

• Responses to multiple writes with different IDs


can be out-of-order from address order
Write Interleaving
ADDRESS A11 A21 A31

DATA D11 D21 D22 D12 D23 D31 D13 D14

• Interleaving rule
– Data with different ID can be interleaved.
– The order within a single burst is maintained
– The order of first data needs to be the same with
that of request
– WriteInterleaveCability  The maximum number of
transactions that master can interleave
[AXI4]

Write Interleaving in AXI4


• No support of write interleaving in AXI4
• Removal of WID in AXI4, why?
– Write data with different AWIDs follow their
address order + no write interleaving  no
need of WID!
– Responses to multiple writes with different
IDs can be out-of-order from address order 
BID remains!
Transaction ID Implementation
• Real implementation
– Transaction ID = <master ID, channel ID>
– Channel ID = original AXI transaction id
– Master ID is needed to identify the initiating master among
all the masters

Video 3D LCD Video


CPU Mixer DMA
Decoder Graphics Control Process

ID: 3 ID: 2 ID: 3 ID: 0 ID: 3 ID: 2 ID: 4

Crossbar
ID: 4 + ceil(log27)=7 bits

Memory
Controller
Cache Support (AXI3)
ARCACHE[3:0] / AWCACHE[3:0]
• Bufferable bit (B): AWCACHE[0]
– Write delay can be an arbitrary one
• Cacheable bit (C): AR(W)CACHE[1]
– Read: prefetch or read cache is possible
– Write: write merging is possible
• Read Allocate bit (RA): ARCACHE[2]
– If read miss, fetch the data to cache
– If C=low, RA=low
• Write Allocate bit (WA): AWCACHE[3]
– If write miss, fetch the data to cache, and then write
to the cache (and through the memory)
– If C=low, WA=low
[AXI4]

Cache Support 1 (AXI4)


[AXI4]

Cache Support 2 (AXI4)


[Source: K. Asanovic, 2008]

A Typical Memory Hierarchy c.2008


Split instruction & data Multiple interleaved
primary caches memory banks
(on-chip SRAM) (off-chip DRAM)
L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache

Multiported Large unified secondary cache


register file (on-chip SRAM)
(part of CPU)
Read / Write Allocate
• On read or write miss, first fetch the cache
line and then access the location for read /
write

L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache
Write Response from
Intermediate Point
• Basically, the memory can give a write response
• Intermediate points, e.g., cache can give a write
response. They need to be responsible for the
data delivery to memory
L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache
Write Though and Write Back
• Write through
– If L1 is updated, all the corresponding data in L2 and
memory are updated
• Write back
– Data update is delayed until the data is evicted

L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache
[Source: K. Asanovic, 2008]

Write Buffer

CPU Data Unified


Cache L2 Cache
Write
RF
buffer

Evicted dirty lines for writeback cache


OR
All writes in writethru cache

• On read miss, first check write buffer contents


if there is a hit in the write buffer, read the hit data
else, send the miss request to the memory (or L2/L3 cache)
[Source: J. Kubiatowicz, 2000]

Write Merging (Coalescing)


CPU Data Unified
Cache L2 Cache
Write
RF
buffer

Write miss or
write back Write merging!
c abcd
• If buffer contains modified blocks, the addresses can be
checked to see if address of new data matches the address of
a valid write buffer entry
• If so, new data are combined with that entry
• The Sun T1 (Niagara) processor, among many others, uses
write merging
[AXI4]

Memory Type
[AXI4]

Device Non-Bufferable and


Bufferable
• Device Non-Bufferable (0000)
– Read/write (response) only from/to the
destination (DRAM, peripherals, …)
• Device Bufferable (0001)
– Write response can be obtained from an
intermediate point (bufferable or allocate)
– Read from the destination
[AXI4]

Normal Non-cacheable
Non-bufferable / Bufferable
• Normal Non-cacheable Non-bufferable:
0010
– Write merging is possible (modifiable)
• Normal Non-cacheable Bufferable: 0011
– Write response from an intermediate point is
possible (bufferable or allocate)
– Write merging (modifiable)
– Read data from the destination or from a
write transaction that is going to the
destination (modifiable & bufferable in read)
[AXI4]

Write Through 1
• Write Through No Allocate: 1010/0110
– Why write through  no bufferable
– Why no allocate  only other allocate
– Write response from an intermediate point is possible
(bufferable or (other) allocate)
– Write merging is possible (modifiable)
– Read data from an intermediate cached copy
(bufferable or (other) allocate)
– Read pre-fetch is possible (modifiable)
– Allocations of either read or write is not
recommended for performance reasons, but not
prohibited
[AXI4]

Write Through 2
• Write Through Read Allocate: 1110(0110)/ 0110
– Basically, Write Through No Allocate except
– Allocation of reads is recommended for performance reasons
– Allocation of writes is not recommended for performance
reasons
• Write Through Write Allocate: 1010/1110 (1010)
– Reverse of the above
– Allocation of writes is recommended for performance reasons
while that of reads is not recommended
• Write Through Read and Write Allocate: 1110/1110
– Basically, Write Through No Allocate except
– Both allocations of reads and writes are recommended
[AXI4]

Write Back 1
• Write Back No Allocate: 1011/0111
– Why write back?  bufferable
– Writes are not required to reach the destination (bufferable)
– Write response from an intermediate point is possible (bufferable
or (other) allocate)
– Write merging is possible (modifiable)
– Read data from an intermediate cached copy (bufferable and
(other) allocate)
– Read pre-fetch is possible (modifiable)
– Allocations of either read or write is not recommended for
performance reasons, but not prohibited
[AXI4]

Write Back 2
• Write Back Read Allocate: 1111(0111)/ 0111
– Basically, Write Back No Allocate except
– Allocation of reads is recommended for performance reasons
– Allocation of writes is not recommended for performance
reasons
• Write Back Write Allocate: 1011/1111 (1011)
– Reverse of the above
– Allocation of writes is recommended for performance reasons
while that of reads is not recommended
• Write Back Read and Write Allocate: 1111/1111
– Basically, Write Back No Allocate except
– Both allocations of reads and writes are recommended
[AXI4]

Transaction Buffering
• When the write reaches the destination?
• The write reaches the destination in a
timely manner
– Bufferable (Device and Normal Non-
Cacheable) and Write Through cases
• Not required to reach the destination (but,
the data should not be lost in any case)
– All Write Back cases
[AXI4]

Ordering 4: Ordering on the


Same (or Overlapping) Address
• Keep the ordering of requests/responses
to the same (or overlapping) address

Device memory
Atomic Access
• Normal access, AR(W)LOCK[1:0]=b00
• Exclusive access, b01
– Exclusive read (load-linked)… Exclusive write (store-
conditional)
– If no intervening write to the address region, EXOKAY
response. If not, OKAY response.
– Usually used for read-modify-write
Master 1 Master 2 Master 1 Slave 1
E.RD 0x100 WR 0x100 E.WR 0x100 OKAY
time
• Locked access, b10
– Start with b10, and end with b00
– During the period, only the lock initiating master can
access the address region (not the slave!!!)
[AXI4]

Atomic Access in AXI4


• No support of locked access

• All locked accesses from AXI3 masters


need to be converted to normal accesses
Protection Support
• Normal or privileged: AR(W)PROT[0]
– High  privileged
• Secure or non-secure: AR(W)PROT[1]
– Low  secure
• Instruction or data, AR(W)PROT[2]
– High / low  instruction / data
Low Power Interface (C channel)
PL301
Crossbar Bus
AXI Crossbar Bus
ARM PrimeCell PL301
• PL301: PrimeCell High Performance Matrix

Slave Interface Master Interface


(SI) (SI)
PL301 Internal Structure
• Adjust interface differences
– Data width, frequency, and protocol
AXI – 50MHz AXI – 166MHz AXI – 200MHz
AMBA Fabric Matrix

Parameterise Synch bridge Async bridge Fabric Matrix Core


d – e.g.
address map
Routing Control Routing Control Routing Control

Structurally
Configured

Routing Control Routing Control Routing Control

Async bridge Synch bridge


Register Slice Downsizer
AXI - AHB AXI - APB
Downsizer

AXI – 200MHz AHB – 32-bit, 70MHz APB – 40MHz


Adjusting Data Width
• ExpanderAxi
– Converts a narrow master for use on a wide bus
– Data replication on write data and muxing on read path.32b 64b
Exp PL300
32b 64b
• FunnelAxi
– Converts a narrow slave for use on a wide bus
– Mux on write data path and replication on read data path
– Assumes that transfers are suitably sized for slave (i.e. no wider than the
narrow bus)
64b 32b
Fun PL300
64b 32b
• DownSizerAXI
– Converts a wide bus to a narrow bus
– Transaction <= narrow bus pass through
– Transactions > narrow bus are broken down in to smaller transactions.
64b 32b
DS PL300
64b 32b
[Source: PL301 TS]

PL301 Features
• Configurable number of SIs and MIs
• Sparse connection options to reduce gate count and improve
security
• Configurable AXI address/data widths
• Decoded address register that you can configure for each SI
• Flexible register stages to aid timing closure
• An arbitration mechanism that you can configure for each MI,
implementing:
– a fixed Round-Robin (RR) scheme
– a programmable RR scheme
– a programmable scheme that provides prioritized groups of Least
Recently Granted (LRG) arbitration
• A programmable Quality of Service (QoS) scheme
• Support for multiple clock domains: synchronous and asynchronous.
• Configurable cyclic dependency schemes to enable a master to have
outstanding transactions to more than one slave
Arbitration Scheme
Fixed Priority
Highest priority

Priority: 0 M0 SlaveInterface0

SlaveInterface1
Control Registers

Priority: 1 M1

PL301

Priority: 2 M2 SlaveInterface2

Priority: 3 M3 SlaveInterface3

Lowest priority
[Source: PL301 TS]

Arbitration Scheme
Round Robin
Arbitration Scheme
Hybrid
• Combination of round robin and fixed
priority
Highest priority

Priority: 0 M0 SlaveInterface0

Fixed mode
Control Registers

Priority: 1 M1 SlaveInterface1

PL301
Round-robin mode

Priority: 2 M2 SlaveInterface2

Next Next

Priority: 2 M3 SlaveInterface3

Lowest priority
[Source: PL301 TS]

Arbitration Scheme
• LRG (least recently granted) scheme
Arbitration Scheme
Fixed Round Robin
• A weighted round robin in a fixed order

ARM VIDEO 3D LCD

m0 m1 m2 m3 m4 m5

m1 m2 m3 m4

m0 m2

m2 m5

s0
m0 m0

m5 m1
S0
m2 m4 m3 m2
An Example of Crossbar Bus Design
40% of total BW

Video 3D LCD Video


CPU Mixer DMA
Decoder Graphics Control Process

Crossbar

Memory Some IPs need a privileged usage of


Controller memory BW
[Source: PL301 TS]

Programmable QoS

Maximum # of requests
allowed for best effort traffic

Assume Tidemark = 4,
and ID match = M0.
If there is 4 outstanding
requests for M1, then
only requests from M0 are
accepted by S0 until one of
M1’s requests is served
Bus Arbitration: A Generic Arbiter
• Assumptions
– Each request has its own performance requirement (e.g., bandwidth budget
and/or latency)
• E.g., low latency access from CPU
• E.g., bandwidth guarantee for LCD / Camera controllers
– In order to avoid starvation, a global time out is applied
• Priority order: time-out > bandwidth > best effort
– Time-out (TO) request: TO = 0, and BW budget > 0
– Bandwidth (BW) request: TO > 0, and BW budget > 0
– Best effort (BE) request: BW budget = 0
• Priority promotion/demotion
– Demotion: if BW budget is exhausted, demotion to BE
– Promotion: if BW budget becomes positive, promotion to BW or TO request
• Time-out counters
– One type of counter for QoS access w/ time out
– The other for old request
• When a normal request (w/ unspecified TO) arrives at the bus, a timer is
assigned and starts to be decremented each cycle.
Bus Arbitration: A Generic Arbiter
• If there is any TO request, it is served
– QoS request w/ time out
– Old request
Higher • If there is any BW request
Priority – Give the bus to the BW requests based on BW budget
• Based on fixed time slot allocation or statistical slot allocation
• To the other BE requests, apply the same priority order that
BW requests

Case of fixed time slot allocation


At a time slot (instant or period)
Current if the master allocated to the slot has a request,
Slot then it is served
else
the other pending requests are served
(e.g., round robin)

Note: Arbitration scheme is similar to the one in memory access


scheduling
A Deadlock Problem in Accessing
Multiple Slaves
On-chip Bus
D
Memory Memory
Master Controller
1 1
1
A

C
Memory Memory
Master Controller
2 2
2
B

Color (= DTransaction id) 1


Optimization in Memory Controller
is blocked at master
RequestsDwith the A same transaction id
Memory controllers can serve
Memory 1
need to beB finished in C the order of
independent requests out-of-order
Memory 2
requestB is
issues
to increase memory utilization
blocked at master 2

or to lower memory access latency


Master 1 C D
Master 2 A B Time
Cyclic Dependency Schemes
• Outstanding requests are permitted only for a
single slave per transaction id
• The deadlock problem is resolved while limiting
parallel (memory) accesses

Memory 1 A D
Memory 2 C B

Master 1 C C D D
Master 2 A A B B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Single Slave Scheme
• Allow multiple outstanding transactions
only to the same slave
Unique ID Scheme
• Accept only out-of-order requests, i.e.,
requests with different transaction ID’s
Single Slave per ID
• Combination of both single slave and
unique ID schemes
• Allow multiple outstanding requests to a
single slave per transaction ID
[Source: J. Yoo, 2008]

Cascaded Crossbar (XB) Bus


M1 M1
3ns 3ns

5ns 4ns
14x4 9x4 7x2
5ns 4ns

3ns 3ns
S2 S2
Case A: Single big crossbar Case B: Cascaded crossbar

* Higher clock frequency


* Possibly, lower area/power cost
On-Chip Network (OCN) Design
Flow
Bus RTL Verification (Specman-based)
IP1 IP2 IP3 IP4

Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem

Pre-layout Bus RTL


Register slice generation
Architecture exploration w/
optimization
performance evaluation Early engagement P&R
w/ floorplanning

IP1 IP2 IP3 IP4 IP3 IP3


IP4 IP4
latching
Bus 1 pipelining Bus 1 Bus 1
too big
Bus RTL I/O delay
IP1 IP2 IP1 bufferingIP2
generation
Mem Mem too long
wire
AXI Bus Architecture Design Space
• Interconnection level
– PL300 decomposition:
cascading level
Register slices on
timing-critical
paths
• Bus component level Processor
– Common parameters: bit width, PL300 IP
etc. Processor 2-level cascaded
3x2 architecture to localize
– PL300: WriteIssueCap, etc. Reg traffic
IP
– PL340: arbiter queue, data fifo
size, etc.
– Register slice: PL340
full/forward/bypass per channel PL300
DSP 2x2
• Closely related with PL30_3 PL340
implementation IP 3x2
SRAM
• IP interface level IP
– # of AXI ports, internal buffer
usage/size, address queue size, Large WriteIssuingCapability
etc. Advanced AXI interface on high-latency paths
Buffer usage, size
Address queue size
Burst size
Performance Evaluation of Bus
Architecture
• Three methods depending on how to
model the traffic of bus masters
– ViP (virtual platform)-based methods
• Random bus traffic model
• Sequence-based bus traffic model
• Cycle-accurate Carbonized model
– FPGA-based model
• Bus masters are emulated on FPGA
– RTL simulation
• Golden RTL is built and simulated
Bus RTL Generation:
AMBA Designer
• Generated IP’s
– Interconnect, DMA, memory
controller, etc.
• Parameters
– data width (32b ~ 128b),
protocol (AXI, AHB, APB), etc.
– Timing (register slice, address
decode), security (trustzone),
etc.
– QoS (programmable QoS),
arbitration schemes, etc.
Bus RTL Verification
• Bus component verification
– Verification of each components (PL30x,
PL34x)
– Method & criterion: Specman eVC & protocol
coverage
• Bus architecture verification
– Verification of the entire bus architecture
– Method & criterion: Specman eVC + system
scenarios & functional correctness + protocol
coverage
On-Chip Network (OCN) Design
Flow
Bus RTL Verification (Specman-based)
IP1 IP2 IP3 IP4

Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem

Pre-layout Bus RTL


Register slice generation
Architecture exploration w/
optimization
performance evaluation Early engagement P&R
w/ floorplanning

IP1 IP2 IP3 IP4 IP3 IP3


IP4 IP4
latching
Bus 1 pipelining Bus 1 Bus 1
too big
Bus RTL I/O delay
IP1 IP2 IP1 bufferingIP2
generation
Mem Mem too long
wire
Register Slice for Timing Isolation
• Register slice can be used at any channel
independently
Write Address/Control
AWREADY

Write data
WREADY

AXI Response AXI


Master BREADY Slave

Read Address/Control
ARREADY

Read data
RREADY
Register Slice for Timing Isolation
• Register slice incurs one cycle latency per
insertion
Write Address/Control
AWREADY

Write data
WREADY

AXI Response AXI


Master BREADY Slave

Read Address/Control
ARREADY

Read data
RREADY
[Source: PL301 TS]

Two Modes: Full and Forward


• Area benefit: forward mode gives smaller
area
• The same latency penalty

AXI register slice (full mode)

AXI register slice (forward mode)


On-Chip Network (OCN) Design
Flow
Bus RTL Verification (Specman-based)
IP1 IP2 IP3 IP4

Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem

Pre-layout Bus RTL


Register slice generation
Architecture exploration w/
optimization
performance evaluation Early engagement P&R
w/ floorplanning

IP1 IP2 IP3 IP4 IP3 IP3


IP4 IP4
latching
Bus 1 pipelining Bus 1 Bus 1
too big
Bus RTL I/O delay
IP1 IP2 IP1 bufferingIP2
generation
Mem Mem too long
wire
Topology/Floorplan/Pipeline
Co-design of Cascaded XB Bus

An Academic Approach
[Source: J. Yoo, 2008]

Floorplanned Cascaded XB Bus


M1 S4 S3 M14 M1 S4 S3 M14

M2 M13 M2 M13
M3 M12 M3 M12
M4 14x4 M11 M4 9x4 7x2 M11
M5 M10 M5 M10
M6 M9 M6 M9

M7 S1 S2 M8 M7 S1 S2 M8

Case A floorplan Case B floorplan

Cascaded crossbar bus design needs to consider floorplan and pipeline


stages as well as topology
[Source: J. Yoo, 2008]

Topology/Floorplan/Pipeline
Co-design of Cascaded XB Bus
codec

Design Input: Technology Input: graphics (115, 40)


(600, 40)
Communication Graph Pre-characterized Library (115, 40)(115, 40)
(115, 40) (100, 25) ddr1
& Floorplan Info of Bus Components proc (50, 25) nand0
(50, 25)
(100, 25)
peri
(50, 25)
lcd (50, 25)
(50,
(50, 1000) 1000) (100, 25) (50, 25)
Initial Population security (50, 1000) nand1

conn
(50, 1000)

imaging (50, 1000) ddr0


Topology & Floorplan
storage (50, 1000)(50, 1000) (50, 1000)
(50, 1000)
Generation
dma cfc display

Pipeline Stage Insertion proc0 proc1


DSP00 DSP01 DSP02 DSP03
with Timing Analysis DSP20 DSP21 DSP22 DSP23

input
display

Cost Evaluation &


Tournament Selection
SRAM1
SRAM0

No DSP22 DSP23
DSP30 DSP31

End? DSP32
DSP21

DSP33
Yes DSP20

Return Results SRAM3


SRAM2 SDRAM0 CONTROL FLASH SDRAM1
[Source: J. Yoo, 2008]

Topology/Floorplan/Pipeline
Co-design of Cascaded XB Bus
Design Input: Technology Input: A
Communication Graph Pre-characterized Library
B
& Floorplan Info of Bus Components D
X1 E
C

Initial Population F

(B C A XB1 D E F, C B F XB1 E D A)

Topology & Floorplan


Generation A

B
D
Pipeline Stage Insertion X2
X1 E
with Timing Analysis C

F
Cost Evaluation &
(B C A XB1 XB2 D E F, C B F XB1 XB2 E D A)
Tournament Selection

No A
End?
B
D
X2
X1 E
Yes
C F
Return Results
(B C A XB1 XB2 D E F, C B XB1 F XB2 E D A)
[Source: J. Yoo, 2008]

Topology/Floorplan/Pipeline
Co-design of Cascaded XB Bus
Design Input: Technology Input:
Communication Graph Pre-characterized Library 1.5 Tree #2
& Floorplan Info of Bus Components (a) M0 1.5 2.5
1.0 1.5 S0
2.5
M1 X0
1.0 1.0 1.5
X1
1.0
Initial Population 3.0 4.0 X2 3.0
M2 3.0
1.0 S1
M3 2.5
Tree #1
Topology & Floorplan 1.0
Generation
(b) 2.5
M1 1.0 1.0
Pipeline Stage Insertion X1
with Timing Analysis 1.0
3.0 4.0 X2 3.0
M2 3.0
1.0 S1
Cost Evaluation & M3 2.5
Tournament Selection 1.0

2.5
(c) M1 1.0 1.0 0.5
No X1
End? 0.5
3.0 2.0 3.0
X2 3.0
M2
2.0 1.0 S1
Yes
M3 2.5
Return Results 1.0
[Source: J. Yoo, 2008]

Experiments

power (mW)
area (square mm)
codec 1.2 160
graphics (115, 40) Bridge
(600, 40)
(115, 40)(115, 40)
(115, 40) (100, 25) ddr1
Wire 140
1 Crossbar
proc (50, 25) nand0
(50, 25)
Pipeline 120
(100, 25)
peri
lcd (50, 25)
(50, 25) 0.8
(50, 1000) (100, 25) (50, 25)
(50, 1000)
100
security (50, 1000) nand1

conn 0.6 80
(50, 1000)

imaging (50, 1000) ddr0


60
storage (50, 1000)(50, 1000) (50, 1000)
(50, 1000) 0.4

dma cfc display 40

0.2
20

codec codec_d bridge


5 0 0

case 1

case 2

case 1

case 2

case 1

case 2

case 1

case 2

case 1

case 2

case 1

case 2
5 ddr1_s ddr1
graphics_d0
graphics XBAR
5
graphics_d1 5
bridge 5

5
proc_d
peri_s peri aspdac07 aspdac07ga proposed aspdac07 aspdac07ga proposed
proc proc_i

proc_dma ddr0_s ddr0


area power
conn 5 5 nand0
conn_m nand0_s
bridge XBAR
XBAR

5 nand1_s nand1
storage storage_m

dma dma_m
5
display display_m

imaging imaging_m Summary


cfc cfc_m

security security_m
XBAR
Topology/floorplan/pipeline co-design gives
lcd lcd_m
lower area cost in cascaded crossbar bus design
Summary
• AMBA3 (AXI3 and AXI4) protocol
– Specification
– Focus on ordering
• PL301: a crossbar bus
– Arbitration, QoS, cyclic dependency schemes
– Crossbar-based bus design flow

You might also like