Bus - 7 - 25 - Morning - P

AMBA3 (AXI) Protocol and
AXI-based Bus Design
July 25th, 2011

Sungjoo Yoo
Embedded System Architecture Lab.
POSTECH
Agenda
• AMBA3 (AXI3 and AXI4) protocol
– Specification
– Focus on ordering
• PL301: a crossbar bus
– Arbitration, QoS, cyclic dependency schemes
– Crossbar-based bus design flow
On-Chip Network (OCN) Design
Flow
Bus RTL Verification (Specman-based)
IP1 IP2 IP3 IP4
Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem
Pre-layout Bus RTL

Register slice generation
Architecture exploration w/
optimization
performance evaluation Early engagement P&R
w/ floorplanning
IP1 IP2 IP3 IP4 IP3 IP3

IP4 IP4
latching
Bus 1 pipelining Bus 1 Bus 1
too big
Bus RTL I/O delay
IP1 IP2 IP1 bufferingIP2
generation
Mem Mem too long
wire
[Source: AXI Spec]
AMBA3 AXI Protocol

• Advanced eXtensable Interface (AXI)
– Variable-length bursts
• From 1 to 16 data transfers (called beats) per burst
– Bursts with a transfer size of 8-1024 bits
– Wrapping, incrementing, and non-incrementing bursts
– Atomic operations (exclusive or locked accesses)
– System-level caching and buffering control
– Secure and privileged access
• Five channels
– Read: Address read (AR) and read (R) data channels
– Write: Address write (AW), write (W) data, and response
(R) channels
[Source: AXI Spec]
Interconnect, Interface & Channel

Separate Read / Write Channels
• AMBA AXI allows for independent read and
write transactions.
Write Address/Control
AWREADY
Write data
WREADY
AXI Response AXI

Master BREADY Slave
Read Address/Control
ARREADY
Read data
RREADY
Split Transaction
• Address, data, and response are handled
separately.
AWREADY
Write data
WREADY
AXI Response AXI

Master BREADY Slave
ARREADY
Read data
RREADY
Split Transaction: Write (1/3)
Master issues
address
AWREADY
Write data
WREADY
AXI Response AXI

Master BREADY Slave
ARREADY
Read data
RREADY
Master gives
data
AWREADY
Write data
WREADY
AXI Response AXI

Master BREADY Slave
ARREADY
Read data
RREADY
AWREADY
Slave
Write data acknowledges
WREADY
AXI Response AXI

Master BREADY Slave
ARREADY
Read data
RREADY
Split Transaction: Read (1/2)
AWREADY
Write data
WREADY
AXI Response Master issues AXI

Master address Slave
BREADY
ARREADY
Read data
RREADY
Split Transaction: Read (2/2)
AWREADY
Write data
WREADY
AXI Response AXI

Master BREADY Slave
Slave returns
data
ARREADY
Read data
RREADY
Wire Counts
• Address 32b, data 32b bus case: 184~204
– AW: 52~56, W: 39~43, B: 4~8, AR: 52~56, R: 37~41
Write Address/Control 52~56

AWID[3:0] write addr ID (0~4 bits)
AWADDR[31:0] write addr ID
AWLEN[3:0] burst length
AWSIZE[2:0] burst size
Payload
AWBURST[1:0] burst type
AWLOCK[1:0] lock info
AXI AWCACHE[3:0] cache type AXI
Master AWPROT[2:0] protection type Slave
AWVALID write address valid Handshake
AWREADY write address ready signals
Write data 39~43

WID[3:0] write ID tag, AWID = WID (0~4 bits)
WDATA[31:0] write data
WSTRB[3:0] write strobes
WLAST write last
WVALID write valid
WREADY write ready
Write Address Channel & Signals
Signal Source Description

AWID[3:0] Master Transaction id
AWADDR[31:0] Master Start address
AWLEN[3:0] Master Burst length (1~16)
AWSIZE[2:0] Master Data width (1~1024)
AWBURST[1:0] Master Burst type (FIXED, INCR, WRAP)
AWLOCK[1:0] / AWCACHE[1:0] Master / Master Lock / cache type
AWPROT[2:0] Master Protection type
AWVALID / AWREADY Master / Slave Handshake control
Write Data & Response Channels
Signal (Write Data Ch.) Source Description

WID[3:0] Master Transaction id
WADDR[31:0] Master Start address
WSTRB[3:0] Master Write strobe per byte
WLAST Master Write last
WVALID / WREADY Master / Slave Handshake control
Signal (Write Response Ch.) Source Description

BID[3:0] Master Transaction id
BRESP[1:0] Master Write response
BVALID / BREADY Master / Slave Handshake control
Read Channels & Signals
• AW signals ~ AR signals
• R signals ~ W signals
– RLAST is controlled by slave
One Address for Burst
ADDRESS A11 A21 A31
DATA D11 D12 D13 D14 D21 D22 D23 D31
• Separation of address and data channel

– Master provides the start address of burst
– Slave needs to generate the remaining addresses based
on burst type (FIXED, INCR, WRAP)
Burst Length, Size and Type
Data Bus
• Write strobe
• Narrow transfer
Incrementing burst case

Data Bus (Cont’d)
• Narrow transfer
Incrementing burst case
• An easy way of unaligned write, use write

strobe
Unaligned Transfer
Unaligned Transfer (Cont’d)
Unaligned Transfer (Cont’d)
• Wrapping burst case
[AXI4]
Burst Length (AXI4)

• AR(W)LEN[7:0] allows burst 256
• Burst in AXI3 protocol:
– Early termination of bursts is not supported.
– A burst must not cross a 4-kbyte boundary. This ensures that a
burst is only destined for a single slave.
• AXI4 protocol longer burst support:
– Bursts longer than 16 beats are only supported for the INCR
burst type. Both WRAP and FIXED burst types remain
constrained to a maximum burst length of 16 beats.
– Exclusive accesses are not permitted to use a burst length
greater than 16.
• What if AXI4 to AXI3 transfer? A conversion wrapper!
Benefit of Split Transaction:
Multiple Outstanding Requests
ADDRESS A11 A21 A31
DATA D11 D12 D13 D14 D21 D22 D23 D31
*D21,D22,D23 의 delay 감소
• Parameters for multiple outstanding requests

– Master I/F: Issuing capability master가 generation 할 수 있는
outstanding request의 개수
– Slave I/F: Acceptance capability slave가 받아 들일 수 있는
outstanding request의 개수
Effect of Multiple Outstanding
Requests
• Setup
• Single outstanding by each master
• 5000 clock cycles simulation Memory
utilization
RS
Master #1 80 Read case
32b SDR (RS 0/2)
Master #2
70
RS
Bank #1 60
Master #3 Bank #2 50
PL300 PL
Master #4 Bank #3
40
340
Bank #4
30
Master #5 20
Master #6 10
0
1 2 4 6 # masters
• Analysis
• A single master w/o outstanding requests can achieve only about 30%
utilization.
• The effects of bank interleaving by different masters are significant, up to 74%.
• 6 master case may suffer from bank conflicts.
Effect of Multiple Outstanding
Requests
• Setup
• Multiple outstanding request by each master
Memory
utilization
RS
Master #1 80 Read case
32b SDR 70 (RS 0/2)
Master #2 RS
Bank #1
60
Master #3 Bank #2
50
PL300 PL
40
Master #4 Bank #3
340 30
Master #5 Bank #4 20
10
Master #6 0
1 2 4 6 # masters
• Analysis
• A single master w/ multiple outstanding requests can achieve >50% utilization.
• The effects of bank interleaving by different masters are significant, up to 74%.
• Register slice does not degrade the overall performance, i.e. utilization since
multiple outstanding hides its latency.
Read Burst Operation
Read request Data read 1st data The last data

is initiated is ready is transferred is transferred
Read request
is accepted Note: data transfer only when valid = ready = 1
Overlapping Read Bursts
Read request A
is accepted Read request B
is accepted via AR channel
while data A(0) is
transferred via R channel
Write Burst Operation
Write request A Response completes

is accepted write operation
Out-of-Order Transaction
ADDRESS A11 A21 A31
RDATA D21 D22 D23 D31 D11 D12 D13 D14
• Transaction ID is used to identify data transfer at all

channels
– ARID <-> RID and AWID, WID <-> BID
– Up to four bits
• Ordering by transaction ID
– Master needs to finish data transfers with the same transaction ID
in the order of request issue.
– Slave can handle data transfers with different transaction IDs out-
of-order
Ordering #1: Request and
Response
• Write request and data
– The write data can appear at an interface
before the write address that relates to it
• Two relationships that must be maintained
are:
– Read data must always follow the address to
which the data relates
– A write response must always follow the last
write transfer in the write transaction to which
the write response relates
Dependencies between Channel
Handshake Signals (AXI3)
• To prevent a deadlock situation, you must
observe the dependencies that exist between
the handshake signals
• In any transaction:
– The VALID signal of one AXI component must
not be dependent on the READY signal of the
other component in the transaction
– The READY signal can wait for assertion of the VALID
signal WLAST
[AXI4]
Dependencies between Channel

Handshake Signals (AXI4)
• The AXI3 protocol requires that the write response for all
transactions must not be given until the clock cycle after the
acceptance of the last data transfer
• In addition, the AXI4 protocol requires that the write response for
all transactions must not be given until the clock cycle after address
acceptance WLAST
AXI3
AXI4
Ordering #2: Multiple read
requests in slave and interconnect
Slave
IP 1
Master Slave Master Inter-
IP IP IP connect
Slave
Read data IP 2
reordering depth
Ordering #3: Write
• Write data with different AWIDs follow their
address order
• Responses to multiple writes with different IDs

can be out-of-order from address order
Write Interleaving
ADDRESS A11 A21 A31
DATA D11 D21 D22 D12 D23 D31 D13 D14
• Interleaving rule
– Data with different ID can be interleaved.
– The order within a single burst is maintained
– The order of first data needs to be the same with
that of request
– WriteInterleaveCability The maximum number of
transactions that master can interleave
[AXI4]
Write Interleaving in AXI4

• No support of write interleaving in AXI4
• Removal of WID in AXI4, why?
– Write data with different AWIDs follow their
address order + no write interleaving no
need of WID!
– Responses to multiple writes with different
IDs can be out-of-order from address order
BID remains!
Transaction ID Implementation
• Real implementation
– Transaction ID = <master ID, channel ID>
– Channel ID = original AXI transaction id
– Master ID is needed to identify the initiating master among
all the masters
Video 3D LCD Video

CPU Mixer DMA
Decoder Graphics Control Process
ID: 3 ID: 2 ID: 3 ID: 0 ID: 3 ID: 2 ID: 4
Crossbar
ID: 4 + ceil(log27)=7 bits
Memory
Controller
Cache Support (AXI3)
ARCACHE[3:0] / AWCACHE[3:0]
• Bufferable bit (B): AWCACHE[0]
– Write delay can be an arbitrary one
• Cacheable bit (C): AR(W)CACHE[1]
– Read: prefetch or read cache is possible
– Write: write merging is possible
• Read Allocate bit (RA): ARCACHE[2]
– If read miss, fetch the data to cache
– If C=low, RA=low
• Write Allocate bit (WA): AWCACHE[3]
– If write miss, fetch the data to cache, and then write
to the cache (and through the memory)
– If C=low, WA=low
[AXI4]
Cache Support 1 (AXI4)

[AXI4]
Cache Support 2 (AXI4)

[Source: K. Asanovic, 2008]
A Typical Memory Hierarchy c.2008

Split instruction & data Multiple interleaved
primary caches memory banks
(on-chip SRAM) (off-chip DRAM)
L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache
Multiported Large unified secondary cache

register file (on-chip SRAM)
(part of CPU)
Read / Write Allocate
• On read or write miss, first fetch the cache
line and then access the location for read /
write
L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache
Write Response from
Intermediate Point
• Basically, the memory can give a write response
• Intermediate points, e.g., cache can give a write
response. They need to be responsible for the
data delivery to memory
L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache
Write Though and Write Back
• Write through
– If L1 is updated, all the corresponding data in L2 and
memory are updated
• Write back
– Data update is delayed until the data is evicted
L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache
[Source: K. Asanovic, 2008]
Write Buffer
CPU Data Unified

Cache L2 Cache
Write
RF
buffer
Evicted dirty lines for writeback cache

OR
All writes in writethru cache
• On read miss, first check write buffer contents

if there is a hit in the write buffer, read the hit data
else, send the miss request to the memory (or L2/L3 cache)
[Source: J. Kubiatowicz, 2000]
Write Merging (Coalescing)

CPU Data Unified
Cache L2 Cache
Write
RF
buffer
Write miss or
write back Write merging!
c abcd
• If buffer contains modified blocks, the addresses can be
checked to see if address of new data matches the address of
a valid write buffer entry
• If so, new data are combined with that entry
• The Sun T1 (Niagara) processor, among many others, uses
write merging
[AXI4]
Memory Type
[AXI4]
Device Non-Bufferable and

Bufferable
• Device Non-Bufferable (0000)
– Read/write (response) only from/to the
destination (DRAM, peripherals, …)
• Device Bufferable (0001)
– Write response can be obtained from an
intermediate point (bufferable or allocate)
– Read from the destination
[AXI4]
Normal Non-cacheable
Non-bufferable / Bufferable
• Normal Non-cacheable Non-bufferable:
0010
– Write merging is possible (modifiable)
• Normal Non-cacheable Bufferable: 0011
– Write response from an intermediate point is
possible (bufferable or allocate)
– Write merging (modifiable)
– Read data from the destination or from a
write transaction that is going to the
destination (modifiable & bufferable in read)
[AXI4]
Write Through 1
• Write Through No Allocate: 1010/0110
– Why write through no bufferable
– Why no allocate only other allocate
– Write response from an intermediate point is possible
(bufferable or (other) allocate)
– Read data from an intermediate cached copy
(bufferable or (other) allocate)
– Read pre-fetch is possible (modifiable)
– Allocations of either read or write is not
recommended for performance reasons, but not
prohibited
[AXI4]
Write Through 2
• Write Through Read Allocate: 1110(0110)/ 0110
– Basically, Write Through No Allocate except
– Allocation of reads is recommended for performance reasons
– Allocation of writes is not recommended for performance
reasons
• Write Through Write Allocate: 1010/1110 (1010)
– Reverse of the above
– Allocation of writes is recommended for performance reasons
while that of reads is not recommended
• Write Through Read and Write Allocate: 1110/1110
– Basically, Write Through No Allocate except
– Both allocations of reads and writes are recommended
[AXI4]
Write Back 1
• Write Back No Allocate: 1011/0111
– Why write back? bufferable
– Writes are not required to reach the destination (bufferable)
– Write response from an intermediate point is possible (bufferable
or (other) allocate)
– Read data from an intermediate cached copy (bufferable and
(other) allocate)
– Read pre-fetch is possible (modifiable)
– Allocations of either read or write is not recommended for
performance reasons, but not prohibited
[AXI4]
Write Back 2
• Write Back Read Allocate: 1111(0111)/ 0111
– Basically, Write Back No Allocate except
– Allocation of reads is recommended for performance reasons
– Allocation of writes is not recommended for performance
reasons
• Write Back Write Allocate: 1011/1111 (1011)
– Reverse of the above
– Allocation of writes is recommended for performance reasons
while that of reads is not recommended
• Write Back Read and Write Allocate: 1111/1111
– Basically, Write Back No Allocate except
– Both allocations of reads and writes are recommended
[AXI4]
Transaction Buffering
• When the write reaches the destination?
• The write reaches the destination in a
timely manner
– Bufferable (Device and Normal Non-
Cacheable) and Write Through cases
• Not required to reach the destination (but,
the data should not be lost in any case)
– All Write Back cases
[AXI4]
Ordering 4: Ordering on the

Same (or Overlapping) Address
• Keep the ordering of requests/responses
to the same (or overlapping) address
Device memory
Atomic Access
• Normal access, AR(W)LOCK[1:0]=b00
• Exclusive access, b01
– Exclusive read (load-linked)… Exclusive write (store-
conditional)
– If no intervening write to the address region, EXOKAY
response. If not, OKAY response.
– Usually used for read-modify-write
Master 1 Master 2 Master 1 Slave 1
E.RD 0x100 WR 0x100 E.WR 0x100 OKAY
time
• Locked access, b10
– Start with b10, and end with b00
– During the period, only the lock initiating master can
access the address region (not the slave!!!)
[AXI4]
Atomic Access in AXI4

• No support of locked access
• All locked accesses from AXI3 masters

need to be converted to normal accesses
Protection Support
• Normal or privileged: AR(W)PROT[0]
– High privileged
• Secure or non-secure: AR(W)PROT[1]
– Low secure
• Instruction or data, AR(W)PROT[2]
– High / low instruction / data
Low Power Interface (C channel)
PL301
Crossbar Bus
AXI Crossbar Bus
ARM PrimeCell PL301
• PL301: PrimeCell High Performance Matrix
Slave Interface Master Interface

(SI) (SI)
PL301 Internal Structure
• Adjust interface differences
– Data width, frequency, and protocol
AXI – 50MHz AXI – 166MHz AXI – 200MHz
AMBA Fabric Matrix
Parameterise Synch bridge Async bridge Fabric Matrix Core

d – e.g.
address map
Routing Control Routing Control Routing Control
Structurally
Configured
Routing Control Routing Control Routing Control
Async bridge Synch bridge

Register Slice Downsizer
AXI - AHB AXI - APB
Downsizer
AXI – 200MHz AHB – 32-bit, 70MHz APB – 40MHz

Adjusting Data Width
• ExpanderAxi
– Converts a narrow master for use on a wide bus
– Data replication on write data and muxing on read path.32b 64b
Exp PL300
32b 64b
• FunnelAxi
– Converts a narrow slave for use on a wide bus
– Mux on write data path and replication on read data path
– Assumes that transfers are suitably sized for slave (i.e. no wider than the
narrow bus)
64b 32b
Fun PL300
64b 32b
• DownSizerAXI
– Converts a wide bus to a narrow bus
– Transaction <= narrow bus pass through
– Transactions > narrow bus are broken down in to smaller transactions.
64b 32b
DS PL300
64b 32b
[Source: PL301 TS]
PL301 Features
• Configurable number of SIs and MIs
• Sparse connection options to reduce gate count and improve
security
• Configurable AXI address/data widths
• Decoded address register that you can configure for each SI
• Flexible register stages to aid timing closure
• An arbitration mechanism that you can configure for each MI,
implementing:
– a fixed Round-Robin (RR) scheme
– a programmable RR scheme
– a programmable scheme that provides prioritized groups of Least
Recently Granted (LRG) arbitration
• A programmable Quality of Service (QoS) scheme
• Support for multiple clock domains: synchronous and asynchronous.
• Configurable cyclic dependency schemes to enable a master to have
outstanding transactions to more than one slave
Arbitration Scheme
Fixed Priority
Highest priority
Priority: 0 M0 SlaveInterface0
SlaveInterface1
Control Registers
Priority: 1 M1
PL301
Lowest priority
[Source: PL301 TS]
Arbitration Scheme
Round Robin
Arbitration Scheme
Hybrid
• Combination of round robin and fixed
priority
Highest priority
Fixed mode
Control Registers
PL301
Round-robin mode
Next Next
Lowest priority
[Source: PL301 TS]
Arbitration Scheme
• LRG (least recently granted) scheme
Arbitration Scheme
Fixed Round Robin
• A weighted round robin in a fixed order
ARM VIDEO 3D LCD
m0 m1 m2 m3 m4 m5
m1 m2 m3 m4
m0 m2
m2 m5
s0
m0 m0
m5 m1
S0
m2 m4 m3 m2
An Example of Crossbar Bus Design
40% of total BW
Video 3D LCD Video

CPU Mixer DMA
Decoder Graphics Control Process
Crossbar
Memory Some IPs need a privileged usage of

Controller memory BW
[Source: PL301 TS]
Programmable QoS
Maximum # of requests
allowed for best effort traffic
Assume Tidemark = 4,
and ID match = M0.
If there is 4 outstanding
requests for M1, then
only requests from M0 are
accepted by S0 until one of
M1’s requests is served
Bus Arbitration: A Generic Arbiter
• Assumptions
– Each request has its own performance requirement (e.g., bandwidth budget
and/or latency)
• E.g., low latency access from CPU
• E.g., bandwidth guarantee for LCD / Camera controllers
– In order to avoid starvation, a global time out is applied
• Priority order: time-out > bandwidth > best effort
– Time-out (TO) request: TO = 0, and BW budget > 0
– Bandwidth (BW) request: TO > 0, and BW budget > 0
– Best effort (BE) request: BW budget = 0
• Priority promotion/demotion
– Demotion: if BW budget is exhausted, demotion to BE
– Promotion: if BW budget becomes positive, promotion to BW or TO request
• Time-out counters
– One type of counter for QoS access w/ time out
– The other for old request
• When a normal request (w/ unspecified TO) arrives at the bus, a timer is
assigned and starts to be decremented each cycle.
Bus Arbitration: A Generic Arbiter
• If there is any TO request, it is served
– QoS request w/ time out
– Old request
Higher • If there is any BW request
Priority – Give the bus to the BW requests based on BW budget
• Based on fixed time slot allocation or statistical slot allocation
• To the other BE requests, apply the same priority order that
BW requests
Case of fixed time slot allocation

At a time slot (instant or period)
Current if the master allocated to the slot has a request,
Slot then it is served
else
the other pending requests are served
(e.g., round robin)
Note: Arbitration scheme is similar to the one in memory access

scheduling
A Deadlock Problem in Accessing
Multiple Slaves
On-chip Bus
D
Memory Memory
Master Controller
1 1
1
A
C
Memory Memory
Master Controller
2 2
2
B
Color (= DTransaction id) 1

Optimization in Memory Controller
is blocked at master
RequestsDwith the A same transaction id
Memory controllers can serve
Memory 1
need to beB finished in C the order of
independent requests out-of-order
Memory 2
requestB is
issues
to increase memory utilization
blocked at master 2
or to lower memory access latency

Master 1 C D
Master 2 A B Time
Cyclic Dependency Schemes
• Outstanding requests are permitted only for a
single slave per transaction id
• The deadlock problem is resolved while limiting
parallel (memory) accesses
Memory 1 A D
Memory 2 C B
Master 1 C C D D
Master 2 A A B B
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Single Slave Scheme
• Allow multiple outstanding transactions
only to the same slave
Unique ID Scheme
• Accept only out-of-order requests, i.e.,
requests with different transaction ID’s
Single Slave per ID
• Combination of both single slave and
unique ID schemes
• Allow multiple outstanding requests to a
single slave per transaction ID
[Source: J. Yoo, 2008]
Cascaded Crossbar (XB) Bus

M1 M1
3ns 3ns
5ns 4ns
14x4 9x4 7x2
5ns 4ns
3ns 3ns
S2 S2
Case A: Single big crossbar Case B: Cascaded crossbar
* Higher clock frequency

* Possibly, lower area/power cost
Flow
IP1 IP2 IP3 IP4
Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem
Pre-layout Bus RTL

optimization
w/ floorplanning

IP4 IP4
latching
too big
Bus RTL I/O delay
generation
Mem Mem too long
wire
AXI Bus Architecture Design Space
• Interconnection level
– PL300 decomposition:
cascading level
Register slices on
timing-critical
paths
• Bus component level Processor
– Common parameters: bit width, PL300 IP
etc. Processor 2-level cascaded
3x2 architecture to localize
– PL300: WriteIssueCap, etc. Reg traffic
IP
– PL340: arbiter queue, data fifo
size, etc.
– Register slice: PL340
full/forward/bypass per channel PL300
DSP 2x2
• Closely related with PL30_3 PL340
implementation IP 3x2
SRAM
• IP interface level IP
– # of AXI ports, internal buffer
usage/size, address queue size, Large WriteIssuingCapability
etc. Advanced AXI interface on high-latency paths
Buffer usage, size
Address queue size
Burst size
Performance Evaluation of Bus
Architecture
• Three methods depending on how to
model the traffic of bus masters
– ViP (virtual platform)-based methods
• Random bus traffic model
• Sequence-based bus traffic model
• Cycle-accurate Carbonized model
– FPGA-based model
• Bus masters are emulated on FPGA
– RTL simulation
• Golden RTL is built and simulated
Bus RTL Generation:
AMBA Designer
• Generated IP’s
– Interconnect, DMA, memory
controller, etc.
• Parameters
– data width (32b ~ 128b),
protocol (AXI, AHB, APB), etc.
– Timing (register slice, address
decode), security (trustzone),
etc.
– QoS (programmable QoS),
arbitration schemes, etc.
Bus RTL Verification
• Bus component verification
– Verification of each components (PL30x,
PL34x)
– Method & criterion: Specman eVC & protocol
coverage
• Bus architecture verification
– Verification of the entire bus architecture
– Method & criterion: Specman eVC + system
scenarios & functional correctness + protocol
coverage
Flow
IP1 IP2 IP3 IP4
Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem
Pre-layout Bus RTL

optimization
w/ floorplanning

IP4 IP4
latching
too big
Bus RTL I/O delay
generation
Mem Mem too long
wire
Register Slice for Timing Isolation
• Register slice can be used at any channel
independently
AWREADY
Write data
WREADY
AXI Response AXI

Master BREADY Slave
ARREADY
Read data
RREADY
Register Slice for Timing Isolation
• Register slice incurs one cycle latency per
insertion
AWREADY
Write data
WREADY
AXI Response AXI

Master BREADY Slave
ARREADY
Read data
RREADY
[Source: PL301 TS]
Two Modes: Full and Forward

• Area benefit: forward mode gives smaller
area
• The same latency penalty
AXI register slice (full mode)
AXI register slice (forward mode)

Flow
IP1 IP2 IP3 IP4
Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem
Pre-layout Bus RTL

optimization
w/ floorplanning

IP4 IP4
latching
too big
Bus RTL I/O delay
generation
Mem Mem too long
wire
Topology/Floorplan/Pipeline
Co-design of Cascaded XB Bus
An Academic Approach
Floorplanned Cascaded XB Bus

M1 S4 S3 M14 M1 S4 S3 M14
M2 M13 M2 M13
M3 M12 M3 M12
M4 14x4 M11 M4 9x4 7x2 M11
M5 M10 M5 M10
M6 M9 M6 M9
M7 S1 S2 M8 M7 S1 S2 M8
Case A floorplan Case B floorplan
Cascaded crossbar bus design needs to consider floorplan and pipeline

stages as well as topology
codec
Design Input: Technology Input: graphics (115, 40)

(600, 40)
Communication Graph Pre-characterized Library (115, 40)(115, 40)
(115, 40) (100, 25) ddr1
& Floorplan Info of Bus Components proc (50, 25) nand0
(50, 25)
(100, 25)
peri
(50, 25)
lcd (50, 25)
(50,
(50, 1000) 1000) (100, 25) (50, 25)
Initial Population security (50, 1000) nand1
conn
(50, 1000)
imaging (50, 1000) ddr0

Topology & Floorplan
storage (50, 1000)(50, 1000) (50, 1000)
(50, 1000)
Generation
dma cfc display
Pipeline Stage Insertion proc0 proc1

DSP00 DSP01 DSP02 DSP03
with Timing Analysis DSP20 DSP21 DSP22 DSP23
input
display
Cost Evaluation &

Tournament Selection
SRAM1
SRAM0
No DSP22 DSP23
DSP30 DSP31
End? DSP32
DSP21
DSP33
Yes DSP20
Return Results SRAM3

SRAM2 SDRAM0 CONTROL FLASH SDRAM1
Design Input: Technology Input: A
Communication Graph Pre-characterized Library
B
& Floorplan Info of Bus Components D
X1 E
C
Initial Population F
(B C A XB1 D E F, C B F XB1 E D A)
Topology & Floorplan

Generation A
B
D
Pipeline Stage Insertion X2
X1 E
with Timing Analysis C
F
Cost Evaluation &
(B C A XB1 XB2 D E F, C B F XB1 XB2 E D A)
Tournament Selection
No A
End?
B
D
X2
X1 E
Yes
C F
Return Results
(B C A XB1 XB2 D E F, C B XB1 F XB2 E D A)
Design Input: Technology Input:
Communication Graph Pre-characterized Library 1.5 Tree #2
& Floorplan Info of Bus Components (a) M0 1.5 2.5
1.0 1.5 S0
2.5
M1 X0
1.0 1.0 1.5
X1
1.0
Initial Population 3.0 4.0 X2 3.0
M2 3.0
1.0 S1
M3 2.5
Tree #1
Topology & Floorplan 1.0
Generation
(b) 2.5
M1 1.0 1.0
Pipeline Stage Insertion X1
with Timing Analysis 1.0
3.0 4.0 X2 3.0
M2 3.0
1.0 S1
Cost Evaluation & M3 2.5
Tournament Selection 1.0
2.5
(c) M1 1.0 1.0 0.5
No X1
End? 0.5
3.0 2.0 3.0
X2 3.0
M2
2.0 1.0 S1
Yes
M3 2.5
Return Results 1.0
Experiments
power (mW)
area (square mm)
codec 1.2 160
graphics (115, 40) Bridge
(600, 40)
(115, 40)(115, 40)
(115, 40) (100, 25) ddr1
Wire 140
1 Crossbar
proc (50, 25) nand0
(50, 25)
Pipeline 120
(100, 25)
peri
lcd (50, 25)
(50, 25) 0.8
(50, 1000) (100, 25) (50, 25)
(50, 1000)
100
security (50, 1000) nand1
conn 0.6 80
(50, 1000)
imaging (50, 1000) ddr0

60
storage (50, 1000)(50, 1000) (50, 1000)
(50, 1000) 0.4
dma cfc display 40
0.2
20
codec codec_d bridge

5 0 0
case 1
case 2
case 1
case 2
case 1
case 2
case 1
case 2
case 1
case 2
case 1
case 2
5 ddr1_s ddr1
graphics_d0
graphics XBAR
5
graphics_d1 5
bridge 5
5
proc_d
peri_s peri aspdac07 aspdac07ga proposed aspdac07 aspdac07ga proposed
proc proc_i
proc_dma ddr0_s ddr0

area power
conn 5 5 nand0
conn_m nand0_s
bridge XBAR
XBAR
5 nand1_s nand1
storage storage_m
dma dma_m
5
display display_m
imaging imaging_m Summary

cfc cfc_m
security security_m
XBAR
Topology/floorplan/pipeline co-design gives
lcd lcd_m
lower area cost in cascaded crossbar bus design
Summary
• AMBA3 (AXI3 and AXI4) protocol
– Specification
– Focus on ordering
• PL301: a crossbar bus
– Arbitration, QoS, cyclic dependency schemes
– Crossbar-based bus design flow

Bus - 7 - 25 - Morning - P

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bus - 7 - 25 - Morning - P

Uploaded by

Copyright:

Available Formats

AMBA3 (AXI) Protocol and

AXI-based Bus Design

July 25th, 2011

Pre-layout Bus RTL

IP1 IP2 IP3 IP4 IP3 IP3

AMBA3 AXI Protocol

Interconnect, Interface & Channel

AXI Response AXI

AXI Response AXI

AXI Response AXI

AXI Response AXI

AXI Response AXI

AXI Response Master issues AXI

AXI Response AXI

Write Address/Control 52~56

Write data 39~43

Signal Source Description

Signal (Write Data Ch.) Source Description

Signal (Write Response Ch.) Source Description

DATA D11 D12 D13 D14 D21 D22 D23 D31

• Separation of address and data channel

Incrementing burst case

• An easy way of unaligned write, use write

Burst Length (AXI4)

ADDRESS A11 A21 A31

DATA D11 D12 D13 D14 D21 D22 D23 D31

• Parameters for multiple outstanding requests

Read request Data read 1st data The last data

Write request A Response completes

RDATA D21 D22 D23 D31 D11 D12 D13 D14

• Transaction ID is used to identify data transfer at all

Dependencies between Channel

• Responses to multiple writes with different IDs

DATA D11 D21 D22 D12 D23 D31 D13 D14

Write Interleaving in AXI4

Video 3D LCD Video

ID: 3 ID: 2 ID: 3 ID: 0 ID: 3 ID: 2 ID: 4

Cache Support 1 (AXI4)

Cache Support 2 (AXI4)

A Typical Memory Hierarchy c.2008

Multiported Large unified secondary cache

CPU Data Unified

Evicted dirty lines for writeback cache

• On read miss, first check write buffer contents

Write Merging (Coalescing)

Device Non-Bufferable and

Ordering 4: Ordering on the

Atomic Access in AXI4

• All locked accesses from AXI3 masters

Slave Interface Master Interface

Parameterise Synch bridge Async bridge Fabric Matrix Core

Routing Control Routing Control Routing Control

Async bridge Synch bridge

AXI – 200MHz AHB – 32-bit, 70MHz APB – 40MHz

ARM VIDEO 3D LCD

Video 3D LCD Video

Memory Some IPs need a privileged usage of

Case of fixed time slot allocation

Note: Arbitration scheme is similar to the one in memory access

Color (= DTransaction id) 1

or to lower memory access latency

Cascaded Crossbar (XB) Bus

* Higher clock frequency