You are on page 1of 4

A Beyond-l GHz AMBA High-speed Bus for SoC DSP Platforms

Alexandre Landry, Yvon Savaria*, Mohamed Nekili Concordia University, *kcole Polytechnique de Monfrknl { a-landry, mnekili1@ece .concardia.ca; *savaria@vlsi.polymtl.ca

Abstract
This paper presents an on-chip interconnection infrastruc&w based on AXM 5 AHB standard to obtain a bus working beyond one grgahertz. All major design blocks necessaiy to implement reliable interconnect infrmtmctures for DSP platfovms are presented. This inferCt?P?neCtinfrastructure is implemented as a hard IP module to get the maximum performance out o f TsMC's 0.18 pm CiVtOS lechnology. As a result, a bus operating at 1.4 GHz capable of transferring 2.8 gig0 data items
per S
E C Q was ~

successfully designed.

2.

Introduction

Thc nest generation of digital signal processing (DSP) applications requires a computing performance that far exceeds the capabilities of state-of-theart processors [I]. A remedy to this everlasting quest for higher data crunching rates is to engineer a systeni-on-achip ( S e ) DSP platform, but a SoC cannot be efficient without a proper on-chip communication infrastructure. A requirement commonly found with DSP platforms i s to support multiple data streams in real-time, To be successful w i t h this strong requirement, a deterministic, yet simple, communication infrastructure with sufficient bandwidth must be developed to provide a bottleneck-free integration for inter-module communication. Furthermore, it is often useful to reconfigure the flow of data on a SoC to execute another task [2]. TO accommodate these requirements, a symmetric multi-processor ( S M P ) interconnect architecture is appropriate. S M P style is attractive in parallel processing environments such as a DSP platform for several reasons: processing cores are equally rar tiom the shared-memory, access to all shared data can be done through ordinary load and store operations, together with the automatic movenient and replication of shared data in the local memory [3]. The method used in this research to implement the SMP communication infrastructure leverages the AHE3 ( M A High-speed Bus) standard [4j to provide data streaming through a shared-memory.

AMBA standards are commonly used in systems based on ARM processors. The AMBA-AM3 is meant to be synthesizable, as specified by the standard. This is indeed very practical since most SoC are developed using an ASIC design flow [ 5 ] . However, an important speed limitation results from ( I ) the logic synthesis process and (2) the sparseness of the layout obtained by automatic placement and routing [6]. This leads to a maximum bandwidth of the order of 500 MIlz with a 0.18 pm technology [7]. Throughout this research project, a Iidlcustom design flow is used in order to get the masimum attainable bandwidth out of the technology in use. Hence, with a 0.18 pin process from TSMC (Taiwan Semiconductor Manufacturing Company), the shared bus discussed in this paper operates at 1.4 GHz. A highthroughput interleaved memory system is developed using a menlory compiler from Virage 181 to sustain the bandwidth of the shared bus. AHB is known to hinder performance in proccssorcentric systems by exhibiting a significant latency [7]. However, it is possible to fool the soft-core processors by providing the illusion that they are directly connected 10 a multi-port shared memory. H e n c e ,the latency can be predicted at the time of compiling an arbitration schedule. The remainder of this paper is organized as follows. We present our bus architecture in Section 2. The highthroughput interleaved memory is presented in Section 3. Section 4 outlines the arbitration mechanism of the comniunication infrastructure and it details the zones of exclusion to respect in order to achieve determinism. The bridges responsible for interconnecting the processors with the 1.4 GHz bus are presented in Section 5 and Section 6 concludes.

2.

Architectural Overview

Given the environment of ASIC soft intellectual property (TP) cores in which the 1.4 GHz bus i s intended to evolve, a cautious architecture is the key for a successfd design. Several modifications need to be implemented with respect to the A I D standard [4]. They range from protocol leveraging to interconnect reconfiguration and an increased pipelining.

07803-8656.6'04/$20.000 2 0 0 4E

46

The proposed solution guarantees a highperformance access to all modules sharing the high-speed AHB. An p3IB interface supporting all the basic features of the standard is provided to facilitate modules integration with the high-performance communication infrastructure, as s h o w in Figure 1. In this example, the system is partitioned in two clock domains. The computational elements are sitting onto the low frequency domain whereas the highperformance bus with the interleaved memory resides in the fast clock domain. With a bus opating at I .4 G b , it is possible to guarantee a bandwidth of 87.5 M H z with a 1 cycle latency delivery time to up to 16 processors. The corresponding mathematical analysis is detailed in Section 4.

8 levels to allow more time to the high-throughput

interleaved memory for d a t a delivery.

3 . Interleaved Memory
The availability of a high-throughput memorv is a key enabling factor for implementing OUT high-speed bus. Indeed, the development of a memory operating at 1.4 GHz is extremely difficult. However, exploitation of pipelining is possible by using synchronous static random access memories (SSFWM) in an interleaved manner. This research project makes use o 4 dual-port random-access memories (Fig. 2) from Virage to create a memory that can sustain 1.4 giga data accesses per second on both Write and Read buses. This memory compiler can provide memories running up to 500 Mwz when targeting 0.18 pm CMOS [SI, This is significantly faster than the speed target of 350 M H z we need to obtain 1 4 GHi of throughput. Shifting the clock phases by 90 degrees for each 350 MHz memory with respect to one another emulates a memory operating at 1.4 GHz. As a matter of fact, new addresses are accepted and new data items are produced at a rate of I .4 GHz an each menmy port.
350 M E1 S S M M *I

.-mi?m%U,

*:

ADATAW

~2

SSRAM13
350MHz

5%

SSnhM

< :

! ; g$

HDATAR

BW-3

Ma-

Figure 2. Block diagram of the high-throughput interleaved memory.


1.ICna

Manor, CLK-1

t -

Figure 1. Overall structure of the proposed communication infrastructure. In additian, far the purpose of performance improvement, two specialized buses are developed: the first is devoted to Write operations only, whereas the second handles only Read operations. This specialization of buses is not forbidden by the specification of AHB and it results in an aggregate bandwidth of 2.8 GHz. The heart of the high-speed bus is derived from the AH3 standard. However, to get the fastest possible speed out of this interconnect medium, it is necessary to leverage all superfluous signals to the shared-memory to simplify the logical circuits. Finally, the number of pipelining levels of a standard AI-IB is increased f r o m 2 to

Figure 3. Interleaved memory timing diagram.

The motivation behind increasing the depth of the pipeline of the AHB becomes evident from the memory operation. On the high-speed bus, a new address phase is allowed to be initiated at every clock cycle. However, when a read operation is attempted, the 350 MIlz memories are slow to provide data with respect to the high-speed bus. This is where an increased p i p e h e

47

depth becomes necessary. When a memory accepts a new address, it takes precisely 4 1.4 GHz clock cycles to produce the requested data. Instead of wasting 4 cycles on the high-speed bus, a new access is initiated on each memory from the memory bank until WE roll back to the first one with the first data item being ready. This principle is illustrated in Figure 3. There exists a possibility that the Read and Write buses attempt an access on the same memory location simultaneously. If this situation occurs, the data being read may be corrupted. An easy way to handle address contention is to raise an error and then retry the memory access. However, this strategy may severely hinder the determinism of the system. A better way to resolve this type of conflict is to work out a bypass mechanism to duplicate the data being witten by the Write bus onto the Read bus upon conflict detection (Figure 2).

transferring the proper 64-bit word into it. A feedback path is provided from the shift register to the controller in order to be able to follow the access sequence on the bus. This allows reprogramming the arbitration sequence at U II I time. Such a feature is interesting for SoCs that may need to be reconfigured online [2]. Finally, the shift register circulates the 64-bit word which is logically divided into 16 4-bit words. Each 4-bit word is decoded to correspond to a specific bus master.

ARSIIRATLON CQNTROLLEQ

I
SCKEDULE

4.

Bus Arbitration
Figure 4. Block diagram of the VLlW micro-

In an ideal world, communication through a sharedmemory is best achieved via a multi-port memory where each processing element o m s a private port. However, in

programmed arbiter.

practice, the area of the inemory grows in proportion with the number of ports. For instance, the 1 Kb dual-port memory used in this research project is 2.11 times larger than a single port memory of the same size. As mentioned in Section 3, the high throughput dualport interleaved memory can support a bandwidth of 2.8 GHz. This allows emulating a multi-port memory provided that an arbitration-aware mechanism is developed to manage the traffic on the shared-bus. As a matter of 'fact, the arbiter must be aware of which memory bank a processing core writes its shared data into so that it can be read by the destination processor. 'This adds, on the arbiter, the extra burden of granting the shared-bus on a specific memory phase determined by the physical location to be accessed. A fast and elegant solution to resolve this arbitration puzzle is to design the arbiter as a very long instruction word (VLIW) controller [2], [9]. Using a VLIW architecture allows to compute an arbitration schedule at compile time. This strategy is particularly attractive with predictable applications, such as the video processing platform we target in this research, where multiple outstanding data streanis are present simultaneously [Z]. Hence, the VLIW arbiter creates a channel between a source and a destination processor by forcing the conimunicating entities to access the specialized AMB upon the same memory phase. Figure 4 reveals the block diagram of the microprograinmed arbiter. There are three main blocks that compose the VLIW arbiter: First, a memory stores 64-bit words, holding a sequence of 16 consecutive bus accesses. Then, the controller programs the shift register by

4.1. Traffic Analysis


As previously mentioned, there are several parameters to consider to provide quality-of-service (QoS) under realtime constraints. The high-speed bus outperforms the processing elements (PE) by completing multiple memory accesses while the PES are attempting a single access. H e n c e , a careful traffic analysis must be conducted in order not to clog the communication inkastructure. The response time attributed to the interleaved memory makes some aspect of the timing rather critical. In fact, even if the memory supports a bandwidth of 1.4 GHz for each bus, it responds at the speed of a inenlory of 350 MHz. There are other parameters to take into account, such as the time required by the PESto make the data available, setup and hold time of the PES and the memory, and the time required by the high-speed bus to perform a transaction. Another factor to consider is the tolerable latency of a PE. ' The next equation models the titne window available for a burst of transactions from the PES: (V + N + W + M)tHCLKF I (2 + L) tHCLK
(1).
.

where V is the number of fast clock cycles (tHCLw) required by a PE to make its data available after an HCLK pulse. N is the total number of masters that can attempt a read operation simultaneously. W models the number of fast clock cycles wasted to obtain a response from the memory. M is the number of fast clock cycles required to memorize the data from the memory, and L is the

maximum tolerable latency in slow clock cycles (fHCLK).


Thus, the lei?-hand-side of equation 1 models the time required by the specialized AHB to obtain a response from the pipelined memories while the right-hand-side models the upper bound that should not be crossed to provide data io the PESbefore their real-time deadline.

FSM launches an address phase, the other completes its data phase. This process keeps on repeating endlessly. This strategy works correctly provided that some arbiters are planned to allocate the shared resources.

6. Conclusion
In this paper, we have presented a high-performance iVIB-comphant communication infrastructure for highend DSP platforms. It uses a multiplexed high-speed bus to interconnect modules operating at variable frequencies. Means to create a high-throughput interleaved memory serving the high-speed bus w r e discussed. In addition, a simple, yet powerful, arbitration mechanism was designed to provide gcod quality-of-service under real-time constraints. The SOC platform is reconfigurable by altering the content of the arbiter's memory. Support for multiple outstanding data streams is possible. Finally, our goal to push the cluck rate above the GHz was successfully achieved since our communication infrastructme operates at 1.4 GHz. Several protocol alterations were adopted to facilitate the speeding-up of the circuits.

5. Master Bridges
To interface the new high-speed bus w i t h the SQC cores, two bridges are needed: one on the Read bus and the other on the Write bus. The tasks accomplished by these bridges are numerous: The bridges provide a standard N B slave interface or the low-speed SoC cores, while a specialized AH3 bus master initiates the transaction on the high-speed bus. For the time being, synchronization between the two clock domains is performed by a phase-locked loop (I'LL). Finally, protocol filtering is performed byr the bridge since a variety of unused signaIs and features of AHB were screened Out. &nce, the only modes of tTansfeT kept for the high-speed bus are IDLE and SIMPLE transactions, without any possibility of inserting wait states or signaling errors. Dropping out burst transfer mode may look like a severe drawback, however it is not. The PES interconnected with the high-speed bus can still perform transfers in burst ntcde. The internal operation of the high-speed bus is hidden from the processors sitting on the SOC. Therefore, the PES can transfer a burst of data while the bus arbiter multiplexes simple transfers from all the PES on the high-speed bus. This mechanism a l h v s processing multiplc concurrent bursts by the interleaved memory. The illustration below shows the internal organization of the master bridge. Its architecture is divided in 3 way similar to that of a typical microprocessor [9]:the first logical division is the controller, then the datapath, and finally the inpuffoutput interfaces.

7. Acknowledgements
This research is financially supported by Concordia University and a Canada Research Chair to Professor Savaria.

8.
111

References
B. Ackland et al. "A Single-Chip, 1.6-Billion. 16-b MAC/s

&TA

DATAPATH

DATA

Figure 5. Block diagram of the master bridge.


The master bridge is controlled by 3 finite-state

machines (FSM): The first FSM operates as a slave at the speed of tbe slow standard AHB. It is this entity that receives a request from the PES and launches m e of the fast FSMs. The fast FSMs are the masters of the highspeed bus. They are alternately triggered by the slow FSM. 'This is required t o reflect the pipelined architecture of AHB. Thus, at the moment where a fast

Multiprocessor DSP," in Journal o f Solid- State Circuits (JSSC), March2000, p p 412-424. ~ [*I M.T.J. Strik et al. "Heterogeneous Multiprocessor for the Management of Real-Time Video and Graphics Streams." in Journal of Solid-state Circuits (JSSC), November 2000. pp. 1722-1731. E. Culler, J. P. Singh. and A. Gupta, Parallel Computer r31 D. Architecture: A HardwareiSoftware Approach Chapter 5 . Morgan Kaufmann, San Francisco, 1999. 141 ARM, Technical Specification: h 4 B A Specfication. Doc. NO: ARMIHI-O011A, Issued: May2001. 151 Smith, M.J.S. Application-Specijc htegruted Cirarifs. Addison Wesley, 1997. Digital IJitegrated Circuits, Prenfce Hall. I61 Rabaey, J.M., 1996. ~ 7 1 J. P. Bissou, M. Dubois, and Y. Savaria, "High-speed System Bus For a SoC Network Processing Platform," IEEE International Conference on Micro-electronics (ICM), December 2003, pp 194-197. [SI Virage h g i c , Embed-it Itttegtutor / Cusfuni-Touch ~ e m o g vConlpiler,Sofiware User Guide, Release 3.4.4. August 2003. A 191 Hennessy, J.L., D.A. Patterson, Computer Archifecfure: Qmntitative Approuch, Morgan Kaufman Publishers, 2003.

49

You might also like