You are on page 1of 12

General-Purpose

Systolic Arrays
Kurtis T. Johnson and A.R. Hurson, Pennsylvania State University

Behrooz Shirazi, University of Texas, Arlington

Systolic arrays
effectively exploit
massive parallelism in
computationally
intensive applications.
With advances in VLSI,
WSI, and FPGA
technologies, they have
progressed from fixedfunction to generalpurpose architectures.

hen Sun Microsystems introduced its first workstation, the company
could not have imagined how quickly workstations would revolutionize computing. The idea of a community of engineers, scientists, o r
researchers time-sharing on a single mainframe computer could hardly have
become ancient any more quickly. T h e almost instant wide acceptance of workstations and desktop computers indicates that [hey were quickly recognized as giving
the best and most flexible performance for the dollar.
Desktop computers proliferated for three reasons. First, very large scale integration (VLSI) and wafer scale integration (WSI), despite some problems, increased
the gate density of chips while dramatically lowering their production cost.'
Moreover. increased gate density permits a more complicated processor, which in
turn promotes parallelism.
Second, desktop computers distribute processing power to the user in an easily
customized open architecture. Rcal-time applications that require intensive 110
and computation need not consume all the resources of a supercomputer. Also,
desktop computers support high-definition screens with color and motion far
exceeding those available with any multiple-user. shared-resource mainframe.
Third, economical. high-bandwidth networks allow desktop computers to share
data. thus retaining the most appealing aspect of centralized computing, resource
sharing. Moreover, networks allow the computers to share data with dissimilar
computing machines. That is perhaps the most important reason for the acceptance of desktop computers. since all the performance in the world is worth little
if the machine is isolated.
Today's workstations have redefined the way the computing community distributes processing resources, and tomorrow's machines will continue this trend with
higher bandwidth networks and higher computational performance. One way to
obtain higher computational performance is t o use special parallel coprocessors t o
perform functions such as motion and color support of high-definition screens.
Future computationally intensive applications suited for desktop computing machines include real-time text, speech, and image processing. These applications
require massive parallelism.'

I

much as the heart conNovember l Y Y 3 I I I I I I Figure 1. Because an array consists of cells of only one or. systolic arrays typically combine intensive local communication and computation with decentralized parallelism in a compact package. Typically. and capillaries. General systolic organization. gridlike architectures (Figure I ) in which each line indicates a communication path and each intersection represents a cell or a systolic element. Either the arrays interconnect or each processing unit is programmableand a program controls dataflow through the elements. the following description serves as a working definition in this article (also see “Systolic array summary” below). however.p u r p o s e SIMD and MIMD architectures. data structure manipulation. the input data as well as partial results flow through the array. image processing.Many computational tasks are by their very nature sequential: for other tasks the degree of parallelism varies.J Although there is no widely accepted standard definition of systolic arrays and systolic cells. o r a processor complete with a control unit and a processing unit. systolic computer processes perform operations in a rhythmic. Systolic arrays are ideally qualified for computationally intensive applications. They capitalize on regular. Systolic arrays consist of elements that take one of the following forms: a special-purpose cell with hardwired functions. multiple-data s t r e a m (MIMD) organizations to exploit nonhomogeneous parallelism requirements. In all cases. a few kinds. cellular. asystolic array effectively exploits massive parallelism. high-level-language programmable for task control and flexibility. synchronous. uniform. Falling into an area between vector computers and massively parallel computers. However. Systolic arrays have balanced. relational database operations. Special-purpose systolic array: An array of hardwired systolic processing elements tailored for a specific application. many tens or hundreds of cells fit on a single chip. veins. Programmable systolic array: An array of programmable systolic elements that operates either in SIMD or MIMD fashion. the term sysfolicdescribes the contraction (systole) of the heart. a vector-computerlike cell with an instruction decoding unit and a processing unit.. trols blood flow to the cells since it is the source and destination for all blood. Reconfigurable systolic array: An array of systolic elements that can be programmed at the lowest level. rhythmic. Why systolic arrays? Ever since Kung proposed the systolic model. at most. and demanding scientific applicationsP Technology advances. Analogously. and character string manipulation. FPGA (field-programmablegate array) technology allows the array to emulate hardwired systolic elements at a very low level for each unique application. Applications: Matrix arithmetic. Therefore. a massively parallel computational architecture must maintain sufficient application flexibility and computational efficiency. repetitive computation. Unlike a pipeline. the systolic elements or cells are customized for intensive local communications and decentralized parallelism. Whether functioningas a dedicate d fixed-function graphics processor o r a more complicated and flexible coprocessor shared across a network. In physiology. While systolic arrays originally were used for fixed o r special-purpose architectures. modular. The systolic computational rate is restricted by the array’s U0 operations. General-purpose systolic array: An array of systolic processing elements that can be adapted to a variety of applications via programming or reconfiguration.‘ its elegant solutions to demanding problems and its potential performance have attracted great attention. It must be’ reconfigurable to exploit application-dependent parallelisms. Systolic arrays usually have a very high rate of I/O and are well suited for intensive parallel operations. and repetitive manner. concurrent processes that require intensive. which regularly sends blood to all cells of the body through the arteries. signal processing. Three factors have contributed to the systolic array’s evolution into a leading approach for handling computationally intensive applications: technology advances. The array usually is extensible with minimal difficulty. In addition. Advances in Systolic array summary Systolic array: A gridlike structure of special processing elements that processes data much like an n-dimensional pipeline. systolic arrays are more than processor arrays that execute systolic algorithms. and capable of supportingsingle-instruction stream. 21 . multiple-data stream (SIMD) organizations for vector operations and multiple-instruction s t r e a m . data can flow in a systolic organization at multiple speeds in multiple directions. concurrency processing. scalable for easy extension to many applications. language recognition. the systolic concept has b e en e x t e n d e d t o g e n e r a 1. incremental. it has regular and simple characteristics.

’ Each cell computes an incremental result. Clock synchronization. multiple processing units. Reliability. For example. and can act as an n-dimensional pipeline. Every time a systolic array was to b e used on a new application. Asintegratedcircuits grow larger. intercellular communications. cluster analysis. Algorithms and mapping. Economical design and fabrication processes produce less expensive systolic chips. von Neumann computer architecture have yielded coprocessors. Generally. fully tested. a memory subsystem often must b e added between the host and the systolic array to support data access and data multiplexing and demultiplexing. a systolic array is integrated into an existing host as a back-end processor. relational database operations. costly. as well as external to the chips. the manufacturer had to undertake the long. Thus. “systolicizing” them provides an efficient way t o ensure fault tolerance: any fault tolerance precautions built into one cell are extensible to all cells.’ These applications require massive repetitive and rhythmic parallel processing. reducing the chances that it will fail t o work as designed. Granularity increases as word length increases. even in small quantities. In mass quantities. data pipelining. unique cells can now be quickly copied and arranged in regular. can introduce skews. and multiple homogeneous processors. and. But these designs were bound to the specific application at hand and were not flexible or versatile. Each cell’s basic operation can range from a logical or bitwise operation. As VLSIiWSI designs become more complicated. asynchronous.The level of cell granularity directly affects t h e array’s throughput and flexibility and determines the set of algorithms that it can efficiently execute. to a complete program. The technology growth of the last three decades has produced computing environments that make it feasible to attack demanding scientific applications on a larger scale. (These systems will b e discussed in greater detail later. The memory subsystem can range from the complicated support and cluster processors in the Warp array to the simpler staging memory in the Splash array. Implementation issues A number of implementation issues determine a systolic array‘sperformance efficiency. and diagnostics to verify proper operation. Past efforts to add concurrency to the conventional. can contain multiple processing units andlor processors. Because the existing IiO channel rarely satisfies the array’s bandwidth requirement. and radar signal processing are only afew examples. testing. requiring simulation for verification and often producing aless-than-optimum algorithm. The risk of clock skew is greater when dataflow in the systolic array is bidirectional. The array’s high U0 requirements often make system integration a significant problem. 22 Demanding scientific applications. With advances in simulation techniques. A systoliccell can be fully simulated before fabrication. Designers must be intimately familiar with the algorithms they are implementing on systolic arrays. Concurrency processing. Examples of these innovative applications include interactive language recognition. the cell design should be sufficiently flexible for use in a wide variety of topologies implemented in a wide variety of substrate technologies. they were well suited for common applications. Although the cost and risks of developing ASlCs have decreased in recent COMPUTER . designers must build in greater fault tolerance to maintain reliability. when many computer users work on a wide variety of applications. Integration into existing systems. and fabricating an application-specific integrated chip. the production of these arrays was manageable and economical. and potentially risky process of designing. Granularity is subject to technology capabilities and limitations as well as design goals. and concurrency increases with more complicated cells. Wavefront arrays‘ reduce the clock skew problem by introducing more complicated.’As recent history shows. Higher gate densities permit more complicated cells with higher individual and group performance. Designers should understand the following performance trade-offs at the design stage. integration-substrate families have different performance and density character0 pin istics. Because systolic arrays are built of cellular building blocks. as well as intensive U 0 operation. to a word-level multiplication or addition. feature extraction. most array designs are based on heuristics. Designing a systolic array heuristically from an algorithm is slow and error-prone. Better design tools allow arrays to be designed more efficiently. and virtual reality. they develo p new applications requiringincreased computational performance. text recognition. thus. Clock lines of different lengths within integrated chips. systolic computing.VLSI/WSI technology complement the systolic array’s qualifications in the following ways: Smaller and faster gatesallow a higher rate of on-chip communication because data has a shorter distance to travel. Packaging also introduces U restrictions. Although data pipelining reduces IiO requirements by allowing adjacent cells to reuse the input data. Systolic array taxonomy T h e term systolic array originally referred to special-purpose o r fixed-function architectures designed as hardware implementations of a given algorithm. automatic array synthesis is an important research a r e a 6 A t present. Relatively new field-programmable gate array (FPGA) technology permits a reconfigurable architecture. A systolic array can act as a coprocessor. as opposed t o a reprogrammable architecture. Extensibility. modular arrays. Systolic arrays combine features from all of these architectures in a massively parallel architecture that can be integrated into existing platforms without a complete redesign. Hence. the systolic array’s real novelty is its incremental instruction processing o r computational pipelining. Large-matrix multiplication. however. and the computer derives the complete result by interpreting the incremental results from the entire array in a prespecified algorithmic format.) Cellgranularity.

verifying. I n addition to serving in a wide variety of applications. All the a and b data synchronously exits the processing element unmodified to be available for the next element. The product matrix is shifted out after completion of processing. but performance decreases nevertheless. Figure 2 illustrates the algorithm for the sum of a scalar product. and debugging new systolic algorithms. The systolic product of two 3 x 3 matrices.‘ Table 1. generalpurpose systolic architectures have become a logical alternative. The accumulator stores the sum of the a. O n e area that easily utilizes systolic algorithms is matrix operations. November I993 An obvious problem with this approach is that matrix products involving a matrix larger than the systolic array must b e divided into a set of smaller matrix products. This resource and implementation problem affects all systol- icarrays. Special-purpose architectures. computed in a single systolic element. the systolic design requires a performance algorithm that can be efficiently implemented with today’s VLSI technology. the a’s and the b’s are synchronously shifted through the processing element. Proper algorithm development compensates for the problem. Few problems resist attack from systolic arrays. as shown in Figure 3. This principle easily extends to a matrix product. The matrix product example also demonstrates another problem with specialpurpose systolic arrays and hardware in ib-type data exits ~ ~ ~ Figure 2. but some problems may require elegant algorithms. multiple-data stream years. Systolic array taxonomy. Special-purpose systolic architectures are custom designed for each application. a13 a12 a l l a23 ‘22 a33 a32 a31 * * . multiple-data stream. Class General-purpose Type Programmable Reconfigurable Organization SIMD or MIMD VFIMD Topology Programmable Interconnections Static Dynamic Fixed Dimensions Hybrid Special-purpose Hardwired VFIMD Fixed Fixed Static Dynamic I n-dimensional (n > 2 is rare due to complexity) n-dimensional SIMD: single-instruction stream. After the cell is initialized. The only difference between single-element processing and array processing is that the latter delays each additional column and row by one cycle so that the columns and rows line up for a matrix multiply. multiple-data stream: MIMD: multiple-instruction stream. VFTMD: very-few- instruction stream. At the end ofprocessing. they also provide test beds for developing. 23 . budget constraints have motivated a trend away from unique hardware development. Table 1 shows a taxonomy of general-purpose and special-purpose systolic arrays. Consequently. A systolic processing element that computes the sum of a scalar product. the sum of the products is shifted out of the accumulator. Generally speaking. b-type data exits array Figure 3.b products.

The systolic cells store no programs or instructions. Reconfigurable arrays also have either fixed or reconfigurable cell interconnections. cell architectures as well as array architectures change from one application t o another. with the instructions and data for the application. programmable arrays can be categorized as either SIMD or MIMD machines. which are external to the array. Programmable cell interconnection topologies typically consist of programmable cells embedded in a switch lattice that allows the array to assume many different topologies. Programmable topologies are either static or dynamic. Programmable array organization. However. A programmable systolic architecture is a collection of interconnected. SIMD systolic machines (Figure 4) operate similarly t o a vector processor. and dynamic topologies can be altered within an application. which allows the user to configure a low-level logic circuit for each cell. array cells are either programmable o r reconfigurable. Likewise. High-level programmable arrays usually a r e programmed in high-level languages and are word oriented. Therein lies the attractiveness of general-purpose systolic architectures. In the programmable model. T h e host workstation preloads a controller and a memory. Programmable systolic cells are flexible processing elements specially designed to meet the computational and U0 requirements of systolic arrays.Host I Data in : Data out ---__ Array I I Figure 4. One of the first programmable architectures was Carnegie Mellon’s programmable systolic chip. The two most common structures are the linear array and the two-dimensional array. General-purpose architectures. All communication paths and functional units are fixed. All F P G A reconfiguring is static due t o technology limitations. hybrid models have also been proposed. but reduced performance is often a consequence. The primary means of implementing the reconfigurable model is F P G A technology. and F P G A technology. They are typically back-end processors with an additional buffer memory to handle the high systolic U0 rates. In the reconfigurable model. Recently. We can further classify general-purpose and special-purpose systolic architectures by their array dimensions. The architecture for each application appears as a special-purpose array. Static programmable topologies can be implemented with much less complexity than dynamic programmable topologies. general-purpose systolic arrays of dimensions greater than two are not common. A t either level. general. That topology can emulate other topologies by means of the proper mapping transformation. Fixed cell interconnections limit a given topology t o some subset of all possible algorithms. Static topologies can be altered between A r r a y dimensions. Array topologies can be either programmable o r reconfigurable. Reconfigurable systolic architectures capitalize on F P G A technology. Systolic topologies. Twodimensional arrays allow more efficient execution of complicated algorithms. the higher the performance: but cost per application also rises and flexibility decreases. cell architectures and array architectures remain the same from application to application. general-purpose systolic cells. The two basic types of general-purpose systolic arrays are the programmable model and the reconfigurable model. Linear systolic arrays are by default statically reconfigurable in one-dimensional space. Hybridmodelsmake use of both VLSI 24 applications. each of which is either programmable or reconfigurable. Each cell’s instruction-processCOMPUTER . They usually consist of VLSI circuits embedded in a n FPGA-reconfigurable interconnection network. and the program determines when and which paths are utilized. Any general-purpose array that is not conventionally programmable is usually considered reconfigurable. General organization of SIMD programmable linear systolic arrays. SZMD. There has been little research in dynamic programmable topologies because a highly complex interconnection network could undermine the regular and simple principles of systolic architectures. Programmable systolic arrays are programmable either at a high level or a low level. The more specialized the hardware. The user reconfigures an array’s topology by means of a switch lattice. Low-level arrays are programmed in assembly language and are bit oriented. Due t o I/O limitations. a program controls data operations in the cells and data routing through the array. Splash was one of the early FPGA-based reconfigurable systolic arrays. Programmable systolic architectures can be classified according t o their cell interconnection topologies: fixed or programmable.

Whenever data is to be shared by processors. The F P G A architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. the iWarp is a Warp cell without memory on a single chip. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit. the corresponding space in the table is left blank. SIMD-programmable systolic cells take less space on the VLSI wafer than their MIMD counterparts because they are simple instruction-processing elements requiring no program memory. Data out _ _ _ _ _ Array Figure 5. Reconfigurable array organization. Some may have a small amount of global memory. High-level MIMD systolic cells are very complicated. but generally no memory is shared by the entire array. it must be passed to the next cell. 3. converts the design to F P G A code with another utility. there is no need to download them from the workstation. Each cell may be loaded with a different program. like special-purpose arrays. t h e controller sequences through the external memory. Purely reconfigurable architectures are fine-grain. After exiting the array. Figure 6 shows the general organization of a reconfigurable array. reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile. general-purpose systolic arrays reported in the literature. Instructions are implicit in the configuration of each cell. and 4 list most of the recent programmable and reconfigurable. multipledata streams) organization due to F P G A gate density. programmable. therefore. 25 . General organization of MIMD programmable linear systolic arrays. Local memory for each iWarp cell must be supplied by additional chips. Tables 2. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization. are generally limited to VFIMD (very-few-instruction streams. an A L U . Thus. Reconfigurable arrays. data is collected in the external buffer memory. low-level devices best suited for logical or bit manipulations. For example. The designer logically draws cell architectures on the workstation with a schematic editor (such as Mentor). the entire array also tends to be VFIMD.x Reconfigurable systolic arrays d o not fall into the SIMD or MIMD categories for the same reason that special-purpose systolic arrays d o not. but generally no memory is shared by all the cells. Once the workstation enables execution. general-purpose arrays. November 1993 I 1 -i IData in . From tens t o hundreds of such cells fit on one integrated chip. data availability becomes a very important issue. and usually only one fits on a single integrated chip. Adjacent cells may share memory. Within the array. M I M D systolic machines (Figure 5) operate similarly to homogeneous von Neumann multiprocessor machines. When information is unclear in the literature. instructions are broadcast and all cells perform the same operation on different data. and local memory. or all the cells in the array may be loaded with the same program. Since special-purpose systolic architectures consist of a very few unique cells repeated throughout the array. and then downloads the code in a few seconds to configure the F P G A architecture. T h e tables indicate three stages of product development: 1 Figure 6. They typically lack the gate density to support high-level functions such as multiplication. Recent gate-density advances in F P G A technology have produced a low-level.ing functional unit serves as a large decoder. The workstation downloads a program to each MIMD systolic cell. General organization of reconfigurable arrays.' M I M D . thus delivering instructions and data to the systolic array.

f. Mar. “A ProRrammable Systolic Device for Image Processing Based on Morphology. “lmplementation of Synthetic Aperture Radar Algorithms on a SystoliciCellular Architecture. D. pp. a programmable systolic architecture must adhere to the general principles of systolic arrays: regularity.” Systolic Array Processors. global data. 2. In[’/ Conf Systolic Arrays. Lopresti. H o w e v e r ..000 MFLOPS 1. Inr’l Con. VIII. 21-27. pp. or both are available for purchase. and rhythmic communications. 13. No. No. IEEE CS Press. Englewood Cliffs. Opper. 7. Vol. Calif. Replicating and interconnecting a basic programmable systolic cell t o form an array carries out the regularity and simplicity principles. memory initialization.” Paris Univ. The advent of the programmable systolic array required that innovative features be incorporated into designs to satisfy a programmable coordination requirement. and an integrated chip with multiple cells.” Proc. Commercial Linear 32 cells Systolic/Cellular Architecture5 Hughes Research Laboratories Prototype 16 x 16 array 32-bit fixed-function units 256 cells Cylindrical Banyan Multicomputerh University of Texas at Austin Research Packet-switched programmable topology with programmable cells Programmable Systolic Device’ Australian National University Research Instruction decoding occurs once per chip. Commercial: The system has been implemented. May 1989. 860. 100s of cells per chip. P.“ Parallel Computing. Przytula and J. p r o g r a m m a b l e cell-based architectures must introduce special features to address the systolic properties of concurrency. broadcast data. Los Alamitos. Block processing combines periods of intensive systolic IiO and periods of sequential von Neumann-style processing to form what is known as a pseudosystolic model.W. 1985. Methods of program loading. “The Saxpy Matrix-1: A General-Purpose Systolic Computer. Scheiber. IEEE CS Press. pp. 20. pp. Malek and E. 90 MOPS Linear Geometric Arithmetic Parallel Processor” Commercial NCR 48 x 48 array Bit-slice cellular architecture. 319-326.Table 2. Intercell concurrency. and block processing..J. 337-344 Research: A functional system that has not yet been implemented. Greussay.” Proc. Campus de B e a d i e r . 5. “Architecture of a Programmable Systolic Array. 10. 145-155. “The Cylindrical Banyan Multicomputer: A Reconfigurable Systolic Architecture. 35-43. N. Coordination of concurrency control has always been inherent in a hardwired systolic design because it has been addressed at the algorithm development stage. Nash. July 1987. 7 . 1988. Vol. France Prototype Key features 470 cells Very small VLSI footprint. 8-bit ALU only. Foulser and R. 2. Order No. SSR architectures. Vincennes. rhythmic communications. To make concurrency feasible in programmable systolic architectures. Schroder. Frison et al.” Computer. 41-49. Concurrency. P. 157 MOPS Linear 18 cells 16-bit fixed-point math. SIMD-organized programmable systolic architectures. 3. Vol. the designer must add special mechanisms that facilitate processing control. R. simplicity. program switching. 1990. broadcast and Saxpy global data with block processing 1. 6. Calif. *Prototype: At least a partial system has been implemented. LOS Alamitos. 3 . both implicit to hardwired designs.. 860. Architectural issues of programmable systolic arrays To be effective. concurrency. K. Order No. Two levels of concur- rency are possible in a programmable systolic architecture: concurrency across cells and concurrency within cells. provided an appropriate algorithm and program are utilized. 4.. 1988. developer Development stage Topology Brown Systolic Array’ Brown University Prototype Micmacs* IRISA. M. “Micmacs: A VLSI Programmable Systolic Architecture. pp. System name. ISA. pp.: 1989.B.304 cells 900 MOPS Saxpy-1M4 Computer Corp. and local memory access at the cell level had COMPUTER 26 I . Lenders and H. 3. No. a complete system.” Purullel Conzputing.Sysfolic Arruys. In addition. each chip has many cells 32-bit floating-point capability. Prentice-Hall. Hughey and D. “Programmation des Mega-Processeurs: du GAPP a la Connection Machinc Course Notes DEA Artificial Intelligence. the introduction of aprogrammable systolic cell with significant local memory has resulted in a new mode of operations: block processing. P.

. 1248-1250.T. Fisher. IiO queuing. pp. M. “The ASP.. Honeywell Research iWarp3 Carnegie Mellon University Programmable cells embedded in switch lattice for programmable topology at hi 8 x 8 array 64 cells 16-bit word length. 2. IEEE CS Press. L. Ishii et al. Broadcast instruction addresses allow fast and efficient switching among programs in an MIMD cell’s program memory. However. Landis et al. I n / ’ [ Con$ Conipufer Design. IEEE CS Press. Kung. Lindscog and P. p. R. T h e following mechanisms facilitate programmable concurrency throughout the array: Broadcast data permits all cells of an array to be reset simultaneously with initial conditions and constants at the start of a processing block during nonsystolic modes of operation. a jump to the broadcast instruction address carries out sim u l t a n e o u s p r o g r a m switching throughout the array. 535-544. When all MIMD cells have programs stored in the s a m e places in t h e i r p r o g r a m memories. Order No. 6. Los Alamitos. Inr’l Conf: Sysrolic Arrays. 495...Microrlec/ronics S y n p .. “PICAP3: A Parallel Processor Tuned for 3D Image Operations.. 1986.. D. Systolic Arrays. 252-256. 264-267.” Proc. “Architecture of Warp. The larger the array. pp. unlike most vector computers. 1982. pp. Order No. Broadcast instructions permit operations similar to those of a vector computer by making each systolic cell analogous t o the vector computer’s processing unit. wafer-scale design 1. 1987.L. 1988. 320 MFLOPS Cellular Array Processor5 Fujitsu Laboratories. Order No. pp. Japan Commer Configurable Highly Parallel Computerh Purdue University Research Associative String Processor’ Brunel University. Eighth Int’l Conf: Patrern Recognilion. Vol. Jan. Los Alamitos. U K Research PICAP3’ University of Paris Prototype PSDP9 Univ. IEEE CS Press. Red-Time Signal Processing VII. to b e incorporated into a systolic environment. Sarocky.strunzenfufion Engineers. M. B./Gov’f/lnd. 1988. 1991. Ninth Biennial Univ. MIMD-organized programmable systolic architectures. Estab. N. .M. 5 . 414-417. 1984. 7. K. IEEE. of Phoro-Oprical 1n. No. 860. Calif. and H.. a Fault-Tolerant VLSI/ULSI/WSI Associative String Processor for Cost-Effective Processing. L o s Alamitos. Annaratone et al. Borkar et al. Anderson. Superconzpuring. image oriented 4-bit word length. Prototype Four 8 x 16 arrays 512 cells Bit-serial cellular IiO. Int‘l Conf. Spie. 515. IEEE CS Press. 47-56. systolic cells can support a high-bandwidth communication channel with adjacent cells. If the programs in all the MIMD cells happen 27 . Toverud and V. pp. 3 .. Snyder. “Introduction to the Configurable Highly Parallel Computer. System name.” Proc.” Proc.” Proc.. Compcon. 4. “Cellular Array Proccssor CAP and Application. Piscataway. 15. 8. Calif. A. “CESAR: A Programmable High-Performance Systolic Array Proccssor. “Experience with the CMU Programmable Systolic Chip. 764. 100 MFLOPS Computer for Experimental Synthetic Aperture Radar4 Norwegian Defense Res.E. pp. Feb. 330.I.” Proc. 32-bit floatingpoint multipliers in each cell. IEEE CS Press. pp. Soc. 9.524. pp. 1. 742 (microfiche only).339. M . S. “A Wafer-Scale Programmable Systolic Data Processor. Order No.” Proc.“ Proc. Calif. Calif. 872 (microfiche only). Los Alamitos. “iWarp: An Integrated Solution to High-speed Parallel Computing. Calif. Danielsson. the more November 1993 efficiency it will gain from a broadcast data capability. Lea. 882. 1988.Table 3.” Conzputer.” Proc. of South Florida. Order No. San Diego. developer Development stage Topology Architecture‘ Warp’ Carnegie Mellon University Key features Prot Commercial Linear 10 cells 32-bit floating-point multiplication: block processing.

Smith and G. IO. 2101. 361-370. 138. at the same time promoting rhythmic cellular communication. first out) basis that allows program computation to proceed irrespective of communications status. VFIMD-organized programmable systolic architectures. multiple data paths." Proc. Itifrace11concurrency. To minimize the performance degradation of programmable systolic designs. Ini'l Conf: Parrillel Processing. 19S9. When global memory is available. 2. S." P m c . Sobelman. 459-466. Bangor. not when the destination cell is ready to accept it. System name. Yanda. pp. Programmable cells require the ability to perform multiple similar o r dissimilar operations simultaneously. The Saxpy-1M is one of the first systolic arrays with global memory. each cell must have sufficient memory for data storage as well as program storage to facilitate intercell communications. Spray. and the ability to transfer o r store the processed data. T. Mechanisms incorporated into programmable systolic cells to support concurrency are all proven techniques that originated in conventional von Neumann processors: microcodable processing elements." in Sysrolic A r r a y Processorh. When systolic cells are programmable.3. Order No. Calif. IEEE CS Press. Programmability creates a more flexible systolic architecture. Prcntice-Hall. the controller obtains status information much more easily. Direct local memory access o r g l o bal rneniory. PrenticeHall. The processed data passes with the original instruction from each cell to the next. instructions travel through the array with the data. An appropriately wide ISA instruction also has built-in microcodable parallelism. Englcwood Cliffs. Ling. and J. In an ISA. "A Flexiblc Building Block for t h e Construction of Processor Arrays.J. M. 1526-1531. Queued IiO helps considerablywhen all cells are computing a single program that is skewed in time across the cells.' University of Wales. Beijing Research ~ ~~~~~~~ based on FPGA-reconfigurable cell architecture based on custom chip Reconfigurable cell architecture integrating FPGA capability and 32-bit floating-point multiplication Functional units embedded in each cell: reconfigurable connections in cell as well as Programmable programmable topology Configurable array topology and cell architecture 1. "Splash: A Reconfigurable Linear Logic Array.. UK Prototype Configurable Functional Array5 Tsinghua University. Vol. pp. "Configurable Hardware: Two Case Studies of Micrograin Computation. A disadvantage is that this mechanism uses memory space that is expensive on VLSI wafers. 1901. Kcan and J. The resulting proCOMPUTER . Sept. 5 . No. duplication offunctional units. Thus. Wenyang. the issue of data availability arises. N. Dec." Coniprrier-Aided Design. pp. and pipelining of functional units. L. Status flags can be stored either in each cell's local memory or in a global memory. S . but the penalties are complexity and possible slowing of operations. No. Serial I/O reduces the bandwidth requirement on the systolic array's workstation machine. U K Prototype 16 x 16 array 256 cells Hybrid Architecture3 University of Illinois Research Programmable Adaptive Computing Engine. "Systolic Realization for 2D Convolution Using Conrigurable Functional Method in VLSI Parallel Array Dcsigns. Compurrrs und Digirtrl Te~hnology~ Vol. A. pp. Jones. In the case of local memory.J. pp. Rhythmic communications. "Simulation-Based Design of Programmablc Systolic Arrays. Instruction systolic urriiys" provide a precise low-level method of coordinating processing from cell to cell." in Sys/o/ic Arrnv Procrssors. the array is performing SIMD operations. developer Development stage Topology Key features ~~~ Splash Systolic Engine' Super Computing Research Center Prototype Linear 32 cells Cellular Array Logic' Edinburgh University. Englewood Cliffs. to be identical. Di- 28 rect local memory access o r global memory also facilitates initialization within a cell. Gray. the controller must issue a request and wait for it to propagate through the array. Yuc. otherwise. 2. ISA algorithms can be lengthy and more difficult to develo p than algorithms for other SIMD and MIMD arrays. Gokhale et al.Table 4. 310-319. and A. The following architectural features support rhythmic communications: QueiLed //O streamlines cellular communi ca t i on s by a I lowing the source cell to send data t o the destination cell when the data is available. Los Alamitos.. 669-675. 1990.. But they provide a way to uniquely specify concurrency across cells without an overly complicated circuit. N. C. 3. 1989. each cell's processing element must have data avail- able when it is required. R. 1991. sufficientlywide instruction words. 4. The principle of rhythmiccommunicationsclearly separates systolic arrays from other architectures. I E E E . high-speed arithmetic capabilities. Data exits the queue on a FIFO (first in.. the controller must b e able to directly access each cell's local memory t o determine flag status.

Communication a n d concurrency ly available to other cells.platform. More flexible dataflow ory in each cell. parallel internal and external communications. all cells t o share it. instead of programming a *Global datu eases systolic IiO re. operates on it.applications that can b e divided into ized o r set to a common variable at equal segments. Using more cells can partially offset t h e d i s a d v a n t a g e of r e d u c e d I I throughput. Reconfiguring an F P G A is the copy of the same data and allowing ory are always the limiting factors on same as logically drawing a new circuit. and di. Once the cells designing a reconfigurable systolic and dataflow through the array be. An SSR systolic architectures can result in a more architectures that can be synthesized architecture is advantageous over efficient.same problems found in homogeneous grammed. quirements and the cell's storage Second.ture programmer has complete control makes programming difficult for ture of \'on Neumann machines inter. jacent processing elements share a small register memory. word.place. the architecic bandwidth. and their gate ter are specified by the instruction cell's local memory. An SIMD systolic shared-register architecture. designs dataflow is implicit. Applications are divid.array environment causes many of the the cell's architecture is actually process time and thus maximize systol. A the merger of von Neumann program.ate algorithm and significant local memSSR architectures (Figure 7) provide a program-implicit means of in the array. Multiple datu p a t h s provide the cell with fast. Block Therefore. can b e implemented on that hardware d o not occur on a FIFO basis.One or more data paths are dedicated to compuBidirectional und wrapuroutzd dataImportant work has been done on tational needs. If each programnew data from the register memory mable systolic cell has significant local upstream.cessing can b e performed on either Systolicshared registers provide sevdirectional and wraparound data. The serial na.and global data are also desirable.guage.array a r e the hardware platform of comes a result of processing.'" Block procellular output operations.flow must be explicitly specified in characterizing block processing in a geninput operations.I grammable systolic array has bit. The number of FPGAs. pro.processing reduces the array's systolic sume any architecture that F P G A gate ifications to global data are instant. concurrently with computation as each processing element receives Block processing.110 r e q u i r e m e n t s by reducing t h e density and package pinout permit.fers from conventional programming in once. pseudosystolic operation in and the set of systolic algorithms that queued 110because communications some applications for two reasons. any mod. T h e small register multiprocessor systems.feres with the rhythmic systolic I/O of corporated in each FPGA. Each cell stores partial words until it receives the full word and then performs functional operations. whereas in hardwired eral-purpose systolic array. the model of operations in a high-level lanrequirements by keeping only one bandwidth and size of the external mem. another to cellular .over what architectural features are insome of the more complicated algo. Figure 7. amount of cellular IiO. causing an unacceptable ing the cell architecture can b e an iterBroudcast &ita eases systolic IiO amount of time wasted in waiting for ative process that continuously refines requirements in certain instances by data to become available. This architecture elim. With proper throughput and performance. Other types of array streamlined dataflow. F P G A architecture programming diflocal memory of a cell to be initial. very little systolic I/O ocThe primary architectural issues in inates the requirement that data curs.data. and yet another to the program.I serial cellular communications and word-wide operations in each cell. Reconfigurable systolic archidisadvantage is that the register bank mable cell architectures into a systolic tectures are very interesting because must be kept small to minimize ac.SIMD o r MIMD programmable systoleral benefits specifically for proflow can reduce the I/O bandwidth ic machines. an F P G A platform can asglobal memory coherency. During block downstream.memory. and new data is written to each targeted to that platform. First. for any systolic array.the architecture until it satisfies appliallowing all functional units and cessing minimizes this wasted time in cation requirements. Requirements for efficient grammable systolic architectures. Therefore. block processing is possible in rects it to the next register memory the systolic environment. Each two adalso allows arrays to execute a wide communication such as broadcast data variety of algorithms more effi. Architectural issues of reconfigurable systolic arrays Novcmber 1993 29 . The cessing pauses while systolic IiO takes FPGAs and the class of algorithms t o b e input register and the output regis. Block pro. and each cell executes a series of movement be explicitly specified. requirements of the external buffer block processing include an approprimemory by recirculating data with. bidirectional dataBlock processing on programmable density will determine the set of systolic flow is a natural result.generate an intermediate result. e d into parallel tasks that utilize local that one programs the circuit's logical function.processing. Data flows ciently. Determinrithms. their topology. t h e array. instructions on local data. Consequently. Bi.

and Saxpy-1M all require expensive and complicated I/O support . the Cellular Array Processor. The more recent general-purpose systolic array projects (Brown. Current technology limits I/O to approximately 200 pins. Workstation architecture can always be improved. F o r example. Another project is attempting to integrate a hybrid general-purpose systolic array cell on a single chip. Reconfigurable designs have proven to be unmatched for low. I n addition. The hybrid S l M D architecture" (Figure 8) is best utilized for intensive floating-point computational applications but does not degrade in performance as much as high-level programmable arrays when significant logical o r control operations are included. An F P G A chip can be configured for one medium-size systolic element or for several simple systolic cells. and hybrids of functional units embedded in reconfigurable F P G A arrays. Splash. low FPGA gate density makes it unlikely that large-grain tasks will ever be completely implemented in FPGAs o r that reconfigurable arrays will replace conventional SlMD and M I M D architectures in the near future. The commercial multiplier is used f o r its economy. reconfigurable arrays. 320 MFLOPS. Splash) show that back-end systolic processors are effective in boosting a workstation's performance for these applications. Hybrids make use of the most attractive features of programmable and reconfigurable 30 COMPUTER I .to moderate-granularity requirements but are not yet mature enough for high-granularity applications. Consequently. is a linear reconfigurable array appropriate for bit-level applications. and the othe r supports a buffer memory. hybrid architectures present perhaps the most practical means to a reconfigurable high-level systolic array.': A recently developed simulator allows simulation and testing of the cell and the array design for a hybrid architecture. On the other hand. O n e board supports the linear array. customized host support. These arrays are small enough that the host's open architecture with limited U0 bandwidth does not severely impact the array's performance for lowto moderate-level granularity. especially for emerging applications such as text and speech processing and gene matching. not a design and fabrication issue. Micmacs. and package density: the F P G A closely binds the cell t o a specific application.000. reconfigurable topologies and cells are highly fault-tolerant.' Almost without exception. The introduction of workstations into the workplace has changed the way a significant portion of the computing community views computers. A hybrid SlMD programmable systolic architecture. When mapping is completed. constructed to fill an existing performance gap. respectively) with centralized concurrency. Cell architectures can be configured around VLSI defects. for example. the Computer for Experimental Synthetic Aperture Radar." Unlike most commercially available computer-aided design utilities. Packaging and 110 pins also influence the design. it verifies the performance of algorithms and architectures through simulation of the complete array.000 gates. In either case. and 1. speed.Broadcast instructions C I I I Figure 8.000 MFLOPS. Small back-end systolic processors are also economically sound. Reconfigurable architectures are also highly qualified for fault-tolerant computing. These machines are highp e r f o r m a n c e s u p e r c o m p u t e r s (200 MFLOPS. The simulator maintains the iterative nature of purely reconfigurable array design by facilitating tailoring of hybrid designs for specific applications. as we have said. Existing architectures A review of projects initiated during the last decade shows that the trend was to develop large systolic array processors that require elaborate. Users demand improved desktop computing performance as applications continue to increase in size and complexity. keeping their desirable aspects and discarding their undesirable aspects. Until FPGA chip density progresses to the point where a very large FPGA can achieve high-level granularity. current research emphasizes reconfigurable cells. current technology limits the circuit size to an order of magnitude approaching 25. Their fault tolerance is a configuration issue.for the intensive instruction I/O as well as the data 1/0. Reconfiguration can simply eliminate a bad cell from the array. T h e set costs from $13. a natural step is to merge the two approaches. Architectural issues of hybrid arrays High-level programmable arrays require extensive efforts tomap algorithms to high-level languages after algorithm development. which verify designs at the chip level. Current gate-density technology does not make MIMD arrays feasible due t o cell memory and control requirements. but F P G A platforms are well suited for logical o r bit-level applications.000 t o $35. The hybrid design combines a commercial floatingpoint multiplier chip and an F P G A controller t o form a systolic cell. the resulting system often is not as efficient as a system implemented in a fixed-function ASIC. must b e evaluated on a case-by-case basis at the time of configuration. Splash consists of a two-board add-on set for a Sun workstation. Warp.

” Computer (special issue on systolic arrays). 207-2 15. vanced Solutionfor Globallnfornzatioti Sharing Process. Vol.’ T he systolic array is a formidable approach to exploiting concurrencies in a computationally rhythmic and intensive environment. Vol. Englewood Cliffs. Vol. November 1993 posium on Parallel and Distributed Processing. 2-3. He is a cofounder of the IEEE Symposium on Parallel and Distributed Processing. As a solution to the intensive computational performance requirements of tomorrow’s applications. Englewood Cliffs. Smith and G. 24. respectively. IEEE CS Press.T. coordinated and recommended this article for publication.” Parallel Conipiiting.pp. Laxmi Bhuyan.. He has published widely on these topics. dataflow computation. His cmail address is hhuyan@cs. As mentioned earlier. the smaller programmable arrays have implemented only assembly programming. 6.” Computer. His research is directed toward the design and analysis of general-purpose and special-purposecomputer architectures. Int’l Conf Parallel Processing. No. Spray. A. 11 R. A. 8.” Proc. database systems and machines. and VLSI algorithms. N o discussion of general-purpose systolic arrays is complete without addressing the issues of programming and configuration. Hurson. W. 20. Englewood Cliffs. Pennsylvania State University. “Why Systolic Architectures?” Cotnpurer. 17. Swartzlander Jr. Hc has served as chair of the Computer Society section in the Dallas chapter oC the IEEE and as chair of the IEEE Region V Area Activities Board..J. Calif. 1991. General-purpose systolic arrays provide an economical way to enhance computational performance by emphasizing concurrency and parallelism. 10.methods while adding greater flexibility than either method. 7. 1991. pp.c d Architectures. M. m 5. Krikelis and R. P. Order No..A. KOand 0. and parallel and distributed processing. Ling. High-level-languageprogramming is desirable for promoting widespread use of programmable systolic back-end processors. 37-46. Fuchs and E. References 3. No. IEEE CS Press. 184-187. 81-89. Johnson is an advanccd engineer at HRB Systems.1987. Jan. Gokhale et al. University Park. both in electrical engineering. 1988. FPGA-configurable arrays a r e configured with the use of a schematic editor.J. Systolic A r rays. 12-17. Friedlander. N. Los Alamitos.J. “A Flexible Building Block for the Construction of Processor Arrays.W. He has published more than 110 papers on topics including computer architecture. Conzputer’s system architecture area editor. of Computer Science and Engineering. Apr.tamu. C. 9. Systolic arrays hold great promise to be a pervasive form of concurrency processing. Arrays ranging from low-level to high-level granularity have been applied to problems from bit-oriented pixel mapping t o 32-bit floating-point-based scientific computing. 7. 4. Vol. general-purpose systolic arrays cannot be overlooked. Jan. O n e drawback of this approach is that it typically takes more effort than programming a programmable cell. “ Wafer-Scale Integration: Architectures and Algorithms. He coauthored the IEEE tutorial hooks.” Computer. Calif.. 1989. Hughey and D. 1092. “Block Processing on a Programmable Systolic Array. 31 . pp. Dept. Prentice-Hall. Loprcsti. Sjstolic Algorifhni. 459-466. pp. PA 16802: or A2H@ecl. IEEE CS Press. 15. Quinton and Y. 25. Jones. and A. 285-294. “Architectural Constructs for Cost-Effective Parallel Computers. Previously. Dec. “B-SYS: A 470-Processor Programmable Systolic Array. he was an assistant professor at Southern Methodist University.. Prentice-Hall.E. 889. Los Alamitos. “Simulation- 1. Order No 2355-22. Vol. Lea. 2-3. June 1991. pp. H. N. pp. Inr’l Conf. dataflow architectures. Sobelman. Robert. in 1980 and 1Y85. Schroder and P. J. Calif.M. Prentice-Hall. Vol. “Systolic Arrays: From Concept to Implementation. Los Alamitos. 15801583. Kung. Shirazi received his MS and PhD degrees in computer science from the University of Oklahoma. Hurson is an associate professor of computer engineering at Pennsylvania State University.. Wing.edu. Fortes and B.R. “Buildingand Using a Highly Parallel Programmable Logic Array. N. Parallel Architectures for Database Systenis and Multidatabase Systems: An Ad- Kurtis T. He cofounded the IEEE Sym- 10 B. 6-8. No.1991 .B. 1982.. Behrooz Shirazi is an associate professor of computer science engineering at the University of Texas at Arlington. heterogeneous computing.” Proc. an E-Systems subsidiary.” in Systolic Array Processors.pp. 12 S. No. 2.psu. R. He received his BS in 1986 and his MS in 1992. No. July 1987. 860. and digital signal processing. 669-675. pp. “Mapping Strate- gy for Automated Design of Systolic Arrays. parallel processing.K. Send correspondence about this article to A.K.R. 7991. Hurson is a member of the IEEE Computer Society Press Editorial Board and a member of the IEEE Distinguished Visitors Program. pp. 1. Parallel Procersing. H. only the larger systems had mature programming environments. parallel algorithms.” Proc Int’l Conf. Based Design of Programmable Systolic Arrays. pp. Wah. Of the projects we surveyed. His intcrests include parallel computer architectures. A. pp. from Pennsylvania State University.” in Systolic Array ProcrsJors. 287-300.edu. “Program Compression on the ISA. His research interests include task scheduling. 4. Order No. 1989. Strardins..” Computer-Aided Design. Currently.