You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2264326

A High-Performance Embedded Massively Parallel Processing System

Article · November 1997


DOI: 10.1109/MPCS.1994.367077 · Source: CiteSeer

CITATIONS READS

6 122

3 authors, including:

Lars Bengtsson Bertil Svensson


Chalmers University of Technology Halmstad University
39 PUBLICATIONS   251 CITATIONS    118 PUBLICATIONS   719 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Parabolic Synthesis View project

REMOTE View project

All content following this page was uploaded by Lars Bengtsson on 29 April 2014.

The user has requested enhancement of the downloaded file.


A High-Performance Embedded Massively Parallel Processing System

Lars Bengtssona, Kenneth Nilssona, and Bertil Svenssona,b.


aHalmstadUniversity, P.O. Box 823, S-301 18 Halmstad, Sweden
bChalmers University of Technology, Dept. of Computer Engineering, S-412 96 Gothenburg, Sweden

ABSTRACT involved with advanced systems functioning in natural


A need to apply the massively parallel computing paradigm environments is to use a multitude of cooperating ANNs.
in embedded real-time systems is foreseen. Such applica- Support for this must then be given in the architecture.
tions put new demands on massively parallel systems, diffe- The demand for embedded implementations implies a
rent from those of general purpose computing. For example, need for miniaturization and low power consumption. At
time determinism is more important than maximal throug- the same time cheap solutions must be found. It must be
hput, physical distribution is often required, size, power, possible to separate those computing resources that are
and I/O are important, and interactive development tools needed only in the design phase from the execution system.
are needed. At the same time, in the kind of applications we discuss,
The paper describes an architecture for high-perfor- development must typically be done through interaction.
mance, embedded, massively parallel processing, featuring This interaction exists both between the system designer
a large number of nodes physically distributed over a large and the system and between the system and the environ-
area. A typical node has thousands of processing elements ment [2]. Developing the system on-line, using the real sen-
(PEs) organized in SIMD mode and has the size of the palm sors and actuators, must be supported. Thus, the design and
of a hand. Intermodule communication over a scalable opti- execution phases are not clearly separable and there is a
cal network is described. A combination of wavelength divi- need for an easy interconnect between development system
sion multiplexing (WDM) and time division multiplexing and target system.
(TDM) is used. To achieve time determinism, in order to guarantee real-
time response, the approach taken in this paper is the princi-
1: INTRODUCTION ple of resource adequacy and cyclic execution [3], [4].
Resource adequacy means that enough resources, both for
In contrast to the mainstream of massively parallel proces- computation and communication, are allocated statically in
sing, targeted for general purpose, high performance com- the design phase. (However, as described above, the design
puting, this paper discusses architectures of special-purpose and execution phases may alternate). Cyclic execution imp-
computer systems to be embedded into real-time applica- lies guaranteed timely response to stimuli. The program
tions. In many cases embedding implies a need for distribu- cycle is tuned to the requirements given by the dynamics of
ted solutions, which in turn requires modularized systems. the environment.
Communication needs between modules can be critical. Finally, the module architecture that we discuss in this
Processing is frequently required to be close to sensors, and paper is based on the SIMD principle. This organisation is
massively parallel I/O becomes an important issue. well suited to typical computations in the intended applica-
The demand for real-time performance implies that for tion areas, e.g. signal and image processing and ANN com-
each task there must always be enough computational and I/ putations. We describe system solutions for the order of one
O resources guaranteed. The correct time of an output is as hundred such modules, allowing a physical distribution of
important as a correct value. modules over typically one hundred meters.
Typical application areas for embedded massively paral- The paper is organised as follows: In the next section the
lel computing include radar and sonar processing, advanced overall system architecture is introduced and typical figures
instrumentation systems for large experiments, and action for the size of modules and the communication bandwidth
oriented systems [1]. An emerging technology in especially demands are given. Section 3 discusses the architecture and
the latter case is the application of artificial neural network implementation of miniaturized processing modules with,
(ANN) principles. A way to cope with the complexities typically, one thousand processors. Section 4 discusses sca-
lable optical solutions to the intermodule communication
problem, identifies key components and describes the inter- 3: MODULE DESIGN
face between the processing modules and the communica-
tion medium. Section 5 concludes the paper. 3.1: Multi-Chip Module Technology
2: SYSTEM ARCHITECTURE The overall module size requirement (a palm of a hand)
makes Multi-Chip Module (MCM) technology a natural
The system comprises distributed modules connected to a choice because of the ability to attach many naked
shared communication medium via ports. One of the modu- (unpackaged) chips on a small board area. Also, MCM off-
les, the gateway module, is an interconnect to the develop- ers a low-noise environment and reduces the chip-to-chip
ment system. interconnect length. Furthermore, by using the “flip-chip”
A block diagram of a module is shown in Fig. 1. It has bonding technique, speed performance is enhanced due to
two parts: a Processing Unit and a Communication Unit. much lower R,L,C values compared to traditional wire
bonding [6]. In this technique “solder bumps” on the active
Control Unit Control Unit surface of the chip are used to make the bond connections
between the chip (mounted upside-down) and the metal
T
wires running in (several) layers in the substrate. The “flip-
chip” solution also meets the high I/O pin count per chip in
P/S E/O
R a massively parallel (fine-grain) processing module. Seve-
ral thousand solder bumps can be used whereas in a tradi-
O/E S/P Synchr Local Memory ICN PE tional (e.g.PGA) package the limit is usually only a few
Buffer hundred pins. Because the chips are attached to the sub-
Communication Unit Processing Unit strate “upside-down” only resting on the solder bumps, the
Figure 1. A Module. dissipated heat should be removed from the top (backside)
of the chip, possibly utilizing a heat-sink [7].
The module processes data in SIMD style. Typically, the
size of the PE-array is 1024 bit-serial PEs and the clock fre- optical link address & control
to other
quency is 100 MHz. Data are sent in logical channels bet- modules
sync. sync.
ween the modules over the shared medium in the same DRAM DRAM
format as for processing, i.e. as parallel arrays. A logical data
channel is a direct path between the local memories of the 256 256 broadcast
modules. It is important to achieve a balance between the Comm. icn
and 512 512
processing speed and the communication speed for a CU
PEs PEs
module. The processing speed is at most 100 Gb/s (1024 x parallel I/O control
I/O 256 256
100 MHz). However, not all produced or consumed data
data
will be communicated over the medium. This may reduce sync. sync.
the demands on the communication speed by a factor of, DRAM DRAM
say, 10-100, which gives a communication speed of 1-10
address & control
Gb/s per port.
For a system with 100 modules the aggregated commu- Figure 2. The Multi-Chip Module (schematic
nication speed on the shared medium is about 100-1000 view)
Gb/s. This high figure and the fact that modules can be at a
considerable distance from each other ,point to the use of The multi-chip solution using several chips offers a great
optical fiber as the communication medium. A single-mode advantage over a monolithic (single chip) approach,
fiber has a bandwidth of several tens of THz [5]. A techni- because mixed technologies can be used for the different
que to take advantage of this bandwidth is Wavelength chips (Fig. 2). The memories can use a high-density 1-
Division Multiplexing (WDM). To use an all-optical net- transistor DRAM process, the PE chips and the Control
work as the transmission medium also results in a robust Unit a high-speed BiCMOS process and GaAs in the I/O
network with as low bit-error-rate value as 10 -15 [5]. section enables fast circuitry which is useful in the optical
To avoid an electronic bottleneck in the system the elec- laser / pin-diode communication interface.
tronics in the communication interface should only work at Other alternatives like TAB-bonding and Wafer-Scale
its own speed (1-10 Gb/s) and not at the aggregated speed Integration have also been considered but ruled out
of the modules. because of the superiour packaging efficiency in the flip-
chip case [7].
3.2: Memory implementation Pdyn + Pi/o <= 25 (Watt) (1)
where Pdyn is the dynamic power dissipated when the
For neural network computations the amount of memory parasitic capacitances in the CMOS circuit are charged and
needed in each PE is large. Using 16-bit weights, a four- discharged (the short-circuit dissipation is neglected in this
layered neural net with 1024 PEs can require up to 128 calculation), and Pi/o is the power dissipated in the output
kbits per PE [8] giving a total of 128 Mbits for the complete pad buffer drivers. Pdyn can be expressed as [11]:
module. Clearly this, and the overall module size, indicates
Pdyn = CPVDD2f (2)
that a high density DRAM technology is needed. A module
like this would contain two PE chips with 512 PEs each and where CP is the capacitive parasitic load in each node, VDD
four DRAM chips organized as 128k * 256. This could be is the supply voltage and f is the switching frequency.
implemented with conventional byte-wide (or word-wide) Pi/o is given by the same formula but multiplied with
DRAMS and employing a 3D packaging technique where the number of I/O pads and using another C value (CL).
several 2D DRAM dies are stacked to form the complete A first order approximation of the maximum number of
memory module. Examples of such modules will soon be transistors allowed to switch per time unit (A) can now be
seen on the market, e.g. from Texas Instruments an 8 to 10 calculated. Using the above formula and typical parameters
bit x 1Mbit DRAM module reaching a packaging density of for a 512 PE chip running 100 MHz (Cp =100 fF and CL
83 Mbit/in3 (compared with 2 Mbit/in3 for conventional =1 pF, VDD =5 Volts) we obtain A < 380 transistors (a
surface mount technology) [9]. more accurate estimation could of course be obtained by
The main disadvantage with DRAM is the longer access using a switch level simulation tool, like in [12]).
time compared to SRAM. Commercial DRAMs with According to (2) the power dissipation is proportional to
roughly 35 ns (page mode) are available. However, a newer the square of the supply voltage. Thus, it would be advan-
generation of “synchronous DRAMs” (SDRAM) offering tageous to switch to 3.3 V supply. Furthermore, if more
100 MHz of operation have reached the market. In these, effort is put on cooling the chips, the dissipation could well
bursts of data within one “page” can be delivered at 100 exceed the 25 Watts calculated with here. Examples with
MHz [10]. ECL technology CPU chips dissipating over 100 W have
been seen. This research chip utilizes a “cooling tower”
3.3: Implementation of the Processing Elements glued on top of the chip surface to get rid of the heat [13].
Using 3.3 V supply, 100 MHz system speed and 25 W
To achieve high computing performance it is desirable to dissipation leads to a maximum transistor switching count
have a large (massive) number of PEs, and to clock these as of roughly 900.
fast as possible. The number of PEs on a 15 by 15 mm chip
area is limited by the maximum density of I/O solder 3.3.1: Register level PE design
bumps (several thousands) and the maximum wire density
on the MCM substrate (today 800 cm/cm2 [20]). Bit-serial PEs on the register level have been designed in
Speed is constrained by the critical path delay in each PE the REMAP-β sub-project [14]. They have been imple-
and also by the clock skew tolerances in the global clocking mented and tested in FPGA circuits. One example suited
net. Also, because the complete design is synchronous with for neural network execution is shown in Fig. 3. It contains
control signals from the CU, the critical path is really com- a 16-bit serial multiplier which enables the PE to perform
prised of the entire CU-PE/Memory loop. To minimize this multiplication as fast as addition, an important feature in
loop delay, the CU is implemented in a high-speed BiC- neural network algorithms.
MOS process with the critical parts done in Bipolar techno-
logy. Also, the signals in the MCM substrate need to be Min Ri
P Out C
Ri Mout
handled as transmission lines with proper electrical interfa- R
ces (possibly low-voltage swings and differential transmis- A Out
0 alu
sion) and proper impededance matching. Introducing 1 L Br
Up U >1
pipelining both in the data and control paths enables the T >1
Down Br
system frequency to be as fast as the memory cycle - 100 Br Mult
MHz.
Now, the real limitation on the computing performance T Control
shift reg
of each module will be the power dissipation (or rather the file
alu
heat dissipation). This is especially true when using the
“flip-chip” technique where the heat must be removed from
Figure 3. A PE suited for neural network
the top of each chip. Setting the dissipation limit to 25 W
execution
per PE chip this can be expressed as:
Operations may be locally disabled by using “tagged” ope- can reuse the wavelengths in the system, therefore it is pre-
rations. A “tagged” memory write operation is performed ferable to “isolate”, optically, different subnetworks from
by using a T-register holding a “1” in those PEs that should each other. This can be done with special dual-port modu-
update their memory position. A “0” in other PEs selects les in each subnetwork (Fig. 4).
the original memory contents instead which is written back
to the same memory address.
To take full advantage of the synchronous DRAMs des- CU CU
cribed earlier, the sequence of addresses should be sequen-
dual
tial, meaning that the principle of locality in the spatial port
domain is constrained even further from random referen- module
LM
cing within one cache page. Now, in neural net execution, subnetwork dual-port module
where the algorithms consist mainly of matrix-vector mul-
tiplications, the addresses are not following this rule when Figure 4. Partitioning of the communication
“jumping” between the matrix and the vector. To overcome network into subnetworks. CU=control unit,
this, a file of four shiftregisters has been included in the PE. LM=local memory.
This allows the memory (SDRAM) referencing to sweep
through the matrix field. After partitioning there are several communication cycles
The file has a 16-bit parallel output connected to the working in parallel in the system, one cycle for each sub-
multiplication unit. It works as a path for the multiplicand network, where overlapping wavelength bands can be
during the bitserial multiplication. The file is only capable used.
of shifting one register at a time (in and out). This restric-
tion was put on it because of the limit on the dynamic 4.2: Medium-Access Control
power consumption that was previously discussed. Doing
We use WDM to create several logical channels on the all-
so, means that only one fourth of the transistors in the file
optical star, we further divide a logical channel into time
are switching simultaneously (i.e. per time unit).
slots (Time Division Multiplexing, TDM). In a time slot a
Performance when running the neural net “backprop”
communication packet is transmitted. We call this way of
algorithm reaches a maximum of 3 Giga Connections Per
communication, where data are sent in time slots on each
Second using 16-bit weights.
channel and modules have to switch between the different
3.3.2: Transistor level design wavelengths to send or receive data in predetermined time
slots, time-slot switching.
The total number of transistors in each PE sums up to 1326 In our system the scheduling of medium access is made
using a dynamic logic style (a fully static PE would use at design time in the development tool and not on the run-
2016).To keep the 25 Watt limit, the PE is implemented ning system itself. At design time an analysis of the
such that not all nodes in it are allowed to switch in each demands, put by the application, on the communication is
clock cycle - only one register in the register file can be done. The result of this analysis is a static allocation of
shifted at a time. Doing so means that the worst case num- time slots on the shared medium.
ber switching simultaneously (per time unit) is 942 which In each module the communication is executed in a
is accepted as close enough to the estimated maximum cyclic manner. We divide each communication cycle in K
count of 900. instruction slots, each equal in time to a time slot on the
shared medium. Each module has two communication
4: INTER-MODULE COMMUNICATION ports, one for transmitting data and one for receiving data,
each with its own communication cycle.
A send instruction specifies the wavelength to transmit
4.1: The Communication Network
and when in time to transmit (the index of the send slot in
A passive optical star is generally preferred to the optical the cycle), a receive instruction specifies the wavelength to
ring because the star distributes the optical power more effi- receive and when to receive (the index of the receive slot in
ciently among the modules [15]. the cycle).
In a multi-processor system an important feature is To get a simpler allocation scheme, we are “skewing”
many-to-many communication. The star topology imple- the transmitting communication cycle in each module so
ments many-to-many communication if Wavelength Divi- that every send slot with index m (m=1 to K) corresponds
sion Multiplexing (WDM) is used. to time-slot index m in the merging point of the star. The
In WDM-based communication it is an advantage if you receiving communication cycle in each module is also ske-
wed in such a way that if send-slot index p is allocated for ted. The multi-channel receiver uses a diffraction grating to
transmitting data, all the modules are also receiving this implement a wavelength-space demultiplexer and a detec-
data in receive-slot index p. tor array for the optical-to-electrical (O/E) conversion. To
select one of the channels (tuning the filter) an electronic
(dmax-d2) (dmax-dn) switch is used.
Transmit A monolithic multiple wavelength laser array (VCSEL)
comm.
cycles is described in [18]. Each physical position on the 2D-
1 2 N array represents a unique wavelength. The VCSEL used as
d1= dmax d2 dN the light-source, plus a component that directs the light
into a single-mode fiber, implements an addressable wavel-
d1 dN ength-switched transmitter with a large “tuning” range. An
d2 efficient light coupling between the 2D laser-array and the
Receive 1 2 N single-mode fiber can be achieved by a holographic met-
comm.
cycles hod [19].
Direct modulation of the addressed laser or an external
(2xdmax) (dmax+d2) (dmax+dN)
broadband modulator can be used.
Figure 5. The communication cycles
“skewed”. 4.4: The Communication Interface

To skew the communication cycles we must measure the The communication interface has two ports to the shared
delay-times, di, between each module i and the merging medium. The transmission port (T) performs parallel-to
point of the star. The module with the largest delay, di= serial (P/S) and electrical-to-optical (E/O) conversion, and
dmax, is used as a reference when skewing the other com- the receiving port (R) performs optical-to-electrical (O/E)
munication cycles in the modules (Figure 5). and serial-to-parallel (S/P) conversion (Figure 6).
The presented inter-module communication concept is
scalable by: allocating more time slots, adding more wavel- cu
engths, partitioning into subnetworks. Processing
FIFO MUX P/S E/O
A problem with WDM is that different wavelengths tra- cu
T
vel with different velocities. This is due to dispersion in the
optical medium. Therefore, two time-slots transmitted at
the same time on two different wavelengths will reach a Div Mult
PE Div Synchr.
receiver at two different times [16]. The difference in time array Cl Div VCO
arrival, T, is equal to
R
T= DdwL (3)
where D is the dispersion in ps/(nm x km), dw is the dif-
FIFO DE-MUX S/P O/E
ference between the two wavelengths in nm and L is the
distance travelled in km. In our system L is at most 200 m cu
(to and back from the star) which, with D=16 ps/(nm x km)
and dw=200 nm, gives T= 640 ps. This is about 1 bit with a Figure 6. The communication interface.
1 Gb/s transmission rate. This “dispersion-bit” can be
added at the start and end of the time slots so that no data is In a time-slot switching system the clock must be immedi-
lost when a module has to switch between two different ately recovered (minimum hysteresis in the clock-recovery
wavelengths. circuits).
The use of an optical fiber with a bandwidth of several
4.3: Key Components for WDM THz gives the possibility to use a “pseudo-channel” to
transmit data. A pseudo-channel includes not only the data
Key components for WDM based, time-slot switching net- itself but also the clock, coded in separate wavelengths.
works are fast tuned (in ns) laserdiodes and filters. The abi-
lity to tune over a broad wavelength range and to have a 5: CONCLUSION
high selectivity in wavelength are important characteristics
of these components. Two issues have been identified as important with regard to
In [17] several concepts of new sets of multi-channel embedded massively parallel computers: (1) miniaturized
wavelength-switched transmitters and receivers are presen-
modules with reasonable heat dissipation, and (2) efficient Journal of Solid-state Circuits, vol. SC-19, No. 4, August 84, pp.
communication between modules. 468-473.
Knowing, from detailed studies of the application of [12] Burch R. et al. A Monte Carlo Approach for Power
SIMD architectures to artificial neural network computa- Estimation. IEEE Trans. on VLSI, Vol.1, No.1, March 1993.
tions, image and signal processing, etc., that these architec- [13] ISSCC – Digital Technology, Electronic Design, March 4,
1993, pp. 48-57.
tures are efficient for such problems, in this paper we have
[14] Bengtsson L. et al. The REMAP Massively Parallel
investigated the possibilities for implementation of a distri- Computer Platform for Neural Computations. Proceedings of the
buted multiple-SIMD system for embedded real-time appli- Third International Conference on Microelectronics for Neural
cations. Networks, Edinburgh, Scotland, April 1993, pp. 47-62.
The analysis shows that it is clearly feasible to imple- [15] Ramaswami R. Multiwavelength lightwave networks for
ment a 1024 PE module with small dimension (a few cubic computer communication. IEEE Communications Magazine,
inches). Key techniques are flip-chip MCM technology and February 1993, pp. 78-88.
synchronous DRAM circuits. Further, we have described [16] Semaan G, and P. Humblet. Timing and dispersion in WDM
optical star networks. Infocom 93, March 28-April 1, 1993, San
an all-optical intermodule communication network that is Francisco, CA, pp. 573-577.
scalable, with the modules distributed up to 100 meters [17] Kirby P.A. Multichannel wavelength-switched transmitters
apart. The total bandwidth of the network is roughly one and receivers – New component concepts for broad-band
Tbit/s. networks and distributed switching systems. Journal of
In summary, the paper shows the potential of distributed Lightwave Technology, Vol. 8, No. 2, February 1990, pp. 202-211.
massively parallel computing for demanding real-time [18] Chang-Hasnian, J. C. et al. Monolithic multiple wavelength
tasks like action-oriented systems and advanced sensor sys- surface emitting laser arrays. Journal of Lightwave Technology,
Vol. 9, No. 12, December 1991, pp. 1665-1673.
tems.
[19] Eung G. Paek et al. All-optical image transmission through a
single mode fiber. Optics Letter, April 15, 1992, Vol. 17, No. 8,
6: REFERENCES pp... 613-615.
[20] Schaper L. W. Design of Multichip Modules. Proceedings of
[1] Davis, E.W. et al. Issues and applications driving research in the IEEE, vol. 80, No 12, December 92, pp. 1955-1964
non-conforming massively parallel processors. In Isaac Scherson
(ed.) The New Frontiers. IEEE Computer Society Press, 1993, pp.
68-78.
[2] Svensson, B. and P.A. Wiberg. Autonomous systems demand
new computer system architectures and new development
strategies. IECON ‘93: 19th Annual Conference of the IEEE
Industrial Electronics Society, Maui, Hawaii, USA, 1993, pp 27-
31.
[3] Lawson H.W., with contributions by B. Svensson and L.
Wanhammar. Parallel Processing in Industrial Real-Time
Applications. Prentice-Hall, Englewood Cliffs, NJ, USA.
[4] Lawson H.W. and B. Svensson. An architecture for time-
critical distributed/parallel processing. Euromicro Workshop on
Parallel and Distributed Processing, Gran Canaria, Spain, 1993.
[5] Green P.E. The future of fiber-optic computer networks.
Computer, Vol.24, No. 9.
[6] Parker J.W., Optical interconnection for advanced processor
systems: a review of the ESPRIT II OLIVES programme, Journal
of Lightwave Technology, Vol. 9, No. 12, Dec. 91, pp. 1764-1773.
[7] Tummala R., Multichip packaging – a tutorial, Proceedings of
the IEEE, Vol. 80, No. 12, Dec. 92, pp. 1924-1941.
[8] Nordström T. and B. Svensson. Using and designing massively
parallel computers for artificial neural networks. J. Parallel and
Distributed Computing, Vol. 14, No. 3, pp. 260-285.
[9] Courtois B.. “MCM and 3D packaging techniques”, CAD and
Testing of ICs and systems, where are we going? INPG, 46
avenue Felix Viallet, 38031 Grenoble Cedex, France
[10] Fast DRAMs can be swapped for SRAM caches,Electronic
Design, July 22, 1993, pp. 55-67.
[11] Veendrick H., “Short-circuit dissipation of static CMOS
circuits and its impact on the design of buffer circuits”, IEEE

View publication stats

You might also like