You are on page 1of 8

A comparison of Broadcast-based and Switch-based Networks of Workstations

Constantine Katsinis Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104 katsinis@ece.drexel.edu

Keywords: distributed architectures, distributed shared memory, fault tolerance.

Abstract
Networks of Workstations have been mostly designed using switch-based architectures and programming based on message passing. This paper describes a network of workstations based on the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) which is a low-latency, high-bandwidth interconnection network that directly links arbitrary pairs of processor nodes without contention, and can efficiently interconnect several hundred nodes. Each node has a dedicated output channel and an array of receivers, with one receiver dedicated to every other node's output channel. The SOME-Bus eliminates the need for global arbitration and provides bandwidth that scales directly with the number of nodes in the system. Under the Distributed Shared Memory (DSM) paradigm, the SOME-bus allows strong integration of the transmitter, receiver and cache controller hardware to produce a highly integrated system-wide cache coherence mechanism. Backward Error Recovery fault-tolerance techniques can exploit DSM data replication and SOME-Bus broadcasts with little additional network traffic and corresponding performance degradation. This paper examines switch-based networks that maintain high performance under varying degrees of application locality, and compares them to the SOME-Bus, in terms of latency and processor utilization. In addition, the effect of fault-tolerant DSM is examined on all networks.

1. INTRODUCTION
Networks of workstations (NOWs) have become a popular, cost-effective alternative to high-performance parallel computers. They are usually created through the use of a switch-based interconnection network with switch ports connected to other switch ports or to processors. It is quite common for a typical installation to contain several dozen or even hundreds of processing nodes with one or more high-performance microprocessors organized as a small symmetric multiprocessor. The resulting computing capacity is equivalent to the capacity of traditional multicomputers and multiprocessors.

Present interconnection networks for NOWs mostly rely on a network interface card attached to the I/O bus of the processing node. Point-to-point links connect interface cards to switches. This organization is mostly tailored to provide support for infrequent transfers of large volumes of data through message-passing. Typically, the intervention of the operating system is required at each send or receive operation. Such interventions increase the latency of the operation to the extent that the interconnection network capabilities (or lack of) become secondary. As a result, there is a tendency to use large messages to spread the cost of the high latencies. Consequently, scientific and data-processing applications which require frequent, small messages or which are sensitive to latencies do not exhibit high performance, resulting in low processor utilization and little scalability. Distributed shared memory (DSM) provides a simpler programming model, where communication is performed by writing and reading shared memory locations, and barriers provide synchronization, but requires more architectural complexity than message-passing. Software-based DSM systems have been developed [3,16], but tend to rely on the operating system to manage the replication of relatively large memory pages and encounter similar problems relating to message-passing implementations. However, the addition of hardware support allows the benefits of DSM to become a reality, while requiring relatively little support from the operating system and resulting in smaller latencies. Initial versions of such hardware supporting shared-memory [4] still rely on interfaces attached to the I/O bus, but further versions would be connected directly to the memory bus [14] to support shared-memory and implement the DSM protocol in hardware. In [6], the issue of using DSM on NOWs with arbitrary topology is raised, and the performance of routing algorithms is examined. Network topology has been somewhat addressed in the past [8,15,17], usually under the assumption that traffic rates between pairs of nodes are known. Such analyses usually result in irregular topologies dictated by a specific traffic pattern. Irregularly shaped topologies are mostly useful when they are targeted to support applications with considerable locality. Since networks of arbitrary topology risk developing

hot spots as some individual links may saturate when message traffic increases, large networks capable of accommodating arbitrary traffic patterns need large bisection bandwidths and some regular topology. For example, a switch-based Myrinet architecture with 128 processing nodes can be constructed using 24 16-port switches in a Clos network topology [18]. In [13], a smaller system (with 26 dual-processor nodes) is compared to a traditional multiprocessor (SGI Origin 2000). This paper presents the SOME-Bus-NOW, a network of workstations which requires no switches and is capable of supporting multiple multicasts. It eliminates congestion within the network and minimizes latency to allow the system to achieve high-performance. The SOME-Bus-NOW is compared to common switch-based networks of workstations capable of supporting applications with little locality, in terms of performance and network interface complexity using the distributedshared-memory (DSM) paradigm, with extensions supporting fault tolerance. Performance is measured using processor utilization and message latency. The rest of the paper is organized as follows: section 2 presents the SOME-Bus network of workstations, section 3 presents switch-based networks with sufficient bandwidth to allow meaningful comparisons with the SOME-Bus, section 4 reviews the DSM operation, section 5 presents the operations that achieve fault-tolerance in combination with DSM, and section 6 presents simulation results and compares the performance of the SOME-Bus and the switch-based networks. 2. ARCHITECTURE OF THE SOME-BUS NOW A network which offers an alternative to switch-based networks relies on one-to-all broadcast, where each processor can directly communicate with any other processor. From the point of view of any processor, all other processors appear the same. Such a network allows data and operations to be structured in the application code to better reflect the inherent parallelism and reduce message latencies. The most useful properties of such a network of workstations are high bandwidth (scaling directly with the number of workstations), low latency, no arbitration delay, and non-blocking communication. Such a network can be constructed using optoelectronic devices, relying on sources, modulators and arrays of detectors, all being coupled to local electronic processors. One implementation of such an architecture is the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) [9]. One of its key features is that each node has a dedicated broadcast channel which can operate at several GBytes/sec, depending on the configuration. Figure 1 shows the network organization. In general, the SOME-Bus with N nodes contains K fibers, each carrying M wavelengths organized in M/W channels, where each channel is composed of W wavelengths. The total number of fibers is K = NW/M. A simple configuration with 128 nodes (N = 128 channels) and W=1 wavelength per channel would require K=32 fibers with M=4 wavelengths

per fiber. Each of N nodes also has an input channel interface based on an array of N receivers (each with W detectors) which simultaneously monitors all N channels.
Fibers G0 G N-1
Micro-mirrors Readout Gratings

G0

G N-1

Detectors Receiver Arrays

Node 0

Node N-1

Figure 1: Parallel receiver array Each node uses separate micro-mirrors [11] and laser sources to insert each wavelength of its channel. Slant Bragg gratings [5] written directly into the fiber core are used as narrow-band, inexpensive output couplers. This coupling of the evanescent field allows the traffic to continue and eliminates the need for regeneration. Photodetectors are created through a layer of amorphous silicon (a-Si) on the surface of electronic processing devices. Due to the low conductivity of the a-Si layer, no subsequent patterning is required, and therefore the yield and cost of the receiver is determined by the yield and cost of the CMOS device itself. Since the receiver array does not need to perform any routing, its hardware complexity (including detector, logic, and packet memory storage) is small. This organization eliminates the need for global arbitration and provides bandwidth that scales directly with the number of nodes in the system. No node is ever blocked from transmitting by another transmitter or due to contention for shared switching logic. The ability to support multiple simultaneous broadcasts is a unique feature of the SOME-Bus which efficiently supports high-speed, distributed barrier synchronization mechanisms and cache consistency protocols and allows process group partitioning within the receiver array.
type BARRIER TYPE DEC address ADDRESS DEC block number CAPTURE CNTR B0 Bi B N-1 BARRIER DETECT BARRIER BARRIER MASK FIBER BUFFER CNTR GRANT HEADER DEC REQ

Figure 2: Optical interface The receiver array contains an optical interface (Figure 2)

INPUT FIFO CONTROLLER

DET

D IN

DESTUFF

BUFFER

D OUT

which performs address filtering, barrier processing, length monitoring and type decoding. If a valid address is detected in the message header, the message is placed in a queue, otherwise the message is ignored. The address filter can recognize multicast group addresses as well as broadcast addresses in addition to recognizing the address of the host node. The receiver array also contains a set of queues such that one queue is associated with each input channel, allowing messages from any number of nodes to arrive and be buffered simultaneously. Arbitration may be required only locally in the receiver array when multiple input queues contain messages. Scalability of the network of workstations, whether based on switches or the SOME-Bus, is a very important issue. Since achieving "supercomputing" performance requires a large number of workstations, and since the topology of the network becomes critical as the network size increases, it is important to examine the issue of scalability in the sense of doubling the number of processing nodes, rather than incrementally increasing the network size. The SOME-Bus system relies on a set of N receivers integrated on a single chip. Since most of the network cost is in the fiber optic ribbon, a system with fewer than N nodes may be constructed by using a SOME-Bus segment that accommodates the number of needed nodes (and would leave a portion of the receiver chip unused). Such a system may scale up to N by optically joining additional SOME-Bus segments. Once N is reached, it is not possible to add one more node. System expansion is then achieved by incorporating a second N-receiver chip in each node (and using additional SOME-Bus networks for the necessary channels). Figure 3 shows the expansion of the such a system from N to 2N nodes, which is accomplished by using four SOME-Bus segments to create twice the number of channels (and each channel is twice as long to accommodate the additional nodes). Since information flows only in one direction on each SOME-Bus between the two halves of the network, amplifiers may be easily placed between SOME-Bus segments as Figure 3 shows. 3. SWITCH-BASED ARCHITECTURES In this section we summarize two switch-based architectures that can serve as a basis for performance comparison with the SOME-Bus, when application exhibit some or little locality. Locality is expected to be present in varying degrees (and possibly not at all) given the fact that current and future applications exhibit dynamically changing behavior during execution. If communication traffic is distributed relatively equally between all pairs of processing nodes, then regular networks with small diameter have better potential to achieve high performance. Architectures with large number of nodes are built on networks using switches with 8 or 16 ports, and topologies that include Clos, star, and torus. Under the assumption of uniform destination selection in a network with N nodes, if the node network interface bandwidth is less than

the switch (and link) bandwidth, then the switch-to-switch links may be able to supply the required bisection bandwidth. However, current individual processing nodes contain one or several high-performance processors, and both the network interface and the switch are made of the same technology and have similar bandwidth capabilities which can be fully utilized. To maintain the needed bisection bandwidth under little application locality, a network may have to be constructed with multiple parallel links between pairs of switches. Clos networks and fat trees are used in several high performance architectures [18]. For example, in section 6 we will show that the processor utilization becomes low as miss rate increases, using a 2-D torus with 4 switches per dimension, 7 processors per switch, 11 links per switch, for a total of 16 switches and 112 processors. When the application exhibits strong locality, utilization increases significantly. Similar behavior is observed using a star with one central switch, 10 ports per switch, (the central switch has 13 ports), for a total of 14 switches and 117 processors. If there is little or no locality in the application, then any reasonable value of the miss rate will cause a significant amount of traffic through the links that connect processing nodes to switches. To maintain the necessary large bisection
Optical Amplifiers Channel 2N-1 Channel N

Channel N-1

SOME-Bus 0

Channel 0

Que-Net

Que-Net

Que-Net

Que-Net

Que-Net

Que-Net

Que-Net

Que-Net

P0

P N-1

PN

P 2N-1

Figure 3: SOME-Bus expansion from N to 2N nodes bandwidth, the networks must be augmented with several more links. The star network can be augmented with several central switches, resembling a tree with multiple roots (similar to a Clos network) as shown in Figure 4. The resulting network has 16 switches with 13 ports. Similarly, in the torus network, each switch-to-switch link is replaced by three links, resulting in 16 switches with 19 ports, as shown in Figure 5 (switch links shown open in the figure wrap around in the proper direction). While it may be easy to incrementally expand a switchbased network by adding switches and nodes, the resulting hot spots and/or bottlenecks in such a system cause a dramatic loss of performance. For example, under the assumption of little locality, and to achieve full bisection bandwidth, a network of 112 nodes can be organized as a 4 x 4 torus using switches with P=19 ports. Seven of the 19 ports are connected to processing nodes. To increase the system size to 175 nodes and maintain a torus with full bisection bandwidth (and seven nodes per switch), the number of switches must be increased

SOME-Bus 1

to 5x5, and each switch must be expanded to 23 ports. Although, smaller configurations with reduced bisection bandwidth are possible, section 6 shows that their performance approaches the performance of the full configuration only when locality is very strong, whereby most of the communication occurs between nodes attached to the same switch.

Traffic on the interconnection network consists of request and data acknowledge messages, due to a cache miss on one (local) node, directed to the memory of another (remote) node, and additional messages which allow the caches to maintain data consistency. In all systems we assume strong integration of the transmitter, receiver and cache controller hardware to produce a highly integrated system-wide cache coherence mechanism.
4 5 4 5 4 5 4 5 3 3 4 13 14 15 5 0 6 16 6 3 3 4 13 14 15 5 0 6 16 6 3 3 4 13 14 15 5 0 6 16 6 3 3 4 13 14 15 5 0 6 16 6

12

12

12

12

4
2

11

17

11

17

11

17

11

17

10 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0

3 3 4 13 14 15 5 0 6 16

3 3 4 13 14 15 5 0 6 16

3 3 4 13 14 15 5 0 6 16

3 3 4 13 14 15 5 0 6 16

12

12

12

12

0 5 1
3 2

11

17

11

17

11

17

11

17

10 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0

6 3 4 13 14 15 5 0 6 16

3 3 4 13 14 15 5 0 6 16

3 3 4 13 14 15 5 0 6 16

3 3 4 13 14 15 5 0 6 16

Figure 4: Star network In our simulations, we use the two switch-based networks described above, a 112-node 2-D torus and a 117-node star. In both architectures, the switches use input buffering and cutthrough routing. To eliminate the head-of-the-line problem, each input port contains separate logical queues for each output. The switches contain high speed routing tables, resulting in 20 clock cycles needed for the head of the message to be routed and pass through the switch. The 2-D torus uses two virtual channels to avoid deadlock. In all architectures, one clock cycle is needed to transfer one byte over one link. 4. DSM PARADIGM All systems examined in this paper are operating using the distributed-shared-memory paradigm. They are cache-coherent non-uniform memory access (CC-NUMA) systems where the shared virtual address space is distributed across local memories which can be accessed both by the local processor and by processors from remote nodes, with different access latencies.

...
2 15 3 16

12

12

12

12

11

17

11

17

11

10

17

11

11

17

10 2 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0

3 3 4 13 14 15 5 0 6 16

3 3 4 13 14 15 5 0 6 16

3 3 4 13 14 15 5 0 6 16

3 3 4 13 14 15 5 0 6 16

12

12

12

12

11

12

17

11

13

17

11

14

17

11

15

17

10 2 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0 2

10 2 1 9 8 7

18 0 0

Switches Processors

Figure 5: Torus network A sequentially-consistent, multithreaded execution model is assumed: each processor executes a program which contains a set of M parallel threads. The node in which a group of M threads executes, is called the host or owner node. A thread continues execution until it encounters a cache miss. If the miss can be serviced locally, the thread waits for the memory access and then continues running. If it is a global cache miss which requires data or permission from a remote node, the thread is blocked and another thread is chosen from the pool of threads ready to run. The blocked thread remains in this state, until the required action is completed (data transferred from a remote memory or permission received), at which time it becomes ready for execution and eventually resumes execution. In addition to the processor and the cache, each node contains a directory which maintains coherence information on the section of the distributed memory which is implemented in that node. The directory is used to support the typical MSI, write-invalidate protocol. A node that contains the portion of global memory and corresponding directory entry for a particular data block is called the home node for that block. A global cache miss may be due to a read or write miss at the local cache. Accordingly, a data-request or ownership-

request message is enqueued for transmission on the output channel. After transmission, the message is enqueued at the destination (remote) node, is serviced by the directory at the remote node and another message is sent back to the originating node with data or acknowledgment. The remote node directory controller performs the necessary memory accesses, creates the response message and enqueues it for transmission on the output channel of the remote node. As part of servicing a message, a directory may send messages to other nodes and receive data or acknowledgments. When the message is received, the required data block may be in shared state or modified state. The home directory may send a data message back to the requestor (in the case of reading a shared block) or it may send downgrade or invalidation messages and collect invalidation acknowledge messages. A waiting-message queue is used to store requests waiting for acknowledge messages so the directory can service multiple requests simultaneously. Cache blocks that are chosen for replacement in the cache result in a victim message to the home directory. If the block was in exclusive state, the victim message contains the writeback data. Otherwise, the victim message is sent so the directory controller can remove the node from the copyset. In the SOME-Bus, the directory controller multicasts Invalidation and Downgrade Request messages to the relevant nodes. Every receiver can also monitor its input channels for invalidation messages and signal the cache controller to take appropriate action when locally cached data is affected. Messages are broadcast over the sending node's output channel, but the decision to accept or reject an input message is performed at the receiver input rather than the cache controller of each remote node. In the switch-based star architecture, messages as sent individually to all necessary destinations. In the switch-based torus architecture, a spanning tree is created from source to all destination nodes and multidestination wormholes deliver the message. 5. FAULT-TOLERANCE Backward Error Recovery (BER) is a popular fault tolerance technique which periodically saves system information in recovery points (checkpoints) and allows the application to restart from an earlier, error-free state, after a fault is encountered. Saving checkpoint data in the memories of other nodes [10] provides the advantages of higher speed (by avoiding the use of slower disks) and tolerance to multiple node failures when all copies of the checkpoint do not reside on the failed nodes. In the basic BER version each node maintains its own recovery data and a copy of the recovery data of another node. To create a new checkpoint, the nodes are synchronized and the cache controllers write back all exclusive blocks. Subsequently, each node creates an update message with all data blocks that were written since the last checkpoint was taken, and sends it to its backup node.

Although there is no overhead associated with this approach during normal execution time, maintaining the recovery memory at checkpoint results in the additional overhead of 1) the simultaneous writeback of all cache blocks in the exclusive state and 2) the compilation and exchange of updates to the recovery memory in the backup nodes. Depending on the memory access pattern, as the length of the interval between consecutive checkpoints increases, the number of modified blocks may also increase, resulting in more time spent creating each checkpoint. An improved BER version [7] reduces the amount of data that must be transferred at a checkpoint, by allowing the backup node to receive a copy of each cache writeback destined to the primary (home) node as it occurs. This approach ensures that the backup node has all the information necessary to update the backup recovery data locally rather than by receiving the updates remotely. To avoid the data transfer at checkpoint, each node keeps a consistent backup copy of another node's local memory as well as the recovery memory. Node X contains a copy of Node Y's local memory as well as Node Y's recovery data. We refer to Node Y as the "home" location for Node Y's local memory and Node X as the "home2" location for Node Y's local memory. When the home node sends an invalidation message, the home2 node changes the state of the memory block from shared to invalid and returns an acknowledge message. When a node sends either a downgrade writeback or a victim writeback message to the home node, it also sends a copy of the writeback to the home2 node. Between checkpoints, writeback messages destined for one node are also delivered to another node (the backup node). In addition, several other messages are also delivered to the backup node, so that it can determine the proper order of changes to each modified block and maintain a consistent backup copy. The delivery of messages to backup nodes causes some additional load on the directory and cache controllers, and depending on the architecture, may cause additional traffic on the network. The improved BER approach requires that every request and acknowledge message be delivered to its regular destination node and to the associated backup node. In the SOMEBus network, this is simply accomplished by including an additional destination in the destination list at the header of every message and multicasting the message. In a switch-based network, multicast operations are more complicated and require additional resources inside the switches [1,2,12]. When no fault tolerance is implemented, the additional complexity of full multicast capability at the switch may not provide sufficient benefit under the DSM paradigm where only invalidation requests are normally multicast. However, when fault tolerance is implemented, message delivery to the backup node does not require full multicast capability if the backup node is chosen to be another node attached to the same switch as the primary

node. In that case, the scheduling of the internal crossbar of the switch is slightly enhanced so that a message at an input queue starts transmission on an output link only when that output is free and the output to the link leading to the backup node is also free. Once both links are found to be available, the message is delivered simultaneously to both nodes. 6. PERFORMANCE COMPARISON In this section we present simulation results which compare the performance of the SOME-Bus to two switch-based architectures using the DSM paradigm, using message delivery to a backup node to support fault tolerance. Various degrees of locality are considered. In all architectures, the channel controller receives messages from the cache or directory controllers and places them in the output queue. Message transfer time depends on the message size, as well as overheads, routing delays and contention. With respect to size, messages are distinguished into three different types: request messages, large acknowledge messages carrying a cache block, and small acknowledge messages with no data. One clock cycle is the time required to transfer one byte over one link. The primary parameter of the simulations is the mean thread run time (R). Given the mean message size in bytes (S), the channel bandwidth (C), the miss rate (m), and the number of instructions per second performed by a processor (F), we observe that R=4*T*C/(m*F*S), where T is the mean message transfer time and one in four instructions is a memory reference. In currently available parallel architectures we typically see that the processor's number of instructions per second is approximately equal to the network link bandwidth in bytes per second, so that the ration of C/F is in the range of 1 to 2. A range of R between T to 20*T corresponds to a miss rate in the range of approximately 0.25% to 5%. This range is sufficient to capture the system behavior under most common configurations and cache behavior. In the simulations described in the paper, the sizes of the three message types are 32, 80 and 32 bytes, respectively. Also, we assumed that the probability that a message is a data message (due to a global read miss) is 0.9; the probability that it is an ownership-request (due to a global write miss) is 0.1. This assumption is consistent with common observed memory reference patterns, where write cycles constitute about 10% to 20% of all memory cycles. The probability that a block is found in shared state is 0.9. To measure and compare the performance of the architectures we use the processor utilization and the average response time. Processor utilization is the fraction of time that the processor is busy executing its own threads. Response time is the time interval between the time instant when a cache miss causes a message to be enqueued in the output channel until the instant when the corresponding data (or acknowledgment) message arrives at the input queue of the originating node. In the simulation discussed here, we assumed that each process-

ing node has three threads and that the average number of invalidations per transaction (when necessary) is four. The thread runtime is geometrically distributed with mean R. All times are measured in clock cycles.
Processor Util. 1.000
CROSSBAR TOR-3 FULL LOC STAR-4 FULL LOC

0.800
SOME

0.600
TOR-3 PART-LOC TOR-3 NO-LOC STAR-4 NO-LOC STAR-4=STAR WITH FOUR CENTRAL SWITCHES TOR-3=TORUS WITH THREE LINKS BETWEEN SWITCHES TORUS, ONE LINK NO-LOC

0.400

0.200

0.000 0.0

c:\usr\ck\some\ftol\data3\all08.plt

STAR, ONE CENTRAL SWITCH NO-LOC

1.0

2.0

3.0

=A 4.0 miss rate (%)

Figure 6: Processor utilization comparison, no fault tolerance Figure 6 shows the processor utilization and message response time obtained on the SOME-Bus and the two switchbased architectures mentioned above, as the miss rate is varied between 0.25% and 4%. For comparison, the figure also shows performance results for a single switch (one crossbar) connected to 112 processors. When little locality is present, in the star architecture with one central switch, the processor utilization drops very quickly as the network traffic increases due to the increased miss rate. Even for very small miss-rate values, the congestion in the network causes the response time to become very large and the processor utilization becomes negligible. A similar drop is observed in the torus when only one link is used between each pair of switches. In the crossbar, as the miss rate increases, several messages are sent to the same destination, with a resulting increase in input queue sizes and a corresponding increase in response times, as shown in Figure 8. Figure 6 also compares the SOME-Bus to the augmented star network (with four center switches), torus network (with three links between pairs of switches) and the full crossbar without application locality, without additional traffic due to fault tolerance. The crossbar performance is also the performance that can be expected from the two regular networks as application locality increases. Application locality has a strong effect on the performance of the switch-based architectures as it reduces the traffic on the links between switches. In the simulations described here, locality is modeled by making the destination selection probability be dependent on path length. Figure 6 shows the resulting performance for several cases of locality. For the torus network, the figure shows three cases

where any destination within a distance of 5, 4 or 2 links may be chosen. The case of distance equal to 5 corresponds to no locality, and the case of distance equal to 2 corresponds to full locality where every node selects another node attached to the same switch. For the star network, the figure shows two cases of no locality and full locality. It is clear from Figures 6 and 8 that as locality increases, the traffic between switches drops and response time is reduced and approaches the response time of the crossbar. Consequently, if the applications of interest exhibit significant locality then networks such as the ones described here show reasonable performance and become viable alternatives to the crossbar (or other networks with larger number of switches such as the Clos networks). As Figure 7 shows, the SOME-Bus response time remains smaller than the crossbar response time for the whole range of values of the miss rate. This reduction, together with that fact that several nodes can be multicasting simultaneously, results in a strong improvement of the processor utilization. The SOMEBus is insensitive to locality while the other two networks achieve high performance only in the presence of significant locality.
Processor Util. 1.000

input queues. The increased response time causes the processor utilization to drop. The addition of fault-tolerance on the SOME-Bus architecture causes no loss of performance, and results in the additional benefit of allowing a local process to have direct access to more memory (the regular memory of the local node and the backup memory of another node). Although the SOME-Bus architecture is not normally affected by the presence or absence of locality, in this particular case an application that can take advantage of access to the backup memory will experience even better performance.
Response Time 1000

800
STAR-4=STAR WITH FOUR CENTRAL SWITCHES TOR-3=TORUS WITH THREE LINKS BETWEEN SWITCHES STAR-4 NO-LOC

600

TOR-3 NO-LOC

400

TOR-3 PART-LOC

SOME

200
CROSSBAR TOR-3 FULL LOC STAR-4 FULL LOC

0.800
SOME

0 0.0

c:\usr\ck\some\ftol\data3\all38.fi x

1.0

2.0

3.0

=C 4.0 miss rate (%)

0.600

Figure 8: Response time comparison, no fault tolerance


TOR-3 CROSSBAR TOR-3 FULL LOC STAR-4 FULL LOC

0.400
STAR-4=STAR WITH FOUR CENTRAL SWITCHES TOR-3=TORUS WITH THREE LINKS BETWEEN SWITCHES

STAR-4 NO-LOC

0.200

0.000 0.0

c:\usr\ck\some\ftol\data3\all08.plt

1.0

2.0

3.0

=B 4.0 miss rate (%)

Figure 7: Processor utilization comparison, with fault tolerance Figure 7 shows the performance of all architectures with additional traffic due to fault tolerance. Multicast of messages has no effect on the SOME-Bus performance. The additional messages delivered to the backup node are processed by the directory (causing the status of blocks to change, or blocks to be written to memory). This activity causes the directory utilization to increase slightly but does not appreciably change the message response time in the SOME-Bus, as Figure 9 shows. The additional traffic has a direct effect on the performance of the crossbar and the switch-based architectures. As the number of messages is increased, the output utilization also increases with a resulting increase in the waiting time at the

7. CONCLUSION Small switch-based networks of workstations (with 10 to 20 nodes) show good performance even when applications do not have a significant degree of locality. Larger networks with hundreds of nodes require a large number of switches and usually have a topology with a relatively large diameter. Under little application locality, the resulting network traffic causes increased latencies due to delays within the switches. Topologies with even larger number of switches and multiple parallel links between switches are necessary to maintain the necessary bisection bandwidth, resulting in highly complex switches and networks that require more resources to scale up. In contrast, the SOME-Bus network of workstations provides non-blocking communication and high bandwidth, scaling directly with the number of workstations. Consequently, messages experience very low and predictable latency, a property that can be exploited by the operating system to perform extensive thread placement and migration dynamically, and to manage successfully the level of parallelism present in large applications. Simulation results show that when compared to large torus and star topologies with sufficiently high bisection bandwidth, the SOME-Bus achieves much higher processor utilization and smaller response time. The difference in performance is even

more prevalent when the network carries additional traffic to support a fault-tolerant DSM protocol that allows backward error recovery.
Response Time 1000
TOR-3 NO-LOC TOR-3 PART-LOC STAR-4=STAR WITH FOUR CENTRAL SWITCHES TOR-3=TORUS WITH THREE LINKS BETWEEN SWITCHES STAR-4 NO-LOC

800

600

CROSSBAR STAR-4 FULL LOC

400

TOR-3 FULL LOC

SOME

200

0 0.0

c:\usr\ck\some\ftol\data3\all38.fi x

1.0

2.0

3.0

=D 4.0 miss rate (%)

Figure 9: Response time comparison, with fault tolerance 8. REFERENCES 1. Ajmone Marsan M, Bianco A, Giaccone P, Leonardi E, Neri F, "On the throughput of input queued cell based switches with multicast traffic", IEEE INFOCOM 2001. Conference on Computer Communications, 2001, p.1664-72 vol.3 2. Ajmone Marsan M, Bianco A, Giaccone P, Leonardi E, Neri F, "Optimal multicast scheduling in input queued switches", ICC 2001. IEEE International Conference on Communications, 2001, p.2021-7 vol.7 3. Amza, C., A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel, "TreadMarks: Shared Memory Computing on Networks of Workstations," IEEE Computer, Vol. 29, No. 2, pp. 18-28, February 1996. 4. Dolphin SCI Interconnect, http://www.dolphinics.com. 5. Dong, L., Ortega, B., Reekie, L., "Coupling characteristics of claddding modes in tilted optical fiber gratings", Applied Optics, 37:(22), pp. 5099-5105, 8/1998. 6. Flich J, Malumbres MP, Lopez P, Duato J, "Performance evaluation of networks of workstations with hardware shared memory model using execution driven simulation", 1999 Conference on Parallel Processing. IEEE, 1999, p.146-53

7. Hecht, D., and C. Katsinis, "Protocols for Fault-Tolerant Distributed Shared Memory on the SOME-Bus Multiprocessor Architecture", IEEE Workshop On Fault-Tolerant Parallel and Distributed Systems, Fort Lauderdale, Florida, April 2002. 8. Kaminsky, V., Kacnelson, L., Gerlovin, I., Zanis, A., "On optimal design of network topology",Local Computer Networks, 1998. LCN '98. Proceedings., 23rd Annual Conference on, 1998, pp. 297-304. 9. Katsinis, C., "Performance Analysis of the Simultaneous Optical Multiprocessor Exchange Bus", Parallel Computing Journal, Vol. 27, No. 8, pp. 1079-1115, July 2001. 10. Kermarrec, A.M, Morin, C., Banatre, M., "Design, implementation and evaluation of ICARE: an efficient recoverable DSM", Software - Practice and Experience, vol. 28, no. 9, pp. 981-1010, Jul 1998. 11. Li Y., T. Wang, and K. Fasanella, "Cost-Effective Side-Coupling Polymer Fiber Optics for Optical Interconnections", Journal of Lightwave Technology, vol. 16, no. 5, May 1998, pp. 892-901. 12. Mingyao Yang, Ni LM, "Design of scalable and multicast capable cut through switches for high speed LANs", 1997 Conference on Parallel Processing. IEEE, 1997, p.324-32. 13. Norton, C.D., Cwik, T.A., "Early experiences with the myricom 2000 switch on an SMP Beowulf class cluster for unstructured adaptive meshing", International Conference on Cluster Computing, 2001, 7 14 14. Nowatzyk, A. G., et al., "S-Connect: From networks of workstations to supercomputer performance," Proceedings of the 22nd International Symposium on Computer Architecture, pp. 71 82, June 1995. 15. Ravindran G, Stumm M,"On topology and bisection bandwidth of hierarchical ring networks for shared memory multiprocessors", Fifth Conference on High Performance Computing. IEEE, 1998, p.262-9 16. Speight, E., and J.K. Bennett, "Brazos: A third generation DSM system," Proceedings of the 1997 USENIX Windows/NT Workshop, August, 1997. 17. Youssef, H., Sait, S.M., Khan, S.A.,"An evolutionary algorithm for network topology design", Neural Networks, 2001. Proceedings. IJCNN '01, pp. 744 749 vol.1. 18. www.myrinet.com

You might also like