Professional Documents
Culture Documents
net/publication/221574479
CITATIONS READS
35 35
4 authors, including:
José M. García
University of Murcia
231 PUBLICATIONS 2,027 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Advanced Computational Drug Discovery Technologies using High Performance Computing Architectures View project
All content following this page was uploaded by Manuel E. Acacio on 21 July 2014.
vious approaches. Some of them use a sharing code smaller The rest of the paper is organized as follows. The related
than by storing an in-excess representation of the work is presented in Section 2. Section 3 introduces some
nodes that hold a line, and unlike do not always fall of the compressed sharing codes proposed in the literature.
back on broadcast. We will refer to these proposals as com- It also presents the multilayer clustering approach and three
pressed sharing codes (also known as multicast protocols new sharing codes derived from it. A new directory orga-
[16] and limited broadcast protocols [1]), as opposed to ex- nization is proposed and justified in Section 4. Section 5
act ones (as limited pointers [4] or ). The goal of a shows a detailed performance evaluation of our novel pro-
compressed sharing code is twofold: to minimize the shar- posals. Finally, Section 6 concludes the paper.
ing code size and the number of unnecessary coherence
messages. Note that, as stated in [16], the unnecessary co- 2. Related Work
herence messages could have three potential negative ef-
fects: (a) increased contention in the network, (b) cycles Directory-based cache coherence protocols are accepted
1 This is not exact in our evaluation environment because replacement as the common technique in large-scale shared-memory
hints are not sent for lines in shared state. Thus, some invalidation mes- multiprocessors because they are more scalable than snoop-
sages may be unnecessary. ing protocols. Although directory protocols have been ex-
tensively studied in the past, memory overhead and long are Gray-Tristate and Coarse Vector. Tristate uses a
remote accesses remain the major hurdle on the scalability. digit code requiring 2 bits per digit. The j-th digit of the
Memory overhead is usually managed from two orthog- code is 0 if the j-th bit of all sharers is 0; the digit is 1 if all
onal points of view: reducing directory width and reduc- sharers have a bit equal to 1; and the digit is both otherwise.
ing directory height. Some authors proposed to reduce Gray-Tristate differs from Tristate in that it uses Gray code
the width of directory entries by using compressed shar- to number the nodes. Coarse Vector uses N/K bits, where
ing codes: Coarse Vector [9], which is currently employed a bit is set if any of the processors in a K-processor group
in the SGI Origin 2000 multiprocessor [15], Tristate [1], cached the line. These sharing codes were shown to achieve
Gray-Tristate [16] and Home [16]. very good performance results in terms of precision, but ex-
Others proposals reduce directory width by having a lim- ecution times were not reported [16].
ited number of pointers per entry to keep track of sharers We propose a new family of compressed sharing codes
[1][4][22]. Differences between them are mainly found in based on the multilayer clustering concept. Nodes are recur-
the way they handle overflow situations [6]. A comparison sively grouped into clusters of equal size until all the nodes
with such directory schemes is out of the scope of this pa- are grouped into a single cluster. Compression is achieved
per. by specifying the smallest cluster containing all the shar-
A third alternative way of keeping track of sharers is ers (instead of indicating all the sharers). Compression can
the Chained directory protocol, such as the IEEE Standard be increased even more by indicating only the level of the
Scalable Coherent Interface (SCI) [10]. It relies on dis- cluster in the hierarchy. In this case, it is assumed that the
tributing the sharing code between them. Each directory cluster is the one containing the home node for the memory
entry has a pointer to the first sharer in the list, which in line. This approach is valid for any network topology.
turn has a pointer to the second sharer, and so on. All nodes Although clusters can be formed by grouping any inte-
holding a copy of the line are obtained by performing a list ger number of clusters in the immediately lower layer of
traversal. Optimizations to this proposal can be found in the hierarchy, we analyze the case of using a value equal
[5][12][18]. All these schemes introduce significant over- to two. That is, each cluster contains two clusters of the
head, drastically increasing the latency of coherence trans- immediately lower level. By doing so, we simplify binary
actions. We have not considered these organizations be- representation and obtain better granularity to specify the
cause they represent a different approach from the imple- set of sharers. This recursive grouping into layer clusters
mentation point of view. leads to a binary tree with the nodes located at the leaves.
Instead of decreasing directory width, other schemes As an example of the application of this approach, we
propose reducing directory height (the number of directory propose three new compressed sharing codes. The new
entries) by combining several directory entries into a single sharing codes can be graphically shown by considering the
entry [21] or by organizing the directory as a cache [9][19]. distinction between the logical and the physical organiza-
Recently, some proposals have appeared focusing on re- tions. For example, we can have a 16-node system with a
ducing the penalty introduced by remote accesses. In [13], mesh interconnection network as shown in Figure 1(a) and
[14] and [17], coherence messages are predicted. Bilir et al. we can imagine the same system as a binary tree (multilayer
[2] try to predict which nodes must receive each coherence system), as shown in Figure 1(b). In this representation,
transaction. If the prediction hits, the protocol approximates each subtree is a cluster. It can be observed that this binary
the snooping behavior (although the directory must be ac- tree is composed of 5 layers or levels (
, where N
cessed in order to verify the prediction). is a power of 2).
From this, we derive three new sharing codes:
3. Multilayer Clustering Concept
Binary Tree (BT): Since nodes are located at the leaves
of a tree, the set of nodes (sharers) holding a copy of a
This section presents several new sharing code organi- particular memory line can be expressed as the level of
zations based on the multilayer clustering approach. The the root of the minimal subtree that includes the home
aim of these proposals is to reduce the size of the directory node and all the sharers (which can be expressed just
width without significantly increasing the percentage of un- using
bits). Intuitively, the set of
necessary coherence messages. We refer to these schemes sharers is obtained from the home node identifier by
as compressed sharing codes since the full information is changing the value of some of its least significant bits
actually compressed in order to be represented using a fewer to don’t care. The number of bits is equal to the level
number of bits, introducing a loss of precision. This means of the above mentioned subtree. It constitutes a very
that, when this information is reconstructed, more sharers compact sharing code (observe that, for a 128-node
than necessary appear. system, only 3 bits per directory entry are needed), but
Two of the already proposed compressed sharing codes its precision may be low, especially when few sharers,
12 are the symmetric nodes of node 0. The tree level
could now be computed from node 0 or from any of its
Node Node Node Node
symmetric nodes. In this way, the one which encodes
0 1 2 3 the smallest number of nodes is selected. In this par-
ticular example, the tree level 3 must be used to cover
Node Node Node Node all sharers.
4 5 6 7
Binary Tree with Subtrees (BT-SuT): This scheme rep-
resents our most elaborated proposal. It solves the
Node Node Node Node
8 9 10 11 common case of a single sharer by directly encoding
the identifier of that sharer. Thus, the sharing code
size is, at least,
bits. When several nodes are
Node Node Node Node
12 13 14 15 caching the same memory line, an alternative repre-
sentation is chosen. Instead of using a single sub-
tree to include all sharers, two subtrees are employed.
One of them is computed from the home node. For
(a) Physical system
the other one, a symmetric node is employed. Using
both subtrees, the whole set of sharers must be cov-
ered while minimizing the number of included nodes.
Level 4
Now, each directory entry will have two fields of up to
Level 3
bits to codify these subtrees (depend-
Level 2
ing on the size of the subtree) and an additional field
Level 1 to represent the symmetric node used. An additional
Node Node
0 1
Node Node
2 3
Node Node
4 5
Node Node
6 7
Node Node
8 9
Node Node
10 11
Node Node
12 13
Node Node Level 0
14 15
bit is needed to indicate the representation used (single
sharer or subtrees). Note that, in order to optimize the
(b) Logical system number of bits required for this representation, we take
into account the maximum size of the subtrees, which
depends on the number of symmetric nodes used. For
Figure 1. Multilayer Clustering Example
the previous example, symmetric nodes do not change
distant in the tree, are found. As an example, consider (i.e., nodes 4, 8 and 12). Node 0 should notice that the
a 16-node system as the one shown in Figure 1(a), and sharing code value implying fewer nodes is obtained
assume that nodes 1, 4 and 5 hold a copy of certain by selecting node 4 as symmetric node. Then, it en-
memory line whose home node is 0. In this case, node codes its tree level as 1 (covering node 1) and the tree
0 would store 3 as the tree level value, which is the one level for the symmetric node as 1 (covering nodes 4
covering all sharers (see Figure 1(b)). and 5).
Binary Tree with Symmetric Nodes (BT-SN): We in-
troduce the concept of symmetric nodes of a particu- Sharing code Size (in bits)
Full-Map N
lar home node. Assuming that 3 additional symmetric None
0
nodes will be assigned to each home node, they will
Gray-Tristate
Coarse Vector
be codified by different combinations of the two most-
Binary Tree
N
1
... Choose N
& Full-map sharing code
Convert
Logic
M
0
State Sharing Code
State
increases. Finally, all three schemes proposed in this work line. We will use the compressed sharing codes pro-
achieve a much lower memory overhead than the previously posed in this work (BT, BT-SN and BT-SuT) since their
proposed ones. very low memory overhead makes them much more
scalable for this level than other schemes.
4. Two-Level Directory
While the compressed structure has an entry for each
memory line assigned to a particular node, the uncom-
In addition to reducing directory entry width, an orthogo- pressed structure has just a few entries, only used by a
nal way to diminish directory memory overhead is to reduce small subset of memory lines. Thus, for a certain mem-
the total number of directory entries. That is, the total num- ory line, in-excess information is always available in the
ber of entries of the directory is less than the total number second-level directory, but precise sharing code will be oc-
of memory lines. Therefore, the directory could be orga- casionally placed in the first-level directory depending on
nized as a cache [9][19]. The observation that motivates the temporal locality exhibited by this line. Note that, in
the utilization of fewer directory entries is that only a small this organization, if the hit rate of the first-level directory
fraction of the memory lines will be used at a given time. is high, the final performance of the system should approxi-
Such a directory organization is known as sparse directory. mate to the one obtained with a Full-Map directory. This hit
The main drawback of this directory cache concerns how re- ratio depends on several factors: size of the uncompressed
placements are managed. One approach [6] invalidates all structure, replacement policy and temporal locality exhib-
sharing copies of a memory line whose associated directory ited by the application.
entry is evicted. This may drastically affect performance
Figure 2 shows the architecture of our two-level direc-
due to unnecessary invalidations.
tory. Originally, state bits are only contained in the com-
In this work, we propose a solution to the scalability
pressed structure. Tag information must also be stored in the
problem that combines the benefits of both previous solu-
uncompressed structure in order to determine whether there
tions. Our two-level directory architecture merges the prop-
is a hit. Both directory levels are accessed in parallel and
erties of sparse directories and compressed sharing codes.
a Choose & Convert logic selects between precise informa-
In this way, we can increase the scalability of cc-NUMAs
tion, if it is present, or compressed information otherwise.
without degrading performance.
In the latter case, compressed sharing code is converted into
its Full-Map representation.
4.1. Two-Level Directory Architecture
4.2. Implementation Issues
In a two-level directory organization we distinguish two
clearly decoupled structures:
In sparse directories, when an entry is replaced, inval-
1. First-level Directory (or Uncompressed Structure): It idation messages are sent to all the nodes encoded in the
consists of a small set of directory entries, each one evicted entry [6]. This affects the cache miss rate (and there-
containing a precise sharing code (as for instance, Full- fore, the final performance) of processors having the remote
Map or limited set of pointers). copy of the line, since they receive the invalidation not be-
cause of a remote write but because of a replacement in the
2. Second-level Directory (or Compressed Structure): In remote directory cache. We can refer to these invalidations
this level, a directory entry is assigned to each memory as premature invalidations. These misses would not occur if
ILP Processor
a directory with correct information per memory line were Processor Speed 1 GHz
Max. fetch/retire rate 4
used. Instruction Window 64
Functional Units 2 integer arithmetic
On the other hand, our two-level directory organization 2 floating point
2 address generation
will never send premature invalidations since correct in- Memory queue size 32 entries
Cache Parameters
formation per memory line is always placed in the main Cache line size 64 bytes
L1 cache (on-chip, WT) Direct mapped, 16KB
(compressed) directory. For our organization, a miss in the L1 request ports 2
L1 hit time 2 cycles
first-level directory will cause the second-level to supply L2 cache (off-chip, WB) 4-way associative, 64KB
L2 request ports 1
the sharing information. As opposed to premature invalida- L2 hit time 15 cycles, pipelined
Number of MSHRs 8 per cache
tions, unnecessary coherence messages will never increase Memory Parameters
Memory access time 60 cycles (60 ns)
the cache miss rate with respect to a Full-Map directory im- Memory interleaving 4-way
plementation. First coherence message creation time
Next coherence messages creation time
40 cycles
20 cycles
In this paper we assume the organization of the first-level Bus Speed
Internal Bus Parameters
1 GHz
directory to be fully associative, with a LRU replacement Bus width
Network Parameters
8 bytes
a) When a request for a certain line arrives at the home Table 2. Base system parameters
node, an entry is allocated in the first-level directory if
the line is in uncached state, or if an exclusive request Round Trip Access Latency (Cycles)
was received. Secondary Cache 19
Local 97
b) Since this uncompressed structure is quite small, capac- Remote (1-Hop) 137
ity misses can degrade performance. In order to reduce
Table 3. No-contention round-trip latency of reads
such misses, an entry in the first-level directory is freed
when a write-back message for a memory line in exclu- simulator [20]. RSIM models an out-of-order superscalar
sive state is received. This means that this line is no processor pipeline, a two-level cache hierarchy, a split-
longer cached in any of the system nodes, so its corre- transaction bus on each processor node, and an aggressive
sponding entry is available for other lines. memory and multiprocessor interconnection network sub-
In addition, entries in the uncompressed structure are not system, including contention at all resources. The modeled
allocated as long as there exists a single node holding a copy system implements a Full-Map invalidation-based four-
of the line and its identifier can be precisely encoded with state MESI directory cache-coherent protocol and sequen-
the sharing code of the second-level directory. Since its tial consistency. Table 2 summarizes the parameters of the
compressed sharing code provides precise information, al- simulated system. These values have been chosen to be sim-
locating an entry in the first-level would be unnecessary (in ilar to the parameters of current multiprocessors. However,
these cases, allocations could be done when a non-exclusive caches are smaller due to the limitations of the benchmarks
request from another node comes). Finally, replacements in utilized for evaluation (SPLASH-2). These sizes have been
the first-level directory are not allowed for entries associ- chosen following the working set characterizations found in
ated to memory lines with pending coherence transactions. [25]. With all these parameters, the resulting no-contention
While more elaborated management algorithms are pos- round-trip latency of read requests satisfied at various levels
sible and will be considered in the near future, we concen- of the memory hierarchy is shown in Table 3.
trate on this immediate implementation to demonstrate the The original coherence protocol was designed based on
viability of our proposal. precise sharing codes, particularly on a Full-Map sharing
code. This protocol was extended to support compressed
5. Performance Results sharing codes and, in particular, the presence of unneces-
sary coherence messages.
In this section, we present a detailed performance evalu- We have selected some numerical applications to inves-
ation of our novel proposals based on execution-driven sim- tigate the potential performance benefits of our proposals.
ulations. The application programs used in our evaluations are Water
from the SPLASH benchmark suite [23] and FFT, Radix,
5.1. Simulation Environment Barnes-Hut and LU from SPLASH-2 benchmark suite [25].
The input data sizes are shown in Table 4. All experimen-
We have used a modified version of Rice Simulator for tal results reported in this paper are for the parallel phase
ILP Multiprocessors (RSIM), a detailed execution-driven of these applications. All the applications include code to
Program Size Application
Cycles
Coherence events Messages per
coherence event
FFT 64k
Water 512 molecules FFT 3.6 178.66 1.00
Radix 2M keys, 1024 radix Water 17.68 825.84 1.09
Barnes-Hut 4K bodies Radix 23.73 91.58 1.19
LU 512 by 512 matrix, block 16 Barnes-Hut 16.58 106.98 2.58
LU 77.4 35.6 1.06
Table 4. Applications and input sizes
distribute the data among the physically distributed memo- Table 5. Execution times, number of coherence
ries in a cc-NUMA system based on the given recommen- events and messages per coherence event when
dations. Full-Map sharing code is used
We chose 64 processors for our modeled system in or-
makes compressed sharing codes not to degrade signifi-
der to evaluate the impact of compressed sharing codes for
cantly the performance when comparing with respect to a
a reasonable number of nodes. Finally, for Coarse Vector
Full-Map directory.
sharing code scheme, we assume a value of K equal to 4.
As expected, Water performance is significantly de-
graded when no sharing code exists (slowdown of 10). On
5.2. Simulation Results and Analysis the other hand, the performance of Radix is also degraded
but not so significantly (75% slowdown). Finally, a slow-
5.2.1 Compressed Sharing Codes Evaluation down of 3.82 and more than 4 is observed for FFT and
In Section 3, three novel compressed sharing codes were Barnes-Hut, respectively. This means that eliminating the
introduced: Binary Tree (BT), Binary Tree with Symmetric sharing code does not constitute an effective solution to the
Nodes (BT-SN) and Binary Tree with Subtrees (BT-SuT). scalability problem. As far as our schemes are concerned,
These schemes have been shown to obtain good scalability BT obtains the worst results (as expected), it causes a per-
and the best results in terms of memory overhead. However, formance degradation which ranges from 1.31 for Radix to
performance results in terms of execution time and number 6.40 for Water. This degradation is strongly related to the
of unnecessary coherence messages are needed to evaluate amount of coherence transactions in the applications. How-
the feasibility of such schemes. ever, the execution time for FFT (which has a considerable
Full-Map sharing code provides the best execution times, number of such transactions) is only increased by a factor
since unnecessary coherence messages are completely elim- of 1.52.
inated. Table 5 shows the execution time (in processor cy-
2.07
2.54
2.91
3.82
6.4
4
10
None
2 Coarse
code is used (second column), the total number of coher- Gray-Tr
BT
ence events (third column), and the average number of mes- 1.8 BT-SN
BT-SuT
Normalized execution time
sages sent per coherence event (fourth column). The last 1.6
3.82
2.97
4
10
BT-SuT uses half of the storage required by Gray-Tristate. Coarse
2 Gray-Tr
Thus, we can conclude that it achieves a better trade-off be- None+FM(256)
1.6
24.42
33.33
1.4
None
53
63
58
Coarse
21 1.2
Gray-Tr
19 BT
1
Coherence Message overhead
BT-SN
17
BT-SuT
15 0.8
FFT Water Radix Barnes
13
11
9
7
Figure 5. Normalized execution times for None and
5 first-level directory(FM )
3
1
FFT Water Radix Barnes
2.54 None
3.82
6.39
Coarse
Gray-Tr
herence event 2 BT
BT+FM(256)
Normalized execution time
1.8 BT+FM(512)
1.6
1.4
1
As it was expected, some performance degradation occurs
0.8
when compressed sharing codes are used. Such degrada- FFT Water Radix Barnes
tion depends on both the compressed sharing code used and
the sharing patterns of the applications. This degradation
Figure 6. Normalized execution times for BT and
could be negligible (when BT-SuT is used, the slowdown
first-level directory(FM )
for Radix is only 0.3%) or significant (using BT-SuT, 49%
for Barnes-Hut). In order to recover as much as possible
from the lost performance, we organize the directory as a Figure 6 presents the same results when the BT scheme
multilevel architecture. In this section, we evaluate the two- is utilized for the second-level directory. The overhead in-
level directory organization proposed in Section 4. troduced by such an aggressive compressed sharing code is
Figure 5 shows the normalized execution time (with re- almost hidden by the first-level directory (mainly for FFT,
spect to a Full-Map directory) obtained with a two-level di- 2 Since the main memory is four-way interleaved, each memory module
rectory. We evaluated two different sizes for the Full-Map has a first-level directory of 64 and 128 entries, respectively.
Water and Barnes-Hut), although 512 entries are needed. compressed second-level directory may constitute an effec-
These results are very promising, since the scalability of tive solution to the problem of directory scalability, causing
multiprocessors can be significantly increased using such a negligible performance degradation.
scalable main directory while performance is almost kept
intact due to the presence of the first-level directory. Nev- 6. Conclusions
ertheless, Radix still presents a performance degradation,
which may indicate that BT could be too aggressive for the
second-level. The major objective of this work has been to overcome
the scalability limitations that directory memory overhead
imposes on current shared-memory multiprocessors. First,
2.91
2.07
the multilayer clustering concept is introduced and from
3.82
None
4
10
Coarse
2 Gray-Tr
BT-SN
it, three new compressed sharing codes are derived. Bi-
nary Tree, Binary Tree with Symmetric Nodes and Binary
Normalized execution time
BT-SN+FM(256)
1.8
BT-SN+FM(512)
1.6
Tree with Subtrees are proposed as new compressed sharing
codes with less memory requirements than existing ones.
1.4
Compressed sharing codes reduce the directory entry width
1.2
associated to a memory line, by having an in-excess rep-
1 resentation of the nodes holding a copy of this line. Un-
0.8 necessary coherence messages degrading the performance
FFT Water Radix Barnes
of directory protocols appear as a result of this inaccurate
way to keep track of sharers. A comparison between our
Figure 7. Normalized execution times for BT-SN three proposals and Full-Map sharing code is carried out
and first-level directory(FM ) in order to evaluate such a degradation. Also, a compari-
son with two of the most relevant existing compressed shar-
ing codes, Coarse Vector and Gray-Tristate, is performed.
Results show that compressed directories slowdown the ap-
None
3.82
Coarse
Gray-Tr
2
BT-SuT coherence messages. Despite this degradation, our pro-
BT-SuT+FM(256)
Normalized execution time
!"#$%&(')+*
1.6 performance penalty (up to 49%) and memory overhead
1.4
( bits per
entry) than previously proposed compressed sharing codes.
1.2
In order to alleviate the performance penalty introduced
1
by compressed sharing codes, a novel directory architec-
0.8
FFT Water Radix Barnes
ture has also been proposed. Two-level directory architec-
tures combine a very small uncompressed first-level struc-
ture (Full-Map) with a second-level compressed structure.
Figure 8. Normalized execution times for BT-SuT Results for this directory organization show that, for a 256-
and first-level directory(FM ) entry first-level directory and a BT-SuT second-level direc-
tory, the performance achieved is very similar to that ob-
Figure 7 shows the performance of the two-level direc- tained by systems using big and non-scalable Full-Map di-
tory when the BT-SN sharing code is considered for the rectories. Therefore, the directory architecture proposed in
compressed structure. In this case, 512 entries for the first- this paper drastically reduces directory memory overhead
level directory are enough to almost eliminate the perfor- while achieving similar performance. Thus, directory scal-
mance penalty introduced by BT-SN in Radix. Note that ability is significantly improved.
this would be a good compromise between scalability and Further research in this field involves the application of
performance since BT-SN only needs two additional bits per this directory organization not only to improve the scalabil-
directory entry with respect to BT. Finally, Figure 8 depicts ity but also to improve the performance. This can be accom-
the same results when BT-SuT is considered for the second- plished by moving the directory controller and the first-level
level directory. Observe that using just 256 entries in the directory into the processor chip. In this way, the time re-
first level (which constitutes a 3.12% of the L2 size) practi- quired by coherence transactions that may be in the critical
cally eliminates all the performance degradation introduced path would be significantly reduced. Note that some current
by BT-SuT. This means that a two-level architecture com- processors already include on-chip memory controller and
posed by a very small Full-Map first-level directory and a network interface [11].
Besides, medium-scale multiprocessors use snooping [10] D. Gustavson. The Scalable Coherent Interface and Related
protocols over a scalable interconnection network, since Standards Projects. IEEE Micro 12(1), pp. 10-22, 1992.
buses are unfeasible for more than a few processors (exam- [11] L. Gwennap. Alpha 21364 to Ease Memory Bottleneck. Mi-
ple of this is the Sun E10000). It will be interesting to eval- croprocessor Report, pp. 12-15, October 1998.
uate whether the use of broadcast is better than the use of
[12] R.E. Johnson. Extending the Scalable Coherent Interface
our two-level directory organization with on-chip first-level
for large-Scale Shared-Memory Multiprocessors. PhD The-
directory. The cost of broadcast coherence transactions in sis, University of Wisconsin-Madison, 1993.
snooping protocols and the network saturation may be over-
come with fast processor-to-processor directory access and [13] S. Kaxiras and C. Young. Coherence Communication Predic-
tion in Shared-Memory Multiprocessors. Proc. of the 6th Int’l
multicast coherence messages.
High Performance Computer Architecture, January 2000.
[14] A. Lai and B. Falsafi. Memory Sharing Predictor: The Key
7. Acknowledgments to a Speculative DSM. 26th Proc. of the Int’l Symposium on
Computer Architecture, May 1999.
This research has been carried out using the resources of
[15] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA
the Centre de Computació i Comunicacions de Catalunya
Highly Scalable Server. Proc. of the 24th Int’l Symposium on
(CESCA-CEPBA). This work has been supported in part Computer Architecture, 1997.
by the Spanish CICYT under grant TIC97-0897-C04.
[16] S.S. Mukherjee and M.D. Hill. An Evaluation of Directory
Protocols for Medium-Scale Shared-Memory Multiproces-
References sors. Proc. of the 8th ACM Int’l Conference on Supercom-
puting, July 1994.
[1] A. Agarwal, R. Simoni, M. Horowitz and J. Hennessy. An
[17] S.S. Mukherjee and M.D. Hill. Using Prediction to Acceler-
Evaluation of Directory Schemes for Cache Coherence. Proc.
ate Coherence Protocols. Proc. of the 24th Int’l Symposium
of the 15th Int’l Symposium on Computer Architecture, pp.
on Computer Architecture, July 1998.
280-289, 1988.
[18] H. Nilsson and P. Stenström. The Scalable Tree Protocol – A
[2] E.E. Bilir, R.M. Dickson, Y. Hu, M. Plakal, D.J. Sorin, M.D.
Cache Coherence Approach for Large-Scale Multiprocessors.
Hill and D.A. Wood. Multicast Snooping: A New Coherence
Proc. of 4th IEEE Symposium on Parallel and Distributed
Method Using a Multicast Address Network. Proc. of the 26th
Processing, pp. 498-506, December 1992.
Int’l Symposium on Computer Architecture, May 1999.
[19] B. O’Krafka and A. Newton. An Empirical Evaluation of
[3] L.M. Censier and P. Feautrier. A New Solution to Coherence
Two Memory-Efficient Directory Methods. Proc. of the 17th
Problems in Multicache Systems. IEEE Transaction on Com-
Int’l Symposium on Computer Architecture, pp. 138-147, May
puters, 27(12), pp. 1112-1118, December 1978.
1990.
[4] D. Chaiken, J. Kubiatowicz and A. Agarwal. LimitLESS Di-
[20] V.S. Pai, P. Ranganathan and S.V. Adve. RSIM Reference
rectories: A Scalable Cache Coherence Scheme. Proc. of
Manual version 1.0. Technical Report 9705, Department of
International Conference on Architectural Support for Pro-
Electrical and Computer Engineering, Rice University, Au-
gramming Language and Operating Systems (ASPLOS IV),
gust 1997.
pp. 224-234, April 1991.
[21] R. Simoni. Cache Coherence Directories for Scalable Multi-
[5] Y. Chang and L.N. Bhuyan. An Efficient Hybrid Cache Co-
processors. Ph.D. Thesis, Stanford University, 1992.
herence Protocol for Shared Memory Multiprocessors. IEEE
Transaction on Computers, March 1999. [22] R. Simoni and M. Horowitz. Dynamic Pointer Allocation for
Scalable Cache Coherence Directories. Proc. Int’l Symposium
[6] D.E. Culler, J.P. Singh and A. Gupta. Parallel Computer Ar-
on Shared Memory Multiprocessing, pp. 72-81, April 1991.
chitecture: A Hardware/Software Approach. Morgan Kauf-
mann Publishers, Inc., 1999. [23] J.P. Singh, W.-D. Weber and A. Gupta. SPLASH: Stanford
Parallel Applications for Shared-Memory. Computer Archi-
[7] J. Duato, S. Yalamanchili and L.M. Ni. Interconnection Net-
tecture News, vol. 20, pp. 5-44, March 1992.
works: An Engineering Approach. IEEE Computer Society,
Los Alamitos, 1997. [24] C. Tang. Cache Design in a Tightly Coupled Multiprocessor
System. Proc. AFIPS Conference, pp. 749-753, June 1976.
[8] J. Goodman. Using Cache Memories to Reduce Processor-
Memory Traffic. Proc. of the Int’l Symposium on Computer [25] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh and A. Gupta. The
Architecture, June 1983. SPLASH-2 Programs: Characterization and Methodological
Considerations. Proc. of the 22nd Int’l Symposium on Com-
[9] A. Gupta, W.-D. Weber and T. Mowry. Reducing Memory
puter Architecture, pp. 24-36, June 1995.
and Traffic Requirements for Scalable Directory-Based Cache
Coherence Schemes. Proc. Int’l Conference on Parallel Pro-
cessing, pp. I: 312-321, August 1990.