You are on page 1of 7

Efficient Distributed Shared Memory on a Single System Image Operating

System

Jesús De Oliveira Jesús Federico Rafael Chacón David Zaragoza


Yudith Cardinale
Universidad Simón Bolı́var,
Departamento de Computación y Tecnologı́a de la Información,
Apartado 89000, Caracas 1080-A, Venezuela

{jdeoliveira,jfederico,rchacon,dzaragoza,yudith}@ldc.usb.ve

Abstract Message passing systems typically use explicit data dis-


tribution, exchange, and synchronization. These systems
In this paper we describe KRC, a kernel-level imple- are typically easier to construct but more difficult to pro-
mentation of DSM paradigm using the Release Consistency gram; the inverse is true for their shared memory coun-
(RC) Memory Model on the Kerrighed Single System Image terparts. Programming parallel applications with a shared
operating system. KRC offers a write-update coherence pol- memory model is considered to be more natural and easier
icy and a Multiple Reader/Multiple Writer (MR/MW) data than programming with a message-passing model.
access algorithm. The main contribution of this work is DSM supports the convenient shared-memory program-
an adaptive update mechanism which efficiently implements ming model on distributed-memory hardware. Implement-
the write-update protocol and the MR/MW data access pol- ing a Single System Image (SSI) is an elegant approach to
icy. It consists on the local replication of shared pages simplify global resources managed in a cluster [4, 5, 10,
(twin copies) and encoding dynamically the changes made 13, 21]. The SSI concept hides the heterogeneous and dis-
on the original versions, hence, only the differences (diffs) tributed nature of the available resources, in order to present
are sent to processes sharing data instead of the full modi- a global and uniform view to user and applications. Algo-
fied pages. We present an experimental evaluation of KRC rithms for implementing DSM deal with two basic prob-
compared with Kerrighed in terms of total execution time, lems: distributing data to minimize the latency and main-
which shows that the execution of parallel applications on taining a coherent view of data without incurring in high
KRC, in particular those with false sharing, has a signifi- overhead.
cant performance improvement over the original Kerrighed To deal with the problem of distributing shared data
platform. We also show that KRC provides linear scalabil- across the system, there are two frequently used policies:
ity. Single Writer (SW) and Multiple Writer (MW) [2, 19]. The
SW approach means that only one process has rights to
modify a shared data item. This policy can be implemented
1 Introduction in the form of Single Writer/Single Reader (SW/SR), and
more commonly in the form of Single Writer/Multiple
Clusters are considered today as a standard platform for Reader (SW/MR) algorithm. At any given time, data may
high performance computing, and have supported parallel be either read in read-only mode by one or more processes
applications processing for many years, primarily through or written by a single process. On the other hand, the MW
two main parallel computing models: the Message Pass- approach means that several processes have rights to update
ing Paradigm, targeted at providing a standard and homo- a shared data item simultaneously. This policy is commonly
geneous distributed communication environment to coor- implemented in the form of MW/MR algorithm.
dinate processes running across the nodes of the comput- For keeping data coherence, two possibilities exist:
ing cluster; and the Distributed Shared Memory (DSM) ap- write-update or write-invalidate protocols. In the update-
proach, oriented at providing a single, virtualized, parallel based protocol, modifications made by a process are exe-
environment on top of a distributed clustered architecture. cuted locally and immediately forwarded to all other pro-
cesses holding a copy of the modified data item, effectively performance improvement over the original Kerrighed plat-
updating it. Hence, processes read their local copies without form. We also show that KRC presents a linear scalability.
the need for communication. In the invalidate-based proto-
col, updates made by a process are executed locally and an 2 Overview of Kerrighed
invalidating message is sent to all other processors holding
a copy of the data item, which immediately invalidate the
Kerrighed [13] is a SSI operating system based on
data read by local processes. Then, when local processes
Linux, on which every node of a cluster has a consolidated
access the invalidated data item, a read miss occurs and a
view of available system resources and constructs. Ker-
valid copy of the data item (the most recent) will be fetched
righed is based on a global, kernel-level, memory man-
from the writing processor.
agement service called kernel Distributed Data Manager
A Memory Consistency Model defines how applications (kDDM). The kDDM allows to share collections of objects
use the shared address space (i.e., the legal ordering of among the cluster nodes, which are stored on one node and
memory references issued by a processor), whereas the can be shared and accessed from any other node in a trans-
degree of relaxation of a consistency protocol determines parent fashion. This way, the kDDM enables transparent
the efficiency of the implementation [11, 16, 17, 18, 20]. cluster-wide storing and sharing of memory segments (e.g.,
Stronger forms of the consistency model (i.e., strict and se- memory pages) through the standard System-V SHM mech-
quential consistency models) typically increase the memory anisms, treated abstractly as objects.
access latency and bandwidth requirements, while simplify- The data in the kDDM is hierarchically organized by ob-
ing programming. Relaxed models (i.e., weak, release, and jects, sets, and namespaces. The objects are managed by the
lazy release consistency models) are based on looser con- kDDM without any assumptions on its contents or seman-
straints. In order to reduce communication overhead caused tics: in terms of Kerrighed, objects are a set of bytes defined
by writes propagation, these models allow multiple copies by the programmer. The objects can represent basic types
of shared data to exist on different nodes at the same time, (integers, characters, etc.) as well as memory pages, files,
some of which can be temporarily inconsistent. They distin- devices, and custom data structures, allowing developers to
guish ordinary synchronization memory accesses executed transparently share any type of data between nodes. An ob-
between acquire-release operations pairs. Then, memory ject must be always associated with a set, which stores a
consistency is required only on these synchronization ac- collection of similar objects. On the top level of the hier-
cesses. archy, a namespace acts as a collection of sets. In general,
In this paper we describe a kernel-level implementation every kDDM object belongs to a set, and sets in turn belong
of the Release Consistency (RC) Memory Model, with a to a namespace.
MW/MR data access policy and a write-update protocol. The objects are handled through manipulation functions
Our implementation extends Kerrighed SSI operating sys- that ensure cluster-wide data replication and guarantee con-
tem [13], for System-V shared memory IPC mechanisms. sistency. These functions can be compared with standard
We called KRC (Kerrighed-Release Consistence) the ex- read/write locks. They are primarily used to define critical
tended version of Kerrighed. The RC Memory Model im- sections inside the kDDM, allowing secure and consistent
plemented in KRC is efficient, adaptive, and transparent. access to objects. As objects can be replicated among
It efficiently implements the write-update protocol and the cluster nodes, each object in the kDDM is associated
MW/MR data access policy, providing an adaptive update to a state, which can be READ COPY and INV COPY,
mechanism to transfer the changes among the processes among the most significant ones. The READ COPY state
which share data. This adaptive update mechanism is our represents that a replica of the object could be read locally,
main contribution and consists on locally replicating the while the INV COPY state means that the local replica has
shared pages (twin copies generation) and dynamically en- been updated on another cluster node and the local copy
coding the changes made on the original pages employing is outdated. The objects manipulation interface mainly
several byte-encoding schemes. Hence, only the differences consists of these operations:
(diffs) are sent to processes sharing data, instead of whole
modified pages. KRC also provides transparent execution to get: Equivalent to the acquisition of a read-lock upon a
applications, i.e., no instrumentation/modification is needed shared object. This operation replicates the object in the
for shared memory parallel applications to be executed on local memory region of the requester node, and marks it
clusters platforms. We present an experimental evaluation ensuring that no write operations can be performed on the
of KRC, compared with Kerrighed in terms of total execu- rest of the nodes, however, read operations are permitted.
tion time by using confidence intervals for the means. The
results show that the execution of parallel applications on grab: Equivalent to the acquisition of a write-lock upon
KRC, especially those with false sharing, has a significant a shared object. It works as the get operation, but also
ensures that read operations are prohibited in the rest of the 3.1 Extending Kerrighed
nodes. When a grab is performed, every replica of the
object present on other nodes is invalidated. KRC (Kerrighed-Release Consistency) is the result of in-
corporating into Kerrighed the RC model with a MW/MR
put: Equivalent to a release of the locks associated to a data access algorithm and a write-update coherence policy.
shared object. It notifies the other nodes that, from now on, In order to provide DSM capabilities, we performed modi-
the object is no longer in use at the local node. fications at the kernel-level SHM management routines, in-
cluding System-V SHM and IPC semaphores system calls
From the above, we can realize that kDDM provides (e.g., shmget, shmat, shmctl, shmdt, sem open,
a strict consistency model to shared objects with an sem close, sem wait, and sem post).
invalidate-based coherence policy and a MR/SW data ac- Through the kDDM, when the cluster is initialized, a set
cess algorithm. is instantiated to store information about processes execut-
ing inside a critical section. This set is referred as Criti-
cal Process Container (CPC). As any other Kerrighed ob-
3 KRC: Implementation of Release Consis- ject, this set is shared among cluster nodes, and its consis-
tency on Kerrighed tency is preserved under Kerrighed standard strict consis-
tency model.
The RC model implemented in KRC is divided in
According to extensive studies and experimental re- three phases related to the natural sequence of operations
sults [1, 11, 16, 17, 18, 20] relaxed consistency models have performed by a process during the execution of a critical
proven increased efficiency compared to strict consistency section. The first phase corresponds to the actions carried
models, because of the remarkable savings of communica- out when an acquire operation is performed to enter to
tion by reducing the number of messages and the amount a critical section. The second phase describes how the
of data exchanged for remote memory access. Hence, we protocol manages data structures to keep memory in a
decided to implement Release Consistency (RC), a relaxed consistent state. Finally, the third phase corresponds with
consistency model and a MW/MR algorithm. the actions carried out when a release operation is executed,
The RC model establishes that memory is consistent only signalling the exit of the critical section. These phases are
at specific synchronization points. It is based on the fact illustrated in Figure 1.
that programmers use synchronization points to separate ac-
cesses to shared variables by different processes, for exam-
ple by enclosing the shared data in critical sections. Due
to the fact that these primitives ensure that inside a criti-
cal section no other process concurrently accesses a specific
shared data item, it could be temporarily inconsistent. Syn-
chronization points are defined through two operations upon
shared objects: Acquire, which implies the start of a critical
section on which shared objects are manipulated, and Re- Figure 1. 3-phases RC in KRC.
lease, which represents the finalization of the critical sec-
tion. On the other hand, MW/MR algorithm for parallel ap- Acquire: When a process acquires a semaphore, it is
plications based on critical sections, has better performance registered on the CPC, representing that it has entered a
than SW/SR and SW/MR, by providing a higher degree of critical section. At this time a new set structure, called
concurrency. Additionally, the choice of a MW/MR algo- Modified Pages Set (MPS), which will contain pointers
rithm forces the use of the write-update protocol instead of to the SHM pages modified by the current process, is
the write-invalidate protocol, which could cause loss of up- associated with the process. Figure 2 shows the CPC before
dates. Due to high data locality, it is possible that on a given and after a process k executes an acquire.
time, concurrent processes could be accessing different data
items located at the same page. Using a write-update proto- Critical Section: When a process triggers a write-fault
col, all these processes will get every modification made by inside a critical section, a new object is inserted in the
the other processes at the end of their corresponding critical corresponding MPS. Before giving the process write access
section. On the contrary, using the write-invalidate proto- to the page (by allocating a local replica) a page twin is
col, the first process that finishes its critical section invali- created. This twin will be used to determine the changes
dates all copies of the memory page, causing modifications made to the page in the critical section and notify the set of
performed by the other concurrent processes to be lost. nodes that contain replicas (copy set) about those changes.
Figure 3. Adaptive Update Mechanism in KRC

Figure 2. Acquire example in KRC.

a grab or get operation, this request is handled by the


This notification occurs at release time. owner.
In order to identify nodes that holds a write replica of
Release: When a process reaches a release operation, it is an object, we introduced the WRITE COPY state. Nodes
necessary to send a notification to the copy set along with that require write access to pages locally replicated, can go
the changes made on the shared pages. In order to iden- directly to the WRITE COPY state without notice. This is
tify the changes, each page in the MPS is compared with its possible because of the write-update protocol: as an object
corresponding twin to obtain the differences (diffs). Hence, owner is responsible for managing the object copy set, an
only the diffs are sent to the copy set, instead of whole object in this state can be written on any node under the
pages. This phase implements the adaptive update mech- premise that it is being accessed in a critical section and it
anism, which is presented in Section 3.2. will not be written simultaneously by more than one pro-
cess. Hence, on release time, it is ensured that the cor-
3.2 The KRC adaptive update mechanism responding node will notify every replica holder about the
modifications made by the current process according to the
The adaptive update mechanism is our main contribu- steps described in Section 3.2. Figure 4 shows a simplified
tion and consists on locally replicate the shared pages (twin state diagram including the new WRITE COPY state. Note
copies) and dynamically encoding the changes made on the that a release operation triggers implicit transitions from
original ones, by using several byte-encoding schemes. WRITE to READ states. This way, after a release opera-
To encode the diffs, the protocol divides a memory page tion, subsequent writes will trigger a page write-fault and
in blocks of 4, 8, 16, 32, or 64 bytes (i.e., a 4KB page is will cause the proper registration of that page on the MPS.
divided in 1024, 512, 256, 128, or 64 contiguous blocks This avoids dirty pages and lost updates.
respectively). Then, for each block it is determined which
bytes were changed and with those, the update message is
built, which includes the following information: i) block
size identifier, composed of 3 bits; ii) block number identi-
fier, composed of 6 to 10 bits, according to the block size
used, for each modified block; iii) new bytes of the block.
The smallest block size for a update message is dynam-
ically determined in order to accommodate all the memory Figure 4. Diagram of the state transitions to
page updated bytes. In this sense, the calculation is per- keep data coherence in KRC
formed by generating 5 update messages simultaneously,
each using a different block size. The smallest one is then
selected to be sent to the copy set. The creation and main-
tenance cost of twins are compensated by reduced network
traffic and smaller notification message size. Our adaptive 4 Experimental results
diffs encoding mechanism, shown in Fig. 3, guarantees a
minimal-sized update message. In order to evaluate the performance of KRC, we con-
In contrast to the original Kerrighed implementation, ducted two experimental scenarios with three synthetic ap-
where an object owner is the last node which executed a plications: i) a comparative analysis, in which we executed
grab operation on it, in KRC the object owner is the node test applications with fixed problem sizes and fixed num-
that creates it. The owner has the responsibility to inform ber of nodes, on Kerrighed and on KRC systems; ii) a KRC
the copy set each time a new node becomes a member of scalability analysis, in which we run the test applications
this set. In this way, every node has the copy set updated. In with fixed problem sizes (higher than those used on the first
turn, when a node first requests a copy of an object through scenario), varying the number of nodes.
Our applications exploit different levels of data locality. on KRC perform better than when they are executed on the
The first application consists on a simple Monte-Carlo sim- original Kerrighed implementation.
ulation to approximate the value of π which involves ran- The poor performance of Kerrighed when compared
domly selecting N points in the unit square and determin- with KRC, specially on high data locality applications, such
ing which of these points are in the circumscribed circle. as Monte-Carlo π estimation and Riemann sum integra-
In our version, all processes select N points accessing the tion, is a consequence of the phenomenon known as false-
same unit square (this data is located in one page, shared by sharing. Due to the page granularity of the shared mem-
the processes). The node P 0 is responsible to collect partial ory, large amounts of data are located on the same page.
results and report the final approximate value of π. Hence, a flood of invalidation messages and unnecessary
The second application is an implementation of Riemann full-content page transfers with a ping-pong behaviour oc-
Sums integration which distributes, in an equitable manner, cur when different processors request write access to the
the limits of the integral among the number of running pro- same page, even when each of them write on a different
cesses. A coordinator process is responsible for adding the data item. In contrast, the KRC MW/MR access policy
results of the rest of the processes. The integral used in our avoids the ping-pong effect and completely eliminates the
R 12 1 false-sharing phenomenon occurrence on high data locality
experiments was 0 (x3 + 3(x2 ) + x + 2)dx .
2 applications. The matrix multiplication application does not
Finally, the third application consists on a traditional par-
present the phenomenon of false-sharing: in this case, the
allel application for matrix multiplication, which distributes
results show that, with KRC, we obtain similar performance
the calculation among all running processes. Each process
to that with Kerrighed.
computes a sub-part of the multiplication.
From these results, we expect that KRC performs much
Applications were implemented using System-V SHM
better than Kerrighed on real applications with a high de-
inter-process communication mechanism. We performed 10
gree of data locality, when false-sharing can appear, or sim-
executions for each test case, running one process per com-
ilar to Kerrighed in other cases.
puting node and measuring the wall clock execution time.
4.2 Discussion of the KRC Scalability
4.1 Discussion of the Comparative Analy-
Analysis
sis

The comparative analysis experiment was executed on The scalability analysis experiment was executed on a
a dedicated 12-nodes cluster running Debian GNU/Linux dedicated 32-nodes cluster running Debian GNU/Linux 5.0,
4.0rc5, dual core 3.4Ghz Pentium IV CPU, 1GB of RAM dual core 3.4Ghz Pentium D CPU, 4GB of RAM and inter-
and interconnected trough a 100Mbps Ethernet LAN. Table connected through a 1Gbps Ethernet LAN.
1 shows average wall clock execution time of each experi-
ment.
Application Problem size KRC Kerrighed Relative difference
Matrix multiplication 128x128 matrix size 0.78 s. 0.75 s. -3.85%
256x256 matrix size 1.11 s. 1.39 s. 20.14%
512x512 matrix size 1.84 s. 1.82 s. -1.09%
1024x1024 matrix size 8.18 s. 10.47 s. 21.87%
2048x2048 matrix size 63.58 s. 65.11 s. 2.35%
Montecarlo π estimation 1 ∗ 104 points 0.81 s. 0.78 s. -3.7%
1 ∗ 105 points 0.83 s. 7.42 s. 88.81%
5 ∗ 105 points 0.81 s. 19.51 s. 95.85%
5 ∗ 106 points 1.59 s. 212.80 s. 99.25%
25 ∗ 106 points 4.49 s. 1132.07 s. 99.6%
Riemann sum integration 1/102 precision 0.79 s. 0.64 s. -18.99%
1/104 precision 0.80 s. 0.75 s. -6.25%
1/105 precision 0.82 s. 4.04 s. 79.7%
1/106 precision 0.82 s. 33.00 s. 97.52%
1/107 precision 1.94 s. 301.03 s. 99.36%
Relative diff. average 44.71%
Relative diff. standard deviation 49.22%
Confidence interval [2.43, 86.98]
(α = 0.05)

Figure 5. KRC scalability


Table 1. Confidence intervals analysis of
comparative experiment The results presented on Figure 5 show that the kernel-
level adaptations performed in Kerrighed in order to imple-
ment the new consistency model have no impact on appli-
From the statistical analysis presented on Table 1, we can cations performance scalability. The result of increment-
argue with 99.5% of confidence that applications executed ing the number of computing nodes used, and hence the
number of concurrent processes, presents a practically lin- computing resource. However, few successful implemen-
ear speed-up for applications designed with a divide-and- tations of such systems exist. Plurix [10] creates a glob-
conquer strategy, such as Riemann sum integration and ma- ally shared memory space, on which all resources, oper-
trix multiplication. On the other hand, the Monte-Carlo ating system kernel, drivers, applications code, and data
π estimation shows no performance degradation when the are stored. Consistency is guaranteed by a transactional
number of local replicas of shared memory pages, among consistency model, based on optimistic transaction consis-
which consistency must be guaranteed, increases as the tency models found on common database systems. Ker-
number of processes is raised. righed [13] implements sequential consistency model. Per-
Note that without the optimized communication mech- formance results show similar speedups of equivalent par-
anisms implemented in KRC and the MW/MR data access allel applications running on Plurix and Kerrighed [8]. As
algorithm described in section 3, an increase of the com- we presented in the experimental results section (Section 4),
puting nodes and hence the concurrent executing processes Kerrighed performance with sequential consistency is over-
would cause a tremendous increment on inter-process com- stepped by our relaxed consistency implementation. Team-
munication activities in Kerrighed. This is due to a higher ster [4] provides a Global Memory Image that allows easy
number of processes generating full-content page updates threads and processes distribution and migration accom-
upon each process write to the shared memory pages. In this plished in a fully transparent fashion. Unlike our proposal,
case, network bandwidth will become a bottleneck for ap- on which applications must employ System-V SHM mecha-
plications performance much faster than on KRC, resulting nism to take advantage of distributed shared memory, Team-
in a hard constraint to applications performance scalability, ster provides a fully transparent SSI platform. Static global
as occurs on the original Kerrighed implementation. variables are allocated from a sequential consistent memory
segment, and dynamic heap is allocated from a release con-
5 Related Work sistent memory area. The main difference with our proposal
is that Teamster platform is not an open source project.
Jessica [21] provides a single Java virtual machine image
The first implementations of DSM have shown the trade- that allows transparent distributed execution of Java appli-
off between flexibility and performance for sharing and ac- cations. It employs the release consistency or lazy release
cessing data by comparing conservative consistency models consistency model to support consistency in the Global Ob-
(such as that implemented in Ivy [14]) with relaxed con- ject Space. KRC has no restrictions over the programming
sistency models (such as those implemented in Munin [3] language of applications. vNuma [5] provides NUMA ar-
and TreadMarks [12]). JiaJia[7] presents an intermediate chitectures virtualization on top of a cluster of PCs. It im-
relaxed consistency model located between entry consis- plements the SW/MR data access policy and the sequential
tency and release consistency, namely scope release con- consistency model. Experimental results show poor perfor-
sistency, on which writes are propagated only to processes mance compared to clustered based applications counter-
on the same scope, defined by the critical section protection parts, but the authors argue that in future work, overall per-
constructs. Performance results show a significant reduc- formance of the platform will be addressed.
tion of communication messages, leading to better appli-
More recently research and experimentation efforts have
cations performance. More recent improvements over the
been focused on extending the DSM approach to cluster
scope consistency are presented in JUMP[6], which slightly
federations or computing grids. These proposals must deal
modifies the protocol and consistency model implemented
with very high-latency/low-bandwidth and highly dynamic
in JiaJia, to allow migrating memory segment among nodes,
communication networks. Hence, reducing communication
hence adapting dynamically to application memory access-
overhead for consistency maintenance, as well as address-
pattern and reducing communication. Our proposal imple-
ing fault tolerance, topology change, and resource discov-
ments a release consistency model, equivalent to that pre-
ering are the main priorities on grid-aware DSM platforms.
sented in Munin [3]. All these early DSM systems provide
Among the first proposals of such systems are JuxMem[9]
a user level library or API used to declare and use shared
and Teamster-G[15]. We plan to incorporate grid-awareness
memory segments on their applications, then they are not
and automatic data distribution mechanisms considering
fully transparent; while with KRC it is not required to alter
communication costs to KRC in the near future.
applications source code in order to take advantage of DSM
benefits.
The combination of SSI platforms and DSM mech- 6 Conclusions
anisms has proven to be an attractive environment for
straightforward application development and high perfor- The implementation of the consistency model is the key
mance computing, providing the user with a fully trans- issue when dealing with DSM systems. The SSI operating
parent view of a cluster of PCs as a unique, consolidated system introduced in this work, called KRC, implements
the Release Memory Consistency Model on Kerrighed. Our with transactional coherence and consistency (tcc). ACM
experimental results showed that KRC performs much bet- SIGPLAN NOTICES, pages 1–13, 2004.
ter than Kerrighed, especially when the phenomenon called [12] P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel.
false-sharing can appear. Besides, KRC offers a transpar- Treadmarks: Distributed shared memory on standard work-
ent DSM to applications, and thus facilitates parallel ap- stations and operating systems. Proc. of the 1994 Winter
Usenix Conference, pages 115–131, January 1994.
plication programming since the shared memory program- [13] A. Lèbre, R. Lottiaux, E. Focht, and C. Morin. Reducing
ming model is often more natural than the message-passing kernel development complexity in distributed environments.
paradigm. We are currently enhancing KRC, in order to LNCS, 5168:576–586, 2008. Euro-Par.
support adaptive protocols according to the application be- [14] K. Li. Ivy: A shared virtual memory system for parallel
haviour. We also plan to incorporate grid-awareness and computing. In Int. Conf. Parallel Processin, pages 94–101.
automatic data distribution mechanisms into KRC. IEEE Computer Society Press, 1988.
[15] T.-Y. Liang, C.-Y. Wu, C.-K. Shieh, and J.-B. Chang. A
grid-enabled software distributed shared memory system on
References a wide area network. Future Generations Computer Sys-
tems, 23(4):547–557, 2007.
[1] C. Amza, A. L. Cox, S. Dwarkadas, L.-J. Jin, K. Rajamani, [16] P. Schmidt, S. Frenz, S. Gerhold, and P. Schulthess. Transac-
and W. Zwaenepoel. Adaptive protocols for software dis- tional consistency in the automotive environment. In Proc.
tributed shared memory. In Proc. of IEEE, Special Issue on of the Third International Symposium on Industrial Embed-
Distributed Shared Memory, pages 467–475, March 1999. ded Systems, pages 1–4, 2008.
[2] C. Amza, A. L. Cox, W. Zwaenepoel, and S. Dwarkadas. [17] H. Weiwu, S. Weisong, and T. Zhimin. An interaction of co-
Software dsm protocols that adapt between single writer and herence protocols and memory consistency models in dsm
multiple writer. In Proc. of the 3rd IEEE Symposium on systems. Journal of Computer Science and Technology,
High-Performance Computer Architecture (HPCA), 1997. 13(2):110–124, 1998.
[3] J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: [18] M. Wende, M. Schoettner, R. Goeckelmann, T. Bindham-
distributed shared memory based on type-specific memory mer, and P. Schulthess. Optimistic synchronization and
coherence. SIGPLAN Not., 25(3), 1990. transactional consistency. In Proc. of the 2nd IEEE/ACM
[4] J. B. Chang and C. K. Shieh. Teamster: A transparent dis- Int. Symposium on Cluster Computing and the Grid, 2002.
tributed shared memory for cluster symmetric multiproces- [19] X. Xianghui and H. Chengde. Limited multiple-writer: An
sors. In Proc. of the 1st Int. Symposium on Cluster Comput- approach to dealing with false sharing in software dsms.
ing and the Grid, pages 508–513, May 2001. Journal of Computer Science and Tech., 15(5):453–460,
[5] M. Chapman and G. Heiser. vNUMA: A virtual shared- 2000.
memory multiprocessor. In Proc. of the 2009 USENIX An- [20] Y. Zhou, L. Iftode, J. P. Singh, K. Li, B. R. Toonen,
nual Technical Conf., pages 349–362, 2009. I. Schoinas, M. D. Hill, and D. A. Wood. Relaxed con-
[6] B. Cheung, C. Wang, and K. Hwang. Jump-dp: A soft- sistency and coherence granularity in dsm systems: A per-
ware dsm system with low-latency communication support. formance evaluation. In Proc. of the 6th ACM Symposium
In Proc. of the Internat. Workshop on Cluster Computing on Principles and Practice of Parallel Programming, vol-
- Technologies, Environments and Applications, pages 1–7, ume 32, pages 193 – 205, 1997.
June 2000. [21] W. Zhu, C.-L. Wang, and F. C.M.Lau. Jessica2: A dis-
[7] M. R. Eskicioglu and T. A. Marsland. Shared memory com- tributed java virtual machine with transparent thread migra-
puting on sp2: Jiajia approach. In Proc. of the 1998 Conf. of tion support. In Proc. of IEEE Fourth International Confer-
the Centre for Advanced Studies on Collaborative research, ence on Cluster Computing, pages 381–388, 2002.
page 12, Nov. 1998.
[8] S. Frenz, R. Lottiaux, M. Schoettner, C. Morin, R. Goeck-
elmann, and P. Schulthess. A practical comparison of clus-
ter operating systems implementing sequential and transac-
tional consistency. Distributed and Parallel Comp., pages
23–33, Oct. 2005.
[9] A. Gabriel, B. Luc, and J. Mathieu. Juxmem: An adap-
tive supportive platform for data sharing on the grid. Scal-
able Computing: Practice and Experience, pages 45–55,
September 2005.
[10] R. Goeckelmann, M. Schoettner, S. Frenz, and P. Schulthess.
Plurix, a distributed operating system extending the single
system image concept. In Proc. of IEEE Canadian Conf. on
Electrical and Computer Engineering, pages 1985 – 1988,
2004.
[11] L. Hammond, B. D. Carlstrom, V. Wong, B. Hertzberg,
M. Chen, C. Kozyrakis, and K. Olukotun. Programming

You might also like