You are on page 1of 40

Design Issues for Processor Allocation Algorithms

A large number of processor allocation algorithms have been proposed over


the years. In this section we will look at some of the key choices involved in
these algorithms and point out the various trade-o s. The major decisions
the designers must make can be summed up in ve issues:

1.Deterministic versus heuristic algorithms.

2.Centralized versus distributed algorithms.

3.Optimal versus suboptimal algorithms.

4.Local versus global algorithms.

5.Sender-initiated versus receiver-initiated algorithms.

Other decisions also come up, but these are the main ones that have been
studied extensively in the literature. Let us look at each of these in turn.

Deterministic algorithms are appropriate when everything about process


behavior is known in advance. Imagine that you have a complete list of all
processes, their computing requirements, their le requirements, their
communication requirements, and so on. Armed with this information, it is
possible to make a perfect assignment. In theory, one could try all possible
assignments and take the best one.

In few, if any, systems, is total knowledge available in advance, but


sometimes a reasonable approximation is obtainable. For example, in
banking, insurance, or airline reservations, today's work is just like
yesterday's. The airlines have a pretty good idea of how many people want
to y from New York to Chicago on a Monday morning in early Spring, so the
nature of the workload can be accurately characterized, at least statistically,
making it possible to consider deterministic allocation algorithms.

At the other extreme are systems where the load is completely


unpredictable. Requests for work depend on who's doing what, and can
change dramatically from hour to hour, or even from minute to minute.
Processor allocation in such systems cannot be done in a deterministic,
mathematical way, but of necessity uses ad hoc techniques called heuristics.
fl
fi
ff
fi
The second design issue is centralized versus distributed. This theme has
occurred repeatedly throughout the book. Collecting all the information in
one place allows a better decision to be made, but is less robust and can put
a heavy load on the central machine. Decentralized algorithms are usually
preferable, but some centralized algorithms have been proposed for lack of
suitable decentralized alternatives.

The third issue is related to the rst two: Are we trying to nd the best
allocation, or merely an acceptable one? Optimal solutions can be obtained
in both centralized and decentralized systems, but are invariably more
expensive than suboptimal ones. They involve collecting more information
and processing it more thoroughly. In practice, most actual distributed
systems settle for heuristic, distributed, suboptimal solutions because it is
hard to obtain optimal ones.
The fourth issue relates to what is often called transfer policy. When a
process is about to be created, a decision has to be made whether or not it
can be run on the machine where it is being generated. If that machine is too
busy, the new process must be transferred somewhere else. The choice here
is whether or not to base the transfer decision entirely on local information.
One school of thought advocates a simple (local) algorithm: if the machine's
load is below some threshold, keep the new process; otherwise, try to get rid
of it. Another school says that this heuristic is too crude. Better to collect
(global) information about the load elsewhere before deciding whether or not
the local machine is too busy for another process. Each school has its
points. Local algorithms are simple, but may be far from optimal, whereas
global ones may only give a slightly better result at much higher cost.

The last issue in our list deals with location policy. Once the transfer policy
has decided to get rid of a process, the location policy has to gure out
where to send it. Clearly, the location policy cannot be local. It needs
information about the load elsewhere to make an intelligent decision. This
information can be disseminated in two ways, however. In one method, the
senders start the information exchange. In another, it is the receivers that
take the initiative.

As a simple example, look at Fig. 4-16(a). Here an overloaded machine


sends out requests for help to other machines, in hopes of o oading its new
process on some other machine. The sender takes the initiative in locating
more CPU cycles in this example. In contrast, in Fig. 4-16(b), a machine that
is idle or underloaded announces to other machines that it has little to do
and is prepared to take on extra work. Its goal is to locate a machine that is
willing to give it some work to do.
fi
ffl
fi
fi
Design Issues for Processor Allocation Algorithms
Fig. (a) A sender looking for an idle machine. (b) A receiver looking for work
to do.

For both the sender-initiated and receiver-initiated cases, various algorithms


have di erent strategies for whom to probe, how long to continue probing,
and what to do with the results. Nevertheless, the di erence between the
two approaches should be clear by now.
ff
ff
What is a distributed shared
memory? And its advantages
A distributed shared memory (DSM) system is a collection of many
nodes/computers which are connected through some network and
all have their local memories. The DSM system manages the
memory across all the nodes. All the nodes/computers
transparently interconnect and process. The DSM also makes sure
that all the nodes are accessing the virtual memory independently
without any interference from other nodes. The DSM does not have
any physical memory, instead, a virtual space address is shared
among all the nodes, and the transfer of data occurs among them
through the main memory.

Let us now discuss some di erent types of Distributed shared


memory:

1. On-Chip Memory:
1. All the data are stored in the CPU's chip.
2. There is a direct connection between Memory and
address lines.
ff
3. The On-Chip Memory DSM is very costly and
complicated.
2. Bus-Based Microprocessor
1. The connection between the memory and the CPU is
established through a number of parallel wires that acts
as a bus.
2. All the computer follows some protocols to access the
memory, and an algorithm is implemented to prevent
memory access by the systems at the same time.
3. The network tra c is reduced by the cache memory.
3. Ring-based Microprocessor
1. In Ring-based DSM, there is no centralized memory
present
2. All the nodes are connected through some link/network,
and accessing of memory is done by the token passing
3. All the nodes/computers present in the shared area
access a single address line in this system
Let us now see some advantages of Distributed Shared
Memory system:

1. Easy Abstraction - Since the address space is the same, data


migration is not an issue for programmers, making it simpler
to build than RPC.
2. Easier Potability - Sequential to distributed system migration
is made easy by the access protocols employed in DSM.
Because they make use of a common programming interface,
DSM programmes are portable.
3. Locality of data - Data that is being fetched in large blocks, or
data that is close to the memory address being fetched, may
be needed in the future.
4. Larger memory space - Large virtual memory space is
provided, paging operations are decreased, and the total
memory size is the sum of the memory sizes of all the nodes.
5. Better Performance - It speeds up the data access in order to
increase the performance of the system.
6. Flexible communication Environment - There is no
requirement for the sender and receiver to be present, thus
they can join and leave the DSM system without in uencing
the others.
ffi
fl
7. Simpli ed process Migration - Since they all share the same
address space, moving one process to a di erent computer is
simple
Some more basic advantages of the distributed shared memory
(DSM) system are listed below:

1. It is less expensive than using multiprocessing systems


2. Data access is done smoothly
3. It provides better scalability as several nodes can access the
memory.
Now let's see the disadvantages of the distributed shared
memory:

1. Accessing is faster in a non-distributed shared memory


system than in a distributed system.
2. Simultaneous access to data has always been a topic of
discussion. It should ensure some additional protection to it.
3. It generates some programmer control over the actual
message
4. In order to write correct programs, programmers must learn
the consistency model
5. It is not much e cient as the message-passing
implementation as it uses the asynchronous message-passing
implementations.
fi
ffi
ff
Difference between Message Passing and
Distributed shared memory:

Distributed shared Message Passing


memory

Variables are directly Variables are formalized.


shared

Less cost of More Costs of communication


communication

Errors may occur by Private address space is associated


altering the data by with the processes which prevents the
the processes. occurrence of errors.

Processes cannot run Processes may run at the same time.


simultaneously.

1. Shared Memory Model:


In this IPC model, a shared memory region is established which
is used by the processes for data communication. This memory
region is present in the address space of the process which
creates the shared memory segment. The processes that want
to communicate with this process should attach this memory
segment into their address space.

2. Message Passing Model:


In this model, the processes communicate with each other by
exchanging messages. For this purpose, a communication link
must exist between the processes and it must facilitate at least
two operations send (message) and receive (message). The
size of messages may be variable or xed.
fi
Difference between Shared Memory Model and Message
Passing Model in IPC :
NUMA represents Non-uniform Memory Access. NUMA is a
multiprocessor model in which each processor is connected
with the dedicated memory. Non-uniform memory access
(NUMA) machines were intended to prevent the memory
access bottleneck of UMA machines. The logically shared
memory is physically assigned among the processing nodes
of NUMA machines, leading to distributed shared memory
architectures.
These parallel computers became hugely scalable, but they
are very responsive to data allocation in local memories.
Accessing a local memory segment of a node is much
quicker than accessing a remote memory segment.
The main difference is in the organization of the address
space. In multiprocessors, a global address space is used
that is consistently visible from each processor. Several
processors can access all memory areas.
In multicomputer, the address space is replicated in the local
memories of the processing elements (PEs). There is no PE
is allowed to directly access the local memory of another
PE.
This difference in the address space of the memory is also
re ected at the software level: distributed memory
multicomputer is programmed based on the message-
passing paradigm, while NUMA machines are programmed
based on the global address space principle.
These machines have become more and more dif cult in
recent parallel computers, like the Cray T3D, where both
programming paradigms are provided to the user in the form
of library packages.
A further aspect that makes the difference even smaller
comes from the fact that the actual form of accessing
remote memory modules is the same in both classes of
MIMD computers. Remote memory access is realized by
messages even in the NUMA machines, similarly to the
message-passing multicomputer.
fl
fi
The problem of cache coherency does not occur in
distributed memory multicomputer because the message-
passing paradigm explicitly manages several copies of the
equivalent data structure in the form of autonomous
messages.
In the shared memory paradigm, multiple access to a similar
global data structure is applicable and can be increased if
local copies of the global data structure are preserved in
local caches.
The hardware-provided cache consistency schemes are not
introduced into the NUMA machines. These systems can
cache read-only code and data, and local data, but not
shared modi able information. This is the distinctive feature
between NUMA and CC-NUMA multiprocessors. NUMA
machines are adjacent to multicomputer than to other
shared-memory multiprocessors, while CC-NUMA machines
express real shared memory systems.
In NUMA machines, such as multicomputer, the main design
issues are the organization of processor nodes, the
interconnection network, and the possible approaches to
lower remote memory accesses. Typical NUMA machines
are the Cray T3D and the Hector multiprocessor.
fi
Types of Consistency

There are many different models of consistency,


some of which are included in the list below:

• Strict Consistency Model: "The strict consistency model is the


strongest form of memory coherence, having the most stringent
consistency requirements. A shared-memory system is said to support
the strict consistency model if the value returned by a read operation
on a memory address is always the same as the value written by the
most recent write operation to that address, irrespective of the
locations of the processes performing the read and write operations.
That is, all writes instantaneously become visible to all processes."
(Sinha97)
• Sequential Consistency Model: "The sequential consistency model
was proposed by Lamport ... . A shared-memory system is said to
support the sequential consistency model if all processes see the
same order of all memory access operations on the shared memory.
The exact order in which the memory access operations are
interleaved does not matter. ... If one process sees one of the
orderings of ... three operations and another process sees a different
one, the memory is not a sequentially consistent memory."
(Sinha97)
• Casual Consistency Model: "The causal consistency model ...
relaxes the requirement of the sequential model for better
concurrency. Unlike the sequential consistency model, in the causal
consistency model, all processes see only those memory reference
operations in the same (correct) order that are potentially causally
related. Memory reference operations that are not potentially causally
related may be seen by different processes in different orders."
(Sinha97)
• FIFO Consistency Model: For FIFO consistency, "Writes done by a
single process are seen by all other processes in the order in which
they were issued, but writes from different processes may be seen in
a different order by different processes.
"FIFO consistency is called PRAM consistency in the case of
distributed shared memory systems."
(Tanvan02)
• Pipelined Random-Access Memory (PRAM) Consistency Model:
"The pipelined random-access memory (PRAM) consistency model ...
provides a weaker consistency semantics than the ( rst three)
consistency models described so far. It only ensures that all write
operations performed by a single process are seen by all other
processes in the order in which they were performed as if all the
write operations performed by a single process are in a pipeline.
Write operations performed by different processes may be seen by
different processes in different orders."
(Sinha97)
• Weak Consistency Model: "Synchronization accesses (accesses
required to perform synchronization operations) are sequentially
consistent. Before a synchronization access can be performed, all
previous regular data accesses must be completed. Before a regular
data access can be performed, all previous synchronization accesses
must be completed. This essentially leaves the problem of
consistency up to the programmer. The memory will only be
consistent immediately after a synchronization operation."
(SinShi94)
• Release Consistency Model: "Release consistency is essentially the
same as weak consistency, but synchronization accesses must only be
processor consistent with respect to each other. Synchronization
operations are broken down into acquire and release operations. All
pending acquires (e.g., a lock operation) must be done before a
release (e.g., an unlock operation) is done. Local dependencies
within the same processor must still be respected.
"Release consistency is a further relaxation of weak consistency
without a signi cant loss of coherence."
(SinShi94)
• Entry Consistency Model: "Like ... variants of release consistency, it
requires the programmer (or compiler) to use acquire and release at
the start and end of each critical section, respectively. However,
unlike release consistency, entry consistency requires each ordinary
shared data item to be associated with some synchronization
variable, such as a lock or barrier. If it is desired that elements of an
fi
fi
array be accessed independently in parallel, then different array
elements must be associated with different locks. When an acquire is
done on a synchronization variable, only those data guarded by that
synchronization variable are made consistent."
(Tanvan02)
• Processor Consistency Model: "Writes issued by a processor are
observed in the same order in which they were issued. However, the
order in which writes from two processors occur, as observed by
themselves or a third processor, need not be identical. That is, two
simultaneous reads of the same location from different processors
may yield different results."
(SinShi94)
• General Consistency Model: "A system supports general consistency
if all the copies of a memory location eventually contain the same
data when all the writes issued by every processor have completed."
(SinShi94)
PAGE BASED DISTRIBUTED SHARED MEMORY

I. INTRODUCTION
The idea behind DSM is simple: Try to emulate the cache of a
multiprocessor using the MMU and operating system
software
In a DSM system, the address space is divided up into chunks,
with the chunks being spread over all the processors in the
system. When a processor references an address that is not
local, a trap occurs, and the DSM software fetches the
chunk containing the address and restarts the faulting
instruction, which now completes successfully. Distributed
Global Address Space (DGAS), is a similar term for a wide
class of software and hardware implementations, in which
each node of a cluster has access to shared memory in
addition to each node's non-shared private memory.
Software DSM systems can be implemented in an operating
system (OS), or as a programming library and can be
thought of as extensions of the underlying virtual memory
architecture. When implemented in the OS, such systems
are transparent to the developer;
which means that the underlying distributed memory is
completely hidden from the users. In contrast, Software
DSM systems implemented at the library or language level
are not transparent and developers usually have to program
differently. However, these systems offer a more portable
approach to DSM system implementation.
Software DSM systems also have the exibility to organize the
shared memory region in different ways. The page based
approach organizes shared memory into pages of xed size.
In contrast, the object based approach organizes the shared
memory region as an abstract space for storing shareable
objects of variable sizes. Another commonly seen
implementation uses a tuple space, in which the unit of
sharing is a tuple. Shared memory architecture may involve
fl
fi
separating memory into shared parts distributed amongst
nodes and main memory; or distributing all memory
between nodes. A coherence protocol, chosen in
accordance with a consistency model, maintains memory
coherence.
II. REPLICATION
One improvement to the basic system that can improve
performance considerably is to replicate chunks that are
read only, read-only constants, or other read-only data
structures. Another possibility is to replicate not only read-
only chunks, but all chunks. As long as reads are being
done, there is effectively no difference between replicating a
read-only chunk and replicating a read-write chunk.
However, if a replicated chunk is suddenly modi ed,
inconsistent copies are in existence. The inconsistency is
prevented by using some consistency protocols.
Finding the Owner
The simplest solution for nding the owner is by doing a
broadcast, asking for the owner of the speci ed page to
respond. An optimization is not just to ask who the owner is,
but also to tell whether the sender wants to read or write
and say whether it needs a copy of the page. The owner
can then send a single message transferring ownership and
the page as well, if needed. Broadcasting has the
disadvantage of interrupting each processor, forcing it to
inspect the request packet. For all the processors except the
owner's, handling the interrupt is essentially wasted time.
Broadcasting can use up considerable bandwidth,
depending on the hardware.There could be several other
possibilities as well. In one of these, a process is designated
as the page manager. It is the job of the manager to keep
track of who owns each page. When a process, P, wants to
read a page it does not have or wants to write a page it
does not own, it sends a message to the page manager
telling which operation it wants to perform and on which
page. The manager then sends back a message telling the
fi
fi
fi
ownership, as required. A problem with this protocol is the
potentially heavy load on the page manager, handling all the
incoming requests. This problem can be reduced by having
multiple page managers instead of just one.
Another possible algorithm is having each process keep track of
the probable owner of each page. Requests for ownership
are sent to the probable owner, which forwards them if
ownership has changed. If ownership has changed several
times, the request message will also have to be forwarded
several times. At the start of execution and every n times
ownership changes, the location of the new owner should
be broadcast, to allow all processors to update their tables
of probable owners.
Finding the Copies
Another important detail is how all the copies are found when
they must be invalidated. Again, two possibilities present
themselves. The rst is to broadcast a message giving the
page number and ask all processors holding the page to
invalidate it. This approach works only if broadcast
messages are totally reliable and can never be lost.
The second possibility is to have the owner or page manager
maintain a list or copyset telling which processors hold
which pages. When a page must be invalidated, the old
owner, new owner, or page manager sends a message to
each processor holding the page and waits for an
acknowledgment. When each message has been
acknowledged, the invalidation is complete.
Page Replacement
In a DSM system, as in any system using virtual memory, it can
happen that a page is needed but that there is no free page
frame in memory to hold it. When this situation occurs, a
page must be evicted from memory to make room for the
needed page. Two subproblems immediately arise: which
page to evict and where to put it.
To a large extent, the choice of which page to evict can be made
using traditional virtual memory algorithms, such as some
fi
approximation to the least recently used algorithm. As with
conventional algorithms, it is worth keeping track of which
pages are 'clean' and which are 'dirty'. In the context of
DSM, a replicated page that another process owns is
always a prime candidate to
evict because it is known that another copy exists. Consequently,
the page does not have to be saved anywhere. If a directory
scheme is being used to keep track of copies, the owner or
page manager must be informed of this decision, however. If
pages are located by broadcasting, the page can just be
discarded.
The second best choice is a replicated page that the evicting
process owns. It is suf cient to pass ownership to one of the
other copies but informing that process, the page manager,
or both, depending on the implementation. The page itself
need not be transferred, which results in a smaller message.
III. DATA GRANULARITY
Data granularity deals with amount of data process during the
execution phase. It can be related to the amount of data
exchanged between nodes at the end of execution phase as
it is data that will be processed in the execution phase. In
the framework the amount of data exchanged between
nodes is usually a multiple of the physical page-size. The
system using paging, regardless of the amount of sharing,
the amount of data exchanged between nodes is usually a
multiple of the physical page size of the basic architecture.
Problem arises when system that consists very small data
granularity are running on system that support very large
physical pages. If the shared data is stored in contiguous
memory location then most data can be stored in few
physical pages. As a result the system performance
degrades as the common physical page thrashes between
different processors. To overcome with this dif culty the
framework partitioned the shared data structure on to
disjoint physical pages.
fi
fi
IV. CONSISTENCY PROBLEMS IN DISTRIBUTED SHARED
MEMORY
To get acceptable performance from a Distributed Shared
Memory System, data have to be placed near the
processors who are using it. This is done by replicating and
replacing data for read and write operations at a number of
processors. Since several copies of data are stored in the
local cache, read and write access can be performed
ef ciently. The caching technique increases the ef ciency of
Distributed Shared Memory Systems, but it also raises the
consistency problems, which happens when a processor
writes (modi es) the replicated shared data. How and when
this change is visible by other processors who also have a
copy of the shared data becomes an important issue. A
memory is consistent if the value returned by a read
operation is always the same as the value written by the
most recent write operation to the same address [2]. In a
distributed shared memory system, a processor has to
access the shared virtual memory when page faults happen.
To reduce the communication cost initiated by this reason, it
seems naturally to increase the page size. However, large
page size produces the contention problem when a number
of processes try to access the same page and it also
triggers the false sharing problem, which, in turn, may
increase the number of messages because of aggregation.
[4].
V. CONCLUSION
As the framework uses memory page as the unit of data sharing,
the entire address space (spread over all to the processor)
is divided into pages. Whenever the virtual memory
manager (VMM) nds a request to an address space which
is not local, it asks the DSM manager to fetch that page
from the remote machine. Such kind of page fault handling
is simple and similar to what is done for local page faults. So
for improving performance, the framework suggests for
doing replication of the pages so that same page does not
fi
fi
fi
fi
have to be transferred again and again. Keeping multiple
copies of same page leads to the issue of consistency
between these copies. This consistency issue is maintained
by write-update coherence protocol. This is done by
updating all processor about the write operation. The page
manager maintains a list or copy set telling which
processors hold which pages. When a page must be
invalidated, the old owner, new owner, or page manager
sends a message to each processor holding the page and
waits for an acknowledgment. When each message has
been acknowledged, the invalidation is complete. In this
DSM framework, it may happen that a page is needed but
that there is no free page frame in memory to hold it. When
this situation occurs, a page must be ejected from memory
to make room for the needed page. Two sub problems
immediately arise: which page to eject and where to put it.
In the context of framework, a replicated page that another
process owns is always a prime candidate to eject because
it is known that another copy exists.

You might also like