CSC423 - Lec12 - Distributed and Parallel ComputerSystems

Distributed and Parallel Computer Systems
CSC 423
Spring 2021-2022
Lecture 12
Distributed Systems' Processes-2
Instructor
Dr / Ayman Soliman
➢ Contents
1) Penalty points
2) Hierarchical Algorithm
3) Sender-Initiated Distributed Heuristic Algorithm
4) Receiver-Initiated Distributed Heuristic Algorithm
5) Bidding Algorithm
6) SCHEDULING IN DISTRIBUTED SYSTEMS
7) FAULT TOLERANCE
8) System Failures
9) Synchronous Vs Asynchronous Systems
10) Agreement in Faulty Systems
5/18/2022 Dr/ Ayman Soliman 2

❑ Penalty points
➢ When a workstation owner is running processes on other people's
machines, it accumulates penalty points, a fixed number per second.
These points are added to its usage table entry.
➢ Usage table entries can be positive, zero, or negative.

o A positive score indicates that the workstation is a net user of
system resources,
o A negative score means that it needs resources.
o A zero score is neutral.

❑ Hierarchical Algorithm
➢ Centralized algorithms, such as up-down, do not scale well to large
systems. The central node soon becomes a bottleneck, not to mention
a single point of failure.
➢ This approach organizes the machines like people in corporate,
military, academic, and other real-world hierarchies.
o Some of the machines are workers and others are managers

Sender-Initiated Distributed Heuristic Algorithm
• When a process is created, the machine on which it

originates sends probe messages to a randomly-chosen
machine, asking if its load is below some threshold
value. If so, the process is sent there.
• it should be observed that under conditions of heavy

load, all machines will constantly send probes to other
machines in a futile attempt to find one that is willing
to accept more work.
Receiver-Initiated Distributed Heuristic Algorithm
• Algorithm is one initiated by an underloaded receiver.

• whenever a process finishes, the system checks to see if
it has enough work. If not, it picks some machine at
random and asks it for work.
• An advantage of this algorithm is that it does not put
extra load on the system at critical times.
Bidding Algorithm
• The key players in the economy are the processes, which must buy
CPU time to get their work done, and processors, which auction their
cycles off to the highest bidder.
• Each processor advertises its approximate price by putting it in a

publicly readable file.
SCHEDULING IN DISTRIBUTED SYSTEMS
• Each processor does its own local scheduling (assuming that it

has multiple processes running on it), without regard to what
the other processors are doing.
• When a group of related, heavily interacting processes are all

running on different processors, independent scheduling is
not always the most efficient way.
• The basic difficulty can be illustrated by an example in which

processes A and B run on one processor and processes C and
D run on another.
SCHEDULING IN DISTRIBUTED SYSTEMS
• Several algorithms based on a concept he calls co-scheduling, which

takes interprocess communication patterns into account while
scheduling to ensure that all members of a group run at the same
time.
• The first algorithm uses a conceptual matrix in which each column is

the process table for one processor,
FAULT TOLERANCE
• A system is said to fail when it does not meet its specification.
Component Faults
• Computer systems can fail due to a fault in some component,
such as a processor, memory, I/O device, cable, or software.
FAULT TOLERANCE
• Faults are generally classified as transient, intermittent, or permanent.
• Transient faults occur once and then disappear.
• An intermittent fault occurs, then vanishes, then reappears, and so on.
• A permanent fault is one that continues to exist until the faulty component is repaired.
• The goal of designing and building fault-tolerant systems is to ensure that the
system as a whole continues to function correctly, even in the presence of faults.
System Failures
• In a critical distributed system, we are interested in making the system be
able to survive component (in particular, processor) without faults.
• Two types of processor faults can be distinguished:

1. Fail-silent faults.
Faulty processor just stops and does not respond to subsequent input or produce
further output
2. Byzantine faults.
Faulty processor continues to run, issuing wrong answers to questions,
Synchronous Vs Asynchronous Systems
• If one processor sends a message to another, it is guaranteed to get a

reply within a time T known in advance.
• Failure to get a reply means that the receiving system has crashed.
Synchronous Vs Asynchronous Systems
• System that has the property of always responding to a message within a

known finite bound if it is working is said to be synchronous.
• A system not having this property is said to be asynchronous.
• Asynchronous systems are going to be harder to deal with than

synchronous ones.
Use of Redundancy
• The general approach to fault tolerance is to use redundancy
• Three kinds are possible:
• Information redundancy,
• Extra bits are added to allow recovery from garbled bits.
• Time redundancy,
• an action is performed, and then, if need be, it is performed again.
• Time redundancy is especially helpful when the faults are transient or intermittent.
• Physical redundancy.
• extra equipment is added to make it possible for the system as a whole to tolerate the loss or
malfunctioning of some components (permanent fault )
Fault Tolerance Using Active Replication
• Active replication is a well-known technique for providing fault tolerance using physical redundancy.
• It is used in biology (mammals have two eyes, two ears, etc.),
• If all three inputs are different, the output is undefined. This kind of design is known as TMR
(Triple Modular Redundancy).
Triple modular redundancy

Fault Tolerance Using Primary Backup
• The essential idea of the primary-backup method is that at any one

instant, one server is the primary and does all the work. If the primary
fails, the backup takes over.
• Primary-backup fault tolerance has two major advantages over active

replication.
• First, it is simpler during normal operation since messages go to just one server
(the primary) and not to a whole group.
• The problems associated with ordering these messages also disappear.
• Second, in practice it requires fewer machines, because at any instant one

primary and one backup is needed
Agreement in Faulty Systems
• The general goal of distributed agreement algorithms is to have all the

non-faulty processors reach consensus on some issue, and do that
within a finite number of steps.
• Examples are electing a coordinator, deciding whether to commit a transaction

or not, dividing up tasks among workers, synchronization, and so on.
• Different cases are possible depending on system parameters, including:
1. Are messages delivered reliably all the time?
2. Can processes crash?

- if so, fail-silent or Byzantine
3. Is the system synchronous or asynchronous?

• Let us look at the "easy" case of perfect processors but communication

lines that can lose messages. There is a famous problem, known as the
two-army problem.
• two-army problem
• the sender of the last message does not know if the last message arrived.
• Even with nonfaulty processors (generals), agreement between even two

processes is not possible in the face of unreliable communication.
• Now let us assume that the communication is perfect but the processors are not.
• The classical problem occurs in a military setting and is called the Byzantine
generals problem.
• The goal of the problem is for the generals to exchange troop strengths, so that at
the end of the algorithm, each general has a vector of length n corresponding to
all the armies.
• If general / is loyal, then element / is his troop strength; otherwise, it is undefined.
• A recursive algorithm solves this problem under certain conditions.
• we illustrate the working of the algorithm for the case of n = 4 and m =1 For these
parameters, the algorithm operates in four steps.
• in step 4, each general examines the ith element of each of the

newly received vectors. If any value has a majority, that value is
put into the result vector.
• If no value has a majority, the corresponding element of the
result vector is marked unknown. From Fig. (c) we see that
generals 1, 2, and 4 all come to agreement on
(1, 2, UNKNOWN, 4)
• Lamport et al. (1982) proved that in a system with m faulty processors, agreement
can be achieved only if 2m + 1 correctly functioning processors are present

CSC423 - Lec12 - Distributed and Parallel ComputerSystems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSC423 - Lec12 - Distributed and Parallel ComputerSystems

Uploaded by

Copyright:

Available Formats

Distributed and Parallel Computer Systems

Distributed Systems' Processes-2

5/18/2022 Dr/ Ayman Soliman 2

➢ Usage table entries can be positive, zero, or negative.

5/18/2022 Dr/ Ayman Soliman 3

5/18/2022 Dr/ Ayman Soliman 4

• When a process is created, the machine on which it

• it should be observed that under conditions of heavy

• Algorithm is one initiated by an underloaded receiver.

• Each processor advertises its approximate price by putting it in a

• Each processor does its own local scheduling (assuming that it

• When a group of related, heavily interacting processes are all

• The basic difficulty can be illustrated by an example in which

• Several algorithms based on a concept he calls co-scheduling, which

• The first algorithm uses a conceptual matrix in which each column is

• A system is said to fail when it does not meet its specification.

• Faults are generally classified as transient, intermittent, or permanent.

• Transient faults occur once and then disappear.

• An intermittent fault occurs, then vanishes, then reappears, and so on.

• Two types of processor faults can be distinguished:

• If one processor sends a message to another, it is guaranteed to get a

• System that has the property of always responding to a message within a

• A system not having this property is said to be asynchronous.

• Asynchronous systems are going to be harder to deal with than

• It is used in biology (mammals have two eyes, two ears, etc.),

Triple modular redundancy

• The essential idea of the primary-backup method is that at any one

• Primary-backup fault tolerance has two major advantages over active

• The problems associated with ordering these messages also disappear.

• Second, in practice it requires fewer machines, because at any instant one

• The general goal of distributed agreement algorithms is to have all the

• Examples are electing a coordinator, deciding whether to commit a transaction

• Different cases are possible depending on system parameters, including:

1. Are messages delivered reliably all the time?

2. Can processes crash?

3. Is the system synchronous or asynchronous?

• Let us look at the "easy" case of perfect processors but communication

• Even with nonfaulty processors (generals), agreement between even two

• If general / is loyal, then element / is his troop strength; otherwise, it is undefined.

• A recursive algorithm solves this problem under certain conditions.

• in step 4, each general examines the ith element of each of the

You might also like