Professional Documents
Culture Documents
16 Mark Questions
24. Distinguish and examine the process of active and passive replication model.
25. Describe in detail about cristian’s and Berkley algorithm for synchronizing
clocks.
26. Examine Briefly about global states
27. Design Flat transaction and nested transaction with example.
28. Explain detail about two phase commit protocol.
29. Examine on atomic commit protocol.
30. What is the goal of an election algorithm? Explain it detail. (8)
31. Examine How mutual exclusion is handled in distributed system. (8)
32. Summarize the internal and external synchronization of Physical clocks.(8)
33. Give the Chandy and Lamports snapshot algorithm for determining the
global states of distributed systems. (8)
34. Discuss the use of NTP in detail.
35. Discuss that Byzantine agreement can be reached for three generals, with one
of them faulty, if the generals digitally sign their messages.
36. Examine a solution to reliable, totally ordered multicast in a synchronous
system, using a reliable multicast and a solution to the consensus problem.
37. Illustrate an example execution of the ring-based algorithm to show that
processes are not necessarily granted entry to the critical section in happened-
before order.
38. Summarize in detail about CODA.
39. Describe about Distributed dead locks.
40. Examine briefly about optimistic concurrency control.
*****ANSWERS ****
The relative amount that a computer clock differs from a perfect clock. Computer
clocks drift from perfect time and their drift rates differ from one another. Even if
clocks on all computers in a DS are set to the same time, their clocks will eventually
vary quite significantly unless corrections are applied
Clock drift rate: the difference per unit of time from some ideal reference clock.
Ordinary quartz clocks drift by about 1 sec in 11-12 days. (10-6 secs/sec). High
precision quartz clocks drift rate is about 10-7 or 10-8 secs/sec
Clock skew
The difference between the times on two clocks (at any instant) is skew
Coordinated Universal Time (UTC)
External synchronization
– A computer’s clock Ci is synchronized with an external authoritative time
source S, so that:
– |S(t) - Ci(t)| < D for i = 1, 2, … N over an interval, I of real time
– The clocks Ci are accurate to within the bound D.
Internal synchronization
– The clocks of a pair of computers are synchronized with one another so that:
– |Ci(t) - Cj(t)| < D for i, j = 1, 2, … N over an interval, I of real time
– The clocks Ci and Cj agree within the bound D.
Clock correctness
A hardware clock, H is said to be correct if its drift rate is within a bound ‘x’
> 0. (e.g. x= 10-6 secs/ sec)
Cristian [1989] suggested the use of a time server, connected to a device that receives
signals from a source of UTC, to synchronize computers externally. Upon request,
the server process S supplies the time according to its clock, as shown in Figure.
A process p requests the time in a message mr , and receives the time value t
in a message mt.
Process p records the total round-trip time Tround A simple estimate of the
time to which p should set its clock is t + Tround / 2 ,
The earliest point at which S could have placed the time in mt was min after p
dispatched mr . The latest point at which it could have done this was min
before mt arrived at p. The time by S’s clock when the reply message arrives
is therefore in the range [t + min, t + Tround – min] . The width of this range is
Tround – 2min , so the accuracy is ± Tround /2 – min
The method achieves synchronization only if the observed round-trip times
between client and server are sufficiently short compared with the required
accuracy.
3 Modes of synchronization:
- Multicast: A server within a high speed LAN multicasts time to others which set
clocks assuming some delay (not very accurate)
- Procedure call: A server accepts requests from other computers (like Cristiain’s
algorithm). Higher accuracy. Useful if no hardware multicast.
- Symmetric: Pairs of servers exchange messages containing time information
Used where very high accuracies are needed (e.g. for higher levels)
Lamport clocks are counters that are updated according to the happened-before
relationship between events.
A logical clock is a monotonically increasing software counter. It need not relate to a
physical clock
- Each process pi has a logical clock, Li which can be used to apply logical time
stamps to events
Vector clocks
Global states
Lattice represents partial order. All consistent global states can be put in
the “lattice of global states” And, all possible flows can be derived from the lattice,
the one in the above figure is only one of them
SNAPSHOT algorithm analog: census taking
Close all gates into/out of each village (process) and count people (record process
state) in village; these actions need not be synched with other villages
Open each outgoing gate and send official with a red cap (special marker
message).
Open each incoming gate and count all travellers (record channel state = messages
sent but not received yet) who arrive ahead of official.
Tally the counts from all villages.
Algorithm SNAPSHOT
All processes are initially white: Messages sent by white(red) processes are also
white (red)
MSend [Marker sending rule for process P]
– Suspend all other activities until done
– Record P’s state
– Turn red
– Send one marker over each output channel of P.
MReceive [Marker receiving rule for P]
On receiving marker over channel C,
– if P is white { Record state of channel C as empty; Invoke MSend; }
– else record the state of C as sequence of white messages received since P
turned red.
– Stop when marker is received on each incoming channel
Snapshots taken by SNAPSHOT algorithm
In Distributed systems the computers must coordinate their actions correctly with
respect to shared resources.
z No shared variables or facilities are provided by single local kernel to solve it.
Require a solution that is based solely on message passing.
Essential requirements:
ME2 implies freedom from both deadlock and starvation. Starvation involves
fairness condition. The order in which processes enter the critical section. It is
not possible to use the request time to order them due to lack of global clock.
So usually, we use happen-before ordering to order message requests.
Performance Evaluation
z The client delay incurred by a process at each entry and exit operation.
z The simplest way to grant permission to enter the critical section is to employ
a server.
z A process sends a request message to server and awaits a reply from it.
z If no other process has the token at the time of the request, then the server
replied immediately with the token.
z If token is currently held by other processes, the server does not reply but
queues the request.
z Client on exiting the critical section, a message is sent to server, giving it back
the token.
Bandwidth: entering takes two messages (request followed by a grant), delayed by the
round-trip time; exiting takes one release message, and does not delay the exiting process.
Ring-based Algorithm
z Each process pi has a communication channel to the next process in the ring,
p(i+1)/mod N.
z The unique token is in the form of a message passed from process to process
in a single direction clockwise.
z If a process does not require to enter the CS when it receives the token, then it
immediately forwards the token to its neighbor.
z A process requires the token waits until it receives it, but retains it.
To exit the critical section, the process sends the token on to its neighbor
Bandwidth: continuously consumes the bandwidth except when a process is
inside the CS. Exit only requires one message
On initialization
state := RELEASED;
To enter the section
state := WANTED;
Multicast request to all processes; request processing deferred here
T := request’s timestamp;
Wait until (number of replies received = (N – 1));
state := HELD;
On receipt of a request <Ti, pi> at pj (i ≠ j)
if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))
then
queue request from pi without replying;
else
reply immediately to pi;
end if
To exit the critical section
state := RELEASED;
reply to any queued requests;
Multicast synchronization
z P1 and P2 request CS concurrently. The timestamp of P1 is 41 and for P2 is 34.
When P3 receives their requests, it replies immediately. When P2 receives P1’s
request, it finds its own request has the lower timestamp, and so does not
reply, holding P1 request in queue. However, P1 will reply. P2 will enter CS.
After P2 finishes, P2 reply P1 and P1 will enter CS.
z Granting entry takes 2(N-1) messages, N-1 to multicast request and N-1
replies. Bandwidth consumption is high.
z Client delay is again 1 round trip time
z Synchronization delay is one message transmission time.
z It is not necessary for all of its peers to grant access. Only need to obtain
permission to enter from subsets of their peers, as long as the subsets used by
any two processes overlap.
z Think of processes as voting for one another to enter the CS. A candidate
process must collect sufficient votes to enter.
z Processes in the intersection of two sets of voters ensure the safety property
ME1 by casting their votes for only one candidate.
z A voting set Vi associated with each process pi.
z there is at least one common member of any two voting sets, the size of all
voting set are the same size to be fair
Vi Í { p1 , p2 ,..., pN }
such that for all i, j = 1,2,...N :
pi Î Vi
Vi ÇV j ¹ Æ
|Vi |= K
Each process is contained in M of the voting set Vi
On initialization
state := RELEASED;
voted := FALSE;
For pi to enter the critical section
state := WANTED;
Multicast request to all processes in Vi;
Wait until (number of replies received = K);
state := HELD;
On receipt of a request from pi at pj
if (state = HELD or voted = TRUE)
then
queue request from pi without replying;
else
send reply to pi;
voted := TRUE;
end if
For pi to exit the critical section
state := RELEASED;
Multicast release to all processes in Vi;
On receipt of a release from pi at pj
if (queue of requests is non-empty)
then
remove head of queue – from pk, say;
send reply to pk;
voted := TRUE;
else
voted := FALSE;
end if
z ME1 is met. If two processes can enter CS at the same time, the processes in
the intersection of two voting sets would have to vote for both. The algorithm
will only allow a process to make at most one vote between successive
receipts of a release message.
z Deadlock prone. For example, p1, p2 and p3 with V1={p1,p2}, V2={p2, p3},
V3={p3,p1}. If three processes concurrently request entry to the CS, then it is
possible for p1 to reply to itself and hold off p2; for p2 rely to itself and hold
off p3; for p3 to reply to itself and hold off p1. Each process has received one
out of two replies, and none can proceed.
z Client delay is the same as Ricart and Agrawala’s algorithm, one round-trip
time.
Fault tolerance
z None of the algorithm that we have described would tolerate the loss of
messages if the channels were unreliable.
y The central server algorithm can tolerate the crash failure of a client
process that neither holds nor has requested the token.
z Goal is to elect a single process coordinator which has the largest identifier.
z Participant processes
are shown darkened
2. The starting process marks itself as participant and place its identifier in a
message to its neighbour.
3. A process receives a message and compare it with its own. If the arrived
identifier is larger, it passes on the message.
6. If the received identifier is that of the receiver itself, then this process’ s
identifier must be the greatest, and it becomes the coordinator.
z The problem is for processes to agree on a value after one or more of the
processes has proposed what that value should be. (e.g. all controlling
computers should agree upon whether let the spaceship proceed or abort after
one computer proposes an action. )
Transactions
ACID properties
Atomicity of transactions
Isolation
One way to achieve isolation is to perform the transactions serially – one at a time
The aim for any server that supports transactions is to maximize concurrency.
Concurrency control ensures isolation
Transactions are allowed to execute concurrently, having the same effect as a serial
execution
– That is, they are serially equivalent or serializable
Concurrency control
We will illustrate the ‘lost update’ and the ‘inconsistent retrievals’ problems which
can occur in the absence of appropriate concurrency control
– a lost update occurs when two transactions both read the old value of a variable
and use it to calculate a new value
– inconsistent retrievals occur when a retrieval transaction observes values that are
involved in an ongoing updating transaction
we show how serial equivalent executions of transactions can avoid these
problems
we assume that the operations deposit, withdraw, getBalance and setBalance are
synchronized operations - that is, their effect on the account balance is atomic.
Serial equivalence
if each one of a set of transactions has the correct effect when done on its own
then if they are done one at a time in some order the effect will be correct
a serially equivalent interleaving is one in which the combined effect is the same as
if the transactions had been done one at a time in some order
the same effect means
– the read operations return the same values
– the instance variables of the objects have the same values at the end
Nested Transactions:
Sub-transactions may run concurrently with other sub-transactions at the same level.
– this allows additional concurrency in a transaction.
– when sub-transactions run in different servers, they can work in parallel.
e.g. consider the branchTotal operation
it can be implemented by invoking getBalance at every account in the
branch.
- these can be done in parallel when the branches have different servers
- Sub-transactions can commit or abort independently.
– this is potentially more robust
– a parent can decide on different actions according to whether a subtransaction
has aborted or not
A transaction may commit or abort only after its child transactions have completed.
A sub-transaction decides independently to commit provisionally or to abort. Its
decision to abort is final.
When a parent aborts, all of its sub-transactions are aborted.
When a sub-transaction aborts, the parent can decide whether to abort or not.
If the top-level transaction commits, then all of the sub-transactions that have
provisionally committed can commit too, provided that none of their ancestors has
aborted.
Replication
System model
Five phases in performing a request.
Front end issues the request.
Either sent to a single replica or multicast to all replica messages.
Coordination
Replication managers coordinate in preparation for the execution of the request, i.e.
agree if request is to be performed and the ordering of the request relative to others.
Execution
Perhaps tentative
Agreement
Response
Each replication manager provides concurrency control and recovery of its own data
items in the same way as it would for non-replicated data.
Effects of transactions performed by various clients on replicated data items are the
same as if they had been performed one at a time on a single data item.
Replication Schemes
Primary Copy
Quorum consensus
Virtual Partition
Read-one write-all
It can handle some replica managers are unavailable because they have failed or
communication failure.
Reads can be performed by any available replica manager but writes must be
performed by all available replica managers.
Normal case is like read one/write all.
As long as the set of available replica managers does not change during a transaction.
Available copies
Copies of replication
Failure case
One copy serializability requires that failures and recovery be serialized with
transactions.
This is not achieved when different transactions make conflicting failure observations.
Available copies with local validation assumes no network partition - i.e. functioning
replica managers can communicate with one another.
Assume X fails just after T has performed GetBalance and N fails just after U has
performed.
GetBalance
Assume X and N fail before T & U have performed their Deposit operations.
T’s Deposit will be performed at M & P while U’s Deposit will be performed at Y.
Concurrency control on A at X does not prevent U from updating A at Y; similarly
concurrency control on B at N does not prevent Y from updating B at M & P.
Features:
Bandwidth adaptation
Failure Resilience
Client side persistent caching of files, directories and attributes for high performance
Security
Communication
Inter process communication in Coda is performed using RPCs. However,
the RPC2 system for Coda is much more sophisticated than traditional RPC systems
such as ONC RPC, which is used by NFS.
RPC2 allows the client and the server to set up a separate connection for transferring
the video data to the client on time. Connection setup is done as a side effect of an RPC
call to the server. For this purpose, the RPC2 runtime system provides an interface of
side-effect routines that is to be implemented by the application developer. For
example, there are routines for setting up a connection and routines for transferring
data. These routines are automatically called by the RPC2 runtime system at the client
and server, respectively, but their implementation is otherwise completely
independent of RPC2.
The problem is caused by the fact that an RPC may fail. Invalidating files in a strict
sequential order may be delayed considerably because the server cannot reach a
possibly crashed client, but will give up on that client only after a relatively long
expiration time. Meanwhile, other clients will still be reading from their local copies.
Parallel RPCs are implemented by means of the MultiRPC system, which is part of
the RPC2 package. An important aspect of MultiRPC is that the parallel invocation of
RPCs is fully transparent to the callee. In other words, the receiver of a MultiRPC call
cannot distinguish that call from a normal RPC. At the caller’s side, parallel execution
is also largely transparent. For example, the semantics of MultiRPC in the presence of
failures are much the same as that of a normal RPC. Likewise, the side-effect
mechanisms can be used in the same way as before.
Processes
Coda maintains a clear distinction between client and server processes. Clients are
represented by Venus processes; servers appear as Vice processes. Both type of
processes are internally organized as a collection of concurrent threads. Threads in
Coda are non-preemptive and operate entirely in user space. To account for
continuous operation in the face of blocking I/O requests, a separate thread is used to
handle all I/O operations, which it implements using low-level asynchronous I/O
operations of the underlying operating system. This thread effectively emulates
synchronous I/O without blocking an entire process.
Naming
File Identifiers
Considering that the collection of shared files may be replicated and distributed across
multiple Vice servers, it becomes important to uniquely identify each file in such a
way that it can be tracked to its physical location, while at the same time maintaining
replication and location transparency.
Synchronization
Many distributed file systems, including Coda’s ancestor, AFS, do not provide UNIX
file-sharing semantics but instead support the weaker session semantics. Given its
goal to achieve high availability, Coda takes a different approach and makes an
attempt to support transactional semantics, albeit a weaker form than normally
supported by transactions.
The problem that Coda wants to solve is that in a large distributed file system it may
easily happen that some or all of the file servers are temporarily unavailable. Such
unavailability can be caused by a network or server failure, but may also be the result
of a mobile client deliberately disconnecting from the file service. Provided that the
disconnected client has all the relevant files cached locally, it should be possible to use
these files while disconnected and reconcile later when the connection is established
again.
Transactional Semantics
In Coda, the notion of a network partition plays a crucial role in defining transactional
semantics. A partition is a part of the network that is isolated from the rest and which
consists of a collection of clients or servers, or both. The basic idea is that series of file
operations should continue to execute in the presence of conflicting operations across
different partitions. Recall that two operations are said to conflict if they both operate
on the same data, and at least one is a write operation.
Let us first examine how conflicts may occur in the presence of network partitions.
Assume that two processes A and B hold identical replicas of various shared data
items just before they are separated as the result of a partitioning in the network.
Ideally, a file system supporting transactional semantics would implement one-copy
serializability, which is the same as saying that the execution of operations by A and B,
respectively, is equivalent to a joint serial execution of those operations on non-
replicated data items shared by the two processes. The main problem in the face of
partitions is to recognize serializable executions after they have taken place within a
partition. In other words, when recovering from a network partition, the file system is
confronted with a number of transactions that have been executed in each partition
(possibly on shadow copies, i.e., copies of files that were handed out to clients to
perform tentative modifications analogous to the use of shadow blocks in the case of
transactions). It will then need to check whether the joint executions can be serialized
in order to accept them. In general, this is an intractable problem.
Caching and Replication
Client Caching
Client-side caching is crucial to the operation of Coda for two reasons. First, and in
line with the approach followed in AFS, caching is done to achieve scalability. Second,
caching provides a higher degree of fault tolerance as the client becomes less
dependent on the availability of the server. For these two reasons, clients in Coda
always cache entire files. In other words, when a file is opened for either reading or
writing, an entire copy of the file is transferred to the client, where it is subsequently
cached.
Unlike many other distributed file systems, cache coherence in Coda is maintained by
means of callbacks. For each file, the server from which a client had fetched the file
keeps track of which clients have a copy of that file cached locally. A server is said to
record a callback promise for a client. When a client updates its local copy of the file
for the first time, it notifies the server, which, in turn, sends an invalidation message
to the other clients. Such an invalidation message is called a callback break, because
the server will then discard the callback promise it held for the client it just sent an
invalidation.
Server Replication
Network broken
Fault Tolerance
Coda has been designed for high availability, which is mainly reflected by its
sophisticated support for client-side caching and its support for server replication. An
interesting aspect of Coda that needs further explanation is how a client can continue
to operate while being disconnected, even if disconnection lasts for hours or days.
Besides providing high availability, the AFS and Coda developers have also looked at
simple mechanisms that help in building fault-tolerant processes. A simple and
effective mechanism that makes recovery much easier, is Recoverable Virtual
Memory (RVM). RVM is a user-level mechanism to maintain crucial data structures
in main memory while being ensured that they can be easily recovered after a crash
failure. The details of RVM are described in.
The basic idea underlying RVM is relatively simple: data that should survive crash
failures are normally stored in a file that is explicitly mapped into memory when
needed. Operations on that data are logged, similar to the use of a write ahead log in
the case of transactions. In fact, the model supported by RVM is close to that of flat
transactions, except that no support is provided for concurrency control. Once a file
has been mapped into main memory, an application can perform operations on that
data that are part of a transaction. RVM is unaware of data structures. Therefore, the
data in a transaction is explicitly set by an application as a range of consecutive bytes
of the mapped-in file. All (in-memory) operations on that data are recorded in a
separate write ahead log that needs to be kept on stable storage. Note that due to the
generally relatively small size of the log, it is feasible to use a battery power-supplied
part of main memory, which combines durability with high performance.
Security
Coda inherits its security architecture from AFS, which consists of two parts. The first
part deals with setting up a secure channel between a client and a server using secure
RPC and system-level authentication. The second part deals with controlling access to
files. We will not examine each of these in turn.