Fault Tolerance and Consensus: C. Bettini - Distributed and Pervasive Systems

Fault Tolerance and
Consensus
Copyright
• Le slide di questo corso sono in parte personalizzate da quelle fornite come materiale associato ai testi consigliati
(Distributed Systems: Concepts and Design, 5/e, Coulouris, Dollimore, Kindberg & Blair, Addison-Wesley, e Distributed
Systems: Principles and Paradigms, A. S. Tanenbaum, M. Van Steen, Prentice Hall, 2007). Parte delle slide sono invece di
proprietà del docente. Tutte le slide sono soggette a diritto d’autore e quindi non possono essere ri-distribuite senza
consenso. Lo stesso vale per eventuali registrazioni o videoregistrazioni delle lezioni.
• The slides for this course are partly adapted from the ones distributed by the publisher of the suggested books for this
course (Distributed Systems: Concepts and Design, 5/e, Coulouris, Dollimore, Kindberg & Blair, Addison-Wesley and
Distributed Systems: Principles and Paradigms, A. S. Tanenbaum, M. Van Steen, Prentice Hall, 2007). Other slides are
from the teacher of this course. All the material is subject to copyright and cannot be redistributed without consent of
the copyright holder. The same holds for audio and video-recordings of the classes of this course.
C. Bettini - Distributed and Pervasive Systems 2
1
Fault Tolerance Basic Concepts
• Being fault tolerant is strongly related to what are called dependable systems
• Dependability implies the following:

1. Availability. The probability that the system operates correctly at any given
moment
2. Reliability. The ability to run correctly for a long interval of time
3. Safety. Failure to operate correctly does not lead to catastrophic failures
4. Maintainability. The ability to “easily” repair a failed system
The CAP theorem for distributed DBMS
2
The CAP theorem: only 2 of these
• Consistency: Every read receives the most recent write or an error
(implies all nodes see the same data)
• Availability: Every request receives a response (implies system

operational 100% of time)
• Partition tolerance: The system tolerates an arbitrary number of

messages lost (or delayed)
Failure Models
• Different types of failures.

3
Failure Masking by Redundancy
• Information redundancy
• Example: extra bits used to recover transmitted message when noise is
present
• Time redundancy
• Example: a transaction is repeated if aborted because of failure of one of its
actions
• Physical redundancy
• Extra hardware or software (processes) used to recover from failure of
some component
Failure Masking by Redundancy
• How many faulty devices are

tolerated?
4
Process resilience
• Redundancy using groups of identical processes.
• Each process receives the messages for the group

• A protocol and mechanism for joining and leaving a group is
necessary
• How much redundancy is needed?
How much redundancy?

• If faulty processes just stop working (crash failure),
k+1 processes provide k-fault tolerance.
• If faulty processes reply with wrong values, at least

2k+1 processes are needed (client can decide by voting)
• If a consensus in the group is needed (e.g., to decide transaction

commitment, election, …) the problem is more complicated
5
Consensus in Faulty Systems (1)
A hard problem, whose solution depends on:
1. Synchronous versus asynchronous systems.
2. Communication delay is bounded or not.
3. Message delivery is ordered or not.
4. Message transmission is done through unicasting or multicasting.
Byzantine Agreement (1)

§ The agreement (or consensus) problems were studied by Lamport
in 1982.
§ A specific problem: Interactive consistency
§ N processes with k processes with Byzantine behavior
§ Each process i produces a value V[i]
§ Goal: the (N-k) non-faulty processes need to reach agreement on a result
vector of values (one for each process) isolating faulty processes
§ Assumptions for Lamport’s solution:
§ Synchronous processes; Unicast communication; Bounded delay; Order of messages
preserved
6
Example with N=4 and k=1.
Step 1. Each process sends (with

reliable unicast) its value to the
others. The faulty process may send
different values.

Step 2: each process assembles a vector from the received values (b).
Step 3: each process sends its vector to the others. The faulty process can send
different arbitrary vectors. The vectors that each process receives are shown in (c).
7
Step 4:
• each process examines the i-th element of each of the received vectors.
• If any value has a majority it is put in the i-th element of the result vector.
• Otherwise, UNKNOWN is put in the i-th element of the result vector.
In the example, a consensus is achieved on the v1,v2 and v4 values among the 3
non-faulty processes
• It has been proved that at least 3k+1 processes are needed to solve this problem
with k faulty processes.

N=3k+1 processes are needed for k faulty processes
• With N=3 and k=1 a consensus cannot be achieved!
8
Impossibility of agreement in asynchronous
systems
• When delays in answering messages are arbitrary there is no

guaranteed solution to Byzantine agreement (neither to totally
ordered multicast) even if a single node fails.
• There are solutions for partially synchronous systems that can be used
to model practical systems.
Other important topics

• Reliable client-server communication
• Example: RPC in presence of failures
• Reliable group communication
• Example: Atomic multicast
• Distributed commit
• Generalization of atomic multicast to arbitrary operations performed by
members of a group
(all-or-none semantics)
9
References
• Textbook (Coulouris et al.) chapter 15
• The CAP theorem. "Brewer's conjecture and the feasibility of consistent,
available, partition-tolerant web services", S. Gilbert and N. Lynch, ACM
SIGACT News, (2002). DOI:10.1145/564585.564601
• “The Byzantine generals problem”. L. Lamport et al., ACM Trans. on
Programming Languages and Systems, (1982).
https://people.eecs.berkeley.edu/~luca/cs174/byzantine.pdf
• FLP Consensus impossibility result
https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf
10

Fault Tolerance and Consensus: C. Bettini - Distributed and Pervasive Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fault Tolerance and Consensus: C. Bettini - Distributed and Pervasive Systems

Uploaded by

Copyright:

Available Formats

Fault Tolerance and

C. Bettini - Distributed and Pervasive Systems 2

• Dependability implies the following:

C. Bettini - Distributed and Pervasive Systems 3

The CAP theorem for distributed DBMS

C. Bettini - Distributed and Pervasive Systems 4

• Availability: Every request receives a response (implies system

• Partition tolerance: The system tolerates an arbitrary number of

C. Bettini - Distributed and Pervasive Systems 5

• Different types of failures.

C. Bettini - Distributed and Pervasive Systems 7

Failure Masking by Redundancy

• How many faulty devices are

• Each process receives the messages for the group

C. Bettini - Distributed and Pervasive Systems 10

How much redundancy?

• If faulty processes reply with wrong values, at least

• If a consensus in the group is needed (e.g., to decide transaction

C. Bettini - Distributed and Pervasive Systems 11

1. Synchronous versus asynchronous systems.

2. Communication delay is bounded or not.

3. Message delivery is ordered or not.

4. Message transmission is done through unicasting or multicasting.

C. Bettini - Distributed and Pervasive Systems 12

Byzantine Agreement (1)

C. Bettini - Distributed and Pervasive Systems 14

Step 1. Each process sends (with

C. Bettini - Distributed and Pervasive Systems 15

Byzantine Agreement (3)

C. Bettini - Distributed and Pervasive Systems 16

• Otherwise, UNKNOWN is put in the i-th element of the result vector.

C. Bettini - Distributed and Pervasive Systems 17

Byzantine Agreement (5)

• With N=3 and k=1 a consensus cannot be achieved!

C. Bettini - Distributed and Pervasive Systems 18

• When delays in answering messages are arbitrary there is no

C. Bettini - Distributed and Pervasive Systems 19

Other important topics

C. Bettini - Distributed and Pervasive Systems 20

C. Bettini - Distributed and Pervasive Systems 21

You might also like