Professional Documents
Culture Documents
Distributed Systems: Fault Tolerance: Fall 2013
Distributed Systems: Fault Tolerance: Fall 2013
Distributed Systems:
Fault Tolerance
Fall 2013
Jussi Kangasharju
Chapter Outline
Dependability includes
n Availability
n Reliability
n Safety
n Maintainability
--
--
--
client failure
error
fault
server
Timing failure A server's response lies outside the specified time interval
State transition failure The server deviates from the correct flow of control
Detection
n redundant information
- error detecting codes (parity, checksums)
- replicas
n redundant processing
- groupwork and comparison
n control functions
- timers
- acknowledgements
Recovery
n redundant information
- error correcting codes
- replicas
n redundant processing
- time redundancy
- retrial
- recomputation (checkpoint, log)
- physical redundancy
- groupwork and voting
- tightly synchronized groups
- hierarchical group:
a primary and a cold standby (checkpoint, log)
n Technical basis
n “group” – a single abstraction
n reliable message passing
Pi in G) {send m to Pi } )
Message transmission
Reporting feedback
A simple solution to reliable multicasting when all receivers are known and are
Implementation of Basic_multicast(G, m) :
1. for each pi in G: send(pi,m) (a reliable one-to-one send)
2. on receive(m) at pi : deliver(m) at pi
Application
Delivery of messages
- new message => HBQ delivery
- decision making
- delivery order hold-back queue
- deliver or not to deliver?
Assume
n reliable point-to-point communication
n group G={pi}: each pi : groupview
Reliable_multicast (G, m):
if a message is delivered to one in G,
then it is delivered to all in G
a) Process 4 notices that process 7 has crashed, sends a view change
b) Process 6 sends out all its unstable messages, followed by a flush
message
c) Process 6 installs the new view when it has received a flush message
from everyone else
Kangasharju: Distributed Systems 30
Ordered Multicast
Need:
all messages are delivered in the intended
order
receives m2 receives m2
receives m4 receives m4
Four processes in the same group with two different senders, and a possible
delivery order of messages under FIFO-ordered multicasting
n Fault tolerance: recovery from an error (erroneous state => error-free
state)
n Two approaches
n backward recovery: back into a previous correct state
n forward recovery:
- detect that the new state is erroneous
- bring the system in a correct new state
challenge: the possible errors must be known in advance
n forward: continuous need for redundancy
n backward:
P1
P2
P3