CS 382: Network-Centric Computing
Replication, Failures, Consistency
Zartash Afzal Uzmi
Spring 2023-24
ACK: Slides use some material from Scott Shenker (UC Berkeley) and Jim Kurose (UMass)
Agenda
⚫ Replication (of?)
⚫ Failures and Correctness
⚫ Primary-backup Protocols
⚫ Asynchronous protocols
⚫ Synchronous protocols
⚫ Two-phase Commit (2PC) Protocol
⚫ Example
2
Replication
3
Replication
⚫ When we replicate an object, we create copies of the object and store
them on different servers
⚫ Each copy is called a replica
4
Why Replication?
⚫ Fault tolerance
⚫ If one replica crashes, simply switch to another replica
⚫ With k replicas of each object, it can tolerate the failure of any (k-1) servers in the
system
⚫ Performance
⚫ Helps when scaling for size or geographical area
⚫ Load balancing: divide the workload among multiple servers
⚫ Placing the copy of the object in proximity of the client, e.g., CDNs
5
Nature of Replicated Data
⚫ Read-only data
⚫ Easy to replicate; we just make multiple copies
⚫ Read-write data
⚫ Writes result in different replicas. Any challenge?
⚫ Replicas must be kept consistent; modifications need to be propagated to all
other copies!
⚫ What do applications need: Read-only or Read-write?
⚫ Want the distributed system with multiple replicas to appear as if there was one
copy on a single machine. Want Read-write data replication!
⚫ Challenge:
⚫ When and how to propagate write updates?
6
Depends on Application Requirements
⚫ What do applications require?
⚫ From replicated/distributed systems
⚫ Availability
⚫ The application is operational and instantly processes requests
⚫ Some server failures do not prevent surviving servers from continuing to operate
⚫ Partition Tolerance:
⚫ The application continues to operate despite message loss due to network partition
7
Network Partitions Divide Systems
8
Network Partitions Divide Systems
9
Fundamental Tradeoff
⚫ Replicas appear to be a single [consistent] machine but lose availability
during a network partition
⚫ OR
⚫ All replicas remain available during a network partition but do not
appear to be a single machine (inconsistent data!!!)
10
CAP Theorem Preview
⚫ You cannot achieve all three of:
1. Consistency Consistency Availability
2. Availability
3. Partition-Tolerance
Partition
tolerance
⚫ Consistency ➙ Replicas Act Like Single Machine
⚫ Availability ➙ All Sides of Partition Continue
⚫ Partition Tolerance ➙ Partitions Can Happen
11
CAP Conjecture [Brewer 00]
⚫ From a keynote lecture by Eric Brewer (2000)
⚫ History: Eric started Inktomi, early Internet search site based around
“commodity” clusters of computers
⚫ Popular interpretation: 2-out-of-3
⚫ Consistency
⚫ Availability
⚫ Partition Tolerance
12
CAP Theorem [Gilbert Lynch 02]
Assume that an algorithm provides all of CAP (to contradict!)
Let us start with a variable x=0 consistently stored at A and B
Client 1 Client 1
A B
13
CAP Theorem [Gilbert Lynch 02]
Assume that an algorithm provides all of CAP (to contradict!)
w(x=1)
Client 1 Client 1
ok
A B
Write eventually returns
(from A)
Partition Possible
14
CAP Theorem [Gilbert Lynch 02]
Assume to contradict that an algorithm provides all of CAP
w(x=1) r(x)
Client 1 Client 1
ok x=0
A B
Read begins after write
Write eventually returns completes
Read eventually returns
Partition Possible
15
CAP Theorem [Gilbert Lynch 02]
Assume to contradict that an algorithm provides all of CAP
Not consistent (C) => contradiction!
w(x=1) r(x)
Client 1 Client 1
ok x=0
A B
Read begins after write
Write eventually returns completes
Read eventually returns
Partition Possible
16
CAP Interpretation Part 1
⚫ Cannot “choose” no partitions
⚫ 2-out-of-3 interpretation doesn’t make sense
⚫ Instead, availability OR consistency?
⚫ i.e., a fundamental tradeoff between availability and consistency
⚫ When designing a system, you must choose one or the other; both are not possible
simultaneously
17
CAP Interpretation Part 2
⚫ It is a theorem, with proof, that you understand!
⚫ Cannot “beat” CAP Theorem
⚫ Can engineer systems to make partitions extremely rare, and then just
take the rare hit to availability (or consistency)
18
Some Real Distributed Systems Relax
Consistency Constraints …
Memcache at
Facebook
19
Questions?
20
Node Failures and Correctness
21
Failure Model: Fail-Stop
Node Fails!
Nodes fail by crashing
A machine is either working correctly or it is doing nothing
22
Failure Model: Byzantine Failures
Node Fails
Node operates arbitrarily after a failure
(this includes not sending messages at all or sending different and wrong
messages to different servers or lying about a value)
Can be caused by
• Malicious attacks
• Software errors
23
Failures in Distributed Systems (DS)
⚫ What distinguishes DS from single-machine systems:
⚫ Some nodes might still be working correctly while others are experiencing failures
⚫ DS may continue to operate even if part of it is failing
⚫ A design goal for distributed systems:
⚫ “Correctly” operate even when failures occur
24
Correctness for Strong Consistency
⚫ Replicas act like a single machine
⚫ Specifically
⚫ If one node commits an update, no other replica rejects it
⚫ If one replica rejects it, no one commits the update
25
We will assume Fail-Stop model
⚫ Many consistency protocols assume fail-stop model
⚫ Reason: Byzantine failures are very hard to deal with and usually add
substantial performance overheads
⚫ However, Byzantine fault-tolerant solutions are important in certain
contexts
⚫ Covered in the advanced course “CS 582: Distributed Systems”
⚫ Also covered in “CS3812: Into to Blockchain”
26
Primary-backup protocols
27
Primary-backup protocols
⚫ One special node (primary) orders requests
⚫ Route all updates through the primary node
⚫ Assigns an order to the updates
⚫ All replicas commit updates in the assigned order
⚫ Simple to implement
28
Two Types of Primary-backup Schemes
⚫ Asynchronous primary-backup protocol
⚫ Synchronous primary-backup protocol
29
Primary-backup protocol
Client
Replica Primary Replica
30
Primary-backup protocol
Client
Write-request
Replica Primary Replica
31
Primary-backup protocol
• The primary sends an ACK to
the client once it has performed
the write locally
Client OR
• It waits for all replicas first to
perform the update and then
sends the ACK
Write-request
Replica Primary Replica
32
Primary-backup protocol[asynchronous version]
Client
Step 1
Write-request
Replica Primary Replica
33
Primary-backup protocol[asynchronous version]
Client
Step 1 ACK write
Write-request completed
Step 2
Replica Primary Replica
34
Primary-backup protocol[asynchronous version]
Client
Step 1 ACK write
Write-request completed
Step 2
Replica Primary Replica
Tell replicas Tell replicas
Step 3
to update to update
35
Primary-backup protocol[asynchronous version]
Client
Step 1 ACK write
Write-request completed
Step 2
Step 4 Update Update
ACK ACK
Replica Primary Replica
Tell replicas Tell replicas
Step 3
to update to update
36
Analyzing the Asynchronous Version
⚫ Performance (w/o failures):
⚫ The client does not need to spend any additional time waiting for the internals of
the system to do their work
⚫ The system is also more tolerant of network latency since fluctuations in internal
latency do not cause additional waiting on the client-side
⚫ What about correctness (with failures)?
37
What could go wrong if there are failures?
⚫ If the primary fails before the updates are sent to the backups,
⚫ Then updates may be lost
⚫ Correctness can get violated under failures
⚫ If one server commits (update), no one rejects it
⚫ If one rejects it, no one commits
⚫ From the client’s perspective
⚫ There are no guarantees that you can read back what you wrote if there are any
failures in the system – no “Read Your Writes” consistency!
38
Primary-backup protocol[synchronous version]
Client
Step 1
Write-request
Replica Primary Replica
39
Primary-backup protocol[synchronous version]
Client
Step 1
Write-request
Replica Primary Replica
Tell replicas Tell replicas
Step 2
to update to update
40
Primary-backup protocol[synchronous version]
Client
Step 1
Write-request
Step 3 Update Update
ACK ACK
Replica Primary Replica
Tell replicas Tell replicas
Step 2
to update to update
41
Primary-backup protocol[synchronous version]
Client
Step 1 ACK write
Write-request completed
Step 4
Step 3 Update Update
ACK ACK
Replica Primary Replica
Tell replicas Tell replicas
Step 2
to update to update
42
Analyzing the Synchronized Version
⚫ Performance (w/o failures):
⚫ Clients have to wait for additional time – used for synchronizing replicas
⚫ Network delay fluctuations inside the system impact the client
⚫ Can this version guarantee correctness with failures?
43
What could go wrong if there are failures?
44
Failure Scenario
Client
Write-request
Replica Primary Replica
45
Failure Scenario
Client
Write-request
Replica Primary Replica
Tell replicas Tell replicas
to update to update
46
Failure Scenario
Client
Write-request
This replica
crashes
Update
ACK
Replica Primary Replica
Write update Tell replicas Tell replicas
committed to update to update
47
Failure Scenario
The client assumes the
Client
write update failed
Write-request
This replica
crashes
Update
ACK
Replica Primary Replica
Write update Tell replicas Tell replicas
committed to update to update
48
Failure Scenario
Replicas do not agree
The client assumes the
We need a way to roll Client
write update failed
back updates
Write-request
This replica
crashes
Update
ACK
Replica Primary Replica
Write update Tell replicas Tell replicas
committed to update to update
49
Two Phase Commit (2-PC)
⚫ Consists of two distinct phases
⚫ Used in several distributed systems
50
Two Phase Commit (2-PC)
51
Key idea
⚫ Allow the system to roll back updates on failures
⚫ By using two phases
⚫ This is in contrast to single-phase primary-based protocols
⚫ Where there is no step for rolling back an operation that has failed on some nodes
and succeeded on other nodes
52
Terminology
⚫ Primary Node
⚫ Coordinator
⚫ Replicas
⚫ Cohort, worker, participant
⚫ Prepare phase (also voting phase OR commit-request phase)
⚫ Replicas become prepared
⚫ Commit phase
⚫ Transactions are committed across the system
53
2-PC
Coordinator Replica 1 … Replica N-1
54
2-PC
Coordinator Replica 1 … Replica N-1
Prepare Save update to
disk
Respond with
yes or No
55
2-PC
Coordinator Replica 1 … Replica N-1
Save update to
Prepare disk
Respond with
Yes from all yes or No
replicas within
the timeout
56
2-PC
Coordinator Replica 1 … Replica N-1
Save update to
Prepare disk
Respond with
Yes from all yes or No
replicas within
the timeout
Commit
Commit
updates from
disk to store
ACK
57
2-PC
Coordinator Replica 1 … Replica N-1
Save update to
Prepare disk
Respond with
If any “No” or yes or No
timeout before
all votes
Abort
ACK
58
Failures in 2-PC: Food for Thought
⚫ If a server votes yes, can it commit unilaterally before receiving commit
message?
⚫ If a server voted No, can it abort right away without waiting for an abort
message?
59
Failures in 2-PC
⚫ To deal with replica crashes
⚫ Each replica saves tentative updates into permanent storage, right before replying
Yes/No in the first phase
⚫ Retrievable after crash recovery
⚫ To deal with coordinator crashes
⚫ The coordinator logs all decisions and received/sent messages on disk
⚫ After recovery or new election ➙ new coordinator takes over
60
Correctness and Performance of 2-PC
⚫ Correctness: All hosts that decide reach the same decision
⚫ No commit unless everyone says “yes”
⚫ Performance: If failures, then 2PC might block
⚫ Failure of any process can result in non-progress
⚫ Doesn’t tolerate failures well: must wait for repair
61
2-PC Summary
⚫ Primary-backup schemes (1 Phase protocols)
⚫ Safety may be violated ➙ Can’t roll back updates
⚫ Two-Phase Commit
⚫ Allow for rolling back updates
⚫ Sensitive to coordinator failure ➙ blocking
62
Summing it up …
⚫ Replication is needed for fault tolerance and performance
⚫ But it introduces challenges
⚫ CAP theorem shows we cannot provide all three properties: strong consistency,
availability, and partition tolerance
⚫ How to design a consistency protocol for a replicated system?
⚫ Primary-backup protocols: asynchronous and synchronous
⚫ Need more than one phase to provide consistency under failures
63
Advanced Topics on Consistency
⚫ Paxos
⚫ Raft
⚫ PBFT
⚫ Blockchain
64
Questions?
66