Professional Documents
Culture Documents
Type
Created By ế
khi t cao
Tags
Types:
(From chatgpt)
Atomic operation: using $set, $inc —> refactor business handle logic, avoiding get data and then
procced.
Message queue pattern: instead of updating DB directly, we could queue updates command
sequentially
Distributed locking:
Optimistic locking: take notes records (involve date, timestamp,…) to ensure it hasn’t been
updated by another thread, and then perform my updates.
Saga pattern: manage distributed transactions across multiple services, instead of a single
transaction spanning multiple services, each service executes its own local transaction and emits
event to trigger subsequent actions in other services
Event sourcing pattern: store and process events that represents changes to DB. Each service
publish events when it performs operation on DB, other service subscribe to that events to update
their own data. By relying on log of events, we can ensure updates are applied in correct order
CQRS (command query responsibility segregation): separate read and write operation
💡 Outline
Fault tolerance is gained by the nature of distributed and replica across storage nodes
References:
Slide presentation: https://www.canva.com/design/DAFwjbmpxmk/rsjC7nztswGSepOcUOJjjw/edit?
utm_content=DAFwjbmpxmk&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton
Redlock: https://redis.io/docs/manual/patterns/distributed-locks/#liveness-arguments
Distributed lock is same as normal locking when multiple processes are trying to access 1 files (shared
memory)
Capacity estimate
When multiple machines in nodes try to access shared resources, we can summarize into 3 operations:
💡 Lock instance is general term and could be anymore (ZooKeeper, AWS S3), Redis is just a
specific implementation
Safety: refer to the state of lock correctly and guarantee that only one process that holding lock,
following situations could compromise:
Network or node failure: if lock service (Redis) fail or become unreachable, the lock can not be
released and potentially leading multiple process holding the lock or lock could not be released
after process done
Redis crash or restart: lock will be lost and other process could hold a lock again —> leading
concurrent holding lock
Liveness:
Network congestion and latency: potentially increase dramatically acquire and release lock, this
could lead to livelock, which is processes are constantly waiting for lock without making progress
Performance bottleneck: a single redis instance may struggle to handle the load, resulting in
increased latency and reduced liveness.
💡 Single instance will have limitations: Single point of failure —> add replicas but Why Failover-
based implementation won’t enough
The solution is add replicas to our cluster, but it will violate Rule 1: Safety property
When master go down before sync lock to replica, its replica will be elected master without knowing
about lock before —> lead to other process could hold the lock
All redis nodes hold keys for approximately the right length of time (because there are delay in
expired time length between nodes) that network delay is small compared to expiry duration.
Safety violation
Assure lock holding by one client must not be deleted by another client: when client holding
lock longer than validity time of lock, it will automatically released and be acquired by another
client, and old client after done processing could remove lock of another client
—> Simplest solution is using timestamp with micro precision concatenate with locking key.
DEL is unsafe, could lead to remove another lock by client after expired validity lock
Problem:
Assume client has acquired 3 out of 5 instances, but one of these is down and restart, now
there are 3 instances could grant lock to another clients
Solution
With standalone redis instance: we could enable AOF persistence, it could remain data after
loss
With multi redis nodes: need to ensure correctly data replication between replicas, after the
election of leader, data will be remain
Another simple but bad solution is that we add delay to restarted instance that greater than
TTL of currently active lock, ensure that it can not rejoin into previous lock operation. But
delayed restart have a drawback of lacking availability, for instance, there are majority of
instance was down, my system could be unavailable, meaning no resource at all could be
locked.
Fault tolerant:
when service holdings lock is suddenly down, this will make other machines trying get lock is
blocked and fail
Or when process holding lock running too long, could be stuck somewhere,
Use fencing token, which is a number indicate version of record in storage. Bear in mind not
using timestamp because it’s not reliable in because distributed system is fault tolerant and
what happen when machine go down, it restart and timestamp will be different
Redis don’t support fencing token generation in several nodes because syncing counter on
multiple nodes will be out of sync and we need a strictly consensus algorithm
—> Setup periodically send heartbeat to service holding lock, in case doesn’t have response, we
can assume that service is dead and release the lock
💡 we can use heartbeat with some complexity in our lock service to increase TTL of lock.
The important of consensus algorithm, suppose we have 3 nodes, when machine want grab a lock, it
need to grab all of 3 locking nodes for successfully lock grabbing
💡 Consensus algorithm ensure consistent view of system despite potential failure, network
partitions or delay in distributed nodes. It can used to coordinate and establish agreement
among multiple nodes on a particular value or decision.
—> The important building distributes system is lock is replicated in fault tolerant matter
Design distributed
Having odd number of nodes, one of these is leader node, every time write is sent to leader, and
leader to forward write to all follower, when leader receive response ok from all follower, it going
ahead and commit that locally and tell all other nodes that right ok
Leader election, we have heartbeat service and will elect new leader
Consider scaling
Imagine when release lock, a thousand of processes trying to grab the lock on single raft instance, that’s
bad situation
Solution is using linked-list for queueing process, use some of sorting time
Correctness: Taking lock prevent from multiple processes modify the same piece of data, which lead
to corrupted, data loss & permanent inconsistency
Distributed lock
Requirements:
Storage could actively role checking token validity as well reject write on any token has gone
backward.
Redis can not safely generate fencing token, simply keeping a counter on node won’t be sufficient as
node could fall, in case multiple nodes, counter would go out of sync. And if we want use Redlock,
we need implement consensus algorithm like Raft or Paxos
Raft machine