Design locking serivce

Created @September 27, 2023 5:52 PM

Last Edited Time @October 18, 2023 4:59 PM


(From chatgpt)

Atomic operation: using $set, $inc —> refactor business handle logic, avoiding get data and then

Message queue pattern: instead of updating DB directly, we could queue updates command

Distributed locking:

Optimistic locking: take notes records (involve date, timestamp,…) to ensure it hasn’t been
updated by another thread, and then perform my updates.

Use updateAndReturn to mark timestamp record and then update later.

Saga pattern: manage distributed transactions across multiple services, instead of a single
transaction spanning multiple services, each service executes its own local transaction and emits
event to trigger subsequent actions in other services

Event sourcing pattern: store and process events that represents changes to DB. Each service
publish events when it performs operation on DB, other service subscribe to that events to update
their own data. By relying on log of events, we can ensure updates are applied in correct order

CQRS (command query responsibility segregation): separate read and write operation

Idempotent operation: design microservice to support idempotent operation, an idempotent

operation can be executed multiple times without changing result beyond the initial application, u
can handle retries and duplicate request without causing data inconsistencies.

Transaction and Distributed lock

💡 Outline

1. Motivation:
Data race condition…

Criteria when design micro-services that avoiding race condition

2. Common phenomenon race condition

Isolation level in DB…

Basic ACID: principles of DB ensure data integrity

Compare to MongoDB and analyze ACID, transaction in mongoDB

3. Explain common solution & synchronization types

Locking mechanism

Lock type: pessimistic & optimistic (document version,…)

4. Distributed cache and Redis usecase

Fault tolerance is gained by the nature of distributed and replica across storage nodes

Slide presentation:


Lock with fencing token:

locking.html#:~:text=In this context%2C a fencing,pause and the lease expires.

Distributed lock general and comparation:


Distributed lock:

Distributed Lock design


Distributed lock is same as normal locking when multiple processes are trying to access 1 files (shared

Capacity estimate

When multiple machines in nodes try to access shared resources, we can summarize into 3 operations:

Grab the lock

Do operations: reading, writing

Release the lock (fencing token)

Consider requirement:

💡 Need to satisfy Safety and Liveness property

Single instance Lock (1 Redis master node)

💡 Lock instance is general term and could be anymore (ZooKeeper, AWS S3), Redis is just a
specific implementation

Lack of safety and liveness property and explanation:

Safety: refer to the state of lock correctly and guarantee that only one process that holding lock,
following situations could compromise:

Network or node failure: if lock service (Redis) fail or become unreachable, the lock can not be
released and potentially leading multiple process holding the lock or lock could not be released
after process done

Redis crash or restart: lock will be lost and other process could hold a lock again —> leading
concurrent holding lock


Network congestion and latency: potentially increase dramatically acquire and release lock, this
could lead to livelock, which is processes are constantly waiting for lock without making progress

Performance bottleneck: a single redis instance may struggle to handle the load, resulting in
increased latency and reduced liveness.

To overcome these situations, a more robust and fault-tolerant approach is required.

💡 Single instance will have limitations: Single point of failure —> add replicas but Why Failover-
based implementation won’t enough

The solution is add replicas to our cluster, but it will violate Rule 1: Safety property

When master go down before sync lock to replica, its replica will be elected master without knowing
about lock before —> lead to other process could hold the lock

Redlock with multiple instance (N redis master node)

There are some limitations with Redlock, its safety depends on lots of timing assumptions:

All redis nodes hold keys for approximately the right length of time (because there are delay in
expired time length between nodes) that network delay is small compared to expiry duration.

The process running is much shorter than expiry duration —> fencing token will solve this problem.

Safety violation

Deadlock by client holding lock is crashed or too long

Assure lock holding by one client must not be deleted by another client: when client holding
lock longer than validity time of lock, it will automatically released and be acquired by another
client, and old client after done processing could remove lock of another client

—> Simplest solution is using timestamp with micro precision concatenate with locking key.

Master down before sync lock to replica

DEL is unsafe, could lead to remove another lock by client after expired validity lock

Performance, crash recovery and fsync:


Assume client has acquired 3 out of 5 instances, but one of these is down and restart, now
there are 3 instances could grant lock to another clients


With standalone redis instance: we could enable AOF persistence, it could remain data after

With multi redis nodes: need to ensure correctly data replication between replicas, after the
election of leader, data will be remain

Another simple but bad solution is that we add delay to restarted instance that greater than
TTL of currently active lock, ensure that it can not rejoin into previous lock operation. But
delayed restart have a drawback of lacking availability, for instance, there are majority of
instance was down, my system could be unavailable, meaning no resource at all could be

Fault tolerant:

when service holdings lock is suddenly down, this will make other machines trying get lock is
blocked and fail

Or when process holding lock running too long, could be stuck somewhere,

Use fencing token, which is a number indicate version of record in storage. Bear in mind not
using timestamp because it’s not reliable in because distributed system is fault tolerant and
what happen when machine go down, it restart and timestamp will be different

Redis don’t support fencing token generation in several nodes because syncing counter on
multiple nodes will be out of sync and we need a strictly consensus algorithm

“stop-the-world GC pause” could be simply considered a long running process

—> Simplest solution is set expired time holding lock

—> Setup periodically send heartbeat to service holding lock, in case doesn’t have response, we
can assume that service is dead and release the lock

💡 we can use heartbeat with some complexity in our lock service to increase TTL of lock.

The important of consensus algorithm, suppose we have 3 nodes, when machine want grab a lock, it
need to grab all of 3 locking nodes for successfully lock grabbing

💡 Consensus algorithm ensure consistent view of system despite potential failure, network
partitions or delay in distributed nodes. It can used to coordinate and establish agreement
among multiple nodes on a particular value or decision.

—> The important building distributes system is lock is replicated in fault tolerant matter

Design distributed

Distributed lock in majority of nodes

Having odd number of nodes, one of these is leader node, every time write is sent to leader, and
leader to forward write to all follower, when leader receive response ok from all follower, it going
ahead and commit that locally and tell all other nodes that right ok

Leader election, we have heartbeat service and will elect new leader

Consider scaling

Imagine when release lock, a thousand of processes trying to grab the lock on single raft instance, that’s
bad situation

Solution is using linked-list for queueing process, use some of sorting time

Overview problem
Efficiency: taking lock save u from unnecessary workload (e.g. 2 processes doing the same
computation & taking expensive cost)

Correctness: Taking lock prevent from multiple processes modify the same piece of data, which lead
to corrupted, data loss & permanent inconsistency

Distributed lock

💡 Make lock safe with fencing token


Lock service could strictly generate monotonically increasing tokens

Storage could actively role checking token validity as well reject write on any token has gone

Fencing token with Redlock algorithm:

Redis can not safely generate fencing token, simply keeping a counter on node won’t be sufficient as
node could fall, in case multiple nodes, counter would go out of sync. And if we want use Redlock,
we need implement consensus algorithm like Raft or Paxos

Raft machine

