System Design Notes
System Design Notes
But the question now is how does mary’s app know that there is
message from John in th server. Since Http is client oriented and no
connection has been established yet from Mary to server so how does
she know there is a message for her.
Now to solve this is using polling or short polling.
Polling involves clients repeatedly making requests to the server at
regular intervals to check for new messages. Here, Mary polls the server
for new messages every few seconds:
Mary polls the server for new messages by sending get requests to
server. The server after the request for new messages for Mary send’s
John message to Mary’s app as response.
If there are no new messages, the server responds with an empty array
([]). Mary continues polling periodically.
Now there are various issue with this approach:
o Network and Server Load: Frequent polling consumes bandwidth
and server resources, even when there are no updates. And when the
userbase grows, this can cause stablility issues to server.
o Latency: Messages are delayed until the next polling interval. E.g. if
Mary’s app polls every 10 seconds to server then for 10 seconds she
won’t receive John’s message.
The solution for this is long polling.
Clients initiate a request to the server, which keeps the connection open
until new data is available and send back that in the response or a
timeout occurs.
Chat App: Mary initiates a long-poll request (/long-poll), holding the
connection open until the server responds or times out.
This apporach also has issues:
o This apporach reduces frequency of request to the server but still
client has to keep on polling after timeouts.
o This is also complex to implement in the backend.
o Handling long-lived connections can strain server resources.
The solution for this is using WebSockets.
Flow:
The client establishes connection with server for websocket protocol.
Both the client and server can send and receive messages independently
over the established connection and its not client oriented as http is.
The WebSocket connection remains open as long as needed, allowing
for continuous, real-time communication.
Ping/Pong Frames: The protocol supports ping/pong frames to keep
the connection alive and check for liveness.
Connection Closure:
o Either the client or the server can initiate the closing of the
connection.
o Close Frame: A close frame is sent to signal the intention to close the
connection.
o The connection is closed gracefully after both parties acknowledge the
close frame.
Chat App:
o Client to Server: John sends a message "Hello Mary!" to the chat
server.
o Server to Client: The server forwards the message to Mary in real-
time.
WebHooks
Webhooks are a way for applications to communicate with each other
automatically.
They allow one system to send real-time data to another whenever a
specific event occurs.
This is done by making an HTTP request to a predefined URL with a
payload of information about the event.
Webhooks typically facilitate communication between two services or
servers, not between a client (e.g., a user's browser) and a server.
Key Components of Webhooks
o Webhook Endpoint: The URL in the target system where the HTTP
requests are sent.
o Payload: The data sent to the webhook endpoint, usually in JSON
format.
o Event Types: Specific events that trigger the webhook (e.g., new user
registration, payment received).
Benefits of Webhooks
o Real-Time Updates and Efficiency: Receive data in real time as
events occur, without needing to poll for updates. Reduce the need for
constant checking for new data, saving resources and bandwidth.
o Automation: Automate workflows and integrate different systems
seamlessly.
How Webhooks Work
Setting Up Webhook Endpoint:
o In your e-commerce platform, you define a webhook endpoint where
you want to receive notifications from the payment gateway. This
endpoint is typically a URL handled by your application's backend.
Configuring the Payment Gateway:
o In the settings of the payment gateway (e.g., Stripe, PayPal), you
specify this webhook endpoint URL. You also configure which events
should trigger these notifications (e.g., successful payment,
chargeback).
Triggering the Webhook:
o When the configured event occurs (e.g., a successful payment), the
payment gateway automatically sends an HTTP POST request to the
webhook endpoint URL specified in your e-commerce platform.
Handling the Webhook in Your Application:
o Your application receives the incoming HTTP POST request at the
defined webhook endpoint.
o The payload of the request typically contains information about the
event (e.g., payment details, event type).
o Your application processes this information (e.g., updates the order
status, sends a confirmation email to the customer).
Webhooks vs WebSockets
Webhooks are suitable for event-driven, one-way notifications between
services or servers, typically used to trigger actions in another system
(e.g., payment notifications, updates, CI/CD pipelines).
WebSockets are ideal for real-time, two-way communication where low
latency and continuous data exchange are required (e.g., chat
applications, live data feeds, real-time collaboration).
Proxy, Reverse Proxy, and Forward Proxy
A proxy server serves as an intermediary between two systems, and it
abstracts out the complexities involved in direct communication.
A proxy can be implemented as software, hardware, or a combination of
both, depending on the requirements and scale of the deployment.
Most proxies are implemented as software running on general-purpose
hardware. This is the most flexible and common approach.
There are two types of proxies:
Forward Proxy
A forward proxy acts as an intermediary for clients and sits between
clients’ and the internet, and it performs various functions such as
filtering requests, caching data, and hiding the client's IP address.
Here’s how it works:
A client (like a web browser) requests access to a resource (like a
website).
The request is sent to the forward proxy server instead of directly to the
target server.
The proxy server evaluates the request based on its rules (e.g., access
control, caching policies).
The proxy server forwards the request to the target server.
The target server sends the response back to the proxy server, which
then sends it to the client.
Use-Cases:
o Privacy and Anonymity: Hides the client’s IP address from the
target server. Here are some key reasons why this might be desirable:
This is particularly important for users who want to protect their
identity and location from being tracked by websites and online
services.
By masking the client’s IP address, users can access content that
is otherwise unavailable in their region.
Using a proxy server hides the client’s IP address, reducing the
risk of direct attacks.
o Policies and Access Control: Restricts access to certain websites
based on organizational policies. E.g. used in schools, comapanies etc.
Blocks access to inappropriate or harmful content.
o Caching: Stores copies of frequently accessed resources to reduce
bandwidth usage and improve response times. So in future if the
internet is down or something, the forward proxy can server cached
data to the clients.
Reverse Proxy
A reverse proxy sits in front of one or more web servers and forwards
client requests to those servers. It appears to clients as if they are
interacting directly with the web servers.
Here’s how it works:
A client sends a request for a resource.
The request goes to the reverse proxy server.
The reverse proxy decides which backend server should handle the
request.
The reverse proxy forwards the request to the selected backend server.
The backend server processes the request and sends the response back
to the reverse proxy, which then forwards it to the client.
Use-Cases
o Load Balancing: Distributes incoming traffic across multiple servers
to ensure no single server is overwhelmed.
o Routing: Routes the incoming requests to the appropriate services or
server. E.g. if the request starts with /auth, then send it to
authentication service.
o Security: Acts as a barrier, protecting backend servers from direct
exposure to the internet, and can help mitigate attacks like DDoS.
o SSL Termination: Handles SSL encryption/decryption to reduce the
processing load on backend servers.
o Caching: Stores responses from backend servers to speed up
responses to similar future requests.
o Abstraction: It abstarcts out the elasticity (auto-scaling) and
becoming single point of contact for users. E.g. we can add or reduce
multiple instances of servers in the backends but the users don’t know
about it; they just communicate with the reverse proxy.
The examples of reverse proxies are loadbalancers, api gateways
and DB proxies.
Database Proxy
A database proxy is a type of reverse proxy and an intermediary
between database clients (web servers or applications) and the database
server.
It takes requests from clients like web applications and forwards them to
a database server using configurations that are specific to databases.
It provides various benefits such as load balancing, connection pooling,
caching, security, and high availability.
Implementing a database proxy can help optimize database performance
and scalability.
Benefits of a Database Proxy
o Load Balancing: Distributes client queries across multiple database
servers to ensure no single server is overwhelmed.
o Read/Write Splitting: Directs read queries to replicas and write
queries to the master database, optimizing resource utilization.
o High Availability: Provides failover capabilities to ensure database
availability even if some instances go down.
o Connection Pooling: Manages a pool of database connections to
reduce the overhead of establishing connections.
o Query Caching: Stores frequently executed queries in memory to
speed up response times.
o Security: Acts as a firewall, filtering and controlling access to the
database.
Some of the popular DB Proxy solutions are ProxySQL and pgbouncer.
ProxySQL is a high-performance database proxy for MySQL and
MariaDB, designed to handle large traffic loads with various features like
query routing, load balancing, and caching.
CAP Theorem
The CAP theorem is primarily applicable to distributed database
systems where we have multiple nodes or replicas of database
nodes. However, its principles can also extend to other types of
distributed systems, such as distributed computing systems, distributed
file systems, and any system that involves coordination among multiple
networked nodes.
The CAP theorem is a concept in computer science that explains the
trade-offs between consistency, availability, and partition tolerance in
distributed systems.
Consistency refers to the property of a system where all nodes have a
consistent view of the data. It means all clients see the same data at the
same time no matter which node they connect to.
Availability refers to the ability of a system to respond to requests from
users at all times.
Partition tolerance refers to the ability of a system to continue
operating even if there is a network partition.
But what is a network partition?
A network partition happens when nodes in a distributed system are
unable to communicate with each other due to network failures.
When there is a network partition, a system must choose between
consistency and availability.
If the system prioritizes consistency, it may become unavailable until the
partition is resolved.
If the system prioritizes availability, it may allow updates to the data.
This could result in data inconsistencies until the partition is resolved.
Example to Explain CAP Theorem
Let's say we have a tiny bank with two ATMs connected over a network.
The ATMs support three operations: deposit, withdraw, and check
balance.
No matter what happens, the balance should never go below zero.
There is no central database to keep the account balance. It is stored on
both ATMs.
When a customer uses an ATM, the balance is updated on both ATMs
over the network. This ensures that the ATMs have a consistent view of
the account balance.
If there is a network partition and the ATMs are unable to
communicate with each other, the system must choose between
consistency and availability:
o If the bank prioritizes consistency, the ATM may refuse to process
deposits or withdrawals until the partition is resolved. This ensures
that the balance remains consistent, but the system is unavailable to
customers.
o If the bank prioritizes availability, the ATM may allow deposits and
withdrawals to occur, but the balance may become inconsistent until
the partition is resolved. This allows the system to remain available to
users, but at the cost of data consistency. The preference for
availability could be costly to the bank. When there is a network
partition, the customer could withdraw the entire balance from both
ATMs. When the network comes back online, the inconsistency is
resolved and now the balance is negative. That is not good.
Another Example - Social Media Platform
During a network partition, if two users are commenting on the same
post at the same time, one user's comment may not be visible to the
other user until the partition is resolved.
Alternatively, if the platform prioritizes consistency, the commenting
feature may be unavailable to users until the partition is resolved.
For a social network, it is often acceptable to prioritize availability at the
cost of users seeing slightly different views some of the time.
CAP Theorem is simple but real world is not
The CAP theorem may sound very simple, but the real world is messy.
As with many things in software engineering, it is all about tradeoffs, and
the choices are not always so black and white.
The CAP theorem assumes 100% availability or 100% consistency.
In the real world, there are degrees of consistency and availability that
distributed system designers must carefully consider. This is where the
simplistic model of the CAP theorem could be misleading.
Back to the bank example, during a network partition, the ATM could
allow only balance inquiries to be processed, while deposits or
withdrawals are blocked.
Alternatively, the bank could implement a hybrid approach. For example,
the ATM could allow balance inquiries and small withdrawals to be
processed during a partition, but block large withdrawals or deposits
until the partition is resolved.
It is worth noting that in the real world, resolving conflicts after a
network partition could get very messy.
The bank example above is simple to fix. In real life, the data structures
involved could be complex and challenging to reconcile.
A good example of a complex data structure is Google Docs. Resolving
conflicting updates could be tricky.
So, is the CAP theorem useful?
Yes, it is a useful tool to help us think through the high-level trade-offs to
consider when there is a network partition.
But does not provide a complete picture of the trade-offs to consider
when designing a well-rounded distributed system and what happens
when there are no network partitions.
PACELC Theorem
In distributed database system, the PACELC theorem is an extension to
the CAP theorem.
It states that in case of network partitioning (P) in a distributed computer
system, one has to choose between availability (A) and consistency (C)
(as per the CAP theorem), but else (E), even when the system is running
normally in the absence of partitions, one has to choose between latency
(L) and loss of consistency (C).
Both theorems describe how distributed databases have
limitations and tradeoffs regarding consistency, availability, and
partition tolerance.
PACELC goes further and states that an additional trade-off exists:
between latency and loss of consistency, even in absence of
partitions.
- Network Latency: Latency refers to the time it takes for a request to
travel from the client to the server and back. In distributed systems,
latency can vary significantly due to factors such as network congestion,
distance between nodes, and processing time.
- Consistency Requirements: Consistency in a distributed system
ensures that all nodes have the same view of data at the same time.
Challenges in Achieving Both:
o High Latency for Strong Consistency: Achieving strong
consistency (C) often requires waiting for acknowledgments or
synchronization across multiple nodes. This can increase latency, as
each operation may need to be coordinated or serialized to ensure
data integrity.
o Low Latency for Weak Consistency: Opting for weaker consistency
models (e.g., eventual consistency) can reduce latency, as nodes can
respond quickly without waiting for full synchronization. However, this
may lead to temporarily inconsistent states visible to different nodes.
Strong Consistency vs Eventual Consistency
Definition:
- Strong Consistency: Strong consistency is a property in distributed
systems that ensures that all nodes in the system see the same data at
the same time, regardless of which node they are accessing. In other
words, when a write operation is performed, all subsequent read
operations from any node will return the most recent write value.
- Eventual Consistency: Over time, the system converges towards
consistency, but during the transient period, users accessing different
data centers may observe different versions of the data. This is the
characteristic behavior of eventual consistency.
Data Accuracy:
- Strong Consistency: Ensures that all nodes see the same data at the
same time, guaranteeing immediate data accuracy and integrity.
- Eventual Consistency: Temporarily allows nodes to be inconsistent,
which may result in stale data being read until convergence occurs.
Performance:
- Strong Consistency: Achieving strong consistency often involves
increased coordination and communication among nodes, leading to
higher latency for read and write operations.
- Eventual Consistency: The asynchrony of write propagation and
reduced coordination overhead allows for lower latency and higher
throughput for read and write operations.
Use Cases:
- Strong Consistency: Best suited for scenarios where data integrity
and consistency are critical, such as financial systems, e-commerce
platforms, and critical business applications.
- Eventual Consistency: Well-suited for applications where real-time
consistency is not vital and where system availability and scalability are
more important, such as social media platforms, content distribution
networks, and collaborative systems.
Note: Choosing between strong and eventual consistency depends on
the specific needs of the application and its users. Some systems may
adopt a hybrid approach, selectively applying strong consistency to
certain critical data and eventual consistency to less critical or non-
critical data, striking a balance between data accuracy, performance,
and availability. The decision requires careful consideration of the
tradeoffs to meet the desired requirements and constraints of the
distributed system.
Database Scaling
Scaling databases is a crucial part of ensuring that your application can
handle increasing loads and growing datasets.
There are several strategies for scaling databases, each with its own
advantages and use cases.
Here’s an overview of the most common approaches:
1. Caching: Caching involves storing frequently accessed data in memory to
reduce database load. Common caching solutions include Redis and
Memcached. It greatly improves read performance.
2. NoSQL Databases: Switching to NoSQL databases like MongoDB,
Cassandra, or DynamoDB can be a good solution for certain use cases,
especially where scalability is a primary concern. NoSQL databases are
designed to scale out by distributing data across multiple servers. NoSQL
databases are built with high availability and fault tolerance in mind. NoSQL
databases offer flexible schema designs that can handle various data types
and structures.
3. Vertical Scaling (Scaling Up): Vertical scaling involves adding more
resources (CPU, RAM, storage) to your existing database server. It is easier
to implement and manage. Also, no changes to application logic required.
4. Horizontal Scaling (Scaling Out): Horizontal scaling involves adding
more database servers to handle increased load. There are different
approaches within horizontal scaling:
1) Replication: Replication involves copying data from one database
server to others. There are various types of replication, such as
master-slave, master-master, and multi-master. Increases data
availability and redundancy. It can improve read performance.
Read/Write Splitting: Read/write splitting involves using master-
slave replication, where the master handles all writes and one or more
slaves handle reads. Offloads read operations to slaves, reducing load
on the master. Slaves can serve as backups.
2) Sharding: Sharding involves splitting your database into smaller,
more manageable pieces, called shards. Each shard is a separate
database that holds a subset of the data. This method helps balance
the load and allows for easy scaling by adding more shards.
3) Load Balancing: Uses a proxy layer to distribute requests to different
database servers based on load and availability.
Replication
Data Replication is the process of storing data in more than one site or
node.
It is simply copying data from a database from one server to another
server.
Why do we need replication?
Reduced Network Latency:
o Replication helps reduce network latency by keeping data close to the
user's geographic location, improving the speed at which data is
accessed.
o This is particularly useful for global applications, where users are
spread across different regions.
Improved Availability and Fault Tolerance:
o Replication enhances system availability by ensuring that if one server
goes down, others can take over, minimizing downtime.
o This is critical for businesses that need to maintain continuous
operations and avoid financial losses and customer trust issues due to
downtime.
Increased Data Throughput:
o Replication can handle higher data throughput by distributing read
and write operations across multiple servers.
o This scalability ensures that the system can manage a large number of
transactions per second (TPS) or queries per second (QPS),
maintaining a high quality of service (QoS) even under heavy load.
Load Balancing:
o Replication helps distribute the load evenly across multiple servers,
preventing any single server from becoming a bottleneck.
o This improves overall system performance and reliability.
Disaster Recovery:
o Replicated databases provide a reliable backup in case of data
corruption or hardware failure on the primary server.
o This ensures that data can be quickly restored and operations can
continue with minimal disruption.
Maintenance and Upgrades:
o Replication allows for easier maintenance and upgrades by enabling
operations on one server while others continue to handle user
requests.
o This reduces downtime and ensures a smoother update process.
Types of algorithms for implementing Database Replication?
1. Single Leader Replication
2. Multi-Leader Replication
3. Leaderless Replication
Single Leader Replication
So, in leader-based architecture, client (application server) requests are
sent to leader DB first and after that leader sends the data changes to all
of its followers.
Whenever a client wants to read data from the database then it can
query either leader or any of the follower (Yes there is generally more
than just one follower to make the system highly available). However,
writes to the database is only accepted on the leader by the client.
Now whenever a follower dies our application will not get impacted as
there is not just a single node of data. Our application can read from
other followers as well and hence this makes our system highly Read
Scalable.
Two types of Replication Strategy:
o Synchronous replication strategy: In this strategy, followers are
guaranteed to have an up-to-date copy of data (Strong Consistency)
which is an advantage.
o But one of the biggest disadvantages it has is that it will block the
client until the leader receives the OK response from all the followers.
Now if you have a very high read scalable system like Facebook with
thousands of follower’s nodes then waiting for data to be replicated at
each node is not a good user experience.
o Asynchronous replication strategy: The changes to master are
sent to the slave databases asynchronously. This means that the
master does not wait for the slaves to acknowledge the changes
before confirming the write operation to the client.
o Disadvantage: There is a time lag (replication lag) between when a
change is made on the master and when it is applied on the slaves.
During this lag, the slaves may serve outdated data for read requests.
Handling Node outages in replication
So, there are two scenarios as mentioned below:
Follower failure: In the scenario of follower failure, we can use a
strategy called Catchup recovery. In this strategy, the follower (which
got disconnected) can connect to the leader again and request all data
changes that occur when the follower was disconnected.
Leader failure: Now in the scenario of Leader failure, we can use a
strategy called Failover. In this strategy, one of the followers needs to be
promoted as a new leader.
There is a voting algorithm which is called Consensus Algorithm. In
layman terms, in this algorithm, we have Quorum of followers and all the
followers decide which follower should be made a leader.
Issues in Single Leader based Replication
Write Scalability: Since all write operations must go through the
leader, it can become a bottleneck, limiting the overall write throughput
of the system. As the number of write requests increases, the leader may
struggle to handle the load, leading to performance degradation.
Write Latency: In geographically distributed systems, write operations
can suffer from high latency because every write must be sent to the
leader, which may be located far from the user initiating the write. This
can slow down the perceived performance for users located far from the
leader.
Loss of Latest Changes: If the data center containing the leader fails
and the latest changes were not replicated to all the followers, those
changes may be permanently lost. This can lead to data inconsistency
and potential data loss.
Failover Complexity: Promoting a follower to become the new leader is
necessary, but it introduces complexity.
Multi-Leader Replication
In multi-Leader replication, we are going to have one leader in each of
my data centers and each data center’s leader replicates its changes to
the leaders in other data centers asynchronously.
Advantages of Multi-Leader Replication
Better Performance as compared to Single leader replication as we have
now reduced both Read & Write Latency of our application
High Fault Tolerance as each data center can continue operating
independently of others if any data center goes down. This is possible
because each data center has its leader. Also, replication catches up
when the failed datacenter comes back online
Moreover, if one data center goes down in one particular geographic
area then temporarily, we can route the requests from that geographic
area to some other healthy data center in another geographic area till
that unhealthy data center becomes healthy. Yes, there is a trade-off
between Performance and High Availability here.
Disadvantages
In a multi-leader system, writes can happen on different leaders at the
same time.
This can lead to conflicts because different leaders might receive and
apply write operations in different orders.
Resolution
Last Write Wins (LWW): In this approach, the system keeps the most
recent write based on timestamps. While simple, it can lead to data loss
as it discards other writes.
Version Vectors: Each data item carries a version vector that tracks the
version history. When conflicts arise, the system can merge changes
based on the version history.
Leaderless Replication
Leaderless replication (also known as peer-to-peer replication) does
not designate a single leader for write operations. Instead, any node can
accept write operations, and data is replicated among all nodes.
Advantages:
o Fault Tolerance: High fault tolerance as there is no single point of
failure.
o Scalability: Highly scalable since any node can handle read and write
operations.
Quorums in Leaderless Replication
Definition: Quorum-based replication involves a majority voting
mechanism to decide the success of read and write operations.
The system uses a quorum to determine if an operation has been
successfully applied.
A quorum is a subset of nodes that must respond positively for an
operation to be considered successful.
Read and Write Quorums (R and W)
o Read Quorum (R): The minimum number of nodes that must
respond to a read request to consider it successful.
o Write Quorum (W): The minimum number of nodes that must
acknowledge a write request for it to be considered successful.
o The values of R and W are chosen such that R + W > N, where N is the
total number of nodes.
How Quorums Work?
Write Operations:
o A write operation is sent to multiple nodes.
o The operation is considered successful if W nodes acknowledge the
write.
o Once the write quorum is met, the system can return success to the
client, even if not all nodes have acknowledged the write immediately.
Read Operations:
o A read operation is sent to multiple nodes.
o The operation is considered successful if R nodes respond.
o The system can then return the most recent data based on the
responses from these R nodes.
Benefits of Using Quorums
This approach improves availability, scalability, and fault tolerance while
providing strong consistency guarantees.
AmazonDB and Cassandra uses Quorum based approach.
Database Sharding
Sharding involves splitting a database into smaller, horizontally
partitioned pieces called shards, where each shard is a separate
database.
Each shard can be stored on a different server or cluster, enabling the
system to handle more load and improve performance.
Need for Sharding:
Consider a very large database whose sharding has not been done. For
example, let’s take a DataBase of a college in which all the student’s
record (present and past) in the whole college is maintained in a single
database. So, it would contain very very large number of data, say 100,
000 records.
Now when we need to find a student from this Database, each time
around 100, 000 transactions have to be done to find the student, which
is very very costly.
Now consider the same college students records, divided into smaller
data shards based on years. Now each data shard will have around
1000–5000 student records only. So not only the database became much
more manageable, but also the transaction cost of each time also
reduces by a huge factor, which is achieved by Sharding.
Benefits of Sharding
Horizontal Scalability: Sharding allows you to distribute data across
multiple servers, which means you can handle increased load by simply
adding more servers.
Improved Performance: By distributing the data, each server handles
a smaller portion, reducing the load and improving query performance.
Reduced Impact of Failures: If one shard fails, it affects only a subset
of the data and users, not the entire database.
Easier Maintenance: Maintenance tasks can be performed on
individual shards without impacting the entire system.
Lower Latency: Shards can be placed in different geographic locations
to reduce latency for users in different regions.
Backup and Recovery: Shards can be backed up and restored
independently, making the processes faster and more efficient.
Disadvantages of Sharding
Data Distribution Logic: Implementing sharding logic requires careful
planning and additional coding in the application to handle data
distribution.
Cross-Shard Queries: Queries that span multiple shards are more
complex and can be less efficient.
Management and Monitoring: Sharding requires monitoring multiple
database instances, which increases administrative overhead.
Rebalancing: Rebalancing involves redistributing data across shards
when the load is uneven or when adding new shards. This ensures even
load distribution and optimal performance. When adding new shards or
redistributing data, rebalancing can be a resource-intensive and time-
consuming process.
When to Shard a Database
High Data Volume: Large Databases: When the size of the database
exceeds the storage capacity or performance limits of a single server,
sharding becomes necessary.
Increased Traffic: Applications with a high number of read and write
operations can benefit from sharding to distribute the load across
multiple servers.
Global User Base: Applications with users spread across different
geographic regions can use sharding to reduce latency by placing data
closer to users.
Example Scenario
When to Shard: A social media platform with millions of users and high
traffic might need to shard its user data. By sharding the database based
on user ID, the platform can distribute the load across multiple servers,
improving performance and scalability.
When Not to Shard: A small e-commerce website with a few thousand
products and moderate traffic may not need sharding. A single database
instance can handle the load efficiently, and sharding would introduce
unnecessary complexity.
Shard Key (Partition Key)
A shard key is a specific column or set of columns in a database table
that is used to determine how data is distributed across multiple shards
in a sharded database architecture.
It should ensure even data distribution, minimize cross-shard queries,
and align with the query patterns.
The shard key should ensure that data is evenly distributed across all
shards to avoid hot spots where some shards become overloaded while
others remain underutilized.
The shard key should align with the most common query patterns to
minimize the need for cross-shard queries, which can be more complex
and less efficient.
Common shard keys include user ID, geographic region, or a hash of a
key field.
Sharding Strategies Based on Shard Key
Key-Based (Hash) Sharding:
o Uses a hash function on the shard key to distribute data.
o Shard Key: User ID, Product ID, Email Address (hashed).
o Example: shard_id = hash(user_id) % number_of_shards.
Range-Based Sharding:
o Distributes data based on ranges of shard key values.
o Shard Key: Transaction Date, Salary Range, Alphabetical Range of
Last Names.
o Example: Shard 1 for dates 2023-01-01 to 2023-03-31, Shard 2 for
2023-04-01 to 2023-06-30.
Directory-Based Sharding:
o Directory-based sharding uses a lookup table or directory to map each
shard key to its corresponding shard. The directory service determines
which shard a piece of data belongs to.
o This directory can be updated dynamically, allowing for more fine-
grained control over data distribution. You can easily move data
between shards by updating the directory without needing to
rehash or redefine ranges.
o However, it also requires maintaining an up-to-date directory, which
adds some complexity to the system.
o Shard Key: User ID (mapped via lookup table), Product Category,
Region Code.
o Example: Directory service maps User ID 1-1000 to Shard 1, User ID
1001-2000 to Shard 2.
Geographic (Location-Based) Sharding:
o Shards data based on geographic location.
o Shard Key: Country Code, City Name, IP Address Range (geo-
located).
o Example: Users from North America -> Shard 1, Europe -> Shard 2.
Partitioning
Dividing data within a single database instance.
Each partition holds a subset of the data based on specific criteria, such
as a range of values, a list of values, or a hash function.
Improved Performance: Partitioning can improve query performance
by limiting the amount of data scanned during query execution.
Partitions are implemented within a single database. The database
management system (DBMS) handles the creation, maintenance, and
querying of partitions.
Sharding vs. Partitioning
o While partitions are divisions within a single database, sharding
involves dividing the data across multiple databases or database
servers.
Using Both Sharding and Partitioning
o In some cases, both sharding and partitioning can be used together to
achieve optimal scalability and performance.
Example Scenario
o Suppose you have a social media application with user data spread
across multiple shards based on user ID. Within each shard, you
further partition the data by activity date.
Sharding by User ID:
o Users with IDs 1-1000 are stored in Shard 1.
o Users with IDs 1001-2000 are stored in Shard 2.
Partitioning within Each Shard:
o Each shard's user activity table is partitioned by month.
Shard 1: Contains users with IDs 1-1000, and their activities are
partitioned by month.
Shard 2: Contains users with IDs 1001-2000, and their activities are
partitioned by month.
Message Queue
Message queuing makes it possible for applications to communicate
asynchronously, by sending messages to each other via a queue.
A message queue provides temporary storage between the sender and
the receiver so that the sender can keep operating without interruption
when the destination program is busy or not connected.
Asynchronous processing allows a service to send message to another
service, and move on to the next task while the other service processes
the request at its own pace.
A message queue is a queue of messages sent between applications
and waiting to be handled by other applications.
A message is the data transported between the sender and the receiver
application; it’s essentially a byte array with some headers on top. An
example of a message could be an event. One application tells another
application to start processing a specific task via the queue.
Architecture
o Producer (Sender): The component that sends messages to the
queue. Producers can generate messages at any time, without
needing to wait for the consumer to be ready.
o Queue: A storage area where messages are held until they are
consumed. The queue ensures that messages are delivered in a
reliable manner, usually in the order they were sent (FIFO - First In,
First Out).
o Consumer (Receiver): The component that retrieves and processes
messages from the queue. Consumers can process messages at their
own pace, independently of the producer.
Role of Message Queues in Microservices
o In a microservice architecture, various functionalities are divided
across different services that collectively form a complete software
application.
o These services often have cross-dependencies, meaning some
services can’t perform its functions without interacting with others.
o Message queuing plays a crucial role in this architecture by providing
a mechanism for services to communicate asynchronously, without
getting blocked by responses.
Key Characteristics of Message Queues:
o Asynchronous Communication: The sender and receiver of the
messages do not need to interact with the queue at the same time.
The sender can post a message and continue processing, while the
receiver can retrieve and process the message at a later time.
o Decoupling: The sender and receiver do not need to know about
each other's existence. This decoupling simplifies the system
architecture and makes it easier to scale and maintain.
o Reliability: Message queues often provide guarantees on message
delivery, ensuring that messages are not lost and are delivered in the
correct order. This is essential for critical applications where data
integrity is important.
o Scalability: Message queues help in scaling applications by
distributing the workload. Multiple consumers can process messages
from the queue in parallel, improving the throughput of the system.
o Buffering: Message queues can handle bursts of messages and
buffer them, ensuring that the receiver processes them at a
manageable rate.
Caching
A cache is essentially a key-value store that is used to temporarily store
data in a fast-access storage medium.
The primary purpose of a cache is to speed up data retrieval operations
by storing copies of data that are frequently accessed or computationally
expensive to retrieve from the original source.
It takes advantage of the locality of reference principle: recently
requested data is likely to be requested again.
Benefits
o Low latency: Caching makes your system faster by reducing data
fetching time.
o Reduced Server Load: Caching reduces the load on your database
or primary servers.
o Better cx Experience: Quick response times lead to happier users
Cons
o Stale Data: Cache data can become outdated, leading to data
inconsistency.
o System Complexity: Implementing caching adds an extra layer of
complexity to system design.
o Cache Invalidation: Determining when to refresh or clear cache can
be challenging.
Can we store all the data in the cache?
No! We can’t store all the data in the cache because of multiple reasons.
The hardware that we use to make cache memories is much more
expensive than a normal database.
If you store a ton of data on cache the search time will increase
compared to the database.
Caching levels in CPU design
In computer architecture and CPU design, L1, L2, and L3 refer to different
levels of cache memory hierarchy that are integrated into modern
processors to improve performance by reducing the time taken to access
data.
L1 cache is the fastest but smallest, L2 cache is larger but slightly
slower, and L3 cache is the largest and slower compared to L1 and L2
but still much faster than accessing RAM.
Caches in different layers
Caching can be organized into multiple levels depending on where and
how data is stored relative to its usage and accessibility.
Each level serves a specific purpose in optimizing performance and
efficiency within a system. Here are the typical levels of caching:
Client-Side Caching
o Location: On the client device (e.g., web browser, mobile app).
o Purpose: Store frequently accessed resources locally to reduce
latency and improve responsiveness.
o Examples: Browser cache for web pages, app cache for mobile
applications.
o Advantages: Minimizes network requests and server load, enhances
user experience by speeding up access to resources.
DNS Caching: for faster resolution of domain name to ip address of
frequently accessed domain names.
Server-Side Caching
o Location: On the server hosting the application or service.
o Purpose: Cache data or computations to reduce response times and
load on backend systems.
o Examples: In-memory caches like Redis or Memcached used to
store session data, computed results, or frequently accessed objects.
o Advantages: Improves scalability and efficiency by reducing the
need to repeatedly generate or fetch data from databases or external
services.
Database Caching
o Location: Within the database management system (DBMS).
o Purpose: Cache frequently accessed data or query results to
minimize disk I/O and query execution time.
o Advantages: Speeds up data retrieval and query processing, reduces
database load during peak usage.
Content Delivery Network (CDN) Caching
o Location: Distributed globally across CDN edge servers.
o Purpose: Cache static content (e.g., images, CSS, JavaScript) closer
to users to reduce latency and improve content delivery speed.
o Examples: CDN services like Cloudflare, Akamai, serving cached
copies of web assets to users based on geographical proximity.
Cache Performance Metrics
When implementing caching, it’s important to measure the performance
of the cache to ensure that it is effective in reducing latency and
improving system performance.
Hit rate: The hit rate is the percentage of requests that are served by
the cache without accessing the original source. A high hit rate indicates
that the cache is effective in reducing the number of requests to the
original source, while a low hit rate indicates that the cache may not be
providing significant performance benefits.
Miss rate: The miss rate is the percentage of requests that are not
served by the cache and need to be fetched from the original source. A
high miss rate indicates that the cache may not be caching the right
data or that the cache size may not be large enough to store all
frequently accessed data.
Cache size: The cache size is the amount of memory or storage
allocated for the cache. The cache size can impact the hit rate and miss
rate of the cache. A larger cache size can result in a higher hit rate, but it
may also increase the cost and complexity of the caching solution.
Cache latency: The cache latency is the time it takes to access data
from the cache. A lower cache latency indicates that the cache is faster
and more effective in reducing latency and improving system
performance. The cache latency can be impacted by the caching
technology used, the cache size, and the cache replacement and
invalidation policies.
Cache Replacement Policies
When implementing caching, it’s important to have a cache replacement
policy to determine which items in the cache should be removed when
the cache becomes full. Here are some of the most common cache
replacement policies:
Least Recently Used (LRU): This policy assumes that items that have
been accessed more recently are more likely to be accessed again in the
future.
Least Frequently Used (LFU): This policy assumes that items that
have been accessed more frequently are more likely to be accessed
again in the future.
First In, First Out (FIFO): This policy assumes that the oldest items in
the cache are the least likely to be accessed again in the future.
Random Replacement: This policy doesn’t make any assumptions
about the likelihood of future access and can be useful when the access
pattern is unpredictable.
Comparison of different replacement policies
Each cache replacement policy has its advantages and disadvantages,
and the best policy to use depends on the specific use case.
LRU and LFU are generally more effective than FIFO and random
replacement since they take into account the access pattern of the
cache.
However, LRU and LFU can be more expensive to implement since they
require maintaining additional data structures to track access patterns.
FIFO and random replacement are simpler to implement but may not be
as effective in optimizing cache performance.
Overall, the cache replacement policy used should be chosen carefully to
balance the trade-off between performance and complexity.
Cache Invalidation Strategies
Cache invalidation is the process of removing data from the cache when
it is no longer valid.
Invalidating the cache is essential to ensure that the data stored in the
cache is accurate and up-to-date.
Here are some of the most common cache invalidation strategies:
o Write-Through Cache: Data is written to the cache and the backing
store at the same time. Although, write-through minimizes the risk of
data loss, since every write operation must be done twice before
returning success to the client, this scheme has the disadvantage of
higher latency for write operations.
o Write-around cache: This technique is similar to write-through
cache, but data is written directly to permanent storage, bypassing
the cache. This can reduce the cache being flooded with write
operations that will not subsequently be re-read, but has the
disadvantage that a read request for recently written data will create a
“cache miss” and must be read from slower back-end storage and
experience higher latency.
o Write-Back Cache: Under this scheme, data is written to cache
alone, and completion is immediately confirmed to the client. The
write to the permanent storage is done based on certain conditions,
for example, when the cache system needs to free some space. This
results in low-latency and high-throughput for write-intensive
applications; however, this speed comes with the risk of data loss in
case of a crash or other adverse event because the only copy of the
written data is in the cache.
o Write-behind cache: It is quite similar to write-back cache. In this
scheme, data is written to the cache and acknowledged to the
application immediately, but it is not immediately written to the
permanent storage. Instead, the write operation is deferred, and the
data is eventually written to the permanent storage at a later time.
o The main difference between write-back cache and write-behind cache
is when the data is written to the permanent storage. In write-back
caching, data is only written to the permanent storage when it is
necessary for the cache to free up space, while in write-behind
caching, data is written to the permanent storage at specified
intervals.
Distributed Caching
The distributed cache is designed to store and manage cached data
across multiple nodes in a network.
This approach improves scalability, fault tolerance, and performance by
leveraging the combined resources of multiple machines.
Cache Nodes: Individual servers that store and manage cached data in
a distributed cache system.
Cache Management Layer: The system responsible for coordinating
data distribution, consistency, replication, and fault tolerance across
cache nodes.
Networking Layer: The communication infrastructure that enables data
exchange between cache nodes and clients, ensuring secure and
efficient data transfer.
Metadata and Configuration Store: A centralized repository that
keeps track of cache metadata (e.g., key mappings, expiration times)
and configuration settings for cache nodes.
Client Interface: APIs and libraries that allow applications to interact
with the distributed cache for data storage and retrieval.
Request Flow in Distributed Cache
Client Request: The client sends a data request to the distributed
cache via the client interface.
Routing to Cache Node: The key of the requested data is hashed to
determine which cache node is responsible for the data. The metadata
and configuration store provide the necessary information for routing.
Cache Lookup: The cache node checks its local store for the requested
data (cache hit or miss).
Handling Cache Miss: If a miss occurs, the cache node fetches data
from the primary source, coordinated by the management layer, and
stores it. The fetched data is then stored in the cache node for future
requests. The metadata and configuration store may update metadata to
reflect the new data location.
Data Retrieval and Return: The data is returned to the client.
Replication (Optional): The data may be replicated to other nodes,
coordinated by the management layer, and metadata is updated.
Cache Maintenance: The cache node may evict or expire data based
on policies, overseen by the management layer.
SQL Databases
Databases are typically controlled by database management systems
(DBMS).
SQL databases are relational databases that use Structured Query
Language (SQL) for managing and querying data.
These databases are mainly composed of tables, with each table
consisting of rows and columns.
In a relational database, each row is a record with a unique
identification, or key, called a key.
Relational databases have a predefined schema, which establishes
the relationship between tables and field types. In a relational
database, the schema must be clearly defined before any information
can be added.
Relational databases are ACID-compliant, which makes them highly
suitable for transaction-oriented systems and storing financial
data. ACID compliance ensures error-free services, even in the event of
failures, which is essential for transaction validity.
Here are some key characteristics and components of SQL databases:
Relational Structure: Data is organized into tables with rows and
columns. Tables are related to each other through defined relationships.
Schema: SQL databases have a schema that defines the structure of
the database, including tables, fields (columns), and relationships
between tables.
SQL Language: SQL (Structured Query Language) is the standard
language used to interact with SQL databases. It allows users to query
data, insert new records, update existing records, and delete records.
ACID Properties: Transactions in SQL databases adhere to the ACID
properties:
o Atomicity: Transactions are all or nothing.
o Consistency: Transactions bring the database from one valid state to
another.
o Isolation: Transactions occur independently of each other.
o Durability: Once a transaction is committed, it is permanently saved
and recoverable.
Examples: Examples of SQL databases include MySQL, PostgreSQL,
Oracle Database, SQLite, Microsoft SQL Server, and others.
NoSQL Databases:
NoSQL (Not only SQL) databases are a diverse group of non-relational
databases designed to address the limitations of traditional SQL
databases, particularly in terms of scalability, flexibility, and performance
under specific workloads.
NoSQL databases do not adhere to the relational model and typically do
not use SQL as their primary query language. Instead, they employ
various data models and query languages, depending on the specific
type of NoSQL database being used.
The key characteristics of NoSQL databases include their schema-less
design, which allows for greater flexibility in handling data; horizontal
scalability, which makes it easier to distribute data across multiple
servers; and their ability to perform well under specific workloads, such
as high write loads or large-scale data storage and retrieval.
Types of NoSQL databases and their use cases
NoSQL databases can be broadly categorized into four main types, each
with its unique data model and use cases:
Document databases: These databases store data in a semi-structured
format, such as JSON or BSON documents. Each document can contain
nested fields, arrays, and other complex data structures, providing a
high degree of flexibility in representing hierarchical and related
data. Some popular document databases include MongoDB and
CouchDB.
Key-value stores: Key-value databases store data as key-value pairs,
where the key is a unique identifier and the value is the associated data.
These databases excel in scenarios requiring high write and read
performance for simple data models, such as caching, session
management, and real-time analytics. Some widely-used key-value
stores are Redis and Amazon DynamoDB.
Column-family stores: Also known as wide-column stores, these
databases store data in columns rather than rows, making them highly
efficient for read and write operations on specific columns of data.
Column-family stores are particularly well-suited for large-scale,
distributed applications with high write loads and sparse or time-series
data, such as IoT systems, log analysis, and recommendation engines.
Examples of column-family stores include Apache Cassandra and HBase.
Graph databases: Graph databases store data as nodes and edges in a
graph, representing entities and their relationships. These databases are
optimized for traversing complex relationships and performing graph-
based queries, making them ideal for applications involving social
networks, fraud detection, knowledge graphs, and semantic search.
Some notable graph databases are Neo4j and Amazon Neptune.
SQL VS NoSQL Databases
Schema:
o SQL: SQL databases enforce a predefined schema for the data, which
ensures that the data is structured, consistent, and follows specific
rules. This structured schema can make it easier to understand and
maintain the data model, as well as optimize queries for performance.
We cannot accomadate new data types without schema
modification.
o NOSQL: One of the primary advantages of NoSQL databases is their
schema-less design, which allows for greater flexibility in handling
diverse and dynamic data models. This makes it easier to adapt to
changing requirements and accommodate new data types without the
need for extensive schema modifications, as is often the case with
SQL databases.
Indexing:
o Both SQL (relational) and NoSQL databases can utilize indexing to
optimize query performance, although the specifics of indexing can
vary between them.
ACID compliance:
o SQL: SQL databases adhere to the ACID (Atomicity, Consistency,
Isolation, Durability) properties, which ensure the reliability of
transactions and the consistency of the data. These properties
guarantee that any operation on the data will either be completed in
its entirety or not at all, and that the data will always remain in a
consistent state.
o NoSQL: NoSQL databases often offer different consistency models
and trade-offs depending on their specific design goals and use cases.
Some NoSql Db’s focus on achieving eventual consistency rather than
strong consistency. Some NoSQL databases provide support for certain
ACID properties in specific scenarios, but they may not guarantee
ACID compliance universally across all operations and configurations.
Some NoSQL databases (like MongoDB) databases often sacrifice full
ACID compliance in favor of other benefits such as high availability,
partition tolerance, and scalability.
Scalabilty:
o SQL: SQL databases can be scaled vertically by adding more
resources (such as CPU, memory, and storage) to a single server.
However, horizontal scaling, or distributing the data across multiple
servers, can be more challenging due to the relational nature of the
data and the constraints imposed by the ACID properties. This can
lead to performance bottlenecks and difficulties in scaling for large-
scale applications with high write loads or massive amounts of data.
o NoSql: NoSQL databases are designed to scale horizontally, enabling
the distribution of data across multiple servers, often with built-in
support for data replication, sharding, and partitioning. This makes
NoSQL databases well-suited for large-scale applications with high
write loads or massive amounts of data, where traditional SQL
databases may struggle to maintain performance and consistency.
Querying
o SQL: SQL is a powerful and expressive query language that allows
developers to perform complex operations on the data, such as
filtering, sorting, grouping, and joining multiple tables based on
specified conditions.
o NoSQL: While some NoSQL databases offer powerful query languages
and capabilities, they may not be as expressive or versatile as SQL
when it comes to complex data manipulation and analysis. This can be
a limiting factor in applications that require sophisticated querying,
joining, or aggregation of data.
o Additionally, developers may need to learn multiple query languages
and paradigms when working with different types of NoSQL databases.
o Since a query doesn’t need to look at numerous tables to obtain a
response, as relational databases frequently do, Non-Relational
Database Management Systems are usually faster than
Relational Databases.
type User {
id: ID!
name: String!
email: String!
}
type Query {
posts: [Post]
user(id: ID!): User
}
Use Cases
REST: Best for simple CRUD operations, standard web applications, and
where human-readability of messages is essential.
GraphQL: Ideal for applications with complex querying needs, where
clients need precise control over the data they request, and for scenarios
requiring strong schema and introspection.
API Design and best Practices
Backwards compatibilty, so we don’t break existing users who are usin
the api
Versioning
Consistent Hashing
Consistent hashing aims to minimize the amount of data that needs to
be reassigned when nodes are added or removed.
It achieves this by mapping both nodes and data items onto a virtual
hash ring.
Each node and data item is assigned a position on the ring using a hash
function.
Nodes are responsible for the data items whose positions are closest to
and follow them on the ring.
Nodes and data items are mapped onto a virtual ring using a hash
function that outputs a large numeric space.
Normal Hashing: Normal hashing involves using a hash function to
determine which node should store or handle a particular piece of data.
This can be expressed as:
node_index=hash_function(data_key) % num_nodes
Example Scenario: Consider a scenario with 3 nodes (Node 1, Node 2,
Node 3) and a set of data items (Data 1 to Data 10). Using normal
hashing:
o Data 1 hashes to a value that maps to Node 1.
o Data 2 hashes to a value that maps to Node 2.
o Data 3 hashes to a value that maps to Node 3.
o If you add a new node (Node 4) to scale the system, you would
typically need to rehash all data because the addition of a new node
changes the modulus (num_nodes), affecting which node each data
item should be assigned to.
Example Scenario: Using the same 3 nodes (Node A, Node B, Node C)
and a set of data items (Data 1 to Data 10):
o Nodes A, B, and C are placed at different points on the hash ring
based on their hash values.
o Data 1 to Data 10 are also hashed onto the ring.
o Assigning Data to Nodes:
To determine which node should handle a particular data item,
you move clockwise on the ring from the data item's hash value
until you find the first node.
That node becomes responsible for storing or processing that
data item.
This process ensures that each node is responsible for a segment
of the hash ring, and data items are evenly distributed among
nodes.
o Adding a New Node:
When a new node is added, it is placed on the ring based on its
hash value.
Only the data that was previously assigned to the next node on
the ring (in a clockwise direction) needs to be reassigned to the
new node.
This minimal reassignment reduces the overhead and disruption
in the system compared to traditional hashing methods where all
data might need to be redistributed.
If you add a new node (Node D), only a fraction of the data
needs to be remapped. For example we would data between
Node C and Node D to Node D.
o Removing a Node:
When a node is removed, its data is typically reassigned to the
next node that follows it on the ring.
Again, only a portion of the data needs to be reassigned,
maintaining efficiency and minimizing disruption.
Now, if Node B is removed, its data is typically reassigned to
Node C which is next in clockwise direction.
Only the data that was previously assigned to Node B needs to
be remapped.
o Determining the Next Node:
To find which node should handle the request or data item:
The system starts from the hashed value's position on the hash
ring.
It moves clockwise around the ring until it finds the first node
whose position (hash value or identifier) is greater than or equal
to the hashed value of the request.
This node becomes responsible for processing or storing the
request.
o Why Not a Centralized Load Balancer or Proxy?
The use of a hash ring allows nodes to independently determine
routing decisions based on their relative positions on the ring.
This decentralized approach scales more effectively as the
number of nodes increases, without creating a bottleneck at a
centralized load balancer.
Networking Basics
Ip Address: An IP address is a numerical label assigned to each device
connected to a computer network. It identifies the location of a device on
a network, allowing other devices to communicate with it and facilitating
data routing.
Port: A port is a communication endpoint in an operating system that is
used to uniquely identify a specific process or network service. Ports
allow multiple services or processes to run simultaneously on a single
device
MAC Address: A MAC address is a unique identifier assigned to network
interfaces for communications on the physical network segment.
Virtual Private Networks (VPNs): VPNs create secure, encrypted
connections over a less secure network (e.g., the internet), enabling
remote access and secure communication.
Firewalls and Security: Firewalls enforce security policies by filtering
incoming and outgoing traffic based on predefined rules.
Routing and Switching: Definition: Routing involves directing network
traffic between different networks, while switching involves forwarding
data within the same network.
TCP (Transmission Control Protocol):
o Connection-Oriented: TCP establishes a reliable and ordered
connection between two devices before data exchange begins.
o Reliability: Provides reliable delivery of data with error-checking,
retransmission of lost packets, and in-order delivery.
o Flow Control: Manages data flow between sender and receiver to
prevent overwhelming the receiver with data.
Use Cases:
o Web browsing: TCP is used by HTTP for loading web pages.
o Email: SMTP, POP, and IMAP protocols use TCP for sending and
receiving emails.
o File Transfer: FTP and SSH use TCP for secure file transfer.
o Streaming: TCP is used for streaming media where reliability and
order are crucial.
UDP (User Datagram Protocol):
o Connectionless: UDP does not establish a connection before sending
data and does not guarantee delivery.
o Unreliable: Does not perform error-checking or packet
retransmission, leaving that responsibility to the application layer.
o Low Overhead: Lightweight protocol with minimal processing and
transmission overhead.
Use Cases:
o Real-time applications: Used in video conferencing, online gaming,
and VoIP (Voice over IP) where low latency is critical.
o DNS: UDP is used by DNS for quick lookups.
o Streaming: UDP can be used for live video or audio streaming where
occasional packet loss is acceptable.
o IoT: Used in IoT devices for transmitting small amounts of data
quickly.
Note: Application layer protocols like HTTP, SMTP, FTP, etc., rely on
transaport layer protocols either TCP or UDP for transmitting their data.
Object Storage
Object storage is a data storage architecture that manages data as
objects rather than traditional file systems or block storage.
Each object typically includes data, metadata (attributes or tags), and
a unique identifier (key).
Object storage is not a database in the traditional sense, as it does not
provide structured query capabilities like relational databases (e.g., SQL
databases).
Instead, object storage is a data storage architecture optimized for
storing and managing large amounts of unstructured data as discrete
units called objects.
Primarily supports basic operations like storing, retrieving, and deleting
objects. It lacks the ability to perform complex queries or transactions on
the data.
Object Storage Model: Cloud storage services like Amazon S3, Google
Cloud Storage, or Azure Blob Storage use an object storage model.
Unlike traditional file systems that organize data in a hierarchical
structure, object storage uses a flat namespace where each object
(file) is stored as a standalone unit identified by a unique key
(often a URL or URI).
Characteristics:
o Scalability: Object storage systems are highly scalable, capable of
storing vast amounts of unstructured data across distributed
infrastructure.
o Metadata Rich: Each object can be enriched with metadata (e.g.,
timestamps, content type), allowing for efficient indexing and search
operations.
o Durability: Object storage systems often provide high durability
through data replication and distribution across multiple nodes or data
centers.
o Access Methods: Objects are typically accessed via HTTP/S using
RESTful APIs, making them suitable for cloud-native and distributed
applications.
Use Cases:
o Ideal for storing large volumes of unstructured data such as images,
videos, documents, backups, and log files. It is commonly used in
cloud storage solutions.
o Cloud Storage: Object storage is widely used in cloud platforms
(e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage) for
storing files, backups, and large-scale data sets.
o Content Delivery: Serving static content (e.g., images, videos) for
websites and applications.
BLOB: BLOB" stands for Binary Large OBject. It refers to a collection of
binary data stored as a single entity in a database management system
(DBMS) or a file system.
o Binary Data: BLOBs are used to store binary data, which can include
images, videos, audio files, documents (like PDFs), and other
multimedia files. Unlike traditional text data, which can be easily
represented using characters and strings, binary data consists of
raw bytes that may not have a specific character encoding.
o Large Size: The term "Large" in BLOB emphasizes that these objects
can be of considerable size, potentially ranging from kilobytes to
gigabytes or even larger. This makes them suitable for storing
large multimedia files and other types of data that are not text-based.