0% found this document useful (0 votes)
17 views56 pages

System Design Notes

Uploaded by

er.shahid.nazir7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views56 pages

System Design Notes

Uploaded by

er.shahid.nazir7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

System Design Notes

Why do you need to know system design? – ON CHATGPT


Components of a Distributed System


 Designing large-scale, distributed systems requires a deep
understanding of the various components that come together to create a
robust and efficient architecture.
DNS
 DNS (Domain Name System) is a hierarchical and decentralized naming
system used to resolve human-friendly domain names into into their
corresponding IP addresses.
 It essentially acts as the phonebook of the internet, allowing users to
access websites and services by typing in easily memorable domain
names, such as www.designgurus.iorather than the numerical IP
addresses like “192.0.2.1” that computers use to identify each other.
 When you enter a domain name into your web browser, the DNS is
responsible for locating the associated IP address and directing your
request to the correct server.
How DNS works
 The process begins with your computer sending a query to a recursive
resolver, which then searches a series of DNS servers, starting with the
root server, followed by the Top-Level Domain (TLD) server, and finally
the authoritative name server.
 Once the IP address is found, the recursive resolver returns it to your
computer, allowing your browser to establish a connection with the
target server and access the desired content.
 Recursive Resolver: A recursive resolver (also known as a DNS
resolver) is a server that takes a domain name query from a client (such
as your computer) and then queries other DNS servers to resolve the
domain name into an IP address.
 The recursive resolver handles the entire process of searching through
the necessary DNS servers to obtain the correct IP address and then
returns this information back to the client.
 DNS Hierarchy Components
 Root DNS Servers:
o The top of the DNS hierarchy.
o There are 13 sets of root servers worldwide, known by letters A
through M.
o They don't contain the specific IP addresses for domain names but can
direct queries to the appropriate TLD servers based on the suffix of
the domain name (e.g., .com, .org, .net).
 Top-Level Domain (TLD) DNS Servers:
o Manage domains for a specific TLD, such as .com, .org, .net, etc.
o Each TLD has its own set of DNS servers.
o They provide the address of the authoritative DNS server responsible
for the specific domain name within the TLD.
 Authoritative DNS Servers:
o These servers contain the actual DNS records (like A records, MX
records) for the domain name.
o They provide the final IP address for the domain name to the recursive
resolver.
Load Balancer
 Load balancers are softwares or devices that distribute incoming network
traffic across multiple servers to ensure optimal resource utilization,
maximize throughput, minimize response time, and prevent overloading
any single server.
 They play a crucial role in maintaining high availability and reliability by
distributing requests evenly and rerouting traffic in case of server
failures.
 There are different types of load balancing algorithms, such as Round
Robin, Least Connections, and IP Hash, each with its benefits and trade-
offs. Selecting the right load balancing strategy depends on the specific
requirements of the system.
 There are two primary approaches to load balancing: Dynamic Load
Balancing and Static Load Balancing.
 Dynamic load balancing uses algorithms that take into account the
current state of each server and distribute traffic accordingly.
 Static load balancing distributes traffic without making these
adjustments. Some static algorithms send an equal amount of traffic to
each server in a group, either in a specified order or at random.
 Load Balancing Algorithms:
o Round Robin: Distributes requests sequentially to each server in the
group. - Static
o Least Connections: Directs traffic to the server with the fewest
active connections. - Dynamic
o IP Hash: Uses the client’s IP address to determine which server will
handle the request. - Static
o Weighted: Assigns a weight to each server based on its capacity,
distributing traffic proportionally. – Dynamic
o Resource-based: Distributes load based on what resources each
server has available at the time. Specialized software (called an
"agent") running on each server measures that server's available CPU
and memory, and the load balancer queries the agent before
distributing traffic to that server. - Dynamic
Relationship Between API Gateway and Load Balancer
 Front-End Layer:
o Client Requests: Clients send requests to the Load Balancer, which
distributes these requests to multiple instances of the API
Gateway.
o API Gateway: Each instance of the API Gateway processes the
requests, handles routing, authentication, and other functions, and
then forwards the requests to the appropriate backend services.
 Backend Layer:
o API Gateway to Services: The API Gateway may use a Load
Balancer to distribute requests among multiple instances of a
particular microservice or backend server.
CDN
 A Content Delivery Network (CDN) is a distributed network of servers
that store and deliver content, such as images, videos, stylesheets, and
scripts, to users from geographically closer locations.
 The primary goal of a CDN is to improve the performance, availability,
and reliability of web services and applications by delivering content
from locations closer to the user, regardless of their location relative to
the origin server.
 Here are some key points about CDNs:
o Performance Improvement: By caching content on servers that are
closer to the user, CDNs reduce latency and load times. This is
particularly beneficial for large websites, streaming services, and
online platforms with a global audience.
o Reliability and Redundancy: CDNs provide redundancy, meaning if
one server fails, another can take over.
o Scalability: CDNs help handle large amounts of traffic and sudden
spikes by distributing the load across multiple servers. This scalability
is crucial for websites experiencing varying traffic levels.
o Cost Efficiency: By offloading traffic from the origin server, CDNs can
reduce bandwidth costs and server load. This efficiency can lead to
cost savings for the website owner.
How CDN works?
 When a user requests content from a website or application, the Domain
Name System (DNS) is queried. If the website is configured to use a CDN,
the DNS resolver returns the IP address of the nearest edge server based
on the user’s location and the request is directed to the nearest CDN
server, also known as an edge server.
 If the edge server has the requested content cached, it directly
serves the content to the user. This reduces latency and improves the
user experience, as the content travels a shorter distance.
 If the content is not cached on the edge server, the CDN retrieves it
from the origin server or another nearby CDN server. Once the content is
fetched, it is cached on the edge server and served to the user.
 To ensure the content remains up-to-date, the CDN periodically checks
the origin server for changes and updates its cache accordingly.

 Note: When it comes to handling a signup process or any other complex


request, the request typically goes directly to the origin server hosted in
a data center, not the CDN.
 CDNs are optimized for caching and delivering static content and some
dynamic content that doesn't require real-time processing or
personalization.
 For a signup request, which involves data processing, validation, and
possibly interaction with a database, the CDN's role is limited.
 The CDN receives the request but does not process it because it involves
complex operations.
 The CDN forwards the request to the origin server in the data center
where the web application logic is hosted.
Vertical Scaling and Horizontal Scaling
 Vertical scaling and horizontal scaling are two strategies used to enhance
the performance and capacity of computing resources, particularly in the
context of servers, databases, and applications.
Vertical Scaling (Scaling Up)
 Vertical scaling involves adding more power (CPU, RAM, storage) to an
existing machine.
 Advantages
o Easier to implement as it often involves upgrading the current
hardware.
o Typically requires less configuration and maintenance since there is
only one machine.
Disadvantages:
o Limits: There's a ceiling to how much you can upgrade a single
machine due to physical hardware limits.
o Downtime: Upgrading hardware may require significant downtime.
o Single Point of Failure: If the server goes down, all services are
affected.
 Use Cases:
o Best for applications that are not easily distributed or require
significant resources on a single server.
o Ideal for legacy systems where rewriting for horizontal scaling isn't
feasible.
Horizontal Scaling (Scaling Out)
 Horizontal scaling involves adding more machines to a system,
distributing the load across multiple servers.
 Advantages:
o Scalability: Easier to scale out as demand increases, theoretically
allowing for unlimited growth.
o Redundancy: Multiple machines mean that failure of one doesn't
bring down the entire system, enhancing reliability and fault
tolerance.
o Flexibility: Can handle varying loads more efficiently by distributing
tasks.
 Disadvantages:
o Complexity: More complex to implement and manage, requiring load
balancers, distributed databases, and other infrastructure.
o Consistency: Ensuring data consistency across multiple servers can
be challenging and might require sophisticated strategies (like
sharding or distributed databases).
 Use Cases
o Suitable for web applications, cloud services, and other scenarios
where workload distribution is beneficial.
o Preferred for modern, scalable architectures like microservices and
distributed databases.
Data Centers
 A data center is a facility that houses a large number of computer
servers and related equipment, including storage systems,
networking infrastructure, and cooling systems.
 They are designed to manage, store, and disseminate large amounts of
data and applications efficiently and securely.
 Components
 Servers, storage systems, networking equipment, backup power
supplies, cooling systems, and physical security measures.
 Do Data Centers Store Databases?
 Yes, data centers store databases. They host the physical infrastructure
(servers, storage systems) where databases reside. Data centers provide
the necessary environment for databases to operate efficiently, with high
availability, redundancy, and security.
 Data Centers and CDNs
 Typically centralized, though larger companies might have multiple data
centers distributed regionally or globally to ensure redundancy and
disaster recovery.
 Data centers often house CDN servers as part of their infrastructure.
 This means that within a data center, there can be specialized servers
dedicated to CDN operations.
 These CDN servers cache and serve content to users based on their
geographic location and the proximity of the data center.
 Large organizations and CDN providers (like Akamai, Cloudflare, and
Amazon CloudFront) often have their CDN servers hosted in various data
centers around the world. These data centers provide the physical
infrastructure (power, cooling, security) needed to support the CDN
operations.
HTTP
 HTTP(Hyper Text Transfer Protocol) is a protocol used by web browsers
and servers to communicate and transfer data over the Internet.
 It defines the structure and transmission rules for web
communication, specifying how requests and responses should be
formatted and processed. E.g. what should be the format of request
message and response message, status codes meaning, Http
Method like get, post etc, structure of head and body etc.
 It allows for the transfer of various types of data, such as text, images,
videos, etc., between a client (usually a web browser) and a server.
 HTTP operates over a plain text communication channel. This means that
data sent using HTTP is not encrypted, making it susceptible to
interception and modification by attackers. Therefore, any data
transmitted over HTTP is insecure and could potentially be compromised.
HTTP is commonly used for websites that do not handle sensitive
information or do not require additional security measures.
 HTTP is stateless meaning new request doesn’t know about previous
requests to the server and it doesn’t store any state. We can use
Cookies, Local Storage, Sessions etc to overcome this but HTTP itself is
stateless.
 Methods of HTTP
 HTTP methods are used to indicate the desired action to be
performed on a resource identified by a given URL. Here are the
most common HTTP methods:
o GET: Purpose: Retrieve data from a server at the specified resource.
o POST: Purpose: Submit data to be processed to a specified resource.
o PUT: Update an existing resource or create a new resource if it
doesn't exist.
o DELETE: Remove the specified resource.
 Parts of HTTP: Each HTTP Request whether request or response
consists of header and body.
 HTTP Request
o Request Line: Specifies the HTTP method (GET, POST, etc.), the
requested URL, and the HTTP version. Exists only in Http Requests.
o Body (optional): Contains data sent by the client to the server,
typically used with methods like POST or PUT. For form submissions,
the body includes key-value pairs of form fields and their
corresponding values.
o Headers: Provide additional information about the request, such as
the content type (Content-Type), content length (Content-Length), etc.
Some common request headers:
 Host: Specifies the host and port number of the server.
 User-Agent: Identifies the client making the request (e.g.,
browser type and version).
 Accept: Indicates the media types that are acceptable for the
response.
 Content-Type: Specifies the media type of the body of the
request.
 HTTP Response
o Status Line: Specifies the HTTP version, a status code indicating the
result of the request (e.g., 200 OK, 404 Not Found), and a reason
phrase. Exists only in Http Responses.
 Status codes are three-digit numbers included in HTTP
responses to indicate the result of the request.
o Body: Contains the requested resource (HTML page, image, JSON
data, etc.) or an error message.
o Headers: Provide additional information about the response. Some
common response headers:
 Content-Type: Specifies the media type of the body of the
response (e.g., text/html, application/json).
 Content-Length: Indicates the size of the body of the response
in bytes.
 Server: Specifies information about the server software handling
the request.
 Common Status Codes
o 1xx Informational Responses: These status codes indicate that the
server has received the request and is processing it, but the process is
not yet complete.
o 2xx Success: These status codes indicate that the request was
received, understood, and successfully processed by the server.
 200 OK: The request has succeeded.
o 3xx Redirection: These status codes indicate that further action
needs to be taken by the client to complete the request.
o 4xx Client Error: These status codes indicate that there was an error
in the client's request, and the server cannot fulfill it.
 404 Not Found: The server cannot find the requested resource.
o 5xx Server Error: These status codes indicate that the server
encountered an error while processing the request.
 503 Service Unavailable: The server is currently unable to
handle the request due to temporary overloading or maintenance
of the server.
HTTPS
 HTTPS is an extension of HTTP that adds a layer of security
through encryption. It uses SSL/TLS protocols to encrypt data before
transmitting it.
 HTTPS ensures secure communication between a client (web browser)
and a server. It encrypts data during transmission, preventing attackers
from eavesdropping on or tampering with the communication.
 By encrypting data, HTTPS provides confidentiality of the transmitted
data. It also authenticates the server, ensuring that the client is
communicating with the intended website.
 HTTPS is essential for websites that handle sensitive information, such as
login credentials, payment details, and personal information.
 HTTPS is essentially HTTP over SSL/TLS. It uses the same basic protocols
and functions as HTTP but adds encryption capabilities.
 HTTP typically uses port 80 for communication, while HTTPS uses port
443.
 To make a web server use HTTPS instead of HTTP, you need to obtain an
SSL/TLS certificate and configure your web server to use this certificate
for secure connections.
 We either purchase or buy a certificate and setup the certificate on the
server and then Add a redirect in the config file from http to https and we
are good to go.
SSl/TLS
 SSL and its successor TLS are protocols designed to secure
communication over a network, typically between a client (such as a web
browser) and a server (web server).
 SSL/TLS protocols encrypt data transmitted between clients and servers,
protecting it from eavesdropping and tampering.
 SSL/TLS are widely used in web browsing (HTTPS), email (SMTPS, POP3S,
IMAPS), VPNs (Virtual Private Networks), and other applications where
secure communication is essential.
 SSL (Secure Sockets Layer) has historically been a widely used protocol
for securing communication over the internet. However, due to
vulnerabilities found in older versions of SSL (such as SSLv2 and SSLv3),
it has been largely deprecated in favor of its successor, TLS (Transport
Layer Security).
 How SSL/TLS works:
 SSL/TLS encryption is a collaborative process where both the client
(browser) and server play active roles in establishing a secure HTTPS
connection.
 The client initiates the connection, verifies the server’s identity using its
digital certificate, and negotiates a secure session key exchange.
 The server responds, provides its certificate, and participates in the key
exchange process.
 Once the secure session is established, all data transmitted between the
client and server is encrypted and decrypted using the agreed-upon
session key.
 This dual implementation ensures secure communication over the
internet, protecting sensitive information from interception and
tampering.
SSH
 SSH stands for Secure Shell. It is a network protocol that allows secure
access to a remote computer or server over an unsecured network.
 A shell refers to a software interface that allows users to interact with
the operating system (OS) or execute commands.
 A shell provides a command-line interface (CLI) or sometimes a graphical
user interface (GUI) where users can type commands to instruct the
operating system to perform tasks.
 SSH is primarily used for secure command-line access and remote
administration of computers or servers. It provides a secure shell
environment for executing commands and managing remote
systems.
 SSH uses its own protocol suite, including SSH-1 and SSH-2, which define
how secure communication, authentication, and encryption are handled
between a client and a server.
WebSockets
 WebSockets and HTTP are both protocols used for communication over
the internet, but they serve different purposes and have different
characteristics.
 HTTP
o HTTP is a request-response protocol, meaning a client (such as a web
browser) sends a request to a server for a resource (like a web page),
and the server responds with the requested resource.
o The connection between client and server is closed after each
request-response cycle, unless explicitly kept alive using
techniques like HTTP keep-alive.
o In the typical HTTP request-response model, the server cannot
initiate communication with a client without the client first
sending a request. This is because HTTP is inherently a client-
initiated protocol where the client (such as a web browser or a
mobile app) sends a request to the server, and the server responds
with the requested data.
 WebSockets
o WebSockets is a protocol providing full-duplex communication
channels over a single TCP connection. It enables bidirectional
communication between a client and a server in real-time.
o Both client and server can send messages to each other
simultaneously.
o Unlike HTTP, WebSockets maintain a persistent connection between
the client and server once established. This allows for low-latency
communication and real-time updates.
o WebSockets are commonly used in applications that require real-
time updates or interactive features, such as chat
applications, online gaming, collaborative editing tools, and
live streaming.
Example to explain why we need WebSockets
 Let’s, take an example of a chat application involving John and Mary:
 Using HTTP
 John sends a message to Mary. In the backend a post request is sent to
the server and the message contains details like chat room name,
sender, receiver and message payload.
 The server processes the request and sends a response back to John
indicating success (200 OK). Also, the server would store it in the
databse so it can retrieve it later.

 But the question now is how does mary’s app know that there is
message from John in th server. Since Http is client oriented and no
connection has been established yet from Mary to server so how does
she know there is a message for her.
 Now to solve this is using polling or short polling.
 Polling involves clients repeatedly making requests to the server at
regular intervals to check for new messages. Here, Mary polls the server
for new messages every few seconds:
 Mary polls the server for new messages by sending get requests to
server. The server after the request for new messages for Mary send’s
John message to Mary’s app as response.
 If there are no new messages, the server responds with an empty array
([]). Mary continues polling periodically.
 Now there are various issue with this approach:
o Network and Server Load: Frequent polling consumes bandwidth
and server resources, even when there are no updates. And when the
userbase grows, this can cause stablility issues to server.
o Latency: Messages are delayed until the next polling interval. E.g. if
Mary’s app polls every 10 seconds to server then for 10 seconds she
won’t receive John’s message.
 The solution for this is long polling.
 Clients initiate a request to the server, which keeps the connection open
until new data is available and send back that in the response or a
timeout occurs.
 Chat App: Mary initiates a long-poll request (/long-poll), holding the
connection open until the server responds or times out.
 This apporach also has issues:
o This apporach reduces frequency of request to the server but still
client has to keep on polling after timeouts.
o This is also complex to implement in the backend.
o Handling long-lived connections can strain server resources.
 The solution for this is using WebSockets.
 Flow:
 The client establishes connection with server for websocket protocol.
 Both the client and server can send and receive messages independently
over the established connection and its not client oriented as http is.
 The WebSocket connection remains open as long as needed, allowing
for continuous, real-time communication.
 Ping/Pong Frames: The protocol supports ping/pong frames to keep
the connection alive and check for liveness.
 Connection Closure:
o Either the client or the server can initiate the closing of the
connection.
o Close Frame: A close frame is sent to signal the intention to close the
connection.
o The connection is closed gracefully after both parties acknowledge the
close frame.
 Chat App:
o Client to Server: John sends a message "Hello Mary!" to the chat
server.
o Server to Client: The server forwards the message to Mary in real-
time.
WebHooks
 Webhooks are a way for applications to communicate with each other
automatically.
 They allow one system to send real-time data to another whenever a
specific event occurs.
 This is done by making an HTTP request to a predefined URL with a
payload of information about the event.
 Webhooks typically facilitate communication between two services or
servers, not between a client (e.g., a user's browser) and a server.
 Key Components of Webhooks
o Webhook Endpoint: The URL in the target system where the HTTP
requests are sent.
o Payload: The data sent to the webhook endpoint, usually in JSON
format.
o Event Types: Specific events that trigger the webhook (e.g., new user
registration, payment received).
 Benefits of Webhooks
o Real-Time Updates and Efficiency: Receive data in real time as
events occur, without needing to poll for updates. Reduce the need for
constant checking for new data, saving resources and bandwidth.
o Automation: Automate workflows and integrate different systems
seamlessly.
How Webhooks Work
 Setting Up Webhook Endpoint:
o In your e-commerce platform, you define a webhook endpoint where
you want to receive notifications from the payment gateway. This
endpoint is typically a URL handled by your application's backend.
 Configuring the Payment Gateway:
o In the settings of the payment gateway (e.g., Stripe, PayPal), you
specify this webhook endpoint URL. You also configure which events
should trigger these notifications (e.g., successful payment,
chargeback).
 Triggering the Webhook:
o When the configured event occurs (e.g., a successful payment), the
payment gateway automatically sends an HTTP POST request to the
webhook endpoint URL specified in your e-commerce platform.
 Handling the Webhook in Your Application:
o Your application receives the incoming HTTP POST request at the
defined webhook endpoint.
o The payload of the request typically contains information about the
event (e.g., payment details, event type).
o Your application processes this information (e.g., updates the order
status, sends a confirmation email to the customer).
Webhooks vs WebSockets
 Webhooks are suitable for event-driven, one-way notifications between
services or servers, typically used to trigger actions in another system
(e.g., payment notifications, updates, CI/CD pipelines).
 WebSockets are ideal for real-time, two-way communication where low
latency and continuous data exchange are required (e.g., chat
applications, live data feeds, real-time collaboration).
Proxy, Reverse Proxy, and Forward Proxy
 A proxy server serves as an intermediary between two systems, and it
abstracts out the complexities involved in direct communication.
 A proxy can be implemented as software, hardware, or a combination of
both, depending on the requirements and scale of the deployment.
 Most proxies are implemented as software running on general-purpose
hardware. This is the most flexible and common approach.
 There are two types of proxies:
Forward Proxy
 A forward proxy acts as an intermediary for clients and sits between
clients’ and the internet, and it performs various functions such as
filtering requests, caching data, and hiding the client's IP address.
 Here’s how it works:
 A client (like a web browser) requests access to a resource (like a
website).
 The request is sent to the forward proxy server instead of directly to the
target server.
 The proxy server evaluates the request based on its rules (e.g., access
control, caching policies).
 The proxy server forwards the request to the target server.
 The target server sends the response back to the proxy server, which
then sends it to the client.
 Use-Cases:
o Privacy and Anonymity: Hides the client’s IP address from the
target server. Here are some key reasons why this might be desirable:
 This is particularly important for users who want to protect their
identity and location from being tracked by websites and online
services.
 By masking the client’s IP address, users can access content that
is otherwise unavailable in their region.
 Using a proxy server hides the client’s IP address, reducing the
risk of direct attacks.
o Policies and Access Control: Restricts access to certain websites
based on organizational policies. E.g. used in schools, comapanies etc.
Blocks access to inappropriate or harmful content.
o Caching: Stores copies of frequently accessed resources to reduce
bandwidth usage and improve response times. So in future if the
internet is down or something, the forward proxy can server cached
data to the clients.
Reverse Proxy
 A reverse proxy sits in front of one or more web servers and forwards
client requests to those servers. It appears to clients as if they are
interacting directly with the web servers.
 Here’s how it works:
 A client sends a request for a resource.
 The request goes to the reverse proxy server.
 The reverse proxy decides which backend server should handle the
request.
 The reverse proxy forwards the request to the selected backend server.
 The backend server processes the request and sends the response back
to the reverse proxy, which then forwards it to the client.
 Use-Cases
o Load Balancing: Distributes incoming traffic across multiple servers
to ensure no single server is overwhelmed.
o Routing: Routes the incoming requests to the appropriate services or
server. E.g. if the request starts with /auth, then send it to
authentication service.
o Security: Acts as a barrier, protecting backend servers from direct
exposure to the internet, and can help mitigate attacks like DDoS.
o SSL Termination: Handles SSL encryption/decryption to reduce the
processing load on backend servers.
o Caching: Stores responses from backend servers to speed up
responses to similar future requests.
o Abstraction: It abstarcts out the elasticity (auto-scaling) and
becoming single point of contact for users. E.g. we can add or reduce
multiple instances of servers in the backends but the users don’t know
about it; they just communicate with the reverse proxy.
 The examples of reverse proxies are loadbalancers, api gateways
and DB proxies.
Database Proxy
 A database proxy is a type of reverse proxy and an intermediary
between database clients (web servers or applications) and the database
server.
 It takes requests from clients like web applications and forwards them to
a database server using configurations that are specific to databases.
 It provides various benefits such as load balancing, connection pooling,
caching, security, and high availability.
 Implementing a database proxy can help optimize database performance
and scalability.
 Benefits of a Database Proxy
o Load Balancing: Distributes client queries across multiple database
servers to ensure no single server is overwhelmed.
o Read/Write Splitting: Directs read queries to replicas and write
queries to the master database, optimizing resource utilization.
o High Availability: Provides failover capabilities to ensure database
availability even if some instances go down.
o Connection Pooling: Manages a pool of database connections to
reduce the overhead of establishing connections.
o Query Caching: Stores frequently executed queries in memory to
speed up response times.
o Security: Acts as a firewall, filtering and controlling access to the
database.
 Some of the popular DB Proxy solutions are ProxySQL and pgbouncer.
 ProxySQL is a high-performance database proxy for MySQL and
MariaDB, designed to handle large traffic loads with various features like
query routing, load balancing, and caching.
CAP Theorem
 The CAP theorem is primarily applicable to distributed database
systems where we have multiple nodes or replicas of database
nodes. However, its principles can also extend to other types of
distributed systems, such as distributed computing systems, distributed
file systems, and any system that involves coordination among multiple
networked nodes.
 The CAP theorem is a concept in computer science that explains the
trade-offs between consistency, availability, and partition tolerance in
distributed systems.
 Consistency refers to the property of a system where all nodes have a
consistent view of the data. It means all clients see the same data at the
same time no matter which node they connect to.
 Availability refers to the ability of a system to respond to requests from
users at all times.
 Partition tolerance refers to the ability of a system to continue
operating even if there is a network partition.
But what is a network partition?
 A network partition happens when nodes in a distributed system are
unable to communicate with each other due to network failures.
 When there is a network partition, a system must choose between
consistency and availability.
 If the system prioritizes consistency, it may become unavailable until the
partition is resolved.
 If the system prioritizes availability, it may allow updates to the data.
This could result in data inconsistencies until the partition is resolved.
 Example to Explain CAP Theorem
 Let's say we have a tiny bank with two ATMs connected over a network.
The ATMs support three operations: deposit, withdraw, and check
balance.
 No matter what happens, the balance should never go below zero.
 There is no central database to keep the account balance. It is stored on
both ATMs.
 When a customer uses an ATM, the balance is updated on both ATMs
over the network. This ensures that the ATMs have a consistent view of
the account balance.
 If there is a network partition and the ATMs are unable to
communicate with each other, the system must choose between
consistency and availability:
o If the bank prioritizes consistency, the ATM may refuse to process
deposits or withdrawals until the partition is resolved. This ensures
that the balance remains consistent, but the system is unavailable to
customers.
o If the bank prioritizes availability, the ATM may allow deposits and
withdrawals to occur, but the balance may become inconsistent until
the partition is resolved. This allows the system to remain available to
users, but at the cost of data consistency. The preference for
availability could be costly to the bank. When there is a network
partition, the customer could withdraw the entire balance from both
ATMs. When the network comes back online, the inconsistency is
resolved and now the balance is negative. That is not good.
 Another Example - Social Media Platform
 During a network partition, if two users are commenting on the same
post at the same time, one user's comment may not be visible to the
other user until the partition is resolved.
 Alternatively, if the platform prioritizes consistency, the commenting
feature may be unavailable to users until the partition is resolved.
 For a social network, it is often acceptable to prioritize availability at the
cost of users seeing slightly different views some of the time.
 CAP Theorem is simple but real world is not
 The CAP theorem may sound very simple, but the real world is messy.
 As with many things in software engineering, it is all about tradeoffs, and
the choices are not always so black and white.
 The CAP theorem assumes 100% availability or 100% consistency.
 In the real world, there are degrees of consistency and availability that
distributed system designers must carefully consider. This is where the
simplistic model of the CAP theorem could be misleading.
 Back to the bank example, during a network partition, the ATM could
allow only balance inquiries to be processed, while deposits or
withdrawals are blocked.
 Alternatively, the bank could implement a hybrid approach. For example,
the ATM could allow balance inquiries and small withdrawals to be
processed during a partition, but block large withdrawals or deposits
until the partition is resolved.
 It is worth noting that in the real world, resolving conflicts after a
network partition could get very messy.
 The bank example above is simple to fix. In real life, the data structures
involved could be complex and challenging to reconcile.
 A good example of a complex data structure is Google Docs. Resolving
conflicting updates could be tricky.
 So, is the CAP theorem useful?
 Yes, it is a useful tool to help us think through the high-level trade-offs to
consider when there is a network partition.
 But does not provide a complete picture of the trade-offs to consider
when designing a well-rounded distributed system and what happens
when there are no network partitions.
PACELC Theorem
 In distributed database system, the PACELC theorem is an extension to
the CAP theorem.
 It states that in case of network partitioning (P) in a distributed computer
system, one has to choose between availability (A) and consistency (C)
(as per the CAP theorem), but else (E), even when the system is running
normally in the absence of partitions, one has to choose between latency
(L) and loss of consistency (C).
 Both theorems describe how distributed databases have
limitations and tradeoffs regarding consistency, availability, and
partition tolerance.
 PACELC goes further and states that an additional trade-off exists:
between latency and loss of consistency, even in absence of
partitions.
 - Network Latency: Latency refers to the time it takes for a request to
travel from the client to the server and back. In distributed systems,
latency can vary significantly due to factors such as network congestion,
distance between nodes, and processing time.
 - Consistency Requirements: Consistency in a distributed system
ensures that all nodes have the same view of data at the same time.
 Challenges in Achieving Both:
o High Latency for Strong Consistency: Achieving strong
consistency (C) often requires waiting for acknowledgments or
synchronization across multiple nodes. This can increase latency, as
each operation may need to be coordinated or serialized to ensure
data integrity.
o Low Latency for Weak Consistency: Opting for weaker consistency
models (e.g., eventual consistency) can reduce latency, as nodes can
respond quickly without waiting for full synchronization. However, this
may lead to temporarily inconsistent states visible to different nodes.
Strong Consistency vs Eventual Consistency
 Definition:
 - Strong Consistency: Strong consistency is a property in distributed
systems that ensures that all nodes in the system see the same data at
the same time, regardless of which node they are accessing. In other
words, when a write operation is performed, all subsequent read
operations from any node will return the most recent write value.
 - Eventual Consistency: Over time, the system converges towards
consistency, but during the transient period, users accessing different
data centers may observe different versions of the data. This is the
characteristic behavior of eventual consistency.
 Data Accuracy:
 - Strong Consistency: Ensures that all nodes see the same data at the
same time, guaranteeing immediate data accuracy and integrity.
 - Eventual Consistency: Temporarily allows nodes to be inconsistent,
which may result in stale data being read until convergence occurs.
 Performance:
 - Strong Consistency: Achieving strong consistency often involves
increased coordination and communication among nodes, leading to
higher latency for read and write operations.
 - Eventual Consistency: The asynchrony of write propagation and
reduced coordination overhead allows for lower latency and higher
throughput for read and write operations.
 Use Cases:
 - Strong Consistency: Best suited for scenarios where data integrity
and consistency are critical, such as financial systems, e-commerce
platforms, and critical business applications.
 - Eventual Consistency: Well-suited for applications where real-time
consistency is not vital and where system availability and scalability are
more important, such as social media platforms, content distribution
networks, and collaborative systems.
 Note: Choosing between strong and eventual consistency depends on
the specific needs of the application and its users. Some systems may
adopt a hybrid approach, selectively applying strong consistency to
certain critical data and eventual consistency to less critical or non-
critical data, striking a balance between data accuracy, performance,
and availability. The decision requires careful consideration of the
tradeoffs to meet the desired requirements and constraints of the
distributed system.

Database Scaling
 Scaling databases is a crucial part of ensuring that your application can
handle increasing loads and growing datasets.
 There are several strategies for scaling databases, each with its own
advantages and use cases.
 Here’s an overview of the most common approaches:
1. Caching: Caching involves storing frequently accessed data in memory to
reduce database load. Common caching solutions include Redis and
Memcached. It greatly improves read performance.
2. NoSQL Databases: Switching to NoSQL databases like MongoDB,
Cassandra, or DynamoDB can be a good solution for certain use cases,
especially where scalability is a primary concern. NoSQL databases are
designed to scale out by distributing data across multiple servers. NoSQL
databases are built with high availability and fault tolerance in mind. NoSQL
databases offer flexible schema designs that can handle various data types
and structures.
3. Vertical Scaling (Scaling Up): Vertical scaling involves adding more
resources (CPU, RAM, storage) to your existing database server. It is easier
to implement and manage. Also, no changes to application logic required.
4. Horizontal Scaling (Scaling Out): Horizontal scaling involves adding
more database servers to handle increased load. There are different
approaches within horizontal scaling:
1) Replication: Replication involves copying data from one database
server to others. There are various types of replication, such as
master-slave, master-master, and multi-master. Increases data
availability and redundancy. It can improve read performance.
Read/Write Splitting: Read/write splitting involves using master-
slave replication, where the master handles all writes and one or more
slaves handle reads. Offloads read operations to slaves, reducing load
on the master. Slaves can serve as backups.
2) Sharding: Sharding involves splitting your database into smaller,
more manageable pieces, called shards. Each shard is a separate
database that holds a subset of the data. This method helps balance
the load and allows for easy scaling by adding more shards.
3) Load Balancing: Uses a proxy layer to distribute requests to different
database servers based on load and availability.
Replication
 Data Replication is the process of storing data in more than one site or
node.
 It is simply copying data from a database from one server to another
server.
Why do we need replication?
 Reduced Network Latency:
o Replication helps reduce network latency by keeping data close to the
user's geographic location, improving the speed at which data is
accessed.
o This is particularly useful for global applications, where users are
spread across different regions.
 Improved Availability and Fault Tolerance:
o Replication enhances system availability by ensuring that if one server
goes down, others can take over, minimizing downtime.
o This is critical for businesses that need to maintain continuous
operations and avoid financial losses and customer trust issues due to
downtime.
 Increased Data Throughput:
o Replication can handle higher data throughput by distributing read
and write operations across multiple servers.
o This scalability ensures that the system can manage a large number of
transactions per second (TPS) or queries per second (QPS),
maintaining a high quality of service (QoS) even under heavy load.
 Load Balancing:
o Replication helps distribute the load evenly across multiple servers,
preventing any single server from becoming a bottleneck.
o This improves overall system performance and reliability.
 Disaster Recovery:
o Replicated databases provide a reliable backup in case of data
corruption or hardware failure on the primary server.
o This ensures that data can be quickly restored and operations can
continue with minimal disruption.
 Maintenance and Upgrades:
o Replication allows for easier maintenance and upgrades by enabling
operations on one server while others continue to handle user
requests.
o This reduces downtime and ensures a smoother update process.
Types of algorithms for implementing Database Replication?
1. Single Leader Replication
2. Multi-Leader Replication
3. Leaderless Replication
Single Leader Replication
 So, in leader-based architecture, client (application server) requests are
sent to leader DB first and after that leader sends the data changes to all
of its followers.
 Whenever a client wants to read data from the database then it can
query either leader or any of the follower (Yes there is generally more
than just one follower to make the system highly available). However,
writes to the database is only accepted on the leader by the client.
 Now whenever a follower dies our application will not get impacted as
there is not just a single node of data. Our application can read from
other followers as well and hence this makes our system highly Read
Scalable.
 Two types of Replication Strategy:
o Synchronous replication strategy: In this strategy, followers are
guaranteed to have an up-to-date copy of data (Strong Consistency)
which is an advantage.
o But one of the biggest disadvantages it has is that it will block the
client until the leader receives the OK response from all the followers.
Now if you have a very high read scalable system like Facebook with
thousands of follower’s nodes then waiting for data to be replicated at
each node is not a good user experience.
o Asynchronous replication strategy: The changes to master are
sent to the slave databases asynchronously. This means that the
master does not wait for the slaves to acknowledge the changes
before confirming the write operation to the client.
o Disadvantage: There is a time lag (replication lag) between when a
change is made on the master and when it is applied on the slaves.
During this lag, the slaves may serve outdated data for read requests.
Handling Node outages in replication
 So, there are two scenarios as mentioned below:
 Follower failure: In the scenario of follower failure, we can use a
strategy called Catchup recovery. In this strategy, the follower (which
got disconnected) can connect to the leader again and request all data
changes that occur when the follower was disconnected.
 Leader failure: Now in the scenario of Leader failure, we can use a
strategy called Failover. In this strategy, one of the followers needs to be
promoted as a new leader.
 There is a voting algorithm which is called Consensus Algorithm. In
layman terms, in this algorithm, we have Quorum of followers and all the
followers decide which follower should be made a leader.
Issues in Single Leader based Replication
 Write Scalability: Since all write operations must go through the
leader, it can become a bottleneck, limiting the overall write throughput
of the system. As the number of write requests increases, the leader may
struggle to handle the load, leading to performance degradation.
 Write Latency: In geographically distributed systems, write operations
can suffer from high latency because every write must be sent to the
leader, which may be located far from the user initiating the write. This
can slow down the perceived performance for users located far from the
leader.
 Loss of Latest Changes: If the data center containing the leader fails
and the latest changes were not replicated to all the followers, those
changes may be permanently lost. This can lead to data inconsistency
and potential data loss.
 Failover Complexity: Promoting a follower to become the new leader is
necessary, but it introduces complexity.
Multi-Leader Replication
 In multi-Leader replication, we are going to have one leader in each of
my data centers and each data center’s leader replicates its changes to
the leaders in other data centers asynchronously.
Advantages of Multi-Leader Replication
 Better Performance as compared to Single leader replication as we have
now reduced both Read & Write Latency of our application
 High Fault Tolerance as each data center can continue operating
independently of others if any data center goes down. This is possible
because each data center has its leader. Also, replication catches up
when the failed datacenter comes back online
 Moreover, if one data center goes down in one particular geographic
area then temporarily, we can route the requests from that geographic
area to some other healthy data center in another geographic area till
that unhealthy data center becomes healthy. Yes, there is a trade-off
between Performance and High Availability here.
Disadvantages
 In a multi-leader system, writes can happen on different leaders at the
same time.
 This can lead to conflicts because different leaders might receive and
apply write operations in different orders.
Resolution
 Last Write Wins (LWW): In this approach, the system keeps the most
recent write based on timestamps. While simple, it can lead to data loss
as it discards other writes.
 Version Vectors: Each data item carries a version vector that tracks the
version history. When conflicts arise, the system can merge changes
based on the version history.
Leaderless Replication
 Leaderless replication (also known as peer-to-peer replication) does
not designate a single leader for write operations. Instead, any node can
accept write operations, and data is replicated among all nodes.
 Advantages:
o Fault Tolerance: High fault tolerance as there is no single point of
failure.
o Scalability: Highly scalable since any node can handle read and write
operations.
Quorums in Leaderless Replication
 Definition: Quorum-based replication involves a majority voting
mechanism to decide the success of read and write operations.
 The system uses a quorum to determine if an operation has been
successfully applied.
 A quorum is a subset of nodes that must respond positively for an
operation to be considered successful.
 Read and Write Quorums (R and W)
o Read Quorum (R): The minimum number of nodes that must
respond to a read request to consider it successful.
o Write Quorum (W): The minimum number of nodes that must
acknowledge a write request for it to be considered successful.
o The values of R and W are chosen such that R + W > N, where N is the
total number of nodes.
 How Quorums Work?
 Write Operations:
o A write operation is sent to multiple nodes.
o The operation is considered successful if W nodes acknowledge the
write.
o Once the write quorum is met, the system can return success to the
client, even if not all nodes have acknowledged the write immediately.
 Read Operations:
o A read operation is sent to multiple nodes.
o The operation is considered successful if R nodes respond.
o The system can then return the most recent data based on the
responses from these R nodes.
 Benefits of Using Quorums
 This approach improves availability, scalability, and fault tolerance while
providing strong consistency guarantees.
 AmazonDB and Cassandra uses Quorum based approach.
Database Sharding
 Sharding involves splitting a database into smaller, horizontally
partitioned pieces called shards, where each shard is a separate
database.
 Each shard can be stored on a different server or cluster, enabling the
system to handle more load and improve performance.
Need for Sharding:
 Consider a very large database whose sharding has not been done. For
example, let’s take a DataBase of a college in which all the student’s
record (present and past) in the whole college is maintained in a single
database. So, it would contain very very large number of data, say 100,
000 records.
 Now when we need to find a student from this Database, each time
around 100, 000 transactions have to be done to find the student, which
is very very costly.
 Now consider the same college students records, divided into smaller
data shards based on years. Now each data shard will have around
1000–5000 student records only. So not only the database became much
more manageable, but also the transaction cost of each time also
reduces by a huge factor, which is achieved by Sharding.
Benefits of Sharding
 Horizontal Scalability: Sharding allows you to distribute data across
multiple servers, which means you can handle increased load by simply
adding more servers.
 Improved Performance: By distributing the data, each server handles
a smaller portion, reducing the load and improving query performance.
 Reduced Impact of Failures: If one shard fails, it affects only a subset
of the data and users, not the entire database.
 Easier Maintenance: Maintenance tasks can be performed on
individual shards without impacting the entire system.
 Lower Latency: Shards can be placed in different geographic locations
to reduce latency for users in different regions.
 Backup and Recovery: Shards can be backed up and restored
independently, making the processes faster and more efficient.
Disadvantages of Sharding
 Data Distribution Logic: Implementing sharding logic requires careful
planning and additional coding in the application to handle data
distribution.
 Cross-Shard Queries: Queries that span multiple shards are more
complex and can be less efficient.
 Management and Monitoring: Sharding requires monitoring multiple
database instances, which increases administrative overhead.
 Rebalancing: Rebalancing involves redistributing data across shards
when the load is uneven or when adding new shards. This ensures even
load distribution and optimal performance. When adding new shards or
redistributing data, rebalancing can be a resource-intensive and time-
consuming process.
When to Shard a Database
 High Data Volume: Large Databases: When the size of the database
exceeds the storage capacity or performance limits of a single server,
sharding becomes necessary.
 Increased Traffic: Applications with a high number of read and write
operations can benefit from sharding to distribute the load across
multiple servers.
 Global User Base: Applications with users spread across different
geographic regions can use sharding to reduce latency by placing data
closer to users.
Example Scenario
 When to Shard: A social media platform with millions of users and high
traffic might need to shard its user data. By sharding the database based
on user ID, the platform can distribute the load across multiple servers,
improving performance and scalability.
 When Not to Shard: A small e-commerce website with a few thousand
products and moderate traffic may not need sharding. A single database
instance can handle the load efficiently, and sharding would introduce
unnecessary complexity.
Shard Key (Partition Key)
 A shard key is a specific column or set of columns in a database table
that is used to determine how data is distributed across multiple shards
in a sharded database architecture.
 It should ensure even data distribution, minimize cross-shard queries,
and align with the query patterns.
 The shard key should ensure that data is evenly distributed across all
shards to avoid hot spots where some shards become overloaded while
others remain underutilized.
 The shard key should align with the most common query patterns to
minimize the need for cross-shard queries, which can be more complex
and less efficient.
 Common shard keys include user ID, geographic region, or a hash of a
key field.
Sharding Strategies Based on Shard Key
 Key-Based (Hash) Sharding:
o Uses a hash function on the shard key to distribute data.
o Shard Key: User ID, Product ID, Email Address (hashed).
o Example: shard_id = hash(user_id) % number_of_shards.
 Range-Based Sharding:
o Distributes data based on ranges of shard key values.
o Shard Key: Transaction Date, Salary Range, Alphabetical Range of
Last Names.
o Example: Shard 1 for dates 2023-01-01 to 2023-03-31, Shard 2 for
2023-04-01 to 2023-06-30.
 Directory-Based Sharding:
o Directory-based sharding uses a lookup table or directory to map each
shard key to its corresponding shard. The directory service determines
which shard a piece of data belongs to.
o This directory can be updated dynamically, allowing for more fine-
grained control over data distribution. You can easily move data
between shards by updating the directory without needing to
rehash or redefine ranges.
o However, it also requires maintaining an up-to-date directory, which
adds some complexity to the system.
o Shard Key: User ID (mapped via lookup table), Product Category,
Region Code.
o Example: Directory service maps User ID 1-1000 to Shard 1, User ID
1001-2000 to Shard 2.
 Geographic (Location-Based) Sharding:
o Shards data based on geographic location.
o Shard Key: Country Code, City Name, IP Address Range (geo-
located).
o Example: Users from North America -> Shard 1, Europe -> Shard 2.
Partitioning
 Dividing data within a single database instance.
 Each partition holds a subset of the data based on specific criteria, such
as a range of values, a list of values, or a hash function.
 Improved Performance: Partitioning can improve query performance
by limiting the amount of data scanned during query execution.
 Partitions are implemented within a single database. The database
management system (DBMS) handles the creation, maintenance, and
querying of partitions.
 Sharding vs. Partitioning
o While partitions are divisions within a single database, sharding
involves dividing the data across multiple databases or database
servers.
 Using Both Sharding and Partitioning
o In some cases, both sharding and partitioning can be used together to
achieve optimal scalability and performance.
 Example Scenario
o Suppose you have a social media application with user data spread
across multiple shards based on user ID. Within each shard, you
further partition the data by activity date.
 Sharding by User ID:
o Users with IDs 1-1000 are stored in Shard 1.
o Users with IDs 1001-2000 are stored in Shard 2.
 Partitioning within Each Shard:
o Each shard's user activity table is partitioned by month.
 Shard 1: Contains users with IDs 1-1000, and their activities are
partitioned by month.
 Shard 2: Contains users with IDs 1001-2000, and their activities are
partitioned by month.
Message Queue
 Message queuing makes it possible for applications to communicate
asynchronously, by sending messages to each other via a queue.
 A message queue provides temporary storage between the sender and
the receiver so that the sender can keep operating without interruption
when the destination program is busy or not connected.
 Asynchronous processing allows a service to send message to another
service, and move on to the next task while the other service processes
the request at its own pace.
 A message queue is a queue of messages sent between applications
and waiting to be handled by other applications.
 A message is the data transported between the sender and the receiver
application; it’s essentially a byte array with some headers on top. An
example of a message could be an event. One application tells another
application to start processing a specific task via the queue.
 Architecture
o Producer (Sender): The component that sends messages to the
queue. Producers can generate messages at any time, without
needing to wait for the consumer to be ready.
o Queue: A storage area where messages are held until they are
consumed. The queue ensures that messages are delivered in a
reliable manner, usually in the order they were sent (FIFO - First In,
First Out).
o Consumer (Receiver): The component that retrieves and processes
messages from the queue. Consumers can process messages at their
own pace, independently of the producer.
 Role of Message Queues in Microservices
o In a microservice architecture, various functionalities are divided
across different services that collectively form a complete software
application.
o These services often have cross-dependencies, meaning some
services can’t perform its functions without interacting with others.
o Message queuing plays a crucial role in this architecture by providing
a mechanism for services to communicate asynchronously, without
getting blocked by responses.
 Key Characteristics of Message Queues:
o Asynchronous Communication: The sender and receiver of the
messages do not need to interact with the queue at the same time.
The sender can post a message and continue processing, while the
receiver can retrieve and process the message at a later time.
o Decoupling: The sender and receiver do not need to know about
each other's existence. This decoupling simplifies the system
architecture and makes it easier to scale and maintain.
o Reliability: Message queues often provide guarantees on message
delivery, ensuring that messages are not lost and are delivered in the
correct order. This is essential for critical applications where data
integrity is important.
o Scalability: Message queues help in scaling applications by
distributing the workload. Multiple consumers can process messages
from the queue in parallel, improving the throughput of the system.
o Buffering: Message queues can handle bursts of messages and
buffer them, ensuring that the receiver processes them at a
manageable rate.
Caching
 A cache is essentially a key-value store that is used to temporarily store
data in a fast-access storage medium.
 The primary purpose of a cache is to speed up data retrieval operations
by storing copies of data that are frequently accessed or computationally
expensive to retrieve from the original source.
 It takes advantage of the locality of reference principle: recently
requested data is likely to be requested again.
 Benefits
o Low latency: Caching makes your system faster by reducing data
fetching time.
o Reduced Server Load: Caching reduces the load on your database
or primary servers.
o Better cx Experience: Quick response times lead to happier users
 Cons
o Stale Data: Cache data can become outdated, leading to data
inconsistency.
o System Complexity: Implementing caching adds an extra layer of
complexity to system design.
o Cache Invalidation: Determining when to refresh or clear cache can
be challenging.
Can we store all the data in the cache?
 No! We can’t store all the data in the cache because of multiple reasons.
 The hardware that we use to make cache memories is much more
expensive than a normal database.
 If you store a ton of data on cache the search time will increase
compared to the database.
Caching levels in CPU design
 In computer architecture and CPU design, L1, L2, and L3 refer to different
levels of cache memory hierarchy that are integrated into modern
processors to improve performance by reducing the time taken to access
data.
 L1 cache is the fastest but smallest, L2 cache is larger but slightly
slower, and L3 cache is the largest and slower compared to L1 and L2
but still much faster than accessing RAM.
Caches in different layers
 Caching can be organized into multiple levels depending on where and
how data is stored relative to its usage and accessibility.
 Each level serves a specific purpose in optimizing performance and
efficiency within a system. Here are the typical levels of caching:
 Client-Side Caching
o Location: On the client device (e.g., web browser, mobile app).
o Purpose: Store frequently accessed resources locally to reduce
latency and improve responsiveness.
o Examples: Browser cache for web pages, app cache for mobile
applications.
o Advantages: Minimizes network requests and server load, enhances
user experience by speeding up access to resources.
 DNS Caching: for faster resolution of domain name to ip address of
frequently accessed domain names.
 Server-Side Caching
o Location: On the server hosting the application or service.
o Purpose: Cache data or computations to reduce response times and
load on backend systems.
o Examples: In-memory caches like Redis or Memcached used to
store session data, computed results, or frequently accessed objects.
o Advantages: Improves scalability and efficiency by reducing the
need to repeatedly generate or fetch data from databases or external
services.
 Database Caching
o Location: Within the database management system (DBMS).
o Purpose: Cache frequently accessed data or query results to
minimize disk I/O and query execution time.
o Advantages: Speeds up data retrieval and query processing, reduces
database load during peak usage.
 Content Delivery Network (CDN) Caching
o Location: Distributed globally across CDN edge servers.
o Purpose: Cache static content (e.g., images, CSS, JavaScript) closer
to users to reduce latency and improve content delivery speed.
o Examples: CDN services like Cloudflare, Akamai, serving cached
copies of web assets to users based on geographical proximity.
Cache Performance Metrics
 When implementing caching, it’s important to measure the performance
of the cache to ensure that it is effective in reducing latency and
improving system performance.
 Hit rate: The hit rate is the percentage of requests that are served by
the cache without accessing the original source. A high hit rate indicates
that the cache is effective in reducing the number of requests to the
original source, while a low hit rate indicates that the cache may not be
providing significant performance benefits.
 Miss rate: The miss rate is the percentage of requests that are not
served by the cache and need to be fetched from the original source. A
high miss rate indicates that the cache may not be caching the right
data or that the cache size may not be large enough to store all
frequently accessed data.
 Cache size: The cache size is the amount of memory or storage
allocated for the cache. The cache size can impact the hit rate and miss
rate of the cache. A larger cache size can result in a higher hit rate, but it
may also increase the cost and complexity of the caching solution.
 Cache latency: The cache latency is the time it takes to access data
from the cache. A lower cache latency indicates that the cache is faster
and more effective in reducing latency and improving system
performance. The cache latency can be impacted by the caching
technology used, the cache size, and the cache replacement and
invalidation policies.
Cache Replacement Policies
 When implementing caching, it’s important to have a cache replacement
policy to determine which items in the cache should be removed when
the cache becomes full. Here are some of the most common cache
replacement policies:
 Least Recently Used (LRU): This policy assumes that items that have
been accessed more recently are more likely to be accessed again in the
future.
 Least Frequently Used (LFU): This policy assumes that items that
have been accessed more frequently are more likely to be accessed
again in the future.
 First In, First Out (FIFO): This policy assumes that the oldest items in
the cache are the least likely to be accessed again in the future.
 Random Replacement: This policy doesn’t make any assumptions
about the likelihood of future access and can be useful when the access
pattern is unpredictable.
 Comparison of different replacement policies
 Each cache replacement policy has its advantages and disadvantages,
and the best policy to use depends on the specific use case.
 LRU and LFU are generally more effective than FIFO and random
replacement since they take into account the access pattern of the
cache.
 However, LRU and LFU can be more expensive to implement since they
require maintaining additional data structures to track access patterns.
 FIFO and random replacement are simpler to implement but may not be
as effective in optimizing cache performance.
 Overall, the cache replacement policy used should be chosen carefully to
balance the trade-off between performance and complexity.
Cache Invalidation Strategies
 Cache invalidation is the process of removing data from the cache when
it is no longer valid.
 Invalidating the cache is essential to ensure that the data stored in the
cache is accurate and up-to-date.
 Here are some of the most common cache invalidation strategies:
o Write-Through Cache: Data is written to the cache and the backing
store at the same time. Although, write-through minimizes the risk of
data loss, since every write operation must be done twice before
returning success to the client, this scheme has the disadvantage of
higher latency for write operations.
o Write-around cache: This technique is similar to write-through
cache, but data is written directly to permanent storage, bypassing
the cache. This can reduce the cache being flooded with write
operations that will not subsequently be re-read, but has the
disadvantage that a read request for recently written data will create a
“cache miss” and must be read from slower back-end storage and
experience higher latency.
o Write-Back Cache: Under this scheme, data is written to cache
alone, and completion is immediately confirmed to the client. The
write to the permanent storage is done based on certain conditions,
for example, when the cache system needs to free some space. This
results in low-latency and high-throughput for write-intensive
applications; however, this speed comes with the risk of data loss in
case of a crash or other adverse event because the only copy of the
written data is in the cache.
o Write-behind cache: It is quite similar to write-back cache. In this
scheme, data is written to the cache and acknowledged to the
application immediately, but it is not immediately written to the
permanent storage. Instead, the write operation is deferred, and the
data is eventually written to the permanent storage at a later time.
o The main difference between write-back cache and write-behind cache
is when the data is written to the permanent storage. In write-back
caching, data is only written to the permanent storage when it is
necessary for the cache to free up space, while in write-behind
caching, data is written to the permanent storage at specified
intervals.
Distributed Caching
 The distributed cache is designed to store and manage cached data
across multiple nodes in a network.
 This approach improves scalability, fault tolerance, and performance by
leveraging the combined resources of multiple machines.
 Cache Nodes: Individual servers that store and manage cached data in
a distributed cache system.
 Cache Management Layer: The system responsible for coordinating
data distribution, consistency, replication, and fault tolerance across
cache nodes.
 Networking Layer: The communication infrastructure that enables data
exchange between cache nodes and clients, ensuring secure and
efficient data transfer.
 Metadata and Configuration Store: A centralized repository that
keeps track of cache metadata (e.g., key mappings, expiration times)
and configuration settings for cache nodes.
 Client Interface: APIs and libraries that allow applications to interact
with the distributed cache for data storage and retrieval.
Request Flow in Distributed Cache
 Client Request: The client sends a data request to the distributed
cache via the client interface.
 Routing to Cache Node: The key of the requested data is hashed to
determine which cache node is responsible for the data. The metadata
and configuration store provide the necessary information for routing.
 Cache Lookup: The cache node checks its local store for the requested
data (cache hit or miss).
 Handling Cache Miss: If a miss occurs, the cache node fetches data
from the primary source, coordinated by the management layer, and
stores it. The fetched data is then stored in the cache node for future
requests. The metadata and configuration store may update metadata to
reflect the new data location.
 Data Retrieval and Return: The data is returned to the client.
 Replication (Optional): The data may be replicated to other nodes,
coordinated by the management layer, and metadata is updated.
 Cache Maintenance: The cache node may evict or expire data based
on policies, overseen by the management layer.
SQL Databases
 Databases are typically controlled by database management systems
(DBMS).
 SQL databases are relational databases that use Structured Query
Language (SQL) for managing and querying data.
 These databases are mainly composed of tables, with each table
consisting of rows and columns.
 In a relational database, each row is a record with a unique
identification, or key, called a key.
 Relational databases have a predefined schema, which establishes
the relationship between tables and field types. In a relational
database, the schema must be clearly defined before any information
can be added.
 Relational databases are ACID-compliant, which makes them highly
suitable for transaction-oriented systems and storing financial
data. ACID compliance ensures error-free services, even in the event of
failures, which is essential for transaction validity.
 Here are some key characteristics and components of SQL databases:
 Relational Structure: Data is organized into tables with rows and
columns. Tables are related to each other through defined relationships.
 Schema: SQL databases have a schema that defines the structure of
the database, including tables, fields (columns), and relationships
between tables.
 SQL Language: SQL (Structured Query Language) is the standard
language used to interact with SQL databases. It allows users to query
data, insert new records, update existing records, and delete records.
 ACID Properties: Transactions in SQL databases adhere to the ACID
properties:
o Atomicity: Transactions are all or nothing.
o Consistency: Transactions bring the database from one valid state to
another.
o Isolation: Transactions occur independently of each other.
o Durability: Once a transaction is committed, it is permanently saved
and recoverable.
 Examples: Examples of SQL databases include MySQL, PostgreSQL,
Oracle Database, SQLite, Microsoft SQL Server, and others.
NoSQL Databases:
 NoSQL (Not only SQL) databases are a diverse group of non-relational
databases designed to address the limitations of traditional SQL
databases, particularly in terms of scalability, flexibility, and performance
under specific workloads.
 NoSQL databases do not adhere to the relational model and typically do
not use SQL as their primary query language. Instead, they employ
various data models and query languages, depending on the specific
type of NoSQL database being used.
 The key characteristics of NoSQL databases include their schema-less
design, which allows for greater flexibility in handling data; horizontal
scalability, which makes it easier to distribute data across multiple
servers; and their ability to perform well under specific workloads, such
as high write loads or large-scale data storage and retrieval.
 Types of NoSQL databases and their use cases
 NoSQL databases can be broadly categorized into four main types, each
with its unique data model and use cases:
 Document databases: These databases store data in a semi-structured
format, such as JSON or BSON documents. Each document can contain
nested fields, arrays, and other complex data structures, providing a
high degree of flexibility in representing hierarchical and related
data. Some popular document databases include MongoDB and
CouchDB.
 Key-value stores: Key-value databases store data as key-value pairs,
where the key is a unique identifier and the value is the associated data.
These databases excel in scenarios requiring high write and read
performance for simple data models, such as caching, session
management, and real-time analytics. Some widely-used key-value
stores are Redis and Amazon DynamoDB.
 Column-family stores: Also known as wide-column stores, these
databases store data in columns rather than rows, making them highly
efficient for read and write operations on specific columns of data.
Column-family stores are particularly well-suited for large-scale,
distributed applications with high write loads and sparse or time-series
data, such as IoT systems, log analysis, and recommendation engines.
Examples of column-family stores include Apache Cassandra and HBase.
 Graph databases: Graph databases store data as nodes and edges in a
graph, representing entities and their relationships. These databases are
optimized for traversing complex relationships and performing graph-
based queries, making them ideal for applications involving social
networks, fraud detection, knowledge graphs, and semantic search.
Some notable graph databases are Neo4j and Amazon Neptune.
SQL VS NoSQL Databases
 Schema:
o SQL: SQL databases enforce a predefined schema for the data, which
ensures that the data is structured, consistent, and follows specific
rules. This structured schema can make it easier to understand and
maintain the data model, as well as optimize queries for performance.
We cannot accomadate new data types without schema
modification.
o NOSQL: One of the primary advantages of NoSQL databases is their
schema-less design, which allows for greater flexibility in handling
diverse and dynamic data models. This makes it easier to adapt to
changing requirements and accommodate new data types without the
need for extensive schema modifications, as is often the case with
SQL databases.
 Indexing:
o Both SQL (relational) and NoSQL databases can utilize indexing to
optimize query performance, although the specifics of indexing can
vary between them.
 ACID compliance:
o SQL: SQL databases adhere to the ACID (Atomicity, Consistency,
Isolation, Durability) properties, which ensure the reliability of
transactions and the consistency of the data. These properties
guarantee that any operation on the data will either be completed in
its entirety or not at all, and that the data will always remain in a
consistent state.
o NoSQL: NoSQL databases often offer different consistency models
and trade-offs depending on their specific design goals and use cases.
Some NoSql Db’s focus on achieving eventual consistency rather than
strong consistency. Some NoSQL databases provide support for certain
ACID properties in specific scenarios, but they may not guarantee
ACID compliance universally across all operations and configurations.
Some NoSQL databases (like MongoDB) databases often sacrifice full
ACID compliance in favor of other benefits such as high availability,
partition tolerance, and scalability.
 Scalabilty:
o SQL: SQL databases can be scaled vertically by adding more
resources (such as CPU, memory, and storage) to a single server.
However, horizontal scaling, or distributing the data across multiple
servers, can be more challenging due to the relational nature of the
data and the constraints imposed by the ACID properties. This can
lead to performance bottlenecks and difficulties in scaling for large-
scale applications with high write loads or massive amounts of data.
o NoSql: NoSQL databases are designed to scale horizontally, enabling
the distribution of data across multiple servers, often with built-in
support for data replication, sharding, and partitioning. This makes
NoSQL databases well-suited for large-scale applications with high
write loads or massive amounts of data, where traditional SQL
databases may struggle to maintain performance and consistency.
 Querying
o SQL: SQL is a powerful and expressive query language that allows
developers to perform complex operations on the data, such as
filtering, sorting, grouping, and joining multiple tables based on
specified conditions.
o NoSQL: While some NoSQL databases offer powerful query languages
and capabilities, they may not be as expressive or versatile as SQL
when it comes to complex data manipulation and analysis. This can be
a limiting factor in applications that require sophisticated querying,
joining, or aggregation of data.
o Additionally, developers may need to learn multiple query languages
and paradigms when working with different types of NoSQL databases.
o Since a query doesn’t need to look at numerous tables to obtain a
response, as relational databases frequently do, Non-Relational
Database Management Systems are usually faster than
Relational Databases.

Real-World Examples and Case Studies


SQL Databases in Action
 E-commerce platforms: SQL databases are widely used in e-commerce
platforms, where structured data and well-defined relationships are the
norm. For example, an online store’s database may have tables for
customers, products, orders, and shipping details, all with established
relationships. SQL databases enable efficient querying and data
manipulation, making it easier for e-commerce platforms to manage
inventory, customer data, and order processing.
 Financial systems: Financial applications, such as banking and trading
platforms, rely on SQL databases to maintain transactional consistency,
ensure data integrity, and support complex queries. The ACID properties
of SQL databases are crucial in this context, as they guarantee the
correct processing of transactions and safeguard against data corruption.
 Content Management Systems (CMS): Many popular CMS platforms,
such as WordPress and Joomla, use SQL databases to store content, user
data, and configuration information. The structured nature of the data
and the powerful query capabilities of SQL databases make them well-
suited for managing content and serving dynamic web pages.
NoSQL Databases in Action
 Social media platforms: NoSQL databases, particularly graph
databases, are ideal for managing complex relationships and
interconnected data found in social media platforms. For example,
Facebook uses a custom graph database called TAO to store user
profiles, friend connections, and other social graph data. This allows
Facebook to efficiently query and traverse the massive social graph,
providing features like friend recommendations and newsfeed
personalization.
 Big data analytics: NoSQL databases, such as Hadoop’s HBase and
Apache Cassandra, are commonly used for big data analytics, where
large-scale data storage and processing are required. These databases
are designed to scale horizontally, enabling them to handle vast amounts
of data and high write loads. For example, Netflix uses Apache
Cassandra to manage its customer data and viewing history, which helps
the streaming service to provide personalized content recommendations
to its users.
 Internet of Things (IoT): IoT applications generate massive volumes of
data from various devices and sensors, often with varying data
structures and formats. NoSQL databases like MongoDB and Amazon
DynamoDB are suitable for handling this diverse and dynamic data,
providing flexible data modeling and high-performance storage
capabilities. For example, Philips Hue, a smart lighting system, uses
Amazon DynamoDB to store and manage data generated by its
connected light bulbs and devices.
Hybrid Solutions
 Gaming industry: In the gaming industry, developers often use a
combination of SQL and NoSQL databases to support different aspects of
their applications. For instance, an SQL database may be employed to
manage user accounts, in-game purchases, and other transactional data,
while a NoSQL database like Redis can be used to store real-time game
state information and leaderboards.
 E-commerce with personalized recommendations: Some e-commerce
platforms combine SQL databases for transactional data and inventory
management with NoSQL databases for personalized recommendations.
This hybrid approach allows the platform to leverage the strengths of
both database types, ensuring efficient data storage, querying, and
analysis for various aspects of the application.

Avoiding Common Pitfalls in Database


Selection
 One of the biggest mistakes candidates make in system design
interviews is relying too much on their personal experience or bias when
selecting a database. Remember, interviews are not about which
database you prefer, but about which one is most suited to the problem
you’re facing.
 When selecting a database, it’s important to consider the specific needs
of your application.
o What type of data will you be storing?
o How much data will you be storing?
o How frequently will you be accessing the data?
 These are all important questions to ask yourself before making a
decision.
 Another common pitfall is failing to fully understand the requirements of
the application. It’s important to take the time to thoroughly analyze the
needs of your application before selecting a database. This includes
understanding the expected traffic and usage patterns, as well as any
specific performance requirements.
 Choosing the wrong database can have serious consequences for
your application. For example, if you choose a database that can’t
handle the amount of data you need to store, you may run into
performance issues or even data loss. On the other hand, if you choose a
database that is too complex for your needs, you may end up with a
system that is difficult to maintain and scale.
 It’s also important to consider the long-term implications of your
database selection. As your application grows and evolves, your
database needs may change. Choosing a database that is flexible and
scalable can help ensure that your application can continue to meet your
needs as you grow.
CI/CD Pipelines
 CI/CD stands for Continuous Integration and Continuous Deployment (or
Continuous Delivery). A CI/CD pipeline is a series of automated processes
that allow developers to integrate their code changes frequently and
deploy them quickly and reliably. The pipeline automates the steps
involved in software delivery, from building and testing the code to
deploying it to production.
 Continuous Integration (CI): Automates the process of integrating
code changes from multiple contributors into a shared repository several
times a day. This involves automatic building and testing of the code to
detect integration issues early. The final output of this is an artifact which
is stored in a repository or registery like docker hub etc.
 Continuous Deployment (CD): takes over from where CI leaves off.
The primary function of CD is to automate the deployment process of the
build artifacts produced by CI. These artifacts are deployed to different
environments, starting typically with a staging environment for further
testing and validation.
Benefits
 Rapid Development: Automating the integration, testing, and
deployment processes significantly accelerates the software
development lifecycle.
 Frequent Releases and Faster Development: Developers can work
in smaller, manageable increments and see their features and bug fix
changes integrated and tested quickly.
 Automation: Reduces the need for manual intervention in repetitive
tasks such as building, testing, and deploying code, freeing up
developers to focus on more complex tasks.
 Automated Testing: Continuous testing ensures that code changes are
consistently validated through automated unit, integration, and end-to-
end tests, catching issues early.
 Continuous Feedback and Early Issue Detection: CI enables the
early detection of integration issues and bugs, reducing the cost and
effort required to fix them.
 Cost Savings: By reducing manual labor and minimizing downtime,
CI/CD pipelines help in controlling operational costs.
CI workflow
 Pre-Commit Checks: Before code changes are committed, developers
often run pre-commit checks to catch issues early.
o Static Code Analysis: Tools analyze the code for potential bugs,
code smells, and adherence to coding standards.
o Local Unit Tests: Developers run unit tests on their local machines to
verify that their changes do not break existing functionality.
 Source Phase: This phase involves managing the source code and
initial integration steps.
o Commit: Developers commit their code changes to the version
control system (e.g., Git).
o Branch Protection: Policies enforce rules such as requiring code
reviews and passing status checks before changes can be merged into
the main branch.
o Linting: Automated tools check the code for stylistic and
programming errors, ensuring consistency and adherence to best
practices.
 Build Phase: The build phase compiles the code and creates build
artifacts.
o Compiling Code: The source code is compiled into executable
binaries.
o Building Image: For containerized applications, a Dockerfile is used to
create a container image.
o Unit Tests: Automated unit tests are executed to verify the
functionality of individual components.
o Code Coverage: Tools measure the extent to which the source code
is tested, ensuring critical parts of the application are covered.
o Building Container Image: A final container image is built, including
all necessary dependencies and configurations.
 Test Phase: In this phase, various tests are executed to validate the
integrated application.
o Integration Tests: These tests verify that different modules or
services of the application work together as expected.
o End-to-End (E2E) Tests: E2E tests simulate real user scenarios to
ensure the entire application functions correctly from start to finish.
o Performance Tests: These tests assess the application's
performance, such as load and stress testing, to ensure it meets
performance requirements.
 Release Stage: Once the code has passed all tests, it is prepared for
deployment.
o Shipping Image to Registry: The built container image is pushed to
a container registry (e.g., Docker Hub, Amazon ECR).
o Tagging and Versioning: The image is tagged and versioned
appropriately to track releases.
o Artifact Storage: Other build artifacts are stored in an artifact
repository for future reference or deployment.
CD Workflow
 The continous deployment pipeline takes the container image or tested
build artifact from CI stage and bring it to different environments.
 The Continuous Deployment (CD) process involves several stages that
automate the deployment of validated artifacts across various
environments. The setup includes multiple repositories, configurations,
and deployment models.
 Flow:
 Repository Setup: We create a second repository to host our
deployement configurations. Our CI pipeline is going to checkout code
from application repo while as deployement pipeline will use configs from
config repo. It hosts configuration files, such as Docker Compose files or
Kubernetes deployment files.
 Build and Artifact Versioning: At the end of the CI pipeline, a new
version of the artifact (e.g., myapp:v1.0, myapp:v1.1) is produced. The
new version is stored in a container registry or artifact repository.
 Deployment Models: There are two different deployement models for
deploying to different environements: Push and pull.
 Push Model
 Artifact Deployment:
o Pull Request/Direct Push: The CI pipeline creates a pull request or
directly pushes the new image version to the configuration repository.
o Config Update: The new image version is updated in the Kubernetes
deployment files within the config repository.
o Deployment Execution: Runs kubectl apply to deploy the new
image into the environment.
 Environment Progression:
o QA Environment: The application is first deployed to the QA
environment for initial testing. Observability tools like Grafana and
Prometheus are used to monitor metrics such as latency and error
rates, providing feedback to QA engineers.
o Staging Environment: Once validated in QA, the application is
pushed to the staging environment. This can be done automatically or
manually.
o Production Environment: After staging, the application can be
deployed to production. This step is often manual to ensure extra
caution.
 Pull Model
 Operators: In pull model we have operators installed in environments
like ArgoCD which is installed in environments to monitor and
synchronize the state of the cluster with the repository.
 Auto-Sync (QA and Staging): ArgoCD automatically synchronizes the
QA and staging environments with the configuration repository.
 Manual Sync (Production): For production, synchronization is
manually triggered to ensure controlled deployment.
Canary Deployements
 A deployment strategy where a new version of an application is gradually
rolled out to a small subset of users initially.
 This allows for monitoring and verifying the new version's performance
and stability before a full rollout to all users.
 If issues are detected, the deployment can be halted or rolled back,
minimizing the impact on users.
 We gradualy increase the percentage of users towards new version.

Database, Datawarehouse, DataLake


 A database is a structured collection of data organized to facilitate
efficient retrieval, storage, and manipulation. It typically uses a
predefined schema to organize data into tables, rows, and columns.
Databases are designed for transactional operations, ensuring data
consistency, integrity, and ACID properties.
 A data warehouse is a centralized repository that stores structured and
organized historical data from various source. It integrates data from
different operational systems and transforms it into a consistent format
suitable for analysis and reporting. Data warehouses support decision-
making processes by providing a historical perspective on business
operations.
 Example: Retail Analytics: A retail company uses a data warehouse
(e.g., Amazon Redshift, Snowflake) to consolidate sales data from
multiple stores and online channels. Analysts query the data warehouse
to analyze sales trends, inventory levels, and customer behavior over
time to optimize pricing and promotions.
 A data lake is a centralized repository that stores vast amounts of raw,
unstructured, semi-structured, and structured data in its native format.
Unlike data warehouses, data lakes accept data from diverse
sources without requiring upfront schema definition or data
transformation.
 Example: IoT Data Processing: An organization collects sensor data
from IoT devices (e.g., temperature, humidity) and stores it in a data lake
(e.g., Amazon S3, Azure Data Lake). Data scientists use the data lake to
perform predictive analytics and anomaly detection, leveraging the
flexibility to explore different data formats and sources.
Agile Methodology
 Agile methodology is a flexible approach to software development that
prioritizes collaboration, customer feedback, and continuous
improvement. It emphasizes:
o Iterative Development: Breaking projects into smaller, manageable
parts called sprints or iterations.
o Customer Collaboration: Regular interaction with customers to
refine requirements and adapt to changes.
o Adaptability: Embracing change to deliver better products quickly
and efficiently.
o Self contained and independent teams
 Key practices include Scrum for structured team roles and processes,
Kanban for visualizing workflow and limiting work in progress, and Lean
principles for maximizing value and minimizing waste. Agile enables
teams to deliver software in incremental stages, ensuring responsiveness
to customer needs and market changes.
Database Indexes
 An index is a data structure that improves the speed of data retrieval
operations on a database table at the cost of additional space and
decreased performance on write operations.
 In most relational database management systems (RDBMS), indexes are
stored separately from the tables for which they are created.
 Indexes are stored as separate data structures within the database.
 Each index contains keys (or pointers) and their corresponding
references to rows in the table.
 Ensuring that indexes remain synchronized with table data during
insertions, updates, and deletions is crucial for maintaining query
performance.
 Types of Indexes
o Primary Index: Index is created on unique identifier for each record
in the table, often based on the primary key.
o Secondary Index: Created on columns other than the primary key to
accelerate queries on frequently searched fields.
o Composite Index: Combines multiple columns to create a single
index, useful for queries that filter based on multiple criteria.
 Indexing Structures:
o B-tree and B+ trees (Balanced Tree): Commonly used for indexes
due to its balanced nature, which ensures efficient search operations.
 Data Retrieval with Indexes: Indexes in databases, such as B-tree
indexes, significantly improve the speed of data retrieval operations.
Here’s how:
o Quick Lookup:
 Indexes are structured as data structures (like B-trees) that allow
for rapid lookup of values.
 When you query a column that has an index, the database can
quickly navigate the index structure to find the specific rows that
match the query condition.
 This avoids the need for a full-table scan where the database
would otherwise have to examine every row in the table.
o Reduced Disk I/O:
 By using indexes, the database can minimize the amount of disk
I/O (Input/Output) required to retrieve data.
 Instead of reading the entire table, the database accesses the
index structure first to locate the rows, reducing the number of
disk reads.
 Write Operations and Index Overhead
o When you insert, update, or delete data in a table, the corresponding
indexes must be updated to reflect these changes.
o For insertions, new entries must be added to the index structure.
o For updates or deletions, the affected entries in the index must be
modified or removed.
o This overhead includes additional disk writes and CPU processing to
keep indexes synchronized with the table.
 Balancing Data Retrieval Speed and Write Performance
o Index Selection: Choosing the right columns to index is crucial.
Indexing heavily queried columns improves retrieval performance
while carefully considering the impact on write operations.
o Index Maintenance Strategies: Database systems provide
mechanisms to optimize index maintenance, such as batch
processing or deferred index updates, to mitigate write
performance degradation.
Why Using B-tree and B+tree Indexing in SQL?
 A B-tree index is a type of tree data structure used in databases to
improve search efficiency. It is called a B-tree because it is a balanced
tree, meaning that all leaf nodes are at the same level, and the
branching factor is kept relatively low to ensure fast access to leaf nodes
so that we don’t have search too much lineearly in each nodes keys.
 The B-tree index stores data in sorted order, which makes it very efficient
for range queries and equality checks. In MySQL, the default index type
is BTREE.
 Key Characteristics of B-trees:
 Sorted Order: Every node in a B-tree maintains keys in a sorted order.
For a node with keys K1,K2,...,Kn, the following condition holds:
K1<K2<...<Kn.
 Node Structure:
o Each node in a B-tree typically contains multiple keys and pointers to
child nodes (or data entries in the case of leaf nodes).
o Internal nodes (non-leaf nodes) serve as intermediate levels for
navigation through the tree.
o Leaf nodes contain actual data entries or pointers to data entries.
 Binary Search Property:
o The B-tree ensures that for any given node:
o All keys in the left child node are less than the node's keys.
o All keys in the right child node are greater than the node's keys.
o This property allows for efficient search operations similar to binary
search, maintaining O(logn) time complexity for search operations.
 Balanced Tree Structure:
o B-trees are balanced trees, meaning that all leaf nodes are at the
same depth.
o B-trees are self-balancing, meaning that as data is added or removed,
the tree is automatically restructured to maintain a balance between
the left and right subtrees. This ensures that search operations take a
similar amount of time for all nodes in the tree, improving the overall
performance of the database.
 Support for Range Queries:
o B+ trees, in particular, excel at range queries due to their leaf node
structure where all keys are stored sequentially.
o Range queries can efficiently scan through consecutive leaf nodes,
minimizing disk I/O and improving query performance.
Distributed Indexing
 Indexing in distributed databases presents unique challenges and
opportunities compared to indexing in traditional, single-node databases.
 We can have something like global and local indexes:
o Global Indexes: A single index that spans all shards, enabling
queries across the entire dataset. These are more complex to maintain
due to the need for synchronization across nodes.
o Local Indexes: Separate indexes are maintained on each shard.
These are easier to manage but may require additional logic to
combine results from different shards.
N-Tier Architecture
 N-tier architecture (also known as multi-tier architecture) is a software
architecture model that organizes applications into layers or tiers, each
with a specific responsibility.
 This separation of concerns improves scalability, manageability, and
flexibility.
 Tiers are physically separated, running on separate machines. A tier can
call to another tier directly, or use asynchronous messaging.
 Physically separating the tiers improves scalability and resiliency but
adds latency from the additional network communication.
 The most common implementation is the three-tier architecture, but the
concept can be extended to more tiers as needed.
 Key Tiers in N-tier Architecture
 Presentation Tier:
o Responsibility: This layer is responsible for the user interface and
user experience. It handles all interactions with the end-user,
displaying data, and collecting user inputs.
o Examples: Web browsers, mobile apps, desktop applications.
o Technologies: HTML, CSS, JavaScript, Angular, React, Vue.js.
 Application (Logic) Tier:
o Responsibility: This layer contains the business logic and application
processing. It processes user inputs, makes logical decisions, performs
calculations, and controls the flow of data.
o Examples: Web servers, application servers.
o Technologies: Java, C#, Python, Node.js, .NET, Spring.
 Data Tier:
o Responsibility: This layer is responsible for data storage, retrieval,
and management. It interacts with databases to perform CRUD
(Create, Read, Update, Delete) operations.
o Examples: Database servers, data storage systems.
o Technologies: SQL databases (MySQL, PostgreSQL), NoSQL
databases (MongoDB, Cassandra), ORM tools (Hibernate, Entity
Framework).
 Benefits of N-tier Architecture
o Scalability: Each tier can be scaled independently based on load and
requirements.
o Manageability: Clear separation of concerns allows for easier
development, maintenance, and troubleshooting.
o Flexibility: Changes in one tier do not necessarily affect others,
facilitating updates and redesigns.
o Reusability: Business logic and data access code can be reused
across different applications or presentation layers.
o Security: Multiple levels of security can be implemented, providing a
comprehensive defense strategy.
API Paradigms
 REST: REST is an architectural style for designing networked
applications. It relies on a stateless, client-server, cacheable
communications protocol -- the HTTP protocol.
 Key Features:
o Resource-Based: Uses URIs to access resources.
o HTTP Methods: CRUD operations are mapped to HTTP methods
(GET, POST, PUT, DELETE).
o Stateless: Each request from a client to server must contain all the
information needed to understand and process the request.
o Scalable: Since each request is self sufficent, therefore we can add
multiple servers and load balancer can forward the rest api request to
any server,
o Flexibility: Can handle different types of calls and return different
data formats (JSON, XML).
 Pros:
o Wide adoption and familiarity.
o Easy to debug and test with tools like Postman.
o Excellent browser support.
o Caching mechanisms available to improve performance.
 Cons:
1. Over-fetching/Under-fetching: Clients may receive too much or too
little data.
 Example Scenario: Suppose you have a blog application with users and
posts. You want to display a list of posts along with the author’s name
and email for each post.
 REST API Endpoints:
o /posts: Returns a list of posts.
o /users/{id}: Returns user details by user ID.
 Over-fetching:
o When you request /posts, you get all post data, including fields you
might not need, such as content, timestamp, etc.
o Then, for each post, you need to make a separate request to
/users/{id} to get the author's details, which might include more
information than you need, such as address, phone number, etc.
 Under-fetching: To display the required information, you need to make
multiple requests: one to /posts and several to /users/{id}. This results in
many network requests, which can be inefficient.
2. Lack of type safety: JSON does not enforce a strict schema.
 GraphQL: GraphQL is an open-source data query and manipulation
language for APIs and a query runtime engine.
 Key Features:
o Single Endpoint: All requests are sent to a single endpoint.
o Flexible Queries: Clients can request exactly the data they need.
o Strongly Typed Schema: Enforces a schema for the API.
 Cons:
o Complexity in setup and maintenance.
o Performance can degrade if queries are not optimized.
o Overhead of learning a new query language and paradigm.
o Can be overkill for simple APIs.
 Example Scenraio: Suppose you have a blog application with users and
posts. You want to display a list of posts along with the author’s name
and email for each post.
 GraphQL Query: With GraphQL, you can request exactly the data you
need in a single query.
query {
posts {
id
title
author {
name
email
}
}
}
 Another advantage is type-safety which REST doesn’t enforce.
 In REST, if a client requests a field that doesn't exist in the API response,
the behavior depends on how the API and client are implemented. Unlike
GraphQL, REST does not have a built-in mechanism for schema
validation or query validation at the API level.
 In REST, clients themselves have to validate the response if its missing
nay field or the response is according to their requirements e.g. they
may have to check if email doesn’t contain an integer or simple string
etc or if any field is missing.
 In GraphQL, the response is validated against the schema to ensure that
no violations occur. This means that both the queries made by clients
and the responses returned by the server must conform to the defined
schema.
 When a client makes a query, the GraphQL server validates the query
against the schema. If the query includes fields that are not defined in
the schema, the server will return an error before executing any part of
the query.
 After the server processes a valid query, it constructs a response that
must conform to the schema. The server ensures that the data types and
fields in the response match those defined in the schema.
 If the server tries to return a field not defined in the schema or with a
type that doesn’t match, it will result in an error, ensuring consistency
and reliability in the data contract between the server and the client.
 GRAPHQL SCHEMA
type Post {
id: ID!
title: String!
content: String!
author: User!
}

type User {
id: ID!
name: String!
email: String!
}

type Query {
posts: [Post]
user(id: ID!): User
}

Use Cases
 REST: Best for simple CRUD operations, standard web applications, and
where human-readability of messages is essential.
 GraphQL: Ideal for applications with complex querying needs, where
clients need precise control over the data they request, and for scenarios
requiring strong schema and introspection.
API Design and best Practices
 Backwards compatibilty, so we don’t break existing users who are usin
the api
 Versioning
Consistent Hashing
 Consistent hashing aims to minimize the amount of data that needs to
be reassigned when nodes are added or removed.
 It achieves this by mapping both nodes and data items onto a virtual
hash ring.
 Each node and data item is assigned a position on the ring using a hash
function.
 Nodes are responsible for the data items whose positions are closest to
and follow them on the ring.
 Nodes and data items are mapped onto a virtual ring using a hash
function that outputs a large numeric space.
 Normal Hashing: Normal hashing involves using a hash function to
determine which node should store or handle a particular piece of data.
 This can be expressed as:
 node_index=hash_function(data_key) % num_nodes
 Example Scenario: Consider a scenario with 3 nodes (Node 1, Node 2,
Node 3) and a set of data items (Data 1 to Data 10). Using normal
hashing:
o Data 1 hashes to a value that maps to Node 1.
o Data 2 hashes to a value that maps to Node 2.
o Data 3 hashes to a value that maps to Node 3.
o If you add a new node (Node 4) to scale the system, you would
typically need to rehash all data because the addition of a new node
changes the modulus (num_nodes), affecting which node each data
item should be assigned to.
 Example Scenario: Using the same 3 nodes (Node A, Node B, Node C)
and a set of data items (Data 1 to Data 10):
o Nodes A, B, and C are placed at different points on the hash ring
based on their hash values.
o Data 1 to Data 10 are also hashed onto the ring.
o Assigning Data to Nodes:
 To determine which node should handle a particular data item,
you move clockwise on the ring from the data item's hash value
until you find the first node.
 That node becomes responsible for storing or processing that
data item.
 This process ensures that each node is responsible for a segment
of the hash ring, and data items are evenly distributed among
nodes.
o Adding a New Node:
 When a new node is added, it is placed on the ring based on its
hash value.
 Only the data that was previously assigned to the next node on
the ring (in a clockwise direction) needs to be reassigned to the
new node.
 This minimal reassignment reduces the overhead and disruption
in the system compared to traditional hashing methods where all
data might need to be redistributed.
 If you add a new node (Node D), only a fraction of the data
needs to be remapped. For example we would data between
Node C and Node D to Node D.
o Removing a Node:
 When a node is removed, its data is typically reassigned to the
next node that follows it on the ring.
 Again, only a portion of the data needs to be reassigned,
maintaining efficiency and minimizing disruption.
 Now, if Node B is removed, its data is typically reassigned to
Node C which is next in clockwise direction.
 Only the data that was previously assigned to Node B needs to
be remapped.
o Determining the Next Node:
 To find which node should handle the request or data item:
 The system starts from the hashed value's position on the hash
ring.
 It moves clockwise around the ring until it finds the first node
whose position (hash value or identifier) is greater than or equal
to the hashed value of the request.
 This node becomes responsible for processing or storing the
request.
o Why Not a Centralized Load Balancer or Proxy?
 The use of a hash ring allows nodes to independently determine
routing decisions based on their relative positions on the ring.
 This decentralized approach scales more effectively as the
number of nodes increases, without creating a bottleneck at a
centralized load balancer.
Networking Basics
 Ip Address: An IP address is a numerical label assigned to each device
connected to a computer network. It identifies the location of a device on
a network, allowing other devices to communicate with it and facilitating
data routing.
 Port: A port is a communication endpoint in an operating system that is
used to uniquely identify a specific process or network service. Ports
allow multiple services or processes to run simultaneously on a single
device
 MAC Address: A MAC address is a unique identifier assigned to network
interfaces for communications on the physical network segment.
 Virtual Private Networks (VPNs): VPNs create secure, encrypted
connections over a less secure network (e.g., the internet), enabling
remote access and secure communication.
 Firewalls and Security: Firewalls enforce security policies by filtering
incoming and outgoing traffic based on predefined rules.
 Routing and Switching: Definition: Routing involves directing network
traffic between different networks, while switching involves forwarding
data within the same network.
 TCP (Transmission Control Protocol):
o Connection-Oriented: TCP establishes a reliable and ordered
connection between two devices before data exchange begins.
o Reliability: Provides reliable delivery of data with error-checking,
retransmission of lost packets, and in-order delivery.
o Flow Control: Manages data flow between sender and receiver to
prevent overwhelming the receiver with data.
 Use Cases:
o Web browsing: TCP is used by HTTP for loading web pages.
o Email: SMTP, POP, and IMAP protocols use TCP for sending and
receiving emails.
o File Transfer: FTP and SSH use TCP for secure file transfer.
o Streaming: TCP is used for streaming media where reliability and
order are crucial.
 UDP (User Datagram Protocol):
o Connectionless: UDP does not establish a connection before sending
data and does not guarantee delivery.
o Unreliable: Does not perform error-checking or packet
retransmission, leaving that responsibility to the application layer.
o Low Overhead: Lightweight protocol with minimal processing and
transmission overhead.
 Use Cases:
o Real-time applications: Used in video conferencing, online gaming,
and VoIP (Voice over IP) where low latency is critical.
o DNS: UDP is used by DNS for quick lookups.
o Streaming: UDP can be used for live video or audio streaming where
occasional packet loss is acceptable.
o IoT: Used in IoT devices for transmitting small amounts of data
quickly.
 Note: Application layer protocols like HTTP, SMTP, FTP, etc., rely on
transaport layer protocols either TCP or UDP for transmitting their data.
Object Storage
 Object storage is a data storage architecture that manages data as
objects rather than traditional file systems or block storage.
 Each object typically includes data, metadata (attributes or tags), and
a unique identifier (key).
 Object storage is not a database in the traditional sense, as it does not
provide structured query capabilities like relational databases (e.g., SQL
databases).
 Instead, object storage is a data storage architecture optimized for
storing and managing large amounts of unstructured data as discrete
units called objects.
 Primarily supports basic operations like storing, retrieving, and deleting
objects. It lacks the ability to perform complex queries or transactions on
the data.
 Object Storage Model: Cloud storage services like Amazon S3, Google
Cloud Storage, or Azure Blob Storage use an object storage model.
Unlike traditional file systems that organize data in a hierarchical
structure, object storage uses a flat namespace where each object
(file) is stored as a standalone unit identified by a unique key
(often a URL or URI).
 Characteristics:
o Scalability: Object storage systems are highly scalable, capable of
storing vast amounts of unstructured data across distributed
infrastructure.
o Metadata Rich: Each object can be enriched with metadata (e.g.,
timestamps, content type), allowing for efficient indexing and search
operations.
o Durability: Object storage systems often provide high durability
through data replication and distribution across multiple nodes or data
centers.
o Access Methods: Objects are typically accessed via HTTP/S using
RESTful APIs, making them suitable for cloud-native and distributed
applications.
 Use Cases:
o Ideal for storing large volumes of unstructured data such as images,
videos, documents, backups, and log files. It is commonly used in
cloud storage solutions.
o Cloud Storage: Object storage is widely used in cloud platforms
(e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage) for
storing files, backups, and large-scale data sets.
o Content Delivery: Serving static content (e.g., images, videos) for
websites and applications.
 BLOB: BLOB" stands for Binary Large OBject. It refers to a collection of
binary data stored as a single entity in a database management system
(DBMS) or a file system.
o Binary Data: BLOBs are used to store binary data, which can include
images, videos, audio files, documents (like PDFs), and other
multimedia files. Unlike traditional text data, which can be easily
represented using characters and strings, binary data consists of
raw bytes that may not have a specific character encoding.
o Large Size: The term "Large" in BLOB emphasizes that these objects
can be of considerable size, potentially ranging from kilobytes to
gigabytes or even larger. This makes them suitable for storing
large multimedia files and other types of data that are not text-based.

You might also like