Reliable Scalable Maintainable DDIA

Notes for Designing Data Intensive Applications
 
# 1. Reliable, Scalable, and Maintainable Applications (20)
- Reliability
- “continuing to work correctly, even when things go wrong.”
- toleration hardware & software faults (`resilient`)
- human error
- Scalability
- load & performance
- latency percent, throughput (пропускная способность)
- Maintainability
- operability
- simplicity
- evolvability
CPU is not a limiting factor for `data-intensive` apps (it would be for `compute-intensive`).
Data Intensive apps need to:
- Store data so that they, or another application, can nd it again later (databases)
- Remember the result of an expensive operation, to speed up reads (caches - e.g. `Memcached`)
- Allow users to search data by keyword or lter it in various ways (search indexes, full-text search
- `Elasticsearch or Solr`)
- Send a message to another process, to be handled asynchronously (stream processing)
- Periodically crunch a large amount of accumulated data (batch processing - `Hadoop`)
The boundaries between database and queues are blurred, so they all fall under `Data Systems`
category.
- datastores that are also used as message queues (Redis)
- message queues with database-like durability guarantees (Apache Kafka)
## Reliability
Even in “noncritical” applications we have a responsibility to our users, so reliability is always

important.
- fault = one component works unexpectedly
- failure - the whole system stops providing service to the user
- not possible to avoid faults
- so the goal is to prevent faults from causing failures
- test faults by causing them deliberetly, eg killing a process.
- Net ix uses `Chaos Monkey` for that
- hardware fault
- aim to have your app running on several servers (multi machine redundancy)
- extra power sources in data centres
- software fault-tolerance techniques can hide certain types of faults from the end user.
- software errors
- a service that the system depends on that slows down, becomes unresponsive, or starts
returning corrupted responses.
- cascading failures, where a small fault in one component triggers a fault in another
component, which in turn triggers further faults
- think of assumptions that we make and handle faults for them
- human errors
- provide fully featured non-production sandbox environments where people can explore and
experiment safely, using real data, without a ecting real users.
- minimizes opportunities for error
- telemetry
fl
fi
ff
fi
- proper testing to handle user errors gracefully
- have a training for users
## Scalability
Scalability is the term we use to describe a system’s ability to cope with increased load meaning
you can add processing capacity in order to remain reliable under high load.
## Describing load
What load are we talking about for the speci c app? Number of active users, number of
messages per sec.
4.6k requests/sec on average, over 12k requests/sec at peak).
`fan-out` = In transaction processing systems, we use it to describe the number of requests to

other services that we need to make in order to serve one incoming request.
Tweeter and reading timeline example having two approaches:
1. when a user loads timeline
- we fetch from db all users the reader follows
- we fetch all their posts to assemble the list of timeline
- very slow approach for pulling timeline,
- quick for posting
2. we cache the user's timeline
- each time a following person posts something, that post gets inserted into 'mailbox' list of
each followers timeline cache
- the operation of fetching timeline list then is very quick
- but publishing post is a long operation for users with a lot of followers
3. mixed approach
- they mostly use approach 2
- celebrities with millions followers are excluded from that approach
- their tweets are pulled into the timeline at the moment when user opens timeline (like in
approach 1)
- this gives consistently good performance
## Describing Performance
- For batch processing Hadoop performance is `throughput` — the number of records we can
process per second.
- For online systems performance is `response time` (request sent, response received) - not a
single number. Time di ers from response to response, so performance is a distribution of
response time values.
- response time = network delays + service time + queueing delays
- latency = duration request is waiting to be handled
For response time using average is not good as it doesn't show how many users experience
delays. Better to use median and use percentiles (p50). In that case 50% are faster than the
median, 50% are slower.
- service level objectives (SLOs) - expected performance
- service level agreements (SLAs) - expected availability
> An SLA may state that the service is considered to be up if it has a
> median response time of less than 200 ms and a 99th percentile under 1 s (if the
> response time is longer, it might as well be down), and the service may be required to
> be up at least 99.9% of the time.
- `head-of-line blocking` - when the requests are queued to server and long to process requests
are in the beginning of the queue blocking the quick requests. Due to this e ect, it is important to
measure response
times on the client side.
ff
fi
ff
- `tail latency ampli cation` - one request results into multiple other services requests and needs
all the responses to get back to the user. It takes even just once request being slow to slow down
the whole response to the user.
## How to cope with Load
- `scaling up` (vertical scaling, moving to a more powerful machine)
- `scaling out` or `shared-nothing` (horizontal scaling, distributing the load across multiple smaller
machines)
- `elastic` systems can detect load increase and add resources - used when load is unpredicted
There is no universal scalable architecture `magic scaling sauce`. The problems are di erent:
- volume of reads
- writes
- data to store
- complexity of data
- response time required and etc.
## Maintainability
- Managable for operations team
- provide visual metrics
- avoid dependency on machines
- documentation
- ability to override defaults (breg?)
- predictable behavior
- Simplicity of code
- avoid:
- tight coupling of modules,
- tangled dependencies,
- inconsistent naming and terminology,
- hacks
- focus on:
- good abstraction
- Evolvability: Making Change Easy
- For example, how would you “refactor” Twitter’s architecture for assembling home timelines
(“Describing Load” on page 11) from approach 1 to approach 2?
## Summary
- `functional requirements` (what user gets) - what the system should do, such as allowing data to
be stored, retrieved, searched, and processed in various ways.
- `nonfunctional requirements` (so that it works well) - general properties like security, reliability,
compliance, scalability, compatibility, and maintainability
fi
ff

Reliable Scalable Maintainable DDIA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliable Scalable Maintainable DDIA

Uploaded by

Copyright:

Available Formats

Notes for Designing Data Intensive Applications

- “continuing to work correctly, even when things go wrong.”

- toleration hardware & software faults (`resilient`)

- load & performance

- latency percent, throughput (пропускная способность)

Data Intensive apps need to:

- Send a message to another process, to be handled asynchronously (stream processing)

- Periodically crunch a large amount of accumulated data (batch processing - `Hadoop`)

- datastores that are also used as message queues (Redis)

- message queues with database-like durability guarantees (Apache Kafka)

Even in “noncritical” applications we have a responsibility to our users, so reliability is always

- fault = one component works unexpectedly

- failure - the whole system stops providing service to the user

- not possible to avoid faults

- so the goal is to prevent faults from causing failures

- test faults by causing them deliberetly, eg killing a process.

- Net ix uses `Chaos Monkey` for that

- extra power sources in data centres

- think of assumptions that we make and handle faults for them

- minimizes opportunities for error

- have a training for users

4.6k requests/sec on average, over 12k requests/sec at peak).

`fan-out` = In transaction processing systems, we use it to describe the number of requests to

Tweeter and reading timeline example having two approaches:

1. when a user loads timeline

- we fetch from db all users the reader follows

- we fetch all their posts to assemble the list of timeline

- very slow approach for pulling timeline,

- quick for posting

2. we cache the user's timeline

- the operation of fetching timeline list then is very quick

- they mostly use approach 2

- celebrities with millions followers are excluded from that approach

- this gives consistently good performance

- response time = network delays + service time + queueing delays

- latency = duration request is waiting to be handled

- service level objectives (SLOs) - expected performance

- service level agreements (SLAs) - expected availability

> An SLA may state that the service is considered to be up if it has a

> be up at least 99.9% of the time.

times on the client side.

## How to cope with Load

- `scaling up` (vertical scaling, moving to a more powerful machine)

- response time required and etc.

- Managable for operations team

- provide visual metrics

- avoid dependency on machines

- ability to override defaults (breg?)

- tight coupling of modules,

- inconsistent naming and terminology,

- Evolvability: Making Change Easy

You might also like