You are on page 1of 3

Notes for Designing Data Intensive Applications


# 1. Reliable, Scalable, and Maintainable Applications (20)

- Reliability

- “continuing to work correctly, even when things go wrong.”

- toleration hardware & software faults (`resilient`)

- human error

- Scalability

- load & performance

- latency percent, throughput (пропускная способность)

- Maintainability

- operability

- simplicity

- evolvability

CPU is not a limiting factor for `data-intensive` apps (it would be for `compute-intensive`).

Data Intensive apps need to:

- Store data so that they, or another application, can nd it again later (databases)

- Remember the result of an expensive operation, to speed up reads (caches - e.g. `Memcached`)

- Allow users to search data by keyword or lter it in various ways (search indexes, full-text search
- `Elasticsearch or Solr`)

- Send a message to another process, to be handled asynchronously (stream processing)

- Periodically crunch a large amount of accumulated data (batch processing - `Hadoop`)

The boundaries between database and queues are blurred, so they all fall under `Data Systems`
category.

- datastores that are also used as message queues (Redis)

- message queues with database-like durability guarantees (Apache Kafka)

## Reliability

Even in “noncritical” applications we have a responsibility to our users, so reliability is always


important.

- fault = one component works unexpectedly

- failure - the whole system stops providing service to the user

- not possible to avoid faults

- so the goal is to prevent faults from causing failures

- test faults by causing them deliberetly, eg killing a process.

- Net ix uses `Chaos Monkey` for that

- hardware fault

- aim to have your app running on several servers (multi machine redundancy)

- extra power sources in data centres

- software fault-tolerance techniques can hide certain types of faults from the end user.

- software errors

- a service that the system depends on that slows down, becomes unresponsive, or starts
returning corrupted responses.

- cascading failures, where a small fault in one component triggers a fault in another
component, which in turn triggers further faults

- think of assumptions that we make and handle faults for them

- human errors

- provide fully featured non-production sandbox environments where people can explore and
experiment safely, using real data, without a ecting real users.

- minimizes opportunities for error

- telemetry

fl
fi
ff
fi
- proper testing to handle user errors gracefully

- have a training for users

## Scalability

Scalability is the term we use to describe a system’s ability to cope with increased load meaning
you can add processing capacity in order to remain reliable under high load.

## Describing load

What load are we talking about for the speci c app? Number of active users, number of
messages per sec.

4.6k requests/sec on average, over 12k requests/sec at peak).

`fan-out` = In transaction processing systems, we use it to describe the number of requests to


other services that we need to make in order to serve one incoming request.

Tweeter and reading timeline example having two approaches:

1. when a user loads timeline

- we fetch from db all users the reader follows

- we fetch all their posts to assemble the list of timeline

- very slow approach for pulling timeline,

- quick for posting

2. we cache the user's timeline

- each time a following person posts something, that post gets inserted into 'mailbox' list of
each followers timeline cache

- the operation of fetching timeline list then is very quick

- but publishing post is a long operation for users with a lot of followers

3. mixed approach

- they mostly use approach 2

- celebrities with millions followers are excluded from that approach

- their tweets are pulled into the timeline at the moment when user opens timeline (like in
approach 1)

- this gives consistently good performance

## Describing Performance

- For batch processing Hadoop performance is `throughput` — the number of records we can
process per second.

- For online systems performance is `response time` (request sent, response received) - not a
single number. Time di ers from response to response, so performance is a distribution of
response time values.

- response time = network delays + service time + queueing delays

- latency = duration request is waiting to be handled

For response time using average is not good as it doesn't show how many users experience
delays. Better to use median and use percentiles (p50). In that case 50% are faster than the
median, 50% are slower.

- service level objectives (SLOs) - expected performance

- service level agreements (SLAs) - expected availability

> An SLA may state that the service is considered to be up if it has a

> median response time of less than 200 ms and a 99th percentile under 1 s (if the

> response time is longer, it might as well be down), and the service may be required to

> be up at least 99.9% of the time.

- `head-of-line blocking` - when the requests are queued to server and long to process requests
are in the beginning of the queue blocking the quick requests. Due to this e ect, it is important to
measure response

times on the client side.

ff
fi
ff
- `tail latency ampli cation` - one request results into multiple other services requests and needs
all the responses to get back to the user. It takes even just once request being slow to slow down
the whole response to the user.

## How to cope with Load

- `scaling up` (vertical scaling, moving to a more powerful machine)

- `scaling out` or `shared-nothing` (horizontal scaling, distributing the load across multiple smaller
machines)

- `elastic` systems can detect load increase and add resources - used when load is unpredicted

There is no universal scalable architecture `magic scaling sauce`. The problems are di erent:

- volume of reads

- writes

- data to store

- complexity of data

- response time required and etc.

## Maintainability

- Managable for operations team

- provide visual metrics

- avoid dependency on machines

- documentation

- ability to override defaults (breg?)

- predictable behavior

- Simplicity of code

- avoid:

- tight coupling of modules,

- tangled dependencies,

- inconsistent naming and terminology,

- hacks

- focus on:

- good abstraction

- Evolvability: Making Change Easy

- For example, how would you “refactor” Twitter’s architecture for assembling home timelines
(“Describing Load” on page 11) from approach 1 to approach 2?

## Summary

- `functional requirements` (what user gets) - what the system should do, such as allowing data to
be stored, retrieved, searched, and processed in various ways.

- `nonfunctional requirements` (so that it works well) - general properties like security, reliability,
compliance, scalability, compatibility, and maintainability

fi
ff

You might also like