You are on page 1of 2

Notes for Designing Data Intensive Applications



# 6. Partitioning (Sharding) (18)

Is for achieving scalability. Each partition is a small database on its own. But the db can support
operations involving multiple partitions at the same time.

## Summary

Partitioning scheme should depend on your data. The goal is to avoid hot spots (disproportional
high load).

Partitioning approaches:

- Key range partitioning

- sort keys, so that range queries are e cient

- allocate key ranges to partitions

- there is a risk of hot spot

- partitions rebalanced dynamically by splitting the range into 2 subranges and adding new
partitions

- Hash partitioning

- hash func is applied to each key

- partition owns a range of hashes

- range queries are not e cient, but load is more even

- usually number of partitions is xed, but dynamic partitioning can also be used

- each node can have several partitions


# DISTRIBUTED DATA

There are various reasons why you might want to distribute a database across multiple

machines:

- Scalability

- load is too big - spread across multiple machines

- Fault tolerance/high availability

- have redundant machines for the case if some machines/network/datacenter go down

- Latency

- each user can be served from a datacenter that is geographically close to them

Two common ways of distributing data among multiple nodes:

- Replication

- copy of same data on several nodes

- Partitioning (Sharding)

- splitting big database into partitions on di erent nodes

> assume anything that can go wrong will go wrong

In distributed systems part of the system can break in unpredictable way while some other parts
are still working. Whe situation may be nondeterministic - for the same actions we might get
di erent results (failure/success).

## System Design and Scalability

It's not about the ultimate design, but about my thought process and COMMUNICATION. Talk out
loud about every point - assumptions, tradeo s, estimations.

ff
ffi
fi
ffi
ff
ff
Use interviewers guidance

### Design task

1. Scope

- write down the main functionality expected

2. Make reasonable assumptions

- reasonable number of users per day

- memory is not in nite

3. Draw the major components

- walk through it end-to-end

- this is simple, no scalability yet

4. Identify the key issues

- the bottlenecks of the system

- maybe some queries are used more than others, etc?

5. Redesign for key issues

- communicate about the limitations in your design, so that interviewer knows you are aware of
them

### Scale task

1. Ask questions

2. Pretend that the data can all t on one machine and there are no memory limitations. How
would you solve the problem? The answer to this question will provide the general outline for your
solution.

3. Now think about splitting up into di erent machines.

- is it a problem to split data into several machines?

- how machine identi es where to look for the data

4. Solve the problems (might involve altering the original design)

fi
fi
fi
ff

You might also like