You are on page 1of 54

capacity planning for LAMP

what happens after you’re scalable

MySQL Conf and Expo April 2007

John Allspaw

Engineering Manager (Operations) at flickr (Yahoo!)

Yay!

You’re scalable! (or not)

Now you can simply add hardware as you need capacity.

(right ?)

But:

How many servers ?

BUT, um, wait

How many databases ?

How many webservers ?

How much shared storage ?

How many network switches ?

What about caching ?

How many CPUs in all of these ?

How much RAM ?

How many drives in each ?

WHEN should we order all of these ?

some stats

- ~35M photos in squid cache (total)

- ~2M photos in squid’s RAM

- ~470M photos, 4 or 5 sizes of each

- 38k req/sec to memcached (12M objects)

- 2 PB raw storage (consumed about ~1.5TB on Sunday)

capacity

capacity

capacity

doesn’t

mean

speed

capacity doesn’t mean speed
capacity is for business

capacity is for business

too much

enough

not

Buying enough for now

too soon

too late

3 main parts

- Planning (what ?/why ?/when ?)

- Deployment (install/config/manage)

- Measurement (graph the world)

boring queueing theory

Forced Flow Law:

Little’s Law:

X i =V i x X 0

N = X x R Service Demand Law:

D i =V i x S i = U i / X 0

my theory

capacity planning math is based on real things, not abstract ones.

predicting the future

predicting the future

consumable

consumable
concurrent usage

concurrent usage

considerations:

social applications

- Have the ‘network effect’

- Exponential growth

considerations:

social applications

Event-related growth

(press, news event, social trends, etc.)

Examples:

London bombing, holidays, tsunamis, etc.

What do you have NOW ?

When will your current capacity be depleted or outgrown ?

finding ceilings

MySQL (disk IO ?)

SQUID (disk IO ? or CPU ?)

memcached (CPU ? or network ?)

forget benchmarks

boring

to use in capacity planning worth the time

not representative of real load

not

usually

• test in production

test in production

what do you expect ?

define what is acceptable

examples:

squid hits should take less than X milliseconds

SQL queries less than Y milliseconds, and also keep up with replication

measurement

measurement

accept the observer effect

measurement is a necessity.

it’s not optional.

http://ganglia.sf.net

http://ganglia.sf.net
gmetad db1 db2 db3 XML over TCP xml over UDP on 239.2.11.84 (multicast) www www
gmetad
db1
db2
db3
XML over TCP
xml over UDP on 239.2.11.84
(multicast)
www
www
www
1
2
3
xml over UDP on 239.2.11.83
(multicast)
gmetad
gmetad
db1 db2 db3 xml over UDP on 239.2.11.84 (multicast)
db1
db2
db3
xml over UDP on 239.2.11.84
(multicast)

XML over TCP

www www www boom! 1 2 3 xml over UDP on 239.2.11.83 (multicast)
www
www
www
boom!
1
2
3
xml over UDP on 239.2.11.83
(multicast)

super simple graphing

#!/bin/sh

/usr/bin/iostat -x 4 2 sda | grep -v ^$ | tail -4 > /tmp/ disk-io.tmp

UTIL=`grep sda /tmp/disk-io.tmp | awk '{print $14}'`

/usr/bin/gmetric -t uint16 -n disk-util -v$UTIL -u '%'

memcached

memcached

what if you have graphs but no raw data ?

GraphClick

http://www.arizona-software.ch/

applications/graphclick/en/

application usage

Usage stats are just as important

as server stats!

Examples:

# of user registrations

# of photos uploaded every hour

not a straight line

not a straight line

another not straight line

another not straight line

but straight relationships!

but straight relationships!

measurement examples

queries

queries

disk I/O

disk I/O

What we know now

we can do at least 1500 qps (peak) without:

- slave lag

- unacceptable avg response time

- waiting on disk IO

MySQL capacity

1. find ceilings of existing h/w

2. tie app usage to server stats

3. find ceiling:usage ratio

4. do this again:

- regularly (monthly)

- when new features are released

- when new h/w is deployed

caching maximums

caching maximums

caching ceilings squid, memcache

working-set specific:

- tiny enough to all fit in memory ?

- some/more/all on disk ?

- watch LRU churn

churning full caches

Ceilings at:

- LRU ref age small enough to affect hit ratio too much

- Request rate large enough to affect disk IO (to 100%)

squid requests and hits

squid requests and hits

squid hit ratio

squid hit ratio

LRU reference age

LRU reference age

hit response times

hit response times

What we know now

we can do at least 620 req/sec (peak) without:

- LRU affecting hit ratio

- unacceptable avg response time

- waiting too much on diskIO

not full caches

(working set smaller than max size)

- request rate large enough to bring network or CPU to 100%

deployment

deployment

Automated Deploy Tools

SystemImager/SystemConfigurator - http://wiki.systemimager.org

CVSup:

- http://www.cvsup.org

Subcon:

- http://code.google.com/p/subcon/

questions ?

http://flickr.com/photos/gaspi/62165296/

http://flickr.com/photos/marksetchell/27964330/

http://flickr.com/photos/sheeshoo/72709413/

http://flickr.com/photos/jaxxon/165559708/

http://flickr.com/photos/bambooly/298632541/

http://flickr.com/photos/colloidfarl/81564759/

http://flickr.com/photos/sparktography/75499095/