what happens after you’re scalable

capacity planning for LAMP

MySQL Conf and Expo April 2007

John Allspaw
•Engineering • •
Manager (Operations) at flickr (Yahoo!)

•You’re scalable! (or not) •Now you can simply add
hardware as you need capacity. ?)


•But: •How many

servers ?

•How •How •How much shared storage ? How many network •switches ? •What about caching ? How many •these ? CPUs in all of •How much RAM ? •How many drives in each ? •WHEN should we order all

BUT, um, many databases ? wait.... ? many webservers


~35M photos in squid cache (total) ~2M photos in squid’s RAM ~470M photos, 4 or 5 sizes of each 38k req/sec to memcached (12M objects) 2 PB raw storage (consumed about ~1.5TB on Sunday)

some stats


capacity doesn’t mean speed

capacity is for business

too much

Buying enough for now not enough too soon

too late


Planning (what ?/why ? /when ?) Deployment (install/config/manage) Measurement (graph the world)

3 main parts

boring queueing theory •Forced Flow Law: •X = V x X •Little’s Law: •N = X x R •Service Demand Law: •D = V x S = U / X •
i i 0 i i i i 0

my theory
planning math should be based on real things, not abstract ones.

predicting the



considerations: social applications •- Have the ‘network
•• •
effect’ Exponential growth

considerations: social applications •Event-related growth
•(press, trends,
• •

news event, social etc.)

Examples: London bombing, holidays, tsunamis, etc.

What do you have NOW ?
will your current capacity be depleted or outgrown ?

finding ceilings
•MySQL (disk IO •SQUID (disk IO •memcached (CPU
network ?) ?) ? or CPU ?) ? or

•boring •to use •not

forget benchmarks

in capacity planning...not usually worth the time representative of real

test in production

•define what is acceptable •examples: •squid hits should take •SQL

what do you expect ?

less than X milliseconds queries less than Y milliseconds, and also keep up with replication


accept the observer effect
necessity. is a


not optional.



super simple graphing


-x 4 2 sda | grep -v ^$ | tail -4 > /tmp/disk-io.tmp sda /tmp/disk-io.tmp | awk '{print $14}'` -t uint16 -n diskutil -v$UTIL -u '%'



what if you have graphs but no raw data ? •GraphClick •http://www.arizona•

software.ch/applications/g raphclick/en/

application usage Usage stats are just as •
important server stats!

•Examples: •# of user registrations •# of photos uploaded
every hour

not a straight line

another not straight line

but straight relationships!

measurement examples


disk I/O


can do at least 1500 qps (peak) without:

What we know now
slave lag


unacceptable avg response time waiting on disk IO

•find ceilings of existing h/w •tie app usage to server stats •find ceiling:usage ratio •do this again:

MySQL capacity


regularly (monthly)

when new features are released

caching maximums

caching ceilings squid, memcache •working-set specific: •- tiny enough to all fit
••in memory ? some/more/all on disk ? watch LRU churn

•Ceilings at: •- LRU ref age •-

churning full caches
small enough to affect hit ratio too much Request rate large enough to affect disk IO (to 100%)


requests and hits

squid hit ratio

LRU reference age

hit response times

What we know now •we can do at least 620
•••LRU affecting hit ratio unacceptable avg response time

req/sec (peak) without:

waiting too much on diskIO

not full caches
•(working •max size) set smaller than

request rate large enough to bring network or CPU to 100%


Automated Deploy Tools •SystemImager/SystemConfigurat

•CVSup: - http://www.cvsup.org • •Subcon: •http://code.google.com/p/subcon/


•http://flickr.com/photos/gaspi/62165296/ •http://flickr.com/photos/marksetchell/2796 •http://flickr.com/photos/sheeshoo/72709413 •http://flickr.com/photos/jaxxon/165559708/ •http://flickr.com/photos/bambooly/29863254 •http://flickr.com/photos/colloidfarl/81564 •http://flickr.com/photos/sparktography/754

questions ?