capacity planning for LAMP

what happens after you’re scalable

MySQL Conf and Expo April 2007

John Allspaw
• • •
Engineering Manager (Operations) at flickr (Yahoo!)

Yay!
• You’re scalable! (or not) • Now you can simply add hardware as
you need capacity.

• (right ?)

• But: • How many servers ?

BUT, um, wait....
• How many databases ? • How many webservers ? • How much shared storage ? • How many network switches ? • What about caching ? • How many CPUs in all of these ? • How much RAM ? • How many drives in each ? • WHEN should we order all of these ?

some stats
• - ~35M photos in squid cache (total) • - ~2M photos in squid’s RAM • - ~470M photos, 4 or 5 sizes of each • - 38k req/sec to memcached (12M
objects)

• - 2 PB raw storage (consumed about
~1.5TB on Sunday)

capacity

capacity doesn’t mean speed

capacity is for business

too much

Buying enough for now not enough too soon

too late

3 main parts
• - Planning (what ?/why ?/when ?) • - Deployment (install/config/manage) • - Measurement (graph the world)

boring queueing theory
• Forced Flow Law: X =V •
i i

x X0

Little’s Law: N=XxR Service Demand Law: Di = Vi x Si = Ui / X0

my theory
• capacity planning math is based on
real things, not abstract ones.

predicting the future

consumable

concurrent usage

considerations: social applications
• - Have the ‘network effect’ • - Exponential growth • •

• Event-related growth • (press, news event, social trends, etc.)
• •

considerations: social applications

Examples: London bombing, holidays, tsunamis, etc.

What do you have NOW ?
• When will your current capacity be
depleted or outgrown ?

finding ceilings
• MySQL (disk IO ?) • SQUID (disk IO ? or CPU ?) • memcached (CPU ? or network ?)

forget benchmarks
• boring • to use in capacity planning...not usually
worth the time

• not representative of real load

test in production

what do you expect ?
• define what is acceptable • examples: • squid hits should take less than X
milliseconds

• SQL queries less than Y

milliseconds, and also keep up with replication

measurement

accept the observer effect
• measurement is a necessity. • it’s not optional.

http://ganglia.sf.net

gmetad

db1 XML over TCP

db2

db3

xml over UDP on 239.2.11.84 (multicast)

www 1

www 2

www 3

xml over UDP on 239.2.11.83 (multicast)

gmetad

db1 XML over TCP

db2

db3

xml over UDP on 239.2.11.84 (multicast)

www boom! 1

www 2

www 3

xml over UDP on 239.2.11.83 (multicast)

super simple graphing
• #!/bin/sh
• /usr/bin/iostat -x 4 2 sda | grep -v ^$ | tail -4 > /tmp/
disk-io.tmp

• UTIL=`grep sda /tmp/disk-io.tmp | awk '{print $14}'` • /usr/bin/gmetric -t uint16 -n disk-util -v$UTIL -u '%'

memcached

what if you have graphs but no raw data ?
• GraphClick • http://www.arizona-software.ch/
applications/graphclick/en/

application usage
• Usage stats are just as important • as server stats! • Examples: • # of user registrations • # of photos uploaded every hour

not a straight line

another not straight line

but straight relationships!

measurement examples

queries

disk I/O

What we know now
• we can do at least 1500 qps (peak)
without: - slave lag - unacceptable avg response time - waiting on disk IO

MySQL capacity
1. find ceilings of existing h/w 2. tie app usage to server stats 3. find ceiling:usage ratio 4. do this again: - regularly (monthly) - when new features are released - when new h/w is deployed

caching maximums

caching ceilings squid, memcache
• working-set specific: • - tiny enough to all fit in memory ? • - some/more/all on disk ? • - watch LRU churn

churning full caches
• Ceilings at: • - LRU ref age small enough to affect
hit ratio too much disk IO (to 100%)

• - Request rate large enough to affect

squid requests and hits

squid hit ratio

LRU reference age

hit response times

What we know now
• we can do at least 620 req/sec (peak)
without: - LRU affecting hit ratio - unacceptable avg response time - waiting too much on diskIO

not full caches
• (working set smaller than max size) • - request rate large enough to bring
network or CPU to 100%

deployment

•SystemImager/SystemConfigurator •- http://wiki.systemimager.org • CVSup: • - http://www.cvsup.org • Subcon: • - http://code.google.com/p/subcon/ •

Automated Deploy Tools

questions ?
•http://flickr.com/photos/gaspi/62165296/ •http://flickr.com/photos/marksetchell/27964330/ •http://flickr.com/photos/sheeshoo/72709413/ •http://flickr.com/photos/jaxxon/165559708/ •http://flickr.com/photos/bambooly/298632541/ •http://flickr.com/photos/colloidfarl/81564759/ •http://flickr.com/photos/sparktography/75499095/