Capacity Management

for Web Operations

John Allspaw Operations Engineering

the book I’m writing

???

Rules of Thumb Planning/Forecasting Stupid Capacity Tricks

th some Flickr statistics sprinkled in)

• bugs (disguised as capacity
problems)
problems)

Things that can cause downtime

• edge cases (disguised as capacity • security incidents • real capacity problems*
* (should be the last thing you need to worry about)

Capacity != Performance

• Forget about performance for
right now NOW

• Measure what you have right • Don’t count on it getting any
better

Thank You HPC Industry!

• Automated Stuff • Scalable Metric

Collection/Display

a lot of great deployment and management tricks come from them, adopted by web ops

Good Measureme record and • nt Tools
store • metrics in/out • custom metrics • easily compare • lightweightIish

Clouds need planning too

• Makes deployment and

procurement easy and quick with costs and limits, just like your own stuff pay even more attention than before

• But clouds are still resources

• Black-boxes: you may need to

Metrics
System Statistics

“Application” Level

Metrics
(photos processed per minute)

(average processing time per ph

(apache requests) (concurrent busy apache procs)

Metrics
App-level meets system-level

here, total CPU = ~1.12 * # busy apache

2400
photos per minute being uploaded right NOW

the most amount of “work” your resources will allow before degradation or failure

Ceilin gs

Forget Benchmarking

Find your ceilings

what you have left The End

Use real live production data to find ceilings

Production: “it’s like a lab, but bigger!”

Like: database ceilings

replication lag: bad!

Ceilings

sustained disk I/O wait for waiting on disk >40% creates too much slave lag*
*for us, YMMV

35,000

o requests per second on a Tuesday peak

Safety Factors

Safety Factors

Ceiling * Factor of Safety = UR LIMITZ

Safety Factors

webserver!

Safety Factors
what you have left

“safe” ceiling @85% CPU

85% total CPU = ~76 busy apache procs

Safety Factors
Yahoo Front Page link to Chinese NewYear Photos

(8% spike)

(photo requests/second)

Forecasting

Forecasting

Fictional Example: webservers

Forecasting
peak of the week

Fictional example: 15 webservers. 1 week.

Forecasting

...bigger sample, 6 weeks....isolate the peaks...

Forecasting
not too shabby

now

..”Add a Trendline” with some decent correlation...

Forecasting
ceiling this will tell you when it is

when is this? what you have left

servers @76 busy apache proc limit = 1140 total procs

Forecasting

(1140-726) / 42.751 = 9.68

(week #10, duh)

Forecasting Automation

• Writing excel macros is boring • All we want is “days
remaining”, so all we need is the curve-fit

Use http://fityk.sf.net to automate the curve-fit

Forecasting

Fictional Example: storage consumption

Forecasting Automation

this will tell you when this is

actual flickr storage consumption from early 2005, in GB

Forecasting Automation
jallspaw:~]$cfityk ./fit-storage.fit 1> # Fityk script. Fityk version: 0.8.2 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...

cmd line script output

Forecasting Automation
fityk gave: y = 0.786854x2 + 146.657x + 14147.4 ( R2 = 99.84) Excel gave: y = 0.7675x2 + 146.96x + 14147.3 ( R2 = 99.84)

(SAME)

Capacity Health

• 12,629 nagios checks • 1314 hosts • 6 datacenters • 4 photo “farms” • farm = 2 DCs (east/west)

High and Low Water Marks
alert if higher

alert if lower

Per server, squid requests per second

A good dashboard looks something like...
type # curren limit/b ceiling limit t ox units (total) (peak) % peak Est days left

www shard db squid

20 20 18

busy 62.50 80 1600 1000 procs % I/O 27.50 40 800 220 wait % req/se 17,10 11,40 66.67 950 c 0 0 %

36 120 48

yes, fictional numbers)

Diagonal Scaling
vertically scaling your already horizontal nodes

• Image processing machines • Replace Dell PE860s with HP
DL140G3s

Diagonal Scaling
example: image processing
4 cores

8 cores

(about the same CPU “usage” per box)

Diagonal Scaling
example: image processing throughput

~45 images/min @ peak

~140 images/min @ pea
(same CPU usage, but ~3x more work) “processing” means making 4 sizes from originals

Diagonal Scaling
example: image processing
went from:

23

1035 3008.4 Dell PE860s Watts photos/min

23U rack

to:

8

1036.8 8U 1120 HP DL140 G3s Watts photos/min rack !!! (75% faster, even)

3 .52

terabytes will be consumed today

2nd Order Effects (beware the wandering bottleneck)

running hot, so add more

2nd Order Effects (beware the wandering bottleneck)

now these run hot

running great now, so more traffic!

Stupid Capacity Tricks

Stupid Capacity Tricks
quick and dirty management
DSH http://freshmeat.net/projects/dsh
[root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2

Stupid Capacity Tricks
quick and dirty management
[root@netmon101 ~]# dsh ­N group.of.servers
dsh> date executing 'date' www100:         Mon Jun 23 14:14:53 UTC 2008 www118:         Mon Jun 23 14:14:53 UTC 2008 dbcontacts3:    Mon Jun 23 07:14:53 PDT 2008 admin1:         Mon Jun 23 14:14:53 UTC 2008 admin2:         Mon Jun 23 14:14:53 UTC 2008 dsh> 

Stupid Capacity Tricks
Turn Stuff OFF

• Disable heavy-ish features of
the site(on/off switches)

• We have 195 different

things to disable in case of emergency.

Stupid Capacity Tricks
Turn Stuff OFF
uploads (photo) uploads (video) uploads by email various API things various mobile things various search things

etc., etc.

Stupid Capacity Tricks
Outages Happen

• Host your outage/status/blog
page in more than one datacenter.

• Tell your users WTF is going
on, they’ll appreciate it.

Stupid Capacity Tricks
Hit the Pause Button

• Bake the dynamic into static • Some Y! properties have a big
red button to instantly bake (and un-bake) at will

http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/

thanks

We’re Hiring! flickr.com/jobs

Come see me!

questions?

Sign up to vote on this title
UsefulNot useful