You are on page 1of 57

Capacity

Management
for Web Operations

John Allspaw
Operations Engineering
the book I’m writing
???
Rules of Thumb

Planning/Forecasting

Stupid Capacity Tricks

(with some Flickr statistics sprinkled in)


Things that can cause
downtime
bugs (disguised as capacity problems)
edge cases (disguised as capacity
problems)

security incidents
real capacity problems*

* (should be the last thing you need to worry about)


Capacity != Performance

Forget about performance for right


now
Measure what you have right NOW
Don’t count on it getting any better
Thank You HPC Industry!

Automated Stuff
Scalable Metric Collection/Display

a lot of great deployment and management tricks


come from them, adopted by web ops
Good
Measuremen
t Tools
record and
store
metrics in/out
custom metrics
easily compare
lightweight-ish

I
Clouds need planning too

Makes deployment and


procurement easy and quick
But clouds are still resources with
costs and limits, just like your own
stuff
Black-boxes: you may need to pay
even more attention than before
Metrics
System Statistics
Metrics
“Application” Level
(photos processed per minute)

(average processing time per phot

(apache requests)
(concurrent busy apache procs)
Metrics
App-level meets system-level

here, total CPU = ~1.12 * # busy apache procs


2400

photos per minute being uploaded right NOW (Tuesday


Ceiling
s
the most amount of “work” your
resources will allow before
degradation
or failure
Forget Benchmarking
Find your ceilings

what you have left

The End
Use real live production data
to find ceilings

Production: “it’s like a lab, but bigger!”


Like: database ceilings

replication lag: bad!


Ceilings

waiting on disk sustained disk I/O wait for


too much >40% creates
slave lag*
*for us, YMMV
35,000
oto requests per second on a Tuesday peak
Safety
Factors
Safety Factors

Ceiling * Factor of Safety = UR LIMITZ


Safety Factors

webserver!
Safety Factors
what you have left

“safe”
ceiling
@85% CPU

85% total CPU = ~76 busy apache procs


Safety Factors
Yahoo Front Page
link to Chinese NewYear
Photos
(8% spike)

(photo requests/second)
Forecasting
Forecasting

Fictional Example:
webservers
Forecasting

peak of the week

Fictional example: 15 webservers. 1 week.


Forecasting

...bigger sample, 6 weeks....isolate the peaks...


Forecasting

not too shabby

now

...”Add a Trendline” with some decent correlation...


Forecasting

ceiling this will tell you when it is

when is this?
what you have left

15 servers @76 busy apache proc limit = 1140 total procs


Forecasting

(1140-726) / 42.751 = 9.68

(week #10, duh)


Forecasting Automation

Writing excel macros is boring


All we want is “days remaining”, so
all we need is the curve-fit

Use http://fityk.sf.net to
automate the curve-fit
Forecasting

Fictional Example:
storage consumption
Forecasting Automation

this will tell


you when this is

actual flickr storage consumption from early 2005, in GB


(ceiling is fictional)
Forecasting Automation
jallspaw:~]$cfityk ./fit-storage.fit cmd line script
1> # Fityk script. Fityk version: 0.8.2 output
2> @0 < '/home/jallspaw/storage-consumption.xy'
15 points. No explicit std. dev. Set as sqrt(y)
3> guess Quadratic
New function %_1 was created.
4> fit
Initial values: lambda=0.001 WSSR=464.564
#1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%)
#2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%)
#3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%)
#4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%)
Fit converged.
Better fit found (WSSR = 0.736763, was 464.564, -99.8414%).
5> info formula in @0
# storage-consumption
14147.4+146.657*x+0.786854*x^2
6> quit
bye...
Forecasting Automation
fityk gave:
y = 0.786854x2 + 146.657x + 14147.4
( R2 = 99.84)
Excel gave:
y = 0.7675x2 + 146.96x + 14147.3
( R2 = 99.84)

(SAME)
Capacity Health

12,629 nagios checks


1314 hosts
6 datacenters
4 photo “farms”
farm = 2 DCs (east/west)
High and Low Water Marks

alert if higher

alert if lower

Per server, squid requests per second


A good dashboard looks
something like...
Est
limit/bo ceiling limit current % days
type # x units (total) (peak) peak left
busy 62.50
www 20 80 1600 1000 36
procs %
shard I/O 27.50
20 40 800 220 120
db wait %
req/se 66.67
squid 18 950 17,100 11,400 48
c %

(yes, fictional numbers)


Diagonal Scaling

vertically scaling your already horizontal nodes

Image processing machines


Replace Dell PE860s with HP
DL140G3s
Diagonal Scaling
example: image processing

4 cores

8 cores

(about the same CPU “usage” per box)


Diagonal Scaling
example: image processing throughput

~45 images/min @ peak

~140 images/min @ peak

(same CPU usage, but ~3x more work)


“processing” means making 4 sizes from originals
Diagonal Scaling
example: image processing
went from:
3008.4 1035 23U
23 Dell PE860s Watts photos/min rack

to:
1036.8 1120 8U
8 HP DL140 G3s Watts photos/min rack
!!! (75% faster, even)
3.52

terabytes will be consumed today (on a


2nd Order Effects
(beware the wandering bottleneck)

running hot,
so add more
2nd Order Effects
(beware the wandering bottleneck)

running great now,


so more traffic!
now
these run
hot
Stupid Capacity Tricks
Stupid Capacity Tricks
quick and dirty management
DSH
http://freshmeat.net/projects/dsh
[root@netmon101 ~]# cat group.of.servers

www100

www118

dbcontacts3

admin1

admin2
Stupid Capacity Tricks
quick and dirty management

[root@netmon101 ~]# dsh -N group.of.servers

dsh> date
executing 'date'
www100: Mon Jun 23 14:14:53 UTC 2008
www118: Mon Jun 23 14:14:53 UTC 2008
dbcontacts3: Mon Jun 23 07:14:53 PDT 2008
admin1: Mon Jun 23 14:14:53 UTC 2008
admin2: Mon Jun 23 14:14:53 UTC 2008
dsh>
Stupid Capacity Tricks
Turn Stuff OFF

Disable heavy-ish features of the


site(on/off switches)

We have 195 different things to


disable in case of emergency.
Stupid Capacity Tricks
Turn Stuff OFF
uploads (photo)
uploads (video)
uploads by email
various API things
various mobile things
various search things
etc., etc.
Stupid Capacity Tricks
Outages Happen
Host your outage/status/blog page
in more than one datacenter.
Tell your users WTF is going on,
they’ll appreciate it.
Stupid Capacity Tricks
Hit the Pause Button

Bake the dynamic into static


Some Y! properties have a big red
button to instantly bake (and un-
bake) at will
thanks
http://flickr.com/photos/bondidwhat/402089763/
http://flickr.com/photos/74876632@N00/2394833962/
http://flickr.com/photos/42311564@N00/220394633/
http://flickr.com/photos/unloveable/2422483859/
http://flickr.com/photos/absolutwade/149702085/
http://flickr.com/photos/krawiec/521836276/
http://flickr.com/photos/eschipul/1560875648/
http://flickr.com/photos/library_of_congress/2179060841/
http://flickr.com/photos/jekkyl/511187885/
http://flickr.com/photos/ab8wn/368021672/
http://flickr.com/photos/jaxxon/165559708/
http://flickr.com/photos/sparktography/75499095/
We’re Hiring!
flickr.com/jobs

Come see me!


questions?

You might also like