You are on page 1of 57

Capacity

Management

for Web
Operations

John Allspaw
Operations
Engineering
the book I’m writing
???
Rules of Thumb

Planning/Forecasting

Stupid Capacity Tricks

th some Flickr statistics sprinkled in)


Things that can cause
downtime
• bugs (disguised as capacity
problems)

• edge cases (disguised as capacity


problems)

• security incidents
• real capacity problems*
* (should be the last thing you need to worry about)
Capacity != Performance

• Forget about performance for


right now
• Measure what you have right
NOW
• Don’t count on it getting any
better
Thank You HPC Industry!

• Automated Stuff
• Scalable Metric
Collection/Display

a lot of great deployment and management tricks


come from them, adopted by web ops
Good
Measureme
• nt Tools
record and
store
• metrics
in/out
• custom
metrics
• easily
compare
• lightweight-
Iish
Clouds need planning too
• Makes deployment and
procurement easy and quick
• But clouds are still resources
with costs and limits, just like
your own stuff
• Black-boxes: you may need to
pay even more attention than
before
Metrics
System Statistics
Metrics
“Application” Level
(photos processed per minute)

(average processing time per ph

(apache requests)
(concurrent busy apache procs)
Metrics
App-level meets system-level

here, total CPU = ~1.12 * # busy apache


2400

photos per minute being uploaded right NOW


Ceilin
gs
the most amount of “work” your
resources will allow before
degradation
or failure
Forget Benchmarking
Find your ceilings

what you have left

The End
Use real live production
data
to find ceilings

Production: “it’s like a lab, but bigger!”


Like: database ceilings

replication lag: bad!


Ceilings

sustained disk I/O wait for


waiting on disk
too much >40% creates
slave lag*
*for us, YMMV
35,000
o requests per second on a Tuesday peak
Safety
Factors
Safety Factors

Ceiling * Factor of Safety = UR LIMITZ


Safety Factors

webserver!
Safety Factors
what you have left

“safe”
ceiling
@85% CPU

85% total CPU = ~76 busy apache procs


Safety Factors
Yahoo Front Page
link to Chinese NewYear
Photos
(8% spike)

(photo requests/second)
Forecasting
Forecasting

Fictional Example:
webservers
Forecasting

peak of the week

Fictional example: 15 webservers. 1 week.


Forecasting

...bigger sample, 6 weeks....isolate the peaks...


Forecasting

not too shabby

now

..”Add a Trendline” with some decent correlation...


Forecasting
this will tell you when it is
ceiling

when is this?
what you have left

servers @76 busy apache proc limit = 1140 total procs


Forecasting

(1140-726) / 42.751 = 9.68

(week #10, duh)


Forecasting Automation

• Writing excel macros is boring


• All we want is “days
remaining”, so all we need is
the curve-fit

Use http://fityk.sf.net to
automate the curve-fit
Forecasting

Fictional Example:
storage consumption
Forecasting Automation

this will tell


you when this is

actual flickr storage consumption from early 2005,


in GB
Forecasting Automation
jallspaw:~]$cfityk ./fit-storage.fit cmd line script
1> # Fityk script. Fityk version: 0.8.2 output
2> @0 < '/home/jallspaw/storage-consumption.xy'
15 points. No explicit std. dev. Set as sqrt(y)
3> guess Quadratic
New function %_1 was created.
4> fit
Initial values: lambda=0.001 WSSR=464.564
#1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663
(99.8059%)
#2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833
(18.2818%)
#3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05
(0.00332729%)
#4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11
(5.21909e-09%)
Fit converged.
Better fit found (WSSR = 0.736763, was 464.564, -99.8414%).
5> info formula in @0
# storage-consumption
14147.4+146.657*x+0.786854*x^2
6> quit
bye...
Forecasting Automation
fityk gave:
y = 0.786854x2 + 146.657x + 14147.4
( R2 = 99.84)
Excel gave:
y = 0.7675x2 + 146.96x + 14147.3
( R2 = 99.84)

(SAME)
Capacity Health

• 12,629 nagios checks


• 1314 hosts
• 6 datacenters
• 4 photo “farms”
• farm = 2 DCs (east/west)
High and Low Water
Marks

alert if higher

alert if lower

Per server, squid requests per second


A good dashboard looks
something like...
curren Est
limit/b ceiling limit t % days
type # ox units (total) (peak) peak left
busy 62.50
www 20 80 1600 1000 36
procs %
shard I/O 27.50
20 40 800 220 120
db wait %
req/se 17,10 11,40 66.67
squid 18 950 48
c 0 0 %

yes, fictional numbers)


Diagonal Scaling

vertically scaling your already horizontal nodes

• Image processing machines


• Replace Dell PE860s with HP
DL140G3s
Diagonal Scaling
example: image processing

4 cores

8 cores

(about the same CPU “usage” per box)


Diagonal Scaling
example: image processing
throughput

~45 images/min @ peak

~140 images/min @ pea

(same CPU usage, but ~3x more work)


“processing” means making 4 sizes from originals
Diagonal Scaling
example: image processing
went from:
3008.4 1035 23U
23 Dell PE860s Watts photos/min rack

to:
1036.8 8U
8 1120
HP DL140 G3s Watts photos/min rack
!!! (75% faster, even)
3
.52

terabytes will be consumed today


2nd Order Effects
(beware the wandering
bottleneck)

running hot,
so add more
2nd Order Effects
(beware the wandering
bottleneck)

running great now,


so more traffic!
now
these
run hot
Stupid Capacity Tricks
Stupid Capacity Tricks
quick and dirty management
DSH
http://freshmeat.net/projects/dsh
[root@netmon101 ~]# cat group.of.servers

www100

www118

dbcontacts3

admin1

admin2
Stupid Capacity Tricks
quick and dirty management

[root@netmon101 ~]# dsh ­N group.of.servers

dsh> date
executing 'date'
www100:         Mon Jun 23 14:14:53 UTC 2008
www118:         Mon Jun 23 14:14:53 UTC 2008
dbcontacts3:    Mon Jun 23 07:14:53 PDT 2008
admin1:         Mon Jun 23 14:14:53 UTC 2008
admin2:         Mon Jun 23 14:14:53 UTC 2008
dsh> 
Stupid Capacity Tricks
Turn Stuff OFF

• Disable heavy-ish features of


the site(on/off switches)

• We have 195 different


things to disable in case of
emergency.
Stupid Capacity Tricks
Turn Stuff OFF
uploads (photo)
uploads (video)
uploads by email
various API things
various mobile things
various search things
etc., etc.
Stupid Capacity Tricks
Outages Happen

• Host your outage/status/blog


page in more than one
datacenter.
• Tell your users WTF is going
on, they’ll appreciate it.
Stupid Capacity Tricks
Hit the Pause Button

• Bake the dynamic into static


• Some Y! properties have a big
red button to instantly bake
(and un-bake) at will
thanks
http://flickr.com/photos/bondidwhat/402089763/
http://flickr.com/photos/74876632@N00/2394833962/
http://flickr.com/photos/42311564@N00/220394633/
http://flickr.com/photos/unloveable/2422483859/
http://flickr.com/photos/absolutwade/149702085/
http://flickr.com/photos/krawiec/521836276/
http://flickr.com/photos/eschipul/1560875648/
http://flickr.com/photos/library_of_congress/2179060841/
http://flickr.com/photos/jekkyl/511187885/
http://flickr.com/photos/ab8wn/368021672/
http://flickr.com/photos/jaxxon/165559708/
http://flickr.com/photos/sparktography/75499095/
We’re Hiring!
flickr.com/jobs

Come see me!


questions?

You might also like