Professional Documents
Culture Documents
Link-State 2012 - Fred Hatfull - The Dirty Work: Scaling Out Websites With Your Own Two Hands
Link-State 2012 - Fred Hatfull - The Dirty Work: Scaling Out Websites With Your Own Two Hands
Who Am I?
CWRU Alumnus Software Developer at Yelp Infrastructure Engineer Availability Performance Productivity
PAGE: 2
Overview
Whats Yelp? Scaling the Backend Accelerating Content Delivery Monitoring Performance
PAGE: 3
Whats Yelp?
Help consumers find great local businesses Help businesses owners find more customers
PAGE: 4
Whats Yelp?
PAGE: 5
Whats Yelp?
Five Sites www.yelp.com biz.yelp.com api.yelp.com m.yelp.com admin.yelp.com
PAGE: 6
www - consumer facing website biz - business owners website for managing ads, biz page, etc api - public and private APIs (mobile apps, too) m - mobile site (web browsers on mobile devices) admin - administrative tools
Whats Yelp?
Numerous Open-Source Projects: mrjob firefly testify tron many more: github.com/Yelp
PAGE: 7
mrjob - python Map/Reduce framework rey - time-series statistics graphing testify - more python test framework tron - distributed cron
PAGE: 8
In the Beginning...
PAGE: 9
In the Beginning
PAGE: 10
In the Beginning
16.32.64.128
PAGE: 11
No load balancer, no internal DNS, no web framework. Just us, mod_python, and mySQL.
Up and Running
16.32.64.128
web1
db1
web2
Traffic starts picking up. One box doesnt cut it any more... time to scale out horizontally. Adding webs is the low hanging fruit.
Up and Running
16.32.64.128
web1
web2
web3
db1
web4
web5
web6
Ok, we begin to hit the limits of horizontal scaling. The webapp can always benet from having more machines (+HAproxy)
Up and Running
16.32.64.128
web1
web2
!
db1
web3
web4
web5
web6
PAGE: 15
There are a few classic options for scaling up mySQL. We could switch datastores... maybe mySQL is just slow? How about Oracle? MSSQL? etc. We could introduce an entirely new database machine with its own mySQL instance... basically just a clone of the current one. Both DBs dont know anything about each other. We could shard the database by having multiple machines where each is responsible for a certain set of keys. Or we could just replicate the current database to accomodate more traffic, and hope writes dont get overwhelming.
PAGE: 16
Ok. Its 2004... noSQL isnt around, really, and our data is pretty relational anyway. mySQL is looking like the fastest store that meets our requirements.
PAGE: 17
We could just set up an entirely new database machine and run the new database in parallel. Have to make two read queries instead of one now, gure out where to send writes, and keeping schemas in sync is a nightmare. Clearly not scalable.
PAGE: 18
We are actually doing a form of sharding here, but its not quite the conventional mastermaster sharding that usually comes to mind. master-master would work, but its a huge pain to get right and requires a lot of effort to make sure keys go to and are retrieved from the shards where they belong. Our read-heavy traffic patterns make us an ideal candidate for replication to massively increase read capacity while reducing engineering overhead.
PAGE: 19
Database Replication
db1
webs
Reads/Writes
PAGE: 20
Prior to replication. Just one database, handles all reads/writes. All webs are talking to this db.
Database Replication
db1
Replicated Writes
Simple two-database replication scheme. All write traffic hits the master, `db1`. Read traffic is split between the master and a read-only database. When writes come through the master database, the master database informs the slave database that the write happens so that the slave replicates the action taken on the master.
Database Replication
db1
db2
Replicated Writes
Its good practice to keep another master early in the replication stream that can be promoted to write master if the write master fails.
Database Replication
db1
db2
Replicated Writes
Replication cant always happen immediately, depending on the load on the slave db and the master and how far apart they are/network congestion.
PAGE: 24
Database Replication
Strong Consistency vs. Eventual Consistency
PAGE: 25
A shift in thinking. Life is easy in strongly-consistent systems. Although scaling can be challenging, you get a guarantee that data is always up-to-date. Eventual consistency is a big change and has lots of nasty corner cases ready to bite.
Database Replication
Replication has lots of cons: Expensive/poorly formed queries have multiplicative effect Replication delay can lead to inconsistent views Figuring out when to hit master vs. slave etc...
PAGE: 26
While horizontally-scalable read capacity is a big win, there are lots of new things to think about.
4500ms
2ms
master
webs
slaves
PAGE: 27
CWRU, 27 October 2012 - Fred Hatfull
A normal replication stream. A lots of nice, small queries oating by. Master has no problem.
4500ms
master
2ms
webs
2ms
slaves
PAGE: 28
CWRU, 27 October 2012 - Fred Hatfull
Small writes get replicated ne, but the big, nasty table-scan suddenly locks a whole bunch of rows and waits seconds for mySQL to gure out which rows should come back. This delays all the writes in the replication stream and causes an increase in replication delay, exacerbating the inconsistency
master
4500ms
webs
4500ms
PAGE: 29
In the case of a big INSERT/UPDATE/DELETE, that query also needs to replicate to the slaves, causing big delays on all of the slaves
Replication: Consistency
db master web
db replica
PAGE: 30
Heres an example of where replication can introduce inconsistency. Our user wants to know about the restaurant Happy Dog
Replication: Consistency
db master web
db replica
PAGE: 31
As part of the request, the web server handling the request asks a DB replica for information about Happy Dog
Replication: Consistency
db master web
db replica
PAGE: 32
Replication: Consistency
db master web
db replica
PAGE: 33
and its returned to the user. Fine. Everything here is as it used to be.
Replication: Consistency
db master web
db replica
PAGE: 34
Replication: Consistency
db master web
db replica
PAGE: 35
Since we have to write data, our web connects to the master database and issues the writes for the new review
Replication: Consistency
db master web
db replica
PAGE: 36
Thats the end of our users web request. After she POSTs the review she gets redirected back to Happy Dogs page. Like before, her web hits a replica instead of the master because it only needs to do reads. However, her request gets through the web and to the replica before her review makes it to the replica in the replication stream...
Replication: Consistency
wheres my review??
db master web
db replica
PAGE: 37
As a result, our user sees the stale information, even though she just contributed content! Now the user thinks her content has disappeared.
Replication: Consistency
Writes: always master Reads: can use either master or slave majority can use slave sometimes you want to hit the master for consistency
PAGE: 38
Heres how DB access is split up based on what kind of activity you are doing. Writes (INSERTs/UPDATEs/DELETEs) always hit the master, since nothing else will accept writes. Reads (SELECTs) can use either the master or a slave, and usually only need a slave. Its up to the application to gure out if it needs to hit the master, and that can be tricky
Replication: Consistency
When Does Consistency Matter? Consistency only matters when its expected example: users writing reviews If the user doesnt know information is out of date... is it really out of date?
PAGE: 39
Replication: Consistency
Asking for the master: after writes, hit the master until replication catches up webs/load-balancers can remember user state but expensive, brittle instead, teach clients to ask for the master dirty session cookie
PAGE: 40
Always hit the master for writes. After writes, hang on to a cookie for X seconds. While the user has the cookie, hit the master.
PAGE: 42
Caches
Caches
Fastest thing since sliced bread Often seen as a drop-in performance enhancement Can be hard to get right Present hidden availability implications
PAGE: 43
Take advantage of precomputed/pre-retrieved results in-memory. While it seems like a drop-in speed upgrade, they can be surprisingly hard to get right.
Caches: Types
webs
dbs
load balancer
PAGE: 44
Caches: Types
HTTP caches (varnish etc)
webs
dbs
load balancer
PAGE: 45
HTTP Caches - cache full HTTP responses. Great for static sites or dynamic sites with content that changes infrequently. Ex: varnish, squid
Caches: Types
HTTP caches (varnish etc) in-memory caches
webs
dbs
load balancer
PAGE: 46
Memoization of results from ... things. Functions, db queries, etc. Typically per-node (not shared between webs, for example)
Caches: Types
HTTP caches (varnish etc) in-memory caches
memcache
webs
dbs
load balancer
PAGE: 47
Memcache! Frequently used to store computed results for faster lookup and load reduction. Used to cache anything from raw DB rows to larger queries (joins) to gzipd blobs to serialized data structures
Caches: Advantages
Primary cache in most places: memcache Takes advantage of fast in-memory key/value lookups Good for expensive operations Complex DB queries - 100s of ms Network roundtrip to memcache - 2-3ms Misses are cheap - only network roundtrip cost
PAGE: 48
Caches: Pitfalls
Cache libraries can make it easy to cache weird things: Database models memcache connections (!) ??? - anything else that can be serialized (via pickle etc) Causes problems when object definitions change Cannot enumerate cache contents easily to fix polluted caches
PAGE: 49
Especially in dynamic languages, it can be easy to say oh do some serialization or whatever and then cache it! However, many times youll have things like SQLAlchemy models (which have connections to your database!), the connection you are using to access memcache, and more. If any of those object denitions change (or you change/remove code that pickle/json expects to have to deserialize), you may end up with a polluted cache which contains entries that you cant decode. Memcache also doesnt allow you to enumerate cache entries, so programmatically invalidating certain subsets of keys is hard if not impossible.
Caches: Pitfalls
Makes exceeding failover capacity really easy If memcache cluster goes down what happens? How do you handle additional web and DB load? Solution: Build in additional capacity Be able to isolate and turn off expensive features Have an emergency maintenance mode
PAGE: 50
CWRU, 27 October 2012 - Fred Hatfull
Memcache helps to reduce load, but its another point of failure. Memcache outages can cause increased load proportional to what it offloaded for you, which can easily cause cascading failures if not handled correctly.
Datacenters
PAGE: 51
Datacenters
Geographic distribution helps you mitigate the speed of light Replication problems expand to more systems: memcache code deployments offline batch processing database slaves see non-trivial replication delay
PAGE: 52
Its like database replication for your whole system. Out-of-sync caches can be problematic, and database replication becomes a super-non-trivial problem.
Datacenters
Solutions: Each datacenter gets write master Still only One True Master, other write masters replicate Read/reporting slaves replicate from local write master Replicate cache inserts, invalidations Take advantage of existing mySQL replication stream
PAGE: 53
Provide utilities for monitoring replication delay for services using the replication stream
PAGE: 54
Front-End: Principles
Reduce HTTP Round-Trips Reduce download sizes Dont do things browsers dont like
PAGE: 55
Lots of front-end performance tips/tricks/hacks. Most of them are based on these guidelines.
CDNs
Content Delivery Networks Maintains copies of your assets Probably serves your assets faster than you do Examples: Akamai Cloudfront (Amazon Web Services) Cotendo
PAGE: 56
CWRU, 27 October 2012 - Fred Hatfull
CDNs: Why?
Huge networks of globally distributed edge nodes e.g. Akamai at > 100,000 Easy to setup and drop in Transparent layer, just change hostnames to CDN Much lower bandwidth and equipment costs Asset gets uploaded to CDN once (ish)
PAGE: 57
[1] http://www.datacenterknowledge.com/archives/2009/05/14/whos-got-the-most-webservers/
Subdomain Sharding
RFC 2616 (HTTP 1.1): A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy.
PAGE: 58
http://www.ietf.org/rfc/rfc2616.txt
Subdomain Sharding
Parallel Connections by Browser: Firefox 4.x: 6 Firefox 3.6.x: 6 Internet Explorer: 2-6 Chrome: 6 Opera 11.x: 8 Safari 5.x: 6
PAGE: 59
CWRU, 27 October 2012 - Fred Hatfull
http://stackoverow.com/questions/5751515/official-references-for-default-values-ofconcurrent-http-1-1-connections-per-se
Subdomain Sharding
Distribute assets traffic across sharded subdomains Before: media.yelp.com -> 16.32.64.128 After: media[1-4].yelp.com -> media.yelp.com -> 16.32.64.128
PAGE: 60
PAGE: 61
There are alternatives... e.g. ETag and If-Modied-Since. These require HTTP round-trips to compute, though, so even though you dont end up needing to re-download the asset you still end up with more TCP connections.
PAGE: 62
PAGE: 63
Cookieless Domains
GET / HTTP/1.1 Host: www.yelp.com Connection: keep-alive Cache-Control: max-age=0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.52 Safari/537.11 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Encoding: gzip,deflate,sdch Accept-Language: en-US,en;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3 Cookie: yuv=moFlBQmJAAfti607vjGInzN7FFk8_DSjAfHQ_lW4YaGwZEJqQqZmcEyNyeNTQam7Rqm2q6EOieOhxmRTZiNuMNmm_G7pet1m; __qca=P0-2101128917-1317837280654; __gads=ID=91ec4bf2b3418833:T=1321911452:S=ALNI_MbtB7iIIKDvemgnlX95Ywi4BsWEPg; bse=63c34cdcaef7b20a01cc89cc34ccfff5; fd=0; searchPrefs=%7B%22seen_pop%22%3Atrue%2C%22seen_crop_pop %22%3Atrue%2C%22prevent_scroll%22%3Afalse%2C%22maptastic_mode%22%3Atrue%2C%22mapsize%22%3A%22large %22%2C%22rpp%22%3A40%7D; fbm_97534753161=base_domain=.yelp.com; s=YGc7FduEf1Wv2m5hE1sMWU5pMolrEG8x; hl=en_US; recentlocations=New+York%2C+NY%2C+USA%3B%3B706+Mission+St%2C+San+Francisco%2C+CA%2C+USA%3B %3BLower+Pac+Heights%2C+San+Francisco%2C+CA%2C+USA%3B%3BFillmore%2C+MO%2C+USA%3B%3B1251+Waller+St%2C +San+Francisco%2C+CA%2C+USA%3B%3BAnn+Arbor%2C+MI%2C+USA%3B%3BHaight-Ashbury%2C+San+Francisco%2C+CA%2C +USA%3B%3BSOMA%2C+San+Francisco%2C+CA%2C+USA%3B%3BPittsburgh%2C+PA%2C+USA%3B%3BUnion+Square%2C+San +Francisco%2C+CA%2C+USA%3B%3BLondon%2C+UK%3B%3B706+Mission%2C+Kingsburg%2C+CA%2C+USA%3B%3BMiami%2C+FL %2C+USA%3B%3B510+Central+Ave%2C+Hot+Springs%2C+AR%2C+USA; location=%7B%22unformatted%22%3A+%22San +Francisco%2C+CA%22%2C+%22city%22%3A+%22San+Francisco%22%2C+%22state%22%3A+%22CA%22%2C+%22country %22%3A+%22US%22%7D; __utma=165223479.655521012.1316285892.1350862721.1351317047.69; __utmb=165223479.3.10.1351317047; __utmc=165223479; __utmz=165223479.1338263014.44.13.utmcsr=google| utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); fbsr_97534753161=Le92MKZfPPnUrIyfRghoVKhIdhfzAt4wW1jpJHo3fIk.eyJhbGdvcml0aG0iOiJITUFDLVNIQTI1NiIsImNvZ GUiOiIzNWM5OTViNzIxYTgyYjQzMzVmZjNlNDAuMS0xNDczNjMwMTIyfDEzNTEzMTczNDl8Vm1mLU5DLXh1RkswSHRmdF9wQWRDR0N RTGNZIiwiaXNzdWVkX2F0IjoxMzUxMzE3MDQ5LCJ1c2VyX2lkIjoiMTQ3MzYzMDEyMiJ9 PAGE: 64
CWRU, 27 October 2012 - Fred Hatfull
Cookieless Domains
2128
bytes = 2k!
PAGE: 65
Cookieless Domains
Only assign cookies to domains which need them Put static assets/other cookie-less content elsewhere *.yelp.com vs. *.yelp-cdn.com
PAGE: 66
Lots more... web is abundant with tips. These are some of the ones we use.
[Cat]
PAGE: 68
Monitoring
Mission-critical Needs to be simple, easy-to-understand, durable Strategies vary widely based on application requirements Drop-in products only get you so far Exposes: what is broken when what works and how well it works
PAGE: 69
CWRU, 27 October 2012 - Fred Hatfull
Monitoring: Alerts
Nagios Off-the-shelf solution Flexible custom reporting Well-understood for monitoring systems e.g. load, memory/disk usage, hardware failures, etc Needs well-known states (CRITICAL/OK)
PAGE: 71
Monitoring: Performance
Firefly Graphical front-end to time-series data Extensible data ingestion API Open-source: github.com/Yelp/firefly Statmonster Code-name for data collection and preprocessing Upstream of Firefly Turns log lines into stats, analyzes, and stores
PAGE: 72
CWRU, 27 October 2012 - Fred Hatfull
Monitoring: Logging
Brief Aside: Logging Incredibly important for application development Often only source of information when everything blows up Useful for pulling data out of the webapp on the fly Huge volumes of log data require special infrastructure
PAGE: 73
Monitoring: Scribe
Distributed log aggregation system Composed of leaf and aggregator nodes Leaves collect log lines Aggregators aggregate incoming lines based on channel Each line associated with a channel e.g. (yelp_timings, Homepage render took 32.9ms Eventually Consistent
PAGE: 74
CWRU, 27 October 2012 - Fred Hatfull
Monitoring: Scribe
(channel2, message)
[channel1]
(channel3, message)
webs
(channel2, message)
[channel2, channel3]
(channel1, message)
aggregators
leaves
PAGE: 75
CWRU, 27 October 2012 - Fred Hatfull
Monitoring: Performance
scribe
RRD Data Chunks
log digestion windowing stats stat generation (performance.home, 1.2) additional statistics
PAGE: 76
CWRU, 27 October 2012 - Fred Hatfull
Monitoring: Performance
scribe
RRD Data Chunks
log digestion windowing stats stat generation (performance.home, 1.2) additional statistics
PAGE: 77
CWRU, 27 October 2012 - Fred Hatfull
Monitoring: Performance
{ time_start: 10, time_dispatch: 12, time_end: 44, checkpoints: { user_details: 22, review_collection: 37, template_render: 43 } } def digest(e): time_start = e[time_start] checkpoints = e[checkpoints] total_time = e[time_end] - time_start compute_time = e[template_render] - time_start reviews_time = checkpoints[review_collection] - time_start emit([performance, total_time], total_time) emit([performance, compute_time], compute_time) emit([checkpoint_times, reviews], reviews_time)
([performance, total_time], 34, 10) ([performance, compute_time], 32, 10) ([checkpoint_times, reviews], 27, 10)
PAGE: 78
Monitoring: Performance
scribe
RRD Data Chunks
log digestion windowing stats stat generation (performance.home, 1.2) additional statistics
PAGE: 79
CWRU, 27 October 2012 - Fred Hatfull
Monitoring: Performance
([performance, total_time], 34, 10) ([performance, compute_time], 32, 10) ([checkpoint_times, reviews], 27, 10)
[performance, total_time]
(10, 23) (11, 29) (10, 35) (11, 28) (12, 39) (8, 32) (8, 40) (9, 36) (9, 33)
(10s buer)
PAGE: 80
Monitoring: Performance
([performance, total_time], 34, 10) ([performance, compute_time], 32, 10) ([checkpoint_times, reviews], 27, 10)
[performance, total_time]
(10, 23) (11, 29) (10, 35) (11, 28) (12, 39) (8, 32) (8, 40) (9, 36) (9, 33)
(10s buer)
PAGE: 81
Monitoring: Performance
([performance, total_time], 34, 10) ([performance, compute_time], 32, 10) ([checkpoint_times, reviews], 27, 10)
[performance, total_time]
(10, 23) (11, 29) (10, 35) (10, 34) (11, 28) (12, 39) (8, 32) (8, 40) (9, 36) (9, 33)
(10s buer)
PAGE: 82
Monitoring: Performance
(10, 23) (11, 29) (10, 35) (10, 34) (11, 28) (12, 39) (8, 32) (8, 40) (9, 36) (9, 33)
stats
50th, 75th, 95th, 99th, mean, count, etc...
PAGE: 83
PAGE: 84
Questions?
PAGE: 85
Were Hiring!
Full-Time Interns Front-End and Back-End Engineering http://www.yelp.com/careers
PAGE: 86