Professional Documents
Culture Documents
Scaling Storytime
June 2007
USENIX
Brad Fitzpatrick
brad@danga.com
http://danga.com/words/
1
The plan...
BIG-IP
perlbal (httpd/proxy) Global Database
bigip1 mod_perl
bigip2 proxy1 master_a master_b
web1
proxy2 web2
proxy3 Memcached
web3 slave1 slave2 ... slave5
djabberd proxy4 mc1
web4
djabberd proxy5 ... mc2 User DB Cluster 1
djabberd
webN mc3 uc1a uc1b
mc4 User DB Cluster 2
... uc2a uc2b
gearmand
Mogile Storage Nodes gearmand1 mcN User DB Cluster 3
sto1 sto2 gearmandN uc3a uc3b
Mogile Trackers
... sto8
tracker1 tracker3 User DB Cluster N
ucNa ucNb
MogileFS Database “workers”
gearwrkN Job Queues (xN)
mog_a mog_b theschwkN jqNa jqNb
slave1 slaveN
http://danga.com/words/
3
LiveJournal Overview
http://danga.com/words/
4
Stuff we've built...
(all production, open source)
memcached djabberd
− distributed caching − the super-extensible
MogileFS everything-is-a-plugin
− distributed filesystem mod_perl/qpsmtpd/
Perlbal Eclipse of XMPP/Jabber
− HTTP load balancer, web servers
server, swiss-army knife .....
gearman OpenID
− LB/HA/coalescing low- federated identity
latency function call protocol
“router”
TheSchwartz
− reliable, async job
dispatch system
http://danga.com/words/
5
“Uh, why?”
http://danga.com/words/
6
Yes.
We build wheels.
− ... when existing suck,
− ... or don’t exist.
http://danga.com/words/
7
Yes.
We build wheels.
− ... when existing suck,
− ... or don’t exist.
http://danga.com/words/
7
Yes.
We build wheels.
− ... when existing suck,
− ... or don’t exist.
http://danga.com/words/
7
Yes.
We build wheels.
− ... when existing suck,
− ... or don’t exist.
http://danga.com/words/
7
Part I
Quick Scaling History
http://danga.com/words/
8
Quick Scaling History
1 server to hundreds...
you can do all this with just 1 server!
− then you’re ready for tons of servers, without pain
− don’t repeat our scaling mistakes
http://danga.com/words/
9
Terminology
Scaling:
− NOT: “How fast?”
− But: “When you add twice as many servers, are you
twice as fast (or have twice the capacity)?”
Fast still matters,
− 2x faster: 50 servers instead of 100...
that’s some good money
− but that’s not what scaling is.
http://danga.com/words/
10
Terminology
“Cluster”
− varying definitions... basically:
− making a bunch of computers work together for
some purpose
− what purpose?
load balancing (LB),
high availablility (HA)
Load Balancing?
High Availability?
Venn Diagram time!
− I love Venn Diagrams
http://danga.com/words/
11
LB vs. HA
http://danga.com/words/
12
LB vs. HA
Load High
Balancing Availability
LVS heartbeat,
http
round-robin DNS, cold/warm/hot spare,
reverse proxy,
data partitioning, ...
wackamole,
.... ...
http://danga.com/words/
13
Favorite Venn Diagram
http://danga.com/words/
14
One Server
Simple:
mysql
apache
http://danga.com/words/
15
Two Servers
apache
mysql
http://danga.com/words/
16
Two Servers - Problems
http://danga.com/words/
17
Four Servers
3 webs, 1 db
Now we need to load-balance!
LVS, mod_backhand, whackamole, BIG-IP,
Alteon, pound, Perlbal, etc, etc..
− ...
http://danga.com/words/
18
Four Servers - Problems
http://danga.com/words/
19
Five Servers
introducing MySQL replication
We buy a new DB
MySQL replication
Writes to DB (master)
Reads from both
http://danga.com/words/
20
More Servers
Chaos!
http://danga.com/words/
21
net.
BIG-IP
bigip1
bigip2
mod_proxy
mod_perl
proxy1
web1
proxy2
web2
proxy3 Global Database
web3
web4 master
...
web12
slave1 slave2 ... slave6
http://danga.com/words/
22
Problems with Architecture
or,
“This don't scale...”
DB master is SPOF
Adding slaves doesn't scale
well...
− only spreads reads, not writes!
500 reads/s
250 reads/s 250 reads/s
http://danga.com/words/
23
Eventually...
http://danga.com/words/
24
Spreading Writes
http://danga.com/words/
25
Partition your data!
http://danga.com/words/
26
User Clusters
http://danga.com/words/
27
User Clusters
SELECT userid,
clusterid FROM
user WHERE
user='bob'
http://danga.com/words/
27
User Clusters
SELECT userid,
clusterid FROM
user WHERE
user='bob'
userid: 839
clusterid: 2
http://danga.com/words/
27
User Clusters
userid: 839
clusterid: 2
http://danga.com/words/
27
User Clusters
OMG i like
totally hate
my parents
userid: 839 they just
clusterid: 2 dont
understand me
and i h8 the
world omg lol
rofl *! :^-
^^;
add me as a
friend!!!
http://danga.com/words/
27
Details
per-user numberspaces
− don't use AUTO_INCREMENT
− PRIMARY KEY (user_id, thing_id)
− so:
Can move/upgrade users 1-at-a-time:
− per-user “readonly” flag
− per-user “schema_ver” property
− user-moving harness
job server that coordinates, distributed long-
http://danga.com/words/
28
Shared Storage
(SAN, SCSI, DRBD...)
innodb repairs,
http://danga.com/words/
29
Shared Storage: DRBD
http://danga.com/words/
32
Caching
caching's key to performance
− store result of a computation or I/O for quicker future
access (classic space/time trade-off)
Where to cache?
− mod_perl/php internal caching
memory waste (address space per apache child)
− shared memory
limited to single machine, same with Java/C#/
Mono
− MySQL query cache
flushed per update, small max size
− HEAP tables
fixed length rows, small max size
http://danga.com/words/
33
memcached
http://www.danga.com/memcached/
http://danga.com/words/
35
Picture
http://danga.com/words/
36
Picture
http://danga.com/words/
36
Picture
http://danga.com/words/
36
Picture
0 1 2 3
http://danga.com/words/
36
Picture
0 1 2 3
Client
http://danga.com/words/
36
Picture
0 1 2 3
$val = $client->get(“foo”)
Client
http://danga.com/words/
36
Picture
0 1 2 3
$val = $client->get(“foo”)
CRC32(“foo”) % 4 = 2
Client
http://danga.com/words/
36
Picture
0 1 2 3
$val = $client->get(“foo”)
CRC32(“foo”) % 4 = 2
Client
connect to server[2] (“10.0.0.101:11211”)
http://danga.com/words/
36
Picture
0 1 2 3
GET foo
$val = $client->get(“foo”)
CRC32(“foo”) % 4 = 2
Client
connect to server[2] (“10.0.0.101:11211”)
http://danga.com/words/
36
Picture
0 1 2 3
$val = $client->get(“foo”)
CRC32(“foo”) % 4 = 2
Client
connect to server[2] (“10.0.0.101:11211”)
http://danga.com/words/
36
Client hashing onto a memcacached
node
Up to client how to pick a memcached node
Traditional way:
− CRC32(<key>) % <num_servers>
− (servers with more memory can own more slots)
− CRC32 was least common denominator for all
languages to implement, allowing cross-language
memcached sharing
− con: can’t add/remove servers without hit rate
crashing
“Consistent hashing”
− can add/remove servers with minimal <key> to
<server> map changes
http://danga.com/words/
37
memcached internals
libevent
− epoll, kqueue...
event-based, non-blocking design
− optional multithreading, thread per CPU (not per
client)
slab allocator
referenced counted objects
− slow clients can’t block other clients from altering
namespace or data
LRU
all internal operations O(1)
http://danga.com/words/
38
Perlbal
http://danga.com/words/
39
Web Load Balancing
http://danga.com/words/
40
Perlbal
Perl
parts optionally in C with plugins
http://danga.com/words/
42
Perlbal: Persistent Connections
http://danga.com/words/
42
Perlbal: Persistent Connections
PB
http://danga.com/words/
42
Perlbal: Persistent Connections
PB
Client Apache
http://danga.com/words/
42
Perlbal: Persistent Connections
reqB1, B2 PB
http://danga.com/words/
42
Perlbal: can verify new backend
connections
#include <sys/socket.h>
int listen(int sockfd, int backlog);
http://danga.com/words/
43
Perlbal: multiple queues
http://danga.com/words/
44
Perlbal: cooperative large file serving
http://danga.com/words/
45
Perlbal: cooperative large file serving
internal redirects
− mod_perl can pass off serving a big file to Perlbal
either from disk, or from other URL(s)
− client sees no HTTP redirect
− “Friends-only” images
one, clean URL
mod_perl does auth, and is done.
perlbal serves.
http://danga.com/words/
46
Internal redirect picture
http://danga.com/words/
47
And the reverse...
http://danga.com/words/
48
Palette Altering GIF/PNGs
http://danga.com/words/
49
MogileFS
http://danga.com/words/
50
oMgFileS
http://danga.com/words/
51
MogileFS
http://danga.com/words/
52
MogileFS: Why
http://danga.com/words/
53
MogileFS: Main Ideas
http://danga.com/words/
54
MogileFS components
clients
mogilefsd (does all real work)
database(s) (MySQL, .... abstract)
storage nodes
http://danga.com/words/
55
MogileFS: Clients
http://danga.com/words/
56
MogileFS: Tracker
(mogilefsd)
The Meat
event-based message bus
load balances client requests, world info
process manager
− heartbeats/watchdog, respawner, ...
Child processes:
− ~30x client interface (“query” process)
interfaces client protocol w/ db(s), etc
− ~5x replicate
− ~2x delete
− ~1x fsck, reap, monitor, ..., ...
http://danga.com/words/
57
Trackers' Database(s)
http://danga.com/words/
58
MogileFS storage nodes
(mogstored)
HTTP transport
− GET
− PUT
− DELETE
mogstored listens on 2 ports...
HTTP. --server={perlbal,lighttpd,...}
− management/status:
iostat interface, AIO control, multi-stat() (for faster
fsck)
files on filesystem, not DB
− sendfile()! future: splice()
− filesystem can be any filesystem
http://danga.com/words/
59
Large file
GET
request
http://danga.com/words/
60
Auth: complex, but quick
Large file
GET
request
http://danga.com/words/
60
Spoonfeeding:
slow, but event-
based
Large file
GET
request
http://danga.com/words/
60
Gearman
http://danga.com/words/
61
manaGer
http://danga.com/words/
62
Manager
dispatches work,
but doesn't do anything useful itself. :)
http://danga.com/words/
63
Gearman
different languages,
db connection pooling,
...
...
http://danga.com/words/
64
Gearman Pieces
gearmand
− the function call router
− event-loop (epoll, kqueue, etc)
workers.
− Gearman::Worker – perl/ruby
− register/heartbeat/grab jobs
clients
− Gearman::Client[::Async] -- perl
− also Ruby Gearman::Client
− submit jobs to gearmand
− opaque (to server) “funcname” string
− optional opaque (to server) “args” string
− opt coallescing key
http://danga.com/words/
65
Gearman Picture
http://danga.com/words/
66
Gearman Picture
http://danga.com/words/
66
Gearman Picture
Worker Worker
http://danga.com/words/
66
Gearman Picture
can_do(“funcA”)
can_do(“funcA”)
can_do(“funcB”)
Worker Worker
http://danga.com/words/
66
Gearman Picture
can_do(“funcA”)
can_do(“funcA”)
can_do(“funcB”)
http://danga.com/words/
66
Gearman Picture
call(“funcA”)
can_do(“funcA”)
can_do(“funcA”)
can_do(“funcB”)
http://danga.com/words/
66
Gearman Picture
call(“funcA”)
can_do(“funcA”)
can_do(“funcA”)
can_do(“funcB”)
http://danga.com/words/
66
Gearman Picture
call(“funcA”)
can_do(“funcA”)
call(“funcB”)
can_do(“funcA”)
can_do(“funcB”)
http://danga.com/words/
66
Gearman Protocol
http://danga.com/words/
67
Gearman Uses
http://danga.com/words/
68
Gearman Uses, cont..
http://danga.com/words/
69
Gearman Misc
Guarantees:
− none! hah! :)
please wait for your results.
goes away.
No policy/conventions in gearmand
− all policy/meaning between clients <-> workers
...
http://danga.com/words/
70
Sick Gearman Demo
$ ./dmap.pl
dmap says:
1 = sammy
2 = papag
3 = sammy
4 = papag
5 = sammy
6 = papag
7 = sammy
8 = papag
9 = sammy
10 = papag
http://danga.com/words/
71
Gearman Summary
Gearman is sexy.
− especially the coalescing
Check it out!
− it's kinda our little unadvertised secret
oh crap, did I leak the secret?
http://danga.com/words/
72
TheSchwartz
http://danga.com/words/
73
TheSchwartz
Like gearman:
− job queuing system
− opaque function name
− opaque “args” blob
− clients are either:
submitting jobs
workers
But unlike gearman:
− Reliable job queueing system
− not low latency
− fire & forget (as opposed to gearman, where you wait for
result)
currently library, not network service
http://danga.com/words/
74
TheSchwartz Primitives
insert job
“grab” job (atomic grab)
− for 'n' seconds.
mark job done
temp fail job for future
− optional notes, rescheduling details..
replace job with 1+ other jobs
− atomic.
...
http://danga.com/words/
75
TheSchwartz
backing store:
− a database
− uses Data::ObjectDriver
MySQL,
Postgres,
SQLite,
....
but HA: you tell it @dbs, and it finds one to
insert job into
− likewise, workers foreach (@dbs) to do work
http://danga.com/words/
76
TheSchwartz uses
http://danga.com/words/
77
gearmand + TheSchwartz
http://danga.com/words/
78
djabberd
http://danga.com/words/
79
djabberd
http://danga.com/words/
80
djabberd hooks
everything is a hook
− not just auth! like, everything.
− auth,
− roster,
− vcard info (avatars),
− presence,
− delivery,
− inter-node cluster delivery,
− ala mod_perl, qpsmtpd, etc.
async hooks
− hooks phases can take as long as they want before
they answer, or decline to next phase in hook chain...
− we use Gearman::Client::Async
http://danga.com/words/
81
Thank you!
Questions to:
brad@danga.com
Software:
http://danga.com/
http://code.sixapart.com/
http://danga.com/words/
82
net.
Questions?
BIG-IP
perlbal (httpd/proxy) Global Database
bigip1 mod_perl
bigip2 proxy1 master_a master_b
web1
proxy2 web2
proxy3 Memcached
web3 slave1 slave2 ... slave5
djabberd proxy4 mc1
web4
djabberd proxy5 ... mc2 User DB Cluster 1
djabberd
webN mc3 uc1a uc1b
mc4 User DB Cluster 2
... uc2a uc2b
gearmand
Mogile Storage Nodes gearmand1 mcN User DB Cluster 3
sto1 sto2 gearmandN uc3a uc3b
Mogile Trackers
... sto8
tracker1 tracker3 User DB Cluster N
ucNa ucNb
MogileFS Database “workers”
gearwrkN Job Queues (xN)
mog_a mog_b theschwkN jqNa jqNb
slave1 slaveN
http://danga.com/words/
83
Bonus Slides
if extra time
http://danga.com/words/
84
Data Integrity
http://danga.com/words/
85
Persistent Connection Woes
http://danga.com/words/
86