Professional Documents
Culture Documents
MIKE PANCHENKO
INFRASTRUCTURE ENGINEER
@MIHASYA
DEREK SMITH
INFRASTRUCTURE ENGINEER
@DSMITTS
PAUL LATHROP
OPERATIONS
@GREYTALYN
SIMPLEGEO
AWS
RDS
AWS auth/proxy
ELB
HTTP
data centers
... ...
api servers record
storage
reads geocoder
queues reverse
geocoder
GeoIP
pushpin
writes
index
storage
Apache Cassandra
BUT WHY NOT
POSTGIS?
DATABASES
WHAT ARE THEY GOOD FOR?
DATA STORAGE
Durably persist system state
CONSTRAINT MANAGEMENT
Enforce data integrity constraints
EFFICIENT ACCESS
Organize data and implement access methods for efficient
retrieval and summarization
DATA INDEPENDENCE
Data independence shields clients from the details
of the storage system, and data structure
ATOMICITY
Either all of a transaction’s actions are visible to another transaction, or none are
CONSISTENCY
Application-specific constraints must be met for transaction to succeed
ISOLATION
Two concurrent transactions will not see one another’s transactions while “in flight”
DURABILITY
The updates made to the database in a committed transaction will be visible to
future transactions
ACID HELPS
ACID is a sort-of-formal contract that makes it
easy to reason about your data, and that’s good
CONSISTENCY
Every node in the system contains the same data (e.g., replicas are
never out of date)
AVAILABILITY
Every request to a non-failing node in the system returns a response
PARTITION TOLERANCE
System properties (consistency and/or availability) hold even when
the system is partitioned and data is lost
CAP THEOREM IN 30 SECONDS
ack
CAP THEOREM IN 30 SECONDS
aept ack
CAP THEOREM IN 30 SECONDS
ni
UNAVAILAB!
CAP THEOREM IN 30 SECONDS
aept
CSTT!
ACID HURTS
Certain aspects of ACID encourage (require?)
implementors to do “bad things”
1
sh(alice) % 3
2
=> 23 % 3
=> 2 3
CONSISTENT HASHING
With modulo hashing, a change in the number of
nodes reshuffles the entire data set
1
sh(alice) % 4
2
=> 23 % 4
=> 3 3
4
CONSISTENT HASHING
Instead the range of the hash function is mapped to a
ring, with each node responsible for a segment
0
sh(alice) => 23
84 42
CONSISTENT HASHING
When nodes are added (or removed) most of the data
mappings remain the same
0
sh(alice) => 23
84 42
64
CONSISTENT HASHING
Rebalancing the ring requires a minimal amount of
data shuffling
0
sh(alice) => 23
96 32
64
GOSSIP
DISSEMINATES CLUSTER MEMBERSHIP AND
RELATED CONTROL STATE
Gossip is initiated by an interval timer
At each gossip tick a node will
• Randomly select a live node in the cluster, sending it a gossip message
• Attempt to contact cluster members that were previously marked as
down
If the gossip message is unacknowledged for some period of
time (statistically adjusted based on the inter-arrival time of
previous messages) the remote node is marked as down
REPLICATION
REPLICATION FACTOR Determines how many
copies of each piece of data are created in the
system
RF=3
0
sh(alice) => 23
96 32
64
CONSISTENCY MODEL
DYNAMO INSPIRED
QUORUM-BASED CONSISTENCY
W=2
0
wre
sh(alice) => 23
ad 96 32
W+R>N
R=2 64
Cstt
TUNABLE CONSISTENCY
WRITES
wre
fail
CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR Asynchronously checks replicas during
reads and repairs any inconsistencies
HINTED HANDOFF
ANTI-ENTROPY
W=2
wre
ad + fix
CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF Sends failed writes to another node
with a hint to re-replicate when the failed node returns
ANTI-ENTROPY
wre
plica
CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF Sends failed writes to another node
with a hint to re-replicate when the failed node returns
ANTI-ENTROPY
* ck *
pair
CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF
ANTI-ENTROPY Manual repair process where nodes
generate Merkle trees (hash trees) to detect and
repair data inconsistencies
pair
DATA MODEL
BIGTABLE INSPIRED
SPARSE MATRIX it’s a hash-map (associative array):
a simple, versatile data structure
SCHEMA-FREE data model, introduces new freedom
and new responsibilities
COLUMN FAMILIES blend row-oriented and column-
oriented structure, providing a high level mechanism
for clients to manage on-disk and inter-node data
locality
DATA MODEL
TERMINOLOGY
KEYSPACE A named collection of column families
(similar to a “database” in MySQL) you only need one and
you can mostly ignore it
COLUMN FAMILY A named mapping of keys to rows
ROW A named sorted map of columns or supercolumns
COLUMN A <name, value, timestamp> triple
{
column family
“users”: {
key
“alice”: {
“city”: [“St. Louis”, 1287040737182],
columns
“name”: [“Alice”, 1287080340940],
},
...
},
}
...
bob
alice s3b
3e8
HASH TABLE
SUPPORTED QUERIES
EXACT MATCH
RANGE
PROXIMITY
ANYTHING THAT’S NOT
EXACT MATCH
COLUMNS
SUPPORTED QUERIES
EXACT MATCH
{
RANGE “users”: {
“alice”: {
“city”: [“St. Louis”, 1287040737182],
PROXIMITY “friend-1”: [“Bob”, 1287080340940],
friends “friend-2”: [“Joe”, 1287080340940],
“friend-3”: [“Meg”, 1287080340940],
“name”: [“Alice”, 1287080340940],
},
...
}
}
LOG-STRUCTURED MERGE
MEMTABLES are in memory data structures that
contain newly written data
EXACT MATCH
RANGE
On a single dimension
? PROXIMITY
SPATIAL DATA
IT’S INHERENTLY MULTIDIMENSIONAL
2 x 2, 2
1 2
DIMENSIONALITY REDUCTION
WITH SPACE-FILLING CURVES
1 2
3 4
Z-CURVE
SECOND ITERATION
Z-VALUE
14
x
GEOHASH
SIMPLE TO COMPUTE
Interleave the bits of decimal coordinates
(equivalent to binary encoding of pre-order
traversal!)
Base32 encode the result
AWESOME CHARACTERISTICS
Arbitrary precision
Human readable
Sorts lexicographically
01101
e
DATA MODEL
{
“record-index”: {
key
<geohash>:<id>
“9yzgcjn0:moonrise hotel”: {
“”: [“”, 1287040737182],
},
...
},
“records”: {
“moonrise hotel”: {
“latitude”: [“38.6554420”, 1287040737182],
“longitude”: [“-90.2992910”, 1287040737182],
...
}
}
}
BOUNDING BOX
E.G., MULTIDIMENSIONAL RANGE
1 2
3 4
Gie 4 5
SPATIAL DATA
STILL MULTIDIMENSIONAL
DIMENSIONALITY REDUCTION ISN’T PERFECT
Clients must
• Pre-process to compose multiple queries
• Post-process to filter and merge results
Degenerate cases can be bad, particularly for nearest-neighbor
queries
Z-CURVE LOCALITY
Z-CURVE LOCALITY
x
x
Z-CURVE LOCALITY
x
x
Z-CURVE LOCALITY
x
o o o x
o
o o
o
THE WORLD
IS NOT BALANCED
1 2
SAN FRANCISCO
3 4
TOO MUCH LOCALITY
1 2
SAN FRANCISCO
3 4
TOO MUCH LOCALITY
1 2 I’m sad.
SAN FRANCISCO
3 4
TOO MUCH LOCALITY
I’m b.
SAN FRANCISCO
3 4
o o
o xo
TRAVERSAL
NEAREST NEIGHBOR
o o
o xo
TRAVERSAL
NEAREST NEIGHBOR
o o
o xo
KEY CHARACTERISTICS
PERFORMANCE
Best case on the happy path (everything cached) has zero
read overhead
Worst case, with nothing cached, O(log(n)) read overhead
RE-BALANCING SEEMS UNNECESSARY!
Makes worst case more worser, but so far so good
DISTRIBUTED TREE
SUPPORTED QUERIES
EXACT MATCH
RANGE
PROXIMITY
SOMETHING ELSE I HAVEN’T
EVEN HEARD OF
DISTRIBUTED TREE
SUPPORTED QUERIES
MUL
EXACT MATCH DI P
NS
RANGE
NS!
PROXIMITY
SOMETHING ELSE I HAVEN’T
EVEN HEARD OF
LIFE OF A REQUEST
THE BIRDS ‘N THE BEES
ELB
gate gate
service service
cass
index
THE BIRDS ‘N THE BEES
ELB
gate
service
cass
worker pool
index
THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
gate
service
cass
worker pool
index
THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
service
cass
worker pool
index
THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
cass
worker pool
index
THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
worker pool
index
THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
index
THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
file {
'/etc/apache2/apache2.conf':
ensure => file,
mode => 600,
notify => Service[‘apache2’],
source => '/root/learning-manifests/apache2.conf',
}
service {
'apache2':
ensure => running,
enable => true,
subscribe => File['/etc/apache2/apache2.conf'],
}
AN EXAMPLE
TERMINAL
CONTINUOUS
INTEGRATION
GET ‘ER DONE
• Revision control
• Automate build process
• Automate testing process
• Automate deployment
Github Plugin
TYING IT ALL TOGETHER
TERMINAL
FLUME
Flume is a distributed, reliable and available
service for efficiently collecting, aggregating and
moving large amounts of log data.
syslog on steriods
DATA-FLOWS
AGENTS
Physical Host Logical Nodes Source and Sink
tail(“/var/log/nginx/access.log”)
i-192df98 tail agentSink(35853)
collectorSource(35853)
hdfs_writer collectorSink("hdfs://namenode.sg.com/bogus/", "logs")
RELIABILITY
END-TO-END
STORE ON FAILURE
BEST EFFORT
GETTIN’ JIGGY WIT IT
Custom Decorators
HOW DO WE USE IT?
PERSONAL EXPERIENCES
• #flume
• Automation was gnarly
• Its never a good day when Eclipse is involved
• Resource hog (at first)
We’ buildg kick-s ols f vops x,
tpt, csume da cnect a locn
MARCH 2011