P. 1
Cassandra at Twitter

Cassandra at Twitter

|Views: 6,131|Likes:
Published by Chris Goffinet

More info:

Published by: Chris Goffinet on Jul 12, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF or read online from Scribd
See more
See less

03/31/2013

Cassandra @

Cassandra SF July 11th, 2011

Team
Chris Goffinet

Stu Hood

Ryan King

@stuhood

@rk

@lennox Alan Liang Oscar Moll

Melvin Wang

@alan

@padauk9

Measuring ourselves

#prostyle

Measuring ourselves
‣ ‣ ‣ ‣ ‣ ‣ ‣

Hardware Platform Data Storage Latency and Throughput Operational Efficiency Capacity Planning Developer Integration Testing

Hardware Platform
‣ ‣ ‣ ‣ ‣ ‣ ‣ ‣

CPU Core Utilization Memory bandwidth and consumption Machine cost RAID Filesystems and I/O Schedulers IOPS Network bandwidth Kernel

Hardware Platform
‣ ‣ ‣ ‣ ‣ ‣ ‣ ‣

CPU Core Utilization Memory bandwidth and consumption Machine cost RAID Filesystems and I/O Schedulers IOPS Network bandwidth Kernel

Hardware Platform
Filesystem configurations

Ext4
‣ ‣

Data mode = Ordered Data mode = Writeback

‣ ‣

XFS RAID
‣ ‣

0 and 10 far side vs near side copies

128 vs 256 vs 512 stripe sizes

Hardware Platform
I/O Schedulers
‣ ‣

CFQ vs Noop vs Deadline vs Anticipatory Workloads
‣ ‣

Timeseries 50/50

Measure
‣ ‣ ‣ ‣

p90 p99 Average Max

Hardware Platform
I/O Schedulers 5050 - Reads
Scheduler cfq noop deadline anticipatory p90 73ms 47ms 75ms 76ms p99 210ms 167ms 233ms 214ms Average 11.72ms 9.12ms 12.72ms 12.37ms Max 4940ms 4132ms 3718ms 5120ms

Hardware Platform
I/O Schedulers 5050 - Writes
Scheduler cfq noop deadline anticipatory p90 2ms 2ms 2ms 2ms p99 2ms 2ms 2ms 2ms Average 2.02ms 2.06ms 2.13ms 2.03ms Max 5927ms 3475ms 3718ms 5119ms

Measuring ourselves
‣ ‣ ‣ ‣ ‣ ‣ ‣

Hardware Platform Data Storage Latency and Throughput Operational Efficiency Capacity Planning Developer Integration Testing

Data Storage
‣ ‣ ‣ ‣ ‣

How efficient is our on-disk storage? Could we do compression? Do we have CPU to trade? How do we push for better? Is it worth it?

Data Storage
Old Easy to Implement Checksumming Varint Encoding Delta Encoding Type Specific Compression Fixed Size Blocks New

Data Storage
Old Easy to Implement Checksumming Varint Encoding Delta Encoding Type Specific Compression Fixed Size Blocks X X X X X X New

Data Storage

How did we do?

Data Storage
‣ ‣ ‣

1.5x? 2.5x? 3.5x?

Data Storage

7.03x

Data Storage
10,00o rows; 250M columns
Rows Current Format New Format Timeseries LongType column names CounterColumnType values 10000 10000 Columns 250M 250M Size on disk 16,716,432,189 2,375,027,696 bytes per column 66.8 9.5

Data Storage

compression

type specific

‣ ‣

fine-grained corruption detection index promotion
‣ ‣ ‣

normalizing narrow and wide rows predictable performance no double-pass on compaction

range and slice deletes

Measuring ourselves
‣ ‣ ‣ ‣ ‣ ‣ ‣

Hardware Platform Data Storage Latency and Throughput Operational Efficiency Capacity Planning Developer Integration Testing

Latency and Throughput
‣ ‣ ‣ ‣ ‣

What are our issues? Compaction Performance? Caching? Too many disk seeks? Garbage Collection?

Latency and Throughput

Compaction

Latency and Throughput

Compaction

Latency and Throughput
‣ ‣ ‣ ‣ ‣ ‣

Multithread Compaction + Throttling Compact each bucket in parallel Throttle across all buckets Compaction running all the time CASSANDRA-2191 CASSANDRA-2156

Latency and Throughput

Measure latency
‣ ‣

p99 p999

‣ ‣

No averages! Every customer has p99 and p999 targets we must hit 24x7 on-call rotation

Latency and Throughput

Caching?
‣ ‣

In-heap Off-heap

Pluggable cache

Memcache

Case Study: Tweet Button

Case Study: Tweet Button

Growth was requiring entire dataset in memory. Why? How big is the active dataset within 24hours? What happens when dataset outgrows memory? Could other storage solutions do better? What are we missing here?

‣ ‣

‣ ‣

Case Study: Tweet Button

Key Size Variable length (each one a url)

Implement hashing on keys

‣ ‣

Can we do better? But... the cache in Java isn’t very efficient...

or is it?

Case Study: Tweet Button

On-heap

Requires us to scale the JVM heap with cache

Off-heap

Store pointers to data allocated out of the JVM

Memcache

Out of process

Case Study: Tweet Button

On-heap

Data + CLHM overhead (87GB)

Off-heap

CLHM overhead (67GB just the pointers!)

Memcache

Internal overhead + data (48GB!)

* CLHM (Concurrent Linked HashMap)

Case Study: Tweet Button
‣ ‣ ‣ ‣

Co-locate memcache on each node Routing + Cache replication Write through LRU Rolling restarts do not cause degraded performance states

Cassandra
Memcache

Cassandra
Memcache

Cassandra
Memcache

Cassandra
Memcache

Case Study: Tweet Button
‣ ‣

In production today Stats

99th percentile went before 200ms 800ms when data > memory 99th percentile now - 2.5ms

Case Study: Cuckoo
‣ ‣ ‣ ‣

New observability stack Replaces Ganglia Collect metrics for graphing in real-time Scale based on machine count + defined metrics Heavy write throughput requirements SLA Target

‣ ‣

All metrics written under 60 seconds

Case Study: Cuckoo
‣ ‣ ‣ ‣ ‣ ‣

1.3 million writes/second 112 billion writes a day 3.2 gigabit/s over the network 492GB of new data per hour 140MB/s writes across cluster 70MB/s reads across cluster

Case Study: Cuckoo

36,000 writes/second

persistently to disk on each node

‣ ‣ ‣ ‣

36 nodes without RF (Replication Factor) Replication Factor = 3 30-35% cpu utilization FSync Commit Log every 10s

Case Study: Cuckoo

Garbage Collection Challenge

30-60 second pauses multiple times per hour on each node

‣ ‣

Why? Heap fragmentation

Case Study: Cuckoo
1.5e+09 1.0e+09 free_space

5.0e+08

value

1.5e+09

1.0e+09

max_chunk

5.0e+08

1000

2000

time

3000

4000

Case Study: Cuckoo
‣ ‣ ‣

Slab Allocation Fixed sized chunks (2MB) Copy byte[] into slabs using CAS (Compare & Swap) Largely reduced fragmentation CASSANDRA-2252

‣ ‣

Case Study: Cuckoo
No Slab Slab

GC Pause Avg Time

30-60 seconds

Frequency of pause

Every hour

Case Study: Cuckoo
No Slab Slab

GC Pause Avg Time

30-60 seconds

5 seconds

Frequency of pause

Every hour

3 days 10 hours

Case Study: Cuckoo

Pluggable Compaction
‣ ‣

Custom strategy for retention support Used for our timeseries

Drop SSTables after N days

Make it easy to implement more interesting and intelligent compaction strategies SSTable Min/Max Timestamp

Read time optimization

Measuring ourselves
‣ ‣ ‣ ‣ ‣ ‣ ‣

Hardware Platform Data Storage Latency and Throughput Operational Efficiency Capacity Planning Developer Integration Testing

Operational Efficiency
‣ ‣ ‣ ‣

Automated infrastructure burn-in process Rack awareness to handle switch failures Grow clusters per rack, not per node Lower Server RPC timeout (200ms to 1s)

Fail fast

‣ ‣

Split out RPC timeouts by read & writes CASSANDRA-2819

Operational Efficiency

Fault tolerance at the disk level
‣ ‣

Eject from cluster if raid array fails CASSANDRA-2118

‣ ‣ ‣ ‣ ‣

No swap and dedicated commit log Multiple hard drive vendors 300+ nodes in production Run on cheap commodity hardware Design for failure

Operational Efficiency
What failures do we see in production?
‣ ‣

Bad memory that causes corruption Multiple disks dying on same hosts within hours Rack switch failures Memory allocation delays causing JVM to encounter higher latency GC collections (mlockall recommended) Stop the world pauses if traffic patterns change

‣ ‣

Operational Efficiency
What failures do we see in production?

Network cards sometimes negotiating down to 100Mbit Machines randomly die and never come back Disks auto-ejecting themselves from the raid array

‣ ‣

Operational Efficiency
Deploy Process
Driver Hudson Git

Cass Cass Cass Cass

Cass Cass Cass

Cass

Operational Efficiency
Deploy Process
‣ ‣

Deploy to hundreds of nodes in under 20s Roll the cluster
‣ ‣

Disable Gossip on a node Check ring on all nodes to ensure ‘Down’ state Drain Restart

‣ ‣

Measuring ourselves
‣ ‣ ‣ ‣ ‣ ‣ ‣

Hardware Platform Data Storage Latency and Throughput Operational Efficiency Capacity Planning Developer Integration Testing

Capacity Planning
‣ ‣

In-house capacity planning tool Collect input from sources:
‣ ‣ ‣ ‣ ‣

hardware platform (kernel, hw data) on-disk serialization overhead cost of read/write (seeks, index overhead) query cost (cpu, memory usage) requirements from customers

Capacity Planning
Input Example
spec = { 'read_qps': 500, 'write_qps': 1000, 'replication_factor': 3, 'dataset_hot_percent': 0.05, 'latency_95': 350.0, 'latency_99': 250.0, 'read_growth_percentage': 0.1, 'write_growth_percentage': 0.1, ...... }

Capacity Planning
Output Example
90 days datasize: 14.49T page cache size: 962.89G number of disks: 68 disk capacity: 15.22T iops: 6800.00/s replication_factor: 3 servers: 51 servers (w/o replication): 17 read_ops: 2323 write_ops: 991066 servers: 57 servers (w/o replication): 19 read_ops: 2877 write_ops: 1143171

Measuring ourselves
‣ ‣ ‣ ‣ ‣ ‣ ‣

Hardware Platform Data Storage Latency and Throughput Operational Efficiency Capacity Planning Developer Integration Testing

Developer Integration

Cassie
‣ ‣ ‣

Light-weight Cassandra Client Cluster member auto discovery Uses Finagle (http://github.com/twitter/ finagle) Scala + Java support Open sourcing

‣ ‣

Measuring ourselves
‣ ‣ ‣ ‣ ‣ ‣ ‣

Hardware Platform Data Storage Latency and Throughput Operational Efficiency Capacity Planning Developer Integration Testing

Testing

Distributed Testing Harness

Open sourced to community

Custom Internal Build of YCSB
‣ ‣

Performance Benchmarking Custom workloads such as timeseries

Performance Framework

Performance Framework
‣ ‣

Custom framework that uses YCSB What we do:
‣ ‣ ‣

Collect as much data as possible Measure Do it again

Generate reports per build

Performance Framework
‣ ‣

Read/Insert/Update Combinations: 30 Request Targeting (per second): 8

500, 1000, 2000, 4000, 8000, 16000, 32000, 64000

Payload Sizes: 5

100, 500, 1000, 2000, 4000 bytes

Single node vs cluster

Performance Framework

Total test combinations: 1,200

Summary

Understand your hardware and operating system Rigorously exercise your entire stack Capacity plan with math not guesswork Measure everything, then do it again Invest in your storage technology Automate Expect everything to fail

‣ ‣ ‣ ‣ ‣ ‣

We’re hiring @jointheflock

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->