Performance Is Overrated

Performance is Overrated
Mark Callaghan NEDB 2012
(Peak) Performance is overrated

Focus on reducing variance rather than increasing peaks

Capacity planning uses p95 or p99 response time Servers must be underutilized to tolerate variance
Manageability

needs more attention
Cost of extra hardware can be predicted Cost of downtime cannot Downtime comes in many forms (server down and server too busy)
What is manageability?
The
rate of interrupts/server for the operations team count grows quickly and operations team grows slowly of service must improve over time
Server
Quality

Does work get done? Does work get done on time?
This has good average performance
Why MySQL?
It
was there when we arrived made it scale 10X
We

My peers in db eng/ops are very good Room for new people, ideas and products
like MySQL for OLTP

250,000 QPS on (silly) benchmarks InnoDB is wonderful
OLTP for the social graph

Secondary Index-only Small
indexes queries
joins but most queries use one table transactions
Multi-row Majority Physical Async
of workload does not need SQL/optimizer and logical backup
replication on a WAN
Does this require SQL?

Most Why

of it does not
is the grass greener on the other side?
Automated replacement of failed nodes Less downtime on schema changes or fewer schema changes Multi-master Better compression Write-optimized
A busy OLTP deployment circa 2010

Query
response time bytes per second
Rows
read per second changed per second
4 ms reads, 5ms writes
450M peak
Network
Rows
38GB peak
3.5M peak
Queries
per second
InnoDB
page IO per second
13M peak
5.2M peak
Why are there so many servers?

Big
data X high QPS
Per Domas we have lots of medium data (sharded MySQL)
Add
servers to add IOPS is very interesting databases are very interesting
Flash
Write-optimized
Database teams at Facebook

Operations
Move
Engineering
Fix
fast and x things our changes, or not
bugs that stall and crash MySQL better bugs
Deploy Tell
Make
me what to x
Market
The git log for our MySQL branch has 452 changes.
Tips on scaling: more data, more QPS

1. 2. 3.
Fix stalls to make use of capacity Improve efciency to use less Repeat
Fix stalls
Dont make MySQL faster, make it less slow

Stalls from le systems Stalls from caches in MySQL Stalls from mutexes in MySQL Everything else
File system stalls

Switch
the IO scheduler from cfq to deadline
Deadline is less likely to stall writes
Switch

from ext-3 to XFS
XFS does not lock a per-inode mutex on writes XFS has less variance on write-append
Stalls from caches

Some expensive operations are deferred
InnoDB InnoDB Fuzzy
purge removes delete-marked rows insert buffer defers IO for secondary index maintenance
checkpoint constraint enforcement
Repeat until done

Problem

Arrival rate exceeds completion rate Throughput collapses when cache is full
Solution
Increase completion rate
Performance drops when ibuf is full
Otherwise, the insert buffer is awesome
Sysbench QPS at 20 second intervals with checkpoint stalls
Stalls from mutexes

Extending Opening
InnoDB les
LOCK_open Excessive Deadlock
and kernel_mutex
InnoDB tables lock conicts
calls to fcntl detection overhead
Purge/undo TRUNCATE DROP
table and LOCK_open
innodb_thread_concurrency Group
table and LOCK_open
commit control
Admission
Repeat until done

Problem

Global mutex held while expensive operation done Requests stall
Solution
Defer expensive operation until global mutex unlocked
Stalls from excessive calls to fcntl

fcntl

Some Linux kernels get the big kernel lock on fcntl calls MySQL called it too often
Doubled
peak QPS by changing MySQL to call it less is now xed in ofcial MySQL
200,000 QPS on benchmarks
Problem
Sysbench read-only with fcntl x
Stalls from deadlock detection overhead

InnoDB
deadlock detection was inefcient
O(N*N) for N threads waiting on the same row lock.
Fix

is simple
Disable it and rely on lock wait timeout Detection is now more efcient in ofcial MySQL
The cost of deadlock detection

QPS for 1 to 1024 connections updating the same row
3000 2500 2000 1500 1000 500 0 1 2 4 8 16 32 64 128 256 512 1024 Disabled Enabled
Stalls from innodb_thread_concurrency

Limits
the maximum number of running threads
Threads are scheduled in LIFO order
With
1000+ sleeping threads it can take too long to wake one some threads to run in FIFO order
Allow
When new thread arrives run if other threads are slow to wake
FIFO
+ LIFO = FLIFO
Sysbench TPS with FLIFO
Commit stalls for MySQL

This
is XA when the replication log (binlog) is enabled
InnoDB and replication log are resource managers
Commit HW
requires 3 fsyncs, 2 can be shared
RAID card does ~5000 fsyncs/second
Supports ~2500 commits/second
Group commit
Modied Fix

MySQL to allow all fsyncs to be shared
was fun
Uses a group commit timeout Threads only wait when other threads are about to commit (magic)
Useful
side effect
Servers are better able to survive RAID battery failure
Stalls from mutex thrashing

Preserve throughput while overloaded

Good preserve the rows read rate, limit threads running Better preserve query completion rate, limit queries running
Admission control

Simple TP monitor in MySQL Limits max concurrent queries per database account Does the right thing when a query blocks on IO and lock waits
Stalls from the speed of light

mysql_query(START TRANSACTION); mysql_query(INSERT IGNORE INTO graph...); if (mysql_affected_rows() == 1) mysql_query(INSERT INTO counts ... ON DUPLICATE KEY UPDATE c = c+1) mysql_query(INSERT INTO other_table ) mysql_query(COMMIT);
The Solution non stored procedures

mysql_query( START TRANSACTION; INSERT IGNORE INTO graph...; SELECT row_count() INTO @r; INSERT INTO counts ON DUPLICATE KEY UPDATE c = IF(@r = 1, c+1, c); INSERT INTO other_table ...; COMMIT );
Transaction Per Second vs. Concurrency

2400 Original Trigger 2000 Procedure Multi-Query
1600
1200
800
400
0 0 20 40 60 80 100 120 140 160 180 200
How did we nd these problems?

We
know MySQL
When does experience trump perfect software?
We

use PMP
Poor Mans Proler State of the art tool for debugging stalls Continue to invest in making it better
This is PMP
echo "set pagination 0" > /tmp/pmpgdb echo "thread apply all bt" >> /tmp/pmpgdb mpid=$( pidof mysqld ) t=$( date +'%y%m%d_%H%M%S' ) gdb --command /tmp/pmpgdb --batch -p $mpid | grep -v 'New Thread' > f.$t cat f.$t | awk 'BEGIN { s = ""; } /Thread/ { print s; s = ""; } /^\#/ { x=index($2, "0x"); if (x == 1) { n=$4 } else { n=$2 }; if (s != "" ) { s = s "," n} else { s = n } } END { print s }' - | sort | uniq -c | sort -r -n -k 1,1 > h.$t
The database is slow!

Paging
via LIMIT x,y is O(N*N)
Dont allow it or use an index to determine paging order
Non
index-only queries depend on a warm buffer cache
Make them index-only
Queries
that examine 1M rows to return 100 rows are slow
Dene a better index
Queries
that might do 10,000 disk reads are slow
Dont do them
We repeatedly confront these problems
Manageability: solutions
Online
schema change tool collects data during a query pileup
Dogpiled

Get performance counters and the list of running queries Generate HTML page with interesting results
Pylander

sheds load during a query pileup
Kill duplicate queries Limit the number of queries from specic accounts
Schema Change
Must Add
do frequent schema changes
a column, add an index, change an index TABLE can take hours on a large table TABLE can block reads and writes to the table
ALTER ALTER
Our solution: Online Schema Change (OSC)

1.
Setup triggers to track changes
Briey locks the table
2. 3. 4.
Copy data to new table with desired schema Replay changes on new table Rename new table as the target table
Briey locks the table
Manageability: work in progress

Make Faker Auto Auto
InnoDB compression work for OLTP tool for prefetching for replication slaves
replacement replace failed and unhealthy MySQL servers resharding sharding is easy, resharding is hard
Faker
Replication

replay is page read modify page write
Bottleneck might be disk reads Work done by a single thread Transactions on master are concurrent
Faker

Multiple threads replay transactions in fake-changes mode on slaves Captures 70% of disk reads, work in progress to improve the rate
Manageability: open issues

Why Why
is one host slow? is the database tier doing a lot more work today? do I spend the next N dollars (memory, disk, ash)?
Where How How
do I run a workload across old (slow) and new (fast) servers? do I integrate cache and database tiers? monitoring signals generate useful interrupts?
What
World has a surplus of clever ideas

Getting Run
things into production is the hard part
a server in production before writing a new one more in monitoring, debugging and tuning
Invest
Read more at facebook.com/MySQLatFacebook
(c) 2007 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Performance Is Overrated - NEDB 2012

Uploaded by

Document Information

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Performance Is Overrated - NEDB 2012

Uploaded by

Copyright:

Mark Callaghan NEDB 2012

(Peak) Performance is overrated

needs more attention

Does work get done? Does work get done on time?

This has good average performance

was there when we arrived made it scale 10X

like MySQL for OLTP

250,000 QPS on (silly) benchmarks InnoDB is wonderful

OLTP for the social graph

joins but most queries use one table transactions

Multi-row Majority Physical Async

of workload does not need SQL/optimizer and logical backup

Does this require SQL?

is the grass greener on the other side?

A busy OLTP deployment circa 2010

response time bytes per second

read per second changed per second

4 ms reads, 5ms writes

page IO per second

Why are there so many servers?

data X high QPS

Per Domas we have lots of medium data (sharded MySQL)

servers to add IOPS is very interesting databases are very interesting

Database teams at Facebook

fast and x things our changes, or not

bugs that stall and crash MySQL better bugs

Tips on scaling: more data, more QPS

File system stalls

the IO scheduler from cfq to deadline

Deadline is less likely to stall writes

from ext-3 to XFS

Stalls from caches

checkpoint constraint enforcement

Repeat until done

Increase completion rate

Performance drops when ibuf is full

Otherwise, the insert buffer is awesome

Sysbench QPS at 20 second intervals with checkpoint stalls

Stalls from mutexes

LOCK_open Excessive Deadlock

InnoDB tables lock conicts

calls to fcntl detection overhead

Purge/undo TRUNCATE DROP

table and LOCK_open

table and LOCK_open

Repeat until done

Global mutex held while expensive operation done Requests stall

Defer expensive operation until global mutex unlocked

Stalls from excessive calls to fcntl

200,000 QPS on benchmarks

Sysbench read-only with fcntl x

Stalls from deadlock detection overhead

deadlock detection was inefcient

O(N*N) for N threads waiting on the same row lock.

The cost of deadlock detection

Stalls from innodb_thread_concurrency

the maximum number of running threads

Threads are scheduled in LIFO order

Sysbench TPS with FLIFO

Commit stalls for MySQL

is XA when the replication log (binlog) is enabled

InnoDB and replication log are resource managers

requires 3 fsyncs, 2 can be shared

RAID card does ~5000 fsyncs/second

Supports ~2500 commits/second