Performance is Overrated

Mark Callaghan NEDB 2012

(Peak) Performance is overrated
Focus on reducing variance rather than increasing peaks
▪  ▪ 

Capacity planning uses p95 or p99 response time Servers must be underutilized to tolerate variance

▪  Manageability
▪  ▪  ▪ 

needs more attention

Cost of extra hardware can be predicted Cost of downtime cannot Downtime comes in many forms (server down and server too busy)

What is manageability?
▪  The

rate of interrupts/server for the operations team count grows quickly and operations team grows slowly of service must improve over time

▪  Server

▪  Quality
▪  ▪ 

Does work get done? Does work get done on time?

This has good average performance

Why MySQL?
▪  It

was there when we arrived made it scale 10X

▪  We
▪  ▪ 

My peers in db eng/ops are very good Room for new people, ideas and products

▪  I

like MySQL for OLTP
▪  ▪ 

250,000 QPS on (silly) benchmarks InnoDB is wonderful

OLTP for the social graph
▪  Secondary ▪  Index-only ▪  Small

indexes queries

joins but most queries use one table transactions

▪  Multi-row ▪  Majority ▪  Physical ▪  Async

of workload does not need SQL/optimizer and logical backup

replication on a WAN

Does this require SQL?
▪  Most ▪  Why
▪  ▪  ▪  ▪  ▪ 

of it does not

is the grass greener on the other side?

Automated replacement of failed nodes Less downtime on schema changes or fewer schema changes Multi-master Better compression Write-optimized

A busy OLTP deployment circa 2010
▪  Query

response time bytes per second

▪  Rows

read per second changed per second

4 ms reads, 5ms writes

450M peak

▪  Network

▪  Rows

38GB peak

3.5M peak

▪  Queries

per second

▪  InnoDB

page IO per second

13M peak

5.2M peak

Why are there so many servers?
▪  Big

data X high QPS

Per Domas we have lots of medium data (sharded MySQL)

▪  Add

servers to add IOPS is very interesting databases are very interesting

▪  Flash

▪  Write-optimized

Database teams at Facebook
▪  Move

▪  Fix

fast and fix things our changes, or not

bugs that stall and crash MySQL better bugs

▪  Deploy ▪  Tell

▪  Make

me what to fix

▪  Market

The git log for our MySQL branch has 452 changes.

Tips on scaling: more data, more QPS
1.  2.  3. 

Fix stalls to make use of capacity Improve efficiency to use less Repeat

Fix stalls
Don’t make MySQL faster, make it less slow
▪  ▪  ▪  ▪ 

Stalls from file systems Stalls from caches in MySQL Stalls from mutexes in MySQL Everything else

File system stalls
▪  Switch

the IO scheduler from cfq to deadline

Deadline is less likely to stall writes

▪  Switch
▪  ▪ 

from ext-3 to XFS

XFS does not lock a per-inode mutex on writes XFS has less variance on write-append

Stalls from caches
Some expensive operations are deferred
▪  InnoDB ▪  InnoDB ▪  Fuzzy

purge removes delete-marked rows insert buffer defers IO for secondary index maintenance

checkpoint constraint enforcement

Repeat until done
▪  ▪ 

Arrival rate exceeds completion rate Throughput collapses when cache is full


Increase completion rate

Performance drops when ibuf is full

Otherwise, the insert buffer is awesome

Sysbench QPS at 20 second intervals with checkpoint stalls

Stalls from mutexes
▪  Extending ▪  Opening

InnoDB files

▪  LOCK_open ▪  Excessive ▪  Deadlock

and kernel_mutex

InnoDB tables lock conflicts

calls to fcntl detection overhead

▪  Purge/undo ▪  TRUNCATE ▪  DROP

table and LOCK_open

▪  innodb_thread_concurrency ▪  Group

table and LOCK_open

commit control

▪  Admission

Repeat until done
▪  ▪ 

Global mutex held while expensive operation done Requests stall


Defer expensive operation until global mutex unlocked

Stalls from excessive calls to fcntl
▪  fcntl
▪  ▪ 

Some Linux kernels get the big kernel lock on fcntl calls MySQL called it too often

▪  Doubled

peak QPS by changing MySQL to call it less is now fixed in official MySQL

200,000 QPS on benchmarks

▪  Problem

Sysbench read-only with fcntl fix

Stalls from deadlock detection overhead
▪  InnoDB

deadlock detection was inefficient

O(N*N) for N threads waiting on the same row lock.

▪  Fix
▪  ▪ 

is simple

Disable it and rely on lock wait timeout Detection is now more efficient in official MySQL

The cost of deadlock detection
QPS for 1 to 1024 connections updating the same row
3000 2500 2000 1500 1000 500 0 1 2 4 8 16 32 64 128 256 512 1024 Disabled Enabled

Stalls from innodb_thread_concurrency
▪  Limits

the maximum number of running threads

Threads are scheduled in LIFO order

▪  With

1000+ sleeping threads it can take too long to wake one some threads to run in FIFO order

▪  Allow

When new thread arrives run if other threads are slow to wake



Sysbench TPS with FLIFO

Commit stalls for MySQL
▪  This

is XA when the replication log (binlog) is enabled

InnoDB and replication log are resource managers

▪  Commit ▪  HW

requires 3 fsyncs, 2 can be shared

RAID card does ~5000 fsyncs/second

Supports ~2500 commits/second

Group commit
▪  Modified ▪  Fix
▪  ▪ 

MySQL to allow all fsyncs to be shared

was fun
Uses a group commit timeout Threads only wait when other threads are about to commit (magic)

▪  Useful

side effect

Servers are better able to survive RAID battery failure

Stalls from mutex thrashing
Preserve throughput while overloaded
▪  ▪ 

Good – preserve the rows read rate, limit threads running Better – preserve query completion rate, limit queries running

Admission control
▪  ▪  ▪ 

Simple TP monitor in MySQL Limits max concurrent queries per database account Does the right thing when a query blocks on IO and lock waits

Stalls from the speed of light
mysql_query(“START  TRANSACTION”);   mysql_query(“INSERT  IGNORE  INTO  graph...”);   if  (mysql_affected_rows()  ==  1)    mysql_query(“INSERT  INTO  counts  ...  ON  DUPLICATE  KEY  UPDATE  c  =  c+1”)   mysql_query(“INSERT  INTO  other_table  …”)   mysql_query(“COMMIT”);  


The Solution – non stored procedures
mysql_query(    “START  TRANSACTION;”    “INSERT  IGNORE  INTO  graph...;”    “SELECT  row_count()  INTO  @r;”    “INSERT  INTO  counts  ON  DUPLICATE  KEY  UPDATE  c  =  IF(@r  =  1,  c+1,  c);”    “INSERT  INTO  other_table  ...;”    “COMMIT”   );  

Transaction Per Second vs. Concurrency
2400 Original Trigger 2000 Procedure Multi-Query





0 0 20 40 60 80 100 120 140 160 180 200

How did we find these problems?
▪  We

know MySQL

When does experience trump perfect software?

▪  We
▪  ▪  ▪ 

use PMP

Poor Man’s Profiler State of the art tool for debugging stalls Continue to invest in making it better

This is PMP
echo  "set  pagination  0"  >  /tmp/pmpgdb   echo  "thread  apply  all  bt"  >>  /tmp/pmpgdb   mpid=$(  pidof  mysqld  )   t=$(  date  +'%y%m%d_%H%M%S'  )   gdb  -­‐-­‐command  /tmp/pmpgdb    -­‐-­‐batch  -­‐p  $mpid  |  grep  -­‐v  'New  Thread'  >  f.$t   cat  f.$t  |  awk  'BEGIN  {  s  =  "";  }    /Thread/  {  print  s;  s  =  "";  }  /^\#/   {  x=index($2,  "0x");  if  (x  ==  1)  {  n=$4  }  else  {  n=$2  };  if  (s  !=  ""  )  {  s  =  s   ","  n}  else  {  s  =  n  }  }  END  {  print  s  }'  -­‐    |  sort  |  uniq  -­‐c  |  sort  -­‐r  -­‐n  -­‐k  1,1   >  h.$t  

The database is slow!
▪  Paging

via LIMIT x,y is O(N*N)

Don’t allow it or use an index to determine paging order

▪  Non

index-only queries depend on a warm buffer cache

Make them index-only

▪  Queries

that examine 1M rows to return 100 rows are slow

Define a better index

▪  Queries

that might do 10,000 disk reads are slow

Don’t do them

We repeatedly confront these problems

Manageability: solutions
▪  Online

schema change tool – collects data during a query pileup

▪  Dogpiled
▪  ▪ 

Get performance counters and the list of running queries Generate HTML page with interesting results

▪  Pylander
▪  ▪ 

– sheds load during a query pileup

Kill duplicate queries Limit the number of queries from specific accounts

Schema Change
▪  Must ▪  Add

do frequent schema changes

a column, add an index, change an index TABLE can take hours on a large table TABLE can block reads and writes to the table


Our solution: Online Schema Change (OSC)

Setup triggers to track changes

Briefly locks the table

2.  3.  4. 

Copy data to new table with desired schema Replay changes on new table Rename new table as the target table

Briefly locks the table

Manageability: work in progress
▪  Make ▪  Faker ▪  Auto ▪  Auto

InnoDB compression work for OLTP – tool for prefetching for replication slaves

replacement – replace failed and unhealthy MySQL servers resharding – sharding is easy, resharding is hard

▪  Replication
▪  ▪  ▪ 

replay is page read – modify – page write

Bottleneck might be disk reads Work done by a single thread Transactions on master are concurrent

▪  Faker
▪  ▪ 

Multiple threads replay transactions in fake-changes mode on slaves Captures 70% of disk reads, work in progress to improve the rate

Manageability: open issues
▪  Why ▪  Why

is one host slow? is the database tier doing a lot more work today? do I spend the next N dollars (memory, disk, flash)?

▪  Where ▪  How ▪  How

do I run a workload across old (slow) and new (fast) servers? do I integrate cache and database tiers? monitoring signals generate useful interrupts?

▪  What

World has a surplus of clever ideas
▪  Getting ▪  Run

things into production is the hard part

a server in production before writing a new one more in monitoring, debugging and tuning

▪  Invest


(c) 2007 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.