Performance is Overrated

Mark Callaghan NEDB 2012

(Peak) Performance is overrated
Focus on reducing variance rather than increasing peaks
▪  ▪ 

Capacity planning uses p95 or p99 response time Servers must be underutilized to tolerate variance

▪  Manageability
▪  ▪  ▪ 

needs more attention

Cost of extra hardware can be predicted Cost of downtime cannot Downtime comes in many forms (server down and server too busy)

What is manageability?
▪  The

rate of interrupts/server for the operations team count grows quickly and operations team grows slowly of service must improve over time

▪  Server

▪  Quality
▪  ▪ 

Does work get done? Does work get done on time?

This has good average performance

Why MySQL?
▪  It

was there when we arrived made it scale 10X

▪  We
▪  ▪ 

My peers in db eng/ops are very good Room for new people, ideas and products

▪  I

like MySQL for OLTP
▪  ▪ 

250,000 QPS on (silly) benchmarks InnoDB is wonderful

OLTP for the social graph
▪  Secondary ▪  Index-only ▪  Small

indexes queries

joins but most queries use one table transactions

▪  Multi-row ▪  Majority ▪  Physical ▪  Async

of workload does not need SQL/optimizer and logical backup

replication on a WAN

Does this require SQL?
▪  Most ▪  Why
▪  ▪  ▪  ▪  ▪ 

of it does not

is the grass greener on the other side?

Automated replacement of failed nodes Less downtime on schema changes or fewer schema changes Multi-master Better compression Write-optimized

A busy OLTP deployment circa 2010
▪  Query
▪ 

response time bytes per second

▪  Rows
▪ 

read per second changed per second

4 ms reads, 5ms writes

450M peak

▪  Network
▪ 

▪  Rows
▪ 

38GB peak

3.5M peak

▪  Queries
▪ 

per second

▪  InnoDB
▪ 

page IO per second

13M peak

5.2M peak

Why are there so many servers?
▪  Big
▪ 

data X high QPS

Per Domas we have lots of medium data (sharded MySQL)

▪  Add

servers to add IOPS is very interesting databases are very interesting

▪  Flash

▪  Write-optimized

Database teams at Facebook
Operations
▪  Move

Engineering
▪  Fix

fast and fix things our changes, or not

bugs that stall and crash MySQL better bugs

▪  Deploy ▪  Tell

▪  Make

me what to fix

▪  Market

The git log for our MySQL branch has 452 changes.

Tips on scaling: more data, more QPS
1.  2.  3. 

Fix stalls to make use of capacity Improve efficiency to use less Repeat

Fix stalls
Don’t make MySQL faster, make it less slow
▪  ▪  ▪  ▪ 

Stalls from file systems Stalls from caches in MySQL Stalls from mutexes in MySQL Everything else

File system stalls
▪  Switch
▪ 

the IO scheduler from cfq to deadline

Deadline is less likely to stall writes

▪  Switch
▪  ▪ 

from ext-3 to XFS

XFS does not lock a per-inode mutex on writes XFS has less variance on write-append

Stalls from caches
Some expensive operations are deferred
▪  InnoDB ▪  InnoDB ▪  Fuzzy

purge removes delete-marked rows insert buffer defers IO for secondary index maintenance

checkpoint constraint enforcement

Repeat until done
Problem
▪  ▪ 

Arrival rate exceeds completion rate Throughput collapses when cache is full

Solution
▪ 

Increase completion rate

Performance drops when ibuf is full

Otherwise, the insert buffer is awesome

Sysbench QPS at 20 second intervals with checkpoint stalls

Stalls from mutexes
▪  Extending ▪  Opening

InnoDB files

▪  LOCK_open ▪  Excessive ▪  Deadlock

and kernel_mutex

InnoDB tables lock conflicts

calls to fcntl detection overhead

▪  Purge/undo ▪  TRUNCATE ▪  DROP

table and LOCK_open

▪  innodb_thread_concurrency ▪  Group

table and LOCK_open

commit control

▪  Admission

Repeat until done
Problem
▪  ▪ 

Global mutex held while expensive operation done Requests stall

Solution
▪ 

Defer expensive operation until global mutex unlocked

Stalls from excessive calls to fcntl
▪  fcntl
▪  ▪ 

Some Linux kernels get the big kernel lock on fcntl calls MySQL called it too often

▪  Doubled
▪ 

peak QPS by changing MySQL to call it less is now fixed in official MySQL

200,000 QPS on benchmarks

▪  Problem

Sysbench read-only with fcntl fix

Stalls from deadlock detection overhead
▪  InnoDB
▪ 

deadlock detection was inefficient

O(N*N) for N threads waiting on the same row lock.

▪  Fix
▪  ▪ 

is simple

Disable it and rely on lock wait timeout Detection is now more efficient in official MySQL

The cost of deadlock detection
QPS for 1 to 1024 connections updating the same row
3000 2500 2000 1500 1000 500 0 1 2 4 8 16 32 64 128 256 512 1024 Disabled Enabled

Stalls from innodb_thread_concurrency
▪  Limits
▪ 

the maximum number of running threads

Threads are scheduled in LIFO order

▪  With

1000+ sleeping threads it can take too long to wake one some threads to run in FIFO order

▪  Allow
▪ 

When new thread arrives run if other threads are slow to wake

▪  FIFO

+ LIFO = FLIFO

Sysbench TPS with FLIFO

Commit stalls for MySQL
▪  This
▪ 

is XA when the replication log (binlog) is enabled

InnoDB and replication log are resource managers

▪  Commit ▪  HW
▪ 

requires 3 fsyncs, 2 can be shared

RAID card does ~5000 fsyncs/second

Supports ~2500 commits/second

Group commit
▪  Modified ▪  Fix
▪  ▪ 

MySQL to allow all fsyncs to be shared

was fun
Uses a group commit timeout Threads only wait when other threads are about to commit (magic)

▪  Useful
▪ 

side effect

Servers are better able to survive RAID battery failure

Stalls from mutex thrashing
Preserve throughput while overloaded
▪  ▪ 

Good – preserve the rows read rate, limit threads running Better – preserve query completion rate, limit queries running

Admission control
▪  ▪  ▪ 

Simple TP monitor in MySQL Limits max concurrent queries per database account Does the right thing when a query blocks on IO and lock waits

Stalls from the speed of light
mysql_query(“START  TRANSACTION”);   mysql_query(“INSERT  IGNORE  INTO  graph...”);   if  (mysql_affected_rows()  ==  1)    mysql_query(“INSERT  INTO  counts  ...  ON  DUPLICATE  KEY  UPDATE  c  =  c+1”)   mysql_query(“INSERT  INTO  other_table  …”)   mysql_query(“COMMIT”);  

 

The Solution – non stored procedures
mysql_query(    “START  TRANSACTION;”    “INSERT  IGNORE  INTO  graph...;”    “SELECT  row_count()  INTO  @r;”    “INSERT  INTO  counts  ON  DUPLICATE  KEY  UPDATE  c  =  IF(@r  =  1,  c+1,  c);”    “INSERT  INTO  other_table  ...;”    “COMMIT”   );  

Transaction Per Second vs. Concurrency
2400 Original Trigger 2000 Procedure Multi-Query

1600

1200

800

400

0 0 20 40 60 80 100 120 140 160 180 200

How did we find these problems?
▪  We
▪ 

know MySQL

When does experience trump perfect software?

▪  We
▪  ▪  ▪ 

use PMP

Poor Man’s Profiler State of the art tool for debugging stalls Continue to invest in making it better

This is PMP
echo  "set  pagination  0"  >  /tmp/pmpgdb   echo  "thread  apply  all  bt"  >>  /tmp/pmpgdb   mpid=$(  pidof  mysqld  )   t=$(  date  +'%y%m%d_%H%M%S'  )   gdb  -­‐-­‐command  /tmp/pmpgdb    -­‐-­‐batch  -­‐p  $mpid  |  grep  -­‐v  'New  Thread'  >  f.$t   cat  f.$t  |  awk  'BEGIN  {  s  =  "";  }    /Thread/  {  print  s;  s  =  "";  }  /^\#/   {  x=index($2,  "0x");  if  (x  ==  1)  {  n=$4  }  else  {  n=$2  };  if  (s  !=  ""  )  {  s  =  s   ","  n}  else  {  s  =  n  }  }  END  {  print  s  }'  -­‐    |  sort  |  uniq  -­‐c  |  sort  -­‐r  -­‐n  -­‐k  1,1   >  h.$t  

The database is slow!
▪  Paging
▪ 

via LIMIT x,y is O(N*N)

Don’t allow it or use an index to determine paging order

▪  Non
▪ 

index-only queries depend on a warm buffer cache

Make them index-only

▪  Queries
▪ 

that examine 1M rows to return 100 rows are slow

Define a better index

▪  Queries
▪ 

that might do 10,000 disk reads are slow

Don’t do them

We repeatedly confront these problems

Manageability: solutions
▪  Online

schema change tool – collects data during a query pileup

▪  Dogpiled
▪  ▪ 

Get performance counters and the list of running queries Generate HTML page with interesting results

▪  Pylander
▪  ▪ 

– sheds load during a query pileup

Kill duplicate queries Limit the number of queries from specific accounts

Schema Change
▪  Must ▪  Add

do frequent schema changes

a column, add an index, change an index TABLE can take hours on a large table TABLE can block reads and writes to the table

▪  ALTER ▪  ALTER

Our solution: Online Schema Change (OSC)
1. 

Setup triggers to track changes
§ 

Briefly locks the table

2.  3.  4. 

Copy data to new table with desired schema Replay changes on new table Rename new table as the target table
§ 

Briefly locks the table

Manageability: work in progress
▪  Make ▪  Faker ▪  Auto ▪  Auto

InnoDB compression work for OLTP – tool for prefetching for replication slaves

replacement – replace failed and unhealthy MySQL servers resharding – sharding is easy, resharding is hard

Faker
▪  Replication
▪  ▪  ▪ 

replay is page read – modify – page write

Bottleneck might be disk reads Work done by a single thread Transactions on master are concurrent

▪  Faker
▪  ▪ 

Multiple threads replay transactions in fake-changes mode on slaves Captures 70% of disk reads, work in progress to improve the rate

Manageability: open issues
▪  Why ▪  Why

is one host slow? is the database tier doing a lot more work today? do I spend the next N dollars (memory, disk, flash)?

▪  Where ▪  How ▪  How

do I run a workload across old (slow) and new (fast) servers? do I integrate cache and database tiers? monitoring signals generate useful interrupts?

▪  What

World has a surplus of clever ideas
▪  Getting ▪  Run

things into production is the hard part

a server in production before writing a new one more in monitoring, debugging and tuning

▪  Invest

Read more at facebook.com/MySQLatFacebook

(c) 2007 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.