MySQL Scaling and High Availability Architectures

Jeremy Cole
jeremy@provenscaling.com

Eric Bergen
eric@provenscaling.com

Who are we?
• Proven Scaling is a consulting company founded in 2006 by Eric and Jeremy specializing in MySQL • We primarily deal with architecture and design for large scalable systems • We also do training, DBA work, custom MySQL features, etc. • Jeremy: optimization, architecture, performance • Eric: operations, administration, monitoring

Overview
• • • • • • What’s the problem? Basic Tenets of Scaling and High Availability Lifetime of a Scalable System Approaches to Scaling Approaches to High Availability Tools and Components

What’s the problem?
• Internet-age systems can grow (or be forced to choose between growth and death) very quickly • No matter what you plan for or predict, users will always surprise you • Mobs, err, valued users can be very annoying sometimes (e.g. “biggest group ever” logic) • Users may have vastly different usage patterns • Web 2.0™ (blechhhh!) sites have changed the world of scaling; it’s much harder now • Everyone (your VCs included) expects you to be Web 2.0® compliant™

Basic Tenets
• Don’t design scalable or highly available systems:
 Using components you do not control or that have loose tolerances (e.g. DNS)  Using processes with potentially ugly side effects (e.g. code changes to add a new server) [Yes, configuration files are very often “code”]

• If a user doesn’t think/notice something is down, it’s not really “down” • Eliminate (or limit) single points of failure -- if you have only one of any component, examine why • Cache everything

Lifetime of a Scalable System

Newborn
• Shared hosting • Might start worrying (a little bit) about query optimization at this point • Don’t have much control over configuration • Overall performance may be poor • Traffic picks up, and performance is bad... What do we do about it?

Toddler
• • • • A single (dedicated) server for everything MySQL and Apache etc., competing for resources MySQL needs memory for caching data Apache (and especially PHP etc.) needs lots of memory for handling requests • Memory contention will be the first major bottleneck

Child
• Separate web servers and database server • Usually go ahead and get multiple web servers now, since it’s easy • Get a single database server, since it’s hard -maybe better hardware? • Now we need to do session management across web servers… hmm we have this nice database… • Other load same as before, but now with added network overhead • Single database server becomes your biggest bottleneck

Teenager
• “Simple” division of load by moving tables or processes • Use replication to move reporting off production • Move individual tables or databases to lighten load • Use replication to move reads to slaves • Modify code to know where everything is • Still too many writes in some parts of system • Replication synchronization problems mean either annoying users or writing lots of code to work around the problem

Late teens to 20s
• The “awkward” stage • This is where many applications (and sometimes entire companies) die by making bad decisions • Death can be slow (becoming irrelevant due to speed or lack of scalability) or quick (massive meltdown losing user confidence) • Managing the move from teenager into adulthood is often the first real project requiring specs and real processes to do it right • Downtime at this point is hard to swallow due to size of userbase

Adult
• Scalable system that can grow for a long time, generally based on data partitioning • Most improvements now are incremental • System is built to allow incremental improvements without downtime • A lot has been learned from the successful transition to adulthood

Data Partitioning: The only game in town

What is partitioning?
• Distributing data on a record-by-record basis • Usually a single basis for distributing records in each data set is chosen: a “partition key” • An application may have multiple partition keys • Each node has all related tables, but only a portion of the data

Partitioning Models
• Fixed “hash key” partitioning • Dynamic “directory” partitioning • Partition by “group” • Partition by “user”

Partitioning Difficulties
• Inter-partition interactions are a lot more difficult • Example: Partitioning by user, where do we store a message sent from one user to another? How about a friend list? • Overall reporting becomes more difficult • Example: Find the average number of friends a user has by state…

Partition by …
• Partitioning by user (or equivalent) allows for the most flexibility in most applications • In many cases it may make sense to partition by groups, if most (or all) interactions between users are within that group • You could also get most of the same benefits of partitioning by group by partitioning by user with an affinity based on group

Fixed Hash Key
• • • Divide the data into B buckets Divide the B buckets over M machines Example: Define 1024 user buckets 0..1023 based on (user_id % 1024) for 4 physical servers, so each server gets 256 of the buckets by range: 0-255, 256-511, 512-767, 768-1023 Problem: Moving entire buckets means affecting 1/B of your users at a time in the best case… in simple implmentations you may have to affect 1/M or 2/M of your users Problem: The bucket-to-machine mapping must be stored somewhere (usually in code) and updated atomically Problem: You have no control over which bucket (and thus machine) a given user is assigned to

• •

Dynamic Directory
• A “directory” server maintains a database of mappings between users and partitions • A user is assigned (often randomly) to one partition and that mapping is stored • Any user may be moved later by locking the user, moving their data, and updating their mapping in the directory • Solution: Only single users are affected by any repartitioning that must be done • Solution: Partitions may be rebalanced user-byuser at any time

Custom Solutions

Custom Solutions
• It’s very easy to implement simple hash key partitioning to get data distributed • It’s much more difficult to be able to re-partition • It’s difficult to grow

Hibernate Shards

Hibernate Shards
• Sort of a merge between fixed key partitioning and directory-based partitioning • “Virtual Shards” abstract the mapping of objects to shards, but simplistically • It’s still painful to repartition • It doesn’t handle rebalancing at all currently • It doesn’t handle aggregation at all

HiveDB

HiveDB Project
• HiveDB is an Open Source project to design and implement the entire “standard” partition-by-key MySQL system in Java • Originally envisioned by Jeremy while working with several customers • Implemented by Fortress Consulting and CafePress along with help and guidance from Proven Scaling • Many companies have built somewhat similar systems but nobody has really open sourced it

Why HiveDB?
• Many solutions that exist only solve the easy part: storing and retrieving data across many machines • Nobody really touches on the hard part: being able to rebalance and move users on the fly

Server Architecture
• Hive Metadata
 Partition definition

• Directory
 Partition Key -> Partition mapping  Secondary Key -> Partition Key mapping

• Hive Queen - makes management and rebalancing decisions • Job Server (Quartz) - actually executes tasks • Aggregation Layer (future)

Client Architecture
• Client uses Hive API to request a connection for a certain partition key • Client uses those direct connections to do work • Hive API should be written in each development language as necessary

High Availability

Goals
• • • • • Avoid downtime due to failures No single point of failure Extremely fast failover No dependency on DNS changes No dependency on code changes

• Allow for painless, worry-free “casual failovers” to upgrade, change hardware, etc. • Fail-back must be just as painless

MySQL Replication

Basics
• MySQL replication is master-slave one-way asynchronous replication • “Master” keeps logs of all changes – called “binary logs” or “binlogs” • “Slave” connects to the master through the normal MySQL protocol on TCP port 3306 • Slave requests binary logs from last position • Master sends binary logs up to current time • Master keeps sending binary logs in real-time

More Basics
• Replication works with all tables types and (mostly) all features • Any “critical”, reads, ones that cannot be allowed to return stale data, must be done on the master – replication is asynchronous, so there may be a delay at any time

Typical Setup
• One Master (single source of truth) • Any number of slaves • Slaves are used for reads only • All writes go to the master • There are many other possibilities…

Replication Topologies

Master with One Slave
Master

Slave

Master with Many Slaves
Master

Slave

Slave

Slave

Slave

Slave

Master with Relay Slave
Master

Relay Slave

Slave

Master with Relay and Many Slaves
Master

Relay Slave

Slave

Slave

Slave

Slave

Slave

Master with Many Relays
Master

Relay Slave

Relay Slave

Relay Slave

Relay Slave

Relay Slave

Slave

Slave

Slave

Slave

Slave

Dual Masters

Master

Master

Dual Masters with Slaves
Slave

Master

Master

Relay Slave

Slave

Slave

Slave

Ring (Don’t Use)

Master

Master

Master

High Availability Options

Dual Master
• Two machines with independent storage configured as master and slave of each other • Optionally: Any number of slaves for reads only • Manual (scripted) or automatic (heartbeat-based) failover is possible

Dual Master Pros
• Very simple configuration • Simple to understand = simple to maintain • Very similar to basic master-slave configuration that many are familiar with • Allows easy failover in either direction without reconfiguration or rebuilding • Allows for easy and reliable failover for nonemergency situations: upgrades, schema changes, etc. • Allows for quick failover in emergency • Can work between distant sites fairly easily

Dual Master Cons
• Does not help scale writes (no, not at all) • Limited to two sites; replication does not allow multiple masters, so three or more is not possible • Replication is asynchronous, and may get behind -there is always a chance of data loss (albeit small)

SAN
• Shared storage of a single set of disks by two MySQL servers, with a single copy of the data on a FibreChannel or IP/iSCSI SAN • Automatic (heartbeat) failover by fencing and mounting the SAN on the other machine

SAN Pros
• Single copy of the data means lower storage cost for extremely large databases • No worries about replication getting behind • SAN systems can achieve very high performance for same or lower cost as two very large RAIDs

SAN Cons
• Single copy of the data means corruption is possible, and could be very damaging • For medium or small databases, cost can be prohibitive • FibreChannel requires additional infrastructure often not present in typical MySQL systems; iSCSI can be very helpful in this regard • Single copy of the data -- no schema change tricks are possible

DRBD
• Block device-level replication between two machines with their own independent storage (mirrors of the same data) • Automatic (heartbeat-based) failover by fencing and mounting local copy of filesystem is typical

DRBD Pros
• Simple hardware and infrastructure using locallyattached RAID • No expensive hardware or network

DRBD Cons
• Complex configuration and maintenance • May cause performance problems, especially if poorly configured • Failure of or problems with mirror can cause problems in production • From the software perspective, there is still a single copy of the data, which may get corrupted • Single copy of the data -- no schema change tricks are possible

Putting It All Together

Partitioning + HA
• No partitioning solutions really address HA .. They treat the “shards” or “partitions” as single MySQL servers • In reality you would implement an HA solution for each partition • There are many possibilities

HiveDB + Dual Master
• We recommend HiveDB plus Dual Master for most installations • While not technically perfect, and with a chance of data loss, administrative tasks are very simple • Additionally, LVM for volume management gives ability to take snapshot backups easily

Any questions? Discussion!

Sign up to vote on this title
UsefulNot useful