MySQL HA Using different solutions

Robert Krzykawski DB Team Coordinator, bwi n games. Anders Karlsson Principal Sales Engineer, MySQL

 Who are we?  HA Basics – Anders  How we did it; Success or failure – Robert  Summary  Questions?

Anders Karlsson
 Sales Engineer with Sun / MySQL for 5+ years  I have been in the RDBMS business for 20+ years  I have worked for many of the major vendors and with most of the vendor products  I’ve been in roles as
> > > > > Sales Engineer Consultant Porting engineer Support engineer Etc.

 Outside MySQL I build websites (, develop Open Source software (MyQuery, ndbtop etc), am a keen photographer and drives sub-standard cars, among other things. Also: ! Right now!

Robert Krzykawski
 DB Team Coordinator @ bwi n Games AB  Have been working with MySQL in every way from system admin, DBA, DBD and now taking a more system architectural role.  Been involved in building both small and big web based solutions since 1998 using MySQL.  My roles throughout my professional life have varied. System administrator, Technical Sales support, DBA, DBD, Programmer, Application architect and System architect.  Off work I am trying to automate things with scripts and programs to off load myself when “on work”.   I am also trying to find time to snowboard, play some paintball and a recently introduced hobby is our Maine Coon kittens. 

Why do you need HA
 Something can break. It usually will, eventually  You will need to maintain your database eventually, without shutting the whole system down  Adding HA to an existing running system is difficult, Much more so than to provide HA from the start  You want a good nights sleep! You want failover to be automatic!

HA Concepts
 Fault tolerant architectures
> These are hardware architectures with supporting software that prevents against even individual component failures

 Single Point of Failure (SPOF)
> In any fault tolerant setup, you want to avoid a SPOF, as a link is not better than it’s weakest link

 Fail over and Fail back
> Fail over is the process of switching from a failed component to another component, dormant or also active. Fail back is the process of failing back from the backup component to the original one.

Some HA Components
 Heartbeat
> Heartbeat is an HA component that checks that the services that are being failed over, are alive. Heartbeat can check individual servers, software services, networking etc.

 HA Monitor
> The HA Monitor has different names in different frameworks. This is the component that allows configuration of the services, ensures proper shutdown and startup and allows manual control

 Replication
> Replication is a common component that ensures that the data content of managed data rich components are in sync

What should I require?
 Don’t aim too high, aim for what is reasonable for your needs  Aim to ensure that no important data is lost
> What is “important data”? You decide! Different data means different “needs”!

 Aim to ensure that the solution can be automated. You will want this eventually anyway  Aim to ensure a solution that can easily be tested and administered  Aim to ensure that the solution is performant and scalable

HA with MySQL – In short
 MySQL Replication
> Easy to use and set up. Low performance impact > Asynchronous only. Failback can be difficult. Need additional components

 MySQL with DRBD / ZFS / AVS
> Easy to use. Low cost software only. Synchronous. Good HA software integration. > Certain performance impact. Limited data size and transaction rates.

HA with MySQL – In short
 MySQL with Shared storage
> Good performance. Eases hardware management. Good integration with HA software. > Costly. SAN itself is a SPOF.

 MySQL Cluster
> Very good performance. Self contained. Very short fail-over times. Software only solution. > Needs several physical servers. Not optimized for all MySQL applications.

bwin games ab

Our goal at bwin
 We were faced with a requirement; establish a highly available database platform.  We had some rules to follow from management.
> interruptions due to hardware failure should not require hands-on work. > Downtime should be minimized during interruptions. > Performance of DB platform should not decrease when operating as usual > Performance can decrease if a failure has occurred but should not deem the service unusable. > Implementation should be done by the operations department. Developers should not be involved.

What solutions did we consider?
 Master/Master  Linux HA  HP Service Guard  Sun Cluster  Combination of the above  MySQL Cluster  Will walk through all of the above

 Master/Master with two active nodes would give us a seamless switch if we have a good load balancer.
> Will give us the ability to do schema changes “on line” > Not only higher availability when both nodes are up, but better performance. > Can eliminate the use of production slaves. > One entry point for application when using “LB”

Linux HA/ServiceGuard/SunCluster
 Service IP switch will cause a glitch in service.  Since we are running 4.0 we can’t really do a master/master setup with service IP switching.  Slave integrity is important and we are running 4.0; One master data. Can’t switch to slave and hope that everything was replicated.  We are using SAN – Shared storage possible.  One instance, two machines – One active, one standby.  Innodb log size will be a problem.  Timeout during recovery can cause problems during switch.

MySQL Cluster
 High availability built in if implemented correct  Requires more hardware.  More complex solution  Requires application to support NDB  Not full feature set.

 We are using MySQL 4.0 in our biggest database  Master/Master scenario on 4.0 requires higher level of application awareness.  LinuxHA/ServiceGuard/Sun Cluster will cause small glitch when we move resources.  MySQL Cluster will require even more application changes in our case.

Our Choice
 LinuxHA because it is GPL/LGPL. Free and not owned by an organization.  Fastest way to implement, did not require any support from dev. Department.  All other ways required changes in application.

 Two versions



HA Standby1







HA Standby2



We do..
 Use Linux HA 2.0. Needed for setup of “cluster”  Use SAN. Shared storage is easier and faster, but Expensive.
> DRBD can be used but saves the same data twice Also comes with a performance decrease.

 Heartbeat on two bonds. Primary database interconnect network, secondary on database service network  We have LUNs presented to multiple hosts  Services have rules to be run on specific hosts only.  We fence using RiLOE
> Have plans to fence on port level in FC switches.

What’s good and what’s bad..
 Easy and fast implementation  Our config does not increase/decrease performance.  Innodb log size causes long recovery times. Testing to decrease it has caused performance penalties.  Our solution is not fool proof because of long recovery times.  It causes interruption of service.  We can say it’s HA, but true HA solution would give us 100% uptime.  2nd Setup is complicated. We should aim for having simple setups. More common

What can we do better.
 Fine tune config for faster recovery/startup  Add better fencing  Monitor failover in case recovery takes long  Master/Master or Multi master.
> If application can reconnect or if we have a smart load balancer we have no outages. > Upgrades or schema changes can be made “online” > No separation between writes and reads. Less complicated for developers. One entry point.

 Concepts  Components  Requirements  Technologies  Your goal  Considerations  Obstacles  How we did it @ bwi n games AB  HA recommendations

The question is not, ‘What is the answer?’ The question is, ‘What is the question?’ Henri Poincaré

Thank you for your time!
 And thank you for listening so kindly.

 We can be found on:  Robert Krzykawski –  Anders Karlsson –