You are on page 1of 115

Galera Cluster Best Practices

Seppo Jaakola
Codership
Agenda

1) Galera Cluster Short Introduction


2) Consistent reads
3) wsrep_cluster_address
4) Multi-master Conflicts
5) State Transfers (SST IST)
6) Backups
7) Schema Upgrades
8) MySQL Replication Topologies
9) Migrating to Galera Cluster
10)Detecting Slow Nodes
11)Parallel Replication www.codership.com
2
Galera Cluster
Multi-Master Replication

read & write read & write read & write Read & write access to any node
Client can connect to any node

There can be several nodes


MySQL MySQL MySQL
a Galera Replication
Replication is synchronous

www.codership.com
5
Multi-Master Replication

read & write read & write read & write

Multi-master cluster looks


like one big database with
multiple entry points

a MySQL

www.codership.com
6
Galera Cluster

➢ Synchronous multi-master cluster


➢ For MySQL/InnoDB
➢ 3 or more nodes needed for HA
➢ Automatic node provisioning
➢ Works in LAN / WAN / Cloud

www.codership.com
7
Synchronous Replication

Transaction is
Read & write processed locally up
to commit time

MySQL MySQL MySQL

a
Galera Replication

www.codership.com
10
Synchronous Replication

Transaction is
commit replicated to whole
cluster

MySQL MySQL MySQL

a
Galera Replication

www.codership.com
11
Synchronous Replication

Client gets OK
OK status

MySQL MySQL MySQL

a
Galera Replication

www.codership.com
12
Synchronous Replication

Transaction is
applied in slaves

MySQL MySQL MySQL

a
Galera Replication

www.codership.com
13
Consistent Reads
Consistent reads

Replication is virtually synchronous...

Transaction is
commit replicated to whole
cluster

MySQL MySQL MySQL

Galera Replication

www.codership.com
15
Consistent reads

1. Insert into t1 values (1,....)


2. Select from t1 where i=1

Will the select see


the inserted row?
MySQL MySQL

Galera Replication
www.codership.com
16
Consistent Reads

● Aka read causality


● There is causal dependency between
operations on two database connections:
● One thread does a database operation
● And some other thread is expecting to see the
values of earlier write

www.codership.com
17
Consistent Reads

● Use: wsrep_causal_reads=ON
➔ Every read (select, show) will wait until slave
queue has been fully applied
● There is timeout for max causal read wait:
● replicator.causal_read_keepalive

www.codership.com
18
Consistent Reads

● Use: wsrep_causal_reads=ON
➔ Every read (select, show) will wait until slave
queue has been fully applied
● There is timeout for max causal read wait:
● replicator.causal_read_keepalive

www.codership.com
19
Consistent reads

1. Insert into t1 values (1,....) 2. Select from t1 where i=1

Wait for last WS to apply


…...
Run the select

MySQL MySQL

Galera Replication
www.codership.com
20
Consistent Reads

● wsrep_causal_reads makes the cluster to look


fully synchronous for application
● Consistent reads makes SELECTs slower
● Use wsrep_causal_reads Only when there is
real need for it

www.codership.com
21
The Thing with wsrep_cluster_address
The Thing with wsrep_cluster_address

➢ Write all nodes planned for the cluster, in


the address string:
– wsrep_cluster_address=gcomm://10.0.0.1,10.0.0.2...
➢ To bootstrap a new cluster
➢ Service mysql start –wsrep_cluster_address=gcomm://
➢ Or rewrite my.cnf: wsrep_cluster_address=gcomm://
➢ Make sure not to leave
wsrep_cluster_address=gcomm:// in
my.cnf !!
➢ Note: wsrep_urls parameter for mysqld_safe is deprecated
www.codership.com
23
The Thing with wsrep_cluster_address

➢ Xtrabackup and rsync SST cannot be used


for a running server
➢ Why?
➢ It is not possible to copy datafiles under
running InnoDB
➢ Having wsrep_address set to a list of node
addresses and starting replication on a
running node can lead to node crash
www.codership.com
24
wsrep_cluster_address

wsrep_cluster_address=gcomm://10.0.0.1,10.
0.0.2,10.0.0.3
Joiner

10.0.0.1 10.0.0.2 10.0.0.3

a Galera Replication

www.codership.com
25
wsrep_cluster_address

10.0.0.1 10.0.0.2 10.0.0.3 10.0.0.4

a Galera Replication

www.codership.com
26
wsrep_cluster_address

It “may” also happen that 10.0.0.4 will operate as bridge between


10.0.0.1 and the rest of the cluster

Replication relay

10.0.0.1 10.0.0.4 10.0.0.2 10.0.0.3

a Galera Replication

www.codership.com
27
wsrep_cluster_address

wsrep_cluster_address=gcomm://
Joiner

10.0.0.1 10.0.0.2 10.0.0.3

a Cluster 1

www.codership.com
28
wsrep_cluster_address

10.0.0.1 10.0.0.2 10.0.0.3 10.0.0.4

a Cluster 1 Cluster 2

www.codership.com
29
Dealing with Multi-Master Conflicts
Multi-master Conflicts

write write

MySQL MySQL MySQL

a
Galera Replication

www.codership.com
31
Multi-master Conflicts

write write

Conflict detected
MySQL MySQL MySQL

a
Galera Replication

www.codership.com
32
Multi-master Conflicts

Deadlock
OK write error

MySQL MySQL MySQL

a
Galera Replication

www.codership.com
33
Multi-Master Conflicts

● Galera uses optimistic concurrency control


● If two transactions modify same row on
different nodes at the same time, one of
the transactions must abort
➔ Victim transaction will get deadlock error
● Application should retry deadlocked
transactions, however not all applications
have retrying logic inbuilt

www.codership.com
34
Database Hot-Spots

● Some rows where many transactions want to


write to simultaneously
● Patterns like queue or ID allocation can be hot-
spots

www.codership.com
35
Hot-Spots

write write
write

Hot row

www.codership.com
36
Diagnosing Multi-Master Conflicts

● wsrep_log_conflicts will print info of each cluster


conflict in mysql error log
● Cert.log_conflicts to print out information of the
conflicting transaction
● Status variables to monitor:
● wsrep_local_bf_aborts
● wsrep_local_cert_failures

● by using wsrep_debug configuration, all conflicts


(...and plenty of other information) will be
logged
www.codership.com
37
wsrep_retry_autocommit

● Galera can retry autocommit transaction on


behalf of the client application, inside of the
MySQL server
● MySQL will not return deadlock error, but will
silently retry the transaction
● wsrep_retry_autocommit=n will retry the
transaction n times before giving up and
returning deadlock error
● Retrying applies only to autocommit
transactions, as retrying is not safe for multi-
statement transactions
www.codership.com
38
Retry Autocommit

1. conflict
Write detected
write
2. retrying

MySQL MySQL MySQL

a
Galera Replication

www.codership.com
39
Multi-Master Conflicts

1) Analyze the hot-spot


2) Check if application logic can be changed to
catch deadlock exception and apply retrying
logic in application
3) Try if wsrep_retry_autocommit configuration
helps
4) Limit the number of master nodes or change
completely to master-slave model
if you can filter out the access to the hot-
spot table, it is enough to treat writes
only to hot-spotwww.codership.com
table as master-slave
40
State Transfers
State Transfer

➢ Joining node needs to get the current


database state
➢ Two choices:

➢ IST: incremental state transfer

➢ SST: full state transfer

➢ If joining node had some previous state

and gcache spans to that, then IST can


be used

www.codership.com
42
State Snapshot Transfer

● To send full database state


● wsrep_sst_method to choose the method:

➢ mysqldump

➢ rsync

➢ xtrabackup

www.codership.com
43
SST Request

MySQL MySQL joiner


SST Request
wsrep_sst_method
Galera Replication

www.codership.com
44
SST Method

wsrep_sst_mysqldump

MySQL donor wsrep_sst_rsync joiner

Galera Replication

wsrep_sst_xtrabackup

www.codership.com
45
SST API

● SST is open API for shell scripts


● Anyone can write custom SST

● SST API can be used e.g. for:

● Backups

● Filtering out part of database

www.codership.com
46
wsrep_sst_mysqldump

● Logical backup
● Slowest method

● Configure authentication

➢ wsrep_sst_auth=”root:rootpass”

➢ Super privilege needed

● Make sure SST user in donor node can

take mysqldump from donor and load it


over the network to joiner node
● You can try this manually beforehand

www.codership.com
47
wsrep_sst_rsync

● Physical backup
● Fast method

● Can only be used when node is starting

➢ Rsyncing datadirectory under running

InnoDB is not possible

www.codership.com
48
wsrep_sst_xtrabackup

● Contributed by Percona
● Probably the fastest method

● Uses xtrabackup

● Least blocking on Donor side (short

readlock is still used when backup starts)

www.codership.com
49
SST Donor

● All SST methods cause some disturbance


for donor node
● By default donor accepts client

connections, although committing will be


prohibited for a while
● If wsrep_sst_donor_rejects_queries is set,

donor gives unknown command error to


clients
➔ Best practice is to dedicate a reference

node for donor and backup activities


www.codership.com
50
Incremental State Transfer

Request to join
GTID: seqno-n
Donor Joiner

seqno-n
gcache
gcache

www.codership.com
51
Incremental State Transfer

Donor Joiner

Send IST events


apply

gcache seqno-n
gcache

www.codership.com
52
Incremental State Transfer

● Very effective
● gcache.size parameter defines how big

cache will be maintained


● Gcache is mmap, available disk space is

upper limit for size allocation

www.codership.com
53
Incremental State Transfer

● Use database size and write rate to


optimize gcache:
➢ gcache < database
➢ Write rate tells how long tail will be

stored in cache

www.codership.com
54
Incremental State Transfer

● You can think that IST Is


● A short asynchronous replication session
● If communication is bad quality, node
can drop and join back fast with IST

www.codership.com
55
Backups Backups Backups
Backups

➢ All Galera nodes are constantly up to date


➢ Best practices:
➢ Dedicate a reference node for backups
➢ Assign global trx ID with the backup

➢ Possible methods:
1.Disconnecting a node for backup
2.Using SST script interface
3.xtrabackup

www.codership.com
57
Backups with global Trx ID

➢ Global transaction ID (GTID) marks a


position in the cluster transaction stream
➢ Backup with known GTID make it possible
to utilize IST when joining new nodes, eg,
when:
➢ Recovering the node
➢ Provisioning new nodes

www.codership.com
58
Backup by Disconnecting a Node

Load Balancing Isolate the backup node

MySQL MySQL MySQL

Galera Replication

www.codership.com
59
Backup by Disconnecting a Node

Load Balancing

Disconnect from group


MySQL MySQL MySQL e.g. clear wsrep_provider

Galera Replication

www.codership.com
60
Backup by Disconnecting a Node

Load Balancing

Disconnect from group


MySQL MySQL MySQL e.g. clear wsrep_provider

Galera Replication

www.codership.com
61
Backup by Disconnecting a Node

Load Balancing

Work your backup magic


MySQL MySQL MySQL

Galera Replication

backups
www.codership.com
62
Backup by Disconnecting a Node

Load Balancing

Read global transaction ID


MySQL MySQL MySQL from status and assign to
backup
Galera Replication wsrep_cluster_uuid
wsrep_last_committed

backups
www.codership.com
63
Backup by SST

● Donor mode provides isolated processing


environment
● A special SST script can be written just to
prepare backup in donor node:
wsrep_sst_backup
● Garbd can be used to trigger donor node to run
the wsrep_sst_backup

www.codership.com
64
Backup by SST API

Load Balancing Launch garbd

SST request
Garbd
node1 node2 node3
wsrep_sst_donor=node3
wsrep_sst_method=backup
Galera Replication

www.codership.com
65
Backup by SST API

Donor launches
Load Balancing
wsrep_sst_backup

wsrep_sst_backup
node1 node2 node3
.
Galera Replication .
.

www.codership.com
66
Backup by SST API

wsrep_sst_backup
Load Balancing
prepares the backup

wsrep_sst_backup
node1 node2 node3
.
Galera Replication .
.GTID

backups
www.codership.com
67
Backup by SST API

Backup node returns to


Load Balancing
cluster

node1 node2 node3

Galera Replication

www.codership.com
68
Backup by xtrabackup

● Xtrabackup is hot backup method and can be


used anytime
● Simple, efficient
● Use –galera-info option to get global transaction
ID logged into separate galera info file

www.codership.com
69
Schema Upgrades
Schema Upgrades

● DDL is non-transactional, and therefore bad for


replication
● Galera has two methods for DDL
● TOI, Total Order Isolation
● RSU, Rolling Schema Upgrade

● Use wsrep_osu_method to choose either option


● Pt-online-schema-change can give lockless
schema upgrade
● Use TOI mode
● Beware of foreign keys

www.codership.com
71
Total Order Isolation

● DDL is replicated up-front


● Each node will get the DDL statement
and must process the DDL at same slot
in transaction stream
● Galera will isolate the affected
table/database for the duration of DDL
processing

www.codership.com
72
Total Order Isolation

User issues some DDL


ALTER... statement, like ATLTER
TABLE...

node1 node2 node3

www.codership.com
73
Total Order Isolation

node1 node2 node3

ALTER will be replicated and


ALTER ALTER ALTER waits in slave queue according
to seqno

www.codership.com
74
Total Order Isolation

node1 node2 node3

ALTER gets his turn to apply,


ALTER ALTER ALTER Galera will forbid any concurrent
access to the affected table(s)

www.codership.com
75
Total Order Isolation

DDL is over and cluster is back


to normal operation

node1 node2 node3

www.codership.com
76
Rolling Schema Upgrade

● DDL is not replicated


● Galera will take the node out of replication
for the duration of DDL processing
● When DDL is done with, node will catch up
with missed transactions (like IST)
● DBA should roll RSU operation over all
nodes
● Requires backwards compatible schema
changes

www.codership.com
77
Rolling Schema Upgrade

User issues some DDL


ALTER... statement, like ALTER
TABLE...

node1 node2 node3

www.codership.com
78
Rolling Schema Upgrade

Node is de-synced from


replication,
node3 Replication stream is
buffered in gcache
node1 node2

www.codership.com
79
Rolling Schema Upgrade

ALTER...
ALTER processing

node3
node1 node2

Slave queue

www.codership.com
80
Rolling Schema Upgrade

Node re-syncs to
replication
node1 node2 node3

Slave queue

www.codership.com
81
Rolling Schema Upgrade

Catching up with missed


transactions
node1 node2 node3

Slave queue

www.codership.com
82
Rolling Schema Upgrade

Node is back to cluster


again

Q: who rolls the upgrade?


A: DBA!

node1 node2 node3

www.codership.com
83
wsrep_on=OFF

● wsrep_on is a session variable telling if this


session will be replicated or not
● I tried to hide this information to the best I can,
but somebody has leaked this out
● And so, yes, it is possible to run
“poor man's RSU” with wsrep_on set to OFF
● such session may be aborted by replication
● Use only, if you are really sure that:
● planned SQL is not conflicting
● SQL will not generate inconsistency

www.codership.com
84
Schema Upgrades

● Best practices:
➔ Plan your upgrades
➔ Try to be backwards compatible
➔ Rehearse your upgrades
➔ Find out DDL execution time
➔ Go for RSU if possible
●ALTER TABLE to create new autoinc
column will cause issues
● Every node has different autoinc increment
and offset settings
www.codership.com
85
Create autoinc Column

ALTER TABLE myTable ADD COLUMN (new INT AUTOINC...)

node1 node2

myTable myTable

Aaa Aaa
Bbb Bbb
Ccc Ccc

www.codership.com
86
Create autoinc Column

ALTER TABLE myTable ADD COLUMN (new INT AUTOINC...)

node1 node2

myTable myTable

Aaa 1 Aaa 2
Bbb 4 Bbb 5
Ccc 7 Ccc 8

www.codership.com
87
MySQL Replication Topologies
Galera as MySQL Slave

MySQL

MySQL replication Same server_id in


all nodes

MySQL Node 1
slave slave

a Node 3
Node 2

www.codership.com
89
Galera as MySQL Master
Same server_id in
all nodes
Node 1

Node 2 Node 3
master

MySQL replication

a MySQL
slave

www.codership.com
90
MySQL Replication Bottleneck

● MySQL replication is single threaded


● Synchronous replication adds latency for
transactions
Galera Cluster will not apply replication as fast as native MySQL Slave
● wsrep_mysql_replication_bundle
– Groups mysql replication events in large transactions
– Helps with the latency, you can go up to 5K trx/sec
– Experimental, bugs ahead
● MySQL 5.6 “parallel replication” will help, if you
have several schemas
www.codership.com
91
Migrating to Galera Cluster
Migrating to Galera Clustering

MySQL
master

MySQL master slave hierarchy

MySQL
slave

www.codership.com
93
Migrating to Galera Clustering

MySQL
master

Install first Galera node to


operate as MySQL slave

Galera
MySQL Node 1
slave

www.codership.com
94
Migrating to Galera Clustering

MySQL
master

promote Galera node to


operate as one node cluster

Galera
MySQL Node 1
slave

www.codership.com
95
Migrating to Galera Clustering

MySQL
master

Join more nodes in Galera


slave cluster

Galera
MySQL
Node 1
slave

a
Galera Galera
Node 2 Node 3
www.codership.com
96
Migrating to Galera Clustering

MySQL
master

Move traffic to Galera cluster

Galera
MySQL
Node 1
slave

a
Galera Galera
Node 2 Node 3
www.codership.com
97
Migrating to Galera Clustering

Drop obsolete MySQL servers

Galera
Node 1

a
Galera Galera
Node 2 Node 3
www.codership.com
98
Detecting Slow Node
Detecting Slow Node

● Best practice is to use equal hardware,


nodes should have equal capacity
● Galera implements flow control to protect
cluster from slave queue growth

wsrep_flow_control_sent = #times node has
begged for flow control
● wsrep_flow_control_recvd =#times node received
flow control stop signal
● wsrep_flow_control_paused = fraction of time the
node had to pause for flow control
● wsrep_local_recv_queue = length of slave queue
www.codership.com
100
Detecting Slow Node

● Slave queue tuning:


– Gcs.fc_limit = high water mark for the flow
control, FC stop will be sent when this is
reached
– Gcs.fc_factor = limit * factor is the low water
mark, FC continue will be sent when slave
queue returned down to this mark

www.codership.com
101
Detecting Slow Node

● Flow Control statistics should show 0 in healthy


cluster
● Monitor FC status and troubleshoot reason for
deviation. First thing is to detect the slow node.
● Note that some heavy SQL operations can also
cause FC to kick in.

www.codership.com
102
Parallel Replication
Parallel Replication

MySQL
Any number of slave applier
Slave thd Slave thd threads can be started

Slave control assigns write


sets for slave appliers

Slave control Slave control is on ROW level

Only applying is paralle,


Slave queue commit order is dictated

www.codership.com
104
Parallel Applying

● Aka parallel replication


● “true parallel applying”
● Every application will benefit of it
● Works not on database, not on table,
but on row level
● wsrep_slave_threads=n
● How many slaves makes sense:
● Monitor wsrep_cert_deps_distance
● Max ~4 * #CPUcores

www.codership.com
105
Cluster Notifications
Cluster Notifications

● Cluster can trigger notifications


● Script API for notification syntax
● wsrep_notify_cmd defines the script to
handle notifications
● Use for:
● load balancer configuration
● monitoring

www.codership.com
107
MyISAM Replication
MyISAM Replication

● On experimental level
● MyISAM is phasing out not much demand to
complete
● Replicates SQL up-front, like TOI
● Use: wsrep_replicate_myisam=ON
● Should be used in master-slave model
● No checks for non-deterministic SQL
● Insert into t (r, time) values (rand(), now());

www.codership.com
109
Replicating over SSL/TLS
SSL / TLS

● Replication over SSL is supported


● No authentication (yet), only encryption
● Whole cluster must use SSL

www.codership.com
111
SSL or VPN

● Bundling several nodes through VPN


tunnel may cause a vulnerability
● When VPN gateway breaks, a big part of
cluster will be blacked out
● Best practice is to go for SSL if VPN does
not have alternative routes

www.codership.com
112
Using UDP Multicast
UDP Multicast

● Configure with gmcast.mcast_addr


● Full cluster must be configured for
multicast or tcp sockets
● Multicast is good for scalability
● Best practice is to go for multicast if
planning for large clusters

www.codership.com
114
Galera Project
Galera Project
● Galera Cluster for MySQL
● 5 years development
● based on MySQL server community edition
● Fully open source
● Active community
● ~3 releases per year
● Latest release 2.3
● Major release 3.0 in the works
● Galera Replication also used in:
● Percona XtraDB Cluster
● MariaDB Galera www.codership.com
Cluster 116
Codership
➢ Galera development
➢ Consulting and support services for Galera
➢ Technology and support partners:
➢ Percona ➢ SkySQL
➢ FromDual ➢ Monty Program
➢ Capside ➢ Severalnines

We are hiring!!!
R&D and Q/A positions open
www.codership.com
118
Questions?

Thank you for listening!


Happy Clustering :-)

You might also like