Professional Documents
Culture Documents
e1
POSTGRESQL HA RUNBOOK FOR SLB
PROD/QA/DEV ENVIRONMENTS
© Copyright 2019, ATOS PGS sp. z o.o. All rights reserved. Reproduction in whole or in part is prohibited without the prior
written consent of the copyright owner. For any questions or remarks on this document, please contact Atos Poland Global
Services, +48 22 4446500.
OWNER
HDFS HA Runbook for SLB PROD/QA/DEV environments
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
Contents
1 Audience and document purpose...................................................................
2 Components in scope...................................................................................
3 PostgreSQL instances...................................................................................
4 PostgreSQL HA – streaming replication...........................................................
4.1 PostgreSQL failover......................................................................................
PROD 8
QA & DEV.............................................................................................................
4.2 Replication state monitoring..........................................................................
4.3 Switchover – when you just want to switch roles between the primary and
standby databases.....................................................................................
4.4 Failover – when the primary database fails and it’s not recoverable..................
5 Restoring database from backup..................................................................
5.1 Ambari.....................................................................................................
5.2 Hive.........................................................................................................
5.3 Ranger.....................................................................................................
5.4 Rangerkms...............................................................................................
5.5 Oozie.......................................................................................................
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
List of changes
version Date Description Author(s)
0.1 27.11.2019 Initial version MDrazovsky
1.0 06.12.2019 Final document overview M.Niewiatowski
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
The document has been prepared for the SLB HDP platform administrators and ATOS team
responsible for maintaining the PROD/QA and DEV environments. End-user/business team was
not meant as a participant in the process nor the document recipient.
Scope of the document describes the current (for the date of document creation) configuration,
processes and detailed steps leading to backup, restore and bring HA solution for the services
functionality in case of HA/DR drill or real-life issue.
Processes described in this document were based on the PostgreSQL project best practices
and/or documentation and links to them are the integral part of the knowledge required to
operate them.
During this runbook creation – authors followed the suggestions brought together in the
following articles:
https://www.postgresql.org/docs/9.6/warm-standby.html
https://wiki.postgresql.org/wiki/Streaming_Replication
https://www.postgresql.org/docs/9.6/runtime-config-replication.htm
https://www.postgresql.org/docs/9.6/backup.html
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
2 Components in scope
This document covers details procedure for PostgreSQL disaster recovery from HA perspective as
well as from backup restoration perspective for each environment (PROD, QA, DEV).
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
3 PostgreSQL instances
Normally, the first one hosts the primary DB instance, the latter provides the standby. The
primary instance can be used for all types of operations, the standby only supports reads. In case
of a failure of the primary node, you should manually promote the standby database (instruction
provided below).
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
PROD
To make the current location of the primary database transparent, we use a VIP (virtual IP)
address
on PROD. This solution is implemented only on PROD environment so far. The address is active on
the node, which currently hosts the primary instance. All client connections must use this VIP. In
case
of a DB failover/switchover the address follows the role change and is started on the new primary
server. When clients restore their connections, they’ll connect to the new primary.
As of now the VIP flow is in our system a manual task.
The following instructions contain commands for manual VIP management. It’s better to do that
via interface files. Such configuration will be more durable and will survive machine restarts.
On nlxs5144 the secondary (VIP) address on the interface bond0 is controlled by
/etc/sysconfig/network-scripts/ifcfg-bond0:0 . You can manage the alias interface
bond0:0 with normal commands like ifconfig or ip. If you intentionally take it down, don’t forget
to “mask” the interface file so that it’s not accidentally brought up. You can create a similar file for
bond0:0 on nlxs5145 and keep it masked/disabled. This way you could easily bring the VIP up
there when required.
QA & DEV
On QA and DEV environment VIP is not implemented. In order to perform manual failover on QA
and DEV we don’t need to change VIP owner interface, but modify configuration of Hadoop
components, which are using PostgreSQL as a backend relational database:
Ranger
Ranger KMS
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
Hive
Oozie
Once these changes are done, restart of all affected components is required.
Check latest received and replayed transactions (on the standby node):
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
4.3 Switchover – when you just want to switch roles between the
primary and standby databases
Stop the primary DB (clean shutdown) and bring down the VIP:
/etc/init.d/postgresql-9.6 stop
ip addr del 199.6.212.138/25 dev bond0
cd /var/lib/pgsql/9.6/data/
mv recovery.conf{,.ready}
/etc/init.d/postgresql-9.6 restart
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
Demote the previous master to a new standby:
su – postgres
exit
/etc/init.d/postgresql-9.6 start
Verify the state of the new standby, the following query should return TRUE:
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
4.4 Failover – when the primary database fails and it’s not
recoverable.
If the old primary database is somehow still running, stop it or kill it.
If the old primary node is still up and the VIP is still configured, bring it down (only on PROD):
cd /var/lib/pgsql/9.6/data/
mv recovery.conf{,.ready}
/etc/init.d/postgresql-9.6 restart
Note: once all problems on the old primary server are resolved, you have to manually recreate
the standby database, as described in the following point.
Recreating the standby database after failover or when the standby database falls behind too
much:
All steps to be executed on the new standby server (previous primary).
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
Copy data from the primary DB.
Switch to the postgres user:
su - postgres
mv /var/lib/pgsql/9.6/data{,.$(date ‘+%Y%m%d%H%M’)}
exit
/etc/init.d/postgresql-9.6 start
Verify the state of the new standby, the following query should return TRUE:
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
ambari
hive
oozie
ranger
rangerkms
Copy requested backup version on the master database server local filesystem.
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
su - postgres
cd /tmp
mkdir DB_restore
cd DB_restore/
-bash-4.1$ pwd
/tmp/DB_restore
scp backup@nlxs5133:/home/backup/DEVbackups/nlxs5270_postgres/archive/YYYY.MM.DD_09.57_postgres.tgz
YYYY.MM.DD_09.57_postgres.tgz
YYYY.MM.DD_09.57_data_ambaridev.sql
YYYY.MM.DD_09.57_data_hive.sql
YYYY.MM.DD_09.57_data_oozie.sql
YYYY.MM.DD_09.57_data_postgres.sql
YYYY.MM.DD_09.57_data_rangerkms.sql
YYYY.MM.DD_09.57_data_ranger.sql
YYYY.MM.DD_09.57_data_rundeck.sql
YYYY.MM.DD_09.57_data_test_dr.sql
YYYY.MM.DD_09.57_data_tracker.sql
YYYY.MM.DD_09.57_dbacl.sql
YYYY.MM.DD_09.57_roles.sql
pg_hba.conf
postgresql.conf
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
5.1 Ambari
# ambari-server status
Using python /usr/bin/python
Ambari-server status
Ambari Server running
Interrupt existing connections on ambaridev db and forbit to create new connections (it’s
not able to drop any database on postgresql till some connections are established).
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
Ambari database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.
- This should be still running, but not properly working. It’s not possible to manage hdp
services from ambari.
From ambari-server logs:
Internal Exception: org.postgresql.util.PSQLException: FATAL: database "ambari" does not
exist
- Create empty database called: ambari (roles still exist, these weren’t deleted)
Restart ambari-server
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
5.2 Hive
Interrupt existing connections on hive db and forbit to create new connections (it’s not able
to drop any database on postgresql till some connections are established).
Hive database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
raise ExecuteTimeoutException(err_msg)
ExecuteTimeoutException: Execution of 'ambari-sudo.sh su ambari-qa -l -s /bin/bash -c
'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/sbin:/sbin:/usr/lib/
ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/var/lib/ambari-
agent:/bin/:/usr/bin/:/usr/sbin/:/usr/hdp/current/hive-metastore/bin'"'"' ; export
HIVE_CONF_DIR='"'"'/usr/hdp/current/hive-metastore/conf'"'"' ; hive --hiveconf
hive.metastore.uris=thrift://slb-0.local:9083 --hiveconf
hive.metastore.client.connect.retry.delay=1 --hiveconf
hive.metastore.failure.retries=1 --hiveconf
hive.metastore.connect.retries=1 --hiveconf
hive.metastore.client.socket.timeout=14 --hiveconf
hive.execution.engine=mr -e '"'"'show databases;'"'"''' was killed due timeout after
60 seconds
)
a.) Create empty database called: hive (roles still exist, these weren’t deleted)
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
b.) Import dump from backup into newly created empty database
Check hive components logs, functionalities, request developer team for test hive jobs
5.3 Ranger
Interrupt existing connections on ranger db and forbit to create new connections (it’s not
able to drop any database on postgresql till some connections are established).
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
Drop ranger database
Ranger database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.
- Create empty database called: ranger (roles still exist, these weren’t deleted)
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
5.4 Rangerkms
Interrupt existing connections on rangerkms db and forbit to create new connections (it’s
not able to drop any database on postgresql till some connections are established).
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
d.) Check active connections [point a.)] -> (has to be 0 )
Rangerkms database is dropped, disappeared from existing database list – the same on
secondary (stand-by) server.
- Create empty database called: rangerkms (roles still exist, these weren’t deleted)
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
5.5 Oozie
Interrupt existing connections on oozie db and forbit to create new connections (it’s not
able to drop any database on postgresql till some connections are established).
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
Oozie database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.
- Create empty database called: oozie (roles still exist, these weren’t deleted)
sys_WordMark_AT_Continued
version: 0.1
Public
document number: