POSTGRES Runbook

sys_WordMark_AT_Pag
e1
POSTGRESQL HA RUNBOOK FOR SLB
PROD/QA/DEV ENVIRONMENTS
© Copyright 2019, ATOS PGS sp. z o.o. All rights reserved. Reproduction in whole or in part is prohibited without the prior
written consent of the copyright owner. For any questions or remarks on this document, please contact Atos Poland Global
Services, +48 22 4446500.
AUTHOR(S) : Michal Drazovsky

DOCUMENT NUMBER :
VERSION : 1.0
STATUS : Final
SOURCE : Atos Poland Global Services
DOCUMENT DATE : 06 December 2019RELEASED FOR TRAININGRELEASED FOR
OPERATIONSREVIEW BEFORE
NUMBER OF PAGES : 25
OWNER
HDFS HA Runbook for SLB PROD/QA/DEV environments
sys_WordMark_AT_Continued
version: 0.1
Public
document number:
Contents
1 Audience and document purpose...................................................................
2 Components in scope...................................................................................
3 PostgreSQL instances...................................................................................
4 PostgreSQL HA – streaming replication...........................................................
4.1 PostgreSQL failover......................................................................................
PROD 8
QA & DEV.............................................................................................................
4.2 Replication state monitoring..........................................................................
4.3 Switchover – when you just want to switch roles between the primary and
standby databases.....................................................................................
4.4 Failover – when the primary database fails and it’s not recoverable..................
5 Restoring database from backup..................................................................
5.1 Ambari.....................................................................................................
5.2 Hive.........................................................................................................
5.3 Ranger.....................................................................................................
5.4 Rangerkms...............................................................................................
5.5 Oozie.......................................................................................................
Atos Poland Global Services

06 December 2019
2 of 26
version: 0.1
Public
document number:
List of changes
version Date Description Author(s)
0.1 27.11.2019 Initial version MDrazovsky
1.0 06.12.2019 Final document overview M.Niewiatowski

06 December 2019
3 of 26
version: 0.1
Public
document number:
1 Audience and document purpose
The document has been prepared for the SLB HDP platform administrators and ATOS team
responsible for maintaining the PROD/QA and DEV environments. End-user/business team was
not meant as a participant in the process nor the document recipient.
Scope of the document describes the current (for the date of document creation) configuration,
processes and detailed steps leading to backup, restore and bring HA solution for the services
functionality in case of HA/DR drill or real-life issue.
Processes described in this document were based on the PostgreSQL project best practices
and/or documentation and links to them are the integral part of the knowledge required to
operate them.
During this runbook creation – authors followed the suggestions brought together in the
following articles:
 https://www.postgresql.org/docs/9.6/warm-standby.html
 https://wiki.postgresql.org/wiki/Streaming_Replication
 https://www.postgresql.org/docs/9.6/runtime-config-replication.htm
 https://www.postgresql.org/docs/9.6/backup.html

06 December 2019
4 of 26
version: 0.1
Public
document number:

06 December 2019
5 of 26
version: 0.1
Public
document number:
2 Components in scope
This document covers details procedure for PostgreSQL disaster recovery from HA perspective as
well as from backup restoration perspective for each environment (PROD, QA, DEV).

06 December 2019
6 of 26
version: 0.1
Public
document number:
3 PostgreSQL instances
The database is deployed on the two nodes in each environment:
PROD: nlxs5144, nlxs5145
QA: nlxs5146, nlxs5276
DEV: nlxs5270, nlxs5269
Normally, the first one hosts the primary DB instance, the latter provides the standby. The
primary instance can be used for all types of operations, the standby only supports reads. In case
of a failure of the primary node, you should manually promote the standby database (instruction
provided below).

06 December 2019
7 of 26
version: 0.1
Public
document number:
4 PostgreSQL HA – streaming replication
4.1 PostgreSQL failover
PROD
To make the current location of the primary database transparent, we use a VIP (virtual IP)
address
on PROD. This solution is implemented only on PROD environment so far. The address is active on
the node, which currently hosts the primary instance. All client connections must use this VIP. In
case
of a DB failover/switchover the address follows the role change and is started on the new primary
server. When clients restore their connections, they’ll connect to the new primary.
As of now the VIP flow is in our system a manual task.
The VIP address used in PROD is 199.6.212.138.
The following instructions contain commands for manual VIP management. It’s better to do that
via interface files. Such configuration will be more durable and will survive machine restarts.
On nlxs5144 the secondary (VIP) address on the interface bond0 is controlled by
/etc/sysconfig/network-scripts/ifcfg-bond0:0 . You can manage the alias interface
bond0:0 with normal commands like ifconfig or ip. If you intentionally take it down, don’t forget
to “mask” the interface file so that it’s not accidentally brought up. You can create a similar file for
bond0:0 on nlxs5145 and keep it masked/disabled. This way you could easily bring the VIP up
there when required.
QA & DEV
On QA and DEV environment VIP is not implemented. In order to perform manual failover on QA
and DEV we don’t need to change VIP owner interface, but modify configuration of Hadoop
components, which are using PostgreSQL as a backend relational database:
Ranger
Configs / Ranger Admin :

- JDBC connect string for a Ranger database:
jdbc:postgresql://nlxsXYZ.best-nl0114.slb.com:5432/ranger
- JDBC connect string for root user: jdbc:postgresql://nlxsXYZ.best-
nl0114.slb.com:5432/postgres

Ranger KMS
Configs / Settings / Ranger KMS DB :

- JDBC connect string: jdbc:postgresql://nlxsXYZ.best-
nl0114.slb.com:5432/rangerkms


06 December 2019
8 of 26
version: 0.1
Public
document number:
Hive
Advanced / Hive Metastore / Database URL:

- jdbc:postgresql://nlxsXYZ.best-nl0114.slb.com:5432/hive

Oozie
Configs / Oozie Server / Database URL :

- jdbc:postgresql://nlxsXYZ.best-nl0114.slb.com:5432/oozie
Once these changes are done, restart of all affected components is required.
4.2 Replication state monitoring

Check if a DB running on a node is in the recovery state (is a standby DB):
su - postgres -c "psql -c \"SELECT CASE pg_is_in_recovery() WHEN 't' THEN

'TRUE' ELSE 'FALSE' END in_recovery;\""
Check replication details (on the primary node):
su - postgres -c 'psql -x -c "select * from pg_stat_replication;"'
Check the sender and receiver processes (on both nodes):
ps -fu postgres | egrep "wal (sender|receiver) proc"
Check current transaction log locations (on the primary node):
su - postgres -c "psql -x -c 'select

pg_current_xlog_location(),pg_current_xlog_insert_location()'"
Check latest received and replayed transactions (on the standby node):
su - postgres -c 'psql -x -c "select pg_last_xlog_receive_location(),

pg_last_xlog_replay_location();"'

06 December 2019
9 of 26
version: 0.1
Public
document number:
4.3 Switchover – when you just want to switch roles between the
primary and standby databases
Stop the primary DB (clean shutdown) and bring down the VIP:
/etc/init.d/postgresql-9.6 stop
ip addr del 199.6.212.138/25 dev bond0
Check the synchronization of standby:
su - postgres -c 'psql -x -c "select

pg_last_xlog_receive_location(),pg_last_xlog_replay_location();"'
Promote the standby to a new master:

Bring up the VIP on the new master. Comparing to QA and DEV this step is not required as VIP is
not used there. After manual switchover, HDP config changes are needed (as described above).
ip addr add 199.6.212.138/25 dev bond0 label bond0:0
Note: this step is required only on PROD
Remove the recovery.conf file:
cd /var/lib/pgsql/9.6/data/
mv recovery.conf{,.ready}
Restart the database:
/etc/init.d/postgresql-9.6 restart
The following query should return FALSE:


06 December 2019
10 of 26
version: 0.1
Public
document number:
Demote the previous master to a new standby:
su – postgres
Create a new recovery.conf file:
cat > /var/lib/pgsql/9.6/data/recovery.conf <<EOF

standby_mode = on
recovery_target_timeline = 'latest'
primary_conninfo = 'host=199.6.212.138 port=5432 user=replicator'
EOF
exit
/etc/init.d/postgresql-9.6 start
Verify the state of the new standby, the following query should return TRUE:

Check the synchronization

On the primary node:
su - postgres -c "psql -x -c 'select

pg_current_xlog_location(),pg_current_xlog_insert_location()'"
On the standby node:
su - postgres -c 'psql -x -c "select pg_last_xlog_receive_location(),

pg_last_xlog_replay_location();"'

06 December 2019
11 of 26
version: 0.1
Public
document number:
4.4 Failover – when the primary database fails and it’s not
recoverable.
If the old primary database is somehow still running, stop it or kill it.
If the old primary node is still up and the VIP is still configured, bring it down (only on PROD):
ip addr del 199.6.212.138/25 dev bond0
Promote the standby to a new master

Bring up the VIP on the new master:
ip addr add 199.6.212.138/25 dev bond0 label bond0:0
Remove the recovery.conf file:
cd /var/lib/pgsql/9.6/data/
mv recovery.conf{,.ready}
/etc/init.d/postgresql-9.6 restart
The following query should return FALSE:

Note: once all problems on the old primary server are resolved, you have to manually recreate
the standby database, as described in the following point.
Recreating the standby database after failover or when the standby database falls behind too
much:
All steps to be executed on the new standby server (previous primary).

06 December 2019
12 of 26
version: 0.1
Public
document number:
Copy data from the primary DB.
Switch to the postgres user:
su - postgres
Make a backup of the currently available data location:
mv /var/lib/pgsql/9.6/data{,.$(date ‘+%Y%m%d%H%M’)}
pg_basebackup -h 199.6.212.138 -D/var/lib/pgsql/9.6/data/ -U replicator -P -v

–x
copy data from the primary database:
Create a new recovery.conf file:
cat > /var/lib/pgsql/9.6/data/recovery.conf <<EOF

standby_mode = on
recovery_target_timeline = 'latest'
primary_conninfo = 'host=199.6.212.138 port=5432 user=replicator'
EOF
exit
Start the standby database:
/etc/init.d/postgresql-9.6 start
Verify the state of the new standby, the following query should return TRUE:


06 December 2019
13 of 26
version: 0.1
Public
document number:
5 Restoring database from backup
Lits of all databases related with HDP:
ambari
hive
oozie
ranger
rangerkms
Backup location for PROD:
[root@nlxs5133 archive]# pwd

/home/backup/PRODbackups/nlxs5144_postgres/archive
Backup location for QA:

/home/backup/QAbackups/nlxs5146_postgres/archive
Backup location for DEV:

/home/backup/DEVbackups/nlxs5270_postgres/archive
Copy requested backup version on the master database server local filesystem.

06 December 2019
14 of 26
version: 0.1
Public
document number:
su - postgres
cd /tmp
mkdir DB_restore
cd DB_restore/
-bash-4.1$ pwd
/tmp/DB_restore
scp backup@nlxs5133:/home/backup/DEVbackups/nlxs5270_postgres/archive/YYYY.MM.DD_09.57_postgres.tgz
YYYY.MM.DD_09.57_postgres.tgz
100% 72MB 72.4MB/s 00:01
tar zxvf YYYY.MM.DD_09.57_postgres.tgz
YYYY.MM.DD_09.57_data_ambaridev.sql
YYYY.MM.DD_09.57_data_hive.sql
YYYY.MM.DD_09.57_data_oozie.sql
YYYY.MM.DD_09.57_data_postgres.sql
YYYY.MM.DD_09.57_data_rangerkms.sql
YYYY.MM.DD_09.57_data_ranger.sql
YYYY.MM.DD_09.57_data_rundeck.sql
YYYY.MM.DD_09.57_data_test_dr.sql
YYYY.MM.DD_09.57_data_tracker.sql
YYYY.MM.DD_09.57_dbacl.sql
YYYY.MM.DD_09.57_roles.sql
pg_hba.conf
postgresql.conf

06 December 2019
15 of 26
version: 0.1
Public
document number:
5.1 Ambari
 Ambari service is using ambari database.
 Ambari server still running
# ambari-server status
Using python /usr/bin/python
Ambari-server status
Ambari Server running
 Interrupt existing connections on ambaridev db and forbit to create new connections (it’s
not able to drop any database on postgresql till some connections are established).
a.) List active connections:
SELECT * FROM pg_stat_activity WHERE datname = 'ambari';
b.) Forbid ability to create new connection for database
UPDATE pg_database SET datallowconn = 'false' WHERE datname = 'ambari';
c.) Terminate all active connections
SELECT pg_terminate_backend (pg_stat_activity.pid) FROM pg_stat_activity WHERE

pg_stat_activity.datname = 'ambari';
d.) Check active connections [point a.)] -> (has to be 0 )
 Drop ambari database
DROP DATABASE IF EXISTS ambari;

06 December 2019
16 of 26
version: 0.1
Public
document number:
Ambari database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.
 Check ambari service
- This should be still running, but not properly working. It’s not possible to manage hdp
services from ambari.
From ambari-server logs:
Internal Exception: org.postgresql.util.PSQLException: FATAL: database "ambari" does not
exist
 Restore ambari database from backup
- Create empty database called: ambari (roles still exist, these weren’t deleted)
CREATE DATABASE ambari;
GRANT ALL ON DATABASE ambari TO ambari;
- Import dump from backup into newly created empty database
psql ambari < YYYY.MM.DD_09.57_data_ambari.sql
 Restart ambari-server
 Check ambari-server logs, functionalities

06 December 2019
17 of 26
version: 0.1
Public
document number:
5.2 Hive
 Hive service is using hive database for metastore
 Hive component is still running, do not stop it from ambari WebUI
 Interrupt existing connections on hive db and forbit to create new connections (it’s not able
to drop any database on postgresql till some connections are established).
SELECT * FROM pg_stat_activity WHERE datname = 'hive';
UPDATE pg_database SET datallowconn = 'false' WHERE datname = 'hive';

pg_stat_activity.datname = 'hive';
 Drop hive database
DROP DATABASE IF EXISTS hive;
Hive database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.

06 December 2019
18 of 26
version: 0.1
Public
document number:
 Check hive service
- This should be still running, but in logs started to occur error:

- Hive metastore is not accessible. After timeout period alert is triggered by Ambari
From hivemetastore logs:

Failed to acquire connection to jdbc:postgresql://xxx. Sleeping for 7000 ms. Attempts
left: 5
org.postgresql.util.PSQLException: FATAL: database "hive" does not exist
and after timeout period hivemetastore triggered an alert in ambari:
raise ExecuteTimeoutException(err_msg)
ExecuteTimeoutException: Execution of 'ambari-sudo.sh su ambari-qa -l -s /bin/bash -c
'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/sbin:/sbin:/usr/lib/
ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/var/lib/ambari-
agent:/bin/:/usr/bin/:/usr/sbin/:/usr/hdp/current/hive-metastore/bin'"'"' ; export
HIVE_CONF_DIR='"'"'/usr/hdp/current/hive-metastore/conf'"'"' ; hive --hiveconf
hive.metastore.uris=thrift://slb-0.local:9083 --hiveconf
hive.metastore.client.connect.retry.delay=1 --hiveconf
hive.metastore.failure.retries=1 --hiveconf
hive.metastore.connect.retries=1 --hiveconf
hive.metastore.client.socket.timeout=14 --hiveconf
hive.execution.engine=mr -e '"'"'show databases;'"'"''' was killed due timeout after
60 seconds
)
 Restore hive database from backup
a.) Create empty database called: hive (roles still exist, these weren’t deleted)
CREATE DATABASE hive;
GRANT ALL ON DATABASE hive TO hive;

06 December 2019
19 of 26
version: 0.1
Public
document number:
b.) Import dump from backup into newly created empty database
psql hive < YYYY.MM.DD_09.57_data_hive.sql
 Restart hive component
 Check hive components logs, functionalities, request developer team for test hive jobs
5.3 Ranger
 Ranger service is using ranger database.
 Ranger component is still running, do not stop it from ambari WebUI
 Interrupt existing connections on ranger db and forbit to create new connections (it’s not
able to drop any database on postgresql till some connections are established).
SELECT * FROM pg_stat_activity WHERE datname = 'ranger';
UPDATE pg_database SET datallowconn = 'false' WHERE datname = 'ranger';

pg_stat_activity.datname = 'ranger';

06 December 2019
20 of 26
version: 0.1
Public
document number:
 Drop ranger database
DROP DATABASE IF EXISTS ranger;
Ranger database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.
 Check ranger service
- This should be still running, but in logs errors start to occur
 Restore ranger database from backup
- Create empty database called: ranger (roles still exist, these weren’t deleted)
CREATE DATABASE ranger;
GRANT ALL ON DATABASE ranger TO ranger;
psql ranger < YYYY.MM.DD_09.57_data_ranger.sql
 Restart ranger service
 Check ranger components logs, functionalities

06 December 2019
21 of 26
version: 0.1
Public
document number:
5.4 Rangerkms
 Rangerkms service is using rangerkms database.
 Rangerkms component is still running, do not stop it from ambari WebUI
 Interrupt existing connections on rangerkms db and forbit to create new connections (it’s
not able to drop any database on postgresql till some connections are established).
SELECT * FROM pg_stat_activity WHERE datname = 'rangerkms';
UPDATE pg_database SET datallowconn = 'false' WHERE datname = 'rangerkms';

pg_stat_activity.datname = 'rangerkms';

06 December 2019
22 of 26
version: 0.1
Public
document number:
 Drop rangerkms database
DROP DATABASE IF EXISTS rangerkms;
Rangerkms database is dropped, disappeared from existing database list – the same on
secondary (stand-by) server.
 Check rangerkms service
- This should be still running, but in logs errors start to occur.
 Restore rangerkms database from backup
- Create empty database called: rangerkms (roles still exist, these weren’t deleted)
CREATE DATABASE rangerkms;
GRANT ALL ON DATABASE rangerkms TO rangerkms;

psql rangerkms < YYYY.MM.DD_09.57_data_rangerkms.sql
 Restart rangerkms service
 Check rangerkms components logs, functionalities

06 December 2019
23 of 26
version: 0.1
Public
document number:
5.5 Oozie
 Oozie service is using oozie database.
 Oozie component is still running, do not stop it from ambari WebUI
 Interrupt existing connections on oozie db and forbit to create new connections (it’s not
able to drop any database on postgresql till some connections are established).
SELECT * FROM pg_stat_activity WHERE datname = 'oozie';
b.) Forbit ability to create new connection for database
UPDATE pg_database SET datallowconn = 'false' WHERE datname = 'oozie';

pg_stat_activity.datname = 'oozie';

06 December 2019
24 of 26
version: 0.1
Public
document number:
 Drop oozie database
DROP DATABASE IF EXISTS oozie;
Oozie database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.
 Check oozie service
- This should be still running, but in logs error start to occur.
 Restore oozie database from backup
- Create empty database called: oozie (roles still exist, these weren’t deleted)
CREATE DATABASE oozie;
GRANT ALL ON DATABASE oozie TO oozie;
psql oozie < YYYY.MM.DD_09.57_data_oozie.sql
 Restart oozie service
 Check oozie components logs, functionalities

06 December 2019
25 of 26
version: 0.1
Public
document number:

06 December 2019
26 of 26

POSTGRES Runbook

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

POSTGRES Runbook

Uploaded by

Copyright:

Available Formats

sys_WordMark_AT_Pag

AUTHOR(S) : Michal Drazovsky

Atos Poland Global Services

Atos Poland Global Services

1 Audience and document purpose

Atos Poland Global Services

Atos Poland Global Services

Atos Poland Global Services

The database is deployed on the two nodes in each environment:

PROD: nlxs5144, nlxs5145

QA: nlxs5146, nlxs5276

DEV: nlxs5270, nlxs5269

Atos Poland Global Services

4 PostgreSQL HA – streaming replication

4.1 PostgreSQL failover

The VIP address used in PROD is 199.6.212.138.

Configs / Ranger Admin :

Configs / Settings / Ranger KMS DB :

Atos Poland Global Services

Advanced / Hive Metastore / Database URL:

Configs / Oozie Server / Database URL :

4.2 Replication state monitoring

su - postgres -c "psql -c \"SELECT CASE pg_is_in_recovery() WHEN 't' THEN

Check replication details (on the primary node):

su - postgres -c 'psql -x -c "select * from pg_stat_replication;"'

Check the sender and receiver processes (on both nodes):

ps -fu postgres | egrep "wal (sender|receiver) proc"

Check current transaction log locations (on the primary node):

su - postgres -c "psql -x -c 'select

su - postgres -c 'psql -x -c "select pg_last_xlog_receive_location(),

Atos Poland Global Services

Check the synchronization of standby:

su - postgres -c 'psql -x -c "select

Promote the standby to a new master:

ip addr add 199.6.212.138/25 dev bond0 label bond0:0

Note: this step is required only on PROD

Remove the recovery.conf file:

Restart the database:

The following query should return FALSE:

su - postgres -c "psql -c \"SELECT CASE pg_is_in_recovery() WHEN 't' THEN

Atos Poland Global Services

Create a new recovery.conf file:

cat > /var/lib/pgsql/9.6/data/recovery.conf <<EOF

Restart the database:

su - postgres -c "psql -c \"SELECT CASE pg_is_in_recovery() WHEN 't' THEN

Check the synchronization

su - postgres -c "psql -x -c 'select

On the standby node:

su - postgres -c 'psql -x -c "select pg_last_xlog_receive_location(),

Atos Poland Global Services

ip addr del 199.6.212.138/25 dev bond0

Promote the standby to a new master

ip addr add 199.6.212.138/25 dev bond0 label bond0:0

Remove the recovery.conf file:

Restart the database:

The following query should return FALSE:

su - postgres -c "psql -c \"SELECT CASE pg_is_in_recovery() WHEN 't' THEN

Atos Poland Global Services

Make a backup of the currently available data location:

pg_basebackup -h 199.6.212.138 -D/var/lib/pgsql/9.6/data/ -U replicator -P -v

copy data from the primary database:

Create a new recovery.conf file:

cat > /var/lib/pgsql/9.6/data/recovery.conf <<EOF