You are on page 1of 26

sys_WordMark_AT_Pag

e1
POSTGRESQL HA RUNBOOK FOR SLB
PROD/QA/DEV ENVIRONMENTS

© Copyright 2019, ATOS PGS sp. z o.o. All rights reserved. Reproduction in whole or in part is prohibited without the prior
written consent of the copyright owner. For any questions or remarks on this document, please contact Atos Poland Global
Services, +48 22 4446500.

AUTHOR(S) : Michal Drazovsky


DOCUMENT NUMBER :
VERSION : 1.0
STATUS : Final
SOURCE : Atos Poland Global Services
DOCUMENT DATE : 06 December 2019RELEASED FOR TRAININGRELEASED FOR
OPERATIONSREVIEW BEFORE
NUMBER OF PAGES : 25

OWNER
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

Contents
1 Audience and document purpose...................................................................
2 Components in scope...................................................................................
3 PostgreSQL instances...................................................................................
4 PostgreSQL HA – streaming replication...........................................................
4.1 PostgreSQL failover......................................................................................
PROD 8
QA & DEV.............................................................................................................
4.2 Replication state monitoring..........................................................................
4.3 Switchover – when you just want to switch roles between the primary and
standby databases.....................................................................................
4.4 Failover – when the primary database fails and it’s not recoverable..................
5 Restoring database from backup..................................................................
5.1 Ambari.....................................................................................................
5.2 Hive.........................................................................................................
5.3 Ranger.....................................................................................................
5.4 Rangerkms...............................................................................................
5.5 Oozie.......................................................................................................

Atos Poland Global Services


06 December 2019
2 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

List of changes
version Date Description Author(s)
0.1 27.11.2019 Initial version MDrazovsky
1.0 06.12.2019 Final document overview M.Niewiatowski

Atos Poland Global Services


06 December 2019
3 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

1 Audience and document purpose

The document has been prepared for the SLB HDP platform administrators and ATOS team
responsible for maintaining the PROD/QA and DEV environments. End-user/business team was
not meant as a participant in the process nor the document recipient.
Scope of the document describes the current (for the date of document creation) configuration,
processes and detailed steps leading to backup, restore and bring HA solution for the services
functionality in case of HA/DR drill or real-life issue.
Processes described in this document were based on the PostgreSQL project best practices
and/or documentation and links to them are the integral part of the knowledge required to
operate them.
During this runbook creation – authors followed the suggestions brought together in the
following articles:

 https://www.postgresql.org/docs/9.6/warm-standby.html
 https://wiki.postgresql.org/wiki/Streaming_Replication
 https://www.postgresql.org/docs/9.6/runtime-config-replication.htm
 https://www.postgresql.org/docs/9.6/backup.html

Atos Poland Global Services


06 December 2019
4 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

Atos Poland Global Services


06 December 2019
5 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

2 Components in scope

This document covers details procedure for PostgreSQL disaster recovery from HA perspective as
well as from backup restoration perspective for each environment (PROD, QA, DEV).

Atos Poland Global Services


06 December 2019
6 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

3 PostgreSQL instances

The database is deployed on the two nodes in each environment:

PROD: nlxs5144, nlxs5145

QA: nlxs5146, nlxs5276

DEV: nlxs5270, nlxs5269

Normally, the first one hosts the primary DB instance, the latter provides the standby. The
primary instance can be used for all types of operations, the standby only supports reads. In case
of a failure of the primary node, you should manually promote the standby database (instruction
provided below).

Atos Poland Global Services


06 December 2019
7 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

4 PostgreSQL HA – streaming replication

4.1 PostgreSQL failover

PROD
To make the current location of the primary database transparent, we use a VIP (virtual IP)
address
on PROD. This solution is implemented only on PROD environment so far. The address is active on
the node, which currently hosts the primary instance. All client connections must use this VIP. In
case
of a DB failover/switchover the address follows the role change and is started on the new primary
server. When clients restore their connections, they’ll connect to the new primary.
As of now the VIP flow is in our system a manual task.

The VIP address used in PROD is 199.6.212.138.

The following instructions contain commands for manual VIP management. It’s better to do that
via interface files. Such configuration will be more durable and will survive machine restarts.
On nlxs5144 the secondary (VIP) address on the interface bond0 is controlled by
/etc/sysconfig/network-scripts/ifcfg-bond0:0 . You can manage the alias interface
bond0:0 with normal commands like ifconfig or ip. If you intentionally take it down, don’t forget
to “mask” the interface file so that it’s not accidentally brought up. You can create a similar file for
bond0:0 on nlxs5145 and keep it masked/disabled. This way you could easily bring the VIP up
there when required.

QA & DEV
On QA and DEV environment VIP is not implemented. In order to perform manual failover on QA
and DEV we don’t need to change VIP owner interface, but modify configuration of Hadoop
components, which are using PostgreSQL as a backend relational database:

Ranger

Configs / Ranger Admin :


- JDBC connect string for a Ranger database:
jdbc:postgresql://nlxsXYZ.best-nl0114.slb.com:5432/ranger
- JDBC connect string for root user: jdbc:postgresql://nlxsXYZ.best-
nl0114.slb.com:5432/postgres

 
Ranger KMS

Configs / Settings / Ranger KMS DB :


- JDBC connect string: jdbc:postgresql://nlxsXYZ.best-
nl0114.slb.com:5432/rangerkms
 

Atos Poland Global Services


06 December 2019
8 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:
Hive

Advanced / Hive Metastore / Database URL:


- jdbc:postgresql://nlxsXYZ.best-nl0114.slb.com:5432/hive

 
Oozie

Configs / Oozie Server / Database URL :


- jdbc:postgresql://nlxsXYZ.best-nl0114.slb.com:5432/oozie

Once these changes are done, restart of all affected components is required.

4.2 Replication state monitoring


Check if a DB running on a node is in the recovery state (is a standby DB):

su - postgres -c "psql -c \"SELECT CASE pg_is_in_recovery() WHEN 't' THEN


'TRUE' ELSE 'FALSE' END in_recovery;\""

Check replication details (on the primary node):

su - postgres -c 'psql -x -c "select * from pg_stat_replication;"'

Check the sender and receiver processes (on both nodes):

ps -fu postgres | egrep "wal (sender|receiver) proc"

Check current transaction log locations (on the primary node):

su - postgres -c "psql -x -c 'select


pg_current_xlog_location(),pg_current_xlog_insert_location()'"

Check latest received and replayed transactions (on the standby node):

su - postgres -c 'psql -x -c "select pg_last_xlog_receive_location(),


pg_last_xlog_replay_location();"'

Atos Poland Global Services


06 December 2019
9 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

4.3 Switchover – when you just want to switch roles between the
primary and standby databases

Stop the primary DB (clean shutdown) and bring down the VIP:

/etc/init.d/postgresql-9.6 stop
ip addr del 199.6.212.138/25 dev bond0

Check the synchronization of standby:

su - postgres -c 'psql -x -c "select


pg_last_xlog_receive_location(),pg_last_xlog_replay_location();"'

Promote the standby to a new master:


Bring up the VIP on the new master. Comparing to QA and DEV this step is not required as VIP is
not used there. After manual switchover, HDP config changes are needed (as described above).

ip addr add 199.6.212.138/25 dev bond0 label bond0:0

Note: this step is required only on PROD

Remove the recovery.conf file:

cd /var/lib/pgsql/9.6/data/
mv recovery.conf{,.ready}

Restart the database:

/etc/init.d/postgresql-9.6 restart

The following query should return FALSE:

su - postgres -c "psql -c \"SELECT CASE pg_is_in_recovery() WHEN 't' THEN


'TRUE' ELSE 'FALSE' END in_recovery;\""

Atos Poland Global Services


06 December 2019
10 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:
Demote the previous master to a new standby:

su – postgres

Create a new recovery.conf file:

cat > /var/lib/pgsql/9.6/data/recovery.conf <<EOF


standby_mode = on
recovery_target_timeline = 'latest'
primary_conninfo = 'host=199.6.212.138 port=5432 user=replicator'
EOF

exit

Restart the database:

/etc/init.d/postgresql-9.6 start

Verify the state of the new standby, the following query should return TRUE:

su - postgres -c "psql -c \"SELECT CASE pg_is_in_recovery() WHEN 't' THEN


'TRUE' ELSE 'FALSE' END in_recovery;\""

Check the synchronization


On the primary node:

su - postgres -c "psql -x -c 'select


pg_current_xlog_location(),pg_current_xlog_insert_location()'"

On the standby node:

su - postgres -c 'psql -x -c "select pg_last_xlog_receive_location(),


pg_last_xlog_replay_location();"'

Atos Poland Global Services


06 December 2019
11 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

4.4 Failover – when the primary database fails and it’s not
recoverable.
If the old primary database is somehow still running, stop it or kill it.
If the old primary node is still up and the VIP is still configured, bring it down (only on PROD):

ip addr del 199.6.212.138/25 dev bond0

Promote the standby to a new master


Bring up the VIP on the new master:

ip addr add 199.6.212.138/25 dev bond0 label bond0:0

Remove the recovery.conf file:

cd /var/lib/pgsql/9.6/data/
mv recovery.conf{,.ready}

Restart the database:

/etc/init.d/postgresql-9.6 restart

The following query should return FALSE:

su - postgres -c "psql -c \"SELECT CASE pg_is_in_recovery() WHEN 't' THEN


'TRUE' ELSE 'FALSE' END in_recovery;\""

Note: once all problems on the old primary server are resolved, you have to manually recreate
the standby database, as described in the following point.

Recreating the standby database after failover or when the standby database falls behind too
much:
All steps to be executed on the new standby server (previous primary).

Atos Poland Global Services


06 December 2019
12 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:
Copy data from the primary DB.
Switch to the postgres user:

su - postgres

Make a backup of the currently available data location:

mv /var/lib/pgsql/9.6/data{,.$(date ‘+%Y%m%d%H%M’)}

pg_basebackup -h 199.6.212.138 -D/var/lib/pgsql/9.6/data/ -U replicator -P -v


–x

copy data from the primary database:

Create a new recovery.conf file:

cat > /var/lib/pgsql/9.6/data/recovery.conf <<EOF


standby_mode = on
recovery_target_timeline = 'latest'
primary_conninfo = 'host=199.6.212.138 port=5432 user=replicator'
EOF

exit

Start the standby database:

/etc/init.d/postgresql-9.6 start

Verify the state of the new standby, the following query should return TRUE:

su - postgres -c "psql -c \"SELECT CASE pg_is_in_recovery() WHEN 't' THEN


'TRUE' ELSE 'FALSE' END in_recovery;\""

Atos Poland Global Services


06 December 2019
13 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

5 Restoring database from backup

Lits of all databases related with HDP:

ambari
hive
oozie
ranger
rangerkms

Backup location for PROD:

[root@nlxs5133 archive]# pwd


/home/backup/PRODbackups/nlxs5144_postgres/archive

Backup location for QA:

[root@nlxs5133 archive]# pwd


/home/backup/QAbackups/nlxs5146_postgres/archive

Backup location for DEV:

[root@nlxs5133 archive]# pwd


/home/backup/DEVbackups/nlxs5270_postgres/archive

Copy requested backup version on the master database server local filesystem.

Atos Poland Global Services


06 December 2019
14 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:
su - postgres

cd /tmp

mkdir DB_restore

cd DB_restore/

-bash-4.1$ pwd
/tmp/DB_restore

scp backup@nlxs5133:/home/backup/DEVbackups/nlxs5270_postgres/archive/YYYY.MM.DD_09.57_postgres.tgz

YYYY.MM.DD_09.57_postgres.tgz

100% 72MB 72.4MB/s 00:01

tar zxvf YYYY.MM.DD_09.57_postgres.tgz

YYYY.MM.DD_09.57_data_ambaridev.sql
YYYY.MM.DD_09.57_data_hive.sql
YYYY.MM.DD_09.57_data_oozie.sql
YYYY.MM.DD_09.57_data_postgres.sql
YYYY.MM.DD_09.57_data_rangerkms.sql
YYYY.MM.DD_09.57_data_ranger.sql
YYYY.MM.DD_09.57_data_rundeck.sql
YYYY.MM.DD_09.57_data_test_dr.sql
YYYY.MM.DD_09.57_data_tracker.sql
YYYY.MM.DD_09.57_dbacl.sql
YYYY.MM.DD_09.57_roles.sql
pg_hba.conf
postgresql.conf

Atos Poland Global Services


06 December 2019
15 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:
5.1 Ambari

 Ambari service is using ambari database.

 Ambari server still running

# ambari-server status
Using python /usr/bin/python
Ambari-server status
Ambari Server running

 Interrupt existing connections on ambaridev db and forbit to create new connections (it’s
not able to drop any database on postgresql till some connections are established).

a.) List active connections:

SELECT * FROM pg_stat_activity WHERE datname = 'ambari';

b.) Forbid ability to create new connection for database

UPDATE pg_database SET datallowconn = 'false' WHERE datname = 'ambari';

c.) Terminate all active connections

SELECT pg_terminate_backend (pg_stat_activity.pid) FROM pg_stat_activity WHERE


pg_stat_activity.datname = 'ambari';

d.) Check active connections [point a.)] -> (has to be 0 )

 Drop ambari database

DROP DATABASE IF EXISTS ambari;

Atos Poland Global Services


06 December 2019
16 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:
Ambari database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.

 Check ambari service

- This should be still running, but not properly working. It’s not possible to manage hdp
services from ambari.
From ambari-server logs:
Internal Exception: org.postgresql.util.PSQLException: FATAL: database "ambari" does not
exist

 Restore ambari database from backup

- Create empty database called: ambari (roles still exist, these weren’t deleted)

CREATE DATABASE ambari;

GRANT ALL ON DATABASE ambari TO ambari;

- Import dump from backup into newly created empty database

psql ambari < YYYY.MM.DD_09.57_data_ambari.sql

 Restart ambari-server

 Check ambari-server logs, functionalities

Atos Poland Global Services


06 December 2019
17 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

5.2 Hive

 Hive service is using hive database for metastore

 Hive component is still running, do not stop it from ambari WebUI

 Interrupt existing connections on hive db and forbit to create new connections (it’s not able
to drop any database on postgresql till some connections are established).

a.) List active connections:

SELECT * FROM pg_stat_activity WHERE datname = 'hive';

b.) Forbid ability to create new connection for database

UPDATE pg_database SET datallowconn = 'false' WHERE datname = 'hive';

c.) Terminate all active connections

SELECT pg_terminate_backend (pg_stat_activity.pid) FROM pg_stat_activity WHERE


pg_stat_activity.datname = 'hive';

d.) Check active connections [point a.)] -> (has to be 0 )

 Drop hive database

DROP DATABASE IF EXISTS hive;

Hive database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.

Atos Poland Global Services


06 December 2019
18 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

 Check hive service

- This should be still running, but in logs started to occur error:


- Hive metastore is not accessible. After timeout period alert is triggered by Ambari

From hivemetastore logs:


Failed to acquire connection to jdbc:postgresql://xxx. Sleeping for 7000 ms. Attempts
left: 5
org.postgresql.util.PSQLException: FATAL: database "hive" does not exist

and after timeout period hivemetastore triggered an alert in ambari:

raise ExecuteTimeoutException(err_msg)
ExecuteTimeoutException: Execution of 'ambari-sudo.sh su ambari-qa -l -s /bin/bash -c
'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/sbin:/sbin:/usr/lib/
ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/var/lib/ambari-
agent:/bin/:/usr/bin/:/usr/sbin/:/usr/hdp/current/hive-metastore/bin'"'"' ; export
HIVE_CONF_DIR='"'"'/usr/hdp/current/hive-metastore/conf'"'"' ; hive --hiveconf
hive.metastore.uris=thrift://slb-0.local:9083 --hiveconf
hive.metastore.client.connect.retry.delay=1 --hiveconf
hive.metastore.failure.retries=1 --hiveconf
hive.metastore.connect.retries=1 --hiveconf
hive.metastore.client.socket.timeout=14 --hiveconf
hive.execution.engine=mr -e '"'"'show databases;'"'"''' was killed due timeout after
60 seconds
)

 Restore hive database from backup

a.) Create empty database called: hive (roles still exist, these weren’t deleted)

CREATE DATABASE hive;

GRANT ALL ON DATABASE hive TO hive;

Atos Poland Global Services


06 December 2019
19 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:
b.) Import dump from backup into newly created empty database

psql hive < YYYY.MM.DD_09.57_data_hive.sql

 Restart hive component

 Check hive components logs, functionalities, request developer team for test hive jobs

5.3 Ranger

 Ranger service is using ranger database.

 Ranger component is still running, do not stop it from ambari WebUI

 Interrupt existing connections on ranger db and forbit to create new connections (it’s not
able to drop any database on postgresql till some connections are established).

a.) List active connections:

SELECT * FROM pg_stat_activity WHERE datname = 'ranger';

b.) Forbid ability to create new connection for database

UPDATE pg_database SET datallowconn = 'false' WHERE datname = 'ranger';

c.) Terminate all active connections

SELECT pg_terminate_backend (pg_stat_activity.pid) FROM pg_stat_activity WHERE


pg_stat_activity.datname = 'ranger';

d.) Check active connections [point a.)] -> (has to be 0 )

Atos Poland Global Services


06 December 2019
20 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:
 Drop ranger database

DROP DATABASE IF EXISTS ranger;

Ranger database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.

 Check ranger service

- This should be still running, but in logs errors start to occur

 Restore ranger database from backup

- Create empty database called: ranger (roles still exist, these weren’t deleted)

CREATE DATABASE ranger;

GRANT ALL ON DATABASE ranger TO ranger;

- Import dump from backup into newly created empty database

psql ranger < YYYY.MM.DD_09.57_data_ranger.sql

 Restart ranger service

 Check ranger components logs, functionalities

Atos Poland Global Services


06 December 2019
21 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

5.4 Rangerkms

 Rangerkms service is using rangerkms database.

 Rangerkms component is still running, do not stop it from ambari WebUI

 Interrupt existing connections on rangerkms db and forbit to create new connections (it’s
not able to drop any database on postgresql till some connections are established).

a.) List active connections:

SELECT * FROM pg_stat_activity WHERE datname = 'rangerkms';

b.) Forbid ability to create new connection for database

UPDATE pg_database SET datallowconn = 'false' WHERE datname = 'rangerkms';

c.) Terminate all active connections

SELECT pg_terminate_backend (pg_stat_activity.pid) FROM pg_stat_activity WHERE


pg_stat_activity.datname = 'rangerkms';

Atos Poland Global Services


06 December 2019
22 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:
d.) Check active connections [point a.)] -> (has to be 0 )

 Drop rangerkms database

DROP DATABASE IF EXISTS rangerkms;

Rangerkms database is dropped, disappeared from existing database list – the same on
secondary (stand-by) server.

 Check rangerkms service

- This should be still running, but in logs errors start to occur.

 Restore rangerkms database from backup

- Create empty database called: rangerkms (roles still exist, these weren’t deleted)

CREATE DATABASE rangerkms;

GRANT ALL ON DATABASE rangerkms TO rangerkms;

- Import dump from backup into newly created empty database


psql rangerkms < YYYY.MM.DD_09.57_data_rangerkms.sql

 Restart rangerkms service

 Check rangerkms components logs, functionalities

Atos Poland Global Services


06 December 2019
23 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

5.5 Oozie

 Oozie service is using oozie database.

 Oozie component is still running, do not stop it from ambari WebUI

 Interrupt existing connections on oozie db and forbit to create new connections (it’s not
able to drop any database on postgresql till some connections are established).

a.) List active connections:

SELECT * FROM pg_stat_activity WHERE datname = 'oozie';

b.) Forbit ability to create new connection for database

UPDATE pg_database SET datallowconn = 'false' WHERE datname = 'oozie';

c.) Terminate all active connections

SELECT pg_terminate_backend (pg_stat_activity.pid) FROM pg_stat_activity WHERE


pg_stat_activity.datname = 'oozie';

Atos Poland Global Services


06 December 2019
24 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

d.) Check active connections [point a.)] -> (has to be 0 )

 Drop oozie database

DROP DATABASE IF EXISTS oozie;

Oozie database is dropped, disappeared from existing database list – the same on secondary
(stand-by) server.

 Check oozie service

- This should be still running, but in logs error start to occur.

 Restore oozie database from backup

- Create empty database called: oozie (roles still exist, these weren’t deleted)

CREATE DATABASE oozie;

GRANT ALL ON DATABASE oozie TO oozie;

- Import dump from backup into newly created empty database

psql oozie < YYYY.MM.DD_09.57_data_oozie.sql

 Restart oozie service

 Check oozie components logs, functionalities

Atos Poland Global Services


06 December 2019
25 of 26
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.1

Public

document number:

Atos Poland Global Services


06 December 2019
26 of 26

You might also like