You are on page 1of 46

USE DB2 V8.

2 HIGH AVAILABILITY DISASTER RECOVERY (HADR) IN AN SAP IMPLEMENTATION

PATRICK ZENG & LIWEN YEOW IBM-SAP INTEGRATION & SUPPORT CENTRE IBM SOFTWARE SOLUTIONS TORONTO LAB

IBM Software Group Toronto Lab

0 4 A u g u st 2 0 0 5

USE DB2 V8.2 HIGH AVAILABILITY DISASTER RECOVERY (HADR) IN AN SAP IMPLEMENTATION

IBM Software Group Toronto Lab

0 4 A u g u st 2 0 0 5

Table of Contents
1. 2. EXECUTIVE SUMMARY ............................................................................................................ 5 INTRODUCTION TO HIGH AVAILABILITY AND DISASTER RECOVERY................... 6 2.1 2.2 3. HIGH AVAILABILITY.................................................................................................................. 6 DISASTER RECOVERY ................................................................................................................ 7

INTRODUCTION TO HADR AND CLIENT REROUTE ........................................................ 8 3.1 3.2 3.3 3.4 3.5 HIGH AVAILABILITY DISASTER RECOVERY (HADR) ................................................................ 8 HADR RESTRICTIONS ............................................................................................................... 9 AUTOMATIC CLIENT REROUTE .................................................................................................. 9 AUTOMATIC CLIENT REROUTE LIMITATIONS .......................................................................... 10 AUTOMATIC CLIENT REROUTE AND HADR ............................................................................ 11

4.



5.

RUNNING SAP WITH HADR AND CLIENT REROUTE..................................................... 22 5.1 5.2 5.3 5.4 5.5 NORMAL OPERATION .............................................................................................................. 22 SWITCHING THE ROLES OF THE PRIMARY AND STANDBY DATABASE ...................................... 28 FAILOVER WHEN THE PRIMARY DATABASE IS PHYSICALLY DOWN .......................................... 33 REINTEGRATING A DATABASE AFTER A FAILOVER ................................................................... 36 RESTRICTIONS AND RECOMMENDATIONS ................................................................................ 39

6.

PERFORMANCE IMPACT OF HADR .................................................................................... 40 6.1 6.2 6.3 FAILOVER TIME ....................................................................................................................... 40 IMPACT OF HADR SYNCHRONIZATION MODE......................................................................... 40 IMPACT OF HADR LOG RECEIVING BUFFER SIZE ................................................................... 40

7.

TAKING BACKUPS FROM HADR STANDBY IMAGE....................................................... 42 7.1 7.2 7.3 DATABASE BACKUP ON THE STANDBY SERVER ...................................................................... 42 RESTORING TO PRIMARY SERVER ............................................................................................ 43 RESTORING TO STANDBY SERVER ........................................................................................... 45

8.

APPENDIX: REFERENCE ........................................................................................................ 46

IBM Software Group Toronto Lab

0 4 A u g u st 2 0 0 5

Copyright 2005 IBM Corporation. All Rights Reserved.


Neither this documentation nor any part of it may be copied or reproduced in any form or by any means or translated into another language, without the prior consent of the IBM Corporation.

IBM makes no warranties or representations with respect to the content hereof and specifically disclaims any implied warranties of merchantability or fitness for any particular purpose. IBM assumes no responsibility for any errors that may appear in this document. The information contained in this document is subject to change without any notice. IBM reserves the right to make any such changes without obligation to notify any person of such revision or changes. IBM makes no commitment to keep the information contained herein up to date.

SAP and SAP Business Information Warehouse are registered trademarks of SAP AG Linux is a trademark of Linux Torvalds IBM, RS/6000, AIX, OS/390, OS/400 and DB2 Universal database are registered trademarks of IBM Corporation

IBM Software Group Toronto Lab

0 4 A u g u st 2 0 0 5

1. Executive summary
In todays world where businesses are serving customers from around the world on a 24x7 schedule, customers expect their computing systems to be 100% reliable. DB2 UDB has always been in the forefront of databases in providing such industrial strength reliability. In DB2 UDB V8.2, DB2 introduced two new features that will further provide customers with options to implement High Availability and Disaster Recovery (HADR) and automatic Client Rerouting capabilities. These features protect the customers from a production downtime in the event of a local hardware failure or a catastrophic site failure by duplicating the workload of the database to a separate site. These features are shipped as part of the standard packaging of DB2 UDB ESE. In this paper, we will show how these DB2 UDB features can be used with the world leading ERP software, SAP R/3 4.7 Enterprise. We will walk the reader through the steps necessary to set up both DB2 UDB for HADR and Client Reroute with SAP R/3 4.7 Enterprise. Details of procedures and examples of output are provided for the reader to follow each step carefully and be able to compare with their own experience.

IBM Software Group Toronto Lab

0 4 A u g u st 2 0 0 5

2. Introduction to High Availability and Disaster Recovery


In todays global economy, customers expect access to information and the ability to execute transactions on a 24x7 basis. Whether it is in the banking, manufacturing, retail, or entertainment sector, computer systems that support such industries need to be designed with robustness to withstand failure from hardware, software, networking, or the infrastructure supporting it. It is well understood that in order to gain such levels of robustness, a higher level of systems architecture and redundancy is required, leading to higher cost of implementation. Most recent hardware is designed to avoid single points of failure. Multiple power supplies, redundant system clock, redundant service processor, hot swappable disks and hot-plug slots, multiple circuitry paths, and RAID arrays are some examples of these features. With software, and specifically DB2, features such as data page checksums, replication, Online/Offline/Incremental Backups, Split Mirror support, standby database support, and clustering provide similar robustness. The implementation of this resiliency can be classified into two categories High Availability (HA) and Disaster Recovery (DR).

2.1 High Availability


High Availability addresses the issue that the system and database can withstand a hardware or software failure and be instantly, or within a very short period, available to the applications. This issue generally is in the context of a single site. DB2 High Availability can be implemented with IBM HACMP, SUN Cluster, Veritas Cluster Server, Microsoft MSCS, HP Service Guard, and SteelEye LifeKeeper. All these solutions are based on failing over the database server host to a different host but use the same set of disks containing the DB2 database. The choices of such a setup could span from an idle failover host, active standby, mutual takeover, and cascade. Figure 1. HA Clustering options

IBM Software Group Toronto Lab

0 4 A u g u st 2 0 0 5

2.2 Disaster Recovery


Disaster Recovery (DR) is a situation, unlike HA, whereby the entire primary site where the database server resides is no longer functional. Scenarios, such as flooding, earthquake, power grid failure, or fire, could result in the loss of a primary data centre. In such situations, processing can continue if a backup data centre is located in a separate site some distance apart. It is quite common that these two sites could be placed on opposite sides of the continent or in a different country. A solution that provides such capability is Libelle DBShadow for DB2 UDB. To maintain a backup DB2 Server on the DR site, a common setup would be to have the database in a Standby mode and updated using log shipping from the primary database server. Figure 2. Typical DR setup

IBM Software Group Toronto Lab

0 4 A u g u st 2 0 0 5

3. Introduction to HADR and Client Reroute


3.1 High Availability Disaster Recovery (HADR)
DB2 Universal Databases (DB2 UDB) High Availability Disaster Recovery (HADR) is a database replication feature that provides a high availability solution for both partial and complete site failures. HADR protects against data loss by replicating data changes from a source database, called the primary, to a target database, called the standby. A database that does not use HADR is referred to as a standard database. Applications can only access the current primary database. Updates to the standby database occur by rolling forward log data that is generated on the primary database and shipped to the standby database. Figure 3. DB2 HADR setup

A partial site failure can be caused by a hardware, network, or software (DB2 or operating system) failure. Without HADR, a partial site failure requires the database management system (DBMS) server or the machine where the database resides to be rebooted. The length of time it takes to restart the database and the machine where it resides is unpredictable. It can take several minutes before the database is brought back to a consistent state and made available. With HADR, the standby database can take over in seconds. Further, you can redirect the clients that were using the original primary database to the standby database (new primary database) by using automatic client reroute, or retry logic in the application. A complete site failure can occur when a disaster, such as a fire, causes the entire site to be destroyed. Because HADR uses TCP/IP for communication between the primary and standby databases, they can be situated in different locations. For example, your primary database might be located at your head office in one city, while your standby database is located at your sales office in another city. If a disaster occurs at the primary site, data availability is maintained by having the remote standby database take over as the

IBM Software Group Toronto Lab

0 4 A u g u st 2 0 0 5

primary database with full DB2 functionality. After a takeover operation occurs, you can bring the original primary database back up and return it to its primary database status; this is known as failback. After the failed original, primary server is repaired, it can rejoin the HADR pair as a standby database, if the two copies of the database can be made consistent. After the original primary database is reintegrated into the HADR pair as the standby database, you can switch the roles of the databases to enable the original primary database to once again be the primary database. With HADR, you can choose the level of protection you want from potential loss of data by specifying one of three synchronization modes: synchronous, near synchronous, or asynchronous.

3.2 HADR restrictions


The following list is a summary of some of the high availability disaster recovery (HADR) restrictions:

HADR is supported on DB2 UDB Enterprise Server Edition (ESE) as a no-charge option and on DB2 Express and Workgroup Editions as a separately charged option. However, it is not supported when you have multiple database partitions on ESE. The primary and standby databases must have the same operating system version and the same version of DB2 UDB, except for a short time during a rolling upgrade. The DB2 UDB release on the primary and standby databases must be the same bit size (32 or 64 bit). Reads on the standby database are not supported. Clients cannot connect to the standby database. Log archiving can only be performed by the current primary database. Normal backup operations are not supported on the standby database. Non-logged operations, such as changes to database configuration parameters and to the recovery history file, are not replicated to the standby database. Load operations with the COPY NO option specified are not supported. Use of Data Links is not supported.

3.3 Automatic client reroute


The automatic client reroute feature allows client applications to recover from a loss of communication with the server so that they can continue to work with minimal interruption. After a loss of communication, the client application attempts to reconnect to the server. If this fails, the client is then rerouted to a different server. You can specify an alternate location through the command line processor (CLP) by invoking an application programming interface (API) or when adding a database using the Control Center or the advanced view of the Configuration Assistant.

IBM Software Group Toronto Lab

0 4 A u g u st 2 0 0 5

Figure 4. Client Reroute setup

3.4 Automatic client reroute limitations


There are some limitations with use of the automatic client reroute feature: Automatic client reroute is only supported when the communications protocol used for connecting to the DB2 Universal Database (DB2 UDB) server, or to the DB2 Connect(TM) server, is TCP/IP. This means that if the connection is using a different protocol other than TCP/IP, the automatic client reroute feature will not be enabled. Even if DB2 UDB is set up for a loopback, TCP/IP communications protocol must be used in order to accommodate the automatic client reroute feature. If the connection is re-established to the alternate server location, any new connection to the same database alias will be connected to the alternate server location. If you want any new connection to be established to the original location in case the problem on the original location is fixed, there are a couple of options from which to choose: o You need to take the alternate server offline and allow the connections to fail back over to the original server. (This assumes that the original server has been catalogued using the UPDATE ALTERNATE SERVER command such that it is set to be the alternate location for the alternate server.) You could catalogue a new database alias to be used by the new connections. You could uncatalogue the database entry and re-catalogue it again.

o o

The DB2 UDB server installed in the alternate host server must be the same version (but could have a higher FixPak) when compared to the DB2 UDB installed on the original host server. Regardless of whether you have authority to update the database directory at the client machine, the alternate server information is always kept in memory. In other words, if you did not have authority to update the database directory (or because it is a read-only database directory), other applications will not be able to determine and use the alternate server because the memory is not shared among applications. The same authentication is applied to all alternate locations. This means that the client will be unable to re-establish the database connection if the alternate location has a different authentication type than the original location. When there is a communication failure, all session resources, such as global temporary tables, identity, sequences, cursors, server options (SET SERVER OPTION) for federated processing, and special registers, are all lost. The application is responsible to re-establish the session resources in order to continue processing the work. You do not have to run any of the special register statements after the connection is re-established because DB2 UDB will replay the special

IBM Software Group Toronto Lab

10

0 4 A u g u st 2 0 0 5

register statements that were issued before the communication error. However, some of the special registers will not be replayed; they are: o o o o Note: If the client is using CLI, JCC Type 2 or Type 4 drivers, after the connection is re-established, then for those SQL statements that have been prepared against the original server, they are implicitly re-prepared with the new server. However, for embedded SQL routines (for example, SQC or SQX applications), they will not be re-prepared. An alternate way to do automatic client reroute is to use the DNS entry to specify an alternate IP address for a DNS entry. The idea is to specify a second IP address (an alternate server location) in the DNS entry: The client would not know about an alternate server, but at connect time, DB2 UDB would alternate between the IP addresses for the DNS entry. SET ENCRYPTPW SET EVENT MONITOR STATE SET SESSION AUTHORIZATION SET TRANSFORM GROUP

3.5 Automatic client reroute and HADR


The automatic client reroute feature can be used with high availability disaster recovery to allow client applications to recover from a loss of communication with the server and to continue working with minimal interruption. Rerouting is only possible when an alternate database location has been specified at the server. Automatic client reroute is only supported with TCP/IP protocol. You can use automatic client reroute with HADR to make client applications connect to the new primary database after a takeover operation. If automatic client reroute is not enabled, client applications will receive error message SQL30081, and no further attempts will be made to establish a connection with the server.

IBM Software Group Toronto Lab

11

0 4 A u g u st 2 0 0 5

4. Setting-up HADR and Client Reroute in SAP Environment


Figure 5. SAP HADR setup

4.1 Install SAP Central Instance


Follow the SAP R/3 Installation Guide to install the central instance. If you have existing central instance installed, you can skip this step.

In our test, we installed SAP R/3 Enterprise 4.Ext 200 Central Instance on Host C (lunen). Since the database instance resides on a remote host, you need to provide the Database Host name (phillipe) and the Communication Port number (54700) during the installation.

4.2 Install SAP Database Instance on the primary database host


Follow the SAP R/3 Installation Guide to install the database instance. If you have existing database instance installed, you can skip this step. In our test, we installed SAP R/3 Enterprise 4.Ext 200 Database Instance on Host A (phillipe). Since the central instance resides on a remote host, you need to provide the Central Instance Host name (lunen) and the instance number (00) during the installation. In order to facilitate the remote monitoring of the database instance from the central instance, you also need to install additional components after the database instance installation is completed. During the installation of the database instance, a RFC Destinations entry is automatically generated. In our test, we can see this entry in transaction SM59, RFC Destinations > TCP/IP connections > SAPOSCOL_PHILLIPE, and the remote program used for this RFC connections is rfcoscol. However, rfcoscol has been deprecated since R/3 4. (See OSS Note 371023.) Therefore, there is no rfcoscol executable program installed in the database instance, which causes a CPIC error when SAP central instance routinely collects operating system data from the remote database host. There are two ways to resolve this problem: a. Download an rfcoscol program from the R/3 4.6 release and put it in the database host. (See OSS Note 20624 saposcol, st06 for dedicated Database Server.) Install a CCMS agent on the database instance host. (See OSS Note 209834 CCMS agent technology (composite SAP note) and the document CCMS Agents: Features, Installation, and Operation from SAP service marketplace.) Below is an example of setting up CCMS agent on phillipe:

b.

IBM Software Group Toronto Lab

12

0 4 A u g u st 2 0 0 5

Listing 1. Install CCMS agent on Database Instance


phillipe:svtadm 62> ./sapccmsr -R INFO: CCMS agent sapccmsr working directory is /usr/sap/tmp/sapccmsr INFO: CCMS agent sapccmsr config file is /usr/sap/tmp/sapccmsr/csmconf **** Welcome to SAP CCMS agent program sapccmsr ****

Please enter the name (system ID) of the CENTRAL (!) monitoring mySAP system. R/3 system ID : SVT additional CENTRAL system y/[n] ? : n INFO: creating ini file /usr/sap/tmp/sapccmsr/sapccmsr.ini. INFO: Checking Distributed Statistical Records Library dsrlib.so INFO: Distributed Statistical Records not configured, dsrlib.so not found. INFO: CCMS version 20040229, 32 bit, multithreaded, Non-Unicode compiled at Aug 29 2004 systemid 38(Intel x86 with Linux) relno 6200 patch text patch collection 2004/4, OSS note 694057 patchno 1622 intno 20020600 running on phillipe Linux 2.4.19-64GB-SMP #1 SMP Mon Oct 21 18:48:05 UTC 2002 i686 pid 1267 INFO: Created Shared Memory Key 1008 (size 20000000) INFO: Connected to Monitoring Segment [CCMS Monitoring Segment for phillipe, created with version CCMS version 20040229, 32 bit multithreaded, compiled at Aug 29 2004, kernel 6200_20020600_1622, platform 38(Intel x86 with Linux)] segment status WARM_UP segment started at Fri Sep 24 10:39:15 2004 segment version 20040229 **************************************************** ********************** SVT **********************

**************************************************** Please enter the logon info for an admin user of the central monitoring mySAP system [SVT]. The user should have system administrator privileges client [000] user language [EN] hostname of SVT message server use Load Balancing n/[y] ? group [PUBLIC] [optional] route string trace level [0] : : ddic : : lunen : n : : :

please enter password for [SVT:000:ddic]: Try to connect ... INFO: [SVT:000:DDIC] connected to SVT, host lunen, System Nr. 00, traceflag [ ]

IBM Software Group Toronto Lab

13

0 4 A u g u st 2 0 0 5

INFO: SVT release is 620 , (kernel release 620 ) This program will act as registered RFC server lateron. Please enter the info for a gateway of monitoring system SVT gateway info: host: service: [lunen] [sapgw00] n/[y] ? : y

Gateway info ok

**** CCMS agent sapccmsr: RFC client functionality **** This CCMS agent program sapccmsr is able to actively report alert data into the monitoring mySAP.com system [SVT]. To enable this feature, you have to setup the user CSMREG in [SVT]. (refer to SAP Online-help, search for 'CSMREG'). Alternatively use any user in [SVT] that has at least authorization ). to call per RFC function groups SALC, SALF, SALH, SALS, SAL_CACHE_RECEIVE, SCSMBK_DATA_OUT, SCSMBK_RECONCILE, SCSM_CEN_TOOL_MAIN, SYST, RFC1 After entering the RFC logon info for the user, the password will be stored here on this machine in a Secure Storage. client [000] : user [CSMREG] : ddic language [EN] : hostname of SVT message server [lunen] : use Load Balancing n/[y] ? : n hostname of application server [lunen] : system number (00 - 98) [00] : [optional] route string : trace level [0] : please enter password for [SVT:000:ddic]: Try to connect ... INFO: [SVT:000:DDIC] connected to SVT, host lunen, System Nr. 00, traceflag [ ] INFO: SVT release is 620 , (kernel release 620 ), CCMS version 20011212 INFO: RFC logon info for [SVT:000:ddic] can be updated at any time with -R option: sapccmsr -R <params> INFO: Updated saprfc.ini in agent work directory /usr/sap/tmp/sapccmsr INFO: Connected to SVT, CCMS version in ABAP: 20011212 INFO: successfully registered at SVT INFO: Updated config file /usr/sap/tmp/sapccmsr/csmconf. Start agent? n/[y] : y

INFO: Checking shared memory status of sapccmsr INFO: CCMS agent sapccmsr working directory is /usr/sap/tmp/sapccmsr INFO: CCMS agent sapccmsr config file is /usr/sap/tmp/sapccmsr/csmconf INFO: Central Monitoring System is [SVT]. (found in config file) INFO: Checking shared memory status of sapccmsr

IBM Software Group Toronto Lab

14

0 4 A u g u st 2 0 0 5

INFO: CCMS version compiled at systemid relno patch text patchno intno running on UTC 2002 i686 pid 20040229, 32 bit, multithreaded, Non-Unicode Aug 29 2004 38(Intel x86 with Linux) 6200 patch collection 2004/4, OSS note 694057 1622 20020600 phillipe Linux 2.4.19-64GB-SMP #1 SMP Mon Oct 21 18:48:05 1267

INFO: Reconnected to Monitoring Segment [SAP_CCMS_phillipe]

4.3 Install SAP Database Instance on the standby database host


Follow the same steps in 4.2 to install SAP Database Instance on the standby database host (dartagnan).

4.4 Set up HADR on the primary and the standby database host
Steps: 1) Enable log archiving, configure other parameters on the primary database, and make an offline backup. For example: Listing 2. Update, configure, and back up DB on primary
DB2 UPDATE DB CFG FOR svt USING indexrec ACCESS DB2 UPDATE DB CFG FOR svt USING logindexbuild ON DB2 UPDATE DB CFG FOR svt USING logarchmeth1 "DISK:/db2/SVT/log_archive" DB2 DEACTIVATE DB svt DB2 BACKUP DATABASE svt TO "/db2/backup"

2) Move the backup image to the standby database host, and restore the database to the rollforward pending state. For example: Listing 3. Restore database on standby
DB2 RESTORE DB svt FROM /db2/backup REPLACE HISTORY FILE SQL2539W Warning! Restoring to an existing database that is the same as the ba ckup image database. The database files will be deleted. Do you want to continue ? (y/n) y

3) Create HADR local and remote service name in /etc/service file on both primary and standby database servers. For example: Listing 4. Update to /etc/services file
SVT_HADR_1 54711/tcp SVT_HADR_2 54712/tcp

4) Set up HADR and the client reroute information for the primary database. Fox example: Listing 5. Update primary server configuration for HADR and Client Reroute

IBM Software Group Toronto Lab

15

0 4 A u g u st 2 0 0 5

--- Configure databases for client reroute - Phillipe - DB2SVT - SVT -UPDATE ALTERNATE SERVER FOR DATABASE SVT USING HOSTNAME dartagnan PORT 54700; --- Update HADR configuration parameters on primary database - Phillipe - DB2SVT - SVT -UPDATE DB CFG FOR SVT USING HADR_LOCAL_HOST phillipe; UPDATE DB CFG FOR SVT USING HADR_LOCAL_SVC SVT_HADR_1; UPDATE DB CFG FOR SVT USING HADR_REMOTE_HOST dartagnan; UPDATE DB CFG FOR SVT USING HADR_REMOTE_SVC SVT_HADR_2; UPDATE DB CFG FOR SVT USING HADR_REMOTE_INST DB2SVT; UPDATE DB CFG FOR SVT USING HADR_SYNCMODE SYNC; UPDATE DB CFG FOR SVT USING HADR_TIMEOUT 120;

5) Set up HADR and the client reroute information for the standby database. Fox example: Listing 6. Update standby server configuration for HADR and Client Reroute
--- Configure databases for client reroute - Dartagnan - DB2SVT - SVT -UPDATE ALTERNATE SERVER FOR DATABASE SVT USING HOSTNAME phillipe PORT 54700; --- Update HADR configuration parameters on standby database - Dartagnan - DB2SVT - SVT -UPDATE DB CFG FOR SVT USING HADR_LOCAL_HOST dartagnan; UPDATE DB CFG FOR SVT USING HADR_LOCAL_SVC SVT_HADR_2; UPDATE DB CFG FOR SVT USING HADR_REMOTE_HOST phillipe; UPDATE DB CFG FOR SVT USING HADR_REMOTE_SVC SVT_HADR_1; UPDATE DB CFG FOR SVT USING HADR_REMOTE_INST DB2SVT; UPDATE DB CFG FOR SVT USING HADR_SYNCMODE SYNC; UPDATE DB CFG FOR SVT USING HADR_TIMEOUT 120;

6) Start up HADR on the standby database first. For example: Listing 7. Start HADR on standby server first
--- Start HADR on standby database - Dartagnan - DB2SVT - SVT -DEACTIVATE DATABASE SVT; START HADR ON DATABASE SVT AS STANDBY

7) Start up HADR on the primary database. For example: Listing 8. Start HADR on primary server next
--- Start HADR on primary database - Phillipe - DB2SVT - SVT -DEACTIVATE DATABASE SVT; START HADR ON DATABASE SVT AS PRIMARY

After these steps, check the db2diag.log on both the primary and the standby database instances to see whether the HADR is set up correctly. You should be able to see the following entries in the db2diag.log on the primary database:

IBM Software Group Toronto Lab

16

0 4 A u g u st 2 0 0 5

Listing 9. db2diag.log info on primary server after HADR started


2004-10-01-16.24.17.923070-240 E4388G305 LEVEL: Event PID : 19994 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to None (was None) 2004-10-01-16.24.17.951980-240 E4694G30 LEVEL: Event PID : 19994 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-Boot (was None) 2004-10-01-16.24.18.039207-240 E5002G325 LEVEL: Event PID : 19994 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-RemoteCatchupPending (was P-Boot) 2004-10-01-16.24.23.739142-240 E5328G334 LEVEL: Event PID : 19994 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-RemoteCatchup (was P-RemoteCatchupPending) 2004-10-01-16.24.23.741004-240 I5663G308 LEVEL: Warning PID : 19994 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduP, probe:20445 MESSAGE : remote catchup starts at 00000006F84C000C 2004-10-01-16.24.23.779557-240 I5972G311 LEVEL: Warning PID : 19994 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduP, probe:10645 MESSAGE : near peer catchup starts at 00000006F84C000C 2004-10-01-16.24.23.879497-240 E6284G324 LEVEL: Event PID : 19994 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-NearlyPeer (was P-RemoteCatchup) 2004-10-01-16.24.23.884993-240 E6609G315 LEVEL: Event PID : 19994 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-Peer (was P-NearlyPeer)

And the following entries in the db2diag.log on the standby database:

Listing 10. db2diag.log info on Standby server after HADR started


2004-10-01-16.24.05.567511-240 E21902G305 PID : 2425 TID : 1024 INSTANCE: db2svt NODE : 000 LEVEL: Event PROC : db2hadrs (SVT)

IBM Software Group Toronto Lab

17

0 4 A u g u st 2 0 0 5

FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to None (was None) 2004-10-01-16.24.05.601883-240 E22208G30 LEVEL: Event PID : 2425 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-Boot (was None) 2004-10-01-16.24.05.614868-240 I22516G323 LEVEL: Warning PID : 19960 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-68 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:300 MESSAGE : Starting Replay Master on standby. 2004-10-01-16.24.05.615199-240 E22840G31 LEVEL: Event PID : 2425 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-LocalCatchup (was S-Boot) 2004-10-01-16.24.05.631105-240 I23158G39 LEVEL: Severe PID : 2425 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20280 MESSAGE : Failed to connect to primary. rc: DATA #1 : Hexdump, 4 bytes 0xBFFFAE3C : 1900 0F81 .... 2004-10-01-16.24.05.641944-240 I23556G341 LEVEL: Severe PID : 2425 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20280 RETCODE : ZRC=0x810F0019=-2129723367=SQLO_CONN_REFUSED "Connection refused" 2004-10-01-16.24.05.661145-240 E23898G339 LEVEL: Warning PID : 19960 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-68 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:920 MESSAGE : ADM1602W Rollforward recovery has been initiated. 2004-10-01-16.24.05.661511-240 E24238G382 LEVEL: Warning PID : 19960 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-68 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:1740 MESSAGE : ADM1603I DB2 is invoking the forward phase of the database rollforward recovery. 2004-10-01-16.24.05.661743-240 I24621G413 LEVEL: Warning PID : 19960 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-68 FUNCTION: DB2 UDB, recovery manager, sqlpForwardRecovery, probe:720 DATA #1 : String, 103 bytes Invoking database rollforward forward recovery, lowtranlsn 00000006F84C000C minbufflsn 00000006F84C000C 2004-10-01-16.24.05.675999-240 I25035G353 PID : 19960 TID : 1024 LEVEL: Warning PROC : db2agnti (SVT)

IBM Software Group Toronto Lab

18

0 4 A u g u st 2 0 0 5

INSTANCE: APPHDL : FUNCTION: MESSAGE :

db2svt NODE : 000 DB : SVT 0-68 DB2 UDB, recovery manager, sqlprecm, probe:2000 Using parallel recovery with 5 agents 2QSets 108 queues and 64 chunks

2004-10-01-16.24.05.766940-240 I25389G373 LEVEL: Error PID : 24240 TID : 1024 PROC : db2logmgr (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, data protection, sqlpgRetrieveLogDisk, probe:3500 RETCODE : ZRC=0x860F000A=-2045837302=SQLO_FNEX "File not found." DIA8411C A file "S0000239.LOG" could not be found. 2004-10-01-16.24.05.779207-240 I25763G294 LEVEL: Warning PID : 24258 TID : 1024 PROC : db2shred (SVT) INSTANCE: db2svt NODE : 000 APPHDL : 0-68 FUNCTION: DB2 UDB, recovery manager, sqlpshrEdu, probe:18300 MESSAGE : Maxing hdrLCUEndLsnRequested 2004-10-01-16.24.05.839260-240 E26058G333 LEVEL: Event PID : 2425 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-RemoteCatchupPending (was S-LocalCatchup) 2004-10-01-16.24.18.040052-240 E26392G341 LEVEL: Event PID : 2425 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-RemoteCatchupPending (was S-RemoteCatchupPending) 2004-10-01-16.24.18.059286-240 E26734G334 LEVEL: Event PID : 2425 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-RemoteCatchup (was S-RemoteCatchupPending) 2004-10-01-16.24.18.059445-240 I27069G30 LEVEL: Warning PID : 2425 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSPrepareLogWrite, probe:10260 MESSAGE : RCUStartLsn 00000006F84C000C 2004-10-01-16.24.23.803640-240 E27672G324 LEVEL: Event PID : 2425 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-NearlyPeer (was S-RemoteCatchup) 2004-10-01-16.24.23.886153-240 E27997G315 LEVEL: Event PID : 2425 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-Peer (was S-NearlyPeer)

You can also monitor the HADR status from the database snapshots. Listing 11. Monitor HADR using snapshot for database on primary server
HADR Status

IBM Software Group Toronto Lab

19

0 4 A u g u st 2 0 0 5

Role = Primary State = Peer Synchronization mode = Sync Connection status = Connected, 10/01/2004 16:24:18.039167 Heartbeats missed = 0 Local host = phillipe Local service = SVT_HADR_1 Remote host = dartagnan Remote service = SVT_HADR_2 Remote instance = DB2SVT timeout(seconds) = 120 Primary log position(file, page, LSN) = S0000239.LOG, 8345, 00000006FA5593C3 Standby log position(file, page, LSN) = S0000239.LOG, 8342, 00000006FA556E4A Log gap running average(bytes) = 11643

Listing 12. Snapshot for database on standby server


HADR Status Role = Standby State = Peer Synchronization mode = Sync Connection status = Connected, 10/01/2004 16:24:18.039968 Heartbeats missed = 0 Local host = dartagnan Local service = SVT_HADR_2 Remote host = phillipe Remote service = SVT_HADR_1 Remote instance = DB2SVT timeout(seconds) = 120 Primary log position(file, page, LSN) = S0000239.LOG, 8333, 00000006FA54DB45 Standby log position(file, page, LSN) = S0000239.LOG, 8331, 00000006FA54BFC1 Log gap running average(bytes) = 10582

8) Obtain the alternate server information on the SAP Central Instance. With the client reroute information updated on both the primary and standby database server configurations and HADR started, the client reroute information (alternate server information) will be used to populate the database directory cache on the client machine upon establishing a connection to the Primary database. In this case, the SAP Central Instance host is the client machine. This is easily done by just establishing a connection to the primary database. Listing 13. Populate the DB directory cache on the client machine with alternate server info
lunen:svtadm 289> db2 list node directory Node Directory Number of entries in the directory = 1 Node 1 entry: Node name Comment Directory entry type Protocol Hostname Service name = = = = = = NODESVT TCPIP Node for database SVT LOCAL TCPIP phillipe sapdb2SVT

lunen:svtadm 293> db2 list db directory System Database Directory

IBM Software Group Toronto Lab

20

0 4 A u g u st 2 0 0 5

Number of entries in the directory = 1 Database 1 entry: Database alias Database name Node name Database release level Comment Directory entry type Catalog database partition number Alternate server hostname Alternate server port number = = = = = = = = = SVT SVT NODESVT a.00 Remote -1

lunen:svtadm 294> db2 connect to svt user sapsvt Enter current password for sapsvt: Database Connection Information Database server SQL authorization ID Local database alias = DB2/LINUX 8.2.0 = SAPSVT = SVT

lunen:svtadm 295> db2 list db directory System Database Directory Number of entries in the directory = 1 Database 1 entry: Database alias Database name Node name Database release level Comment Directory entry type Catalog database partition number Alternate server hostname Alternate server port number lunen:svtadm 296> = = = = = = = = = SVT SVT NODESVT a.00 Remote -1 dartagnan 54700

Note that the alternate server hostname and port number in the Database Directory have been populated by making the first connection after HADR and client reroute information being updated on the database server.

4.5 Install additional SAP application servers


This step is optional. If you need more SAP application server in order to support the heavy workload in your system, you can follow the SAP R/3 Installation Guide to install additional diaglog instance. In our test, we installed two additional diaglog instances on SAP application server rac2 and laubac respectively. You will also need to follow section 4.4, step 8 to enable client reroute on each application server.

IBM Software Group Toronto Lab

21

0 4 A u g u st 2 0 0 5

5. Running SAP with HADR and Client Reroute


5.1 Normal operation
During the normal operation of the SAP system, the primary database continuously ships the log data to the standby database so that it can be replayed in the standby database.

5.1.1 Primary database status


On the primary database server (phillipe), you will see SAP work processes are connecting to the database: Listing 14. Connections to the database on primary server
phillipe:db2svt 725> db2 list applications Auth Id Application Name -------- -------------SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB Appl. Handle ---------155 148 157 152 151 153 150 154 146 156 158 Application Id DB # of Name Agents ------------------------------ -------- ----G91A62B0.J4D1.059615133027 SVT 1 G91A62B0.IDD1.059AC5133027 SVT 1 G91A62B0.J6D1.059B55133027 SVT 1 G91A62B0.J1D1.059B45133027 SVT 1 G91A62B0.J0D1.059B15133027 SVT 1 G91A62B0.J2D1.059A95133027 SVT 1 G91A62B0.IFD1.059A55133027 SVT 1 G91A62B0.J3D1.059B25133027 SVT 1 G91A62B0.IBD1.059A85133027 SVT 1 G91A62B0.J5D1.059AF5133027 SVT 1 G91A62B0.J7D1.059B65133027 SVT 1

And its database configuration shows it is running in HADR PRIMARY mode (only interesting configuration parameters are shown below): Listing 15. Database configuration show role of primary HADR server
phillipe:db2svt 724> db2 get db cfg for svt Database Configuration for Database svt Backup pending Database is consistent Rollforward pending Restore pending Log retain for recovery status User exit for logging status First active log file HADR HADR HADR HADR HADR HADR HADR HADR database role local host name local service name remote host name remote service name instance name of remote server timeout value log write synchronization mode = NO = NO = NO = NO = NO = YES = S0000277.LOG (HADR_LOCAL_HOST) (HADR_LOCAL_SVC) (HADR_REMOTE_HOST) (HADR_REMOTE_SVC) (HADR_REMOTE_INST) (HADR_TIMEOUT) (HADR_SYNCMODE) = = = = = = = = PRIMARY phillipe SVT_HADR_1 dartagnan SVT_HADR_2 DB2SVT 120 SYNC

IBM Software Group Toronto Lab

22

0 4 A u g u st 2 0 0 5

First log archive method (LOGARCHMETH1) = DISK:/db2/SVT/log_archive/ Index re-creation time and redo index build (INDEXREC) = ACCESS Log pages during index build (LOGINDEXBUILD) = ON

Its snapshot shows (only interesting snapshot data is shown below): Listing 16. Database snapshot showing HADR status on primary HADR server
Database Snapshot Database name Database path /db3/db2/SVT/db2svt/NODE0000/SQL00001/ Input database alias Database status Log to be redone for recovery (Bytes) Log accounted for by dirty pages (Bytes) File File File File number number number number of of of of first active log last active log current active log log being archived = SVT = = SVT = Active = 2362 = 2362 = = = = 277 296 277 Not applicable

HADR Status Role = Primary State = Peer Synchronization mode = Sync Connection status = Connected, 10/15/2004 09:09:34.377400 Heartbeats missed = 0 Local host = phillipe Local service = SVT_HADR_1 Remote host = dartagnan Remote service = SVT_HADR_2 Remote instance = DB2SVT timeout(seconds) = 120 Primary log position(file, page, LSN) = S0000277.LOG, 0, 0000000790428945 Standby log position(file, page, LSN) = S0000277.LOG, 0, 00000007904286A1 Log gap running average(bytes) = 220

5.1.2 Standby database status


On the standby database server (dartagnan), you will see only one db2replay process is running: Listing 17. Connections to database on standby server
dartagnan:db2svt 317> db2 list applications Auth Id Application Name -------- -------------db2replay Appl. Application Id Handle ---------- -----------------------------8 DB # of Name Agents -------- ----SVT 1

And its database configuration shows it is in rollforward pending state and running in HADR STANDBY mode (only interesting configuration parameters are shown below):

IBM Software Group Toronto Lab

23

0 4 A u g u st 2 0 0 5

Listing 18. Database configuration showing state of database and HADR on standby server
dartagnan:db2svt 318> db2 get db cfg for svt Database Configuration for Database svt Backup pending Database is consistent Rollforward pending Restore pending Log retain for recovery status User exit for logging status First active log file HADR HADR HADR HADR HADR HADR HADR HADR database role local host name local service name remote host name remote service name instance name of remote server timeout value log write synchronization mode (HADR_LOCAL_HOST) (HADR_LOCAL_SVC) (HADR_REMOTE_HOST) (HADR_REMOTE_SVC) (HADR_REMOTE_INST) (HADR_TIMEOUT) (HADR_SYNCMODE) = NO = NO = DATABASE = YES = NO = YES = S0000274.LOG = = = = = = = = STANDBY dartagnan SVT_HADR_2 phillipe SVT_HADR_1 DB2SVT 120 SYNC

First log archive method (LOGARCHMETH1) = DISK:/db2/SVT/log_archive/ Index re-creation time and redo index build (INDEXREC) = ACCESS Log pages during index build (LOGINDEXBUILD) = ON

Its snapshot shows (only interesting snapshot data is shown below): Listing 19. Database snapshot showing HADR status on standby server
Database Snapshot Database name Database path /db3/db2/SVT/db2svt/NODE0000/SQL00001/ Input database alias Database status File number of first active log File number of last active log File number of current active log File number of log being archived Rollforward Rollforward Rollforward Rollforward type last committed timestamp log file being processed status = = = = = SVT = = SVT = Rollforward = 277 = 296 = 277 = Not applicable = = = = Database 10/15/2004 07:18:25 274 Redo

HADR Status Role State Synchronization mode Connection status

Standby Peer Sync Connected, 10/15/2004 09:09:34.377628

IBM Software Group Toronto Lab

24

0 4 A u g u st 2 0 0 5

Heartbeats missed = 0 Local host = dartagnan Local service = SVT_HADR_2 Remote host = phillipe Remote service = SVT_HADR_1 Remote instance = DB2SVT timeout(seconds) = 120 Primary log position(file, page, LSN) = S0000277.LOG, 237, 0000000790515FFB Standby log position(file, page, LSN) = S0000277.LOG, 237, 0000000790515FBD Log gap running average(bytes) = 17450

5.1.3 SAP Central Instance server status


On the SAP Central Instance server (lunen), the DB2 client instance is configured to connect to the primary database with the alternate server pointing to the standby database: Listing 20. Node and DB directory on SAP CI
lunen:svtadm 262> db2 list node directory Node Directory Number of entries in the directory = 1 Node 1 entry: Node name Comment Directory entry type Protocol Hostname Service name = = = = = = NODESVT TCPIP Node for database SVT LOCAL TCPIP phillipe sapdb2SVT

lunen:svtadm 263> db2 list db directory System Database Directory Number of entries in the directory = 1 Database 1 entry: Database alias Database name Node name Database release level Comment Directory entry type Catalog database partition number Alternate server hostname Alternate server port number = = = = = = = = = SVT SVT NODESVT a.00 Remote -1 dartagnan 54700

5.1.4 SAP system monitoring


From the SAP GUI, you can monitor the database activity and the remote database hosts operating system activity as usually. For database activity monitoring, use transaction DB6COCKPIT (ST04). All functions should work as normal. However, you can only monitor the primary database activity and configurations, not the standby database.

IBM Software Group Toronto Lab

25

0 4 A u g u st 2 0 0 5

Figure 6. SAP GUI showing primary HADR server status using db6cockpit

You can also monitor the remote database hosts operating system activity using OS07: Figure 7. Monitoring remote databases using OS07

IBM Software Group Toronto Lab

26

0 4 A u g u st 2 0 0 5

In our test, we set up both CCMS agent and rfcoscol on the primary database host (phillipe) and only rfcoscol on the standby database host (dartagnan). So from transaction OS07, you can choose either SAPCCMSR.PHILLIPE.99 or SAPOSCOL_PHILLIPE to monitor OS activity on phillipe and SAPOSCOL_DARTAGNAN on dartagnan. Figure 8. Monitoring remote systems using CCMS or rfcoscol

On the primary database host (phillipe), you should see the following SAP agents running: Listing 21. SAP monitoring agents on primary server
Phillipe:db2svt root 7626 svtadm 8209 svtadm 8211 svtadm 8212 svtadm 8213 svtadm 8214 123> ps -ef | grep -i sap 1 0 Sep24 ? 03:10:03 1 0 Sep24 ? 00:01:04 8209 0 Sep24 ? 00:00:12 8211 0 Sep24 ? 00:13:13 8211 0 Sep24 ? 00:00:07 8211 0 Sep24 ? 00:02:43 saposcol sapccmsr sapccmsr sapccmsr sapccmsr sapccmsr -l -DCCMS -DCCMS -DCCMS -DCCMS -DCCMS

IBM Software Group Toronto Lab

27

0 4 A u g u st 2 0 0 5

svtadm 5599 5598 0 11:49 ? 00:00:00 csh -c rfcoscol lunen sapgw00 55323439 GWHOST=lunen GWSERV=sapgw00 CONVID=55323439 pf=/usr/sap/SVT/SYS/profile/SVT_DVEBMGS00_lunen CPIC_TRACE=0 IDX=1 SNC_MODE=0 root 5647 5599 0 11:49 ? 00:00:00 rfcoscol lunen sapgw00 55323439 GWHOST=lunen GWSERV=sapgwSE CONVID=55323439 pf=/usr/sap/SVT/SYS/profile/SVT_DVEBMGS00_lunen CPIC_TRACE=0 IDX=1 SNC_MODE=0

On the standby database host dartagnan, you should see the following SAP agents running: Listing 22. SAP monitoring agents on standby server
dartagnan:svtadm 37> ps -ef | grep sap root 5036 1 0 11:49 ? 00:00:00 saposcol -l svtadm 5057 5056 0 11:50 ? 00:00:00 csh -c rfcoscol lunen sapgw00 55407072 GWHOST=lunen GWSERV=sapgw00 CONVID=55407072 pf=/usr/sap/SVT/SYS/profile/SVT_DVEBMGS00_lunen CPIC_TRACE=0 IDX=1 SNC_MODE=0 root 5086 5057 0 11:50 ? 00:00:00 rfcoscol lunen sapgw00 55407072 GWHOST=lunen GWSERV=sapgwSE CONVID=55407072 pf=/usr/sap/SVT/SYS/profile/SVT_DVEBMGS00_lunen CPIC_TRACE=0 IDX=1 SNC_MODE=0

5.2 Switching the roles of the primary and standby databases


During normal operation, if for some reason you want to switch the role of the primary and standby database (in order to perform system maintenance or software upgrade on the primary database host, for example) without interrupting the normal business, you can use the HADR TAKEOVER command.

5.2.1 On the old standby database


Before switching the role of the primary and standby, make sure the standby database is in S-PEER state. To switch the role, use takeover hard command: Listing 23. Executing TAKEOVER of HADR from standby server
dartagnan:db2svt 337> db2 takeover hadr on db svt DB20000I The TAKEOVER HADR ON DATABASE command completed successfully.

You should see the following message in the db2diag.log. As you can tell, the database will initially complete the rollforward phase, then stop the Replay Master, and finally switch to the primary mode. Listing 24. db2diag.log from standby server showing phases of becoming the primary HADR server
2004-10-29-11.10.26.157832-240 I9061G410 LEVEL: Warning PID : 12267 TID : 1024 PROC : db2redom (SVT) INSTANCE: db2svt NODE : 000 APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpPRecReadLog, probe:4630 MESSAGE : Last log for rollforward is incomplete, log number: DATA #1 : Hexdump, 4 bytes 0x4D37CF50 : 3701 0000 7... 2004-10-29-11.10.26.368167-240 I9472G317 LEVEL: Warning PID : 11891 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpForwardRecovery, probe:1990

IBM Software Group Toronto Lab

28

0 4 A u g u st 2 0 0 5

MESSAGE : nextLsn 0000000814505B74 2004-10-29-11.10.26.376534-240 E9790G379 LEVEL: Warning PID : 11891 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:3600 MESSAGE : ADM1605I DB2 is invoking the backward phase of database rollforward recovery. 2004-10-29-11.10.26.376827-240 I10170G378 LEVEL: Warning PID : 11891 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpForwardRecovery, probe:2210 MESSAGE : Invoking database rollforward backward recovery, nextLsn: 0000000814505B74 2004-10-29-11.10.26.870813-240 I10549G409 LEVEL: Error PID : 11891 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpForwardRecovery, probe:670 MESSAGE : dbcb->logfhdr.firstDeleteFile: DATA #1 : Hexdump, 4 bytes 0x30010264 : FFFF FFFF .... 2004-10-29-11.10.26.985579-240 E10959G350 LEVEL: Warning PID : 11891 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:6600 MESSAGE : ADM1611W The rollforward recovery phase has been completed. 2004-10-29-11.10.26.986105-240 I11310G324 LEVEL: Warning PID : 11891 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:9500 MESSAGE : Stopping Replay Master on standby. 2004-10-29-11.10.34.093102-240 E11635G309 LEVEL: Event PID : 12262 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-Peer (was S-Peer)

After the standby database becomes the new primary database, the SAP application server will reroute the connections from the old primary database to this new primary database: Listing 25. Client Reroute connections established on "New" HADR server
dartagnan:db2svt 338> db2 list applications Auth Id Application Name -------- -------------SAPSVT dw.sapSVT_DVEB DB2SVT db2recindex SAPSVT dw.sapSVT_DVEB SAPSVT dw.sapSVT_DVEB Appl. Handle ---------617 616 614 615 Application Id DB # of Name Agents ------------------------------ -------- ----G91A62B0.HDDC.059015212215 SVT 1 SVT 1 G91A62B0.HADC.059025214302 SVT 1 G91A62B0.HBDC.059005221528 SVT 1

IBM Software Group Toronto Lab

29

0 4 A u g u st 2 0 0 5

And its database snapshot shows it is now assuming the HADR PRIMARY role (only interesting information is shown below): Listing 26. Database snapshot showing new HADR primary role
Database Snapshot Database name Database path /db3/db2/SVT/db2svt/NODE0000/SQL00001/ Input database alias Database status File File File File number number number number of of of of first active log last active log current active log log being archived = SVT = = SVT = Active = = = = 310 329 310 Not applicable

HADR Status Role = Primary State = Peer Synchronization mode = Sync Connection status = Connected, 10/29/2004 10:40:15.031467 Heartbeats missed = 0 Local host = dartagnan Local service = SVT_HADR_2 Remote host = phillipe Remote service = SVT_HADR_1 Remote instance = DB2SVT timeout(seconds) = 120 Primary log position(file, page, LSN) = S0000310.LOG, 408, 000000081453C3C9 Standby log position(file, page, LSN) = S0000310.LOG, 408, 000000081453C3C9 Log gap running average(bytes) = 1000

5.2.2 On the old primary database


After the role switch, the old primary database is now running on standby mode, and it is constantly replaying the logs shipped from the new primary database: Listing 27. Connection to old primary server
phillipe:db2svt 728> db2 list applications Auth Id Application Name -------- -------------db2replay Appl. Application Id Handle ---------- -----------------------------766 DB # of Name Agents -------- ----SVT 1

The db2diag.log on the old Primary server will show that the database will initially switch the role from the primary to standby, then start the log Replay Master, and finally start up the rollforward process and get into rollforward pending mode. Listing 28. db2diag.log from former primary server showing phases of role switch to standby
2004-10-29-11.10.24.350207-240 I6904G355 PID : 29765 TID : 1024 INSTANCE: db2svt NODE : 000 APPHDL : 0-304 LEVEL: Warning PROC : db2agnti (SVT) DB : SVT

IBM Software Group Toronto Lab

30

0 4 A u g u st 2 0 0 5

FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSwitchDbFromRuntimeToStandby, probe:50122 MESSAGE : copy_nextlsn 0000000814505B74 2004-10-29-11.10.24.367203-240 E7260G344 LEVEL: Event PID : 29765 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-304 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-Peer (was P-Peer) 2004-10-29-11.10.24.383493-240 I7605G298 LEVEL: Severe PID : 29755 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-305 FUNCTION: DB2 UDB, base sys utilities, sqlesrsu, probe:999 MESSAGE : free tran stuff 2004-10-29-11.10.24.384458-240 I7904G324 LEVEL: Warning PID : 29755 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-305 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:300 MESSAGE : Starting Replay Master on standby. 2004-10-29-11.10.24.384954-240 I8229G307 LEVEL: Warning PID : 3824 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSPrepareLogWrite, probe:10260 MESSAGE : RCUStartLsn 0000000814505B74 2004-10-29-11.10.35.872744-240 E8537G340 LEVEL: Warning PID : 29755 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-305 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:920 MESSAGE : ADM1602W Rollforward recovery has been initiated. 2004-10-29-11.10.35.873052-240 E8878G383 LEVEL: Warning PID : 29755 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-305 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:1740 MESSAGE : ADM1603I DB2 is invoking the forward phase of the database rollforward recovery. 2004-10-29-11.10.35.873251-240 I9262G414 LEVEL: Warning PID : 29755 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-305 FUNCTION: DB2 UDB, recovery manager, sqlpForwardRecovery, probe:720 DATA #1 : String, 103 bytes Invoking database rollforward forward recovery, lowtranlsn 0000000814505B74 minbufflsn 00000008143A400C 2004-10-29-11.10.35.884939-240 I9677G354 LEVEL: Warning PID : 29755 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-305 FUNCTION: DB2 UDB, recovery manager, sqlprecm, probe:2000 MESSAGE : Using parallel recovery with 5 agents 27 QSets 108 queues and 64 chunks

Its database snapshot shows it is in rollforward pending state and running in HADR STANDBY mode (only interesting information is shown below):

IBM Software Group Toronto Lab

31

0 4 A u g u st 2 0 0 5

Listing 29. Database snapshot of former primary server


Database Snapshot Database name Database path /db3/db2/SVT/db2svt/NODE0000/SQL00001/ Input database alias Database status File File File File number number number number of of of of first active log last active log current active log log being archived = SVT = = SVT = Rollforward = = = = = = = = 310 329 310 Not applicable Database 10/29/2004 11:17:18 310 Redo

Rollforward Rollforward Rollforward Rollforward

type last committed timestamp log file being processed status

HADR Status Role = Standby State = Peer Synchronization mode = Sync Connection status = Connected, 10/29/2004 10:40:13.269653 Heartbeats missed = 0 Local host = phillipe Local service = SVT_HADR_1 Remote host = dartagnan Remote service = SVT_HADR_2 Remote instance = DB2SVT timeout(seconds) = 120 Primary log position(file, page, LSN) = S0000310.LOG, 408, 000000081453C455 Standby log position(file, page, LSN) = S0000310.LOG, 408, 000000081453C455 Log gap running average(bytes) = 885

5.2.3 On the SAP Central Instance


After the role switch and as a result of the automatic client reroute being enabled on both the original primary database and standby database, all the active SAP work processes will be reconnected to the new primary database. All in-flight transactions during the switch will be rolled back, and the user or the application will be required to resubmit those transactions for execution again. All subsequent transactions will now be run on the new primary database. You will be able to see the following message in the SAP work process log file that indicates a reconnection to the alternate (new primary) database server, as well as the rollback of the current transaction: Listing 30. SAP work process log on SAP CI
B ***LOG BY0=> DbSlReadDB6( SQLExecute ): [IBM][CLI Driver][DB2/LINUX] SQL30108N A connection failed but has been re-established. The hostname or IP address is "dartagnan" and the service name or port number is "54700". Special registers may or may not be re-attempted (Reason code = "1"). SQLSTATE=08506 [dbdynpdb#? @ 438] [dbdynpdb0438 ] B *** ERROR => Input values: Table = DYNPLOAD, prog = SAPMSDYP , dynpnr = 0011, langu = [dbdynpdb2.c 438] A TH VERBOSE LEVEL FULL M ***LOG R68=> ThIRollBack, roll back () [thxxhead.c 11245]

IBM Software Group Toronto Lab

32

0 4 A u g u st 2 0 0 5

5.2.4 SAP Application and System Monitoring


The client reroute feature is only supported on the database level and not on the instance level. Therefore, only database connections are rerouted, not the instance attachment. However, some of the SAP database administration functions (database snapshots using dmdb6rdi, for example) need to be performed at the instance level and will not work correctly after the role switch of the primary and standby. In our test system, the original primary database server (phillipe) is catalogued on the SAP central instance host (lunen) as a TCP/IP node: Listing 31. Node Directory on SAP CI after client reroute
lunen:svtadm 289> db2 list node directory Node Directory Number of entries in the directory = 1 Node 1 entry: Node name Comment Directory entry type Protocol Hostname Service name = = = = = = NODESVT TCPIP Node for database SVT LOCAL TCPIP phillipe sapdb2SVT

Even after the database on phillipe has been switched to standby role, and all the database connections have been rerouted to the new primary database host dartagnan, the node directory information remains the same, and there is no alternate server for the node (instance). This is a limitation of the current DB2 client reroute implementation. Because of this limitation, on transaction DB6COCKPIT (ST04), you will not be able to monitor the new primary database on dartagnan. The transaction ST04 will still display the snapshot of the database on the original primary host phillipe, which is probably not desired. Having said that, some functions in transaction ST04 that only rely on the database connection, not the instance attachment, will continue to work properly.

5.3 Failover when the primary database is physically down


When the current primary database becomes unavailable due to partial or complete site failure, you can perform a failover so that the current standby database becomes the new primary database. Note that this procedure might cause a loss of data in certain scenarios. Please refer to DB2 UDB Data Recovery and High Availability Guide and Reference, V8.2 for more details. On Linux, this failover can be automated by using Tivoli System Automation (TSA), a no-charge feature that is packaged with DB2 UDB. Another whitepaper has also been written to demonstrate how TSA is used in an SAP with DB2 HADR environment.

IBM Software Group Toronto Lab

33

0 4 A u g u st 2 0 0 5

5.3.1 On the old primary database


In this case, assuming the primary database is failing. You will need to completely disable the failed primary database. When a database encounters internal errors, normal shutdown commands may not completely shut it down. You may need to use operating system commands to remove resources such as processes, shared memory, or network connections. In our test, we used shutdown Fr now command to simulate a hardware failure.

5.3.2 On the old standby database


Issue the TAKEOVER HADR command with the BY FORCE option. The BY FORCE option is required because the primary is expected to be offline. Listing 32. FORCE takeover from standby HADR server
dartagnan:db2svt 337> db2 takeover hadr on db svt by force DB20000I The TAKEOVER HADR ON DATABASE command completed successfully.

You should see the following messages in the db2diag.log. The database will initially change from S-Peer to S-RemoteCatchupPending because the connection to the primary was lost, then the rollforward recovery was completed, and the Replay Master was stopped. Finally the database was switched into Primary mode and stayed on P-RemoteCatchPending mode. It couldnt reach P-Peer mode because the remote HADR peer was down. Listing 33. db2diag.log from standby server after a FORCE takeover
2004-10-29-12.36.28.187952-240 E22497G325 LEVEL: Event PID : 12262 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-RemoteCatchupPending (was S-Peer) 2004-10-29-12.36.28.203705-240 I22823G411 LEVEL: Warning PID : 22895 TID : 1024 PROC : db2redom (SVT) INSTANCE: db2svt NODE : 000 APPHDL : 0-771 FUNCTION: DB2 UDB, recovery manager, sqlpPRecReadLog, probe:4630 MESSAGE : Last log for rollforward is incomplete, log number: DATA #1 : Hexdump, 4 bytes 0x4C3A4390 : 3701 0000 7... 2004-10-29-12.36.28.426984-240 I23235G318 LEVEL: Warning PID : 12267 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-771 FUNCTION: DB2 UDB, recovery manager, sqlpForwardRecovery, probe:1990 MESSAGE : nextLsn 0000000814578B92 2004-10-29-12.36.28.460209-240 E23554G380 LEVEL: Warning PID : 12267 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-771 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:3600 MESSAGE : ADM1605I DB2 is invoking the backward phase of database rollforward recovery. 2004-10-29-12.36.28.460697-240 I23935G379 PID : 12267 TID : 1024 INSTANCE: db2svt NODE : 000 LEVEL: Warning PROC : db2agnti (SVT) DB : SVT

IBM Software Group Toronto Lab

34

0 4 A u g u st 2 0 0 5

APPHDL : 0-771 FUNCTION: DB2 UDB, recovery manager, sqlpForwardRecovery, probe:2210 MESSAGE : Invoking database rollforward backward recovery, nextLsn: 0000000814578B92 2004-10-29-12.36.29.238202-240 I24315G410 LEVEL: Error PID : 12267 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-771 FUNCTION: DB2 UDB, recovery manager, sqlpForwardRecovery, probe:670 MESSAGE : dbcb->logfhdr.firstDeleteFile: DATA #1 : Hexdump, 4 bytes 2004-10-29-12.36.29.345022-240 E24726G351 LEVEL: Warning PID : 12267 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-771 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:6600 MESSAGE : ADM1611W The rollforward recovery phase has been completed. 2004-10-29-12.36.29.345636-240 I25078G325 LEVEL: Warning PID : 12267 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-771 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:9500 MESSAGE : Stopping Replay Master on standby. 2004-10-29-12.36.36.168746-240 E25404G341 LEVEL: Event PID : 12262 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-RemoteCatchupPending (was S-RemoteCatchupPending)

Its database snapshot shows: Listing 34. Database snapshot of former standby HADR server after takeover
Database Snapshot Database name File number of File number of File number of File number of first active log last active log current active log log being archived = = = = = SVT 310 329 310 Not applicable

HADR Status Role = Primary State = Disconnected Synchronization mode = Sync Connection status = Disconnected, 10/29/2004 12:36:28.188239 Heartbeats missed = 0 Local host = dartagnan Local service = SVT_HADR_2 Remote host = phillipe Remote service = SVT_HADR_1 Remote instance = DB2SVT timeout(seconds) = 120 Primary log position(file, page, LSN) = S0000310.LOG, 4072, 000000081538C855 Standby log position(file, page, LSN) = S0000000.LOG, 0, 0000000000000000 Log gap running average(bytes) = 0

IBM Software Group Toronto Lab

35

0 4 A u g u st 2 0 0 5

5.3.3 On the SAP Central Instance


Same behavior as in 4.2.3.

5.4 Reintegrating a database after a failover


After a failover because of the original primary database failure, you can bring the failed database back online and use it as a standby database or return it to its status as primary database. Please be aware that if the ASYNC mode or NEARSYNC mode was used for HADR, there is no guarantee that the reintegration will succeed. To reintegrate the failed primary database into the HADR pair as the standby database involves the following steps: 1) Repair the system where the original primary database resides. This could involve repairing failed hardware or rebooting the crashed operating system. 2) Restart the failed old primary database as a standby database. Listing 35. Starting the former primary server as new standby
phillipe:db2svt 51> db2start 10/29/2004 14:27:14 0 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful. phillipe:db2svt 52> db2 start hadr on db svt as standby DB20000I The START HADR ON DATABASE command completed successfully.

After this, you will be able to see the db2diag.log message on this host as below: Listing 36. db2diag.log from former primary server after restarting
2004-10-29-14.27.14.110464-240 E31781G961 LEVEL: Event PID : 8479 TID : 1024 PROC : db2star2 INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:911 MESSAGE : ADM7513W Database manager has started. START : DB2 DBM . . . . . . . . . . 2004-10-29-14.27.27.936769-240 E32743G305 LEVEL: Event PID : 8972 TID : 1024 PROC : db2hadrs (SVT) FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to None (was None) 2004-10-29-14.27.27.970986-240 E33049G307 LEVEL: Event PID : 8972 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-Boot (was None) 2004-10-29-14.27.28.012588-240 I33357G323 LEVEL: Warning PID : 8652 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:300 MESSAGE : Starting Replay Master on standby.

IBM Software Group Toronto Lab

36

0 4 A u g u st 2 0 0 5

2004-10-29-14.27.28.015452-240 I33681G401 LEVEL: Warning PID : 8972 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrHandleHsAck, probe:30445 MESSAGE : HADR: old primary reintegration as new standby discarding obsolete logs after hdrLCUEndLsnRequested 0000000814578B91 2004-10-29-14.27.28.015659-240 E34083G317 LEVEL: Event PID : 8972 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-LocalCatchup (was S-Boot) 2004-10-29-14.27.28.021135-240 E34401G339 LEVEL: Warning PID : 8652 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:920 MESSAGE : ADM1602W Rollforward recovery has been initiated. 2004-10-29-14.27.28.021465-240 E34741G382 LEVEL: Warning PID : 8652 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:1740 MESSAGE : ADM1603I DB2 is invoking the forward phase of the database rollforward recovery. 2004-10-29-14.27.28.025235-240 I35124G413 LEVEL: Warning PID : 8652 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlpForwardRecovery, probe:720 DATA #1 : String, 103 bytes Invoking database rollforward forward recovery, lowtranlsn 0000000814578B92 minbufflsn 0000000814560978 2004-10-29-14.27.28.075504-240 I35538G353 LEVEL: Warning PID : 8652 TID : 1024 PROC : db2agnti (SVT) INSTANCE: db2svt NODE : 000 DB : SVT APPHDL : 0-11 FUNCTION: DB2 UDB, recovery manager, sqlprecm, probe:2000 MESSAGE : Using parallel recovery with 5 agents 27 QSets 108 queues and 64 chunks 2004-10-29-14.27.28.204285-240 E35892G333 LEVEL: Event PID : 8972 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-RemoteCatchupPending (was S-LocalCatchup) 2004-10-29-14.27.28.220506-240 I36226G364 LEVEL: Warning PID : 8972 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduS, probe:20895 MESSAGE : Pair validation passed. Primary reintegration: hdrLCUEndLsnRequested: 0000000814578B91 2004-10-29-14.27.28.220717-240 I36591G356 LEVEL: Warning PID : 8972 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrNukeLogTail, probe:10445

IBM Software Group Toronto Lab

37

0 4 A u g u st 2 0 0 5

MESSAGE : Primary reintegration: hdrNukeLogTail() called at LSN: 0000000814578B91 2004-10-29-14.28.33.222329-240 E36948G334 LEVEL: Event PID : 8972 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-RemoteCatchup (was S-RemoteCatchupPending) 2004-10-29-14.28.33.222550-240 I37283G307 LEVEL: Warning PID : 8972 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSPrepareLogWrite, probe:10260 MESSAGE : RCUStartLsn 0000000814578B92 2004-10-29-14.28.39.343019-240 E37591G324 LEVEL: Event PID : 8972 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-NearlyPeer (was S-RemoteCatchup) 2004-10-29-14.28.39.431901-240 E37916G315 LEVEL: Event PID : 8972 TID : 1024 PROC : db2hadrs (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-Peer (was S-NearlyPeer)

And on the current primary database, you will be able to see the db2diag.log message as below: Listing 37. db2diag.log from current primary server showing former primary server re-integrating as HADR standby server
2004-10-29-14.27.29.509990-240 I29576G322 LEVEL: Warning PID : 12262 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduP, probe:20482 MESSAGE : Old primary requesting rejoining HADR pair as a standby 2004-10-29-14.28.37.982717-240 E29899G334 LEVEL: Event PID : 12262 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-RemoteCatchup (was P-RemoteCatchupPending) 2004-10-29-14.28.37.986492-240 I30234G308 LEVEL: Warning PID : 12262 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduP, probe:20445 MESSAGE : remote catchup starts at 000000081457800C 2004-10-29-14.28.40.630513-240 I30543G325 LEVEL: Warning PID : 12262 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrTransitionPtoNPeer, probe:10645 MESSAGE : near peer catchup starts at 000000081538DCDD 2004-10-29-14.28.40.730154-240 E30869G324 PID : 12262 TID : 1024 INSTANCE: db2svt NODE : 000 LEVEL: Event PROC : db2hadrp (SVT)

IBM Software Group Toronto Lab

38

0 4 A u g u st 2 0 0 5

FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-NearlyPeer (was P-RemoteCatchup) 2004-10-29-14.28.40.732209-240 E31194G315 LEVEL: Event PID : 12262 TID : 1024 PROC : db2hadrp (SVT) INSTANCE: db2svt NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-Peer (was P-NearlyPeer)

3) After the original primary database has rejoined the HADR pair as the standby database, you can choose to perform a failback operation to switch the roles of the databases to enable the original primary database to be once again the primary database. To perform this failback operation, follow the steps in 5.2.

5.5 Restrictions and recommendations 5.5.1 Restrictions of HADR in SAP environment


Please read the HADR restrictions documented in Chapter 7. High availability disaster recovery (HADR) in the DB2 UDB Data Recovery and High Availability Guide and Reference, V8.2. Below is the highlight of restrictions in SAP environment: 1) HADR is not supported on a multiple partition database, in other words, SAP BW, SEM with more then one partition.

5.5.2 Restrictions of Client Reroute in SAP environment


Please read the client reroute restrictions documented in Appendix B. Using automatic client rerouting in DB2 UDB Administration Guide: Implementation, V8.2. Below is the highlight of restrictions in SAP environment: 1) Client Reroute is on database level, not on the instance level. Therefore, some database administration commands or tools that require an instance attachment will not work on the new primary database after failover. Please see the detail in 4.2.4. 2) Automatic client reroute is only supported when the communications protocol used is TCP/IP. This means that the local connection will not be able to be rerouted. Therefore, the database connections from the SAP application server that is located on the same physical host as the database server will not be enabled for client reroute if the local database is failed over to a remote database server.

5.5.3 Recommendations
1) HADR Synchronization Mode SYNC mode is recommended for the best data protection. If you find this mode is impacting the performance of your system, you could change to NEARSYNC mode. ASYNC mode is not recommended as it could lead to a higher probability of data loss.

IBM Software Group Toronto Lab

39

0 4 A u g u st 2 0 0 5

6.

Performance impact of HADR

One of the key considerations that most customers would want to know before implementing HADR in a production environment is the impact it will have on performance. In order to get a feeling of the performance impact, we have conducted several tests in our test systems. Since these systems are not in a controlled environment, the performance number could vary from time to time.

6.1 Failover time In a test conducted to measure failover time in a simulated production environment (see Figure 5. SAP HADR Setup), we ran a 600 user SD Benchmark (with 200 SD users on each application server), and forced a take-over (see section 5.3). The time taken for the clients on the Application Server to reroute the connections and continue processing was about 15 seconds. These 15 seconds can be roughly broken down to the following three phases: 1. Tivoli System Automation (TSA) to detect the primary server failure and initiate the standby server to take over at least 9 seconds. If TSA or other clustering software is not installed, you can initiate the takeover manually, which will usually take longer time. 2. The standby server to take over, including to replay any logs it still has in memory, undo any in-flight transactions, and open the database for new transactions; 3. The Client Reroute to make a new connection to the new primary server.

6.2 Impact of HADR synchronization mode


A second test was conducted to measure the impact of given SAP workload in the three different synchronization modes (SYNC, NEAR-SYNC, and ASYNC). The workload chosen was a client copy and client delete. As you can see, on this particular system, compared to the base run, the runtime in SYNC mode is about 10% slower for Copy and 9% slower for Delete. Table 1. Client Copy/Delete times in different HADR_SYNCMODE HADR setup None (base) SYNC NEARSYNC ASYNC Copy Elapsed time (hh:mm) 1:52 2:04 1:59 1:56 Delete Elapsed time (hh:mm) 2:34 2:48 2:42 2:38

6.3 Impact of HADR log receiving buffer size

IBM Software Group Toronto Lab

40

0 4 A u g u st 2 0 0 5

A third test was conducted to measure the impact of log receive buffer size. By default, the log receive buffer size on the standby database will be two times the value specified for the LOGBUFSZ configuration parameter on the primary database. There might be times when this size is not sufficient. For example, when the HADR synchronization mode is asynchronous and the primary and standby databases are in peer state. If the primary database is experiencing a high transaction load, the log receive buffer on the standby database might fill to capacity, and the log shipping operation from the primary database might stall. To manage these temporary peaks, you can increase the size of the log receive buffer on the standby database by modifying the DB2_HADR_BUF_SIZE registry variable. The workload chosen was the SD benchmark test with 600 users equally spread across 3 application servers. Table 2. SD Benchmark performance in different DB2_HADR_BUF_SIZE HADR setup Response Response Time/Throughput on Time/Throughput on Central Instance lunen Diaglog Instance 1 Response Time/Throughput on Diaglog instance 2

Throughput Base (default setting) HADR_SYNCMODE=ASYN (DS/sec) : 11.93 Average response time C : 6887 LOGBUFSZ = 1024 DB2_HADR_BUF_SIZE=2048 Throughput Increased buffer size HADR_SYNCMODE=ASYN (DS/sec) : 10.68 Average response time C : 9351 LOGBUFSZ = 1024 DB2_HADR_BUF_SIZE=4096

(DS/sec) Throughput (DS/sec) Throughput : 15.56 : 11.92 Average response time Average response time : 3228 : 7387

(DS/sec) Throughput (DS/sec) Throughput : 15.57 : 12.67 Average response time Average response time : 3204 : 6346

The performance numbers above are not consistent across all SAP application servers. Therefore, they do not indicate either performance gain or degradation due to the increased HADR receiving buffer pool size with the workload we put on these servers.

IBM Software Group Toronto Lab

41

0 4 A u g u st 2 0 0 5

7. Taking backups from HADR standby image


There are many ways that one can take a database backup. DB2 provides multiple Backup options such as Online, Offline, Incremental Delta, Incremental Cumulative and Split Mirror. The first 4 options are performed on the database server and will impact production in some form either by being offline to applications or consume resources (CPU and disk I/O) on the server. Split Mirror support on certain advanced storage subsystems e.g. IBM ESS, enable the offloading of backups to another server through its disk mirroring or copy feature of the primary database related files. Once these mirrors have been split and mounted onto the secondary server, backup of the database can be done via OS tools or through DB2. There is, however, a period of time when the database needs to be suspended while the splitting of the mirror is in progress. With HADR, one side benefit is the ability to backup the standby database without impacting the performance of the primary database.

7.1 Database Backup on the Standby Server


The process would be 1. Deactivate the database on the Standby server Listing 38. Deactivate standby database before the backup
F:\db2>db2 list applications Auth Id Application Name Appl. Handle Application Id DB Name # of Agents

-------- -------------- ---------- ------------------------------ -------- ----db2replay 72 SVT 1

F:\db2>db2 deactivate db SVT DB20000I The DEACTIVATE DATABASE command completed successfully. F:\db2>db2 list applications SQL1611W No data was returned by Database System Monitor. SQLSTATE=00000

2. 3.

Use the split mirror funciton of the storage system to separate the mirrors Use OS tools, such as dd, tar, gzip, to backup the DB2 image, which includes the database home directory (such as /DB2/<SID>/DB2<SID>) and all the tablespace container file systems (such as /DB2/<SID>/sapdata) to a backup location. Reactivate the standby database:

4.

Listing 39. Reactivate standby database after the backup


F:\db2>db2 activate db SVT DB20000I The ACTIVATE DATABASE command completed successfully. F:\db2>db2 list applications Auth Id Application Name Appl. Handle Application Id DB Name # of Agents

IBM Software Group Toronto Lab

42

0 4 A u g u st 2 0 0 5

-------- -------------- ---------- ------------------------------ -------- ----db2replay 77 SVT 1

7.2 Restoring to primary server


If there is a need to restore the database from this backup to the primary server, you will need to: 1. Restore all the DB2 related files to the correct original location on the Primary server from this backup image. The restored database should be in database rollforward pending state already. Make available all the log files (including the active log files and archived log files from the primary server) to be processed. When the backup image was made on the standby server, its HADR status was on Standby mode. Also the HADR configuration parameters and some other database parameters, such as LOGARCHMETH1 and LOGPATH, are copied from the standby database. They should be updated for the primary database.

2.

3.

Listing 40. Update database configuration parameters before roll forwarding the log files
D:\Program Files\IBM\SQLLIB\BIN>db2 get db cfg for SVT Database Configuration for Database SVT Backup pending Database is consistent Rollforward pending Restore pending HADR HADR HADR HADR HADR HADR HADR HADR database role local host name local service name remote host name remote service name instance name of remote server timeout value log write synchronization mode (HADR_LOCAL_HOST) (HADR_LOCAL_SVC) (HADR_REMOTE_HOST) (HADR_REMOTE_SVC) (HADR_REMOTE_INST) (HADR_TIMEOUT) (HADR_SYNCMODE)

= NO = NO = DATABASE = YES = = = = = = = = STANDBY dartagnan SVT_hadr_2 phillipe SVT_hadr_1 DB2SVT 120 SYNC

First log archive method SQL00001\log_archive\

(LOGARCHMETH1) = DISK:F:\db2\NODE0000\

D:\Program Files\IBM\SQLLIB\BIN>db2 stop hadr on db SVT DB20000I The STOP HADR ON DATABASE command completed successfully. D:\Program Files\IBM\SQLLIB\BIN>db2 update db cfg for SVT using hadr_local_ho st phillipe DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully. D:\Program Files\IBM\SQLLIB\BIN>db2 update db cfg for SVT using hadr_remote_h ost dartagnan DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully. D:\Program Files\IBM\SQLLIB\BIN>db2 update db cfg for SVT using hadr_remote_s vc SVT_hadr_2 DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully.

IBM Software Group Toronto Lab

43

0 4 A u g u st 2 0 0 5

D:\Program Files\IBM\SQLLIB\BIN>db2 update db cfg for SVT using hadr_local_sv c SVT_hadr_1 DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully. D:\Program Files\IBM\SQLLIB\BIN>db2 get db cfg for SVT Database Configuration for Database SVT Backup pending Database is consistent Rollforward pending Restore pending HADR HADR HADR HADR HADR HADR HADR HADR database role local host name local service name remote host name remote service name instance name of remote server timeout value log write synchronization mode = NO = NO = DATABASE = YES = = = = = = = = STANDARD phillipe SVT_hadr_1 dartagnan SVT_hadr_2 DB2SVT 120 SYNC

(HADR_LOCAL_HOST) (HADR_LOCAL_SVC) (HADR_REMOTE_HOST) (HADR_REMOTE_SVC) (HADR_REMOTE_INST) (HADR_TIMEOUT) (HADR_SYNCMODE)

First log archive method SQL00001\log_archive\

(LOGARCHMETH1) = DISK:F:\db2\NODE0000\

4.

Apply the log files from the original primary database to the restored database.

Listing 41. Rollforward the log files on the restored database


D:\Program Files\IBM\SQLLIB\BIN>db2 rollforward db SVT to end of logs Rollforward Status Input database alias Number of nodes have returned status Node number Rollforward status Next log file to be read Log files processed Last committed transaction DB20000I = SVT = 1 = = = = = 0 DB working S0000002.LOG S0000000.LOG - S0000001.LOG 2005-06-23-15.59.29.000000

The ROLLFORWARD command completed successfully.

D:\Program Files\IBM\SQLLIB\BIN>db2 rollforward db SVT complete Rollforward Status Input database alias Number of nodes have returned status Node number Rollforward status Next log file to be read Log files processed Last committed transaction DB20000I = SVT = 1 = = = = = 0 not pending S0000000.LOG - S0000002.LOG 2005-06-23-15.59.29.000000

The ROLLFORWARD command completed successfully.

IBM Software Group Toronto Lab

44

0 4 A u g u st 2 0 0 5

5.

Restart the HADR (assuming the standby database is still up and running)

Listing 42. Restart HADR on the newly restored database


D:\Program Files\IBM\SQLLIB\BIN>db2 start hadr on db SVT as primary DB20000I The START HADR ON DATABASE command completed successfully.

Please be aware that at this point, the primary database would be using a different log chain for archiving logs.

7.3 Restoring to standby server


If the standby database becomes unusable due to a hardware problem, the split backup could be restored to the original standby database. No special configuration is required. Once the standby database files have been restored, start HADR as standby. The standby database should now go into catch-up mode, with the primary database server retrieving the log files and shipping the log records over to the standby database until it reaches peer state.

IBM Software Group Toronto Lab

45

0 4 A u g u st 2 0 0 5

8. Appendix: Reference
DB2 UDB Data Recovery and High Availability Guide and Reference, V8.2 ftp://ftp.software.ibm.com/ps/products/db2/info/vr82/pdf/en_US/db2hae81.pdf DB2 UDB Administration Guide: Implementation, V8.2 ftp://ftp.software.ibm.com/ps/products/db2/info/vr82/pdf/en_US/db2d2e81.pdf Automating DB2 HADR Failover using IBM Tivoli System Automation for Multiplatforms ftp://ftp.software.ibm.com/software/data/db2/linux/tsa_hadr.pdf FlashCopy and Remote Volume Mirror for IBM Total Storage FAStT 900 in an SAP and DB2 Environment http://w3.ncs.ibm.com/cspaper.nsf/HTitle/0BTOS-5ZZQFT?OpenDocument

IBM Software Group Toronto Lab

46

0 4 A u g u st 2 0 0 5