You are on page 1of 4

High Availability and Disaster Recovery for IBM Systems Solution for


1. Distinguish between HA and DR
It is all about continuous business processes. According to the business requirements planned
and unplanned downtimes need to be covered. To do this for the unplanned downtimes, SAP
talks about datacenter readiness of SAP HANA. This covers the disciplines of hardware or
software failures, network malfunctions, security threads, natural or man-made disasters, failure
of compliance and operations, etc.
For this document we define High Availability (HA) covering a single node hardware failure,
e.g. one node breaks in a running scale out config for whatever reason (e.g. CPU, memory,
storage, network, ...). Certainly IBM eX5 hardware provides a degree of redundancy which
covers failures like that, but breaks can still happen.
Disaster Recovery solutions (aka Disaster Tolerance (DT)) cover multiple nodes fail at the same
time or a whole data center goes down with a fire, flood, or other catastrophe, and a secondary
site needs to take over the SAP HANA system.
If the customer is running a side by side SAP HANA scenario ( e.g. CO-PA or sales planning or
smart metering) the data will still be available in the source / backend SAP Business Suite
system. Only the fast planning or analytical tasks will run significantly slower, just as before the
existence of HANA.
More important is the situation if SAP HANA is the prime database under e.g. BW. Then the
"productive" data sits in SAP HANA as the DB and according to the business service level
agreements, prevention for a failure is more crucial.
In either way, the requirements of recovery point objective (RPO) and recovery time objective
(RTO) needs to be discussed with the customer. These requirements can be very different to
every customer scenario.
RPO = is the maximum tolerable period in which data might be lost from an IT service due to a
major incident.
RTO = is the duration of time and a service level within which a business process must be
restored after a disaster or disruption.

Infolink: Availability and Disaster Recovery for IBM Systems Solution for SAP HANA

2. High Availability
With the IBM Workload Optimized Solution for SAP HANA this is covered in one system, one
location via GPFS functionality. The pre-requisite at this time are two scale out nodes plus a
dedicated quorum node or three or more scale out nodes. Three nodes are always required,
because one needs to run as a quorum node to define, who is the master in the cluster. This can
either be an active system used as worker or standby or a dedicated, less expensive quorum
server. The data is written by GPFS to the primary location and in a striped fashion also
replicated to the other nodes in the cluster.
If one of the worker nodes dies or loses connection to the quorum node, then the standby node
recovers the data (data and log files) into its main memory and the cluster keeps working without
downtime - just some delay to the query in flight. The response to the query is being provided as
soon as the data that is required will have been recovered. The remainder of the data will be
required thereafter.
With just one standby node, the cluster then will have lost its HA capability until a repaired or
new node will be added to the remaining worker nodes as a new standby node. With more than
one standby node, the GPFS configuration needs to be reconfigured (re-striped). Please see SAP
Note 1650046. Upon completion of such reconfiguration and with at least one standby node left
in the cluster, HA capability is maintained.
There are all kinds of scenarios possible: n worker + 1 standby, n worker and n standby and any other
combination. Certified configurations are in the range of 2-16 nodes, larger configurations can get
certified on request at the customer site. By the way, the standby node does not require a SAP
HANA license. The license fee is calculated based on the amount of memory that resides with
the worker nodes only.
To clarify the confusion on "hot" and "cold" standby nodes: from an SAP HANA perspective the
standby node is considered cold, but from an infrastructure perspective it is hot, because SLES
and GPFS are running.
3. Disaster Recovery
The basic features for disaster recovery are available from SAP today. Details are described here.
Possible solutions:
Recovery Point
Backup / recovery over distance hours
(depends on backup
Synchronous replication to two sites (IBM
minutes 0 today
System Replication ("warm standby", SAP
minutes 0 today
Log shipping
seconds planned: 2014
Enhanced DR (asynchronous replication)
planned: late
Log shipping functionality needs to be provided by SAP.
If multiple nodes fail at the same time or if a second node fails while the reconfiguration of the
GPFS has not been recovered upon a single node fail, then the whole cluster will go down as
some primary and secondary (replicated) data is lost. Therefore, this is considered a disaster. If
such disaster happens and a whole site (here primary) goes down, then a scenario where a
secondary datacenter would take over is called disaster tolerant.
As of today, the following procedures are feasible and recommended.
Backup/Recovery over distance
Backup your data on the primary site regularly (at least daily) to a defined staging area which
might be an external disk on an NFS share or a directly attached SAN subsystem (e.g. DS8K or
existing storage). Transfer the backup to the remote site regularly (mirror functionality can be
used here).
On that site an identical SAP HANA system (# of nodes, size, hostnames and SID, etc.) needs to
exist. This system can run for example a Quality Assurance (QA), Development (DEV), or Test
(TST) system or other second tier system. In case the primary site goes down, the system needs
to be cleared from this second tier HANA system (hostname and SID potentially adapted - fresh
install of the SAP HANA software recommended) and the backup can be restored. Upon
configuring the application systems to use the secondary site instead of the primary one,
operation can be resumed. SAP HANA recovers from the latest backup in case of a disaster.
GPFS based synchronous replication
Also, a mirror can be configured on GPFS from the primary site to the secondary site. This is
validated and certified since 18.Dec. 2012.
The failover has to be initiated manually, application systems have to be adapted and the
secondary cluster will have to be restarted. With such a restart, the system is restored to the latest
savepoint and the available logs are recovered. This scenario implies, that the remote system is
not being used for a different SAP HANA installation, but is available in standby with current
data to be ready to start with RPO = 0, in case the prime site goes down.
The maximum distance between the two sites for this solution is defined by the maximum
latency between the internal switches of the appliance. SAP allows a maximum allowed latency
of 320s (microseconds). Under ideal conditions this translates to a distance of about 64km or 40
miles. SAP reserves the right to validate the DR solution at the customer site. If there is
competing traffic or the latency is too high, customer might be asked to optimize the network
We also created a solution, where the remote site can run non- productive systems at the same
time. For this, direct attached additional disk space in expansion units need to be added. This is a
unique feature of the GPFS based DR solution.
Disaster recovery solutions based on storage replication for SAP HANA need to be validated by
SAP. Validated solutions are documented in SAP Note 1755396 -
"Released DT solutions for SAP HANA with disk replication"
SAP System Replication
SAP can deliver since SPS5 a so called "warm standby" solution, now called "System
Replication". With this, SAP HANA itsself will be able to write synchronously to a remote site.
This requires an identical system on the remote site. This system will be idle (not available for -
say- non productive SAP HANA instances. It will have a "warm" that means DB loaded already
in memory, which ensures a short (< than 5min RTO) switch over time, if the primary site goes
down. This applies also for single node installations on both sides. The documentation of the
SPS5 solution is here. SAP delivers with SPS 6 the same functionality but asynchronous. The
documentation can be found here, only the parameter "mode=sync" is now set to "mode=async".
The most desired solution certainly is a hot-standby (using log shipping) of the secondary site.
SAP and IBM are working closely together on such a solution. At this point in time, this is not
available yet.
For more info: https://w3-!/wiki/Waef4c0eb0f35_427f_a25e_670e392682b1/p