IBM FlashSystem HyperSwap Cookbook v1.8

IBM FlashSystems HyperSwap
Configuration, Architecture
Guidelines and Operational Best
Practices
Authors:
Anil Narigapalli
Gavin O’Reilly
IBM FlashSystems HyperSwap Configuration, Architecture Guidelines and Operational Best Practices
Contents
IBM FlashSystem HyperSwap Overview and Configuration 04
Overview 05
IP Quorum 06
HS Configuration requirements and guidelines 07
IP Quorum Deployment 10
HyperSwap Topology Configuration Steps 13
HyperSwap Tuning Parameters 16
IBM FlashSystems HS Design Considerations and Best Practices 17

Design Considerations 18
SAN Considerations, Guidelines and Best Practices 20
HS and Remote Mirror, Standard Volumes Coexistence Guidance 21
IBM FlashSystems HS IP Quorum Design Considerations and Best Practices 23

HyperSwap IP Quorum Connectivity Topology 24
Failure of Primary IPsec VPN 25
Failure of Inter-site cluster links 26
Rolling Disaster with IPsec VPN and Storage Failures 27
Rolling Disaster with IP Quorum and Storage Failures 28
Recommendations 29
IBM FlashSystem HyperSwap Test Cases 30

HyperSwap Test Cases Approach 31
HyperSwap Test Cases Summary 32
Test Case 1: IBM FS9200 I/O Group Failure in Site 1 34
Test Case 2: IBM FS9200 I/O Group Failure in Site 2 35
2 © 2018 IBM Corporation February 16, 2024 IBM Services

Contents
Test Case 3: IP Quorum Failure 36 GTS Standard Building Capacity Guidance 57
Test Case 4: Inter Site Communication Failure 37 IBM FS9200 and FS7200 Capacity Sizing Guidance 58
Test Case 5: Host/App Node Failure in Site 1 38 IBM FS9200 and FS7200 Capacity Sizing Guidance - Explained 59
Test Case 6: Host/App Node Failure in Site 2 39 HyperSwap Change Volumes Capacity Considerations 60
Test Case 7: Complete Site 1 Failure 40
Test Case 8: Complete Site 2 Failure 41
Test Environment Configuration Sample Guidance 42
Fibre Channel Host Attachment and DB/App Settings 43

Linux 44
Oracle RAC on Linux 45
AIX 46
Windows 47
VMware 48
FC Hosts Queue Depth 49
Data Protection Guidance with HS 50

SPS, SPP and CDM 51
HyperSwap Volume Expansion 52
GTS Standard Building Block FC Port Layout and Zoning Guidance 53

IBM FS9200 FC Port Layout and Function Designation 54
IBM FS7200 FC Port Layout and Function Designation 55
IBM FS9200 and FS7200 Zoning Guidance 56

IBM FlashSystem HyperSwap Overview and
Configuration
IBM FlashSystem HyperSwap Overview and Configuration
Overview
• HyperSwap (HS) is the high availability (HA) solution for IBM • In MPC IaaS for Storage environments, certain accounts warrant the
FlashSystems and previous Storwize systems such as FS5100, need for NAT’ing the node canister IPs to communicate with the IP
FS7200, FS9100, FS9200, V7000 etc. that provides continuous data quorum servers deployed in the ISPW cloud which is treated as a
availability in case of hardware failure, power failure, connectivity third failure domain in the HS topology. Generic topology of the
failure, or site failure, disasters. IBM FlashSystem system (e.g. FS9200) HS configuration with IP
quorum is depicted below:
• HS uses intra-cluster synchronous remote copy (Metro Mirror)
capabilities. HS essentially makes a host’s volumes accessible
across two IBM FlashSystem/Storwize system I/O groups in a
clustered system by making the primary and secondary volumes of
the Metro Mirror relationship, running under the covers, look like
one volume to the host.
• This section details the HS considerations, requirements, logical
configuration of the HS functionality with IP Quorum in IBM
FlashSystems and Storwize storage systems.
• This section also details the process for the systems which use
Network Address Translation (NAT) IPs for the IBM FlashSystem,
Storwize system node canister service IPs.
• Host configurations are not included in this section. Follow the host
configuration guidelines for HS deployment when using HS
topology.
IP Quorum
• A quorum device is used to break a tie when a SAN fault occurs, when exactly half of the nodes that were previously a member of the system are present.
A quorum device is also used to store a backup copy of important system configuration data. An IP quorum application is used in Ethernet networks to
resolve failure scenarios where half the nodes or enclosures on the system become unavailable.
• HyperSwap implementations require FC storage or an IP Quorum on a third site to cope with tie-breaker situations if the inter-site link fails, and when
connectivity between sites 1 and 2 is lost. In MPC IaaS for Storage deployments, wherever feasible ISPW location will be used a third site to host the IP
quorum application. This site is treated as an independent and a third failure domain. IP quorum is a java-based application and is specific to a cluster.
• Key considerations while using IP Quorum are as follows:
 Never deploy IP quorum in primary and/or secondary sites.
 Deploy a minimum of 2 IP quorum instances in third failure domain i.e. site 3.
 Do not deploy the IP quorum application on a host that depends on storage that is presented by the IBM HS FlashSystem (e.g. FS9200).
 When certain aspects of the system configuration change, an IP quorum application must be reconfigured and redeployed to hosts. These aspects
include:
– Adding or removing a node from the system.
– When service IP addresses are changed on a node.
– Changing the system certificate or experiencing an Ethernet connectivity issue.
– If an IP application is offline, the IP quorum application must be reconfigured because the system configuration changed.
– When a firmware upgrade happens.
• For MPC IaaS for Storage deployments, when the quorum application is hosted in ISPW location, it is preferred to configure the IP quorum application
without a quorum disk for metadata due to bandwidth constraints.

HS Configuration requirements and guidelines

IBM FlashSystem HS configuration requirements are outlined below:
• Host independent FlashSystem (e.g. FS9200) storage systems at the primary and secondary sites i.e. each I/O group to be hosted in different
site. Each site to be an independent failure domain.
• A third site must be configured to host the IP quorum and is treated as a third failure domain.
• Third site hosting the IP quorum must be the active site with site1 and site2 also having quorum drive/disk.
• Configure one SAN with two separate fabrics for dedicated for node-to-node communication. This SAN is referred to as a Private SAN.
• Configure one SAN with two separate fabrics dedicated for host attachment referred to as Public SAN.
• Minimum guaranteed bandwidth requirements for node-to-node communications and host communications:
– Private SAN - 1 time the primary host write + 2X secondary host writes and secondary host reads whichever is greater for HyperSwap.
– Public SAN - The estimated host read/write rate should the nodes at one site fail for HyperSwap.
• On NPIV-enabled configurations, use. the physical WWPN for the intra-cluster zoning.
• Use consistency groups to manage the volumes that belong to a specific application.
• To avoid the possibility of losing all the quorum devices with a single failure, run IP quorum applications on multiple servers. Recommended
to deploy 2 IP quorum applications.


• IP Quorum mode can be set to either “preferred” or “winner” in HS topology. Considerations for using preferred/winner are as below:
 Preferred Site:
 When the link between the sites breaks triggering a tie-break scenario, the preferred site will request allegiance first.
 The non-preferred site will delay its allegiance request by three seconds. This gives an advantage to the preferred site.
 The nodes at the losing site are removed from the cluster.
 Only active when topology is non-standard i.e. HS or Stretched Cluster.
 Tie-break only occurs when both sites have the same number of nodes.
Preferred site only applies when the active quorum device is an IP Quorum application.

 Winner Site:
 Use when there’s no third site for quorum.
 No active quorum device since all nodes need direct connection to a quorum device that can be used to perform tie-break.
 Without winner site configured, in a tie-break scenario, the node with the lowest ID will continue the cluster.
 This is not always the best site to continue the cluster.
 When the link between the sites breaks the nodes at the winner site will continue the cluster. The nodes at the non-winner site will be
removed from the cluster.


• IP Quorum requirements:
 Connectivity from the servers that are running an IP quorum application to the service IP addresses of all node canisters.
 On each server that runs an IP quorum application, ensure that only authorized users can access the directory that contains the IP quorum
application. Metadata is stored in the directory in a readable format, so ensure access to the IP quorum application and the metadata is
restricted to authorized users only.
 Port 1260 is used by the IP quorum application to communicate from the quorum hosts to all node canisters.
 The maximum round-trip delay must not exceed 80 milliseconds (ms), which means 40 ms each direction.
 If you are configuring the IP quorum application without a quorum disk for metadata, a minimum bandwidth of 2 megabytes per second is
guaranteed for traffic between the system and the quorum application.
 If your system is using an IP quorum application with quorum disk for metadata, a minimum bandwidth of 64 megabytes per second is
guaranteed for traffic between the system and the quorum application.
 Ensure that the directory that stores an IP quorum application with metadata contains at least 250 megabytes of available free space.
 Ensure to configure IPQW process as an automated restartable process/service in the event of quorum server reboot.
 Supported OS: Red Hat Enterprise Linux 6.x, 7.x, SUSE Linux Enterprise Server 11.x, 12.x. For detailed and current requirements please
refer IBM knowledge center links.
 IBM Java 7.1/8.
• Note: Follow the physical planning and SAN connectivity and zoning guidelines as outlined in the IBM knowledge center. Details on physical
planning and SAN connectivity, zoning for HS topology is not covered in this section.

IP Quorum Deployment
• This section details the process to deploy an IP quorum in third site • The deployment steps for an IBM FlashSystem/Storwize system
i.e. ISPW site which is treated as a third failure domain independent (e.g. FS9200, applicable to other IBM FlashSystems/Storwize
of primary and secondary sites. This section covers the details for systems such as FS5100, FS7200, FS9200, V7000 etc.) are as
the deployment for the environments where the NAT’ed IPs are follows:
used for IBM FlashSystems/Storwize system (e.g. FS9200) node
canister service IPs and for the environments with no NAT in place. 1. Login to the FS9200 cluster GUI.
2. Traverse to System -> Settings -> Network -> Service IPs.
• Certain MPC IaaS for Storage environment deployments use
3. (Applicable only if NAT’ed IPs are used) Modify the service
NAT’ed IPs for FS9200 node canister service IPs. As the quorum
IPs of all the node canisters to reflect the corresponding
application generated on an IBM FlashSystems/Storwize (e.g.
NAT’ed IP of the source service IP.
FS9200) cluster embeds the service IPs into the application, it is
essential to ensure that the NAT’ed IPs are fed into the quorum
application while it is generated on the IBM FlashSystem/Storwize
(e.g. FS9200) clustered system. Else the communication from
quorum server to the IBM FlashSystems/Storwize system (e.g.
FS9200) service IPs will not happen. It is not allowed to change the
node service IPs in the quorum application manually.

4. Generate and download the IP Quorum application. d. Download the IP quorum application by clicking the option
a. Traverse to Settings -> System -> IP Quorum. “Download IPv4 Application”
b. Click on “Quorum Setting” and change the “mode” and “site”
as shown below:
c. The above step will set the quorummode to “preferred” and e. Select the option “Download the IP quorum application
the preferred site as “site_mo1” which means that in case of a without recovery metadata”.
tiebreaker scenario, the preferred site will be allowed to
resume the I/O operations. “SITE_MO1” is mentioned for
illustration purpose only. Change it according to the
environment.

f. The above option is selected due to the network bandwidth 8. Change the permissions for the IP quorum application. The
limitations from ISPW site to the production sites. As the application file will generally be named as “ip_quorum.jar”
quorum drives hosted on the primary and secondary FS9200
9. Execute the IP quorum jar file to start the IP quorum application.
systems host the recovery metadata. A quorum disk designated
a. nohup java -jar ip_quorum.jar &
by a managed drive in the quorum configurations contain a
b. replace ip_quorum.jar with the correct IP quorum application
reserved area that is used exclusively for system management.
file name.
In this case, the recovery metadata will be stored on these
drives. In case of system recovery scenarios, this can be used. 10. Ensure the log and lock files (.lck) are generated in the directory
g. As soon as the quorum application is created, the download is from which the jar file is executed.
initiated to the local workstation. Save a copy of the IP
11. Verify the java processes are running on the IP quorum server.
Quorum application to a specific folder on the local
a. ps -ef | grep -I java
workstation.
12. Login to the CLI of FS9200 cluster.
5. (Applicable only if the NAT’ed IPs are used for node canister
service IPs) Modify the service IPs of all the FS9200 node canisters 13. Verify the quorum status using the command “lsquorum”
back to the original IPs by replacing the corresponding NAT’ed IPs
with the original service IPs. 14. The output must list the “IP quorum” application and mark it as
“active”
6. Copy the IP Quorum application downloaded to the local
workstation to the IP quorum server.
7. Login to the IP quorum server and change to root.

HyperSwap Topology Configuration Steps

This section outlines the steps to be followed to perform the logical E. Exit the service state for both the node canisters of the secondary
configuration of HS topology for IBM FlashSystems/Storwize (e.g. FS9200 cluster.
FS9200) storage systems. Applies to greenfield deployments only. a. satask stopservice <panel_name_non_ config_node_canisters
Additional considerations will be applicable for brown field b. satask stopservice <panel_name_config_node_canister>
configurations which is outside of the scope of this document.
F. Reset the cluster id of the secondary FS9200 cluster.
1. Reset cluster configuration (Applicable only if the IBM a. satask chenclousrevpd -resetclusterid
FlashSystems/Storwize (e.g. FS9200) systems in site 1 and site 2
G. Now the secondary FS9200 control enclosure will be visible as a
are configured as independent clusters).
candidate in the source FS9200 storage system.
A. Login to the secondary FS9200 cluster.
B. Collect the node canister IDs from the FS9200 system using
the command “lsnodecanister”
C. Remove the node canisters from the cluster. Remove the non
config node canister first, followed by the config node.
a. rmnodecanister <non config node id>
b. rmnodecanister <config node id>
D. Login to the service assistant GUI of the secondary FS9200
cluster and collect the panel names of the node canisters.


2. Activate the Encryption licenses on the secondary IBM 4. Add secondary enclosure in the primary IBM
FlashSystem/Storwize system (e.g. FS9200) FlashSystem/Storwize system (e.g. FS9200) cluster.
A. Login to the CLI on source FS9200 system. A. Login to the CLI on source FS9200 system.
B. Apply and activate the encryption license for the secondary B. Execute the below command to add the secondary FS9200
FS9200 system. control enclosure to the source FS9200 cluster. This will add
a. svctask activatefeature -licensekey <XXXX-XXXX- the secondary FS9200 system node canisters as a second I/O
XXXX-XXXX> group in the existing FS9200 cluster.
b. Replace <XXXX-XXXX-XXXX-XXXX> with the actual a. addcontrolenclosure -iogrp 1 -sernum
license key of the secondary FS9200 system. <Sec_FS9200_Serial> -site <sec_site_name>
C. Verify the encryption license is activated for the secondary b. Replace the <Sec_FS9200_Serial> with the secondary
FS9200 system. FS9200 serial number.
c. Replace the <sec_site_name> with the secondary site
3. Name the Sites
name.
A. Login to the CLI on source FS9200 system.
B. List the sites using “lssite” command
C. Change the site names to meaningful names according to their
location. For e.g. primary site as site1, secondary site as site2,
third site where the IP quorum is hosted as quorumsite.
a. chsite -name <New_Site_Name> <Site_ID>


5. Assign Node Canisters to Sites 6. Change the System Topology
A. Login to the CLI on source FS9200 system. A. Login to the CLI on source FS9200 system.
B. Assign primary FS9200 node canisters to primary site i.e. B. Execute the below command to change the system topology
site1 to “HyperSwap”
a. chnodecanister -site <primary_site_name> a. chsystem -topology hyperswap
<primary_FS9200_nodecanister_1> C. Validate the topology with the command “lssystem”.
b. chnodecanister -site <primary_site_name>
<primary_FS9200_nodecanister_2>
C. Assign secondary FS9200 node canisters to secondary site
i.e. site2
a. chnodecanister -site <secondary_site_name>
<secondary_FS9200_nodecanister_1>
b. chnodecanister -site <secondary_site_name>
<secondary_FS9200_nodecanister_2>
D. Validate the node canister to site assignments using the
command:
a. lsnodecanister

HyperSwap Tuning (Adjustable) Parameters

• When a HyperSwap cluster is deployed, the system creates a local partnership. • Recommended settings:
– linkbandwidthmbits - Set to 80% of the private SAN (inter-node) bandwidth
• The local partnership definition created when HS topology is configured will have
– backgroundcopyrate - Set to 35% of linkbandwidthmbits
default settings for “linkbandwidthmbits”, “backgroundcopyrate” and “relationship
– relationship bandwidth can be set if needed to reduce or increase bandwidth available
bandwidth limit”.
for a relationship or relationships; however, this setting is global setting: the value
• These parameters control the rate at which the synchronization and applies equally to all relationships.
resynchronization happens between controllers.
• Considerations for the above-mentioned settings:
• The default settings are very low and can impact the sync/re-sync operations – linkbandwidthmbits should not exceed the throughput of the inter-site link.
performance and increase the time required to complete those operations. – The backgroundcopyrate percentage should be set so that the remaining bandwidth is
enough to service the expected host I/O workload, to avoid possible impact to host I/O
• Background copy comes in play in HS deployments in the below scenarios:
performance.
 When an existing non-HS volume is converted to a HS volume.
 When the HS active-active mirror is suspended due to a failure and • Caution: Setting the backgroundcopyrate to greater than 35% of the overall
reactivated at a later point in time. linkbandwidthmbits can have an impact on the node-node communications (foreground IO)
when a sync/resync happens and contends for the private SAN bandwidth. If needed, this
• link_bandwidth_mbits, backgroundcopyrate will affect the initial sync and re-sync
can be tuned up to 50% of linkbandwidthmbits.
process (background copy) for any remote copy relationships (including
HyperSwap). • There are no specific recommended values for the above parameters. Consider the overall
bandwidth of the private SAN (purely intended for node-node communications), intended
• It is recommended to tune the parameters listed below inline with the inter-node
HS workloads and adjust the parameters accordingly.
link (private SAN) bandwidth provisioned in HS cluster.
– linkbandwidthmbits • Currently, these settings are configurable only from CLI with Spectrum Virtualize 8.3.x and
– backgroundcopyrate lower versions.
– relationshipbandwidthlimit Reference(s):
• https://www.ibm.com/support/knowledgecenter/STSLR9_8.3.1/com.ibm.fs9100_831.doc/svc_modifyremoteclusterpartnershi
p_21kdga.html
• https://www.ibm.com/support/knowledgecenter/STSLR9_8.3.1/com.ibm.fs9200_831.doc/svc_chsystem.html

IBM FlashSystems HyperSwap Design
Considerations and Best Practices
IBM FlashSystems HyperSwap Design Considerations and Best Practices
Design Considerations
• HyperSwap configurations must span three independent failure domains. Two • Requirement to use GTS LCM Code guidance and ensure the FS9200s are at
failure domains contain storage systems e.g. FS9200s. Third failure domain the target level 8.3.x or higher code level
i.e. third site contains IP quorum instances. The IP quorum should never be in
• Standalone non-HyperSwap configured FS9150/FS9200/Storwize has a max
the same failure domain as the storage systems.
usable capacity of 875 TB as per GTS sizing guidelines.
• The distance between the two failure domains containing Storage should not
• HyperSwap configured FS9200 / Storwize maximum Cluster wide (2 I/O
exceed 100 KM and a round trip latency of 3ms on the fibre channel SAN
groups) capacity is 1.75 PB.
extension.
•
 When migrating off large capacity environments (i.e.: SVC w/+2PB)
A minimum of one FS9200 control enclosure is required in each of the two
you can deploy multiple FS9200 HyperSwap pairs.
failure domains containing storage.
•
 The maximum number of HyperSwap enabled volumes per FS9200
Deploy identical storage systems in both the sites i.e. same model, same
HyperSwap pair is 1250.
hardware specs and same capacity.
•
 For a two I/O group HyperSwap system, the maximum allowed is 875
FlashCopy can be used to take point-in-time copies of HyperSwap Volumes to
TB of host data.
use for testing, cloning and backup. However, FlashCopy mapping copy
direction cannot be reversed for restoring to a volume that is in an HyperSwap  Make sure to consider the 1250 limitation when defining which
relationship. Applications / Hosts will use HyperSwap volumes. You can not split an
• Application and or Hosts within a Consistency Group between different
The use of VMware vSphere Virtual Volumes (VVols) on a system that is
HyperSwap pairs.
configured for HyperSwap is not currently supported.
•
 Usage of all 1250 HyperSwap relationships is dependent on no other
Host multipath driver must be configured with the ALUA multipath policy.
FlashCopy (FC) mappings use cases. If you are using FlashCopy (FC)
mappings for other use cases, the 1250 max will be reduced by a Qty of
1 for every Qty 4 of FlashCopy mapping consumed.

Design Considerations
• Bitmap space considerations: • Rolling Disaster Scenario - Important Note(s):
 The sum of all bitmap memory allocation for all functions (which  Access loss between Failure domains 1 and 2 (which contain the FS9200 I/O
include Remote Copy (Metro Mirror, Global Mirror, and active- groups) because of a rolling disaster. One site is down, and the other is still
active relationships), RAID, Volume Mirroring) except working. Later, the working site also goes down because of the rolling
FlashCopy must not exceed 552 MiB per IO group. disaster.
 For the 2 I/O group IBM FlashSystems/Storwize systems  The system continues to operate the failure domain with the IBM
completely dedicated for HS requirements, the overall storage FlashSystems/Storwize system I/O Group, which wins the quorum race. The
volume capacity must be considered carefully. cluster continues with operation, while the I/O Group node canister in the
 1.75 PB virtual capacity for a 2 I/O group IBM other failure domain stops. Later, the “operational” IBM
FlashSystems/Storwize system is ideal. This implies 875 TB per FlashSystems/Storwize I/O Group is down too because of the rolling disaster.
I/O group. All IBM FlashSystems/Storwize system I/O Groups are down.
 For 875 TB of HyperSwap volumes, each of the two I/O groups  In this scenario, the frozen surviving site can be restarted by manually
need: triggering the “overridequorum” command which performs the manual
 438 MB of remote copy bitmap space. quorum override. However, note that “overridequorum” will modify the
 Around 70 MB of RAID bitmap space. revived part of the cluster (the one that lost quorum in a disaster scenario) so
 20 MB volume mirroring bitmap space which can that all offline parts of the cluster will not be able to join it again. Therefore,
accommodate 40 TB of mirrored volumes. all HyperSwap relations will have to be reconfigured and rebuilt after the
 This ideal volume capacity will leave around 24 MB buffer disaster recovery operations. Hence it is strongly recommended to not use
for any operations that require bitmap space. the "overridequorum" command without product support guidance and
recommendations.

SAN Considerations, Guidelines and Best Practices

• A HyperSwap cluster must have a dedicated means of communication between the • Refer the link at the bottom of the slide for more details on SAN
nodes in the cluster. BPs.
• Redundant fabrics with each fabric divided into public and private dedicated SANs. • Zoning Considerations:
 Enable the NPIV feature on the cluster, and zone all hosts to
• Dedicated inter-site links (ISL) for the public and private SANs.
the virtual WWPNs on each node port.
• Implement each redundant fabric on a separate provider rather than having both fabrics  Use the physical WWPNs on the node ports for inter-node
on both providers. communication. Do not zone any hosts or controllers to these
• Size the inter-site links for the private SAN appropriately, as all of the writes for ports.
 Use the physical WWPNs on the node ports that are used for
HyperSwap volumes traverses these links.
inter-cluster (replication) communication. Do not zone any
• All links between sites on the public and private SANs should be trunked, rather than
hosts or controllers to these ports.
using separate ISLs.  Use the physical WWPNs on the node ports for controller
• For Cisco SAN, ensure to have a separate/dedicated ISL or port-channel and allow (storage) communication.
only private VSAN traffic.  The private fabrics should only have a single zone each that
includes all of the cluster node ports attached to those fabrics.
• For Brocade SAN, do not allow private SAN to traverse over shared XISLs. Deploy
separate ISLs in the virtual fabrics used for Private and Public SANs.
References:
• Private fabrics must be completely private – including implementing separate ISLs
http://www.redbooks.ibm.com/redpapers/pdfs/redp5597.pdf
from public fabrics. Do not use Private SAN for any other workloads.
https://www.ibm.com/support/pages/node/6248725

HyperSwap and Remote Mirror, Standard Volumes Coexistence in

a HS Cluster
• This section applies to the flash systems running Spectrum • Existence of a remote copy partnership between the two HS clusters can have serious impact
Virtualize (SV) 8.2.x and 8.3.x versions. on HS performance, whether or not there are any active remote copy relationships. Utilizing
remote copy relationships along with HS creates severe performance issues on the storage
• This section provides guidance and best practice
systems, they must not be used together. This is primarily due to the design of the
recommendations on the volume creations in a HyperSwap
communication within a cluster i.e. intra-cluster and between clusters i.e. inter-cluster.
(HS) cluster and coexistence of HS and Remote Mirror i.e.
Technical details are outlined below.
Metro/Global Mirror (MM/GM) on IBM Storage systems
 When a node wants to communicate either state or data to another node or system, it
running SV 8.2.x and 8.3.x.
needs resources on the remote system available to immediately process that request
• This guidance is aimed to educate the teams on various when it arrives.
considerations that help the accounts to prevent the  In order to do this the system uses credits. Each credit will correspond to an amount of
occurrence of Client Impacting Events (CIEs). resource on the remote system, each node will have a specific amount of credit for every
other node/canister it can communicate with.
• SV 8.2.x and 8.3.x technically allows coexistence of
 This means each node can send a certain amount of data/information to another node
HyperSwap and Remote Mirror i.e. Metro/Global Mirror
based on the credit it has available to communicate with that node. Once the credit has
between two clusters. In an enterprise storage environment,
been used new messages will queue.
there can be a need for HS and MM/GM coexistence for
 Obviously, this credit also uses resource on the local node/canister. With HS systems, in
certain scenarios. Guidance in this section help the teams
order to get sufficient performance, the distribution of the credits is such each node has
“when to” and “when not to” use both the functionalities.
'a lot' of credit for communicating with nodes in the same cluster.

HyperSwap and Remote Mirror Coexistence and Standard

Volumes in a HS Cluster
 Now if the user then creates a remote copy relationship, such as MM, • “DO NOT” configure/provision standard (non-HS) volumes in a HyperSwap
this credit distribution is altered to allow the MM to function, but it cluster" for production workloads. Standard i.e. non-HS volumes in a HS
means local credit resource is taken away from the HS volumes. This cluster will pose availability issues. There is a potential risk of losing access
causes a performance drop of these HS volumes. to standard volumes (i.e. volumes on the losing site) when a link failure
occurs between two halves of the HS cluster. In such an event, only one half
• In certain storage refresh projects/activities, depending on the migration
of the cluster (i.e. IO group(s)) remain online and the other half of the cluster
approach, there can be a need for utilizing storage functions to migrate the
(i.e. IO group(s)) will go offline. Hence, any standard i.e. non-HS volumes
data from old storage systems to new HS storage clusters.
residing on the losing site will go offline translating into a major client
 Coexistence of replication partnerships and HS relationships must be
impacting event.
avoided. If at all possible, avoid storage level migrations that require a
 Exception scenario – It is allowed to provision standard volumes in a
remote copy partnership from an old storage system to a new HS
HS cluster, when migrating the storage from an old storage system to a
cluster. Use host-based migrations to migrate/copy the data directly
HS cluster. In this scenario, standard volume can only remain on the
onto new HS volumes.
system for the period of migration. Post cutover of the storage to a HS
 If storage level migration need to used, one alternative approach would
cluster, the standard volumes must be converted to HS volumes and
be to use MM/GM to copy the data to non-HS volumes with a single
then the IO must be resumed on the volumes.
copy, remove the remote copy partnership, and then convert the new
 Note that the HS volume cannot be either source or target in a remote
target volume into a HS volume to enable high-availability.
mirror (i.e. MM, GM, GMCV) relationship.
 Note that the HS volume cannot be either source or target in a remote
mirror (i.e. MM, GM, GMCV) relationship.

IBM FlashSystems HyperSwap IP Quorum
Design Considerations and Best Practices
IBM FlashSystem HyperSwap IP Quorum Design Considerations and Best Practices
HyperSwap IP Quorum Connectivity Topology

IP Quorum Connectivity Flow using IPsec VPN: Cloud Site-3
The Site-3 external network connectivity is established between the client
sites hosting FS9000 Storage and an external cloud hosted server for the IP
Quorum. IPsec VPN configuration typically in this type of design is a Primary
(Active) with Standby (Inactive) solution. IP Quorum A
IP quorum traffic between storage sites and quorum site routes via the Primary
IPsec VPN to Site-1 and traverses inter-site IP network links to Site-2 under
normal operations. The Standby route only becomes active if the Primary
route is unavailable.
Primary IPsec VPN Standby IPsec VPN
Primary IPsec VPN: 1
5
1. The traffic from IP Quorum-A server hosted in Site-3 to FS9200 in
Site-1 as depicted in flows 1, 2. 3
7
Site-2 will happen as depicted in flows 1, 3, 4 using Inter DC IP 2 6
connectivity.
4
8 Inter DC IP Connectivity
Standby IPsec VPN:
Site-1 will happen as depicted in flows 5, 7, 8 using Inter DC IP FC Connectivity
connectivity.
IBM FS9200 - A IBM FS9200 - B
Site-2 as depicted in flows 5, 6 using Standby IPsec VPN.
Client Site-1 Client Site-2
No Impact to HyperSwap Cluster
Failure of Primary IPsec VPN Availability
Cloud Site-3
Primary IPsec VPN Failure:
• When the Primary IPsec VPN fails, the connectivity between IP
Quorum-A server and FS9200s in Site-1 and Site-2 will be lost. IP Quorum A
• In this scenario, the FS9200 systems will continue to function.

• Site-3 Quorum will become active after Standby IPsec VPN
route is enabled.
1
• The availability of the Standby IP route should be enabled by 5
an automated recovery mechanism. If this is not possible and
requires manual recovery a robust monitoring solution must be 3
in place. 7
• The traffic from IP Quorum-A server hosted in Site-3 to 6

2
FS9200 in Site-1 will happen as depicted in flows 5, 7, 8 using 4
Standby IPsec VPN and Inter DC IP connectivity.
• The traffic from IP Quorum-A server hosted in Site-3 to
FC Connectivity
FS9200 in Site-2 as depicted in flows 5, 6 using Standby IPsec
VPN. IBM FS9200 - A IBM FS9200 - B

IO is served from the Preferred Site
Failure of Inter-site cluster links FS9200 system
Cloud Site-3
HyperSwap Inter-site Links go down – Site-1 to Site-2 links dead
(Split-Brain)
• The cluster here suffers a split brain scenario, although both Sites- IP Quorum A
1 and Site-2 are online, the cluster has been split and a tie break is
required to elect a winning site.
• We have two online halves, so the active IP quorum is used as a
e
tie break device.
c
Ra
1
5
um
• Whichever half talks to the IP quorum first, and locks it, wins,
or
Qu
now has 3 votes and continues as the running site.
3
• As GTS Best Practice is for Preferred Site configured IP Quorum 7
there is a 3 seconds head start for that site. 6
2
• The non-preferred site contacts the IP quorum, sees that it is 4
locked and halts any I/O through its nodes.
• In this scenario, the FS9200 systems will continue to function
only from the preferred site. Down
FC Connectivity Offline

Rolling Disaster with IPsec VPN and Storage Failures

Impact: Loss of access to quorum. No impact to cluster. However a second Cloud Site-3
failure of either site could result in loss of access.
Mitigation: Automatic switching of primary VPN to standby in less than 5 Worst Case - HyperSwap Cluster Lease Expires
seconds, enabling surviving site to resolve tie break in the event of rolling
disaster and continue operation. (Stops Functioning
IP Quorumon
A Both Sites)
• When the Primary IPsec VPN fails, the connectivity between IP Quorum-A
server and FS9200s in both sites will be lost until the Standby route becomes
active.
S t an
• Without an external quorum device the cluster will resolve a tie-break between 1IPsec VPN
Primary
st
Standby n
IPsec
d by V
1 ot aVPN
c
P5N
an equal number of nodes based on node ids. Only the IO group containing the Failure t iv e y
et
lowest node ID is permitted to continue. However, as node ids cannot be
controlled then this mechanism cannot be relied upon to influence a
"preferred" site. 3
7
• If site A experiences a failure while IP quorum A is unavailable, the "worst
case" is nodes at site B may lease expire and stop processing I/O. 2 6
• If site B experiences a failure while IP quorum is unavailable, the "worst case" 4
is nodes at site A may lease expire and stop processing I/O.
• If access to the IP quorum device is restored (either by recovering the Primary May
IPsec VPN connection or enabling the standby connection) then the nodes at
the surviving site can come online again, but there will have been an I/O
FC Connectivity
Lease
outage. A manual override would only be required if the IP quorum couldn't be IBM FS9200 - A IBM FS9200 - BExpire
restored successfully.
Down

Rolling Disaster with IP Quorum and Storage Failures

Impact: Loss of access to quorum. No impact to cluster. However a second Cloud Site-3
failure of either site could result in loss of access.
Mitigation: Redundant IP quorum application, enabling cluster to activate 1st
the standby quorum app, enabling either site to survive in the event of a Failure
rolling disaster and continue operations. IP Quorum A
• IP Quorum failure scenarios: When the IPQW process gets stopped or the VM
hosting the IPQW app experiences OS failure or the underlying hardware on
which the VM is hosted experiences failures. When the IP Quorum A fails, the
communication between IP Quorum-A server and FS9200s in Site-1 and Site-2
will be lost. Primary IPsec VPN Standby IPsec VPN
5
Worst 1Case - HyperSwap Cluster Lease Expires
• Without an external quorum device the cluster will resolve a tie-break between
an equal number of nodes based on node ids. Only the IO group containing the
(Stops Functioning on Both Sites)
lowest node ID is permitted to continue. However, as node ids cannot be 3
controlled then this mechanism cannot be relied upon to influence a 7
"preferred" site.
2 6
• If site A experiences a failure while IP quorum A is unavailable, the "worst
case" is nodes at site B may lease expire and stop processing I/O. 4
• If site B experiences a failure while IP quorum is unavailable, the "worst case" May
is nodes at site A may lease expire and stop processing I/O.
Down FC Connectivity Lease
• If access to the IP quorum device is restored then the nodes at the surviving Expire
site can come online again, but there will have been an I/O outage. A manual IBM FS9200 - A IBM FS9200 - B
override would only be required if the IP quorum couldn't be restored
successfully.

Recommendations
Cloud Site-3
• Redundant IP quorum applications
 Deploy two (2) IP quorum application instances on two different VMs.
IP Quorum A IP Quorum B
 Ensure physical hardware redundancy for the VMs.
 The system will automatically activate the standby instance if the first
instance fails.
• Redundant network links to quorum site Primary IPsec VPN Standby IPsec VPN
1 5
 Have two IPsec VPN connections - one to primary site and one to
secondary site.
3
 Have inter-site IP connectivity between primary and secondary sites 7
for cross site FS9200 to IPQW communication.
2 6
 Enable auto switching of the traffic from primary VPN to secondary
4
VPN and vice-versa in the instance of either of the IPsec VPN failures. 8 Inter DC IP Connectivity
 The maximum round-trip delay must not exceed 80 milliseconds (ms),

which means 40 ms each direction. FC Connectivity

IBM FlashSystem HyperSwap Test Cases
HyperSwap Test Cases Approach

• Testing of a HA Storage solution needs to be performed with a two-phase approach. The goal of the testing is to confirm the functionality of
the HA storage Solution with the host operating system with application stack as an end-to-end test.
• The testing is aimed at confirming the HyperSwap functionality works in specific failure scenarios. As a best practice a testing period should
be allocated in the project implementation plan. Depending if the solution is being deployed in a new greenfield or existing brownfield
environment, will determine what level of testing can be achieved. Ensure to follow the guidance closely when approaching testing is a brown
field deployment as described in the following slides as the testing method to simulate the test differs.
• Phase-1 Testing is performed by the Storage support team to prove the HyperSwap Storage infrastructure implementation has been installed
and configured as per best practice and is working as designed. Scope of components tests to include Storage unit, SAN Fabric and Quorum
testing only.
• Phase-2 Testing is performed with the Storage support team in partnership with all the Operating System and application support teams
configured to use the HyperSwap feature. This testing phase is to prove end-to-end functionality. Scope of components tests to include
Storage unit, SAN Fabric and Quorum testing only, Host OS and Application failover compatibility.

HyperSwap Test Cases Summary
Phase-1: Pre-Go-Live Test Cases – Compulsorily to be Phase-2: Operating system and Application End to End
performed before placing in production:
HyperSwap Test Cases:
• Test Case 1: IBM FS9200 I/O group failure in site 1.
• Test Case 5: Host/App Node Failure in site 1.
• Test Case 2: IBM FS9200 I/O group failure in site 2.
• Test Case 6: Host/App Node Failure in site 2.
• Test Case 3: IP Quorum Failure.
• Test Case 7: Complete Site 1 Failure.
• Test Case 4: Inter site communication failure.
• Test Case 8: Complete Site 2 Failure.
Caution: In a Brown Field deployment no testing that could
causes outage into a pre-existing production environment can
be performed. Test cases 4,7,8 are disruptive testing scenarios
and will only be performed in a Green Field deployment

HyperSwap Test Cases Summary

Post-Production Test Cases (Brownfield – HyperSwap deployed in Production Environments):
• Test Cases 1, 2, 3, 5 and 6.
Considerations/Guidelines for Post-Production Test Cases:
• Follow a proper change management process.
• Perform/Simulate the test cases in a controlled change window.
• Perform/Simulate the test cases ONLY in Dev/Test environment. Do not attempt on PROD systems/apps.
• Identify/Define the test case systems and boundaries clearly. Ensure the test impact to a bare minimum to the
environment.
• Use ONLY Brownfield simulation method outlined for a specific test case for the testing.
• Validate all the pre-test HyperSwap related parameters on the host/app systems participating in the test.
• Validate the health of the storage systems and IP QW prior to the test case simulation.

Test Case 1: IBM FS9200 I/O Group Failure in Site 1

• Objective: To simulate the failure of FS9200 control enclosure in site 1 (preferred
site).
• Estimated Behavior/Impact: Little to no impact for hosts, apps storage access that
are configured with or without a host OS or application cluster. The host, app I/O will
continue using the paths to the site 2 FS9200 storage system.
• Simulation Method:
 Greenfield: Power down FS9200 storage system in site 1 (or) place the site 1 IO
group nodes in service mode.
 Brownfield: Change the site id to site 2 for a host using “chhost” command
• HyperSwap State Change and I/O Behavior:
 Replication reverses and suspends. Active-Active mirror between primary and
aux volumes is disabled.
 Host I/O continues from host at Site1.
 Front-end volume I/O switches from IOG0 at Site1 to IOG1 at Site2.
• Multipath Behavior:
 Paths fail over from preferred paths on IOG0 to non-preferred paths on IOG1.
• I/O pause Time Length:
 Expected 35 seconds with possible 60 seconds. IO Pause is considered as the
time difference between the host IO rate drop to when it starts to increase again.

Test Case 2: IBM FS9200 I/O Group Failure in Site 2

• Objective: To simulate the failure of FS9200 control enclosure in site 2 (preferred).
• Estimated Behavior/Impact: Little to no impact for hosts, apps storage access that are
configured with or without a host OS or application cluster. The host, app I/O will continue
using the paths to the site 1 FS9200 storage system.
 Greenfield: Power down FS9200 storage system in site 2 (or) place the site 2 IO
group nodes in service mode.
 Brownfield: Change the site id to site 1 for a host using “chhost” command.
• HyperSwap and I/O Behavior:
 Replication reverses and suspends. Active-Active mirror between primary and aux
volumes is disabled.
 Host I/O continues from host at Site2.
 Paths fail over from preferred paths on IOG1 to non-preferred paths on IOG0.
 Expected 35 seconds with possible 60 seconds. IO Pause is considered as the time
difference between the host IO rate drop to when it starts to increase again.

Test Case 3: IP Quorum Failure

• Objective: To simulate the failure of IP Quorum Witness in site 3.
• Estimated Behavior/Impact: Loss of access to IP QW is completely transparent. No
interruption to I/O is expected.
• Simulation Method: Stop the IP QW process or shutdown the VM hosting the
active IP QW app.
 No changes observed.
 Replication direction remains as-is.
 Standby IP quorum is activated instantaneously.
 No impact to HS topology and functionality.
 No changes observed.
 No IO pause observed.
 No fluctuation in IO rates observed.

Test Case 4: Inter Site Communication Failure

• Objective: To simulate the failure of inter site communication failure.
• Estimated Behavior/Impact: The system continues to operate the failure domain with the FS9200 I/O
Group, which wins the quorum race. The cluster continues with operation, while the I/O Group node
canister in the other failure domain stops.
 Greenfield: Disable the inter site private SAN ISL ports on both fabrics (A and B) used for inter-site
node to node communication on the SAN switches.
 Brownfield: Do not execute this test case.
 When the inter-node communication is stalled, race to IP quorum happens.
 With preferred site configured, site 1 is given priority to contact IP quorum.
 Site 1 (preferred site - IO group 0) wins the quorum race and assumes control of the cluster. Io group
1 is made offline. Standalone hosts in site-2 will experience downtime.
 Replication suspends. Resyncs from master to aux when HS is restored.
 Host I/O continues from host at Site1 (preferred site).
 Front-end volume I/O continues from IO Group 0 at Site 1 as-is.
 No failover of paths.
 I/O pause length = Short lease time + Quorum race + Hold time + Pending IO resumption.
 Expected I/O pause lengths of less than 60 secs.
Test Case 5: Host/App Node Failure in Site 1

• Objective: To simulate the failure of Host, App, DB in site 1.
• Estimated Behavior/Impact: Host server that is part of a host OS cluster or application cluster
fails in the primary site - Host OS cluster or application cluster performs automatic fail-over to
continue in site 2 for storage access. Potential here for tuning of Host OS, MPIO, application
cluster time-out values while cluster site swap happens.
 Greenfield: Halt/Shutdown/failover the host/app/DB in site 1.
 Brownfield: Halt/Shutdown/failover the host/app/DB in site 1. Do this only for TEST hosts
with a controlled change window.
 Storage IO will have not impact with availability, IO will continue to be serviced from the site-
2 to which the host-2 affinity is set to. Replication reverses and continues from site-2 to site-1.
 Based on host’s site affinity, multipathing selects the paths to site-2 storage and host in Site-2
performs IO with preferred path to site-1 storage.
 No pause in IO from storage as no failure in the Storage. IO pause depends on the time taken to
activate the failover host at the compute layer.

Test Case 6: Host/App Node Failure in Site 2

• Objective: To simulate the failure of Host, App, DB in site 2.
• Estimated Behavior/Impact: Host server that is part of a host OS cluster or application cluster fails in
the primary site - Host OS cluster or application cluster performs automatic fail-over to continue in site
1 for storage access. Potential here for tuning of Host OS, MPIO, application cluster time-out values
while cluster site swap happens.
 Greenfield: Halt/Shutdown/failover the host/app/DB in site 2.
 Brownfield: Halt/Shutdown/failover the host/app/DB in site 2. Do this only for TEST hosts with a
controlled change window.
 Storage IO will have not impact with availability, IO will continue to be serviced from the site-1 to
which the host-2 affinity is set to. Replication reverses and continues from site-1 to site-2.
 Based on host’s site affinity, multipathing selects the paths to site-1 storage and host in Site-1
performs IO with preferred path to site-1 storage.
 No pause in IO from storage as no failure in the Storage. IO pause depends on the time taken to
activate the failover host at the compute layer.

Test Case 7: Complete Site 1 Failure

• Objective: To simulate the failure of the entire site 1.
• Estimated Behavior/Impact: All Storage and hosts are down in the site 1. Quorum vote to elect site 2 site
as new primary site. Host OS or Application cluster needs to automate the recovery. The hosts, app, DB
may experience a I/O interruption if not tuned inline with the FS9200 HS failover time.
 Greenfield: Shutdown FS9200 and hosts in site-1.
 Brownfield: NOT APPLICABLE. DO NOT ATTEMPT THIS TEST CASE.
 Replication reverses and suspends. Active-Active mirror between primary and aux volumes is
disabled.
 Host I/O continues from host at Site2 (for hosts that are configured to have HA).
 Hosts failed over to site-2 detects the paths to site-2 FS9200 and continue to perform IO locally.
 Expected 35 seconds with possible 60 seconds. IO Pause is considered as the time difference
between the host IO rate drop to when it starts to increase again.

Test Case 8: Complete Site 2 Failure

• Objective: To simulate the failure of the entire site 2.
• Estimated Behavior/Impact: All Storage and hosts are down in the site 2. Quorum vote to elect site 1 site as
new primary site. Host OS or Application cluster needs to automate the recovery. The hosts, app, DB may
experience a I/O interruption if not tuned inline with the FS9200 HS failover time.
 Greenfield: Shutdown FS9200 and hosts in site-2.
 Brownfield: NOT APPLICABLE. DO NOT ATTEMPT THIS TEST CASE.
 Replication reverses and suspends. Active-Active mirror between primary and aux volumes is disabled.
 Host I/O continues from host at Site1 (for hosts that are configured to have HA).
 Hosts failed over to site-1 detects the paths to site-1 FS9200 and continue to perform IO locally.
 Expected 35 seconds with possible 60 seconds. IO Pause is considered as the time difference between the
host IO rate drop to when it starts to increase again.

Test Environment Configuration Sample Guidance

• Site-1 hosting one FS9200 control enclosure is the primary
production site considered as failure domain 1.
• Site-2 hosting one FS9200 control enclosure is the
secondary production site considered as failure domain 2.
• Site-3 hosting the IP Quorum (2 instances) considered as
failure domain 3.
• The system is configured in a HyperSwap (HS) topology
Active/Active between Site-1 and Site-2
• Site-1 would be preferred Primary site for all app I/O
(VMware, AIX, Windows, RHEL , Oracle, SAP etc.). I/O
will be performed to site 1 FS9200 system under normal
conditions.
• I/O will be failed over to site 2 in response to typical failure
scenarios as outlined in the subsequent slides.
• Host, app clustering will be in place to address host/app
node failover test case.

Fibre Channel Host Attachment and DB/App
Settings
Fibre Channel Host Attachment and DB/App Settings
Linux
• In HA and other standard deployment scenarios, it is essential to tune the • dev_loss_tmo set to 120 seconds.
scsi related timeouts to ensure the device path failures/issues are handled  This is fibre channel specific and deals with the loss of connection to remote fibre
effectively. channel ports. The timeout countdown most often begins when the fibre channel fabric
informs the HBA that a port is no longer connected to the fabric and so is no longer
• All RHEL6, RHEL7, and SLES12 systems require that you set the
reachable. (It can also begin for other events like an HBA reset or reset of the HBA's link,
“scsi_mod.inq_timeout” parameter to 70 seconds. Otherwise, RHEL6,
which will at least temporarily cause a loss of connections to ports through the HBA.)
RHEL7, and SLES12 hosts cannot regain previously failed paths such as
The lost port has until dev_loss_tmo is reached to be rediscovered or any LUNs behind
in a system update or where a node is manually rebooted.
 To resolve this issue, add scsi_mod.inq_timeout=70 to the kernel
the lost port will be considered gone and any waiting commands failed. It is number of
seconds to wait before marking a link as "bad". Once a link is marked bad, IO running on
boot command line through grub configuration. By adding the
its corresponding path (along with any new IO on that path) will be failed.
scsi_mod.inq_timeout=70 parameter, the change in the parameter is
persistent from a server reboot. Linux hosts can also regain system
node paths when lost.
• Set the udev rules for SCSI command timeout.
 The Linux SCSI layer sets a timer on each command. When this
timer expires, the SCSI layer will quiesce the host bus adapter
(HBA) and wait for all outstanding commands to either time out or
complete.
 Afterwards, the SCSI layer will activate the driver's error handler.
 Set the SCSI command timeout to 120s. This is the recommended
setting for all versions of Linux.
References:
• https://www.ibm.com/support/knowledgecenter/STSLR9_8.3.1/com.ibm.fs9200_831.doc/svc_linrequiremnts_21u99v.html
• https://www.ibm.com/support/knowledgecenter/STSLR9_8.3.1/com.ibm.fs9200_831.doc/svc_linux_settings.html
Oracle RAC on Linux

• Increase "css misscount" to 90 • Set _asm_hbeatiowait to 120 for Oracle versions lower than 12.1.0.2 as
 css misscount - the maximum time, in seconds, that a cluster heartbeat (messages described in Oracle support document 1581684.1
sent between nodes over the network interconnect or through voting disk; the prime  The default of 15 seconds is too short for path fail over to complete,
indicator of connectivity), can be missed before entering into a cluster with Oracle 12.1.0.2 default was changed to 120
reconfiguration to evict the node.
 The CSS misscount represents the maximum amount of time in seconds for
• Following multipath / scsi settings modifications are recommended to limit
heartbeat communication between Oracle RAC nodes. The CSS misscount is very path failure detection and switch over:
 Refer Linux OS guidelines when database is hosted on Linux.
specific to heartbeat communication timeout between nodes. While the CSS
disktimeout and _asm_hbeatiowait are more specific to the timeout associated with
IO on voting disks and ASM devices.
 It is suggested to tune the CSS timeout from default of ~30 seconds (for Oracle 11g • Applicable Environment:
and on, 60 seconds for Oracle 10) to 90 because it is possible that without any  Red Hat Enterprise Linux 6
underling n/w communication issues, still the node may not be able to  Red Hat Enterprise Linux 7
communication with other nodes in cluster only due to the fact that load average on  SAN storage accessed using device-mapper-multipath
that node is very high. Due to the steep increase in system load average, the CSS  Oracle RAC 11 / 12
and other Oracle RAC processes may not be scheduled for execution for longer
duration, and thus heartbeat communication with other nodes may not be completed
before reaching CSS timeout.
 Even though there are no underlying n/w issues in above scenario, still the increase
in system load average would adversely affect the working of cluster processes and
could reach to the CSS misscount. Due to this, it is suggested to tune the CSS miss
count to 90 seconds, so that even if a node is experiencing very high load average, References:
then we could still give the CSS and other cluster processes a little more time • https://access.redhat.com/solutions/3136041
before declaring the node as failed and initiating fence action on it. • https://www.ibm.com/support/pages/node/885883

AIX
Disk Related Parameters for AIX running MPIO:
• rw_timeout - Preferred setting for rw_timeout is 120
• Algorithm - Preferred setting is algorithm=shortest_queue
• Path health check mode (hcheck_mode) – Preferred setting is hcheck_mode=nonactive
• hcheck_interval - Preferred setting is hcheck_interval=240
• timeout_policy - Preferred setting is timeout_policy=fail_path ( if available on the device )
• Recommended to increase database timeout values such that they can withstand 3*rw_timeout value prior to taking any drastic action. If the database vendor
cannot allow 3*rw_timeout value, then at least allow 2*rw_timeout value, such that the lower level devices drivers get more than one chance to redrive the
I/O.
• Enable the fast fail and dynamic tracking attributes for hosts systems that run an AIX® 5.2 or later operating system.
• Refer to the links mentioned in the references for more details and the GTS best practice guidance for AIX MPIO configuration.
References:
• https://www.ibm.com/support/pages/node/630015
• https://w3-connections.ibm.com/wikis/home?lang=en-us#!/wiki/Global%20Server%20Management%20Distributed%20SL/page/Recommendations%20for%20Multipath%20Device%20Driver%20%26%20Settings
• https://www.ibm.com/support/pages/node/697363
• https://www.ibm.com/support/knowledgecenter/STSLR9_8.3.1/com.ibm.fs9200_831.doc/svc_aixconfigovrw_21uyy3.html

Windows
• Ensure Dual physical HBA’s are installed with redundant physical connection to both public SAN fabric switches.
• Ensure that you install the required software on your host, including the following items:
 Operating system service packs and patches
 Host bus adapters (HBAs)
 HBA device drivers
 Multipathing drivers
 Clustered-system software
• Recommendation for is to use Microsoft Native MPIO with MS DSM.
• The default value for DiskTimeoutValue is 120. There is no requirement to reduce this value. If you need to reduce, ensure a minimum
value of 60 seconds is configured.
• The timeout value of 60 seconds will be sufficient for a Quorum race where I/O pause length = Short lease time + Quorum race + Hold
time + Pending IO resumption.
References:
• https://www.ibm.com/support/knowledgecenter/STSLR9_8.3.1/com.ibm.fs9200_831.doc/svc_FChostswindows_cover.html
• https://docs.microsoft.com/en-us/powershell/module/mpio/set-mpiosetting?view=win10-ps

VMware
• The Storage Array Type plug-in for IBM® volumes is set to VMW_SATP_ALUA.
• The path selection policy is set to RoundRobin.
• Set “RoundRobin” IOPS to 1 (not 1000) to evenly distribute I/Os across as many ports on the system as possible.
• Ensure the queue depth no deeper than 64.
• The number of paths per volume is recommended not to exceed 8. Ensure a minimum of 4 paths per volume.
• Do not use Datastores greater than 4TB.
• Ensure to leave the APD timeout value to a default of 140 secs.
• Disable auto removal of paths to a disk that is in PDL. Set “Disk.AutoremoveOnPDL” to 0.
References:
• https://www.ibm.com/support/knowledgecenter/STSLR9_8.3.1/com.ibm.fs9200_831.doc/svc_vmwconfigovrw_21lbjv.html
• https://docs.vmware.com/en/VMware-vSphere/5.5/com.vmware.vsphere.storage.doc/GUID-75214ECC-DC3B-4C06-95D4-221A8C03B94A.html

FC Hosts Queue Depth

• In Fibre Channel networks, the queue depth is the number of I/O operations (SCSI commands) that can be run in parallel on a device.
• Usually a higher queue depth can lead to better performance. However, when too many concurrent I/Os are sent to a storage device, the
device responds with an I/O failure message of queue full. This message is intended to cause the host to try the I/O again a short time later.
However, not all operating systems correctly handle the queue-full failure, which can cause unnecessary I/O failures or delays to
applications.
• To avoid these delays or failures, configure your hosts so that they send only a finite number of I/Os to the storage system to avoid
exhausting the resources of the storage system.
• In configurations with multiple hosts, configure each host with a similar queue depth to maintain fairness between the hosts.
 For most hosts, set the HBA queue depth to 32.
 For hosts that are significantly busier or where not many hosts are configured on the storage system, set the HBA queue depth to 128.
• Ensure that the queue depth total for all hosts that are connected to a single physical Fibre Channel port on the storage system does not
exceed 2048.
• If host I/O queue full failures regularly occur, consider reducing the queue depths or distributing the host zoning across more physical Fibre
Channel ports on the storage system.
Reference:
https://www.ibm.com/support/knowledgecenter/STSLR9_8.3.1/com.ibm.fs9200_831.doc/svc_FCqueuedepth.html

Data Protection Guidance with HS
SPS, SPP and CDM

• Spectrum Protect Snapshot (SPS)
 Currently SPS does not support HS volumes. If SPS is in use and there are compelling use cases for SPS, do not provision HS
volumes to the hosts.
 SPS support for HyperSwap support for FS9200 is planned for AIX and Linux x86_64 and is targeted for November/December
2020. Check the support matrix at that time if a requirement arises.
• Spectrum Protect Plus (SPP)

 SPP is storage agnostic and supports HS volumes as both vSnap repository and the primary storage that is intended to be
protected by SPP.
• Copy Data Management (CDM)

 Only SQL and VMware workloads on HyperSwap volumes with CDM currently.
 Evaluate the workloads and associated CDM use cases prior to provisioning HS volumes for workloads other than SQL and
VMware. If CDM is mandatory requirement, do not provision HS volumes for such wokloads.

HyperSwap Volume Expansion

• HyperSwap volume expansion is available with Spectrum Virtualize 8.3.1.
• HS volumes can be expanded online, and the new capacity will be made available immediately.
• You can expand the size of a HyperSwap volume provided:
 All copies of the volume are synchronized.
 All copies of the volume are thin or compressed.
 There are no mirrored copies.
 The volume is not in a remote copy consistency group. To expand the volume, you must remove the active-active
relationship for the volume from the remote copy consistency group. The active-active relationship can be added back to
the consistency group after the volume is expanded.
 The volume is not in any user-owned FlashCopy maps.
Reference(s):
https://www.ibm.com/support/knowledgecenter/STSLR9_8.3.1/com.ibm.fs9200_831.doc/svc_expandvolume.html

GTS Standard Building Block FC Port Layout
and Zoning Guidance
GTS Standard Building Block FC Port Layout and Function Designation
IBM FS9200 FC Port Layout and Function Designation

GTS Standard building block preferred 3-Pair Adapter configuration
Layout to use when all 3 FC adapters are installed on a node canister.
p
wa
wa
wa
wa
erS
erS
erS
erS
yp
yp
yp
yp
n/ H
n/ H
n/ H
n/ H
Hy t ap
p
ap
wa
wa
ti o
ti o
ti o
ti o
w
w
ca
ca
HoperS
ca
rS
ca
rS
HoperS
pl i
pl i
pl i
pe
pl i
pe
Fr st
Re st
Hy t
Fr st
Re st
Fr t
Re st
Fr st
Re st
Ho e
Ho e
ee
Ho e
s
s
e
e
Ho
Ho
Ho
Ho
Ho
Ho
Ho
Hy
Hy
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
1 1 1 1 2 2 2 2 3 3 3 3 1 1 1 1 2 2 2 2 3 3 3 3
FC Adapter1 FC Adapter2 FC Adapter3 FC Adapter1 FC Adapter2 FC Adapter3
CPU2 CPU1/CPU2 CPU1 CPU2 CPU1/CPU2 CPU1
Node Canister 1 Node Canister 2
FS9200 Control Enclosure

IBM FS7200 FC Port Layout and Function Designation

Layout to use when all 3 FC adapters are installed on a node canister.
p
wa
wa
wa
wa
erS
erS
erS
erS
yp
yp
yp
yp
n/ H
n/ H
n/ H
n/ H
Hy t ap
p
ap
wa
wa
ti o
ti o
ti o
ti o
w
w
ca
ca
HoperS
ca
rS
ca
rS
HoperS
pl i
pl i
pl i
pe
pl i
pe
Fr st
Re st
Hy t
Fr st
Re st
Fr t
Re st
Fr st
Re st
Ho e
Ho e
ee
Ho e
s
s
e
e
Ho
Ho
Ho
Ho
Ho
Ho
Ho
Hy
Hy
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
1 1 1 1 2 2 2 2 3 3 3 3 1 1 1 1 2 2 2 2 3 3 3 3
FC Adapter1 FC Adapter2 FC Adapter3 FC Adapter1 FC Adapter2 FC Adapter3
CPU2 CPU1/CPU2 CPU1 CPU2 CPU1/CPU2 CPU1
Node Canister 1 Node Canister 2
FS7200 Control Enclosure

IBM FS9200 and FS7200 Zoning Guidance

• The FC zoning layout is aligned to the port function designations outlined in the previous slides.
• This layout ensures equal distribution between two fabrics for resiliency and eliminates SPOF.
Source Destination
Fabic Port Type Function Cable type Remarks
Device Name Node Canister (NC) Slot No. Port No. Connected Device Slot No. Port No.
FS9200/FS7200 NC 1 - Upper PCIe Slot 1 1 SAN_Switch_A TBD TBD A FC Host OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 1 - Upper PCIe Slot 1 2 SAN_Switch_B TBD TBD B FC Free OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 1 - Upper PCIe Slot 1 3 SAN_Switch_B TBD TBD B FC Host OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 1 - Upper PCIe Slot 1 4 SAN_Switch_A TBD TBD A FC Replication/HyperSwap OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 1 - Upper PCIe Slot 2 2 SAN_Switch_A TBD TBD A FC HyperSwap OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 1 - Upper PCIe Slot 2 4 SAN_Switch_B TBD TBD B FC HyperSwap OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 1 - Upper PCIe Slot 3 2 SAN_Switch_A TBD TBD A FC Free OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 1 - Upper PCIe Slot 3 4 SAN_Switch_B TBD TBD B FC Replication/HyperSwap OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 2 - Lower PCIe Slot 1 1 SAN_Switch_A TBD TBD A FC Host OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 2 - Lower PCIe Slot 1 2 SAN_Switch_B TBD TBD B FC Free OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 2 - Lower PCIe Slot 1 3 SAN_Switch_B TBD TBD B FC Host OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 2 - Lower PCIe Slot 1 4 SAN_Switch_A TBD TBD A FC Replication/HyperSwap OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 2 - Lower PCIe Slot 2 2 SAN_Switch_A TBD TBD A FC HyperSwap OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 2 - Lower PCIe Slot 2 4 SAN_Switch_B TBD TBD B FC HyperSwap OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 2 - Lower PCIe Slot 3 2 SAN_Switch_A TBD TBD A FC Free OM4 Storage to SAN switch Connectivity
FS9200/FS7200 NC 2 - Lower PCIe Slot 3 4 SAN_Switch_B TBD TBD B FC Replication/HyperSwap OM4 Storage to SAN switch Connectivity

GTS Standard Building Block Capacity
Guidance
FS9200 and FS7200 Capacity Sizing Guidance

• This slide provides guidance on capacity sizing for FS9200 and FS7200 standard building blocks.
• The sizing is based on 2:1 FCM compression and 1.5:1 (150%) thin provisioning.
• The max allocation is capped at 80% considering 2:1 compression and 1.5:1 thin.
• 20% is reserved to accommodate incremental growth, airbag volumes (1 TB) and for HS specific considerations (outlined in subsequent slides).
Based on 80% utilization

Thin Allocated (not usable)
Block Storage
Systems Max
Modules 4.8 TB / 9.6
Install @
TB / 19.2 TB
IBM FS9200, FS7200 80%
12x 4.8 TB 103.7
Small is 4.8 TB 12 & 24 Modules 12x 9.6 TB 207.4
Medium is 9.6 TB 12,18 & 24 24x 4.8 TB 218.9
Modules 18x 9.6 TB 322.6
24x 9.6 TB 437.8
Large is 19.2 TB 18 & 24 Modules
18x 19.2 TB 645.1
24x 19.2TB 875.5
Sample Calculations for reference

Raw Physical Cap Real Physical Compression 2:1 Thin Prov 1.5:1 Effective Cap
Flash Module (TB) Count RAID Type Stripe Width Data Drives Parity Spare
(TB) Usable Cap (TB) TB (TB) @80% (TB)
4.8 24 DRAID6 10+P+Q 19 4 1 115.2 91.2 182.4 273.6 218.88
9.6 24 DRAID6 10+P+Q 19 4 1 230.4 182.4 364.8 510.72 408.58
19.2 24 DRAID6 10+P+Q 19 4 1 460.8 364.8 729.6 1094.4 875.52

FS9200 and FS7200 Capacity Guidance - Explained

Why do FCMs have a maximum effective capacity? Example FS9200 HW Configuration
FCM drives contain a fixed amount of space for metadata. The maximum • 24x 9.6 TB FCM, DRAID-6 10+P+Q with 1 Spare Area
effective capacity is the amount of data it takes to fill the metadata space. • 2:1 HW Compression
• 1.5:1 Thin Provisioning
What are the maximum effective capacities for each FCM?
GTS potential allocation 1.5:1 Thin (80%) 437 TB Over provisioned that
For 4.8 TB the maximum is 21.99 TB which effectively limits the compression
can’t be written from this
ratio to 4.5:1; 9.6 TB is 21.99 TB or 2.3:1; 19.2 TB is 43.98 TB or 2.3:1. single storage pool
397 TiB
What happens if I write a highly compressible workload to an FCM? GTS Effective Capacity 364 TB
HS CV
Airbag
Even if you write a 10:1 compressible workload to an FCM, it will still be full Compression 2:1 331 TiB
when it reaches the maximum effective capacity. Any spare data space remaining
at this point will be used to improve the performance of the module and extend 80% Full Critical action required
70% Warning Alert
the wear. to stop new allocations
Usable Capacity 182 TB Threshold triggered
Threshold monitoring must be configured on

Storage efficiency calculations the Effective pool capacity. Only a single
The GTS recommended best practice will cover the majority of typical mixed 165 TiB Threshold alert. Always take action.
workloads and a 2:1 Compression ratio is expected to be achieved. Certain
applications can affect the compression ratio both positively and negatively. If
you are aware of applications in your environment that will be impacted please Action required at 70% Threshold Utilization!
plan capacity sizing accordingly. Example of applications that don’t benefit from GTS guidance for FS9200/FS7200 systems does not allow adding expansion
2:1 compression are SAP HANA or Database applications performing enclosures. Set Monitoring Threshold for 70% to take action to order another
compression. Any host side encryption will also prevent 2:1 compression saving Storage unit. At 80% you need to stop any new allocations and work towards
being applied as the data can’t be accessed to compress. moving storage off the storage unit. As GTS allows Thin overprovisioning there is
no option to increase the storage pool. Do not wait for placing HW orders.

HyperSwap Change Volumes Capacity Considerations

• Each HyperSwap volume uses 4 vdisks; 1 each primary and auxiliary and 2 Change Volumes (CVs).
• CVs come into play when a resync is needed between primary and auxiliary vdisks.
• The change volume protects the consistent copy on the out-of-sync site while the resync is happening, so if
there is a disaster during the resync, the CV can be used to rollback to the last good consistent copy.
• The amount of data that gets written to CV is the amount of data that has been written/modified on the other
site while disconnected.
• Discount the CV volume allocation size from the overall allocation metric.
• Recommend to reserve at least 15% of the data volumes capacity for CVs. Tune this value according to the
actual requirements.


IBM FlashSystem HyperSwap Cookbook v1.8

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IBM FlashSystem HyperSwap Cookbook v1.8

Uploaded by

Copyright:

Available Formats

IBM FlashSystems HyperSwap

IBM FlashSystems HS Design Considerations and Best Practices 17

IBM FlashSystems HS IP Quorum Design Considerations and Best Practices 23

IBM FlashSystem HyperSwap Test Cases 30

2 © 2018 IBM Corporation February 16, 2024 IBM Services

Fibre Channel Host Attachment and DB/App Settings 43

Data Protection Guidance with HS 50

GTS Standard Building Block FC Port Layout and Zoning Guidance 53

3 © 2018 IBM Corporation February 16, 2024 IBM Services

6 © 2018 IBM Corporation February 16, 2024 IBM Services

HS Configuration requirements and guidelines

7 © 2018 IBM Corporation February 16, 2024 IBM Services

HS Configuration requirements and guidelines

8 © 2018 IBM Corporation February 16, 2024 IBM Services

HS Configuration requirements and guidelines

9 © 2018 IBM Corporation February 16, 2024 IBM Services

10 © 2018 IBM Corporation February 16, 2024 IBM Services

11 © 2018 IBM Corporation February 16, 2024 IBM Services

12 © 2018 IBM Corporation February 16, 2024 IBM Services

HyperSwap Topology Configuration Steps

13 © 2018 IBM Corporation February 16, 2024 IBM Services

HyperSwap Topology Configuration Steps

14 © 2018 IBM Corporation February 16, 2024 IBM Services

HyperSwap Topology Configuration Steps

15 © 2018 IBM Corporation February 16, 2024 IBM Services

HyperSwap Tuning (Adjustable) Parameters

16 © 2018 IBM Corporation February 16, 2024 IBM Services

18 © 2018 IBM Corporation February 16, 2024 IBM Services

19 © 2018 IBM Corporation February 16, 2024 IBM Services

SAN Considerations, Guidelines and Best Practices

20 © 2018 IBM Corporation February 16, 2024 IBM Services

HyperSwap and Remote Mirror, Standard Volumes Coexistence in

21 © 2018 IBM Corporation February 16, 2024 IBM Services

HyperSwap and Remote Mirror Coexistence and Standard

22 © 2018 IBM Corporation February 16, 2024 IBM Services

HyperSwap IP Quorum Connectivity Topology

• In this scenario, the FS9200 systems will continue to function.

• The traffic from IP Quorum-A server hosted in Site-3 to 6

Client Site-1 Client Site-2

25 © 2018 IBM Corporation February 16, 2024 IBM Services

Client Site-1 Client Site-2

26 © 2018 IBM Corporation February 16, 2024 IBM Services

Rolling Disaster with IPsec VPN and Storage Failures

27 © 2018 IBM Corporation February 16, 2024 IBM Services

Rolling Disaster with IP Quorum and Storage Failures

28 © 2018 IBM Corporation February 16, 2024 IBM Services

 The maximum round-trip delay must not exceed 80 milliseconds (ms),

IBM FS9200 - A IBM FS9200 - B

Client Site-1 Client Site-2

29 © 2018 IBM Corporation February 16, 2024 IBM Services

HyperSwap Test Cases Approach

31 © 2018 IBM Corporation February 16, 2024 IBM Services

HyperSwap Test Cases Summary

32 © 2018 IBM Corporation February 16, 2024 IBM Services

HyperSwap Test Cases Summary

33 © 2018 IBM Corporation February 16, 2024 IBM Services

Test Case 1: IBM FS9200 I/O Group Failure in Site 1

34 © 2018 IBM Corporation February 16, 2024 IBM Services

Test Case 2: IBM FS9200 I/O Group Failure in Site 2

35 © 2018 IBM Corporation February 16, 2024 IBM Services

Test Case 3: IP Quorum Failure

36 © 2018 IBM Corporation February 16, 2024 IBM Services

Test Case 4: Inter Site Communication Failure

Test Case 5: Host/App Node Failure in Site 1