You are on page 1of 66

Module 8

Implementing failover clustering


Module Overview

Planning a failover cluster


Creating and configuring a new failover cluster
Maintaining a failover cluster
Troubleshooting a failover cluster
• Implementing site high availability with stretch
clustering
Lesson 1: Planning a failover cluster

Preparing to implement failover clustering


Failover-cluster storage
Hardware requirements for a failover-cluster
implementation
Network requirements for a failover-cluster implementation
Demonstration: Verifying a network adapter's RSS and
RDMA compatibility on an SMB server
Infrastructure and software requirements for a failover
cluster
Security considerations
Quorum in Windows Server 2016
• Planning for migrating and upgrading failover clusters
Preparing to implement failover clustering

Features of failover clustering include:


• Identify services/applications to make highly
available
• Cannot configure failover clustering equally to all
applications
• Only for IP-based applications
• Plan resource utilization adequately
• Hardware that has similar capacity for all nodes
in a cluster
• Identify single failure points Ej: Nic Teaming/
MPIO
Failover Cluster: Components

• Node - A Windows Server 2016 computer that is part of a failover


cluster, and has the failover clustering feature installed. Can Be
Virtual or physical Servers
• Service or application - A service that you can move between
cluster nodes (for example, Hyper-V Role, a clustered file server
can run on either node).
• Shared storage - External storage (SAN/iSCSI/StorageSpace/S2D)
that is accessible to all cluster nodes. (Mandatory)
• Quorum - The number of elements that must be online for a
cluster to continue to run. The quorum is determined when
cluster nodes vote.
• Witness – A server that is participating in cluster voting when the
number of nodes is even.
Failover Cluster: Components

• Failover – The process of moving cluster resources from the first


node to the second node, as a result of node failure or
administrator’s action.
• Switch Over- Is Administrator Planned Fail Over
• Failback - The process of moving cluster resources back from the
second node to the first node, as a result of firs node becoming
again online or administrator’s action.
• Clients - Computers that connect to the failover cluster, and are
not aware which node the service is running on.
• Failover/Fail Back If the service or application fails over from
Node1 to Node2, when Node1 is again available, the service or
application will fail back to Node1.
Hardware requirements for a failover-cluster
implementation

The hardware requirements for a failover


implementation include:
• You must use server hardware that is certified for
Windows Server
• Server nodes should all have the same
configuration and contain the same or similar
components
• All servers must pass the tests in the Validate a
Configuration Wizard
• Ensure that each node runs the same processor
architecture.
Network requirements for a failover-cluster
implementation
The network requirements for a failover
implementation include:
• Your server should connect to multiple networks
to ensure communication redundancy, or it
should connect to a single network with
redundant hardware, to remove single points
of failure
• You should ensure that network adapters are
identical and that they have the same IP protocol
versions, speed, duplex, and flow-control
capabilities
• Your network adapters compatible with RSS and
RDMA for distribution of network-receive Pack.
Infrastructure and software requirements for a
failover cluster

• The infrastructure requirements for a failover


implementation include:
• Active Directory domain controllers should run
Windows Server 2008 or newer
• Domain-functional level and forest-functional level
should run Windows Server 2008 or newer
• The application must support Windows Server 2016
high availability
• The software best practices for a failover cluster
implementation require that:
• All nodes have the same edition of Windows Server
2016 and the same service pack and updates
Security considerations
• Security considerations for failover clustering include that
you must:
• Provide a method for authentication and authorization
• Ensure that unauthorized users do not have physical access to
failover cluster nodes
• Ensure that you use antimalware software
• Ensure that your intra-cluster communication authenticates with
Kerberos version 5
• If you use an Active Directory-detached cluster:
• AD DS objects for network names are not created
• Cluster network name that you register in a DNS is not necessary to
create new objects in AD DS
• We do not recommend this for any scenario that requires Kerberos
authentication
• You must run Windows Server 2012 R2 or newer on all cluster nodes
Security considerations

Windows Server 2016 introduces several cluster


types, and which one you use depends on your
domain-membership scenario:
• Single-domain clusters
• Cluster nodes are members of the same domain
• Workgroup clusters
• Cluster nodes are not joined to the domain
• Configure Only With Windows PowerShell
• Multi-domain clusters
• Cluster nodes are members of the different domains
• Workgroup and domain clusters
• Cluster nodes are members of domains and members that are not
joined to the domain (workgroup servers).
Quorum in Windows Server 2016
Quorum What has the vote? When is quorum
mode maintained?
Node Only nodes in the When more than half of
majority cluster have a vote the nodes are online
Node and The nodes in the cluster When more than half of
disk and a disk witness have the votes are online
majority a vote
Node and The nodes in the cluster When more than half of
file share and a file share witness the votes are online
majority have a vote
No Only the quorum- When the shared disk is
majority: shared disk has a vote online
disk only
Dynamic Votes are dynamically When half the votes are
quorum assigned to always be online
odd
Quorum in Windows Server 2016

• Dynamic quorum (Windows 2012)


• Dynamic witness (Windows 2012r2)
• Disk witness
• File share witness

•Azure Cloud Witness (Windows 2016)


•Admin should configure a witness for all clusters
• We recommend that you use dynamic quorum,
which is the default configuration
• You should use all other forms of quorum in
specific use cases only
Dynamic quorum / Dynamic witness
Planning for migrating and upgrading failover clusters

The upgrade steps for each node in the cluster


include:
• Pause the cluster node and drain all cluster resources
• Migrate cluster resources to another node in the cluster
• Replace the cluster node operating system with Windows
Server 2016 and add the node back to the cluster
• Upgrade all nodes to Windows Server 2016
• Run cmdlet Update-ClusterFunctionalLevel
• Site-aware Failover Clusters, Workgroup and Multi-domain
Clusters, Virtual Machine Node Fairness, Virtual Machine Start
Order, Simplified SMB Multichannel, and Multi-NIC Cluster
Networks
Virtual Machine Node Fairness

Low: consume 80% of the node’s capacity


Medium: consume 70% of the node’s capacity
High: consume 60% of the node’s capacity
Lesson 2: Creating and configuring a new
failover cluster

The validation wizard and the cluster support-policy requirements


The process for creating a failover cluster
Demonstration: Creating a failover cluster
Demonstration: Reviewing the validation wizard
Configuring roles
Demonstration: Creating a general file-server failover cluster
Managing failover clusters
Configuring cluster properties
Configuring failover and failback
Configuring storage
Configuring networking
The validation wizard and the cluster
support-policy requirements

• Validation wizard performs multiple types of


tests, such as:
• Cluster
• Inventory
• Network
• Storage
• System

• You can perform validation from the Validate a


Configuration Wizard or with the Test-Cluster
Windows PowerShell cmdlet
The process for creating a failover cluster

1. Install the failover clustering feature


2. Verify the configuration, and create a cluster
3. Install the role on all cluster nodes by using
Server Manager
4. Create a clustered application by using the
Failover Clustering Management snap-in
5. Configure the application
6. Test failover
Configuring roles

• Configuring a cluster role includes:


• Choosing a clustering role Ej: File Server, Virtual
Machine
• Installing the role
• Verifying the status (Running) on all cluster nodes

• You can configure a cluster role by using:


• The Cluster Manager console
• The New-Cluster Windows PowerShell cmdlet
Demonstration: Creating a general file-server
failover cluster

In this demonstration, you will learn how to cluster


a file server role
Managing failover clusters

The most common management tasks


include:
• Managing nodes
• Managing networks
• Managing permissions
• Configuring cluster-quorum settings
• Migrating services and applications to a cluster
• Configuring new services and applications
• Removing the cluster
Configuring cluster properties

The three aspects of managing cluster nodes


include:
• Adding nodes after you create a cluster
• Pausing nodes, which prevents resources from
running on that node
• Evicting nodes from a cluster, which removes the
node from the cluster configuration
Configuration tasks are available in:
• The Actions pane of the Failover Cluster
Management console
• Windows PowerShell
Configuring failover and failback

• During failover, the clustered instance and all


associated resources move from one node to
another
• Failover occurs when:
• The node that hosts the instance becomes inactive for
some reason
• One of the resources within the instance fails
• An administrator performs a failover

• The Cluster service can fail back after the offline


node becomes active again
• Failover can be planned or unplanned
Configuring storage

Storage configuration tasks in Failover Clustering


include:
• Adding storage spaces
• Adding a disk to available storage and to the CSV
• Taking a disk offline
• Bringing the disk back online
Configuring networking

Network Description
Public network Clients use this network to connect to the
clustered service
Private network Nodes use this network to communicate with
each other heartbeats UDP 3343
Public-and-private Required to communicate with external storage
network systems

• One network can support both client and node


communications
• Multiple network adapters are recommended for
enhanced performance and redundancy
• iSCSI storage should have a dedicated network
Configuring quorum options

Quorum configuration options available in the


Configure Cluster Quorum Wizard and Windows
PowerShell include:
• Use typical settings
• Add or change the quorum witness
• Advanced quorum configuration and witness selection
Dynamic quorum and quorum-configuration
considerations
• Dynamic quorum management:
• Failover cluster dynamically manages the vote assignment to nodes
• Allows for a cluster to run on the last surviving cluster node
• Cannot survive a simultaneous failure of a majority of voting nodes
• If you explicitly remove a vote from a node, the cluster cannot
dynamically add or remove that vote
• Quorum configuration considerations include:
• Validating the quorum configuration by using the Validate a
Configuration Wizard or the Test-Cluster Windows PowerShell
cmdlet
• Changing the quorum configuration only in specific scenarios:
• Adding or evicting nodes
• Node or witness have failed and cannot be recovered quickly
• Recovering a cluster in a multisite disaster recovery scenario
Demonstration: Configuring the quorum

In this demonstration, you will learn how to


configure a quorum
Lab A: Implementing failover clustering

Exercise 1: Creating a failover cluster


• Exercise 2: Verifying quorum settings and adding a
node
Logon Information
Virtual machines: 20740C-LON-DC1
20740C-LON-SVR1
20740C-LON-SVR2
20740C-LON-SVR3
20740C-LON-SVR4
20740C-LON-CL1
User name: Adatum\Administrator
Password: Pa55w.rd

Estimated Time: 45 minutes


Lab Scenario

Adatum Corporation is looking to ensure that its


critical services, such as file services, have better
uptime and availability. You decide to implement
a failover cluster with file services to provide
better uptime and availability.
Lab Review

What information do you need for planning a


failover-cluster implementation?
After running Validate a Configuration Wizard,
how can you resolve the network communication’s
single point of failure?
• In which situations might it be important to
enable failback of a clustered application during a
specific time?
Lesson 3: Maintaining a failover cluster

Monitoring failover clusters


Backing up and restoring failover-cluster
configuration
Maintaining failover clusters
Managing cluster-network heartbeat traffic
What is Cluster-Aware Updating?
• Demonstration: Configuring CAU
Monitoring failover clusters

Tools you can use to monitor clusters include:


• Event Viewer
• Tracerpt.exe
• MHTML-formatted cluster configuration reports
• Performance and Reliability Monitor snap-in
Backing up and restoring failover-cluster configuration

• When backing up failover clusters, remember that:


• Windows Server Backup is a Windows Server 2016 feature
• Non-Microsoft tools are available to perform backups and restores
• You must perform system-state backups
• A nonauthoritative restore completely restores a single
node in the cluster
• An authoritative restore restores the entire cluster
configuration to a point in time
Maintaining failover clusters

Failover cluster troubleshooting techniques


include:
• Using the Validate a Configuration Wizard
• Reviewing events in logs (cluster, hardware, storage)
• Defining a process for troubleshooting failover clusters
• Reviewing storage configuration
• Checking for group and resource failures
Managing cluster-network heartbeat traffic

• Types of network monitoring:


• Aggressive
• Relaxed

• Network-monitoring parameter settings:


• Delay
• Threshold

• Windows PowerShell cmdlet examples:


Get-Cluster | fl *subnet*
(Get-Cluster).SameSubnetThreshold=10
What is Cluster-Aware Updating?

• Automated feature in Windows Server 2016


• Updates nodes in a cluster with minimal or no
downtime
• Benefits:
• Updating is automatic
• Can be scheduled
• No downtime
How CAU works

CAU works in two modes:


• Remote updating mode:
• Configure a separate computer (W8.1)as an orchestrator
• Install the failover-clustering administrative tools
• Ensure that the orchestrator computer is not a cluster
member
• Self-updating mode:
• Configure the CAU clustered role as a workload
• Ensure that there is no dedicated orchestrator computer
• Remember that cluster updates itself
• Show Updating Run in progress with Get-CauRun
cmdlet
Demonstration: Configuring CAU

In this demonstration, you will learn how to


configure CAU
Lesson 4: Troubleshooting a failover cluster

Communication issues
Repairing the cluster name object in AD DS
Starting a cluster with no quorum
Demonstration: Reviewing the Cluster.log file
Monitoring performance with failover clustering
Using Event Viewer with failover clustering
• Windows PowerShell troubleshooting cmdlets
Communication issues

• The following might cause communications issues


in failover clustering:
• Network latency
• Network failures
• Network-adapter driver issues
• Firewall rules (Permit heartbeats port UDP 3343)
• Security software

• You can use the Get-ClusterLog cmdlet to


generate the Cluster.log file for troubleshooting.
You can find this file in
C:\Windows\Cluster\Reports
Repairing the cluster name object in AD DS

• The CNO repair process:


• Use the Repair Active Directory Object option in the
Failover Cluster Manager
• You must have Reset Password permissions on the CNO
computer object
• The VCO repair process:
• Use the AD Recycle Bin feature to recover deleted
computer objects, and use the Repair function as the
last recovery action
• The CNO will reset the password and heal itself
automatically
• The CNO must have Create Computer Objects
permissions on the VCO’s OU
Starting a cluster with no quorum

• Cluster nodes must retain the quorum for the cluster


to work
• If the quorum is lost, try to reestablish the quorum
• If you cannot reestablish the quorum during an
extended period, start the cluster in the
ForceQuorum mode Start-ClusterNode –FQ switch,
• After you start the cluster in ForceQuorum mode,
other nodes can rejoin the cluster
• Once the quorum is reestablished, cluster mode
changes from ForceQuorum to normal automatically
• When joining nodes to the cluster in ForceQuorum
mode, you should start other nodes with a setting
preventing the quorum Start-ClusterNode –PQ switch
Demonstration: Reviewing the Cluster.log file

In this demonstration, you will learn how to review


the Cluster.log file
Monitoring performance with failover clustering

Some of the failover clustering performance


counters include:
• Cluster Network Messages
• Cluster Network Reconnections
• Global Update Manager
• Database
• Resource Control
• API
• Cluster Shared Volumes
• Cluster Shared Volumes is a storage architecture that is optimized
for Hyper-V Virtual Machines, and examples include IO Read
Bytes, IO Reads, IO Write Bytes, and IO Writes.
Using Event Viewer with failover clustering

Events that are displayed in Event Viewer and require you


to troubleshoot clusters include:
• Cluster resource in clustered service or application failed
• Cluster network interface for cluster node on network
failed
• File share witness resource failed to arbitrate for the file
share
• Cluster node was removed from the active failover cluster
membership
• The Cluster service failed to bring clustered service or
application completely online or offline
• Cluster network name resource failed registration of one
or more associated DNS names
• Cluster network name resource cannot be brought online
Windows PowerShell troubleshooting cmdlets

Common cmdlets for troubleshooting failover


clustering include:
• Get-Cluster
• Get-ClusterAccess
• Get-ClusterDiagnostics
• Get-ClusterGroup
• Get-ClusterLog
• Get-ClusterNetwork
• Get-ClusterResourceDependencyReport
• Get-ClusterVMMonitoredItem
• Test-Cluster
• Test-ClusterResourceFailure
Lesson 5: Implementing site high availability with
stretch clustering

What is a stretch cluster?


Prerequisites for implementing a stretch cluster
Synchronous and asynchronous replication
Overview of the Storage Replica feature
Demonstration: Implementing server-to-server
storage replica
Selecting a quorum mode for a stretch cluster
Configuring a stretch cluster
Challenges for deploying a stretch cluster
• Multisite failover and failback considerations
What is a stretch cluster?
A stretch cluster is a cluster that has been extended so that
different nodes in the same cluster reside in separate
physical locations Provides highly available services in more
than one location
A stretch cluster is a configuration that has one cluster with
nodes in two locations and storage in both locations

Site A Site B

SAN SAN

Cannot share a disk between sites


Prerequisites for implementing a stretch cluster

To implement a stretch-failover cluster:


• Plan for additional hardware to support enough nodes on
each site
• Ensure that the same operating systems and service
packs are installed on each node
• Include at least one low-latency and reliable network
connection between sites
• Configure a storage replication mechanism
• Failover clustering does not provide a storagereplication
mechanism.
• Configure storage infrastructure services on each site
• Ensure AD DS and DNS, are available on a second site
Synchronous and asynchronous replication
• In synchronous replication, the host receives a write complete
response from the primary storage after the data is written
successfully to both storage locations
• In asynchronous replication, the host receives a write complete
response from the primary storage after the data is written
successfully on the primary storage
Site A Site B

Replication
Write
request
Secondary
Data Data storage
Write
complete Primary
storage
The Storage Replica feature utilizes synchronous or asynchronous
Block-level replication separate from whatever vendor storage
might be at the location
Overview of the Storage Replica feature

• Use for disaster recovery or preparedness


• Configure via Failover Cluster Manager or
Windows PowerShell
• The three replication scenarios are:
• Stretch cluster
• Server-to-server
• Cluster-to-cluster

• Replicates synchronously or asynchronously


• Requires W2016 / 2019 Datacenter Edition
unlimited Space to SR, W2019 until 2TB in
Standard Edition
• Requires GPT-initialized disks
Storage Replica

• Synchronous replication

• Asynchronous replication

5
Storage Replica

Hyper-V stretch cluster supports synchronous


replication only
Storage Replica

Server-to-server supports both synchronous and


asynchronous replication
Storage Replica

Cluster-to-cluster supports synchronous


replication only
Demonstration: Implementing server-to-server
storage replica

In this demonstration, you will learn how to


configure storage replica
Selecting a quorum mode for a stretch cluster

• File-share witness:
• Requires three or more datacenter locations
• Is available in Windows Server 2012 R2 and
Windows Server 2016
• Azure Cloud Witness:
• Requires two datacenter locations
• Requires Internet connection for all nodes
• Is available in Windows Server 2016 only

• No witness:
• Is not recommended
• Manual failover (disaster-recovery site)
Configuring a stretch cluster

Site-aware failover-cluster services provide:


• Failover affinity
• Fail over to a node on the same site
• Cross-site heartbeating
• Determine the heartbeat setting for the same site nodes
• Preferred site configuration
• (Get-Cluster).PreferredSite = 1
• This allows you to identify what nodes the roles should attempt to
bring online first
SET Site-Aware Clustering
(Get-ClusterNode Node1).Site=1
(Get-ClusterNode Node2).Site=1
(Get-ClusterNode Node3).Site=2
(Get-ClusterNode Node4).Site=2
Challenges for deploying a stretch cluster

When deploying stretch clusters:


• Ensure that the business requirements are met
• Use storage replication between sites:
• Hardware vendor (Windows Server 2012 R2 or earlier)
• Storage Replica (Windows Server 2016)
• Choose the correct quorum witness to properly
maintain functionality in the event of failures
• Choose the correct storage-replication solution to meet
the needs for Storage Replica
Multisite failover and failback considerations
When implementing stretch clusters in disaster- recovery scenarios, consider:
• Failover time
• Must decide how long you should wait before you pronounce a Disaster.
• Services for failover
• You should clearly define critical services should fail over to another site
• Quorum maintenance
• Quorum model so that each site has enough votes for maintaining cluster
functionality.
• Storage connection
• Storage available at each site
• Published services and name resolution
• Procedure for changing DNS records Internal / External
• Client connectivity to another Site
• Failback procedure
Lab B: Managing a failover cluster

Exercise 1: Evicting a node and verifying quorum settings


Exercise 2: Changing the quorum from disk witness to
file-share witness and defining node voting
• Exercise 3: Verifying high availability
Logon Information
Virtual machines: 20740C-LON-DC1
20740C-LON-SVR1
20740C-LON-SVR2
20740C-LON-SVR3
20740C-LON-SVR4
20740C-LON-CL1
User name: Adatum\Administrator
Password: Pa55w.rd
Estimated Time: 45 min
Lab Scenario

Adatum Corporation recently implemented


failover clustering for better uptime and
availability. The implementation is new and your
boss has asked you to go through some failover-
cluster management tasks so that you are
prepared to manage it moving forward.
Lab Review

Why would you evict a cluster node from a failover


cluster?
• Do you perform failure-scenario testing for your
high-available applications based on Windows
Server failover clustering?
Module Review and Takeaways

Review Questions
Real-world Issues and Scenarios
Tools
Best Practices
• Common Issues and Troubleshooting Tips

You might also like