You are on page 1of 28


vCenter Server Heartbeat

vCenter Server Heartbeat, from here on shortened to vCSHB, is a Windows application
that protects vCenter Server and its associated components and applications with a true,
high availability solution. There are many technologies that offer some type of protection
for vCenter Server, but today, vCSHB is unique in its level of protection. The focus of
this chapter will be to understand at a high level how vCSHB works to protect vCenter
Server. Before we can talk about how vCSHB is built and how it works, though, we
should understand why we need it in the first place. We’ll do that by first exploring the
effects of an unavailable vCenter Server, what options exist to protect vCenter Server,
their shortcomings, and how vCSHB provides high availability in a way they can’t.
We’ll also look at how vCSHB is commonly deployed and take a look at the interfaces
used to manage it. These topics are outlined in the following sections.

Finding a use case for vCSHB
Options for vCenter high availability
o vCSHB architecture and functional overview
 LAN and WAN deployments
 Identities and roles
 vCSHB networks
 Fail over scenarios
 Application and service protection
Common installation options

vCSHB client and console walkthroughs

The next section talks about why an environment may need vCenter Server to be highly
available. In other words, it helps define a use case for vCSBH because not every
vCenter Server needs to be highly available. It also discusses the implications of vCenter
Server when it’s down.

Finding a use case for vCSHB
For some companies, the availability of their vCenter Server has become as much a
priority as their traditional tier-1 applications, such as databases, messaging systems, and
line of business applications. This isn’t the case for everyone, though. There are small
and medium-sized businesses running vSphere who may only have one person notice if
vCenter Server was down – perhaps their lone IT administrator – and the rest of the
company would literally continue business as usual. It doesn’t have to be a small or
medium business that doesn’t have a use case for vCSHB. I’ve seen $10 billion banks
that don’t require vCenter Server to be highly available. A good use case for vCSHB
though, may be the multinational conglomerate that employs dozens of VMware
Administrators around the world. There will likely be several vCenter Servers managing
hundreds or thousands of ESXi hosts and many thousands of virtual machines (VMs),
plenty of which are considered Tier-1 applications. If a particular vCenter Server is
down, perhaps a critical service like vSphere Auto Deploy now doesn’t work and
operations are hindered. The use cases for vCenter Server high availability don’t
necessarily revolve around how many employees a company has or how much revenue
they bring in; rather they revolve around the impact that not having vCenter Server
available will have on the business and how much a company can tolerate that impact. In
business, this is spoken about in terms of service levels and recovery objectives,
specifically Service Level Agreements (SLA), Recovery Point Objectives (RPO), and
Recovery Time Objectives (RTO).
An SLA is an agreement between a service provider and a service consumer that defines
the availability and responsiveness of a particular resource. An example of this might be
that the IT department has agreed to ensure that vCenter Server will be available
99.999% of the time. This might be important to the developer’s group because they use
vCloud Director to provision virtual machines for testing nightly builds. Since virtual
machines can’t be provisioned in vCloud Director without vCenter Server, if it’s not
available, nightly builds cannot be completed and ultimately, project timelines are pushed
An RPO defines how much data loss is acceptable, usually measured in minutes or hours,
in the event a disaster occurs. This could be a little “d” disaster, like tripping over the
power cords and accidentally unplugging a blade chassis, or it could be a big “D” disaster
that wipes out an entire data center, like a hurricane or flood or earthquake. Backup,
snapshot, and replication intervals are closely tied to RPOs. For vCenter Server, perhaps
a particular RPO of one hour is acceptable to a business because they know they’d mostly

be losing performance data and backups are on a reliable, rotating schedule and are
copied offsite.
Finally, an RTO defines how long it takes to restore all or individual services. If vCenter
does not need to be highly available, perhaps an RTO of 4 or 8 hours is sufficient because
a complete rebuild is planned if vCenter Server is lost. For others, loss of vCenter
services would mean millions of dollars in lost revenue so it must not incur any
downtime, whatsoever. An RTO of seconds, then, is required, and even then, those
seconds are used to determine if the service is available. If the service is not available,
vCenter Server may failover to another instance.
Once a company has identified their SLA, RPO, and RTO requirements, it can determine
how best to achieve them by tying them together into a use case. While this book focuses
on vCSHB, it will help to understand the different use cases for other vCenter Server
availability solutions in addition to vCSHB. To begin, let’s look at some of the services
vCenter Server offers and how they’re affected when vCenter Server is not available.
First and foremost, vCenter Server offers the main management interface into a VMware
vSphere environment. It can manage hundreds of ESXi hosts and thousands of virtual
machines that otherwise would have to be managed on an individual ESXi host basis.
Right away, we see the benefit of the centralized management vCenter Server affords.
vCenter Server is the default, main collector of performance metrics pulled from the
environment. It also provides the default alarm system for performance, capacity, and
configurations. If vCenter Server is not available, it’s not gathering performance data
and it’s not evaluating thresholds or sending emails or SNMP traps for alarms.
There are several, high-profile functions that don’t work if vCenter Server is offline.
These include the Distributed Resource Scheduler (DRS), Storage Distributed
Resource Scheduler (SDRS), vMotion, and Storage vMotion. DRS and SDRS
calculations that detect imbalances in host memory and CPU utilization, datastore
capacity, and datastore I/O latency are performed by vCenter Server based on input from
the hosts. If vCenter Server is down, these functions are unavailable.
vMotion and Storage vMotion perform live migrations of virtual machines between hosts
and datastores, respectively. DRS and SDRS rely on these functions in order to complete
their tasks. Without vCenter Server, none of these functions work. Together, these
functions are ultimately responsible for load-balancing workloads to avoid performance
problems or to aid in maintenance. Without vCenter Server, workloads cannot be
balanced and performance may suffer. If virtual machines can’t be moved around,
maintenance may become more difficult.
Many vSphere administrators know that vSphere High Availability (HA) still functions
without vCenter Server because, once configured, it’s only communication between ESXi
hosts that determine which virtual machines are restarted and where. But one HA feature

that is managed by vCenter Server is the Admission Control Policy. This policy ensures
enough resources are reserved in a vSphere Cluster to boot virtual machines in the case of
an HA event. If vCenter Server is down, this policy cannot be enforced.
Finally, there are many additional servers from VMware and third-parties that rely on
vCenter Server being operational in order to perform their tasks. These include vCenter
Operations Manager, VMware Horizon View, vCloud Director, vCloud Automation
Center, vCenter Orchestrator, and third-parties such as Veeam Backup and Replication
for business continuity and disaster recovery or Quest vFoglight for performance and
capacity management.

Options for vCenter high availability
There are many ways to protect vCenter Server and they each have their advantages and
disadvantages. They each certainly have differing levels of high availability. In order to
understand where vCSHB fits into the mix, let’s look at some of the options, how they
might be useful, and where they fall short of true high availability.
 vSphere High Availability
The most obvious choice for an availability option is usually vSphere HA if vCenter
Server is running in a virtual machine. If vCenter Server is installed in a physical
machine, HA is of no help to you. HA is the feature that restarts virtual machines on a
surviving ESXi host if a particular host experiences a failure. This could be a hardware
failure, loss of network or storage connectivity, or a purple diagnostic screen, colloquially
known as the Purple Screen of Death (PSOD).
What distinguishes vSphere HA is that the entire VM is restarted if an ESXi host fails.
Some would argue that the name HA is not warranted because of that – the virtual
machine is restarted and is unavailable while it boots on another host. So the downtime
incurred by any virtual machines that were running on the failed host is the time it takes
them to boot and become fully operational on surviving hosts. This is perhaps not the
high availability you’re looking for.
This kind of downtime is usually measured in minutes and is almost always enabled in a
vSphere cluster. If there were no HA, it might become a monumental task to find which
virtual machines failed and to power them all back on if vCenter Server was one of them.
vSphere HA is enabled so often because it’s so useful; it works to minimize downtime
even when you’re off the clock. I’ve had plenty of calls after business hours to check on
an environment after a host failure. Usually, by the time I logged into the environment,
HA had already restarted the virtual machines and I was in diagnostic mode, trying to
determine the cause of the host failure.


One further downside to relying on HA to keep vCenter Server available is its lack of
restart priorities. If HA restarts vCenter Server and its database server when they’re
installed on separate VMs, it’s only coincidence if they come back online in the correct
order: database first, then vCenter.For environments that can’t withstand this amount of
downtime or risk to their vCenter Server, HA is of little help.
 VM and Application Monitoring
VM and Application Monitoring are closely tied to vSphere HA and operate in a similar
fashion, relying on heartbeats to determine if a service is available and restarting it if it
fails. VM Monitoring is disabled by default in a vSphere HA cluster. When enabled, the
HA agent on the ESXi host will receive heartbeats from the VMware Tools installed on
virtual machine. If the heartbeat stops, the HA agent checks to see if disk or network
activity has occurred from the virtual machine in the last two minutes. If they have not,
the virtual machine is restarted. If I/O has occurred and VMware Tools is not
responding, the virtual machine is considered active and is not restarted. VM Monitoring
can protect vCenter Server in this way but again, the virtual machine is restarted and
downtime is measured in minutes. The drawback is VM Monitoring offers no application
awareness. VM Monitoring offers protection from operating system faults only. So if
vCenter Server suffered from application or service issues, VM Monitoring can’t help.
Application monitoring fills the gap in VM Monitoring by allowing third-parties like
Veritas/Symantec to write applications that can monitor vCenter Server and database
services. If services become unresponsive to the application monitor, those services can
be restarted first before restarting the virtual machine. Here again, downtime to vCenter
Server can still be measured in minutes if a virtual machine has to be restarted, possibly
less so if only services are restarted. Even here, though, if the vCenter management
services or database services are restarted, vCenter is not available during that time.
As part of vSphere 5.5, VMware released its own application monitoring tool called App
HA. In its initial release, it can only protect the vCenter Server database services, not
vCenter Server itself.
 VMware Fault Tolerance
I’ll say very little about the possibility of Fault Tolerance (FT) protecting vCenter Server
because as of vSphere 5.5, it only supports single-vCPU virtual machines. You rarely
find vCenter Servers with only one vCPU. They’re usually in test labs or small
deployments that don’t require high availability in the first place.
Lack of support for multiple vCPUs is likely to change in an upcoming release,
though (VMware shared a technical preview of Symmetric Multi-Processor
Fault Tolerance (SMP-FT) at VMworld 2012).


Again, if vCenter Server is not virtualized, FT is of no use to you.
FT uses two virtual machines, a primary and a secondary. A VMware proprietary
technology called vLockstep synchronizes the vCPUs of each virtual machine by sending
the secondary the exact same CPU instructions the primary executes. FT protects against
ESXi host failures only and can seamlessly failover from primary to secondary with no
data loss. While the technology has huge potential, it’s not used very much today
because of the single-vCPU limit. The other drawback is that it doesn’t support VMware
Snapshots. This alludes to the old adage garbage in, garbage out. If the primary virtual
machine has a virus, if the operating system crashes or similar, the exact same result will
occur on the secondary virtual machine. With no Snapshots on either the primary or
secondary, the capability to roll back to known-good states is degraded. Look for SMPFT to increase adoption of this technology, though, especially if Snapshots are supported.
 Database clustering
Prior to vSphere 5.5, VMware did not officially support clustering solutions like
Microsoft Cluster Services (MSCS), Windows Server Failover Clustering (WSFC),
AlwaysOn Availability Groups, or Veritas Cluster Services (VCS), to protect the
vCenter Server database. It would, however, offer best-effort support, which meant that
it would help troubleshoot issues related to vCenter Server availability. If it was
determined that the likely cause of the issue was due to the clustering solution, they
would recommend contacting the clustering vendor. With vSphere 5.5, VMware now
fully supports the legacy MSCS/WSFC. Of course, this means the database server is
protected and redundant, but the vCenter Server is not. If a SQL Server went down that
was part of the cluster, fail over would be fairly quick as services were restarted on a
surviving cluster member, but if the vCenter Server itself went down, you would lose all
access to it.
The go-to article for supported Microsoft clustering is VMware KB 1037959:
Microsoft Clustering on VMware vSphere: Guidelines for supported

Cluster failover times can be measured in seconds to minutes. The major drawback to
this, though, is its complexity to setup and maintain.
For an excellent blog series on configuring WSFC in SQL Server 2012 on Windows
Server 2012, which can be used to protect a vCenter Server database, see Derek
Seaman’s Blog


SQL Server Log Shipping

While not a database clustering solution, SQL Server Log Shipping can be used to
replicate vCenter Server database changes to a stand-by database server. If the active
database server fails, via a manual process, vCenter could be pointed to the new database
server. The downtime associated with this is easily measured in 30 minute increments
because of the manual processes involved. Like other database-only protection schemes,
the vCenter Server is not protected by Log Shipping.
 SQL Server Mirroring
SQL Mirroring allows for quicker, automatic failover and much less downtime than Log
Shipping. Mirroring has the advantage over clustering solutions that require shared
storage because virtual machines participating in SQL Server Mirroring can be
vMotioned and DRS can move them around a vSphere cluster automatically. The
drawback with Mirroring is that the vCenter Server itself is still not protected.
 Backups and clones
The final way to protect vCenter Server is to perform regular backups. This type of
protection should already be in place in most environments but is probably the least
effective way to provide high availability. In the worst case, restore operations can be
measured in hours or days. The key component to backup is, of course, the vCenter
Server database. This is easily accomplished through native SQL Maintenance Plans but
there are third-party tools available. In addition, this discussion is from the perspective of
the vSphere Administrator, not a Database Administrator. If the reader wears both hats,
Maintenance Plans will likely suffice. If a Database Administrator is responsible for
backing up the vCenter Server database, he’ll likely perform some tasks differently. The
important thing is to ensure the backups are performed and verified.
For an introduction to SQL Maintenance Plans for the vSphere Administrator,
see my blog post on the subject

As with other SQL-centric approaches to protection, the vCenter Server itself is not
protected. That’s where cloning and snapshots come in. A clone or snapshot can be
taken on a schedule and reverted to in case a restore is needed. Cloning works with
physical vCenter Servers, as well. Clones can be created of a physical vCenter Server by
using VMware vCenter Converter. This process will non-disruptively clone the physical
server into a virtual machine. Once in virtual machine format, it can be backed up or
moved like any other virtual machine. If the physical vCenter Server ever fails, the clone
virtual machine can be powered on and immediately take the place of the original.
If vCenter Server is already a virtual machine, VMware Converter can clone it, as well.
But an easier way to protect a virtual vCenter Server is to use the built-in cloning

function of vSphere itself. Just right-click the vCenter Server virtual machine and choose
Clone. This can serve as a backup in case the original is lost. The cloning process for
both VMware Converter and vSphere cloning are substantial enough, in terms of
resources required, that it might only be reasonable to clone them once or twice a day.
Of course, this means that any changes made after the last clone operation and before the
next clone will be lost if there’s a failure.
The final way to backup a vCenter Server is to take snapshots. There are two kinds of
snapshots, VMware Snapshots and storage array snapshots. In the case of VMware
Snapshots alone, these aren’t true backups. They should be used for short periods of time
only and deleted when no longer needed. Leaving VMware Snapshots for long periods
of time can cause the Snapshots to grow very large and cause problems during their
deletion or when committing them. Performance problems can also arise when running
of large Snaphot delta files. As a long term protection strategy, VMware Snapshots
should not be used. They’re really only useful when changes are made to the system, like
before Microsoft updates are applied or before an upgrade is performed.
For an overview of best practices for VMware Snapshots, see VMware KB
article 1025279

The second type of snapshot is taken at the storage array level. These snapshots are
managed by the storage array itself so the vSphere layer is usually unaware of them.
Storage array vendor tools can help automate the work required to take advantage of
these snapshots in a restore situation. Otherwise, the datastores in which the virtual
machine snapshots exist can be mounted on an ESXi host and the snapshot restored.
These snapshots are often replicated to long term secondary storage for archival and
disaster recovery purposes. Unlike VMware Snapshots, storage snapshots can, indeed, be
a part of a backup strategy. VMware Snapshots can even be included in storage array
snapshots so virtual machines have file system consistent backups.

vCSHB architecture and functional overview
vCSHB is different than the above technologies because it offers a live fail over host for
both vCenter Server and its database. It also offers protection at several different levels,
including hardware, operating system, application, network, and even performance. In
addition, it protects some services and their databases that are routinely installed on the
vCenter Server, like VMware View Composer and Update Manager. In this section, we
will describe the architecture of vCSHB and explain how it protects these applications
and services from all these different failure domains.

LAN and WAN deployments
At a high level, vCSHB protects vCenter Server in two scenarios: High Availability in a
Local Area Network (LAN) or Disaster Recovery (DR) over a Wide Area Network
(WAN). A LAN deployment requires the protected server to share a subnet with its
cloned instance. When vCSHB is deployed in a LAN, each instance will share a Public
IP address to which clients connect. Clients, in this case, could mean the vSphere Client
or Web Client if the protected server is the vCenter Server, or SQL connections if the
protected server is a SQL Server. Protection in a LAN deployment is focused on the loss
of the server itself, perhaps a hardware failure or operating system crash. A stretched
layer 2 or stretched VLAN configuration is also considered a LAN deployment with
vCSHB because subnets are shared between vCSHB instances.
A WAN deployment requires each server to be in a different subnet. Each instance, then,
will have its own Public IP address, only one of which clients will use to connect.
Protection in a WAN deployment still protects against everything a LAN deployment
would, but it adds site resiliency in the case of a complete site failure. How this works,
exactly, will be discussed later.

Identities and roles
To begin, vCSHB starts with two identical machines, or nearly identical in the case of a
physical server. Once vCSHB is installed on these machines, one is considered the
Primary and the clone is considered the Secondary. These identities now stay with the
servers throughout their lifecycle. The first is always the Primary and the second is
always the Secondary. Whichever server is actively servicing clients is considered the
Active node and whichever server is not servicing clients is the Passive node. The
Active and Passive roles move between the Primary and Secondary. During normal
operations, the Primary server holds the Active role and Secondary server holds the
Passive role. When a fail over or switch over occurs, the Primary is demoted to the
Passive server and the Secondary is promoted to the Active server. This is demonstrated
in the figure below.


vCSHB networks
Part of the magic behind how vCSHB provides high availability with live fail over hosts
is how it utilizes the networking between them. This section will explain what those
networks are and how they’re implemented.

VMware Channel network
The Primary and Secondary servers are kept in a synchronized state by replicating data
over a dedicated network called the VMware Channel. The VMware Channel should
exist on a dedicated subnet that only allows vCSHB replication and heartbeat traffic.
Each node receives a unique IP address on this channel that it uses to send and receive
replicated data. In a stretched layer 2 or stretched VLAN configuration, the subnet is
shared between sites, as well. In a WAN deployment, though, a unique subnet will exist
at each site and traffic will be routed between them. In this case, a static route must be
added to each server because a default gateway is not specified on this interface.

Management network
Another network is used for day-to-day management of the nodes themselves, for
instance, to apply patches, connect via the Remote Desktop Protocol (RDP), or perform
other maintenance. This is called the Management Network and each node will have a
unique Management IP address. This is usually the standard network used for the
server being protected.
Finally, the Public IP address is also assigned to the Management Network. This is a
single IP address that is shared between nodes. Only the Active node, though, is
accessible via the Public IP address. The Passive node is not visible on the network using
this IP address. vCSHB accomplishes this through the use of a packet filter installed on
each node. Both nodes are actually assigned the same Public IP address but the packet
filter on the Passive node will block all traffic that uses the Public IP address. When a
fail over or switch over occurs, the following process takes place.

The Active node will become Passive
The packet filter on what was the Active node will block traffic using the
Public IP address
 The former Passive node will now become Active
 And the new Active node packet filter will now allow traffic to pass destined
for the Public IP address.
The network architecture of vCSHB in a LAN deployment is shown below. In most
deployments, each server will utilize two network interface cards (NICs). One NIC will

be dedicated for VMware Channel replication traffic and the other will service client and
administrative traffic through the Public and Management IP addresses, respectively.

As shown in the figure above, the VMware Channel IP addresses share a subnet as do the
Public and Management IP addresses. This configuration is specific to a LAN
deployment. If deployed across a WAN, each server would have its subnets for VMware
Channel and management traffic. Looking at the figure, client traffic will only access
services through the Public IP address and so, client traffic is directed to the server on the
left. The packet filter installed on the server on the right actively blocks traffic associated
with the Public IP address, essentially hiding the server from clients on the network.
Administrative traffic, however, does not use the Public IP address; it uses the
Management IP addresses and so can flow to each server simultaneously. This is the
steady state of a vCSHB installation. The process involved when a fail over or switch
over occurs is discussed next.

Fail over scenarios
A fail over has two distinct cases, one for LAN deployments and another for WAN
deployments. The differences in the two are mainly in how clients are directed to the
restored services on the fail over host. This is done either through the vCSHB packet
filter or through DNS changes.

LAN fail over
In any fail over scenario, the Active node become Passive and the previously Passive
node becomes Active. With fail over in a LAN deployment, though, the Public IP
address moves from the formerly Active node to the newly Active node. This movement
is actually performed by the vCSHB instances and the packet filters on each server. The
packet filter on the failed node starts blocking all traffic directed to or trying to leave the

interface with the Public IP address while the newly active node activates the Public IP
address and starts accepting traffic directed to it. This is shown in the figure below. In
this case, DNS records do not have to be updated because the nodes themselves are
responsible for blocking and activating their Public IP address interfaces. The newly
Active node is also responsible for sending a gratuitous ARP broadcast to let upstream
switches know it is now servicing all requests for the Public IP address.

A stretched VLAN deployment operates in exactly the same way. The only difference is
that the nodes are geographically dispersed.

WAN fail over
A WAN fail over scenario is very similar to a LAN failover. The important difference is
that DNS must be updated to reflect the change in name resolution from one Public IP
address to another. Although each server in a WAN deployment has a unique Public IP
address, they share the Fully Qualified Domain Dame (FQDN) for the server. In WAN
deployments, it is essential that services connecting to vCenter Server use the FQDN to
connect or else they risk losing connectivity after a vCSHB fail over or switch over. For
example, let’s assume vCenter Log Insight is configured to point to a vCenter Server
configured in a vCSHB WAN deployment. Log Insight is pointing to vCenter Server’s
IP address When a fail over occurs, will no longer be
responding on the network and the FQDN of the vCenter Server is now resolving to This is diagrammed below.


Application and service protection mechanisms
All of the protection mechanisms reviewed above have shortcomings. Let’s review them

vSphere HA cannot help if vCenter Server or associated services are installed
on physical machines.
vSphere HA has no application visibility; it doesn’t know when an
application or service has failed, only when an ESXi host has failed.
vSphere HA also has no restart priorities so if both vCenter and its database
server restart (assuming the database server is a separate virtual machine),
it’s only coincidence if they come up in the right order.
VM Monitoring only has operating system awareness, not application
awareness and can also not protect physical servers.
Third-party Application Monitoring like Veritas Cluster Services can at best
restart services on or reboot the original machine, it cannot fail over to a
standby server
The original release of vSphere App HA in vSphere 5.5 can only protect
vCenter’s database services, not vCenter Serer itself and also only applies to
virtual machines.


As of vSphere 5.5, VMware FT can only protect virtual machines with a
single vCPU.
VMware FT is also susceptible to garbage in-garbage out; if a BSOD occurs
on the primary instance, for example, a BSOD will occur on the secondary
All the database-centric protection schemes including clustering, log
shipping, and mirroring, suffer from only being able to protect the database.
While important, these don’t provide true high availability to vCenter Server

Backups and clones, while being important in the overall protection scheme,
are far from being able to provide high availability mainly due to the time
required to recover.
vCSHB addresses every one of these to provide a true high availability solution. It
protects both the vCenter Server, its database, and certain other services from hardware
failures, operating system failures, application and service failures, network failures, and
even performance problems that might cause these services to become unavailable. How
vCSHB accomplishes this is described below.
vCSHB operates across all hardware on the VMware Hardware Compatibility List
(HCL). If you run vCenter or its database on physical servers, this is advantageous
because vCSHB can still offer protection. Being a Windows application, it can only
protect a vCenter Server installed on Windows, though. It cannot protect the vCenter
Server Virtual Appliance (vCSA). The best way to protect the vCSA today is to use
vSphere HA.

Fail overs and switch overs
Fail overs and switch overs are very similar because the end result is the same. How each
goes about reaching that end is different, though. During a fail over, the Active node is
considered to have failed and vCSHB ensures the Passive node immediately takes over
operations. This is unlike how a switch over works. A switch over is a controlled switch
between the Active and Passive nodes. This routinely occurs during planned
maintenance activities. When a switch over occurs, the Active node gracefully shuts
down its services and then the Passive node starts the services on its machine.

Server failure protection
Server failure protection is offered at the operating system and server hardware level. If
the operating system crashes or becomes unresponsive, vCSHB will initiate a fail over. If
there’s a catastrophic hardware failure, or perhaps a power or network failure on the
Primary server, vCSHB will initiate a fail over. The mechanism used to determine when
to fail over is a heartbeat that’s sent over the VMware Channel between nodes every 10

seconds by default. This heartbeat consists of messages and acknowledgements sent
back and forth between the Active and Passive servers. If acknowledgements are not
received in a user-defined interval, an automatic fail over occurs. The heartbeat interval
can be changed depending on the reliability of the links between nodes. For higher
latency WAN links, for instance, the heartbeat interval can be increased. For a more
reliable LAN, the default interval may work well.

Network failure protection
By default, vCSHB on the Active node will ping three IP addresses every 10 seconds to
determine if it still has network connectivity. The three IP addresses are the Management
Network default gateway, the primary DNS server, and the Global Catalog Server. If all
three addresses fail to respond to pings, the Active node can perform an automatic switch
over to the Passive node. In order to do this, however, the nodes must be able to
communicate over the VMware Channel. This is one reason why the VMware Channel
benefits from dedicated or separate hardware switches from the Management Network –
to ensure that a failure in one network doesn’t necessarily imply a failure in the other.

Application and service failure protection
All the other protection mechanisms discussed thus far, both vCSHB and others, are
meant to protect applications. Applications are made up of services and so, service
protection becomes important. vCSHB is aware of applications running on servers
through the use of application plug-ins for vCenter Server and SQL Server. The plug-ins
monitor the state of specific services and will attempt to restart individual services if they
fail. If they don’t come back online, vCSHB will then switch over to the Passive node.
Again, a switch over includes a graceful shut down of services on the Active node. These
services are then started on the Passive node which is becoming the Active node.

Protecting against performance problems leading to service
Each plug-in is configured to monitor certain application performance or health
attributes. If these attributes fall outside of default or user-configured thresholds,
vSphere Administrators can be alerted before the issue becomes severe enough to
degrade services or cause a switch over. The list of attributes monitored by each plug-in
are listed in the below tables.
vCenter Server plug-in
Health of Tomcat Server
vCenter License Check Connection to vCenter
SQL Server plug-in


Application data protection
Finally, vCSHB protects vCenter Server, its database and associated services by
replicating application and user-defined data. It does this through the use of file filters.
File filters simply tell vCSHB which files, folders, and registry settings to replicate to the
Passive node. Default file filters are created when vCSHB is installed. The file filters
used by default are based on which applications are installed when vCSHB is first
installed. Applications can still be protected if they’re installed after vCSHB by
manually configuring vCSHB to protect them.
For example, let’s assume Single Sign-On, the Inventory service, Web Client, and
vCenter services are installed on a single machine. When vCSHB is installed, it
discovers these services and knows which file and folder paths as well as registry settings

these services use. It installs service-specific file filters which capture all data associated
with each service in order to replicate it. Now let’s assume the Update Manager is
installed on this machine after vCSHB is installed. vCSHB will not recognize the Update
Manager installation automatically and will not replicate any data associated with it. A
vCSHB administrator must install the Update Manager file filter manually. This will tell
vCSHB to protect the Update Manager application by replicated its data.
The process of replicating data is captured in the steps below.

Data is written to disk on the Active node by the protected application
The data is intercepted by vCSHB file filters and placed in a Send Queue
The Send Queue is sent over the VMware Channel to the Passive Node
The Passive node places the data in its Receive Queue
Once logged to the Receive Queue on the Passive Node, and
acknowledgement is sent to the Active Node that the data was received
6. The Active node receives the acknowledgement and deletes the sent data
from its Send Queue
7. The Passive node writes the data in its Receive Queue to disk
8. At this point, duplicate data sets exist at both the Active and Passive nodes
The Send and Receive Queues act as buffers for replicated data. There is a small chance
that the Active node will fail before the Send Queue can completely replicate to the
Receive Queue. This is more likely to be seen in a low-bandwidth WAN deployment. In
this case, it’s possible that a small amount of data can be lost. LAN deployments usually
don’t suffer from this because of better network resiliency and higher bandwidth. In a
LAN deployment, the Send Queue is usually at or near zero at all times.

Common installation options
One of the features of vCSHB is that it’s very flexible in its deployment models.
VMware clearly understood that vCenter Server and its associated services were
deployed in various configurations and vCSHB needed to support them. The benefit of
this is that users aren’t restricted to certain models. They don’t have to reinstall their
virtual infrastructure just to benefit from vCSHB. We’ll discuss the deployment options
of each item listed below.

LAN, WAN, or stretched VLAN
Physical to Physical, Physical to Virtual, or Virtual to Virtual

Various cloning techniques
One, two, or more NICs

Local or distributed applications

vCSHB for High Availability or Disaster Recovery
When vCSHB is deployed in a LAN, it’s considered an HA deployment. This protects
against hardware, operating system, network, application and performance failures. The
key when deploying vCSHB in a LAN is that the Public IP address is shared between
nodes, which means each node has to belong to the same subnet. The VMware Channel
in a LAN deployment is also a single Layer 2 domain, which means that each node will
have a VMware Channel IP address assigned in the same subnet. As mentioned earlier,
stretched VLANs are considered LAN deployments for vCSHB because DNS doesn’t
need to be updated even though the servers at different sites.
When deployed across a WAN, it’s considered a DR deployment because, in addition to
protecting against the same LAN failures, it also protects against site failures. So if the
entire site in which the Active node operates fails, vCenter and its services can be failed
over to a DR site and continue to serve clients. The important difference between the two
is that in a LAN deployment, the Public IP address simply moves from the Active node to
the Passive node and the existing DNS record continues resolving to the same Public IP.
But in a WAN deployment, subnets are not shared across sites and so the Public IP
address must resolve to the new, failed over server at the DR site.
When deciding whether a WAN deployment is a good fit or not, consider the available
bandwidth and latency over the WAN. Because vCSHB replicates data, bandwidth can
be a concern. VMware recommends dedicating at least 1 Mbps of WAN bandwidth to
vCSHB. vCSHB includes a compression feature that will do what it can decrease the
amount of data sent across the WAN, though, at the cost of CPU overhead on the sending
node. When bandwidth is limited, the Send Queue can become relatively large if there is
a lot of data to transfer but vCSHB doesn’t have the bandwidth to send it.
Another concern with WAN deployments is the latency of the WAN connection. In a
similar way to decreased bandwidth, high latency can affect how quickly the Send Queue
can empty and therefore how far out-of-sync the Active and Passive nodes become.
VMware recommends keeping latency below that of a standard T1 connection.

Server architecture
vCSHB supports three different server deployment architectures: physical to physical (PP), physical to virtual (P-V), and virtual to virtual (V-V). Most deployments will be V-V
because most vCenters are virtualized. This is beneficial because working with virtual
machines is generally much easier than working with physical machines. In the case of
vCSHB, deployments that involve physical servers are different in a number of ways.

The first is the way in which a clone is made of the Primary server. There are only three
cloning options supported by VMware to do this. The options are VMware vCenter
Converter, VMware vCenter cloning through the vSphere Client or Web Client, and
vCSHB native cloning which is bundled with the vCSHB media. Which one is used
depends on the architecture chosen.

If the Primary server is physical and the secondary is virtual, the cloning
process used will be VMware vCenter Converter.

If the architecture is virtual to virtual, the clone should be made using
vCenter’s built-in cloning option
 If the secondary server is physical, the clone should be made using vCSHB’s
built-in cloning utility.
The over-arching theme regardless of which architecture is used is that both servers must
be nearly identical in both hardware and software. That means matching the amount of
CPU and memory on each server, as well running the same operating system and patches.
Nearly identical servers in these respects ensures that performance is retained between
fail over pairs and the same file and data structures are available on each server so
software can run reliably and predictably.
When a server in a vCSHB fail over pair is a virtual machine, VMware recommends the
use of resource pools for that server. This recommendation comes from the idea that if
the Active node experiences performance degradation due to resource contention,
automatic fail overs can occur. While this is certainly within the capabilities of vCSHB
to overcome, unnecessary fail overs should be avoided. VMware stops short of
recommending resource reservations, but be aware of the dangers of oversubscribing
resources and putting vCSHB servers in resource contention scenarios.

Number of NICs
Another important option when choosing an architecture is whether to use one, two, or
more NICs. Whichever is chosen, it must be same on both nodes. For redundancy, two
NICs are recommended. Two NICs allow for a dedicated Public NIC and a VMware
Channel NIC. The Public NIC would share the Management and Public IP addresses and
the VMware Channel NIC would host the Channel IP address. If a single NIC is used, all
IP addresses will be hosted by the same NIC and in this case, can all share the same
subnet. Single NIC deployments should be avoided because they are a single point of
Teaming Intel NICs is supported with the workaround in VMware KB article


When vCSHB protected servers are virtualized, each virtual NIC should be placed on
separate vSwitches. This is in an effort to prevent all vCSHB traffic from traversing a
single physical server NIC which fails and takes down all vCSHB traffic. Ensure
redundancy is employed on whichever vSwitches vCSHB uses to avoid single points of
The most common use of more than two NICs is when two NICs are dedicated to the
VMware Channel in addition to the one used for Public traffic. These are not teamed
NICs but two, independent NICs with unique IP addresses. This is usually done when
multiple, distinct paths, including different physical links and upstream switches and
routers, can be used for each VMware Channel network. This increases the network
redundancy between vCSHB nodes. Such redundancy is usually seen across a WAN
when two different providers allow for two separate paths between sites. This
redundancy is illustrated below.

For WAN deployments, a static route is required on the VMware Channel network
because the NIC used for this traffic is not configured for a default gateway. An example
of adding a static route using PowerShell in Windows Server 2012 is shown below.
New-NetRoute -DestinationPrefix -InterfaceAlias Channel


New-NetRoute is the PowerShell cmdlet. The destination prefix represents the
destination subnet the source is trying to reach and interface alias is the name given to the
network adapter. In this case, the name is Channel but it could be VMware Channel, in
which case it would be surround by double quotes, like “VMware Channel.”

Local or distributed applications
The final common installation option is determined by how protected applications are
installed across servers or virtual machines. If all protected services are installed on a
single machine, then the installation can be referred to as local. If services to be
protected are installed on different servers, like vCenter on one server and its SQL Server
on another, the installation can be referred to as distributed. vCSHB supports both of
these types of deployments. There’s also no need to reinstall services to create a local
deployment or create a distributed deployment as vCSHB was created with each type in
vCSHB 6.6 protects over 30 VMware services across vCenter versions from 5.0 to 5.5
and they’re not always installed on a single machine. In addition to the various SQL
services that need to be protected by vCSHB, the following vCenter services are also
protected. These are listed generically because the actual Windows service names have
changed through vCenter versions.
For a full list of protected vCenter services, see the vCenter Server Heartbeat
Documentation page
Protected vCenter services
vCenter Server Service
vCenter Inventory Service
vCenter Management Webservices
vCenter Single Sign-On
vSphere Web Client
vCenter Update Manager
vCenter Update Manager Download Service
vSphere Client
vCenter Orchestrator
vSphere ESXi Dump Collector
vSphere Syslog Collector
vSphere Auto Deploy

vCenter Host Agent Pre-Upgrade Checker
VMware Log Browser
vSphere Profile-Driven Storage Service

If protected applications are installed in a distributed manner, vCSHB must be installed
on each node that runs those protected services. Common local and distributed
architectures are diagrammed below.
For instance, all services including the database may be installed on the same machine as
shown below.

Or services may be distributed among many machines. In this case, each machine can
fail over independently and the surviving nodes will reconnect once the fail over is
complete. If a node is running several protected services, it’s important to note that all
services fail over at once. It’s not possible to fail over some services on a single machine
and leave others running on that same machine. It’s all or nothing, so to speak.


Finally, vCSHB is capable of being deployed in environments that use features such as
Linked Mode, Site Recovery Manager, and multi-site SSO. In large deployments that
include DR provided by SRM, vCSHB can be deployed in an architecture similar to that
shown in the diagram below.


Each site has distributed services which it protects local to the site only using vCSHB.
Linked Mode is used between sites to benefit from single pane of glass management and
shared vCenter inventory and licenses. Site protection is provided by SRM but recall
SRM itself, including its database, is not supported by vCSHB.
One other interesting note is that only two vCSHB licenses are used in this deployment –
one at each site. Even though six servers are protected by vCSHB, only two vCenter
instances are and recall, vCSHB is licensed per vCenter Server instance.

vCSHB client and console walkthrough
In this section, let’s take a moment to view the user interfaces available with vCSHB.
First, there’s the vCSHB standalone client. This is a Windows application that can be
installed on protected servers or an administrator’s workstation. A lighter version of this
client is installed as a vCenter plug-in so it’s available through both the legacy vSphere
Client and the vSphere Web Client.
Below is an example of the standalone client.


Here, we can note several items. The first is the overall status of the vCSHB pair. Green
check marks mean everything is healthy. If something were wrong, yellow triangles
would appear and messages would be seen in the lower pane.
Notice the tabs along the top of the page. They include the various levels at which
vCSHB offers protection: server, network, application, and data replication. Choosing
each tab will offer sub-menus for those categories. The first tab in the client under Server
> Summary, we have the option to perform a manual switch over by choosing the Make
Active… button in the top pane. This will start a manual switch over which we can
watch progress via the client.


During the switch over, we can view a progress bar as well as the numbered steps being
executed. Details of the switchover can be viewed from the Logs tab which can also be
useful in troubleshooting.
Staying on the Server tab, when the Passive server is selected, we can see the status of the
Send and Receive Queues by viewing the Recovery Point in the bottom pane. Most of
the time, you’ll want to see 0 milliseconds, which tells you data is being quickly
replicated and committed on the passive node. If this number grows, it’s an indication
that there’s a problem with the VMware Channel network, perhaps bandwidth contention,
latency, or connectivity problems.
Finally, both the vSphere Client and Web Client have vCSHB plug-ins installed when
vCSHB itself is installed. The plug-ins do not have nearly the features the standalone
client does, but for the purposes of executing common workflows, the plug-ins work


Common workflows that can be executed include those listed in the Actions pane in the
upper right. Similar to the standalone client, we can see the overall status of the
replication relationship here as well as perform manual switch overs.

In this chapter, we introduced vCSHB, looked at why vCenter Server may need to be
highly available and what services are affected when it’s down. Recall that vCenter
allows for centralized management of your vSphere infrastructure, in addition to DRS,
vMotion, deploying from templates, and more. Also, important services like Auto
Deploy, Update Manager, vCloud Director and vCAC, and Horizon View environments,
among others, rely on an operational vCenter Server. By identifying the services you
lose in concert with defined SLAs, RPOs, and RTOs, a use case can be built for vCSHB.
The architecture of vCSHB was discussed next starting with the two types of vCSHB
deployments, LAN and WAN. vCSHB identities, Primary and Secondary, stay with the
server throughout its lifecycle, while the roles, Active and Passive, move between them.
A key part of the vCSHB architecture is the VMware Channel network over which
hearbeats and data replication take place. The Management Network is the normal
network over which day-to-day management takes place. Both server’s Management IP
address and the shared, client-facing Public IP address are on the Management Network.

LAN and WAN fail overs are much the same with the biggest difference being a change
in the Public IP address DNS record during a WAN fail over because the Public IP
address will be different across the WAN.
We also looked at each vCSHB protection mechanism in turn to see how it provides a
spectrum of protection for vCenter Server and its database. Recall the following failures
against which vCSHB protects:

Server hardware and operating system

Application and service

 Performance degradation
 Application data
Finally, the vCSHB management interfaces were introduced along with their major
With an understanding of what vCSHB is, how it works, and to interface with it, the next
chapter will explore in detail how to install and more importantly how to configure
vCSHB to protect vCenter Server and its SQL Server when they’re installed on the same