You are on page 1of 169

VMware® vSAN™ Network

Design
First Published On: 04-19-2017
Last Updated On: 03-15-2019

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Table Of Contents

1. Introduction
1.1.Introduction
1.2.Introduction to vSAN networking
1.3.About this document
1.4.Physical NICs Support Statements
1.5.Stretched Cluster vs. Fault Domains Statements
1.6.Layer-2 and Layer-3 Support Statements
1.7.Multicast
1.8.Unicast
1.9.Jumbo Frames
1.10.IPv6 support
1.11.TCP/IP Stacks
1.12.Static Routes
1.13.NSX
1.14.Flow Control
1.15.vSAN Network Port Requirements
2. NIC Teaming on the vSAN network
2.1.NIC Teaming on the vSAN network
2.2.Basic NIC Teaming
2.3.Advanced NIC Teaming
2.4.Dynamic LACP (Multiple physical uplinks, 1 vmknic)
2.5.Static LACP with Route based on IP Hash
3. NIC Teaming Configuration Examples
3.1.Configuration 1
3.2.Configuration 2
3.3.Configuration 3
3.4.Configuration 4
4. Network I/O Control
4.1.Network I/O Control
4.2.Enabling NIOC
4.3.Reservation, Shares and Limits
4.4.Network Resource Pools
4.5.NIOC Configuration Example
5. vSAN Network Topologies
5.1.vSAN Network Topologies
5.2.Standard Deployments
5.3.Stretched Cluster Deployments
5.4.2 Node vSAN Deployments
5.5.2 Node vSAN Deployments – Common Config Questions
5.6.Config of network from data sites to witness host
5.7.Corner Case deployments
6. vSAN iSCSI Target - VIT
6.1.vSAN iSCSI Target - VIT
6.2.VIT Internals
6.3.iSCSI Setup Steps
6.4.MPIO considerations
6.5.iSCSI on vSAN Limitations and Considerations
7. Appendix A
7.1.Migrating from standard to distributed vSwitch
7.2.Step A.1 Create Distributed Switch
7.3.Step A.2 Create port groups
7.4.Step A.3 Migrate Management Network
7.5.Step A.4 Migrate vMotion
7.6.Step A.5 Migrate vSAN Network
7.7.Step A.6 Migrate VM Network
8. Appendix B
8.1.Appendix B. Troubleshooting the vSAN Network
8.2.Network Health Checks
8.3.All hosts have a vSAN vmknic configured

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

8.4.All hosts have matching multicast settings


8.5.All hosts have matching subnets
8.6.Hosts disconnected from VC
8.7.Hosts with connectivity issues
8.8.Multicast assessment based on other checks
8.9.Network Latency Check
8.10.vMotion: Basic (unicast) connectivity check
8.11.vMotion: MTU checks (ping with large packet size)
8.12.vSAN cluster partition
8.13.vSAN: Basic (unicast) connectivity check
8.14.vSAN: MTU check (ping with large packet size)
8.15.Checking the vSAN network is operational
8.16.Checking multicast communications
8.17.Checking performance of vSAN network
8.18.Checking vSAN network limits
8.19.Physical network switch feature interoperability
9. Appendix C
9.1.Appendix C: Checklist summary for vSAN networking
10. Appendix D
10.1.Appendix D: vSAN 6.6 versioning change
11. Appendix E
11.1.vCenter Recovery Example with Unicast vSAN
11.2.vSAN Node preparation
11.3.vCenter Preparation
11.4.Adding vSAN nodes to vCenter
12. Appendix F
12.1.Boot Strapping a vSAN 6.6 unicast cluster

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

1. Introduction
Introduction

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

1.1 Introduction

vSAN is a hypervisor-converged, software-defined storage solution for the software-defined data


center. It is the first policy-driven storage product designed for VMware vSphere® environments that
simplifies and streamlines storage provisioning and management.

vSAN is a distributed, shared storage solution that enables the rapid provisioning of storage within
VMware vCenter Server™ as part of virtual machine creation and deployment operations.

vSAN requires a correctly configured network for virtual machine I/O as well as communication among
cluster nodes. Since the majority of virtual machine I/O travels the network due to the distributed
storage architecture, highly performing and available network configuration is critical to a successful
vSAN deployment.

This paper gives a technology overview of vSAN network requirements and provides vSAN network
design and configuration best practices for deploying a highly available and scalable vSAN solution.

1.2 Introduction to vSAN networking

There are a number of distinct parts to vSAN networking. First there is the communication that takes
place between all of the ESXi hosts in the vSAN cluster, indicating that they are still actively
participating in vSAN. This has traditionally been done via multicast traffic, and a heartbeat is sent
from the master to all hosts once every second to ensure they are still active. However, since the
release of vSAN 6.6, this communication is now done via unicast traffic. This is a significant change
compared to previous versions of vSAN, and should make vSAN configuration much easier from a
networking perspective.

There is also communication between the master, the backup and agent nodes. When an ESXi host is
part of a vSAN cluster, there are three roles that it can take: master, agent and backup. Roles are
applied during cluster discovery, when all nodes participating in vSAN elect a master. A vSphere
administrator has no control over roles.

There is one master node that is responsible for getting CMMDS (Clustering Monitoring, Membership,
and Directory Services) updates from all nodes, and distributing these updates to agents. This traffic
was traditionally multicast, but is now unicast in vSAN 6.6. A backup node, which also maintains a
backup directory, will assume the master role should the master fail. This will avoid a complete
rediscovery of every node, object and component in the cluster, as the backup role will already have a
full copy of the directory contents, and can seamlessly assume the role of master, speeding up
recovery in the event of a master failure.

This volume of traffic between master, agent and backup is relatively light in steady state.

Lastly, there is virtual machine disk I/O. This makes up the majority of the traffic on the vSAN network.
Because VMs on the vSAN datastore are made up of a set of objects, these objects may be made up of
one or more components. For example, a number of RAID-0 stripes or a number of RAID-1 mirrors.
Invariably, a VM’s compute and a VM’s storage will be located on different ESXi hosts in the cluster. It
may also transpire that if a VM has been configured to tolerate one or more failures, the compute may
be on node 1, the first RAID-0 mirror may be on host 2 and the second RAID-0 mirror could be on host
3. In this case, disk reads and writes for this virtual machine will have to traverse the vSAN network.
This is unicast traffic, and forms the bulk of the vSAN network traffic.

1.3 About this document

This document is not a replacement for existing vSphere Networking guidance, but more to
summarize existing collateral available at https://www.vmware.com/support/pubs/vsphere-esxi-
vcenter-server-pubs.html

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

vSAN ready nodes (http://www.vmware.com/resources/compatibility/vsan_profile.html) typically


standardize to two 10Gb server side network uplinks. We will base many of the examples in this guide
on a 2 x 10Gb network interface design.

For those of you using the DELL-EMC VxRAIL appliance, there is a separate network guide for the
appliance available here: https://www.emc.com/collateral/guide/h15300-vxrail-network-guide.pdf

1.4 Physical NICs Support Statements

Before delving into the technical details, we would like to address common supportability questions
related to vSAN network configuration. There are often questions around which NICs can be used with
which vSAN configurations. This table highlights the minimum NIC speeds required for vSAN
deployments. VMware expects to see higher speeds, such as 25Gb, 40Gb and even 100Gb NICs
become commonplace very soon.

+10Gb support 1Gb support Comments

10Gb
recommended,
but 1Gb
Hybrid Cluster Y Y
supported

<1ms RTT

All Flash requires


10Gb. 1Gb not
All-Flash Cluster Y N supported

<1ms RTT

Stretched
Cluster Data to <5ms RTT, 10Gb
Data Y N required between
data sites

Stretched <200ms RTT,


100Mbps
Cluster Witness
connectivity
to Data Y Y
required from
data sites to
witness

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

+10Gb support 1Gb support Comments

10Gb required
for All-Flash, 1Gb
2-node Data to
Y Y supported for
Data*
hybrid, but 10Gb
recommended

2-node Witness <500ms RTT,


to Data 1.5Mbps
Y Y
bandwidth
required

Table 1. Minimum NIC requirements

*With vSAN 6.5, 2-node vSAN supports direct connect for the data network, i.e. hosts can be
connected directly to one another without the need for a network switch. Prior to vSAN 6.5, 2-node
configurations required a 1Gb (hybrid) or 10Gb (all-flash) switch for connectivity.

More details on bandwidth requirements can be found in the vSAN Stretched Cluster Guide - https://
storagehub.vmware.com/#!/vmware-vsan/vsan-stretched-cluster-bandwidth-sizing/general-
guidelines/1

1.5 Stretched Cluster vs. Fault Domains Statements

There has been some confusion regarding which licenses are needed to implement Fault Domains
(rack awareness) on racks of vSAN servers that are placed on different sites, and how that compares to
stretched clustering with vSAN. Basically, this table identifies the different configurations/
deployments and highlights which license edition is needed for each implementation type.

Maximum Performance
Witness
number Distance Considerations Licensing Sites
Considerations
of nodes

Dedicated
Stretched <5ms RTT,
15+15+W N/A Enterprise witness host/ 3**
Cluster 10Gb
appliance

Dedicated
Stretched <5ms RTT, Standard
1+1+W* N/A witness host/ 3**
Cluster 10Gb or higher
appliance

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Maximum Performance
Witness
number Distance Considerations Licensing Sites
Considerations
of nodes

No dedicated
Performs at witness.
All-Flash LAN levels, Witness
Standard
Fault 64 N/A (<1ms RTT) Objects >1**
or higher
Domains 10Gb distributed
bandwidth across all fault
domains

Performs at
LAN levels, No dedicated
Hybrid (<1ms RTT) witness.
1Gb Standard Witness
Fault 64 N/A >1**
bandwidth or higher distributed
Domains supported but across all fault
10Gb domains
recommended

Table 2. Stretched Cluster vs. Fault Domains

* Standard Licensing allows customers to create a 1+1+W stretched cluster deployment. However, if
you attempt to grow the number of nodes at any of the data sites to more than 1 node, the
configuration will display an error, stating that you are not entitled to such a configuration unless you
have vSAN Enterprise licenses.

**While most vSAN deployments are within the same room in the same data center, VMware supports
having vSAN nodes in different rooms and different sites without stretched cluster and without a
dedicated witness. However, this is only supported when the latency is at LAN levels, which is <1ms. In
these configurations, one would need to use also Fault Domains to implement availability across room
or across site. Please refer to the appropriate vSAN documentation for steps on how to do this.

Readers should also be aware that vSphere DRS is needed for stretched clusters, so a vSphere license
edition that supports vSphere DRS should also be considered.

1.6 Layer-2 and Layer-3 Support Statements

This section looks at whether vSAN deployments are supported over Layer 2 and Layer 3 networking
configurations. While Layer 2 configuration are strongly recommended (all hosts share the same
subnet), VMware also supports vSAN deployments using Layer 3 (vSAN network routed between
hosts). Layer 2 implementations are recommended to reduce complexity, especially in vSAN prior to
v6.6, as those versions also require multicast traffic to be routed.

The Layer 2 network topology is responsible for forwarding packets through intermediate Layer 2
devices such as hosts, bridge, or switches. It is required that all of the hosts participating in a vSAN
cluster are able to establish communication through the VMkernel interface connected to a common
Layer 2 network segment. Prior to vSAN 6.6, all vSAN cluster members sent IGMP join requests over
the VMkernel network interfaces that are used for the vSAN traffic service. With vSAN 6.6, which uses
unicast exclusively, IGMP is no longer necessary.

The Layer 3 network topology is responsible for routing packets through intermediate Layer 3 capable
devices such as routers and Layer 3 capable switches. As per a Layer 2 topology, all vSAN cluster
members (prior to vSAN 6.6) are required to join the cluster’s Multicast Group by sending IGMP join
requests over the VMkernel network interfaces that are being used for the vSAN traffic service. In
Layer 3 Network topologies (again, prior to vSAN 6.6), VMware recommends the use and configuration

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

of IGMP Snooping in all the switches configured in the Layer 2 domains where hosts participating in
the vSAN cluster will be present.

L2 L3
Considerations
Supported Supported

Hybrid
Y Y L2 recommended, L3 supported also
Cluster

All-Flash
Y Y L2 recommended, L3 supported also
Cluster

Stretched
Cluster Y Y Both L2 and L3 between data sites supported
Data

L3 supported/recommended.
L2 between Data and Witness is typically not
recommended in a traditional configurations*.
Stretched
When used in a single site, such as a Stretched
Cluster N Y
Cluster across datacenter floors, L2 can be used.
Witness
When L2 is used, the requirement for independent
routing from the Witness to each site is not
changed.

Table 3. Supported Networking Topologies

*Prior to vSAN 6.6, VMware did not support L2 between the Data sites and the Witness nodes. This is
to avoid any communication traffic between the data sites occurring via the witness network in the
event of a split brain of the cluster. With control over multicast routing in the switch (removal of
multicast), this was relatively easy to configure in earlier vSAN versions. With the release of vSAN 6.6
and the introduction of unicast, customers are advised that additional steps may need to be taken on
the switch to prevent site to site communication over the witness network with L3.

1.7 Multicast

Note: vSAN no longer uses multicast traffic as of vSAN version 6.6

With vSAN release version 6.6, the requirement for IP multicast is removed from vSAN. This is
discussed in detail next. However, since there are many vSAN deployments prior to vSAN version 6.6,
we are including details around multicast configurations in this document. Throughout this document,
we will call out where multicast is a consideration in vSAN versions prior to the 6.6 release.

As mentioned, vSAN versions prior to version 6.6 relied on IP multicast. IP multicast sends source
packets to multiple receivers as a group transmission, and provides an efficient delivery of data to a
number of destinations with minimum network bandwidth consumption. Pre-v6.6 vSAN used multicast
to deliver metadata traffic among cluster nodes for efficiency and bandwidth conservation. Multicast
is required for VMkernel ports utilized by pre-vSAN 6.6 configurations.

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 1. Multicast

An IP Multicast address is called a Multicast Group (MG). IP Multicast relies on communication


protocols used by hosts, clients, and network devices to participate in multicast-based
communications. Communication protocols such as Internet Group Management Protocol (IGMP) and
Protocol Independent Multicast (PIM) are integral components and dependencies for the use IP
multicast communications. PIM enables multicast traffic to be routed.

IP Multicast is a fundamental requirement of vSAN prior to v6.6. Earlier vSAN versions depended on IP
multicast communication for the process of joining and leaving cluster groups as well as other intra-
cluster communication services. IP multicast must be enabled and configured in the IP Network
segments that will carry the vSAN traffic service

A default multicast address is assigned to each vSAN cluster at time of creation. This is true even in
vSAN 6.6. The vSAN traffic service will automatically assign the default multicast address settings to
each host which will then make them eligible to send frames to a default

Multicast Group, and Multicast Group Agent. The vSAN Default Multicast Group address is 224.1.2.3
and the vSAN Default Multicast Group Agent address is 224.2.3.4.

When multiple vSAN clusters reside on the same layer 2 network, VMware recommends changing the
default multicast address within the additional vSAN clusters. This is to prevent multiple clusters from
receiving all multicast streams, and the hosts in different clusters do not have to continually deny the
request. VMware KB 2075451 can be consulted for the detailed procedure of changing the default
vSAN multicast address. Isolating each clusters traffic to its own VLAN will remove possibility for
conflict.

The configuration procedures for IP Multicast varies between different vendors and their respective
network devices. Consult the network device vendor documentation for in-depth details and specific
advanced procedures that go beyond the scope of this document.

10

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Internet Group Management Protocol (IGMP)

Typically, all VMkernel ports on the vSAN network (prior to vSAN 6.6) would subscribe to a multicast
group using Internet Group Management Protocol (IGMP) to avoid multicast “flooding” all network
ports.

IGMP is a communication protocol used to dynamically add receivers to IP Multicast group


membership. The IGMP operations are restricted within individual Layer 2 domains. IGMP allows
receivers to send requests to the Multicast Groups they would like to join. Becoming a member of a
Multicast Group allows routers to know to forward traffic that is destined for the Multicast Groups on
the Layer 3 segment where the receiver is connected (switch port). This allows the switch to keep a
table of the individual receivers that need a copy of the Multicast Group traffic.

IGMP snooping allows physical network devices to forward Multicast frames to only the interfaces
where IGMP Join requests are being observed.

Figure 2. Multicast with IGMP Snooping

IGMP snooping, configured with an IGMP snooping querier, can be used to limit the physical switch
ports participating in the multicast group to only vSAN VMkernel port uplinks. The need to configure
an IGMP snooping querier to support IGMP snooping varies by switch vendor. Consult your specific
switch vendor/model best practices for IGMP snooping configuration.

vSAN will always try IGMP v3 first, then revert to IGMP v2 if necessary. In our lab testing, we also
found that IGMP v3 is much more stable than IGMP v2 in these configurations. If your switch supports
both versions of IGMP, VMware strongly recommends using IGMP v3 for the vSAN connections.

When the vSAN deployment is being performed across Layer 3 network segments, a Layer 3 capable
device (router or switch) with a connection and access to the same Layer 3 network segments can be
configured as the IGMP Querier.

11

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Multicast without IGMP Snooping

There are situations where IGMP Snooping would not need to be enabled. IGMP Snooping could be
disabled as long as vSAN is on a non-routed/trunked VLAN that is only extended to the vSAN ports of
all the hosts in the cluster.

If the querier has sub-par performance, this limited “flooding” configuration could actually perform
faster as you essentially turn the vSAN traffic into broadcast traffic and lower the demands placed on
the physical switch(es).

However, for the purposes of this networking guide, we will be recommending IGMP Snooping in all
cases since most modern network devices can handle this task well.

Protocol-Independent Multicast (PIM)

Protocol-Independent Multicast (PIM) is a family of Layer 3 multicast routing protocols that provide
different communication techniques for IP Multicast traffic to reach receivers that are in different
Layer 3 segments from the Multicast Groups sources.

If deploying a pre-vSAN 6.6 cluster across multiple subnets, Protocol Independent Multicast (PIM) is
typically a required configuration step to enable multicast traffic to flow across different subnets. Of
course, PIM is not necessary on vSAN 6.6 since multicast is no longer user.

Note: Some customers who do not wish to to use PIM for routing multicast traffic may consider
encapsulating the multicast traffic in a VxLAN, or some other fabric overlay. This is outside the scope
of this document however, and we will focus on PIM implementations, as this is the most common way
of routing multicast traffic. For detailed implementation of PIM, please consult with your network
vendor.

1.8 Unicast

vSAN 6.6 takes a major step in simplifying the design and deployment by removing the need for
multicast network traffic. This can provide a noticeably simpler deployment effort for not only single
site environments, but for vSAN stretched clusters as well. As mentioned, prior to the vSAN 6.6
release, vSAN CMMDS (Clustering Monitoring, Membership, and Directory Services) required the
presence of having IP multicast enabled. Since vSAN 5.5, CMMDS used multicast as a discovery
protocol to find all other nodes trying to join a vSAN cluster with the same sub-cluster UUID. vSAN 6.6
now communicates using unicast for CMMDS updates.

A Unicast transmission/stream sends IP packets to a single recipient on a network.

12

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 3. Multicast vs. Unicast

Member Coordination with Unicast

The vSAN management layer, vSAN health, now maintains a list of vSAN cluster members and pushes
the list of nodes to the CMMDS layer. For hosts managed by vCenter Server, the vSAN cluster IP
address list is maintained centrally and is pushed down to vSAN nodes. i.e. each vSAN node host
configuration file (esx.conf) will contain a vSAN cluster UUID, vSAN node UUID, IP address and unicast
port. Entries will also be populated if a node is a witness Node, or if the node does not support unicast.
When a vSAN node participating in a vSAN 6.6 enabled cluster is rebooted, the esx.conf file is read.
This contains the list of participating vSAN nodes.

The following changes will trigger an update from vCenter:

• A vSAN cluster is formed


• A new vSAN node is added to a vSAN enabled cluster
• An IP address change or vSAN UUID change on an existing node
• A vSAN node is removed from a cluster

The Cluster summary now shows if a vSAN cluster is operating in Unicast or Multicast mode:

Figure 4. Networking mode

The Networking Mode is not configurable.

13

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Upgrade / Mixed Cluster Considerations

To upgrade to vSAN 6.6 requires hypervisor software to be upgraded or patched. A new vSAN on disk
format (Version 5) is included with vSAN 6.6. As a result, during a vSAN cluster upgrade, vSAN nodes
will be in mixed mode from a software versioning perspective, i.e. some nodes can be running on
releases prior to vSAN 6.6, while others can run at version vSAN 6.6. As a result, the following behavior
is noted:

• vSAN 6.6 clusters, where at least one node has a v5.0 on-disk format will only ever
communicate in unicast. A non-vSAN 6.6 node added to this cluster will not be able to
communicate with the vSAN 6.6 nodes (it will be partitioned).
• A uniform vSAN 6.6 Cluster will communicate using unicast with legacy disk-groups
present. This cluster will switch to multicast mode when non-vSAN 6.6 node(s) are
added.

vSAN Disk
CMMDS
Cluster Software Format Comments
Mode
Configuration Version(s)

Permanently operates in unicast.


6.6 Only Nodes Version 5 Unicast
Cannot switch to multicast.

6.6 nodes operate in unicast mode.


Version 3
6.6 Only Nodes Unicast Switches to multicast when < vSAN 6.6
or below
node added.

Mixed 6.6 and 6.6 nodes with v5 disks operate in


Version 5
unicast mode. Pre-6.6 nodes with v3
vSAN pre-6.6 (Version 3 Unicast
disks will operate in multicast mode.
Nodes or below)
This will cause a cluster partition!

Cluster operates in multicast mode. All


Mixed 6.6 and vSAN nodes must be upgraded to 6.6
Version 3
vSAN pre-6.6 Multicast to switch to unicast mode. Disk format
or Below
Nodes upgrade to v5 will make unicast mode
permanent.

Operates in multicast mode. All vSAN


nodes must be upgraded to 6.6 to
Mixed 6.6 and
Version 1 Multicast switch to unicast mode. Disk format
vSAN 5.X Nodes
upgrade to v5 will make unicast mode
permanent.

6.6 nodes operate in unicast mode.


Mixed 6.6 and Version 5 5.x nodes with v1 disks operate in
Unicast
vSAN 5.X Nodes (Version 1) multicast mode. This will cause a cluster
partition!

Table 4. Unicast and Multicast modes

14

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

A vSAN cluster will continue to operate in multicast mode until all participating cluster nodes are
upgraded to vSAN 6.6.

In the below scenario, vCenter server is at a version that supports vSAN 6.6, i.e. 6.5 Express Patch 2.
The hosts on ESXi 6.5 GA patch, hence the cluster nodes are still communicating via multicast. During
the upgrade, there will be a mix of vSAN cluster node versions, i.e. vSAN 6.5 and vSAN 6.6 nodes

Figure 8. Still in multicast mode

Once all nodes in a vSAN cluster are upgraded to software vSAN 6.6, then and only then we will switch
to unicast, even if vSAN disk groups are below version v5.0.

Figure 9. Switched to unicast mode

Considerations with vSAN 6.6 & pre-v5 Disk Groups

As highlighted, a vSAN 6.6 cluster will automatically revert to multicast communication under the
following circumstances:

1. All vSAN Cluster Nodes are running vSAN version 6.6


2. All vSAN Disk groups are at on-disk version 3 or earlier
3. A non vSAN 6.6 node (e.g. vSAN 6.2 or vSAN 6.5) is introduced to the vSAN cluster

For example, if a vSAN 6.5 node (or earlier version) is added to an existing vSAN 6.6 cluster, the vSAN
cluster will revert to using multicast and include the vSAN 6.5 node as valid node

To avoid this behavior, both vSAN Nodes and on-Disk format version need to be at the latest versions.
To ensure a vSAN cluster will not revert to multicast, it is recommended to upgrade the vSAN disk
groups on the vSAN 6.6 nodes to on-disk version 5.0 as soon as possible. This will avoid the scenario of
reverting to multicast communications.

In summary, the presence of a single v5 disk group in a vSAN 6.6 cluster will trigger the cluster to
permanently communicate in unicast mode.

15

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Considerations with vSAN 6.6 nodes and v5 Disk Groups

In this example, a vSAN 6.6 cluster is already using on-disk version 5. A non-vSAN 6.6 node is added to
the cluster, e.g. a vSAN 6.5 node. The vSAN 6.5 node may be added to a vSAN 6.6 cluster in vCenter
insofar as there is nothing that will prevent you doing this.

However, because the vSAN 6.6 cluster is permanently communicating in unicast mode, the following
will occur.

• The vSAN 6.5 node will form its own Network Partition.
• The vSAN 6.5 node will continue to communicate in multicast mode and will be unable to
establish communications with vSAN 6.6 nodes since they use unicast.

Figure 10. Partition between unicast and multicast

The vSAN Health tests will fail various checks, e.g. network, Cluster, Disk format, Software version, and
advanced parameter check, as shown here:

Figure 11. Health check failures when non vSAN 6.6 host added to cluster

A Cluster Summary warning will also appear related to the On-Disk Format. This will highlight that one
node is at a lower version of 6.6 from an ESXi perspective, and that it needs to be upgraded before
disk format upgrade can complete. You cannot upgrade disk format versions when a cluster is in mixed
software mode.

16

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 12. Cluster Summary warnings

Considerations with vCenter unavailability

As mentioned, vCenter is regarded as the “source of truth” for vSAN cluster membership. vSAN Nodes
are automatically updated with the latest host membership list coming from vCenter. However, if
vCenter is unavailable for a period, there is a possibility that the list of vSAN nodes maintained by
vCenter and the actual number of vSAN nodes may differ due to nodes being added or removed while
vCenter was unavailable.

Examples of vCenter outages may include

• Restored from a previous backup / configuration


• Replaced by a new instance

Caution: To avoid data unavailability and disruptions to storage IO when vCenter is impacted by an
outage, an advanced parameter needs to be enabled on each vSAN 6.6 node that was managed by the
vCenter instance that is communicating in unicast mode. This advanced parameter is called /vSAN/
IgnoreClusterMemberListUpdates. It must be changed from default value of 0 to 1 on each vSAN 6.6
node in each vSAN cluster that is managed by the compromised vCenter.

Figure 13. /vSAN/IgnoreClusterMemberListUpdates

To view the existing value issue:

esxcli system settings advanced list \


--option=/vSAN/IgnoreClusterMemberListUpdates

To enable the option (set it to 1), issue the following esxcli command:

esxcli system settings advanced set --int-value=1 \


--option=/vSAN/IgnoreClusterMemberListUpdates

or

esxcfg-advcfg -s 1 /vSAN/IgnoreClusterMemberListUpdates

This advanced setting, when set to 1, is effective immediately and is persisted across reboots as it is
stored on the hosts configuration (esx.conf). Changes to this advanced setting will be also be logged
on the VMkernel logs.

To change the value back to default by setting the value to 0:

17

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

esxcli system settings advanced set --int-value=0 \


--option=/vSAN/IgnoreClusterMemberListUpdates

or

esxcli system settings advanced set --default \


--option=/vSAN/IgnoreClusterMemberListUpdates

or

esxcfg-advcfg -s 0 /vSAN/IgnoreClusterMemberListUpdates

Note: this advanced parameter is not visible from the vSphere Web Client, or the host client user
interfaces. However, this advanced parameter is exposed on the vSphere API, and can be modified
from PowerCli for example. We can connect directly to a vSAN node using PowerCLI and issue the
following to query the setting:

Get-AdvancedSetting -Entity hostname -


Name vSAN.IgnoreClusterMemberListUpdates | Format-Table –AutoSize

Figure 14. PowerCLI snippet

To modify the advanced setting, issue the following PowerCLI command:

Get-AdvancedSetting -Entity hostname -


Name vSAN.IgnoreClusterMemberListUpdates | Set-AdvancedSetting -Value "1"

Figure 15. PowerCLI snippet

Considerations with vSAN 6.6 Unicast and DHCP

If vCenter Server is deployed on a vSAN 6.6 cluster, and that vSAN 6.6 cluster is using unicast IP
addresses that were obtained using a DHCP server (without the use of reservations), then this is not
supported.

The reason is explained in the following scenario:

• The vCenter VM is offline


• vSAN VMkernel Ports obtain new IP address leases

When vCenter server recovers, vCenter (vSAN health) will attempt to reconcile its current list of unicast
addresses with the vSAN cluster, and will push down stale unicast addresses to vSAN nodes. This may
trigger a vSAN cluster partition and vCenter may no longer be accessible (since it runs on that vSAN
cluster). DHCP with reservations (i.e. assigned IP addresses that are bound to the mac addresses of

18

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

vSAN VMkernel ports) is supported, as is DHCP without reservations but with the managing vCenter
hosted residing outside of the vSAN cluster.

Considerations vSAN 6.6 with IPv6

IPv6 is supported with unicast communications in vSAN 6.6. With IPv6, a link-local address is an IPv6
unicast address that can be automatically configured on any interface using the link-local prefix. vSAN,
by default, does not add a node’s link local address to other cluster nodes (as a neighbor). As a
consequence, IPv6 Link local addresses are not supported for unicast communications on vSAN 6.6.

Considerations vCenter and vSAN Maintenance

Another operational concern is vSAN VMkernel network address configuration changes. For example,
there may be scenarios where vSAN clusters maybe taken completely offline to facilitate a major
networking change. An example would be if the vSAN cluster was assigned a new VLAN or a new
network subnet. In this case a potential workflow would be to shutdown existing workloads, disable
vSAN, implement network change (e.g. assign new IP addresses) and then re-enable vSAN.

This operation is perfectly ok to perform, if and only if, vCenter is still active and orchestrating the
networking changes, i.e. unicast configurations will remain in-sync. If vCenter is down however this will
trigger unwanted behavior. It would be best practice during vSAN network maintenance to ensure
vCenter remains active.

Note: Disabling and Re-enabling vSAN does not remove your data. The data and VMs will still be
present when vSAN is re-enabled.

Query Unicast with esxcli

There are a number of ESXCLI command that can be helpful when determining unicast configuration.
The output of esxcli vsan cluster get has changed to reflect communication modes. Note that the
vSAN cluster node now displays which CMMDs mode it is operating; unicast or multicast.

Figure 5. esxcli vsan cluster get

To interrogate which vSAN cluster nodes are operating in unicast mode, issue esxcli vsan cluster
unicastagent list:

Figure 6. esxcli vsan cluster unicast list

This command displays the view of the cluster from this nodes perspective. The above output shows
the other three nodes of a four node vSAN cluster. It does not show the host from where the command
is run. The output includes vSAN NodeUUID, IPv4 address, UDP port its communicating on, and if the
node is a data node (0) or a witness node (1). This list is maintained by vCenter and persisted on a
vSAN nodes’ host configuration file (i.e. esx.conf).

19

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

The command esxcli vsan network list displays the VMkernel interface used by vSAN for
communications, the unicast port (12321), and the traffic type (vsan or witness) associated with the
vSAN interface.

Figure 7. esxcli vsan network list

While multicast group information is still present from above output, it is not relevant when vSAN
node is communicating in unicast mode. In future releases multicast information may be removed
from above command output.

Inter Cluster CMMDs Traffic Considerations

In multicast mode, the vSAN node master sends a single message to the switch which is then
broadcasted to all the cluster nodes. In unicast mode, the master itself addresses all the cluster nodes,
as it must send the same message “n” number of times (where n = number of vSAN nodes). Therefore,
there is a slight increase of vSAN CMMDs traffic when operating in unicast mode, which is not really
noticeable during normal, steady-state operations.

Inter Cluster CMMDs Traffic - Single Rack, Single Switch Topology

If all the nodes of a vSAN cluster are connected to the same top of the rack switch, then the total
increase in traffic is only between the master node and the switch.

20

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 16. Sample Unicast topology

If, however, a vSAN cluster spans more than one top of rack (TOR) switch, traffic between the switches
will significantly grow in unicast mode compared to multicast mode. A common setup where there are
multiple TORs would be Fault Domains (rack awareness). In this case the cluster may span many racks.

In unicast mode, we send 'n' messages to the other Racks or Fault Domains, where 'n' is the number of
nodes in each Fault Domain (FD).

Take the previous example where there are 3 fault domains, FD1, FD2, and FD3, and each FD has 3
nodes. Each node has 10MB of data in CMMDS:

• The vSAN master node is currently located in FD1


• FD2 will have to replicate 30MB (3 nodes by 10MB) to FD1 via its TOR
• FD3 will also have to 30 MB to FD1 via its TOR
• FD1 will in turn must communicate 60MB back to FD1 and FD2

21

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

• FD1 will send 20MB to master node

Inter Cluster CMMDs Traffic - Stretched Cluster

Figure 17. Sample stretched cluster topology

The master, during normal operation, is located at the Preferred Site. As per the Fault Domain
example, CMMDs data must be communicated from the Secondary site to the Preferred Site i.e.
(number of nodes in secondary) x CMMDS Node size in MB x (number of nodes in secondary).

Note: The witness Site traffic requirements do not change with unicast in vSAN 6.6 release

Inter Cluster CMMDS Traffic - Cluster Recovery Operations

In steady state, the CMMDSs update traffic is minimal between cluster nodes. However, when a cluster
is being reformed after an outage (or in the case of stretch cluster, a site outage), CMMDS information
must be exchanged between the master, backup and agent nodes, to understand a vSAN cluster state.

As a result, CMMDs traffic data exchange can temporarily increase over a short time period as part of
recovery operations. This can last anywhere between 10-15 seconds. For example, take a fully loaded
16 vSAN 6.6 node stretch cluster (8+8+1). Assume there is an outage on the secondary site. During
recovery, CMMDS traffic may reach up to 640 MB over a 10-15 second period. Throughput can be
anywhere from 42.6MBps to 60MBps.

1.9 Jumbo Frames

vSAN fully supports Jumbo Frames on the vSAN network. Jumbo frames are Ethernet frames with
more than 1500 bytes of payload. Conventionally, jumbo frames can carry up to 9000 bytes of
payload, but variations exist.

22

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

As per the vSAN Design and Sizing Guide (https://storagehub.vmware.com/t/vmware-vsan/vmware-r-


vsan-tm-design-and-sizing-guide-2/), VMware testing has found that using jumbo frames can reduce
CPU utilization and improve throughput.

However, as a general guidance, customers will need to decide on whether these gains outweigh the
operation overhead of implementing jumbo frames end-to-end. In data centers where jumbo frames
are already enabled in the network infrastructure, jumbo frames are recommended for the vSAN
network. Otherwise, jumbo frames are not recommended as the operational cost of configuring jumbo
frames throughout the network infrastructure could outweigh the limited CPU and performance
benefits.

1.10 IPv6 support

vSAN supports the following modes:

• IPv4-only
• IPv6-only (since vSAN 6.2)
• Mixed IPv4/IPv6 (since vSAN 6.2).

Prior to vSAN version 6.2, the only supported mode was IPv4. The mixed mode addresses
requirements for customers wishing to migrate their vSAN cluster from IPv4 to IPv6.

IPv6 multicast is also supported. However, there are some restrictions with IPv6 and IGMP snooping
on Cisco ACI to be aware of. For this reason, it is not recommended to implement IPv6 for vSAN using
Cisco ACI.

For detailed implementation of IPv6, please consult with your network vendor.

1.11 TCP/IP Stacks

Note: vSAN does not have its own TCP/IP stack.

Static Routes are needed to route vSAN traffic in L3 networks.

Currently, vSphere does not include a dedicated TCP/IP stack for the vSAN traffic service nor the
supportability for the creation of custom vSAN TCP/IP stack. To ensure vSAN traffic in Layer 3 network
topologies leaves over the vSAN VMkernel network interface, administrators need to add the vSAN
VMkernel network interface to the default TCP/IP Stack and define static routes for all of the vSAN
cluster members.

For those interested in learning more about TCP/IP stacks, and how they can be used for other non-
vSAN traffic types, this section is provided for your information. vSphere 6.0 introduced a new TCP/IP
Stack architecture where multiple TPC/IP stacks can be utilized to manage different VMkernel
network interfaces and their associated traffic. As a result, the new architecture provides the ability to
configure traffic services such vMotion, Management, Fault Tolerance, etc. on completely isolated
TCP/IP stacks with the ability to use multiple default gateways.

For network traffic isolation and security requirements, VMware recommends deploying the different
traffic services onto different network segments (VLANs) to prevent the different traffic services from
traversing through the same default gateway.

23

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 18. TCP/IP Stacks in vSphere

In order to configure the traffic services onto separate TCP/IP stacks, each traffic service type needs
to be deployed onto their own network segments. The network segments will be accessed through a
physical network adapter with VLAN segmentation and individually mapped to dissimilar VMkernel
network interfaces with the respective traffic services (vSAN, vMotion, Management, etc.) enabled.

Built-in TCP/IP stacks available in vSphere

• Default TCP/IP Stack – multi-purpose stack that can be used to manage any of the host
related traffic services. Shares a single default gateway between all configured network
services.
• vMotion TCP/IP Stack – utilized to isolate vMotion traffic onto its own stack. The use of this
stack completely removes or disable vMotion traffic from the default TCP/IP stack.
• Provisioning TCP/IP Stack – utilized to isolate some virtual machine related operations such
as cold migrations, cloning, snapshot, NFC related traffic.

The option to select a different TCP/IP stack is visible during the creation of a VMkernel interface:

24

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 19. TCP/IP Stacks – Port Properties

It is assumed that environments with isolated network requirements for the vSphere traffic services
will not be able to use the same default gateway to direct traffic. The use of the different TCP/IP
stacks facilitates the management for traffic isolation with the ability to use different default gateways,
and thus avoid the unnecessary overhead of implementing static routes. Unfortunately, this is a
necessary additional step required for vSAN at this time, when vSAN traffic needs to be routed to
another network that is not accessible via the default gateway.

1.12 Static Routes

If best practices are adhered to, most customers will have their vSAN network on a completely
different network to the management network, and thus the vSAN network will not have a default
gateway. Therefore, in an L3 deployment, hosts that are on different subnets or different L2 segments
will not be able to reach each other over the default gateway (which is typically associated with the
management network).

This implies that customers will have to implement static routes to allow the vSAN network interfaces
from hosts on one subnet to reach the vSAN networks on hosts on the other network, and vice versa.
Static routes allow administrators to instruct a host how to reach a particular network via a particular
interface, rather than using the default gateway.

The following is an example of how to add an IPv4 static route to an ESXi host. Simply specify the
gateway (-g) and the network (-n) you wish to reach through that gateway:

esxcli network ip route ipv4 add –g 172.16.10.253 -n 192.168.10.0/24

Once the static routes have been added, the vSAN traffic connectivity should be available across all
networks, assuming the physical infrastructure has been configured to allow it. Use the vmkping
command test and confirm communication between the different networks by pinging the IP address

25

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

or the default gateway of the remote network, and vice versa. Additional options can be provided to
check different size packets (-s) and prevent defragmentation (-d) of the packet.

vmkping –I vmk3 192.168.10.253

• If MTU = 1500: vmkping -I vmkX -d -s 1472


• If MTU = 9000: vmkping -I vmkX -d -s 8972
• Use “-d” to verify the packet length. If not set, the packet can be fragmented.

1.13 NSX

The number one question about vSAN and NSX is “are they compatible?”. The quick and short answer
is yes . It is important to state that the two products can be deployed and co-exist on the same
vSphere infrastructure without any issues. Neither vSAN nor VMware NSX are dependent on each
other to deliver their functionalities, resources, and/or services.

However, very often, the question of compatibility is asked in the context of being able to place the
vSAN network traffic on an NSX managed VxLAN/ Geneve overlay. In this case, the answer is no , NSX
does not support the configuration of the vSAN data network traffic over an NSX managed VxLAN/
Geneve overlay.

This is not unique to vSAN. The same restriction applies to any statically defined VMkernel interface
traffic such as vMotion, iSCSI, NFS, FCoE, Management, etc.

Part of the reason for not supporting VMkernel traffic over the NSX managed VxLAN overlay is
primarily to avoid any circular dependency of having the VMkernel infrastructure networks dependent
on the VxLAN overlay that they support. The logical networks that are delivered in conjunction with
the NSX managed VxLAN overlay are designed to be used by virtual machines which require network
mobility and flexibility.

There is one other issue to note with NSX, and this is when it comes to implementing LACP/LAG. The
biggest issue with LAG is when it is used in a Nexus environment. This defines the LAGs as VPCs.
Having a VPC implies you cannot run any dynamic routing protocol from edge devices to the physical
Cisco switches as Cisco doesn’t support this.

See table 2 column 3 in the following document: http://www.cisco.com/c/en/us/support/docs/ip/ip-


routing/118997-technote-nexus-00.html

1.14 Flow Control

Flow control is a mechanism used to help manage the rate of data transfer between two devices. Flow
control is configured when auto-negotiation is performed by two physical connected devices.

26

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 20. Flow Control Detail

Pause Frames

An “overwhelmed” network node can send a pause frame to halt the transmission of the sender for a
specified period. Normally a frame with a multicast destination address sent to a switch will be
forwarded out all other ports of the switch. Pause frames have a special multicast destination address
that distinguishes it from other multicast traffic. A compliant switch will not forward a Pause frame.
Instead frames sent to this range are understood to be frames meant to be acted upon only within the
switch. Pause frames have a limited duration; they will automatically "expire" after a certain amount of
time. Two computers that are connected via a switch will never send pause frames to each other, but
could send pause frames to a switch.

One original motivation for the pause frame was to handle network interface controllers (NICs) that did
not have enough buffering to handle full-speed reception. This problem is not as common with
advances in bus speeds and memory sizes.

Congestion Control

Congestion control is the mechanism that controls the traffic on the network. Congestion control is
mainly applied to packet switching networks. Network congestion within a switch could be caused by
overloaded ISL trunks (interlink between switches). If the ISL overloads the capability on the physical
layer, the switch would introduce pause frames to protect itself.

Priority Flow Control (PFC)

Priority-based flow control (IEEE 802.1Qbb) is intended to eliminate frame loss due to congestion. This
is achieved by a mechanism like pause frames, but operates on individual priorities. PFC is also
referred to as Class-based Flow Control (CBFC) or Per Priority Pause (PPP).

Flow Control vs. Congestion Control

Flow control is an end to end mechanism that controls the traffic between a sender and a receiver.
Flow control is handled by data link layer and the transport layer.

Congestion control is a mechanism that is used by a network to control congestion in the network. This
problem is not as common in modern networks with advances in bus speeds and memory sizes. A
more likely scenario is network congestion within a switch. Congestion Control is handled by network
layer and transport layer.

27

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Flow Control recommendation in vSAN environments

By default, flow control is enabled on all network interfaces in ESXi.

Flow control configuration on a NIC is done by the driver. When a NIC is being overwhelmed by
network traffic, the NIC takes the decision to send the PAUSE frames.

If flow control mechanisms such as pause frames are present in a network segment, this may trigger
overall latency in the VM Guest I/Os due to increased latency at vSAN network layer. Some network
drivers provide module options which configure flow control functionality within the driver, while some
network drivers allow you to modify the configuration options using the ethtool command line utility
on the console of the ESXi. Using module options or ethtool will depend on the implementation details
of a given driver.

Please see VMware KB 1013413 on how to configure flow control on ESXi.

VMware recommends in most deployments to leave Flow Control enabled on ESXi network interfaces
(default). If pause frames are identified to be an issue, disabling of Flow Control should be carefully
planned in conjunction with Hardware Vendor Support and/or VMware Global Support Services.

In the troubleshooting section of this guide, we will show how it may be possible to recognize the
presence of Pause Frames being sent from a Receiver to an ESXi host. If there are large number of
pause frames in an environment, it is usually indicative of an underlying network or transport issue that
should be investigated.

1.15 vSAN Network Port Requirements

In this section, the different network ports used by vSAN are highlighted.

28

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 21. vSAN network ports

Storage Providers (VASA)

vSphere Storage APIs for Storage Awareness (VASA) is a set of application program interfaces (APIs)
that enables vSphere vCenter to recognize the capabilities of storage arrays. When vSAN is enabled,
each vSAN node registers a VASA provider to vCenter Server via TCP port 8080.

RDT - Reliable Datagram Transport

RDT is a proprietary vSAN service for storage I/O. RDT uses TCP at the vSAN transport layer. RDT is
built on top of the vSAN Clustering Service and uses TCP port 2233.

29

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

CMMDS - Clustering Monitoring, Membership, and Directory Services

CMMDS is responsible for discovery and maintenance of a cluster of networked node members. All
nodes communicate via UDP port 12345, 23451.

Witness Host

TCP port 2233 and UDP Port 12321 needs to be open for witness traffic between the witness host and
the vSAN Cluster data nodes.

vSAN Observer

TCP port 8010 needs to be opened to view the vSAN observer graphs. This port is customizable from
the “vsan.observer” command, which is run from RVC.

Firewall considerations

On enablement of vSAN on a given cluster, all required ports are added to ESXi firewall rules and
enabled/disabled automatically . There is no need for an administrator to open any firewall ports or
enable any firewall services manually.

Open ports for incoming and outgoing connections can be viewed from the UI. Select the ESXi host >
Configure > Security Profile:

Figure 22. ESXi Firewall Connections

vSAN Network Ports

30

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Port Protocol Direction Service

8080
vsanvp : Used by the Storage Management Service
Incoming and (SMS) that is part of vSphere vCenter Server. If
TCP
outgoing disabled, vSAN Storage Profile Based Management
(SPBM) does not work.

Incoming and vSAN Transport: Used for storage IO. If disabled,


2233 TCP
outgoing vSAN does not work.

12345

23451
Incoming and vSAN Clustering Service .
UDP
outgoing If disabled, vSAN does not work
12321

Incoming and
3260 TCP Default iSCSI port for vSAN ISCSI target service
Outgoing

Vsanhealth-multicasttest : vSAN Health Proactive


Incoming and
5001 UDP Network test. This port is enabled on demand when
Outgoing
Proactive Network Test is running.

vSAN observer default port number for live statistics.


8010 TCP Incoming Custom port number can also be specified for vSAN
observer.

Incoming and
80 TCP vSAN Performance Service metrics
Outgoing

Table 5. vSAN Network Ports

Firewall considerations for unicast

Other than removal of multicast communication for vSAN CMMDS communications, there are no
changes in UDP/TCP port requirements in vSAN 6.6.

31

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

vSAN Encryption and Firewall considerations

When you enable encryption (new in vSAN 6.6), vSAN encrypts everything on the vSAN Datastore.
vSAN encryption requires an external Key Management Server (KMS). vCenter Server obtains the key
IDs from the KMS and distributes them to the ESXi hosts. KMS Servers and ESXi hosts can
communicate directly to each other. As KMS servers may use different port numbers, a new firewall
rule called vsanEncryption is created to simplify communication between the vSAN node and the
KMS server. This allows a vSAN node to communicate directly to any port on a KMS server (TCP port 0
to 65535).

Figure 23. vsanEncryption Firewall Connections

When ESXi node needs to establish communication to a KMS server:

• The KMS server IP is added to the vsanEncryption rule and the firewall rule is enabled .
• Communication between vSAN node and KMS server is established for the duration of the
exchange.
• Once communication has ended between the vSAN node and the KMS server the IP address
is removed from vsanEncryption rule and the firewall rule is disabled once again.

vSAN nodes can potentially talk to multiple KMS hosts using the same rule.

32

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

2. NIC Teaming on the vSAN network


NIC Teaming on the vSAN network

33

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

2.1 NIC Teaming on the vSAN network

Many customers deploying vSAN require for some level of network redundancy. NIC teaming can be
used to achieve network redundancy. NIC teaming can be defined as two or more network adapters (
NICs) that are set up as a "team" for high availability and/or load balancing. In this section, we will
cover the basics of various methods of NIC teaming that is offered with vSphere Networking stack and
how these techniques may apply to vSAN design and architecture.

There are a number of options available to the reader with regards to NIC teaming. Some NIC teaming
policies require physical switch configurations, and have a dependency on switchgear quality. Some
policies also required an understanding of networking (such as Link Aggregation). Unless you really
comfortable and experienced with network switch configuration, VMware recommends avoiding these
policies. You will increase your odds of implementing a successful vSAN network with a basic, simple
and reliable setup.

If you have any doubts on which one to choose, choosing a basic NIC team of an active/standby
configuration with explicit failover is a good place to start.

2.2 Basic NIC Teaming

34

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 24. Single vmknic for the vSAN network

vSphere NIC teaming occurs when multiple uplink adapters, “vmnics”, are associated with a single
virtual switch to form a team. This is the basic option available through all editions of vSphere and can
be implemented using standard (vSS) or distributed vSwitches (vDS).

Failover and Redundancy

Teaming can be configured on a vSwitch to have multiple Active uplinks or an Active/Standby uplink
configuration. No special configuration such as Ether-channel or Link Aggregation is required at the
physical switch layer for basic NIC teaming. vSAN uses basic NIC teaming and failover policy provided
by vSphere.

35

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Note: vSAN does not use NIC teaming for load balancing

Here is an example of a typical NIC team configuration, and its associated settings. The settings are
discussed in more detail next. When working on distributed switches, edit the settings of the
distributed port group used for vSAN traffic. This view is available under Teaming and failover:

Figure 25. Basic NIC teaming settings

Settings: Load balancing

In this section, the different load balancing techniques available to NIC teaming are discussed. Pros
and Cons of each load balancing techniques are highlighted.

Load balancing 1: Route based on originating virtual port

In active/active or active/passive configurations Route based on originating virtual port is the policy
that is normally used for basic NIC teaming. When this policy is in effect, only one physical NIC is used
per VMkernel port.

Pros

• Simplest teaming mode, with very minimum physical switch configuration.


• Troubleshooting is simpler as there is a single port for the vSAN traffic.

Cons

• A single VMkernel interface cannot use more than a single physical NIC's bandwidth. Since
typical vSAN environments use one VMkernel Adapter, only one physical NIC in the team
will be used.

Load balancing 2: Route Based on Physical NIC Load (Load Based Teaming)

Route Based on Physical NIC Load is based on Route Based on Originating Virtual Port, where the
virtual switch checks the actual load of the uplinks and takes steps to reduce load on overloaded

36

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

uplinks. This load-balancing algorithm is only available only with a vSphere Distributed Switch. It is not
available on standalone vSwitches (VSS).

The distributed switch calculates uplinks for VMkernel Port by taking that port ID and the number of
uplinks in the NIC team. The distributed switch tests the uplinks every 30 seconds, and if the load
exceeds 75 percent of usage, the port ID of the VMkernel Port with the highest I/O is moved to a
different uplink.

Pros

• No physical switch configuration such as LACP is required for this algorithm to work.
• Although vSAN has one VMkernel port, it will be helpful and effective when the same
uplinks are shared by other VMkernel ports/network services. vSAN could benefit by using a
different uplink from other contending services (e.g. vMotion, Management).

Cons

• Since vSAN typically only have one VMkernel port configured, effectiveness of using this
algorithm to achieve load balancing in a vSAN environment is limited.
• The ESXi VMkernel will re-evaluate the load after each time window and this can create a
minor overhead.

Settings: Network failure detection

Leave this at the default of Link status only. The only other option available here is beacon probing.
The “Beacon Probing” feature should not be configured for link failure detection. Beacon probing
required at least 3 physical NICs to avoid split-brain scenarios. More detail can be found in VMware KB
1005577.

Settings: Notify Switches

Leave this at the default of Yes. Physical Switches have MAC address forwarding tables. The tables are
used to associate each MAC address to a physical switch port. When a frame comes in, the switch will
look up the destination MAC address in the table and decide which physical port to send the frame to.

If a NIC failover occurs, the ESXi host needs to notify the network switches that something has
changed, or the physical switch may continue to look up the old information from the MAC address
tables and send the frames to the old port.

With Notify Switches set to Yes, and when one physical NIC fails and traffic is rerouted to a different
physical NIC in the team, the Virtual Switch sends notifications over the network to update the lookup
tables on physical switches.

Notify switches does not catch VLAN misconfigurations, or uplink loss further upstream. However,
these issues will be detected by the vSAN health check (network partitions), and perhaps the VDS
Health Check.

Settings: Failback

This option determines how a physical adapter is returned to active duty after recovering from a
failure. A failover event will trigger the network traffic to move from one NIC to another. When a link
up state is detected on the originating NIC, traffic will automatically revert to the original network
adapter if Failback is set to Yes. When Failback is set to No, a manual fail back will be required.

Setting Failback to No can be useful in some scenarios. For example, after a physical switch port
recovers from a failure, the port may be active (lights on) but may take several seconds to begin
forwarding traffic. Automatic Failback has been known to cause issues in certain environments that

37

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

use the Spanning Tree Protocol. For more information regarding Spanning Tree Protocol (STP) see
VMware KB 1003804.

Settings: Failover Order

Failover order determines which links are active during normal operations, and which links are active in
the event of a failover. Different supported configurations are possible for the vSAN network.

Failover Order: Active / Standby Uplinks

Having one uplink designated as Active and one uplink designated as Standby would define an Active/
Standby setup. In the event of a failure, the NIC driver notifies vSphere that there is a link down event
on Uplink 1. The standby uplink then gets promoted to active, and traffic resumes on the newly
promoted link (Uplink 2).

Figure 26. Active and Standby Uplinks

Failover Order: Active / Active Uplinks

Note that Active/Active doesn’t allow the virtual port used by vSAN traffic to make use of both ports at
the same time.

If the Teaming is configured for both Uplink 1 and Uplink 2 to be designated as Active, the failure
handling is slightly different as there is no need for the Standby Uplink to be promoted.

38

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 27. Active Uplinks only

Note: When using active/active setups, as per VMware KB 2072928 guidance, ensure Failback is
set to No.

2.3 Advanced NIC Teaming

39

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 28. Multiple vmknics for the vSAN network

Multiple VMkernel adapters may be used when configuring vSAN networks. This is typically a
configuration when customers wish to implement an “air-gap” in their vSAN networking. An air-gap
means that a failure that occurs on one network path does not impact the other network path.
Therefore, any part of one network path can fail, and the remaining network path can carry the traffic.
This configuration is achieved by configuring multiple vSAN enabled VMkernel NICs on different
subnets, such as another VLAN or separate physical network fabric.

vSAN, or indeed vSphere, does not support multiple VMkernel adapters (vmknics) on the same
subnet. For more details please see VMware KB 2010877 multi-homing on ESXi

Air-Gap Overview

This vSAN network design methodology comes from the concept of the “air-gap storage fabric” This
requirement basically states that if two storage networks are used to create a redundant storage

40

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

network fabric topology, the storage networks should be completely (physically and logically) isolated
from one another, or in other words, separated by an “air-gap”.

This is the general guidance on how to “air-gap” vSAN networking in a vSphere environment. One
would need to configure multiple vSAN enabled VMkernel Ports per vSAN node (host), and associating
each VMkernel port to dedicated physical uplinks, using either a single vSwitch or multiple Virtual
Switches (vSS or vDS).

Figure 29. Virtual Switch configuration for multiple vSAN vmknics

Depending on the level of physical or logical separation required, each uplink would be required to be
connected to fully redundant physical infrastructure.

However, VMware’s is not recommending this topology. One of the main reasons for not
recommending this network configuration is that failure of components, such as NICs on different
hosts residing on the same networks, can lead to interruption of storage I/O. To avoid this, one would
need to implement physical NIC redundancy on all hosts and all network segments. Configuration
example 2, which follows shortly, will discuss this concern in detail.

These scenarios are applicable to both L2 and L3 topologies, with both unicast and multicast
configurations (unicast being new to vSAN 6.6 of course).

Pros and Cons of Air-Gap network configurations with vSAN

Pros

• Physical and/or Logical Separation of vSAN traffic

Cons

• vSAN does not support multiple VMkernel adapters (vmknics) on the same subnet, as per
VMware KB 2010877.
• Setup is complex and error prone, thus troubleshooting becomes more complex.
• Network availability not guaranteed with multi vmknics in some asymmetric failures – one
NIC failure on one host, another NIC failure on another host.
• No guarantee of load-balanced vSAN traffic across physical NICs.
• Cost increases from a vSAN node perspective as you may need multiple VMkernel adapters
(vmknics) to protect multiple physical NICS (vmnics), e.g. 2 x 2 vmnics may be required to
provide redundancy for 2 vSAN vmknics
• Logical Resources required are doubled, such as VMkernel Ports, IP addresses and VLANs.

41

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

• Unlike iSCSI port-binding and multi-NIC vMotion, vSAN does not implement port binding.
This means that techniques such as multi-pathing are not available.
• Layer 3 topologies are not suitable for vSAN enabled traffic with multiple vmknics and
unlikely to function as expected.
• Command line host configuration may be required to change vSAN multicast addresses in
some configurations.

2.4 Dynamic LACP (Multiple physical uplinks, 1 vmknic)

Figure 30. Link Aggregation configuration

Link aggregation looks to combine (aggregate) multiple network connections in parallel to increase
throughput and provide redundancy. When NIC teaming is configured with LACP then load balancing
of the vSAN network across multiple uplinks occurs. However, this happens at the network layer, and is
not done through vSAN.

42

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Note: Other terms sometimes used to describe the method include, port trunking, link bundling,
Ethernet / network / NIC bonding, ether-channel

While Link aggregation is a very loose term, for the purposes of this document, the focus will be Link
Aggregation Control Protocol or LACP. While IEEE has its own 802.3ad LACP standard, some vendors
have developed proprietary types of LACP. For example, PAgP (Port Aggregation Protocol) is similar to
LAG but it is Cisco proprietary. Therefore, vendor guidance is crucial and vendor best practices should
be should be adhered to all times.

Earlier vSphere native vSwitches had no capability to support LACP. The LACP support introduced
in vSphere Distributed Switch v5.1 supported IP-hash load balancing only. vSphere Distributed
Switch v 5.5 and later fully support LACP.

The main takeaway when implementing LACP is that it uses an industry standard and it is implemented
using port-channel. There can be many hashing algorithms. The vSwitch port-group policy and the
port-channel configuration (hash) must agree and match.

Link Aggregation Group (LAG) overview

LAG is short for Link Aggregation Group. A LAG is defined by the IEEE 802.1AX-2008 standard, which
states, “Link Aggregation allows one or more links to be aggregated together to form a Link
Aggregation Group”. By using the LACP protocol, a network device can negotiate an automatic
bundling of links by sending LACP packets to a peer.

LAG can be configured as either static (manual) or dynamic by using LACP to negotiate the LAG
formation. LACP can be configured in one of two modes:

• Active mode – Devices immediately send LACP messages when the port comes up. End
devices with LACP enabled (for example, ESXi hosts and physical switches) send/receive
frames called LACP messages to each other to negotiate the creation of a LAG.
• Passive mode – Places a port into a passive negotiating state, in which the port only
responds to received LACP messages, it but does not initiate negotiation.

Note: If the host and switch are both in passive mode, LAG won't initialize since there is no active
part to trigger the linking. At least one must be Active.

From a vSphere perspective, vSphere 5.5 and later calls this functionality “Enhanced LACP”. This
functionality is only supported on vSphere Distributed Switch version 5.5 or later.

Further information on LACP Support on a vSphere Distributed Switch can be found in the official
vSphere 6 Networking Documentation.

Settings: Load balancing

The number of LAGs that you can use depends on the capabilities of the underlying physical
environment and the topology of the virtual network.

Please consult vendor best practices when configuring LACP.

The following VMware KB 2051826 discusses the different load balancing options in more detail.

43

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Pros and Cons of dynamic Link Aggregation/LACP

Pros

• Improves performance and bandwidth: One vSAN node or VMkernel port can communicate
with many other vSAN nodes using many different load balancing options
• Network adapter redundancy: If a NIC fails and the link-state goes down, the remaining
NICs in the team continue to pass traffic.
• Rebalancing of traffic after failures is fast and automatic

Cons

• Physical switch configuration: Less flexible and requires that physical switch ports be
configured in a port-channel configuration.
• Complex: Introducing full physical redundancy configuration gets very complex when
multiple switches are used. Implementations can become quite vendor specific.

2.5 Static LACP with Route based on IP Hash

44

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure G.1. Route based on IP Hash

In our previous section on LACP, we looked at using dynamic LACP with “Source and destination IP
addresses, TCP/UDP port and VLAN” as the policy. In this section, we are going to demonstrate how to
create a vSAN 6.6 Cluster using static LACP with an IP-Hash policy. For simplicity, we will focus on
vSphere Standard Switches, but this configuration may also be implemented with distributed switches.

Static vs Dynamic

In the earlier section on Link Aggregation, LACP is used to combine and aggregate multiple network
connections. When LACP is in active or dynamic mode, a physical switch will send frames called LACP
messages to network devices (e.g. ESXi hosts) to negotiate the creation of a LAG.

However, standard switches (vSS) and pre-vSphere 5.5 Distributed Switches did not handle or
understand LACP advertisements and notifications. To configure Link Aggregation for vSphere hosts
using vSS (and pre-5.5 vDS) switches, a static channel-group must be configured on the physical
switch. Cisco, for example call this static-EtherChannel

45

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Static LACP with Route Based on IP Hash overview

IP Hash, when configured, takes the last octet of both source and destination IP addresses in the
packet, puts them through a XOR operation, and then runs the result through another calculation
based on the number of uplinks in the NIC team. More detail on IP Hash can be found in the official
vSphere documentation.

From vSphere perspective, Route based on IP Hash load balancing policy must be selected at a
vSwitch or port group level. All uplinks assigned to static channel group should be in the Active Uplink
Position on the Teaming and Failover Policies at the virtual switch or port group Level.

Static port-channel can use any of the available load distribution policies, however since IP-Hash is
configured on vSphere portgroup, this policy will be used. The number of ports in the Etherchannel
must be same as the number of uplinks in the team.

Pros and Cons of static LACP with IP Hash

Pros

• Improves performance and bandwidth: One vSAN node or VMkernel port can communicate
with many other vSAN nodes using the IP-Hash algorithm.
• Network adapter redundancy: If a NIC fails and the link-state goes down, the remaining
NICs in the team continue to pass traffic.
• Can be used with both vSphere Standard Switches and vSphere Distributed Switches.

Cons

• Physical switch configuration is less flexible and requires that physical switch ports be
configured in a static port-channel configuration.
• Static port-channel will form without any verification on either end (unlike LACP - dynamic
port-channel) so there is a greater chance of misconfiguration.
• Complex: Introducing full physical redundancy configuration gets very complex when
multiple switches are used. Implementations can become quite vendor specific.
• Load balancing: If your environment has a small number of IP addresses, the virtual switch
might consistently pass the traffic through one uplink in the team. This can be especially
true for small vSAN clusters.

46

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

3. NIC Teaming Configuration


Examples
NIC Teaming Configuration Examples

47

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

3.1 Configuration 1

The following procedure outlines the steps that may be followed to configure Basic NIC Teaming
(active/active) with Route based on Physical NIC Load Policy for vSAN hosts using a vSphere
Distributed Switch (vDS). It is assumed that two uplinks from each host has already has been added to
the vDS. A distributed port group is designated for vSAN traffic and isolated to a specific VLAN.
Jumbo frames are already enabled on the vDS with an MTU value of 9000. The distributed port group
for vSAN traffic should have the following settings for Teaming and Failover:

• Load balancing policy set to Route Based on Physical Nic Load


• Notify Switches set to Yes
• Failback set to No – could also be yes, but set to no for the purposes of demonstration
• Ensure both uplinks are in the Active uplinks Position

Figure 31. Settings for configuration example

This is an example of how the distributed vSwitch would appear, including the vmknic and uplinks,
using the above policy settings:

48

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 32. Single vmknic - Virtual Switch configuration

Configuration 1: Network uplink redundancy lost

From a failover perspective, as soon as the link down state is detected, workload switches from one
uplink to another. There is no noticeable impact from the vSAN cluster and VM workload.

Figure 33. Network uplink redundancy lost

In the performance charts, the workload can be seen moving to vmnic0 from vmnic1:

Figure 34. Network failover to standby (now active) uplink

49

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Configuration 1: Recovery/Failback considerations

Since Failback value is set to No, traffic will not be promoted back to vmnic1. If set to Yes, traffic would
be promoted back to vmnic1 on recovery.

Configuration 1: Load balancing considerations

Since this is a single VMkernel NIC, from a performance perspective, there is no benefit to using “
Route based on physical load” as has been highlighted previously in this paper.

From the vSphere performance charts, it is apparent that only one physical NIC is in use at any time. In
this case since, the physical NIC, vmnic1, only has one client (vSAN VMkernel Port - vmk1) active on
this port group. The other physical NIC, vmnic0, is idle.

Figure 35. Network load balancing across uplinks – there isn’t any

The same behavior of only one physical NIC active is visible via esxtop on the ESXi host:

Figure 36. Network uplinks viewed via esxtop

50

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

3.2 Configuration 2

Configuration 2: Multiple vmknics, Route based on originating port ID

In the next example configuration, the design requires two non-routable VLANs. Both are logically and
physically separated. This is the air-gap approach mentioned previously. The approach of using two
10Gb physical NICs and logically separating them on the vSphere Networking layer is taken. This
example makes use of a vSphere distributed switch to help with easy and consistent configuration
steps (vSphere Standard Switches may also be used, but beyond the scope of this example).

We require creating two distributed port groups for each vSAN VMkernel vmknic. Each port-group will
have a separate VLAN tagged to each. For vSAN VMkernel configuration we require two IP addresses
on both VLANs for vSAN traffic

Note: In actual practical implementations, we would be using four physical uplinks for full redundancy

For each port group the Teaming and Failover policy will use the default settings.

• Load balancing set to “Route based on originating port ID”


• Network failure detection set to “Link Status Only”
• Notify Switches has default value of “Yes”
• Failback has default value of “Yes”
• The uplink configuration will comprise of one uplink in the “Active” position and one uplink
in the “Unused” position

Topology would look like below graphic representation, with the “orange” network completely isolated
from the “blue” network:

Figure 37. multi-vmknics - Virtual Switch configuration

51

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Configuration 2: vSAN Port Group 1

This is the configuration of a distributed Port Group called vSAN-portgroup-1. We have tagged VLAN
40 to this Port group with Teaming and Failover Policy below:

• We have tagged traffic on the port group with VLAN 40


• Load balancing set to “Route based on originating port ID”
• Network failure detection set to “Link Status Only”
• Notify Switches remains with default value of “Yes”
• Failback remains default value of “Yes”
• The uplink configuration will comprise of Uplink 1 in the “Active” position and Uplink 2 in the “
Unused” position

Figure 38. distributed port group settings (a)

Configuration 2: vSAN Port Group 2

To complement vSAN port group 1 we are going to configure a second Distributed port group called
vSAN-portgroup-2, almost identical to the above but with two different properties:

• We have tagged traffic on the port group with VLAN 60


• The uplink configuration will comprise of Uplink 2 in the “Active” position and Uplink 1 in the “
Unused” position

52

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 39. distributed port group settings (b)

Configuration 2: vSAN VMkernel Port Configuration

We now need to create two vSAN enabled VMkernel interfaces and on both Port Groups, we will call
these vmk1 and vmk2.

• vmk1 will be associated with VLAN 40 (172.40.0.xx), and as a result port group vSAN-
portgroup-1

Figure 40. vmknic1 configuration

• vmk2 will be associated with VLAN 60 (192.60.0.XX), and as a result port group vSAN-
portgroup-2

53

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 41. vmknic2 configuration

Configuration 2: Multicast considerations with multiple vmknics

If implementing multiple vmknics (in releases prior to vSAN 6.6), multicast communication will be
used. It is best practices to change the default multicast address of one of vSAN enabled VMkernel
ports. This is to avoid unnecessary multicast activity. This is necessary when:

• Traffic is going to be routed between both vmknics, i.e. the design may not have a requirement
to be a completely air-gapped.
• There is more than one vSAN Cluster network domain or segment

Please follow the VMware KB Top of FormBottom of Form2075451 which gives instructions on how
to change the multicast addresses on vSAN. This procedure needs to be repeated for all hosts in the
cluster. This is a command-line drive change using ESXCLI as changing multicast addresses is not
exposed by the vSphere API.

In our scenario, we are going to change the default multicast agent group and multicast master
address on vmk2 from 224.2.3.4/224.1.2.3 to 224.2.3.6/ 224.2.3.5

esxcli vsan network ipv4 set -i vmk2 -d 224.2.3.4 -u 224.2.3.3

To verify issue: esxcli vsan network list

54

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 42. multicast addresses

Note: Please check with your networking administrator to see if there are existing restrictions to
using specific multicast addresses in your network.

Configuration 2: Load balancing considerations revisited

As already mentioned, vSAN has no load balancing mechanism to differentiate between multiple
vmknics. As such, the vSAN IO path chosen is not deterministic across physical NICs. From the
vSphere performance charts, it is apparent that one physical NIC is often more utilized that the other.
In a simple I/O test performed in our labs, using 120 VMs with a 70:30 read/write ratio with a 64K
block size on a four node all flash vSAN cluster, this is what was observed:

55

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 43. vSAN does not load balance I/O addresses

This unbalanced load across NICs can also be observed from the vSphere Performance Graphs:

Figure 44. vSAN node performance graph – vSAN does not load balance I/O addresses

Configuration 2: Network uplink redundancy lost

A network failure is now introduced in this configuration. This time vmnic1 is disabled on a given vSAN
node. As a result, VMkernel Port vmk2 is impacted. As seen already, failing a NIC generates both
network connectivity /redundancy alarms:

56

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 45. Network failure alarms

From a vSAN perspective this failover process will trigger approximately 10 seconds after CMMDS
(Cluster Monitoring, Membership, and Directory Services) detects a failure. During failover and
recovery, vSAN will abort any active connections on the failed network and attempt to re-establish
connections on the remaining functional network.

Since there are two separate vSAN VMkernel ports communicating on isolated VLANs, there are also
vSAN Health Test failures. This is expected as vmk2 can no longer can talk to its peers in VLAN 60.

Figure 46. Network health checks failures – normal ping test

From the performance charts, the affected workload has restarted on vmnic0 since vmnic1 has a
failure (disabled for the purposes of this test). This is an important distinction between vSphere NIC
teaming and this topology. In this case, vSAN attempts to re-establish or restart connections on the
remaining network.

57

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 47. Observing connections switch to other uplink

However, in some failure scenarios, recovering the impacted connections may take up to
approximately 90 seconds to complete. The 90 seconds value is due to ESXi TCP connection timeout.
Subsequent connection attempts might fail but we time out connection attempts at 5 seconds, and we
rotate the attempts through all possible IP addresses. This behavior may have a knock-on effect to
Virtual Machine Guest IO’s. As a result, application and VM I/O may have to be retried.

For example, on Windows 2012 VMs, Event IDs 153 (device reset) and 129 (retry events) maybe log
during the failover and recovery process. In the example above event ID 129 was logging for
approximately 90 seconds until the I/O was recovered.

Some Guest OSes may require their disk timeout settings to be modified to ensure they are not
severely impacted. Disk timeout values can vary depending if VMware tools are installed and the
specific guest OS type and version. VMware KB 1009465 discusses how to change Guest OS disk
timeouts.

Configuration 2: Recovery/Failback considerations

When the network is repaired, workload will not be automatically rebalanced unless there may be
another failure to force workload, due to another failure. As soon as the impacted network is
recovered, it will be available for new TCP connections.

3.3 Configuration 3

Configuration 3: Dynamic LACP - Source/Destination IP, TCP/UDP port & VLAN

In this example, we are going to configure a 2 port LACP port-channel on a switch and a 2 uplink LAG
group on a distributed vSwitch. We chose a LAG group with 2 Uplinks to remain consistent with this
document (and considering most vSAN Ready Nodes ship with this configuration). We will use 10Gb
networking, two physical uplinks per server.

The process is the same in most cases:

58

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Configuration 3: Switch Side Setup Overview

• Identify the ports in question where the vSAN node will connect to.
• Create a port-channel.
• If using VLANs then trunk correct VLAN to the port-channel.
• Configuring the desired distribution or load-balancing options (hash).
• Setting LACP mode to active/dynamic.
• Verify MTU is configured properly

Configuration 3: vSphere Side Setup Overview

• Configure vDS with Correct MTU


• Add hosts to vDS
• Create LAG with correct number of uplinks and matching attributes to port-channel
• Assign physical uplinks to LAG group
• Create distributed port group for vSAN traffic and assign correct VLAN
• Configure VMkernel ports for vSAN with correct MTU

Configuration 3: Physical Switch Setup Detailed

This setup was followed by Dell’s guidance here http://www.dell.com/Support/Article/us/en/19/


HOW10364.

In our example, we are going to configure a 2 uplink LAG as follows:

• Switch ports 36 and 18.


• We are using VLAN trunking, so port-channel will be in VLAN trunk mode, with the
appropriate VLANs trunked (VLAN 40).
• We have decided on “Source and destination IP addresses, TCP/UDP port and VLAN” as the
method of load-balancing or load distribution.
• We have verified the LACP mode will be “active” (aka dynamic).

On our DELL switch, these were the actual steps taken to configure an individual port-channel:

• Step 1: create a port-channel, “1” in this case:

#interface port-channel 1

• Step 2: set port-channel to VLAN trunk mode:

#switchport mode trunk

• Step 3: allow appropriate VLANs:

#switchport trunk allowed vlan 40

• Step 4: configure the load balancing option:

#hashing-mode 6

• Step 5: assign the correct ports to the port-channel and set mode to active

#interface range Te1/0/36, Te1/0/18


#channel-group 1 mode active

Full set of steps:

#interface port-channel 1
#switchport mode trunk
#switchport trunk allowed vlan 40
#hashing-mode 6
#exit

59

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

#interface range Te1/0/36,Te1/018


#channel-group 1 mode active

Verify that the port-channel is configured correctly:

#show interfaces port-channel 1


Channel Ports Ch-Type Hash Type Min-links Local Prf
------- ----------------------------- -------- --------- --------- ---------
Po1 Active: Te1/0/36, Te1/0/18
Dynamic 6 1 Disabled
Hash Algorithm Type
1 - Source MAC, VLAN, EtherType, source module and port Id
2 - Destination MAC, VLAN, EtherType, source module and port Id
3 - Source IP and source TCP/UDP port
4 - Destination IP and destination TCP/UDP port
5 - Source/Destination MAC, VLAN, EtherType, source MODID/port
6 - Source/Destination IP and source/destination TCP/UDP port
7 - Enhanced hashing mode

Note: This procedure must be repeated on all participating switch ports that are connected your
vSAN nodes.

Configuration 3: vSphere Distributed Switch Setup

Before you begin, make sure that the vDS is upgraded to a version that supports LACP. To verify, right
click on the vDS, and check of upgrade option is available. Depending on the original version of the
Distributed Switch, you may have to upgrade the vDS to a minimum version to take advantage of
LACP.

Figure 49. Verify that the vDS supports LACP

60

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Configuration 3: Create LAG Group on VDS

To create a LAG Group on a distributed switch, select the vDS, go to Configure Tab, and select LACP.
Click on the green + symbol to create a new LAG:

Figure 50. Add a New Link Aggregation Group

The following properties need to be set on the New Link Aggregation Group Wizard:

• LAG name, in this case we will use name of lag1


• Number of ports should be set to 2 to match port-channel on switch
• Mode should be set to Active, as this is what we have configured on the physical switch
• Load balancing mode should match physical switch hashing algorithm and therefore we are
going to set this to “Source and destination IP addresses, TCP/UDP port and VLAN”

61

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 51. LAG Settings

Configuration 3: Add physical Uplinks to LAG group

Since we already have added our vSAN nodes to the vDS, the next step is to Assign the individual
vmnics to the appropriate LAG ports.

• Select and right click on the appropriate vDS, Select Add and Manage Hosts…
• Select Manage Host Networking, and add your attached hosts you wish to configure.
• On the select network adapter tasks, select Manage Physical Adapters

At this point, we are going select the appropriate adapters and assign to the LAG port.

62

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 52. Assigning uplinks to the LAG (a)

In this scenario, we are re-assigning vmnic0 from Uplink 1 position to port 0 on lag1:

Figure 53. Assigning uplinks to the LAG (b)

Now we must repeat the procedure for vmnic1 to the second lag port position, i.e. lag1-1. The
configuration should now look like this:

63

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 54. Assigning uplinks to the LAG (c)

This procedure must be repeated on all participating vSAN nodes. LAG configuration can also be
interrogated using command line, i.e. esxcli

esxcli network vswitch dvs vmware lacp config get

Figure 55. Querying LAG configuration from ESXCLI (a)

esxcli network vswitch dvs vmware lacp status get

64

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 56. Querying LAG configuration from ESXCLI (b)

Note: The most important flag is the "SYN" flag (Port state). Without it, LAG won’t form.

Configuration 3: Distributed port group Teaming and Failover policy

We now need to assign the LAG group as an “Active uplink” on distributed port group teaming and
failover policy. Select or create the designated distributed port group for vSAN traffic. In our case, we
already have a vSAN port group called “vSAN” with VLAN ID 40 tagged. Edit the port group, and
configure Teaming and Failover Policy to reflect the new LAG configuration.

Ensure the LAG group “lag1” is in the active uplinks position and ensure the remaining uplinks are in
the Unused position.

Note: When a link aggregation group (LAG) is selected as the only active uplink, the load balancing
mode of the LAG overrides the load balancing mode of the port group. Therefore, the “Route based on
originating virtual port” load balancing policy plays no role here.

65

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 57. Choosing a LAG for the active “uplink”

Configuration 3: Create the VMkernel interfaces

The final step is to create the VMkernel interfaces to use the new distributed port group, ensuring that
they are tagged for vSAN traffic. This is an example topology of a 4-node vSAN cluster using LACP.
We can observe that each vSAN vmknic can communicate over vmnic0 and vmnic1 on a LAG group to
provide load balancing and failover:

66

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 58. Reviewing LACP/LAG configuration

Configuration 3: Load balancing considerations

From a load balancing perspective, while we do not see a consistent balance of traffic across all hosts
on all vmnics in this LAG setup, we do see more consistency compared to Route based on physical NIC
load in configuration 1 and the air-gapped/multiple vmknics in configuration 2.

67

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

As before, if we look at the individual hosts’ vSphere Performance Graph, we see some better load
balancing:

Figure 60. LACP Load Balancing, ESXi host view

68

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Configuration 3: Network uplink redundancy lost

Now let’s take a look at network failures. In this example, vmnic1 is disabled on a given vSAN node.
From an alarm perspective, we see that a Network redundancy alarm has triggered:

Figure 61. Uplink failure with LACP configured

From the performance charts, we see the workload moved to vmnic0 from vmnic1:

Figure 62. Workload moving between uplinks with LACP configured

We do not observe any vSAN related health alarms, and the impact to Guest I/O is minimal compared
to the air-gapped/multi-vmknics configuration. This is because we do not have to abort any TCP
sessions with LACP configured, unlike previous examples.

Configuration 3: Recovery/Failback considerations

In a failback scenario, we see distinct behavior differences between Load Based Teaming, multi-
vmknics and LACP in a vSAN environment. After vmnic1 is recovered, traffic is automatically
(re)balanced across both active uplinks:

69

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Note: This behavior can be quite advantageous for vSAN traffic.

Configuration 3: Failback set to Yes or No?

We have already discussed the fact that a LAG load-balancing policy overrides a vSphere Distributed
port groups Teaming and Failover Policy. What we also need to consider is the guidance on Failback
value. In our lab tests, we have verified there is no discernable behavior differences between Failback
set to yes or no with LACP. LAG / LACP takes priority over the port-group settings as is the case with
port group load balancing policies

70

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 64. Failback setting for LAG policies

Note: Network failure detection values remain as “link status only” as beacon probing is not
supported with LACP. See VMware KB Understanding IP Hash load balancing (2006129)

3.4 Configuration 4

Configuration 4: Static LACP – Route based on IP Hash


In this example, we are going to configure a 2 port LACP static port-channel on a switch and two
active uplinks on a vSphere Standard Switch. We will use 10Gb networking, two physical uplinks per
server. There will be only a single VMkernel interface (vmknic) for vSAN on each host.

More information on host requirements and configuration examples can be found in the following
VMware Knowledgebase articles:

• Host requirements for link aggregation for ESXi and ESX (1001938)
• Sample configuration of EtherChannel / Link Aggregation Control Protocol (LACP) with
ESXi/ESX and Cisco/HP switches (KB 1004048)

Configuration 4: Physical Switch Configuration

This setup used a Dell PowerConnect N4064 Switch, and we followed by Dell’s guidance here

http://www.dell.com/Support/Article/us/en/19/HOW10364.
In our example, we are going to configure a 2-uplink static port-channel as follows:

• Switch ports 43 and 44.


• VLAN trunking, so port-channel will be in VLAN trunk mode, with the appropriate VLANs
trunked (VLAN 200).
• We do not have to specify the load-balancing policy on the port-channel group.

71

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

On our DELL switch, these were the actual steps taken to configure an individual port-channel:

• Step 1: create a port-channel, “13” in this case:

#interface port-channel 13

• Step 2: set port-channel to VLAN trunk mode:

#switchport mode trunk

• Step 3: allow appropriate VLANs:

#switchport trunk allowed vlan 200

• Step 4: assign the correct ports to the port-channel and set mode to active

#interface range Te1/0/43, Te1/0/44


#channel-group 1 mode on

Verify that the port-channel is configured in static port-channel:

#show interfaces port-channel 13


Channel Ports Ch-Type Hash Type Min-links Local Prf
------- ----------------------------- -------- --------- --------- --
Po13 Active: Te1/0/43, Te1/0/44 Static 7 1 Disabled
Hash Algorithm Type
1 - Source MAC, VLAN, EtherType, source module and port Id
2 - Destination MAC, VLAN, EtherType, source module and port Id
3 - Source IP and source TCP/UDP port
4 - Destination IP and destination TCP/UDP port
5 - Source/Destination MAC, VLAN, EtherType, source MODID/port
6 - Source/Destination IP and source/destination TCP/UDP port
7 - Enhanced hashing mode

Configuration 4: vSphere Standard Switch (vSS) Configuration

For the purposes of this walk though, we will assume that to reader is competent in the configuration
and creation of vSphere Standard Switches.

In this example, we are using:

• Four identical vSAN nodes


• Uplinks named vmnic0 and vmnic1
• VLAN 200 is trunked to the switch ports and port-channel
• Jumbo frames are configured

These configuration steps must be replicated on each vSAN node!

On each node, we created a vSwitch1 with MTU set to 9000, and vmnic0 and vmnic1 were added to
the vSwitch. On the Teaming and Failover Policy we set both Adapters to the Active position. The Load
Balancing Policy was set to “Route Based on IP Hash”:

72

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure G.2. Route based on IP Hash

Network detection, Notify Switches and Failback remained at default. All port groups will inherit the
Teaming and Failover Policy as it is set at the vSwitch Level. Although individual port group teaming
and failover polices can be over-ridden to differ from the parent vSwitch, IP hash load balancing
should be set for all port groups using the same set of uplinks.

Configuration 4: Load Balancing Considerations

From a load balancing perspective, we do not see a consistent balance of traffic across all physical
vmnics, but at the same time we do see both physical uplinks utilized. However, the only active traffic
from below figure was vSAN traffic, which was essentially 4 vmknics or IP addresses. Therefore, this
behavior might be explained due to the low number of IP addresses and possible hashes. However, in
some situations, the virtual switch might consistently pass the traffic through one uplink in the team.
For further details on the IP Hash algorithm, see the official vSphere documentation on “Route Based
on IP Hash”.

73

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure G.4. Load balancing/vmnic usage with Route based on IP Hash

Figure G.5. Load balancing/vmnic usage with Route based on IP Hash

74

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Configuration 4: Network Redundancy Considerations

In this example, the port that vmnic1 is connected to is disabled from the switch, in order for us to look
a failure and redundancy behavior. From an alarm perspective, we see that a network uplink
redundancy alarm has triggered:

Figure G.6. Network uplink redundancy lost

We do not observe any vSAN related health alarms, cluster and VM components are not affected and
Guest Storage I/O is not interrupted by this failure.

Figure G.7. All traffic flows via a single uplink on failure

Configuration 4: Recovery/Failback considerations

As per LACP/LAG considerations covered previously in this document, once vmnic1 recovers, traffic is
automatically (re)balanced across both active uplinks:

75

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure G.8. Traffic flows via both uplinks on recovery

76

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

4. Network I/O Control


Network I/O Control

77

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

4.1 Network I/O Control

vSphere Network I/O Control, a feature available with vSphere Distributed Switches, is a mechanism to
implement a Quality of Service (QoS) on network traffic. This can be extremely useful for vSAN when
vSAN traffic does not have its own dedicated network interface card (NIC) and has to share the
physical NIC with other traffic types, e.g. vMotion, Management, virtual machines.

4.2 Enabling NIOC

NIOC is very simple to setup, and can be enabled in the configuration properties of the vDS, as shown
below. Note that NIOC is not available on standard vSwitches, only distributed switches.

Figure 65. Enabling NIOC

NIOC can be used to reserve bandwidth for network traffic based on the capacity of the physical
adapters on a host. For example, if vSAN traffic uses 10-GbE physical network adapters, and those
adapters are shared with other system traffic types, you can use the Sphere Network I/O to guarantee
a certain amount of bandwidth for vSAN.

This can be useful when traffic such as vSphere vMotion, vSphere HA, virtual machine traffic, and so
on, share the same physical NIC as the vSAN network.

4.3 Reservation, Shares and Limits

Setting a reservation implies that Network I/O Control guarantees that minimum bandwidth is
available on the physical adapter for vSAN. This can be useful when bursty traffic, such as vMotion, as
doing full host evacuations, which could impact vSAN traffic. Reservations are ONLY invoked if there is
contention for network bandwidth. One disadvantage with reservations in NIOC is that while unused
reservation bandwidth can be made available to other types of system traffic, it cannot be allocated to
virtual machine traffic. The total bandwidth reserved among all system traffic types cannot exceed 75
percent of the bandwidth that the physical network adapter with the lowest capacity can provide.

78

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

vSAN recommendation on reservations: Since traffic reserved for vSAN cannot be allocated to virtual
machine traffic, the recommendation is to avoid using NIOC reservations in vSAN environments.

Setting shares makes a certain bandwidth available to vSAN when the physical adapter assigned for
vSAN becomes saturated. This prevents vSAN from consuming the entire capacity of the physical
adapter during rebuild and synchronization operations. For example, the physical adapter might
become saturated when another physical adapter in the team fails and all traffic in the port group is
transferred to the remaining adapter(s) in the team. The shares mechanism ensures that no other
traffic impact the v SAN network, and vice versa.

vSAN recommendation on shares: Since this is the fairest of the bandwidth allocation techniques in
NIOC, this is the technique preferred and recommended for use in vSAN environments.

Setting limits defines the maximum bandwidth that a certain traffic type can consume on an adapter.
However, if no-one else is using the additional bandwidth available, the traffic type with the limit will
be unable to consume it either.

vSAN recommendation on limits: Since traffic types with limits cannot consume additional
bandwidths, the recommendation is to avoid using NIOC limits in vSAN environments.

4.4 Network Resource Pools

Here is a default view of all of the system traffic types that can be controlled via NIOC. If you have
multiple virtual machine networks, the virtual machine traffic can be assigned a certain bandwidth,
and network resource pools can be used to consume parts of that bandwidth on a per virtual machine
port group basis.

Figure 66. Network Resource Pools in NIOC

4.5 NIOC Configuration Example

Let’s take an example of a vSAN cluster with a single 10-GbE physical adapter for simplicity. This NIC
handles traffic for vSAN, vSphere vMotion, and virtual machines. To change the shares value for a
traffic type, simply select that traffic type from the VDS > Configure > Resource Allocation > System

79

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Traffic view, and click on the “edit” icon. Below, the shares value for vSAN traffic has been changed
from the default of Normal/50 to High/100:

Figure 67. Network Resource Settings for vSAN in NIOC

After editing the other traffic types, assume that we now have the following share values:

Traffic Type Shares Value

vSAN High 100

vSphere vMotion Low 25

Virtual machine Normal 50

iSCSI/NFS Low 25

Table 6. Sample NIOC settings

If the 10-GbE adapter becomes saturated, Network I/O Control allocates 5Gbps to vSAN on the
physical adapter, 3.5Gbps to vMotion and 1.5Gbp to virtual machine traffic. The above values may be

80

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

used as a starting point for any NIOC configuration on vSAN. vSAN should always have the highest
priority compared to any other protocol.

For more details on the various parameters for bandwidth allocation, check out the official vSphere
networking guide: https://docs.vmware.com/en/VMware-vSphere/6.5/vsphere-esxi-vcenter-
server-65-networking-guide.pdf

With each of the vSphere editions for vSAN, VMware provides a VDS as part of the edition. This
means NIOC can be configured with any vSAN edition.

81

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

5. vSAN Network Topologies


vSAN Network Topologies

82

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

5.1 vSAN Network Topologies

This section covers the different supported network topologies that are supported with vSAN. The
impact that the different network topologies introduce to the overall deployment and management of
vSAN are also discussed.

Of course, the introduction of unicast support in vSAN 6.6 provides a significant simplification of
network design. In most topologies, the differences between vSAN 6.6 and vSAN 6.5 (and earlier) are
discussed.

5.2 Standard Deployments

Standard Deployments
Layer-2, Single Site, Single Rack

This network topology is responsible for forwarding packets through intermediate Layer 2 devices
such as hosts, bridge, or switches. The Layer 2 network topology offers the least complex
implementation and management of vSAN. VMware recommends the use and configuration of IGMP
Snooping to avoid sending unnecessary multicast traffic on the network. In this first example, we are
looking at a single site, and perhaps even a single rack of servers using vSAN 6.5 or earlier. In this case,
multicast is still required, and IGMP snooping would need to be enabled. Since everything is on the
same L2, there is no need to worry about routing the multicast traffic.

Figure 68. L2 Topology, single site, pre-vSAN 6.6 (multicast)

Layer-2 implementations are simplified even further with vSAN 6.6, which introduces unicast support.
With such a deployment, IGMP snooping is not required.

83

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 69. L2 Topology, single site, vSAN 6.6 (unicast)

Layer-2, Single Site, Multiple Racks

Let’s now consider a layer-2 implementation, but where there are multiple racks, and multiple top-of-
rack switches or TORs connected to a core switch. We will begin with vSAN 6.5 (or earlier) which uses
multicast traffic.

In figures 70-72, the blue dotted line between the TORs is simply to show that the vSAN network is
available and accessible to all the hosts in the vSAN cluster. However, the hosts in the different racks
communicate to each other over L3, which implies multicast traffic must be routed between the hosts
via PIM. The TORs are not physically connected to each other.

VMware recommends that all TORs are configured for IGMP snooping to prevent unnecessary
multicast traffic on the network. Since there is no routing of the traffic, there is no need to configure
PIM to route the multicast traffic.

84

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 70. L2 Topology, single site, multiple racks/TORs, pre-vSAN 6.6 (multicast)

Once again, this implementation is easier in vSAN 6.6 where vSAN traffic is unicast, and not multicast.
With unicast traffic, there is no need to configure IGMP snooping on the switches.

Figure 71. L2 Topology, single site, multiple racks/TORs, vSAN 6.6 (unicast)

Layer-3, Single Site, Multiple Racks

Let’s now consider vSAN deployments where L3 is implemented to route vSAN traffic.

This may be the simplest Layer-3 network topology when deploying vSAN. This is when you have

85

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

multiple racks in the same datacenter, each with their own TOR switch. It is then necessary to route the
vSAN network between the different racks over L3, to allow all the hosts in the vSAN cluster to
communicate.

This network topology is responsible for routing packets through intermediate Layer 3 capable devices
such as routers and Layer 3 capable switches. Whenever hosts are deployed across different Layer 3
network segments, the result is a routed network topology.

With vSAN 6.5 and earlier, VMware recommends the use and configuration of IGMP Snooping as
these deployments require multicast. PIM will also need to be configured/enabled on the physical
switches to facilitate the routing of the multicast traffic.

Figure 72. L3 Topology, single site, multiple racks/TORs, pre-vSAN 6.6 (multicast)

Once more, vSAN 6.6 simplifies such an implementation. Since there is no multicast traffic in vSAN
6.6, there is once again no need to configure IGMP snooping. Similarly, there is no need to configure
PIM to route multicast traffic.

Here is an overview of that a vSAN 6.6 deployment over L3 might look like. As you can see, there is no
requirement for IGMP Snooping or PIM as there is no multicast traffic to worry about:

86

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 73. L3 Topology, single site, multiple racks/TORs, vSAN 6.6 (unicast)

All of the configuration that have been examined so far have been single site deployments. In the next
section, a more complex deployment is examined, namely vSAN stretched cluster. This is where vSAN
is deployed across multiple sites, providing highly available virtual machine workloads. Should one site
fail, the remaining site will run all of the virtual machine workloads.

5.3 Stretched Cluster Deployments

In vSAN 6.5 and earlier, vSAN traffic between data sites is multicast (for metadata / state) and unicast
(for IO). In vSAN 6.6, all traffic is unicast. In all versions of vSAN, the witness traffic between a data site
and the witness site has always been unicast .

L2 everywhere

VMware strongly advises to not use a stretched L2 network across all sites.

Consider a design where the vSAN stretched cluster is configured in one large L2 design. Data Site 1
and 2 are where the virtual machines are deployed. Site 3 contains the witness host. To demonstrate
L2 everywhere as simply as possible, we use switches (and not routers) in the topologies.

87

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 74. Stretched Cluster Topology, L2 everywhere

L2 networks cannot have any loops (multiple paths), thus feature like Spanning Tree Protocol (STP) are
needed to block one of the connections between site 1 and site 2, either the S2->S3 link or the S2-
>S1->S3 link. Now consider a situation where the link between S2 and S3 is broken (the link between
site 1 and site 2). Spanning Tree would then discover a path between S2 and S3 exists via switch S1 if
the previously blocked link is unblocked. Network traffic is now switched from site 1 to site 2 via the
witness site. Considering VMware supports a much lower bandwidth and higher latency for the witness
host, customers will see a significant decrease in performance if data network traffic passes through a
lower specification witness site.

However, if there are situations where switching traffic between data sites through the witness site
does not impact latency of applications, and bandwidth is acceptable, a stretched L2 configuration
between sites may be supported . However, in most cases, VMware feels that such a configuration is
not feasible for the majority of customers and adds complexity to the networking requirements. We
strongly advise customers not to implement their stretched cluster in this way.

The other consideration with an “L2 everywhere” deployment is whether this is vSAN 6.5 or earlier, or
whether it is vSAN 6.6. With vSAN 6.5 or earlier, which uses multicast traffic, IGMP snooping would
once again need to be configured on the switches. This is not necessary with vSAN 6.6. PIM is not
necessary in this case as there is no routing of multicast traffic.

Supported (and recommended) Stretched Cluster Configurations

To avoid the situation outlined above, and to ensure that data traffic is not switched through the
witness site, VMware supports the following network topology:

• Between Site 1 and Site 2, VMware supports implementing a stretched L2 (switched)


configuration or a L3 (routed) configuration. Both configurations are supported.
• Between Site 1 and Witness Site, VMware supports an L3 (routed) configuration – route A in
diagram below.
• Between Site 2 and Witness Site, VMware supports implementing a L3 (routed)
configuration – route B in diagram below.

In the event of a failure on either of the data sites network, this configuration will prevent any traffic
from Site 1 being routed to Site 2 via the Witness Site, and thus avoid any performance degradation.

88

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

What follows are a series of diagrams highlighting each of these configurations. Both configurations
(L2+L3, and L3 everywhere) are shown with considerations given to multicast in vSAN 6.5 and earlier,
and unicast only, which is available in vSAN 6.6. Once again, multicast traffic introduces additional
configuration steps around IGMP snooping, and PIM for routing multicast traffic. The first
configuration shows a stretched L2 network between the data sites and an L3, routed network to the
witness site. To demonstrate a combination of L2 and L3 as simply as possible, we use a combination
switches and routers in the topologies.

Stretched L2 between data sites, L3 to witness

Figure 75. Stretched Cluster Topology, L2 between data sites, L3 to witness

The only traffic that is routed in this case is the witness traffic. With vSAN 6.5 and earlier, which uses
multicast, IGMP snooping should still be implemented for the multicast traffic on the stretched L2
vSAN between data sites. But since the witness traffic is unicast, there is no need to implement PIM on
the L3 segments.

With vSAN 6.6, there is no requirement to consider IGMP snooping or PIM since vSAN 6.6 uses
unicast.

L3 everywhere

This is a configuration of vSAN Stretched Cluster where the traffic is routed between the data site and
the witness site, but data is also routed between the data sites. To demonstrate that this is L3
everywhere as simply as possible, we use routers (and not switches) in the topologies.

In this first example, the environment is vSAN 6.5 or earlier, which uses multicast traffic. In this case,
we need to configure IGMP snooping on the data site switches to manage the amount of multicast
traffic on the network. This is unnecessary at the witness site since there should be no multicast traffic
sent to this site; witness traffic is unicast. And since this multicast traffic is being routed between the
data sites, PIM would also need to be configured to allow this multicast routing.

89

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 76. Stretched Cluster Topology, L3 everywhere

With vSAN 6.6, neither IGMP snooping nor PIM are needed since all the routed traffic is unicast.

Separating Witness Traffic on vSAN Stretched Clusters

A new feature of vSAN 6.5 was to separate witness traffic from vSAN traffic in 2-node configurations.
This means that the 2-node vSAN could be connected back-to-back without needing a 10Gb switch.

As of vSAN 6.7, separating the Witness Traffic on vSAN stretched clusters is now supported, in
addition, as of vSAN 6.7 U1, mixed MTU sizes for the witness and vSAN data traffic vmkernel port
types are also supported. For more, see here.

Using Stretched Cluster to achieve Rack-Awareness

One interesting use-case with stretched cluster is the ability to provide rack awareness in a single site.
This is very useful for customers who may only have only have 2 racks of vSAN servers, but wish to be
able to run their vSAN deployment even after a complete rack failure. In this case, availability of the
VM workloads would be provided by the remaining rack, and a remote witness appliance.

Note: For this configuration to be supported, the witness appliance must not be placed anywhere
within the 2 racks of vSAN servers.

Here is a sample topology of such a solution.

90

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 77. Using Stretched Cluster for rack awareness/fault domains (L2+L3)

In this example, if rack 1 failed, rack 2 and the witness (which is not located in either of the racks),
would provide VM availability. The above configuration is a pre-vSAN 6.6 environment, so will still
need multicast configured on the network. The witness would also need to be on the vSAN network, as
we do not support splitting witness traffic in stretched cluster environments. Witness traffic would be
unicast however. In vSAN 6.6, all traffic is unicast.

Such a topology would also be supported over L3, as shown below:

Figure 78. Using Stretched Cluster for rack awareness/fault domains (L3)

91

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Once again, we are showing a pre-6.6 configuration, where multicast is required. Since this is routing
the vSAN connections, PIM is required to route the multicast traffic. If this configuration used vSAN
6.6, no multicast traffic would be present, so no IGMP or PIM configuration would be required.

This supported topology would allow customers with 2 racks to achieve rack awareness/fault domains
with vSAN stretched cluster. Note that this is utilizing a witness appliance. However, the enterprise
edition of vSAN would still be required for anything more than a 1+1+W stretched cluster topology.
This 1+1+W topology, which is available with standard licensing, is discussed in the 2-node topology
section next.

5.4 2 Node vSAN Deployments

2 Node vSAN deployments are very popular for remote offices/branch offices where there are only a
small number of workloads, but where availability is required. vSAN achieves 2 Node deployments
through the use of a third vSAN Witness Host, which can be located remotely from the branch office.
Very often the vSAN Witness is maintained back in the main HQ, along with the management
components, e.g. vCenter server.

2 Node vSAN deployments – pre vSAN 6.5

When 2 Node vSAN was originally launched, there was a requirement to include a physical 10Gb
switch at the remote site. If the only servers at this remote site were the vSAN nodes, then this was a
costly solution.

Figure 79. 2- Node configuration, prior to vSAN 6.5 (switch between data nodes needed)

With this deployment, if there are no other devices using the 10Gb switch, then no consideration
needed to be given to IGMP snooping. If other devices at the remote site were sharing the 10Gb
switch, then once more IGMP snooping would be a good idea to prevent excessive and unnecessary
multicast traffic.

PIM would not be need as the only routed traffic is witness traffic, which is unicast.

2- Node vSAN deployments – vSAN 6.5 and later

With vSAN version 6.5 and later, this 2 Node vSAN implementation is much, much simpler. vSAN 6.5
(and later) allows the 2 Nodes at the data site to be directly connected. To enable this functionality, the
witness traffic is separated completely from the vSAN data traffic. The vSAN data traffic can now flow

92

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

between the two hosts on the direct connect, whilst the witness traffic can be routed to the witness
site over the management network.

Figure 80. 2 Node configuration, vSAN 6.5 and later (back-to-back supported)

The vSAN Witness Host could be located in multiple places, remote from the branch office. For
example, the vSAN Witness Host could be running back in the main DC, alongside the management
infrastructure (vCenter Server, vROps, Log Insight, etc.). Another supported place where the witness
can reside remotely from the branch office is in vCloud Air or other VCAN providers.

In this configuration, there is no switch at the remote site. Thus there is no need to worry about
multicast traffic on the vSAN back-to-back network. There is also no need to consider multicast on the
management network as the witness traffic is all unicast. If this implementation is with vSAN 6.6, there
are no multicast considerations anyway. And of course, multiple remote office/branch office 2 Node
deployment are supported, so long as each has their own unique witness.

93

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 81. Multiple 2 Node configuration, vSAN 6.6 (back-to-back supported, no multicast)

5.5 2 Node vSAN Deployments – Common Config Questions

In this section, other topologies relating to 2 Node vSAN deployments are discussed. This section will
address some of the common configurations seen by the authors from both our customers, and the
vSAN field personnel.

For standard deployment considerations, please refer to the following official documentation:

• Reference Architecture for VMware vSAN 6.2 for Remote Office and Branch Office
• VMware vSAN 6.2 Stretched Cluster & 2 Node Guide

For further information on 2 Node configurations and detailed deployment considerations outside of
network, please refer to the official documentation .

Running witness on another 2 Node cluster, and vice-versa

Figure 82. 2 Node configuration, cross witness hosting (not recommended – requires RPQ)

This configuration is not recommended . There are several constraints with this deployment and the
customer needs to fully understand and agree with those for us to approve the RPQ. The RPQ requires
VMware vSAN engineering to analyze the customer deployment. This is why the solution is not
recommended, as this process can be very time-consuming.

Please note that the RPQ support is ONLY for the 2 Node use case. Running the vSAN Witness
Appliance on another stretched cluster, and vice-versa, is not supported . Contact your VMware
representative for more information.

94

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

vSAN Witness Appliance running on another standard vSAN deployment

Figure 83. 2 Node configuration, vSAN Witness hosted on another vSAN (supported)

This configuration is supported . Again, referring to the statement that “We support any vSphere to
run the vSAN Witness that has independent failure properties”, this configuration is fine. Any failure on

95

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

the 2 Node vSAN at the remote site does not impact the availability of the standard vSAN
environment at the main DC.

vSAN Witness running in vCloud Air

Figure 84. 2 Node configuration, witness hosted on another vCloud Air (supported)

This configuration is also supported . Refer to the official documentation on running a witness in
vCloud Air here: Running VMware vSAN Witness Appliance in VMware vCloud Air .

Multiple witness sharing the same VLAN


For customers who have implemented multiple remote-office/branch office 2 Node vSAN
deployments, a common question is whether the Witness traffic from each of the remote sites requires
its own VLAN. The answer is no. Multiple remote-office/branch office 2 Node vSAN deployments can
send their witness traffic on the same shared VLAN. Here is a simplified diagram to highlight how this
might look:

96

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 84a. 2 Node configurations, witness traffic all on the same shared VLAN

Tagging Witness and Management traffic on the same interface

Another common question that arises is related to the fact that some customers may only have a
single routed subnet to the remote sites. The question is then whether or not the management traffic
and the witness traffic can be tagged to the same VMkernel interface, and send over that single routed
VLAN between the remote site and the main DC at HQ. The answer is yes, we support such a
configuration. A single VMkernel interface on both the physical nodes at the remote site, and the
witness appliance at the main DC can be tagged for both traffic types. This might look something like
this if we simplify the configuration once more:

Figure 84b. 2 Node configurations, witness traffic and management traffic on same interface

97

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

This would also make the configuration of the witness appliance simpler, as a single interface on the
appliance can be tagged for management and vSAN traffic, and the spare VMkernel interface on the
appliance can be removed.

5.6 Config of network from data sites to witness host

The next obvious question is how to implement such a configuration? How can the interfaces on the
hosts in the data sites, which communicate to each other over the vSAN network, communicate to the
witness host?

Option 1: Physical ESXi witness connected over L3 with static routes

In this first configuration, the data sites are connected over a stretched L2 network. This is true for the
data sites’ management network, vSAN network, vMotion network and virtual machine network. The
physical network router in this network infrastructure does not automatically transfer traffic from the
hosts in the data sites (1 and 2) to the host in the witness site (site 3). In order for the vSAN stretched
cluster to be successfully configured, all hosts in the cluster must communicate. Therefore, the
question arises: how can we deploy a stretched cluster in this environment?

The solution is to use static routes configured on the ESXi hosts so that the vSAN traffic from sites 1
&amp; 2 is able to reach the witness host in site 3, and vice versa. In the case of the ESXi hosts on the
data sites, a static route must be added to the vSAN interface which will redirect traffic to the witness
host on site 3 via a specified gateway for that network. In the case of the witness host, the vSAN
interface must have a static routes added which redirects vSAN traffic destined for the data sites’
hosts. Adding static routes is achieved using the esxcli network ip route command on the ESXi hosts,
and examples can be found elsewhere in this guide. This setup will have to be repeated on all ESXi
hosts in the stretched cluster.

Note that we have not mentioned the ESXi management network here. The vCenter server must be
able to manage the ESXi hosts at both the data sites and the witness. As long as there is direct
connectivity from the witness host to vCenter, there should be no additional concerns regarding the
management network.

Also note that there is no need to configure a vMotion network or a VM network or add any static
routes for these network in the context of a vSAN stretched cluster. This is because there will never be
a migration or deployment of virtual machines to the vSAN witness. Its purpose is to maintain witness
objects only, and does not require either of these networks for this task.

Option 2: Virtual ESXi witness (appliance) connected over L3 with static routes

Since the witness is a virtual machine that will be deployed on a physical ESXi host (which is not part of
the vSAN cluster), that physical ESXi host will need to have a minimum of one VM network pre-
configured. This VM network will need to reach both the management network and the vSAN network
shared by the ESXi hosts on the data sites.

Note : This does not need to be a dedicated host. It can be used for many other VM workloads, whilst
simultaneously hosting the witness. Many customers choose their management infrastructure for
hosting the witness appliance, e.g. the host(s) or cluster where vCenter Server, vRealize Operations,
Log Insight, etc. are running.

An alternative option that might be simpler to implement is to have two preconfigured VM networks
on the underlying physical ESXi host, one for the management network and one for the vSAN network.
When the virtual ESXi witness is deployed on this physical ESXi host, the network will need to be
attached/configured accordingly.

98

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 87. Witness network configuration

Once the virtual ESXi witness has been successfully deployed, the static route configuration must be
configured. As before, assume that the data sites are connected over a stretched L2 network. This is
also true for data sites’ management network, vSAN network, vMotion network and virtual machine
network. Once again, vSAN traffic is not routed from the hosts in the data sites (1 and 2) to the host in
the witness site (site 3) via the default gateway. In order for the vSAN stretched cluster to be
successfully configured, all hosts in the cluster require static routes added so that the vSAN traffic
from sites 1 &amp; 2 is able to reach the witness host in site 3, and vice versa. Once again, the static
routes are added using the esxcli network ip route command on the ESXi hosts. This will have to be
repeated on all ESXi hosts in the cluster, both on the data sites and on the witness host.

5.7 Corner Case deployments

Some customers have deployed vSAN is what might be considered unusual, or corner-case
configurations. This section highlights these other topologies, and highlights considerations with these
deployments.

3 locations, no stretched cluster, distributed witnesses

A few customers have requested that they be able to deploy vSAN across multiple rooms, buildings or
sites, rather than deploy a stretched cluster configuration. This is supported under our Standard
License agreement. The one requirement is that the latency between the sites be at the same level as
the latency expected for a normal vSAN deployment in the same data center. This is expected to be
<1ms between all hosts. Therefore, if you plan to roll out vSAN in this way, you must ensure that the
latency between the different rooms, buildings or sites where vSAN nodes reside is less than 1ms. If
latency is greater than this value, then you should consider stretched cluster which tolerates latency of
5ms, but requires an Enterprise level license. The configuration below is shown for vSAN 6.6. With
vSAN 6.5 or earlier, additional considerations for multicast would have to be addressed.

99

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 85. 3 sites, no stretched cluster, distributed witnesses

VMware strongly recommends maintaining a uniform configuration across all sites in such a topology.
And of course, to maintain availability of VMs, fault domains should be configured, where the hosts in
each room, building or site are placed in the same fault domain. Utmost care must be taken with this
deployment to avoid asymmetric partitioning of the cluster, e.g. node A cannot communicate to node
B but node B can communicate to node A. This is not something that vSAN handles eloquently today,
so make sure that the network topology across rooms, buildings or sites does not lead to such a
situation if you decide to deploy this topology.

2-node deployed as 1+1+W Stretched Cluster

The vSAN Standard Licensing allows a 2-node configuration with a witness to be deployed as a
stretched cluster configuration, placing each host in different rooms, buildings or sites.

100

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 86. 1+1+W stretched cluster configuration with standard licenses

Any attempt to increase the number of nodes at each site from one will fail with an error related to
licensing. For any cluster that is larger than 2 nodes and that uses the dedicated witness appliance/
node feature (N+N+Witness, where N>1), the license requirement must be vSAN Enterprise since such
a configuration is considered a vSAN Stretched Cluster.

4 sites, 2 stretched clusters, each stretched cluster supporting the


other cluster's witness appliance
This configuration has only been recently supported. This is a configuration where there are 4 distinct
sites, 2 of the sites are used for deploying vSAN stretched cluster A and 2 of the sites are used for
deploying vSAN stretched cluster B. After testing the scenario internally, VMware can now support the
hosting of the witness on the other cluster, i.e. stretched cluster A can support witness appliance B
and stretched cluster B can support witness appliance A.

101

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Note the requirement of 4 distinct sites. This configuration cannot be supported if there are only 2
sites, and a site failure will result in a complete failure of both stretched clusters. With 4 sites, a single
site failure will only impact a single vSAN stretched cluster. Further detail may be found on this blog
post here.

102

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

6. vSAN iSCSI Target - VIT


vSAN iSCSI Target - VIT

103

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

6.1 vSAN iSCSI Target - VIT

vSAN 6.5 extends workload support to physical servers with the introduction of an iSCSI target service.
vSAN allows physical servers to access vSAN storage using the iSCSI protocol. iSCSI targets on vSAN
are managed the same as other objects with Storage Policy Based Management (SPBM). This implies
that the iSCSI LUNs can be configured as RAID-1, RAID-5 or RAID-6, and have other characteristics,
such as IOPS limits. vSAN functionality such as deduplication, compression and encryption (available
in vSAN 6.6) can also be utilized with the iSCSI target service to provide space savings and security for
the iSCSI LUNS. In additional, CHAP and Mutual CHAP authentication is also supported for greater
security.

vSAN 6.7 introduces support for WSFC to the iSCSI target service allowing users to migrate their
existing legacy apps to vSAN without needing RDMs. More info can be found on WSFC support here .

6.2 VIT Internals

The “Enabling iSCSI service” procedure creates a “namespace” object which is used as an iSCSI
management object. This is similar to a VM Home namespace object, but it has no relationship to a
virtual machine. This iSCSI namespace object stores the iSCSI configuration on this vSAN. A separate
directory is also created and maintained in the home object for each target and its own chain of LUNs
(VMDK objects). This object is called /vmfs/volumes/vsanDatastore/.iSCSI-CONFIG. Residing in this
home namespace is an etc folder which contains the vit.conf file, and another folder called targets. As
new iSCSI targets are created, a new target namespace object is created using a unique UUID. In the
targets folder, a symbolic link to the iSCSI target namespace object is also added. In each of the UUID
sub-folders in targets, you will find the list of iSCSI LUNs (VMDK objects) associated with each target.
It can be visualized as follows:

Figure 88. vSAN iSCSI internals

The vit.conf configuration file is accessible by all hosts in the cluster, and is used to determine which
LUNs are associated with which targets, for example.

Enabling the iSCSI service also starts the appropriate iSCSI services on the ESXi hosts that are part of
the vSAN cluster. Many of the control operations are handled by vitd daemon, which stands for vSAN
iSCSI Target (VIT) daemon. vitd replies to the iSCSI discovery protocol, and listens for connections. It
also handles CHAP and connection establishment. Management operations like creating targets/LUNs
are handled by vsanmgmtd or hostd depending on whether the management operation is coming
from vSphere UI or esxcli.

104

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

During the initial discovery process, all hosts present all targets to the initiator, which allows the
initiator to connect to any host in the cluster.

On accepting a connection, vitd works with another daemon, vitsafehd, to opens/closes VMDK files.
These VMDK files, in the case of this iSCSI service on vSAN, are the iSCSI LUNs accessed by external
initiators.

6.3 iSCSI Setup Steps

Enable the iSCSI service

By default, the iSCSI service is disabled, and needs to be manually enabled by the administrator. Every
host in the cluster has to participate in the iSCSI target. All hosts in the cluster also need to use same
VMkernel adapter. VMware recommends creating a new VMkernel adapter for the vSAN iSCSI traffic.
The VMkernel adapter must be the same interface number on all hosts in the cluster.

When creating the iSCSI target on vSAN, the configuration wizard only displays VMkernel interfaces
that are available on all hosts in the cluster. At this point administrators should choose the correct
adapter from the dropdown list of VMkernel adapters.

Figure 89. Enabling the vSAN iSCSI service

At this point, an optional authentication method can be chosen. CHAP (Challenge Handshake
Authentication Protocol) verifies identity using a hashed transmission. A user and a secret key must be
added to the initiator and target. With CHAP, the target initiates the challenge. However, with Mutual
CHAP, both the initiator and target initiate challenges.

All iSCSI initiators and targets must support CHAP, but use is optional. If the iSCSI network is on its
own isolated VLAN, then CHAP is not necessary. If however, the iSCSI network is on a segment with
other traffic types and users, CHAP might be a useful option to authenticate.

Create a target

The next step is to create an iSCSI target. This creates an IQN so that initiators (e.g. physical hosts that
wish to consume iSCSI storage) can communicate to the target. IQN is short for iSCSI Qualified Name.
This is generated automatically when the target is created.

105

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 90. Create a vSAN iSCSI target

During target creation, there is also an option to create a LUN. In this example, we are not creating a
LUN. The steps to do that will be shown next.

Note: Multiple targets per cluster are supported.

Create LUNs

Now that the service has been enabled and the target is configured, we can begin to build LUNs that
can be consumed by initiators. LUNs are built on a per target basis, i.e. multiple targets with multiple
LUNs may exist on one vSAN cluster. LUN IDs between 0 and 255 are supported, indicating that we
support 256 LUNs per target.

Under the covers, a LUN is essential a VMDK object on the vSAN datastore. These LUNs may have
their own individual policy, such as RAID-1, RAID-5 and RAID-6. The can also have an IOPS limit to
avoid “noisy-neighbor” type situations. Lastly, if deduplication, compression or encryption are enabled
on the cluster, then the LUNs will also benefit from these features.

The only details that need to be provided are the LUN ID, a human readable alias and the size of the
LUN. Below, a LUN of 100GB and a LUN ID of 0 is being created:

106

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 91. Create a vSAN iSCSI LUN

Note: There is no way to snapshot the LUN via the UI.

iSCSI initiators

The final step is to map this target to an initiator, e.g. an external physical host. An example initiator
group is preset by default. The group is called iSCSI ANY_INITIATOR group which allows connections
from any initiator by default.

If you do not wish for this behavior i.e. any host can connect to the target, then you should build your
own initiator groups, and add only those initiators’ that should have access to that group.

VMware strongly recommends creating bespoke initiator groups, as it will avoid the incorrect
initiators from accessing the incorrect iSCSI targets/LUNs.

In the vSAN 6.5 release of VIT, there is no limit on the number of imitators in an initiator group at this
time. Initiators can connect to any host in the cluster. In vSAN 6.6, limit checks for the number of
initiators and initiator groups were added:

• Maximum number of initiator groups allowed per cluster: 256


• Maximum number of initiators allowed per cluster: 8192
• Maximum number of iSCSI sessions per cluster: 1024
• Maximum number of iSCSI sessions per host: 128

To create new initiator groups, navigate to iSCSI initiator Groups, and click on the + sign to create a
new one. Note that the only initiator group present is ANY_INITATOR.

107

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 92. Create a vSAN iSCSI Initiator Group

The IQN of the initiators, e.g. physical hosts iSCSI adapters, should be added to the group, and then
the target that you wish these initiators to access should also be added. On Windows, the IQN can be
found in the ISCSI Initiator Properties, under the Configuration tab.

iSCSI Health Checks

There are several health checks built in to check the integrity of the iSCSI service and target. The first
screenshot verifies the integrity of the iSCSI home object, which stores all aspects of the iSCSI
configuration.

Figure 93. iSCSI health checks on vSAN

This second screenshot verifies the iSCSI service runtime status and ensures that the necessary
daemons, vitd and vitsafehd are running on the hosts that are participating in the vSAN cluster.

108

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 94. “Service runtime status” iSCSI health checks on vSAN

6.4 MPIO considerations

Since an initiator can connect to an iSCSI target via any of the vSAN host interfaces, one should
consider how to connect to the target from a redundancy perspective. From a Windows host
perspective, MPIO can be used. This allows you to add multiple vSAN interfaces/ discovery addresses
to the initiator, and if one is not connecting, MPIO can try another. Microsoft offers a link on how to
configure MPIO in the following KB:

https://technet.microsoft.com/en-us/library/ee619752(v=ws.10).aspx

Here is a simple 3 node example. All 3 IP addresses of the VMkernel ports where iSCSI is configured
have been added to the MPIO settings of the initiator. On the first connection attempt, the connection
is made to the first host (we will ignore the fact that this connection might be redirected to the host
where the target owner resides for the moment).

109

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 95. MPIO and iSCSI on vSAN

Now let’s examine the scenario where the first host in the cluster has an issue, and can no longer
accept iSCSI connections. This is where MPIO on the initiator can be very useful, as the connection can
be made to one of the other vSAN hosts in the list.

Figure 96. MPIO failover to alternate iSCSI IP address on vSAN

Now the connection to the vSAN iSCSI targets can be made through the second host. Once more, if
the owner of the iSCSI target is not on the second host, then the connection is automatically
redirected to the other host(s) in the cluster.

Of course, any target owners that were on the first host will now be moved to the remaining hosts in
the cluster as well.

There is no auto-discovery mechanism when using MPIO with vSAN iSCSI, so for a vSAN cluster,
multiple vSAN iSCSI IP addresses from some of the vSAN nodes must be manually added to the
initiator configuration. It is not necessary to configure all the IP address in the cluster. VMware
suggests selecting 2 or 3 hosts from the cluster (or perhaps a few more if there are 64 nodes in the
cluster). If FTT=1, administrators may configure two portal IP addresses. If FTT=2, administrators may
configure three portal IP addresses. There is no value add in configuring all of the IP addresses from a
scalability perspective, but there is value in using multiple IP addresses from an availability
perspective.

In fact, there may even some disadvantages of configuring too many portal IP addresses. Each iSCSI
session will result in a new iSCSI session to the target owner, and we have a limit on the number of
iSCSI sessions per host. The hard limit is 1024 sessions, while the public limit for the iSCSI session is
only 128 sessions per host.

Administrators should try to balance the hosts with the targets, such as direct targets 1 and 2 to host
X, targets 3 and 4 to host Y and targets 5 and 6 to host Z for example.

110

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Recommended MPIO Settings for Windows initiator hosts

The following are the MPIO settings recommended by VMware for Windows initiator hosts that wish to
consume iSCSI LUNS from vSAN.

1. Configure Retry Count and Timers

C:\> Set-MPIOSetting -CustomPathRecovery Enabled -


NewPathRecoveryInterval 20 \
-NewRetryCount 60 -NewPDORemovePeriod 60 -NewPathVerificationPeriod 30

C:\> Get-MPIOSetting
PathVerificationState : Disabled
PathVerificationPeriod : 30
PDORemovePeriod : 60
RetryCount : 60
RetryInterval : 1
UseCustomPathRecoveryTime : Enabled
CustomPathRecoveryTime : 20
DiskTimeoutValue : 60

2. Failover settings for vSAN iSCSI Target

Since VIT only supports one active I/O path per target/lun connection, i.e. active-passive,
VMware recommends configuring the load balance policy as failover only. If you are running
clustering software on the initiator which leverages persistent reservations, the load balance
policy must be set to failover only.

C:\> mpclaim -l -t "VMware vSAN " 1


C:\> mpclaim.exe -s -t
"Target H/W Identifier " LB Policy
-------------------------------------------------------------------------------
"VMware vSAN " FOO

VMware has validated these settings against Windows 2012R2 and Windows 2016. Additional
information can be found in the Microsoft Multipath I/O Step-by-Step Guide on the Microsoft
Technet site.

Recommended Multipath Settings for Linux initiator hosts

The following are the multipath settings recommended by VMware for Linux initiator hosts that wish
to consume iSCSI LUNS from vSAN, but also failover to an alternate path in the event of a failure.

1. Set path policy to failover

The failover settings are found in /etc/multipath.conf

defaults {
user_friendly_names yes
}
devices {
device {
vendor "VMware "
product "vSAN "
path_grouping_policy failover
no_path_retry 100
}
}

2. Change replacement_time to 15 seconds

111

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

VMware recommends replacing the replacement_timeout to 15 seconds in /etc/


iscsi/iscsid.conf:
# To specify the length of time to wait for session re-establishment
# before failing SCSI commands back to the application when running
# the Linux SCSI Layer error handler, edit the line.
# The value is in seconds and the default is 120 seconds.
# Special values:
# - If the value is 0, IO will be failed immediately.
# - If the value is less than 0, IO will remain queued until the session
# is logged back in, or until the user runs the logout command.
node.session.timeo.replacement_timeout = 15

3. Verify the settings

If there are 2 paths to the iSCSI target on vSAN, the multipath output should look similar to the
following:

# multipath -l
mpathe (1VMware_VITDEVID679bcb58351873774a7702003056e5fe) dm-4 VMware ,vSAN
size=954M features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| `- 112:0:0:0 sde 8:64 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
`- 110:0:0:0 sdc 8:32 active undef running

VMware has validated these settings against RedHat 6 & 7, Suse 11SP3 & 13, and Oracle Linux.

6.5 iSCSI on vSAN Limitations and Considerations

The following is a list of limitations and considerations when using iSCSI on vSAN.

Use cases

There are two supported use cases: iSCSI LUNs can be presented to a physical host or iSCSI LUNs can
be presented to guest VMs that participate in a WSFC cluster to allow for shared disk application
architectures.

One common use-case for physical environments is the presentation of a vSAN VMDK as an iSCSI LUN
to physical Oracle RAC, as per VMware KB 2148216 . Please note that virtual Oracle RAC is also
supported on vSAN natively through the use of the shared writer flag.

Failover

As mentioned, any initiator can connect to any host in vSAN cluster to access the iSCSI target.
However, each target has an owner in vSAN. When the initiator requests access to a target, it will be
informed by the vSAN host that it connects to where the target resides. The initiator then makes a
connection to the appropriate owner/target.

112

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure 97. Auto-redirect to DOM owner/iSCSI target owner on failover

As shown previously, if the host on which the owner resides has a failure, the initiator can use MPIO on
Windows or multipathing on Linux to connect to any of the remaining vSAN hosts, and request access
to the target. At this point, the owner will have been moved to a new host, and this is where the
initiator is directed to connect to the target.

Load balancing

At this time, there is no load-balancing available in the vSAN iSCSI service. Connections to the service
are done in an active-passive configuration, where only a single connection is active, and if there is a
failure, the connection is moved to another host.

Another concern is that there is no load-balancing mechanism specific to iSCSI at the back-end, so it
may be that multiple owners end-up on one host, while other hosts in the cluster may not have any.
This may be a concern as new hosts are added to the vSAN cluster. There is little that can be done at
this point in time to address this.

iSCSI Home Namespace object not removed when iSCSI disabled

Note that the iSCSI Home Namespace object is not removed if you decide to disable iSCSI on vSAN.
This object will remain behind, and will be re-used if/when iSCSI is used once more sometime in the
future. The vSAN iSCSI Target Service can be disabled from the General tab.

Maintenance Mode

Note that when iSCSI is enabled, placing a host into maintenance mode does not redirect the iSCSI
connections to different hosts. The DOM owner/SCSI target owner continues to remain on the host,
and continues to server connections. If the host is subsequently rebooted, then the DOM owner/iSCSI
target owner would be restarted elsewhere in the cluster.

113

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

If you need to stop iSCSI connections to a host that has been placed into maintenance mode, one
solution is to partition the node from the vSAN cluster by disabling the vSAN network on that host.
Simply disable the vSAN service on that host. This will move the DOM owner/iSCSI target owner to
another host in the cluster. This is a known limitation, and a solution will appear in a future vSAN
release.

As mentioned in the load-balancing section previously, after DOM owners/iSCSI target owners have
been moved from a host, there is no mechanism to rebalance DOM owners/iSCSI target owners when
the rebooted/restored host rejoins the cluster.

Stretched Cluster

Please note that vSAN iSCSI Target is unsupported on vSAN Stretched Clusters.

iSCSI routing

VMware fully supports initiators making routed connections to vSAN iSCSI targets over an L3 network.

IPv6

VMware fully supports IPv6 on the vSAN iSCSI network.

IPSec

VMware fully support IPSec on the vSAN iSCSI network for increased security. Note that this
statement is qualified by the fact that there is no data available to verify what impact IPSec has on
vSAN iSCSI performance at this time. Note that ESXi hosts support IPsec using IPv6 only, as per
VMware KB 1021769 .

Jumbo Frames

VMware fully supports Jumbo Frames on the vSAN iSCSI network.

NIC Teaming

VMware fully supports all NIC teaming configurations on the vSAN iSCSI network. Please note that the
vSAN iSCSI implementation does not support multiple connections per session (MCS) and is unlikely to
see throughput benefits from LACP/LAG configurations.

114

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

7. Appendix A
Migrating from standard to distributed vSwitch

115

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

7.1 Migrating from standard to distributed vSwitch

Warning: Please ensure that you have console access to the ESXi hosts during this exercise. All going
well, you will not need it. However, should something go wrong, you may well need to access the
console of the ESXi hosts.

The main reason for migrating from VSS (standard vSwitches) to a vDS (Distributed Switch) is to make
use of the Network I/O Control feature that is only available with vDS. This will then allow you to place
QoS (Quality of Service) on the various traffic types such as vSAN traffic.

Administrators should make a note of the original VSS setup. In particular, load-balancing and NIC
teaming settings should be noted on the source, and the destination setup should be configured the
same. Otherwise the migration will not be successful.

7.2 Step A.1 Create Distributed Switch

To begin with, create the distributed switch. This is a relatively straight forward exercise.

Figure A.1: Create a new distributed switch

Provide it with a name:

116

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.2: Provide a name for the new distributed switch

Select the version of the vDS. In this example, version 6.0.0 is used for the migration:

Figure A.3: Select the distributed switch version

At this point, we get to add the settings. First, you will need to determine how many uplinks you are
currently using for networking. In this example, there are six; one for management, one for vMotion,
one for virtual machines and three for vSAN (a LAG configuration). Therefore, when prompted for the
number of uplinks, “6” is selected. This may differ in your environment but you can always edit it later
on.

117

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.4: Select the number of uplinks

Another point to note here is that a default port group can be created. You can certainly create a port
group at this point, but there will be additional port groups that need to be created shortly. At this
point, the distributed switch can be completed:

Figure A.5: Complete the creation of the DVS

As alluded to earlier, lets now configure and create the additional port groups.

7.3 Step A.2 Create port groups

In the previous exercise, a single default port group was created for the management network. There
was little in the way of configuration that could be done at that time. It is now important to edit this
port group to make sure it has all the characteristics of the management port group on the VSS, such
as VLAN and NIC teaming and failover settings. Select the distributed port group, and click on the Edit
button shown below:

118

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.6: Edit the distributed port group

For some port groups it may be necessary to change the VLAN. Since the management VLAN in this
POC is on 51, we need to tag the distributed port group accordingly:

Figure A.7: Tag the distributed port group with a VLAN

That is the management distributed port group taken care of. You will also need to create distributed
port groups for vMotion, virtual machine networking and of course vSAN networking. In the “Getting
Started” tab of the distributed switch, there is a basic task link called “Create a new port group”:

119

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.8: Create a new distributed port group

In this exercise, we shall create a port group for the vMotion network.

120

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.9: Provide a name for the new distributed port group

Figure A.10: Configure distributed port group settings, such as VLAN

121

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.11: Finish creating the new distributed port group

Once all the distributed port groups are created on the distributed switch, the uplinks, VMkernel
networking and virtual machine networking can be migrated to the distributed switch and associated
distributed port groups.

Warning: While the migration wizard allows many uplinks and many networks to be migrated
concurrently, we recommend migrating the uplinks and networks step-by-step to proceed smoothly
and with caution. For that reason, this is the approach we use here.

7.4 Step A.3 Migrate Management Network

To begin, let’s migrate just the management network (vmk0) and its associated uplink, which in this
case is vmnic0 from VSS to vDS. To begin, select “Add and manage hosts” from the basic tasks in the
Getting started tab of the vDS:

122

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.12: Add and manage hosts

The first step is to add hosts to the vDS.

Figure A.13: Add hosts

Click on the green + and add all four hosts from the cluster:

123

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.14: Select all hosts in the cluster

The next step is to manage both the physical adapters and VMkernel adapters. To repeat, what we
wish to do here is migrate both vmnic0 and vmk0 to the vDS.

Figure A.15: Select physical adapters and VMkernel adapters

124

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Next, select an appropriate uplink on the vDS for physical adapter vmnic0. In this example we chose
Uplink1.

Figure A.16: Assign uplink (uplink1) to physical adapter vmnic0

With the physical adapter selected and an uplink chosen, the next step is to migrate the management
network on vmk0 from the VSS to the vDS. We are going to leave vmk1 and vmk2 for the moment and
just migrate vmk0. Select vmk0, and then click on the “Assign port group” as shown below. The port
group assigned should be the newly created distributed port group created for the management
network earlier. Remember to do this for each host:

125

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.17: Assign port group for vmk0

Click through the analyze impact screen; it only checks the iSCSI network, and is not relevant to vSAN.

Figure A.18: Impact on iSCSI

At the finish screen, you can examine the changes. We are adding 4 hosts, 4 uplink s (vmnic0 from
each host) and 4 VMkernel adapters (vmk0 from each host):

126

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.19: Ready to complete

When the networking configuration of each host is now examined, you should observe the new DVS,
with one uplink (vmnic0) and the vmk0 management port on each host:

Figure A.20: Management network migration to DVS complete

You will now need to repeat this for the other networks.

7.5 Step A.4 Migrate vMotion

Migrating the vMotion network takes the exact same steps as the management network. Before you
begin, ensure that the distributed port group for the vMotion network has all the same attributes as

127

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

the port group on the standard (VSS) switch. Then it is just a matter of migrating the uplink used for
vMotion (in this case vmnic1) along with the VMkernel adapter (vmk1). As mentioned already, this
takes the same steps as the management network.

When the migration completes, the individual host network configuration should look similar to the
following:

Figure A.21: vMotion network migration to DVS complete

7.6 Step A.5 Migrate vSAN Network

If you are using a single uplink for the vSAN network, then the process becomes the same as before.
However, if you are using more than one uplink, then there are additional steps to be taken. If the
vSAN network is using a feature such as Link Aggregation (LACP), or it is on a different VLAN to the
other VMkernel networks, then you will need to place some of the uplinks into an unused state for
certain VMkernel adapters.

For example, in this scenario, VMkernel adapter vmk2 is used for vSAN. However, uplinks vmnic3, 4
and 5 are used for vSAN and they are in a LACP configuration. Therefore, for vmk2, all other vmnics (0,
1 and 2) must be placed in an unused state. Similarly, for the management adapter (vmk0) and
vMotion adapter (vmk0), the vSAN uplinks/vmnics should be placed in an unused state.

Modifying the settings of the distributed port group and changing the path policy/failover
appropriately do this. In the manage physical network adapter, the steps are similar as before except
that now you are doing this for multiple adapters.

128

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.22: Multiple uplinks used by the vSAN network

As before, vmk2 (the vSAN VMkernel adapter) should be assigned to the distributed port group for
vSAN:

Figure A.23: Assign distributed port group for vSAN networking

129

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Note: If you are only now migrating the uplinks for the vSAN network, you may not be able to change
the distributed port group settings until after the migration. During this time, vSAN may have
communication issues. After the migration, move to the distributed port group settings and make any
policy changes and mark any uplinks that should be unused. vSAN networking should then return to
normal when this task is completed. Use the Health Check plugin to verify that everything is functional
once the migration is completed.

Figure A.24: Change distributed port group settings

Figure A.25: Showing load balancing and unused uplinks

130

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

That completed the VMkernel adapter migrations. The final step is to move the VM networking

7.7 Step A.6 Migrate VM Network

This is the final step of migrating the network from a standard vSwitch (VSS) to a distributed switch
(DVS). Once again, we use the “Add and manage hosts”, the same link used for migrating the VMkernel
adapters. The task is to manage host networking:

Figure A.26: Manage host networking

Select all the hosts in the cluster, as all hosts will have their virtual machine networking migrated to the
distributed switch.

131

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.27: Select all hosts

On this occasion, we do not need to move any uplinks. However, if the VM networking on your hosts
used a different uplink, then this of course would also need to be migrated from the VSS. In this
example, the uplink has already been migrated.

Figure A.28: Migrate virtual machine networking

132

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Select the VMs that you wish to have migrated from a virtual machine network on the VSS to the new
virtual machine distributed port group on the vDS. Click on the “Assign port group” option like we have
done many times before, and select the distributed port group, name VM-DPG here:

Figure 29: Assign port groups for the VMs.

Reviewing the final screen. In this case we are only moving to VMs. Note that any templates using the
original VSS virtual machine network will need to be converted to virtual machines, edited and the new
distributed port group for virtual machines will need to be selected as the network. This step cannot
be achieved through the migration wizard.

133

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure A.30: Finish

The VSS should no longer have any uplinks of port groups and can be safely removed.

Figure A.31: VSS no longer in use

This completes the migration from a standard vSwitch (VSS) to a distributed vSwitch (vDS).

134

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

8. Appendix B
Troubleshooting the vSAN Network

135

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

8.1 Appendix B. Troubleshooting the vSAN Network

In this section, the types of issues that can arise from a misconfigured vSAN network are examined.
This chapter shows how to troubleshoot these issues.

vSAN is entirely dependent on the network: its configuration, reliability, performance, etc. One of the
most frequent causes of requesting support is either an incorrect network configuration, or the
network not performing as expected.

VMware recommends using the vSAN Health Services to do initial triage of network issues. The vSAN
Health Services carry out a range of network health checks, and direct administrators to an
appropriate knowledge base article depending on the results of the health check. The knowledge base
article will provide administrators with step-by-step instruction to solve the network problem at hand.

8.2 Network Health Checks

vSAN comes with a built in health check mechanism. A whole section of the health check is dedicated
to networking. Here is a screenshot of the different check that are tested.

Figure B.1: Network Health Checks

Each check has an “Ask VMware” button associated with it. Should a health check fail, administrators
should click on this button and read the associated VMware Knowledge Base article for further details,
and guidance on how to address the issue at hand.

Figure B.2: Ask VMware

Let’s now examine some of these health checks in detail.

136

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

8.3 All hosts have a vSAN vmknic configured

In order to participate in a vSAN cluster, and form a single partition of fully connected hosts, each ESXi
host in a vSAN cluster must have a vmknic (VMkernel NIC or VMkernel adapter) configured for vSAN
traffic.

This check ensures each ESXi host in the vSAN Cluster has a VMkernel NIC configured for vSAN
traffic. Note that even if an ESXi is part of the vSAN cluster, but is not contributing storage, it must still
have a VMkernel NIC configured for vSAN traffic.

Note that this check just ensures that one vmknic is configured. While multiple vmknics are supported,
this test does not check consistent network configurations, i.e. some hosts may have 2 vmknics while
other hosts only have 1 vmknic.

If this test fails, it means that at least one of the hosts in the cluster does not have a VMkernel NIC
configured for vSAN traffic.

Ensure that each ESXi host participating in the vSAN cluster has a VMkernel NIC enabled for vSAN
traffic. This can be done from the vSphere web client, where each ESXi host’s networking
configuration can easily be checked, navigate to Hosts and Clusters -> host -> Manage -> Networking -
> VMkernel Adapters and check the vSAN Traffic column and ensure that at least 1 vmknic is
“Enabled” for this traffic type.

It can also be checked from the CLI using “esxcli vsan network list”:

[root@cs-ie-h01:~] esxcli vsan network list


Interface
VmkNic Name: vmk2
IP Protocol: IPv4
Interface UUID: 264ed254-5aa5-0647-9cc7-001f29595f9f
Agent Group Multicast Address: 224.2.3.4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 224.1.2.3
Master Group Multicast Port: 12345
Multicast TTL: 5
[root@cs-ie-h01:~]

In the above output, the VMkernel NIC vmk2 is used for vSAN traffic.

It can also be checked from the RVC using the “vsan.cluster_info” command. This will display which
VMkernel adapter, if any, is being used on each host for vSAN traffic.

<<truncated>>
Host: cs-ie-h01.ie.local
Product: VMware ESXi 6.0.0 build-2391873
vSAN enabled: yes
Cluster info:
Cluster role: agent
Cluster UUID: 529ccbe4-81d2-89bc-7a70-a9c69bd23a19
Node UUID: 545ca9af-ff4b-fc84-dcee-001f29595f9f
Member UUIDs: ["5460b129-4084-7550-46e1-0010185def78", "54196e13-7f5f-
cba8-5bac-001517a69c72", "54188e3a-84fd-9a38-23ba-001b21168828", "545ca9af-ff4b-
fc84-dcee-001f29595f9f"] (4)
Node evacuated: no
Storage info:
Auto claim: no
Checksum enforced: no
Disk Mappings:
SSD: HP Serial Attached SCSI Disk (naa.xxx) - 186 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2

137

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2


MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
FaultDomainInfo:
Not configured
NetworkInfo:
Adapter: vmk2 (172.32.0.1)
<<truncated>>

This provides both the VMkernel NIC used for vSAN Traffic as well as the IP address of the interface.

8.4 All hosts have matching multicast settings

In order to participate in a vSAN cluster, and form a single partition of fully connected hosts, each host
in a vSAN cluster must use the same IP multicast address range.

It is very rare for users to have to change the IP multicast address range for vSAN. However, this might
be a necessary step if there are multiple vSAN clusters on the same network. The procedure to change
multicast addresses is described in VMware KB Article 2075451 .

If an administrator does change the multicast addresses, using esxcli or API, then it is important that
they are consistently configured across the cluster. This health check ensures that is the case.

Please note that this health check doesn’t check ports or TTL inconsistencies. It is only checking the
multicast IP addresses.

If this check fails, it means that the at least one host has a misconfigured multicast addresses. In
addition to an error detected by the health check, this type of issue may also result in a partitioned
cluster. This will be visible in the health check, but will also be visible in the vSphere web client UI on
the vSAN Cluster -> Disk Management. In the Group column, different values will be shown for the
hosts that are in different network partitions (i.e. isolated).

The IP multicast addresses that vSAN uses can be changed and checked from the CLI. The ESXCLI
command “esxcli vsan network list“ will display the multicast addresses used by each host.

[root@cs-ie-h01:~] esxcli vsan network list


Interface
VmkNic Name: vmk2
IP Protocol: IPv4
Interface UUID: 264ed254-5aa5-0647-9cc7-001f29595f9f
Agent Group Multicast Address: 224.2.3.4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 224.1.2.3
Master Group Multicast Port: 12345
Multicast TTL: 5

8.5 All hosts have matching subnets

In order to participate in a vSAN cluster, and form a single partition of fully connected hosts, each host
in a vSAN cluster must be able to talk to every other host in the cluster. The most common network
configuration is for vSAN hosts to share a single layer-2 non-routable network, i.e. a single IP subnet
and single VLAN. However, since vSAN 6.0, there is support for layer-3 network, i.e. routed
connections, but this is not a very common deployment configuration of vSAN outside of stretched
cluster implementations.

This check tests that all ESXi hosts in a vSAN cluster have been configured so that all vSAN VMkernel
NICs are on the same IP subnet.

138

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

As mentioned earlier, in vSAN 6.0, VMware introduced support for L3/routing on the vSAN network. In
cases where vSAN is deployed over an L3 network, this subnet health check will report an error and
may be safely ignored. This will almost always be the case for vSAN stretched cluster implementations.

If, however, the vSAN network is deployed on an L2 network configuration, then this health check will
identify ESXi hosts that are not on the same IP subnets. The check will also show an issue when
multiple vmknics have been configured but not consistently across the cluster. For example, if one
host has 2 vmknics, and one host has only 1, then this check will also alert to that.

Administrators should ensure that all ESXi hosts that share the same L2 vSAN network have matching
subnets. This can be done from the vSphere web client, where each host’s networking configuration
can easily be checked. It can also be checked from the ESXCLI via “esxcli network ip interface ipv4 get
-i vmkX” where vmkX is the VMkernel adapter.

[root@cs-ie-h01:~] esxcli network ip interface ipv4 get -i vmk2


Name IPv4 Address IPv4 Netmask IPv4 Broadcast Address Type DHCP DNS
---- ------------ ------------- -------------- ------------ --------
vmk2 172.32.0.1 255.255.255.0 172.32.0.255 STATIC false
[root@cs-ie-h01:~]

One thing missing from the above output is the VLAN id associated with the vSAN network. This is
important as the vmknic for the vSAN network on some hosts may be tagged with the VLAN IDs and
others may not be tagged. This will again lead to network misconfiguration and cluster partitioning.

To check which VLAN ID, if any, is associated with the vSAN network, the easiest way is to use the
vSphere web client, and navigate to the VMkernel adapter on each host, select the vSAN network
adapter, and check its properties as shown below.

Figure B.3: vmknic used for vSAN traffic

In this case, the vSAN network is not using VLANs that is why the VLAN type in the lower part of the
screen is set to “None”. If VLANs were being used, this information would be populated. This would
then have to be checked on all hosts in the vSAN Cluster.

It may also be checked from RVC via “vsan.cluster_info”.

<<truncated>>
Host: cs-ie-h01.ie.local
Product: VMware ESXi 6.0.0 build-2391873
vSAN enabled: yes
Cluster info:
Cluster role: agent

139

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Cluster UUID: 529ccbe4-81d2-89bc-7a70-a9c69bd23a19


Node UUID: 545ca9af-ff4b-fc84-dcee-001f29595f9f
Member UUIDs: ["5460b129-4084-7550-46e1-0010185def78", "54196e13-7f5f-
cba8-5bac-001517a69c72", "54188e3a-84fd-9a38-23ba-001b21168828", "545ca9af-ff4b-
fc84-dcee-001f29595f9f"] (4)
Node evacuated: no
Storage info:
Auto claim: no
Checksum enforced: no
Disk Mappings:
SSD: HP Serial Attached SCSI Disk (naa.xxx) - 186 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
MD: HP Serial Attached SCSI Disk (naa.xxx) - 136 GB, v2
FaultDomainInfo:
Not configured
NetworkInfo:
Adapter: vmk2 (172.32.0.1)
<<truncated>>

8.6 Hosts disconnected from VC

If an ESXi host that is part of a vSAN cluster is disconnected from vCenter (or is otherwise not
responding), it could cause operational issues. This could be due to a power outage or some other
event. vSAN still considers the host as a member of the cluster.

This checks whether vCenter Server has an active connection to all ESXi hosts in the vSphere cluster.

It may mean that vSAN is unable to use the capacity or resources available on this ESXi host, and it
may also imply that components residing on the disks on this ESXi host are now in an ABSENT state,
placing virtual machines at risk should another failure occur in the cluster. However, because it is
disconnected from vCenter, the overall state of this ESXi host is not known.

An administrator should immediately check why an ESXi host that is part of the vSAN cluster is no
longer connected to vCenter. One option is to manually try to reconnect the host to vCenter Server via
the vSphere web client UI. Right-click on the disconnected ESXi host in question, select “Connection”
from the drop-down menu and then select “Connect”. Provide the appropriate responses to the
connection wizard where required.

If the host fails to do a manual connect, an administrator could try to connect to the host via SSH, if it
available, to assess its status. Another option would be to connect to the server’s console (iLO for HP,
DRAC for DELL, etc.) and ascertain if there is some underlying problem with the server in question.
VMware KB Article 1003409 provides additional information on how to troubleshoot disconnected
ESXi hosts.

8.7 Hosts with connectivity issues

This check refers to situations where vCenter lists the host as connected, but API calls from vCenter to
the host are failing. This situation should be extremely rare, but in case it happens it leads to similar
issues as the "Host disconnected from VC" situation and it could cause operational issues.

If this health check highlights that an ESXi host has connectivity issues, vCenter server does not know
its state. The host may be up, and may be participating in the vSAN cluster, serving data, and playing a
critical role in the storage functions of the vSAN cluster.

140

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

However, it could also mean that the host may be down and unavailable. vCenter server, and hence the
vSAN Health check, cannot fully assess the situation as long the host is disconnected.

An administrator should immediately check why an ESXi host that is part of the vSAN cluster is no
longer connected to vCenter.

One option is to manually try to reconnect the host to vCenter server via the vSphere web client UI.
Right click on the disconnected ESXi host in question, select “Connection” from the drop-down menu
and then select “Connect”. Provide the appropriate responses to the connection wizard where
required.

If the machine cannot be connected to vCenter but can be connected to via the vSphere web client,
then the issue is likely related to network connectivity or communication problems rather than
management-agent problems.

If the host fails to do a manual connect, an administrator could try to connect to the host via SSH, if it
available, to assess its status. Another option would be to connect to the server’s console (iLO for HP,
DRAC for DELL, etc.) and ascertain if there is some underlying problem with the server in question.

VMware KB Article 1003409 provides additional information on how to troubleshoot disconnected


ESXi hosts.

8.8 Multicast assessment based on other checks

This health check is a simply rollup of previous network health checks.

Basically, if vSAN is correctly configured and the ping tests are succeeding but there is a vSAN
network partition, this check will report that multicast as the most likely cause of the network partition
issue.

Note however that although multicast is the likely cause, it doesn't have to be. Other causes could
include performance issues. Problems like excessive dropped packets or excessive pause frames could
also lead to this health check failing. Refer to our section on flow control earlier is this guide.

If this health check reports that multicast may be an issue, a proactive "host debug multicast” check is
performed. The "host debug multicast” check only runs if this health check triggers it. This check will
add ~10 seconds to the run time of the health check.

If this check fails, it indicates that multicast is most like the root cause of a network partition.

There is a section further on in this appendix on further tips on how to troubleshoot multicast
misconfiguration issues. There is also section elsewhere in this guide on how to checked for excessive
dropped packets and excessive pause frames. Finally, VMware KB 2075451 describes Changing
multicast settings when multiple vSAN clusters are present which should be referenced when running
multiple vSAN clusters on the same network.

There have been situations where misbehaving network switches have led to vSAN network outages.
Symptoms include hosts unable to communicate, and exceedingly high latency reported for virtual
machine I/O in vSAN Observer. However, when latency is examined at the vSAN Disks layer, there is no
latency, which immediately points to latency being incurred at the network layer.

In one case, it was observed that the physical network switch in question was sending excessive
amounts of Pause frames. Pause frames are a flow control mechanism that is designed to stop or
reduce the transmission of packets for an interval of time. This behavior negatively impacted the vSAN
network performance.

ethtool

There is a command on the ESXi host called ethtool to check for flow control. Here is an example
output:

141

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

~ # ethtool -a vmnic4
Pause parameters for vmnic4:
Autonegotiate: on
RX: off
TX: off

This output shows that auto-negotiate is set to on, which is recommended for ESXi host NICs, but that
there is no flow control enabled on the switch (RX and TX are both off).

In the example outage discussed earlier, there were excessive amounts of pause frames in the RX field,
with values in the millions. In this case, one troubleshooting step might be to disable the flow control
mechanisms on the switch while further investigation into the root cause takes place.

If you experience difficulty locating the root cause of the network partition, and need further guidance
on items such as pause frames, please contact VMware Global Support Services for assistance.
However, as stated earlier in the guide, the guidance around pause frames should be to use the best
practices of the switch vendor (which is typically to leave them enabled).

8.9 Network Latency Check

This check performs a network latency check of vSAN hosts. If the threshold exceeds 100ms, a
warning is displayed. If the latency threshold exceeds 200ms, and error is raised.

If this health check fails, then there is poor network latency to one or more vSAN hosts in the vSAN
cluster.

The check can be used to show all latency check results, as well as only the failed network latency
checks. One should be able to identify which hosts are affected, and troubleshoot the issue further.
vmkping can be used at the ESXi CLI to provide further isolation tests.

Note that in vSAN versions prior to release vSphere 6.0 P03, this check did not account for low
latencies to the witness appliance in both 2-node, and stretched cluster solutions. Therefore, this
check would have failed on many such setups. This issue is now addressed in 6.0 P03 (Release notes
are available here).

8.10 vMotion: Basic (unicast) connectivity check

This tests ensures that IP connectivity exists among all ESXi hosts in the vSAN cluster that have
vMotion configured. The test is to simply ping each ESXi host on the vMotion network from each other
ESXi host.

The “vMotion: Basic (unicast) connectivity check” health check automates the pinging of each ESXi
host from each of the other hosts in the vSAN cluster, and ensures that there is connectivity between
all the hosts on the vMotion network. In this test all nodes ping all other nodes in the cluster.

If the small ping tests fail, it indicates that the network is misconfigured. This could be any number of
things, and the issue may lie in the virtual network (vmknic, virtual switch) or the physical network
(cable, physical NIC, physical switch). The other network health check results should be examined to
narrow down the root cause of the misconfiguration. If all the other health checks indicate a good ESXi
side configuration, the issue may reside in the physical network.

This ping test is performed using very small packets, so it ensures basic connectivity. The other health
checks are designed to assess MTU misconfiguration and multicast aspects of connectivity.

Various vmkping tests may be run to identify the misconfiguration. However, this test may be used
with other health checks to focus the network misconfiguration investigation.

8.11 vMotion: MTU checks (ping with large packet size)

142

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

This health check complements the basic vMotion ping connectivity check. MTUs, the Maximum
Transmission Unit size, are increased to improve network performance. Incorrectly configured MTUs
will frequently NOT show up as a network configuration issue, but instead cause performance issues.

While the basic check used small packets, this check uses large packets (9000 bytes). These are often
referred to as jumbo frames. Assuming the small ping test succeeds, the large ping test should also
succeed when the MTU size is consistently configured across all VMkernel adapters (vmknics), virtual
switches and any physical switches.

Note that if the source vmknic has an MTU of 1500, it will fragment the 9000-byte packet, and then
those fragments will travel perfectly fine along the network to the other host where they are
reassembled. As long as all network devices along the path use a higher or equal MTU, then this test
will pass.

One possible cause is if the vmknic has an MTU of 9000 and then the physical switch enforces an MTU
of 1500. This is because then the source doesn't fragment the packet and the physical switch will drop
the packet.

If however there is an MTU of 1500 on the vmknic and an MTU 9000 on the physical switch (e.g.
because there is also iSCSI running which is using 9000) then there is no issue and the test will pass.

vSAN supports different MTU sizes. It does not care if it is set to 1500 or 9000, as long as it is
consistently configured across the cluster.

Various vmkping tests may be run to identify the misconfiguration, including how to test with larger
packets (vmkping –S 9000). However, this test may be used with other health checks to focus the
network misconfiguration investigation.

8.12 vSAN cluster partition

In order to function properly, all vSAN hosts should be able to talk to each other over both multicast
and unicast (prior to vSAN Release 6.6) and unicast only in vSAN 6.6 and later.

If it is the case that all the nodes in the cluster cannot communicate, a vSAN cluster will split into
multiple network "partitions", i.e. sub-groups of ESXi hosts that can talk to each other, but not to other
sub-groups.

When that happens, vSAN objects may become unavailable until the network misconfiguration is
resolved. For smooth operations of production vSAN clusters it is very important to have a stable
network with no extra network partitions (i.e. only one partition).

This health check examines the cluster to see how many partitions exist. It displays an error if there is
more than a single partition in the vSAN cluster. Note that this check really determines if there is a
network issue, but doesn’t attempt to find a root cause. Other network health checks need to be used
to root cause.

This health check is said to be OK when only one single partition is found. As soon as multiple
partitions are found, the cluster is unhealthy.

There are likely to be other warnings displayed in the vSphere web client when a partition occurs.

Another interesting view is the vSAN Disk Management view. This contains a column detailing which
network partition group a host is part of. To see how many cluster partitions there are, examine this
column. If each host is in its own network partition group, then there is a cluster-wide issue. If only one
host is in its own network partition group and all other hosts are in a different network partition group,
then only that host has the issue. This may help to isolate the issue at hand and focus the investigation
effort. Note that the health UI will display the same information in the details section of this check.

The network configuration issue needs to be located and resolved. Additional health checks on the
network are designed to assist an administrator to find the root cause of what may be causing the
network partition. The reasons can range from misconfigured subnets ( All hosts have matching

143

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

subnets ), misconfigured vSAN traffic VMkernel adapters ( All hosts have a vSAN vmknic configured ),
misconfigured VLANs or general network communication issues to specific multicast issues ( All hosts
have matching multicast settings ). The additional network health checks are designed to isolate which
of those issues may be the root cause and should be viewed in parallel with this health check.

Aside from misconfigurations, it is also possible to have partitions when the network is overloaded,
leading to a substantial number of dropped packets. vSAN can tolerate a small amount of dropped
packets but once there is above a medium amount of dropped packets, performance issues may
ensure.

If none of the misconfiguration checks indicate an issue, it is advisable to watch for dropped packet
counters, as well as perform pro-active network tests.

To examine the dropped packet counters on an ESXi host, use esxtop network view (n) and examine
the field %DRPRX for excessive dropped packets. You may also need to watch the switch and switch
ports, as they may also drop packets. Another metric that should be checked for is an excess of pause
frames that can slow down the network and impact performance. This is discussed elsewhere in this
guide.

8.13 vSAN: Basic (unicast) connectivity check

While most other network related vSAN health checks assess various aspects of the network
configuration, this health check takes a more active approach. As vSAN is not able to check the
configuration of the physical network, one way to ensure that IP connectivity exists among all ESXi
hosts in the vSAN cluster is to simply ping each ESXi host on the vSAN network from each other ESXi
host.

The “Hosts small ping test (connectivity check)” health check automates the pinging of each ESXi host
from each of the other hosts in the vSAN cluster, and ensures that there is connectivity between all the
hosts on the vSAN network. In this test all nodes ping all other nodes in the cluster.

If the small ping tests fail, it indicates that the network is misconfigured. This could be any number of
things, and the issue may lie in the virtual network (vmknic, virtual switch) or the physical network
(cable, physical NIC, physical switch). The other network health check results should be examined to
narrow down the root cause of the misconfiguration. If all the other health checks indicate a good ESXi
side configuration, the issue may reside in the physical network.

This ping test is performed using very small packets, so it ensures basic connectivity. The other health
checks are designed to assess MTU misconfiguration and multicast aspects of connectivity.

Various vmkping tests may be run to identify the misconfiguration as well as a set of other commands
to identify the root cause of the ping test failure. However, this test may be used with other health
checks to focus the network misconfiguration investigation.

8.14 vSAN: MTU check (ping with large packet size)

This health check complements the basic ping connectivity check. MTUs, the Maximum Transmission
Unit size, are increased to improve network performance. Incorrectly configured MTUs will frequently
NOT show up as a vSAN network partition, but instead cause performance issues or I/O errors in
individual objects. It can also lead to virtual machine deployment failures on vSAN. For stability of
vSAN clusters, it is critically important for the large ping test check to succeed.

While the basic check used small packets, this check uses large packets (9000 bytes). These are often
referred to as jumbo frames. Assuming the small ping test succeeds, the large ping test should also
succeed when the MTU size is consistently configured across all VMkernel adapters (vmknics), virtual
switches and any physical switches.

Note that if the source vmknic has an MTU of 1500, it will fragment the 9000-byte packet, and then
those fragments will travel perfectly fine along the network to the other host where they are

144

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

reassembled. As long as all network devices along the path use a higher or equal MTU, then this test
will pass.

What will cause a failure is if the vmknic has an MTU of 9000 and then the physical switch enforces an
MTU of 1500. This is because then the source doesn't fragment the packet and the physical switch will
drop the packet.

If however there is an MTU of 1500 on the vmknic and an MTU 9000 on the physical switch (e.g.
because there is also iSCSI running which is using 9000) then there is no issue and the test will pass.

vSAN supports different MTU sizes. It does not care if it is set to 1500 or 9000, as long as it is
consistently configured across the cluster.

Various ping tests that may be run to identify the misconfiguration, including how to test with larger
packets (ping –S 9000) as well as a set of other commands to identify the root cause of the ping test
failure. However, this test may be used with other health checks to focus the network misconfiguration
investigation.

Symptoms of MTU misconfiguration: Cannot complete file creation

This is one of the possible symptoms of having mismatched MTU sizes in your environment. The
cluster formed successfully and reported that the network status was normal. However on attempting
to deploy a virtual machine on the vSAN datastore, an error reported that it “cannot complete file
creation operation”.

This is the error that popped up when the VM was being provisioned on the vSAN datastore:

Figure B.4: Cannot complete file creation operation

Note also the ‘Failed to connect to component host” message.

In this case, the customer wanted to use jumbo frames with vSAN in their environment. An MTU of
9000 (jumbo frames) was configured on the physical switch. However in this setup, it seems that an
MTU of 9000 on the physical switch (DELL PowerConnect) wasn’t large enough to match the MTU of
9000 required on the ESXi configuration due to additional overhead requirements on the switch. The
switch actually required an MTU of 9124 (9 * 1024) to allow successful communication using jumbo
frames on the vSAN network.

Once this change was made on the physical switch ports used by vSAN, virtual machines could be
successfully provisioned on the vSAN datastore.

145

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Again, with jumbo frames, ensure that all nodes in the cluster can be pinged with the larger packet size
of 9000 as seen earlier.

8.15 Checking the vSAN network is operational

When the vSAN network has been configured, these commands will check its state. Using the
following ESXCLI commands, an administrator can check which VMkernel Adapter (vmknic) is used for
vSAN, and what attributes it contains.

First, various ESXCLI and RVC commands verify that the network is indeed fully functional. Then
various ESXCLI and RVC commands demonstrate to an administrator how to troubleshoot any network
related issues with vSAN.

This will involve verifying that the vmknic used for the vSAN network is uniformly configured correctly
across all hosts, checking that multicast is functional and verifying that each host participating in the
vSAN cluster can successfully communicate to one another.

esxcli vsan network list

This is a very useful command as it tells you which VMkernel interface is being used for the vSAN
network. In the output below (command and output identical in ESXi 5.5 and 6.0), we see that the
vSAN network is using vmk2. This command continues to work even if vSAN has been disabled on the
cluster and the hosts no longer participate in vSAN.

There are some additional useful pieces of information such as Agent Group Multicast and Master
Group Multicast.

[root@esxi-dell-m:~] esxcli vsan network list


Interface
VmkNic Name: vmk1
IP Protocol: IP
Interface UUID: 32efc758-9ca0-57b9-c7e3-246e962c24d0
Agent Group Multicast Address: 224.2.3.4
Agent Group IPv6 Multicast Address: ff19::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 224.1.2.3
Master Group IPv6 Multicast Address: ff19::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Multicast TTL: 5
Traffic Type: vsan

This provides useful information such as which VMkernel interface is being used for vSAN traffic. In
this case it is vmk1 . However, also shown are the multicast addresses. Note that even in vSAN 6.6, this
information is displayed even when the cluster us running in unicast mode. There is the group
multicast address and port. Port 23451 is used for the heartbeat, sent every second by the master, and
should be visible on every other host in the cluster. Port 12345 is used for the CMMDS updates
between the master and backup. Once we know which VMkernel port vSAN is using for network
traffic, we can use some additional commands to check on its status.

esxcli network ip interface list

Now that we know the VMkernel adapter, we can use this command to check items like which vSwitch
or distributed switch that it is attached to, as well as the MTU size, which can be useful if jumbo frames
have been configured in the environment. In this case, MTU is at the default of 1500.

[root@esxi-dell-m:~] esxcli network ip interface list


vmk0
Name: vmk0
<<truncated>>
vmk1

146

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Name: vmk1
MAC Address: 00:50:56:69:96:f0
Enabled: true
Portset: DvsPortset-0
Portgroup: N/A
Netstack Instance: defaultTcpipStack
VDS Name: vDS
VDS UUID: 50 1e 5b ad e3 b4 af 25-18 f3 1c 4c fa 98 3d bb
VDS Port: 16
VDS Connection: 1123658315
Opaque Network ID: N/A
Opaque Network Type: N/A
External ID: N/A
MTU: 9000
TSO MSS: 65535
Port ID: 50331814

The Maximum Transmission Unit size is shown as 9000, so this VMkernel port is configured for jumbo
frames, which require an MTU of somewhere in the region on 9,000. VMware does not make any
recommendation around the use of jumbo frames. However, jumbo frames are supported for use with
vSAN should there be a requirement to use them.

esxcli network ip interface ipv4 get –i vmk2

This is another useful command as it displays information such as IP address and netmask of the
VMkernel interface used for vSAN. With this information, an administrator can now begin to use other
commands available at the command line to check that the vSAN network is working correctly.

[root@esxi-dell-m:~] esxcli network ip interface ipv4 get -i vmk1


Name IPv4 Address IPv4 Netmask IPv4 Broadcast Address Type Gateway DHCP DNS
---- ------------ ------------- -------------- ------------ ------- --------
vmk1 172.40.0.9 255.255.255.0 172.40.0.255 STATIC 0.0.0.0 false

vmkping

The vmkping is a simple command that will verify if all of the other ESXi hosts on the network are
responding to your ping requests.

~ # vmkping -I vmk2 172.32.0.3


PING 172.32.0.3 (172.32.0.3): 56 data bytes
64 bytes from 172.32.0.3: icmp_seq=0 ttl=64 time=0.186 ms
64 bytes from 172.32.0.3: icmp_seq=1 ttl=64 time=2.690 ms
64 bytes from 172.32.0.3: icmp_seq=2 ttl=64 time=0.139 ms

--- 172.32.0.3 ping statistics ---


3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.139/1.005/2.690 ms

While it does not verify multicast functionality, it can help with isolating a rogue ESXi host that has
network issues. You can also examine the response times to see if there is any abnormal latency on the
vSAN network.

One thing to note – if jumbo frames are configured, this command will not find any issues if the jumbo
frame MTU size is incorrect. This command by default uses an MTU size of 1500. If there is a need to
verify if jumbo frames are successfully working end-to-end, use vmkping with a larger packet size (-s)
option as follows:

~ # vmkping -I vmk2 172.32.0.3 -s 9000


PING 172.32.0.3 (172.32.0.3): 9000 data bytes
9008 bytes from 172.32.0.3: icmp_seq=0 ttl=64 time=0.554 ms
9008 bytes from 172.32.0.3: icmp_seq=1 ttl=64 time=0.638 ms
9008 bytes from 172.32.0.3: icmp_seq=2 ttl=64 time=0.533 ms

147

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

--- 172.32.0.3 ping statistics ---


3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.533/0.575/0.638 ms
~ #

Consider adding -d to the vmkping command to test if packets can be sent without fragmentation.

esxcli network ip neighbor list

This next command is a very quick way of checking to see if all vSAN hosts are actually on the same
network segment. In this configuration, we have a 4-node cluster, and this command returns the ARP
(Address Resolution Protocol) entries of the other 3 nodes, including their IP addresses and their
vmknic (vSAN is configured to use vmk1 on all hosts in this cluster).

[root@esxi-dell-m:~] esxcli network ip neighbor list -i vmk1


Neighbor Mac Address Vmknic Expiry State Type
----------- ----------------- ------ ------- ----- -------
172.40.0.12 00:50:56:61:ce:22 vmk1 164 sec Unknown
172.40.0.10 00:50:56:67:1d:b2 vmk1 338 sec Unknown
172.40.0.11 00:50:56:6c:fe:c5 vmk1 162 sec Unknown
[root@esxi-dell-m:~]

esxcli network diag ping

To get even more detail regarding the vSAN network connectivity between the various hosts, ESXCLI
provides a powerful network diagnostic command. This command checks for duplicates on the
network, as well as round trip times. Here is an example of one such output, where the VMkernel
interface is on vmk1 and the remote vSAN network IP of another host on the network is 172.40.0.10:

[root@esxi-dell-m:~] esxcli network diag ping -I vmk1 -H 172.40.0.10


Trace:
Received Bytes: 64
Host: 172.40.0.10
ICMP Seq: 0
TTL: 64
Round-trip Time: 1864 us
Dup: false
Detail:

Received Bytes: 64
Host: 172.40.0.10
ICMP Seq: 1
TTL: 64
Round-trip Time: 1834 us
Dup: false
Detail:

Received Bytes: 64
Host: 172.40.0.10
ICMP Seq: 2
TTL: 64
Round-trip Time: 1824 us
Dup: false
Detail:
Summary:
Host Addr: 172.40.0.10
Transmitted: 3
Recieved: 3
Duplicated: 0
Packet Lost: 0
Round-trip Min: 1824 us
Round-trip Avg: 1840 us

148

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Round-trip Max: 1864 us


[root@esxi-dell-m:~]

vsan.lldpnetmap

If there are non-Cisco switches with Link Layer Discovery Protocol (LLDP) enabled in the environment,
there is an RVC (Ruby vSphere Console) command to display uplink <-> switch <-> switch port
information. RVC is available on all vCenter Servers since vSAN launched. For more information on
RVC, please reference the RVC Command Guide.

This is extremely useful for determining which hosts are attached to which switches when the vSAN
Cluster is spanning multiple switches. It may help to isolate a problem to a particular switch when only
a subset of the hosts in the cluster is impacted.

> vsan.lldpnetmap 0
2013-08-15 19:34:18 -0700: This operation will take 30-60 seconds ...
+---------------+---------------------------+
| Host | LLDP info |
+---------------+---------------------------+
| 10.143.188.54 | w2r13-vsan-x650-2: vmnic7 |
| | w2r13-vsan-x650-1: vmnic5 |
+---------------+---------------------------+

This is only available with switches that support LLDP. Cisco switches support LLDP, but it is not
enabled by default. To configure it, logon to the switch and run the following:

switch# config t
Switch(Config)# feature lldp

To verify that LLDP is enabled:

switch(config)#do show running-config lldp

Note: LLDP will operate in both send AND receive mode by default once enabled. Check the settings
of your vDS properties if the physical switch information is not being discovered. By default, vDS is
created with discovery protocol set to CDP, Cisco Discovery Protocol. Administrators should set the
discovery protocol to LLDP and operation to “both” on the vDS to resolve this.

8.16 Checking multicast communications

Prior to vSAN 6.6, multicast configurations have been one of the most problematic issues for initial
vSAN deployment. One of the simplest ways to verify if multicast is working correctly in your vSAN
environment is by using the tcpdump-uw command. This command is available from the command
line of the ESXi hosts.

tcpdump-uw –i vmk2 multicast

This tcpdump command will show if the master is correctly sending multicast packets (port and IP info)
and if all other hosts in the cluster are receiving them.

On the master, this command will show the packets being sent out to the multicast address. On all
other hosts, the same exact packets should be seen (from the master to the multicast address). If they
are not seen, multicast is not working correctly. Run the tcpdump-uw command shown here on any
host in the cluster and the heartbeats should be seen coming from the master, which in this case is on
IP address 172.32.0.2. The “-v” for verbosity is optional.

[root@esxi-hp-02:~] tcpdump-uw -i vmk2 multicast -v


tcpdump-uw: listening on vmk2, link-type EN10MB (Ethernet), capture size 96 bytes
11:04:21.800575 IP truncated-
ip - 146 bytes missing! (tos 0x0, ttl 5, id 34917, offset 0, flags [none], proto UDP (17), lengt
172.32.0.4.44824 > 224.1.2.3.12345: UDP, length 200
11:04:22.252369 IP truncated-

149

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

ip - 234 bytes missing! (tos 0x0, ttl 5, id 15011, offset 0, flags [none], proto UDP (17), lengt
172.32.0.2.38170 > 224.2.3.4.23451: UDP, length 288
11:04:22.262099 IP truncated-
ip - 146 bytes missing! (tos 0x0, ttl 5, id 3359, offset 0, flags [none], proto UDP (17), length
172.32.0.3.41220 > 224.2.3.4.23451: UDP, length 200
11:04:22.324496 IP truncated-
ip - 146 bytes missing! (tos 0x0, ttl 5, id 20914, offset 0, flags [none], proto UDP (17), lengt
172.32.0.5.60460 > 224.1.2.3.12345: UDP, length 200
11:04:22.800782 IP truncated-
ip - 146 bytes missing! (tos 0x0, ttl 5, id 35010, offset 0, flags [none], proto UDP (17), lengt
172.32.0.4.44824 > 224.1.2.3.12345: UDP, length 200
11:04:23.252390 IP truncated-
ip - 234 bytes missing! (tos 0x0, ttl 5, id 15083, offset 0, flags [none], proto UDP (17), lengt
172.32.0.2.38170 > 224.2.3.4.23451: UDP, length 288
11:04:23.262141 IP truncated-
ip - 146 bytes missing! (tos 0x0, ttl 5, id 3442, offset 0, flags [none], proto UDP (17), length
172.32.0.3.41220 > 224.2.3.4.23451: UDP, length 200

While this output might seem a little confusing, suffice to say that the output shown here indicates
that the 4 hosts in the cluster are getting a heartbeat from the master. This tcpdump-uw command
would have to be run on every host to verify that they are all receiving the heartbeat. This will verify
that the master is sending the heartbeats, and every other host in the cluster is receiving them,
indicating multicast is working.

If some of the vSAN hosts are not able to pick up the one-second heartbeats from the master, the
network admin needs to check the multicast configuration of their switches.

To avoid the annoying “truncated-ip – 146 bytes missing!”, one can use the –s0 option to the same
command to stop trunacating of packets:

[root@esxi-hp-02:~] tcpdump-uw -i vmk2 multicast -v -s0


tcpdump-uw: listening on vmk2, link-
type EN10MB (Ethernet), capture size 65535 bytes
11:18:29.823622 IP (tos 0x0, ttl 5, id 56621, offset 0, flags [none], proto UDP (17), length 228
172.32.0.4.44824 > 224.1.2.3.12345: UDP, length 200
11:18:30.251078 IP (tos 0x0, ttl 5, id 52095, offset 0, flags [none], proto UDP (17), length 228
172.32.0.3.41220 > 224.2.3.4.23451: UDP, length 200
11:18:30.267177 IP (tos 0x0, ttl 5, id 8228, offset 0, flags [none], proto UDP (17), length 316)
172.32.0.2.38170 > 224.2.3.4.23451: UDP, length 288
11:18:30.336480 IP (tos 0x0, ttl 5, id 28606, offset 0, flags [none], proto UDP (17), length 228
172.32.0.5.60460 > 224.1.2.3.12345: UDP, length 200
11:18:30.823669 IP (tos 0x0, ttl 5, id 56679, offset 0, flags [none], proto UDP (17), length 228
172.32.0.4.44824 > 224.1.2.3.12345: UDP, length 200

The next multicast checking command is related to IGMP (Internet Group Management Protocol)
membership. Hosts (and network devices) use IGMP to establish multicast group membership.

tcpdump-uw –i vmk2 igmp

Each ESXi node in the vSAN cluster will send out regular IGMP “membership reports” (aka ‘join’). This
tcpdump command will show igmp member reports from a host:

[root@esxi-dell-m:~] tcpdump-uw -i vmk1 igmp


tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
15:49:23.134458 IP 172.40.0.9 > igmp.mcast.net: igmp v3 report, 1 group record(s)
15:50:22.994461 IP 172.40.0.9 > igmp.mcast.net: igmp v3 report, 1 group record(s)

The output shows “igmp v3 reports” are taking place, indicating that the ESXi host is regularly
updating its membership. If a network administrator has any doubts whether or not vSAN ESXi nodes
are doing IGMP correctly, running this command on each ESXi host in the cluster and showing this
trace can be used to verify that this is indeed the case.

150

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

VMware strongly recommend IGMP v3, which is much more stable than previous versions of IGMP.

In fact, the following command can be used to look at multicast and igmp traffic at the same time:

[root@esxi-hp-02:~] tcpdump-uw -i vmk2 multicast or igmp -v -s0

A common issue is that the vSAN Cluster is configured across multiple physical switches, and while
multicast has been enabled on one switch, it has not been enabled across switches. In this case, the
cluster forms with two ESXi hosts in one partition, and another ESXi host (connected to the other
switch) is unable to join this cluster. Instead it forms its own vSAN cluster in another partition.
Remember that the vsan.lldpnetmap command seen earlier can assist in determining network
configuration, and which hosts are attached to which switch.

Figure B.5: Multicast misconfiguration

Symptoms of a multicast misconfiguration issue

Other than the fact that the vSAN Cluster status displays a network misconfiguration issue, there are
some other telltale signs when trying to form a vSAN Cluster that multicast may be the issue. Assume
at this point that the checklist for subnet, VLAN, MTU has been followed and each host in the cluster
can vmkping every other host in the cluster.

If there is a multicast issue when the cluster is created, the next symptom you will observe is that each
ESXi host forms its own vSAN cluster, with itself as the master. It will also have a unique network
partition id. This symptom suggests that there is no multicast between any of the hosts.

However, if there is a situation where a subset of the ESXi hosts form a cluster, and another subset
form another cluster, and each have unique partitions with their own master, backup and perhaps even
agent nodes, this may very well be a situation where multicast is enabled in-switch, but not across
switches. In situations like this, vSAN show’s hosts on the first physical switch forming their own
cluster partition, and hosts on the second physical switch forming their own cluster partition too, each
with its own “master”. If you can verify which switches the hosts in the cluster connect to, and hosts in
a cluster are connected to the same switch, then this may well be the issue.

8.17 Checking performance of vSAN network

151

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

One of the most important aspects of networking is making sure that there is sufficient bandwidth
between your ESXi hosts. This tool may assist you in testing that your vSAN network is performing
optimally.

iperf

To check the performance of the vSAN network, a commonly used tool is iperf. iperf is a tool to
measure maximum TCP bandwidth and latency. As of vSAN 6.0, there is a version of iperf available on
the ESXi 6.0 hosts. It can be found in /usr/lib/vmware/vsan/bin/iperf. Run it with the -–help option to
see how to use the various options. Once again, this tool can be used to check network bandwidth and
latency between ESXi hosts participating in a vSAN cluster.

VMware KB 2001003 referenced previously can assist with setup and testing.

Needless to say, this is most useful when run when a vSAN cluster is being commissioned. Running
iperf tests on the vSAN network when the cluster is already in production may impact the performance
of the virtual machines running on the cluster.

8.18 Checking vSAN network limits

vsan.check_limits

This command verifies that none of the in-built vSAN thresholds are being breached.

> ls
0 /
1 vcsa-04.rainpole.com/
> cd 1
/vcsa-04.rainpole.com> ls
0 Datacenter (datacenter)
/vcsa-04.rainpole.com> cd 0
/vcsa-04.rainpole.com/Datacenter> ls
0 storage/
1 computers [host]/
2 networks [network]/
3 datastores [datastore]/
4 vms [vm]/
/vcsa-04.rainpole.com/Datacenter> cd 1
/vcsa-04.rainpole.com/Datacenter/computers> ls
0 Cluster (cluster): cpu 155 GHz, memory 400 GB
1 esxi-dell-e.rainpole.com (standalone): cpu 38 GHz, memory 123 GB
2 esxi-dell-f.rainpole.com (standalone): cpu 38 GHz, memory 123 GB
3 esxi-dell-g.rainpole.com (standalone): cpu 38 GHz, memory 123 GB
4 esxi-dell-h.rainpole.com (standalone): cpu 38 GHz, memory 123 GB
/vcsa-04.rainpole.com/Datacenter/computers> vsan.check_limits 0
2017-03-14 16:09:32 +0000: Querying limit stats from all hosts ...
2017-03-14 16:09:34 +0000: Fetching vSAN disk info from esxi-dell-
m.rainpole.com (may take a moment) ...
2017-03-14 16:09:34 +0000: Fetching vSAN disk info from esxi-dell-
n.rainpole.com (may take a moment) ...
2017-03-14 16:09:34 +0000: Fetching vSAN disk info from esxi-dell-
o.rainpole.com (may take a moment) ...
2017-03-14 16:09:34 +0000: Fetching vSAN disk info from esxi-dell-
p.rainpole.com (may take a moment) ...
2017-03-14 16:09:39 +0000: Done fetching vSAN disk infos
+--------------------------+--------------------
+-----------------------------------------------------------------+
| Host | RDT | Disks
+--------------------------+--------------------
+-----------------------------------------------------------------+

152

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

| esxi-dell-
m.rainpole.com | Assocs: 1309/45000 | Components: 485/9000
| | Sockets: 89/10000 | naa.500a075113019b33: 0% Components: 0/0
| | Clients: 136 | naa.500a075113019b37: 40% Components: 81/47661
| | Owners: 138 | t10.ATA_____Micron_P420m2DMTFDGAR1T4MAX_____ 0
| | | naa.500a075113019b41: 37% Components: 80/47661
| | | naa.500a07511301a1eb: 38% Components: 81/47661
| | | naa.500a075113019b39: 39% Components: 79/47661
| | | naa.500a07511301a1ec: 41% Components: 79/47661
<<truncated>>

From a network perspective, it is the RDT associations (“Assocs”) and sockets count that most interests
us. There are 45,000 associations per host in vSAN 6.0 and later. An RDT association is used to track
peer-to-peer network state within vSAN. vSAN is sized such that it should never runs out of RDT
associations. vSAN also limits how many TCP sockets it is allowed to use, and vSAN is sized such that it
should never runs out of its allocation of TCP sockets. As can be seen, there is a limit of 10,000 sockets
per host.

A vSAN client represents object access in the vSAN cluster. Most often than not, the client will
typically represent a virtual machine running on a host. Note that the client and the object may not be
on the same host. There is no hard defined limit, but this metric is shown to help understand how
“clients” balance across hosts.

There is always one and only one vSAN owner for a given vSAN object, typically co-located with the
vSAN client accessing this object. vSAN owners coordinate all access to the vSAN object and
implement functionality like mirroring and striping. There is no hard defined limit, but this metric is
once again shown to help understand how “owners” balance across hosts.

8.19 Physical network switch feature interoperability

There have been situations where certain features, when enabled on a physical network switch, did not
interoperate correctly. In one example, a customer attempted to use multicast with jumbo frames, and
because of the inability of the network switch to handle both these features, it impacted the whole of
the vSAN network. Note that many other physical switches handled this perfectly; this was an issue
with one switch vendor only.

Pay due diligence to whether or not the physical network switch has the ability to support multiple
network features enabled concurrently.

153

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

9. Appendix C
Checklist summary for vSAN networking

154

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

9.1 Appendix C: Checklist summary for vSAN networking

1. Shared 10Gb NIC or dedicated 1Gb NIC?


2. Redundant NIC teaming connections set up?
3. VMkernel port for vSAN network traffic configured on each host?
4. Identical VLAN, MTU and subnet across all interfaces?
5. Run vmkping successfully between all hosts? Health check will verify this.
6. If jumbo frames in use, run vmkping successfully with 9000 packet size between all hosts?
Health check will verify this.
7. Network layout check – single physical switch or multiple physical switches?
8. vSAN < v6.6? Multicast allowed on the network?
9. vSAN < v6.6? & multiple vSAN clusters on same network? Modify multicast configurations
so that unique multicast addresses are being used.
10. If vSAN < v6.6 and vSAN spans multiple switches, is multicast configured across switches?
11. If vSAN < v6.6 and routed, is PIM configured to allow multicast to be routed?
12. Multicast issue? Verify that ESXi hosts are reporting IGMP membership if IGMP used?
13. Is flow control enabled on the ESXi host NICs?
14. Test vSAN network performance with iperf - meets expectations?
15. Network performance issues? Any excessive dropped packets or pause frames?
16. Check network limits – within acceptable margins?
17. Ensure that the physical switch can meet vSAN requirements (multicast, flow control,
feature interop)

155

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

10. Appendix D
vSAN 6.6 versioning change

156

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

10.1 Appendix D: vSAN 6.6 versioning change

As of vSAN 6.6 the versioning of vSAN releases will depend on the version number of the vSAN on-
disk format. Going forward, every vSAN software release will map to a single vSAN On-Disk format
version. As of vSAN 6.6 the disk format version is 5.0.

Versioning becomes very important during “rolling upgrades”.

As discussed on existing vSAN administration upgrade guides vSAN gets upgraded in two phases

First, the hypervisor software, vSphere (vCenter and ESXi) and, depending on the source version of
vSAN, the vSAN on disk format. The table below describes the existing versioning up to and including
vSAN 6.6

Disk Format Version


vSphere Release vSAN Version

vSphere 5.5 U1,


Version 1 5.5
U2, U3

Version 2 vSphere 6.0 6.0

Version 2 vSphere 6.1 6.1

vSphere 6.0 U2,


Version 3 6.2
U3

Version 3 vSphere 6.5 6.5

Version 5 vSphere 6.5 EP2 6.6

Table 7. Versioning

To illustrate this, when upgrading from vSAN X to vSAN 6.6, the vSAN Node software version will get
updated first. Now the cluster will be in a mixed mode of software version 6.6 on the hosts but a lower
disk format version, e.g. version 3 for example. This does not constitute vSAN version 6.6.

While there was a loose association between versions of vSphere Software and Disk Group on disk
formats in earlier versions of vSAN, the disk group version is now used as the source of vSAN
versioning information.

To complete a vSAN 6.6 upgrade to reflect the new versioning scheme, on disk format for vSAN disk
groups will be required to be upgraded to Version 5.0.

157

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

11. Appendix E
vCenter Recovery Example with Unicast vSAN

158

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

11.1 vCenter Recovery Example with Unicast vSAN

In this example, we are simulating a scenario where vCenter is no longer available and a new vCenter
has been deployed. We are highlighting the potential steps to avoid any interruption in storage I/O on
a functioning vSAN Cluster

VMware strongly recommends having a good backup strategy for your VCSA. This will make the
recovery process much simpler.

However, let’s assume that it is not possible to restore the original vCenter server. Therefore, we have a
situation where:

• vSAN Cluster is still operational


• vSAN Datastore is intact
• vSAN nodes are communicating successfully
• But there is no vCenter server

Before deploying and attaching to a new vCenter instance it is advisable to verify vSAN health and
cluster membership and specifically node unicast membership. With the absence of vCenter, it is
advisable to verify vSAN cluster state via a member nodes HTML5 client. We now have vSAN health
features available at local node level, which is a new feature introduced in vSAN 6.6. For example, by
logging directly into a vSAN node we can see vSAN node health from the perspective of that vSAN
node view of the cluster.

Figure E.1. vSAN Health from HTML5 client

This may help highlight any issues that maybe of concern if vCenter is unavailable.

We can also interrogate a vSAN6.6 node’s health and cluster membership via esxcli. This again will
offer a command line driven vSAN health view from a given vSAN nodes perspective. Here is the
output from esxcli vsan health cluster list:

159

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Figure E.2. vSAN Health from ESXCLI

The command esxcli vsan cluster get displays the current cluster membership and status:

Figure E.3. vSAN Health from ESXCLI

The command esxcli vsan cluster unicastagent list displays the unicast agent member list from a given
vSAN nodes point of view:

Figure E.4. vSAN network mode

160

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

11.2 vSAN Node preparation

Note: Ensure each vSAN node has the advanced setting before adding hosts to vCenter: /vSAN/
IgnoreClusterMemberListUpdates is set to “1”

This is the MOST important step to take!

11.3 vCenter Preparation

1. Deploy a new vCenter instance, ensuring to use vCenter Version XX “” Need to confirm official
NAME of vCenter version”.
2. Create a new vSphere Cluster and replicate previous configuration such as HA settings and DRS
settings where possible.
3. From a vSAN Clustering perspective ensure vSAN checkbox is selected as an existing vSAN
Cluster will be added to the new Cluster object.

Figure E.5. Enable vSAN

11.4 Adding vSAN nodes to vCenter

Add vSAN nodes by right clicking on the vSphere Cluster Object and selecting add host Action.

Repeat add host procedure until all vSAN nodes are added. Alternatively use command-line tools such
as PowerCLI to add hosts to Recovery Cluster

Once all vSAN nodes have been added, and you have verified that they are communicating in unicast
mode, the next step is to set the advanced setting back to default. On each vSAN change the
advanced host setting /vSAN/IgnoreClusterMemberListUpdates back to default value of “0”. PowerCli
can also be used to apply settings all to the nodes in a vSAN Cluster

$vmhost = Get-Cluster -Name Recovery | get-vmhost


foreach ( $vmhost in Get-VMHost ) {
Get-AdvancedSetting -Entity $vmhost –

161

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Name vSAN.IgnoreClusterMemberListUpdates | Set-AdvancedSetting -Value "0" -


Confirm:$false }

Note: If features such as deduplication and compression, encryption or stretch cluster were previously
enabled, they should be re-enabled once more. However, step by step guidance for those services is
out of scope of this Networking Guide. Another necessary step is the recovery of VM Storage Policies.
This is also outside the scope of this networking guide.

162

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

12. Appendix F
Boot Strapping a vSAN 6.6 Unicast cluster

163

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

12.1 Boot Strapping a vSAN 6.6 unicast cluster

In this section, we are going exploring how creating a vSAN 6.6 Cluster using unicast network
communications can be achieved without the presence of vCenter. In the scenario below, we have
three vSAN 6.6 Nodes, esxi-j, esxi-k, and esxi-l.

Figure F.1. vSAN Cluster

Each node has:

• Two physical uplinks, 10GB n/wing jumbo frames enabled


• One VLAN (200) designated for vSAN trunked to each interface
• 2 x SSD disks per vSAN node (All Flash)
• Management Network is already configured

We are going to approach the problem as follows:

• Node 1 configuration steps


• Create appropriate vSS Switches and VMkernel Interfaces
• Create vSAN interface
• Create vSAN Disk group (a V5.0 Disk group ensures vSAN defaults to unicast)
• Create vSAN Cluster
• Node 2 configuration steps

164

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

• Create appropriate vSS Switches and VMkernel Interfaces


• Create vSAN interface
• Join existing vSAN cluster
• Establish vSAN Communications to Node 1
• Create vSAN Disk group
• Node 3 configuration steps
• Create appropriate vSS Switches and VMkernel Interfaces
• Create vSAN interface
• Join existing vSAN cluster
• Establish vSAN Communications to Node 2 and Node 1
• Create vSAN Disk group
• Complete Unicast Manual Configuration to Node 3 and establish remaining bi-
directional communications to Node 3.

Node 1 Configuration Steps

Here we are going to go through all the steps one would need to configure a single node vSAN cluster.

1. Create vSwitch set MTU to 9000 and add uplinks

#esxcli network vswitch standard add -v vSwitch1


#esxcli network vswitch standard set -m 9000 -v vSwitch
#esxcli network vswitch standard uplink add -u vmnic0 -v vSwitch1

2. Create vSAN VMkernel Interface, (port group with vlan and static ipv4 vmknic)

#esxcli network vswitch standard portgroup add -p vSAN -v vSwitch1


#esxcli network vswitch standard portgroup set -p vSAN -v 200
#esxcli network ip interface add -i vmk1 -m 9000 -p vSAN
#esxcli network ip interface ipv4 set -i vmk1 -I=172.200.0.122 -
N 255.255.255.0 -t static
#esxcli vsan network ip add -i vmk1 -T vsan

3. create one disk group (Mark one SSD as capacity Flash)

#esxcli vsan storage tag add -d naa.500a07510f86d684 -t capacityFlash


#esxcli vsan storage add -d naa.500a07510f86d684 -s naa.55cd2e404c31f8f0

4. Enable vSAN clustering by creating a new vSAN cluster

#esxcli vsan cluster new


#esxcli vsan cluster get

Cluster Information

Enabled: true
Current Local Time: 2017-03-27T11:25:07Z
Local Node UUID: 58d8ef50-0927-a55a-3678-246e962f48f8
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 58d8ef50-0927-a55a-3678-246e962f48f8
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52258811-6e7b-43f6-7171-459d5cd6a303
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 58d8ef50-0927-a55a-3678-246e962f48f8
Sub-Cluster Membership UUID: d1f2d858-700a-7202-3461-246e962f48f8
Unicast Mode Enabled: true
Maintenance Mode State: OFF

165

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

We need to document node1 vSAN UUID, Sub-Cluster UUID, and VMkernel IP address for the
remaining steps.

Node 2 Configuration Steps

1. Repeat VMkernel Networking and Diskgroup creation on Node 2:

#esxcli network vswitch standard add -v vSwitch1


#esxcli network vswitch standard set -m 9000 -v vSwitch1
#esxcli network vswitch standard uplink add -u vmnic0 -v vSwitch1
#esxcli network vswitch standard portgroup add -p vSAN -v vSwitch1
#esxcli network vswitch standard portgroup set -p vSAN -v 200
#esxcli network ip interface add -i vmk1 -m 9000 -p vSAN
#esxcli network ip interface ipv4 set -i vmk1 -I=172.200.0.123 -
N 255.255.255.0 -t static
#esxcli vsan network ip add -i vmk1 -T vsan
#esxcli vsan storage tag add -d naa.500a07510f86d69c -t capacityFlash
#esxcli vsan storage add -d naa.500a07510f86d69c -s naa.55cd2e404c31e2c7

2. Join Node 2 to vSAN Cluster created on Node 1 (use the Sub Cluster UUID entry):

#esxcli vsan cluster join -u 52258811-6e7b-43f6-7171-459d5cd6a303


#esxcli vsan cluster get

Cluster Information

Enabled: true
Current Local Time: 2017-03-27T11:31:30Z
Local Node UUID: 58d8ef12-bda6-e864-9400-246e962c23f0
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 58d8ef12-bda6-e864-9400-246e962c23f0
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52258811-6e7b-43f6-7171-459d5cd6a303
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 58d8ef12-bda6-e864-9400-246e962c23f0
Sub-Cluster Membership UUID: a5f7d858-7836-8c82-bfaa-246e962c23f0
Unicast Mode Enabled: true

3. Configure Unicast on Node 2 to establish CMMDS communication to Node 1. Required


parameters from Node 1 are vSAN VMKernel IP address, Local UUID of Node 1 and UDP port
(default 12321) :

#esxcli vsan cluster unicastagent add -a 172.200.0.122 -i vmk1 -p 12321 -


U true -t node -u 58d8ef50-0927-a55a-3678-246e962f48f8

At this point we need to let Node 1 understand there is a new agent available, so we can
establish two way CMMDS communications as a result must now add a unicast entry on Node 1
to tell it Node 2 is available.

#esxcli vsan cluster unicastagent add -a 172.200.0.123 -i vmk1 -p 12321 -


U true -t node -u 58d8ef12-bda6-e864-9400-246e962c23f0

We should now have a 2 node vSAN Cluster:

#esxcli vsan cluster get

166

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Node 1 should now have one unicast agent entry (Node 2)

Node 2 should now have one unicast agent entry (Node 1)

Node 3 Configuration Steps

1. Repeat VMkernel Networking and Disk group creation on Node 3 as per previous steps:

# esxcli network vswitch standard add -v vSwitch1


# esxcli network vswitch standard set -m 9000 -v vSwitch1
# esxcli network vswitch standard uplink add -u vmnic0 -v vSwitch1
# esxcli network vswitch standard portgroup add -p vSAN -v vSwitch1
# esxcli network vswitch standard portgroup set -p vSAN -v 200
# esxcli network ip interface add -i vmk1 -m 9000 -p vSAN
# esxcli network ip interface ipv4 set -i vmk1 -I=172.200.0.124 -
N 255.255.255.0 -t static
# esxcli vsan network ip add -i vmk1 -T vsan
# esxcli vsan storage tag add -d naa.500a07510f86d6cf -t capacityFlash
# esxcli vsan storage add -d naa.500a07510f86d6cf -s naa.55cd2e404c31f9a9

2. Join Node 3 to vSAN Cluster:

# esxcli vsan cluster join -u 52258811-6e7b-43f6-7171-459d5cd6a303


# esxcli vsan cluster get

Cluster Information

Enabled: true
Current Local Time: 2017-03-27T12:43:41Z
Local Node UUID: 58d8ef61-a37d-4590-db1d-246e962f4978
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 58d8ef61-a37d-4590-db1d-246e962f4978
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52258811-6e7b-43f6-7171-459d5cd6a303
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 58d8ef61-a37d-4590-db1d-246e962f4978

167

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Sub-Cluster Membership UUID: 8608d958-ff62-e122-ae3a-246e962f4978


Unicast Mode Enabled: true
Maintenance Mode State: OFF

3. Configure Unicast on Node 3 to establish CMMDS communication to Node 1 and Node 2

Node 3 -> Node 1

# esxcli vsan cluster unicastagent add -a 172.200.0.122 -i vmk1 -p 12321 \


-U true -t node -u 58d8ef50-0927-a55a-3678-246e962f48f8

Node 3 -> Node 2

# esxcli vsan cluster unicastagent add -a 172.200.0.123 -i vmk1 -p 12321 \


-U true -t node -u 58d8ef12-bda6-e864-9400-246e962c23f0

Complete Unicast Manual Configuration

Now we must ensure all Nodes are configured to establish communications:

Node 1 -> Node 3

#esxcli vsan cluster unicastagent add -a 172.200.0.124 -i vmk1 -p 12321 \


-U true -t node –u

Node 2 -> Node 3

#esxcli vsan cluster unicastagent add -a 172.200.0.124 -i vmk1 -p 12321 \


-U true -t node –u 58d8ef61-a37d-4590-db1d-246e962f4978

Now all nodes should look like this from a unicast agent perspective:

# esxcli vsan cluster unicastagent list

Node 1

Node 2

Node 3

All Nodes should now form a cluster:

168

Copyright © 2019 VMware, Inc. All rights reserved.


VMware® vSAN™ Network Design

Verification and Next Steps

From each vSAN node’s perspective the vSAN Cluster should now report the following vSAN Health
issues regarding Network and Cluster partition checks. Below output shows the expected result that
overall network configuration is red as the Cluster nodes are not connected to a managing vCenter
server:

# esxcli vsan health cluster list


Health Test Name Status
-------------------------------------------------- ----------
Overall health red (Network misconfiguration)
Network red
Hosts disconnected from VC green
Hosts with connectivity issues red
vSAN cluster partition green
All hosts have a vSAN vmknic configured green
All hosts have matching subnets green
vSAN: Basic (unicast) connectivity check green
vSAN: MTU check (ping with large packet size) green
vMotion: Basic (unicast) connectivity check green
vMotion: MTU check (ping with large packet size) green
Network latency check green

Until vCenter Server is managing the cluster, there will be an expected number of health tests failing.
Before deploying vCenter Server the following advanced parameter MUST be set all on all nodes:

# esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates

For guidance on deploying vCenter on an unmanaged vSAN Cluster, please review Appendix E.

169

Copyright © 2019 VMware, Inc. All rights reserved.