You are on page 1of 195

Enterprise Cloud

Adminstration (ECA) 5.15


Course Guide
Copyright

COPYRIGHT
Copyright 2020 Nutanix, Inc.

Nutanix, Inc.
1740 Technology Drive, Suite 150
San Jose, CA 95110

All rights reserved. This product is protected by U.S. and international copyright and intellectual
property laws. Nutanix and the Nutanix logo are registered trademarks of Nutanix, Inc. in the
United States and/or other jurisdictions. All other brand and product names mentioned herein
are for identification purposes only and may be trademarks of their respective holders.

License
The provision of this software to you does not grant any licenses or other rights under any
Microsoft patents with respect to anything other than the file server implementation portion of
the binaries for this software, including no licenses or any other rights in any hardware or any
devices or software that are used to communicate with or in connection with this software.

Conventions
Convention Description
variable_value The action depends on a value that is unique to your
environment.
ncli> command The commands are executed in the Nutanix nCLI.

user@host$ command The commands are executed as a non-privileged user (such


as nutanix) in the system shell.
root@host# command The commands are executed as the root user in the vSphere
or Acropolis host shell.
> command The commands are executed in the Hyper-V host shell.

The information is displayed as output from a command or


output in a log file.

Version A.1
Last modified: March 19, 2020

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   2
Contents

Copyright...................................................................................................................2
License.................................................................................................................................................................. 2
Conventions........................................................................................................................................................ 2
Version.................................................................................................................................................................. 2

Module 1:  Introduction......................................................................................... 9


Overview..............................................................................................................................................................9
Nutanix Concepts............................................................................................................................................ 9
Traditional Three-Tier Architecture............................................................................................. 9
The Nutanix Enterprise Cloud...................................................................................................... 10
Nodes, Blocks, and Clusters........................................................................................................... 11
Node and Block Example................................................................................................................ 11
Cluster Example.................................................................................................................................. 12
Nutanix Cluster Components........................................................................................................ 13
Controller VM (CVM)........................................................................................................................ 15
Creating a Cluster with Different Products............................................................................. 15
Storage Concepts...........................................................................................................................................16
Enterprise Cloud Storage Components....................................................................................16
Storage Pool.........................................................................................................................................16
Storage Container.............................................................................................................................. 17
vDisk.........................................................................................................................................................17
The Nutanix Enterprise Cloud: Acropolis and Prism....................................................................... 18
Acropolis................................................................................................................................................ 18
AOS.......................................................................................................................................................... 19
AHV.......................................................................................................................................................... 19
Prism.......................................................................................................................................................20
Resources...........................................................................................................................................................21
The Support Portal............................................................................................................................ 21
Other Resources................................................................................................................................ 22
Nutanix Training and Certification......................................................................................................... 22
Nutanix Training.................................................................................................................................22
Nutanix Certification........................................................................................................................ 23
Knowledge Check......................................................................................................................................... 24

Module 2: Managing the Nutanix Cluster...................................................25


Overview............................................................................................................................................................25
Prism Overview.............................................................................................................................................. 25
Prism Licensing.............................................................................................................................................. 25
Prism Features................................................................................................................................................28
Infrastructure Management...........................................................................................................28
Performance Monitoring.................................................................................................................29
Operational Insight (Prism Pro).................................................................................................. 29
Capacity Planning (Prism Pro).................................................................................................... 30
Management Interfaces.............................................................................................................................. 30
Prism Element Initial Configuration.......................................................................................................30
Accessing Prism Element.............................................................................................................. 30
Enabling Pulse..................................................................................................................................... 31

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
Pulse Settings..................................................................................................................................... 32
What Gets Shared by Pulse?....................................................................................................... 32
Command Line Interfaces..........................................................................................................................33
Nutanix Command Line Interface (nCLI)................................................................................ 33
Acropolis Command Line Interface (aCLI).............................................................................35
PowerShell Cmdlets......................................................................................................................... 36
REST API...............................................................................................................................................38
Labs..................................................................................................................................................................... 38

Module 3: Securing the Nutanix Cluster.................................................... 39


Overview........................................................................................................................................................... 39
Security Overview......................................................................................................................................... 39
Nutanix Security Development Life Cycle..............................................................................39
Security in the Enterprise Cloud............................................................................................................ 40
Two-Factor Authentication.......................................................................................................... 40
Cluster Lockdown.............................................................................................................................40
Key Management and Administration..................................................................................... 40
Security Technical Implementation Guides (STIGs)............................................................ 41
Data at Rest Encryption................................................................................................................ 42
Configuring Authentication.......................................................................................................................43
Changing Passwords....................................................................................................................... 43
Role-Based Access Control...................................................................................................................... 45
Gathering Requirements to Create Custom Roles..............................................................46
Built-in Roles.......................................................................................................................................46
Custom Roles......................................................................................................................................46
Configuring Role Mapping............................................................................................................ 47
Working with SSL Certificates.................................................................................................... 48
Labs.....................................................................................................................................................................48

Module 4:  Networking.......................................................................................49


Overview........................................................................................................................................................... 49
Default Network Configuration............................................................................................................... 49
Default Network Configuration (cont.)................................................................................................ 50
Open vSwitch (OVS)................................................................................................................................... 50
Bridges................................................................................................................................................................ 51
Ports..................................................................................................................................................................... 51
Bonds................................................................................................................................................................... 51
Bond Modes.........................................................................................................................................52
Virtual Local Area Networks (VLANs)..................................................................................................55
IP Address Management (IPAM).............................................................................................................56
Network Segmentation............................................................................................................................... 57
Configuring Network Segmentation for an Existing RDMA Cluster............................. 58
Network Segmentation During Cluster Expansion..............................................................58
Network Segmentation During an AOS Upgrade................................................................58
Reconfiguring the Backplane Network.................................................................................... 58
Disabling Network Segmentation...............................................................................................59
Unsupported Network Segmentation Configurations........................................................59
AHV Host Networking.................................................................................................................................59
Recommended Network Configuration................................................................................... 59
AHV Networking Terminology Comparison...........................................................................62
Labs..................................................................................................................................................................... 62

Module 5: Virtual Machine Management....................................................63


Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
Overview........................................................................................................................................................... 63
Understanding Image Configuration..................................................................................................... 63
Overview...............................................................................................................................................64
Supported Disk Formats................................................................................................................64
Uploading Images............................................................................................................................. 65
Creating and Managing Virtual Machines in AHV............................................................................65
Creating a VM in AHV.................................................................................................................... 66
Creating a VM using Prism Self-Service (PSS)..................................................................... 67
Managing a VM.................................................................................................................................. 67
Supported Guest VM Types for AHV .................................................................................................. 68
Nutanix VirtIO................................................................................................................................................. 68
Nutanix Guest Tools.................................................................................................................................... 70
Overview...............................................................................................................................................70
NGT Requirements and Limitations............................................................................................71
Requirements and Limitations by Operating System......................................................... 71
Customizing a VM......................................................................................................................................... 72
Cloud-Init...............................................................................................................................................72
Sysprep.................................................................................................................................................. 73
Customizing a VM.............................................................................................................................73
Guest VM Data Management................................................................................................................... 74
Guest VM Data: Standard Behavior.......................................................................................... 74
Live Migration..................................................................................................................................... 75
High Availability................................................................................................................................. 75
Data Path Redundancy...................................................................................................................76
Labs..................................................................................................................................................................... 76

Module 6: Health Monitoring and Alerts.................................................... 77


Overview............................................................................................................................................................77
Health Monitoring.......................................................................................................................................... 77
Health Dashboard..........................................................................................................................................77
Configuring Health Checks............................................................................................................78
Setting NCC Frequency..................................................................................................................79
Collecting Logs.................................................................................................................................. 79
Analysis Dashboard......................................................................................................................................80
Understanding Metric and Entity Charts..................................................................................81
Alerts Dashboard............................................................................................................................................81
Alerts View............................................................................................................................................81
Events View.........................................................................................................................................84
Labs.....................................................................................................................................................................86

Module 7: Distributed Storage Fabric.........................................................87


Overview........................................................................................................................................................... 87
Understanding the Distributed Storage Fabric.................................................................................87
Data Storage Representation...................................................................................................................89
Storage Components.......................................................................................................................89
Understanding Snapshots and Clones................................................................................................. 90
Snapshots.............................................................................................................................................90
Clones vs Shadow Clones.............................................................................................................. 91
How Snapshots and Clones Impact Performance............................................................... 92
Capacity Optimization - Deduplication................................................................................................92
Deduplication Process.....................................................................................................................93
Deduplication Techniques............................................................................................................. 94
Capacity Optimization - Compression................................................................................................. 94
Compression Process...................................................................................................................... 95

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
Compression Technique Comparison....................................................................................... 95
Workloads and Dedup/Compression....................................................................................... 96
Deduplication and Compression Best Practices.................................................................. 96
Capacity Optimization - Erasure Coding............................................................................................ 96
EC-X Compared to Traditional RAID........................................................................................97
EC-X Process...................................................................................................................................... 98
Erasure Coding in Operation....................................................................................................... 99
Data Block Restore After a Block Failure.............................................................................100
Erasure Coding Best Practices..................................................................................................100
Viewing Overall Capacity Optimization.............................................................................................100
Hypervisor Integration................................................................................................................................101
Overview.............................................................................................................................................. 101
AHV........................................................................................................................................................ 101
vSphere................................................................................................................................................ 103
Hyper-V................................................................................................................................................105
Labs................................................................................................................................................................... 106

Module 8: Migrating Workloads to AHV.................................................. 107


Objectives....................................................................................................................................................... 107
Nutanix Move.................................................................................................................................................107
Nutanix Move Operations............................................................................................................ 108
Compatibility Matrix.......................................................................................................................109
Unsupported Features.................................................................................................................. 109
Configuring Nutanix Move...........................................................................................................109
Nutanix Move Migration............................................................................................................... 109
Downloading Nutanix Move........................................................................................................109
Labs.................................................................................................................................................................... 110

Module 9: Acropolis Services..........................................................................111


Overview............................................................................................................................................................ 111
Nutanix Volumes............................................................................................................................................ 111
Nutanix Volumes Use Cases.........................................................................................................112
iSCSI Qualified Name (IQN)......................................................................................................... 112
Challenge-Handshake Authentication Protocol (CHAP) ..................................................113
Attaching Initiators to Targets....................................................................................................113
Configuring a Volume Group for Shared Access................................................................ 114
Volume Group Connectivity Options....................................................................................... 114
Labs.....................................................................................................................................................................115
Nutanix Files....................................................................................................................................................115
Nutanix Files Architecture............................................................................................................ 116
Load Balancing and Scaling.........................................................................................................117
High Availability................................................................................................................................ 118

Module 10: Data Resiliency............................................................................ 122


Overview.......................................................................................................................................................... 122
Scenarios..........................................................................................................................................................122
CVM Unavailability...........................................................................................................................122
Node Unavailability......................................................................................................................... 124
Drive Unavailability......................................................................................................................... 124
Boot Drive (DOM) Unavailability...............................................................................................125
Network Link Unavailability.........................................................................................................127
Redundancy Factor 3.................................................................................................................................127
Block Fault Tolerant Data Placement................................................................................................. 129

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
Rack Fault Tolerance................................................................................................................................. 130
VM High Availability in Acropolis......................................................................................................... 130
Flash Mode.......................................................................................................................................... 131
Affinity and Anti-Affinity Rules for AHV............................................................................................133
Limitations of Affinity Rules....................................................................................................... 134
Labs................................................................................................................................................................... 134

Module 11: Data Protection.............................................................................136


Overview..........................................................................................................................................................136
VM-centric Data Protection Terminology..........................................................................................136
RPO and RTO Considerations................................................................................................................ 138
Time Stream...................................................................................................................................................138
Protection Domains.................................................................................................................................... 140
Concepts............................................................................................................................................. 140
Terminology........................................................................................................................................ 141
Protection Domain States............................................................................................................142
Protection Domain Failover and Failback............................................................................. 143
Leap Availability Zone...............................................................................................................................143
Availability Zone.............................................................................................................................. 144
License Requirements................................................................................................................... 144
Nutanix Software Requirements............................................................................................... 144
Networking Requirements........................................................................................................... 145
Labs................................................................................................................................................................... 146

Module 12: Prism Central................................................................................ 147


Overview..........................................................................................................................................................147
Prism Central Overview............................................................................................................................ 147
Prism Starter vs Prism Pro...................................................................................................................... 148
Deploying a New Instance of Prism Central.................................................................................... 149
Registering a Cluster to Prism Central...............................................................................................149
Unregistering a Cluster from Prism Central..................................................................................... 150
Prism Pro Features  .................................................................................................................................... 150
Customizable Dashboards.............................................................................................................151
Scheduled Reporting.......................................................................................................................151
Dynamic Monitoring........................................................................................................................152
Capacity Runway............................................................................................................................. 152
Multiple Cluster Upgrades............................................................................................................156
Labs....................................................................................................................................................................157

Module 13: Monitoring the Nutanix Cluster............................................. 158


Overview.......................................................................................................................................................... 158
Support Resources...................................................................................................................................... 158
Pulse...................................................................................................................................................... 159
Log File Analysis..............................................................................................................................159
FATAL Logs.......................................................................................................................................160
Command Line Tools.....................................................................................................................160
Linux Tools.......................................................................................................................................... 161
Nutanix Support Tools.................................................................................................................. 162
Labs....................................................................................................................................................................162

Module 14: Cluster Management and Expansion...................................163


Overview..........................................................................................................................................................163

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
Starting and Stopping a Cluster or Node......................................................................................... 163
Understanding Controller VM Access..................................................................................... 163
Cluster Shutdown Procedures....................................................................................................163
Shutting Down a Node................................................................................................................. 164
Starting a Node................................................................................................................................164
Stopping a Cluster.......................................................................................................................... 165
Starting a Cluster.............................................................................................................................165
Removing a Node from a Cluster.........................................................................................................168
Before You Begin............................................................................................................................ 168
Removing or Reconfiguring Cluster Hardware................................................................... 168
Expanding a Cluster................................................................................................................................... 169
Managing Licenses........................................................................................................................................171
Cluster Licensing Considerations............................................................................................... 171
Understanding AOS Prism and Add on Licenses............................................................... 172
Managing Your Licenses............................................................................................................... 173
Managing Licenses in a Dark Site.............................................................................................176
Reclaiming Your Licenses.........................................................................................................................177
Reclaiming Licenses with a Portal Connection....................................................................177
Reclaiming Licenses Without Portal Connection................................................................178
Upgrading Software and Firmware......................................................................................................179
Understanding Long Term Support and Short Term Support Releases................... 180
Before You Upgrade....................................................................................................................... 181
Lifecycle Manager (LCM) Upgrade Process...........................................................................181
Upgrading the Hypervisor and AOS on Each Cluster.......................................................182
Working with Life Cycle Manager............................................................................................ 183
Upgrading Recommended Firmware...................................................................................... 184
Labs................................................................................................................................................................... 186

Module 15: ROBO Deployments...................................................................187


Overview.......................................................................................................................................................... 187
Remote Office Branch Office..................................................................................................................187
Cluster Considerations............................................................................................................................... 187
Three-Node Clusters.......................................................................................................................188
Two-Node Clusters......................................................................................................................... 188
One-Node Clusters..........................................................................................................................188
Cluster Storage Considerations............................................................................................................. 189
Software Considerations...........................................................................................................................189
Hypervisor...........................................................................................................................................189
Witness VM Requirements...................................................................................................................... 190
Failure and Recovery Scenarios for Two-Node Clusters..............................................................191
Node Failure....................................................................................................................................... 191
Network Failure Between Nodes.............................................................................................. 192
Network Failure Between Node and Witness VM..............................................................192
Witness VM Failure......................................................................................................................... 192
Complete Network Failure...........................................................................................................193
Seeding............................................................................................................................................................ 194

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
Module

1
INTRODUCTION

Overview
After completing this module, you will be able to:

• Describe the Nutanix hyperconverged infrastructure solution.


• Describe the components of the Nutanix Enterprise Cloud: Acropolis and Prism.

• Explain the relationship between node, block, and cluster components.

• Identify where to find Nutanix resources.

• Recognize Nutanix training and certification options.

Nutanix Concepts
Traditional Three-Tier Architecture

Legacy infrastructure—with separate storage, storage networks, and servers—is not well
suited to meet the growing demands of enterprise applications or the fast pace of modern
business. The silos created by traditional infrastructure have become a barrier to change and
progress, adding complexity to every step from ordering to deployment to management. New
business initiatives require buy-in from multiple teams and your organization needs to predict
IT infrastructure 3-to-5 years in advance. As most IT teams know, this is almost impossible to
get right. In addition, vendor lock-in and increasing licensing costs are stretching budgets to the
breaking point.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   9
Introduction

The Nutanix Enterprise Cloud


Hyperconverged infrastructure combines x86-based compute and storage resources with
intelligent software to create flexible building blocks that replace legacy infrastructure
consisting of separate servers, storage networks, and storage arrays.

The Nutanix Enterprise Cloud is a converged, scale-out compute and storage system that is
purpose-built to host and store virtual machines.

The foundational unit for the cluster is a Nutanix node. Each node in the cluster runs a standard
hypervisor and contains processors, memory, and local storage (SSDs and hard disks).

A Nutanix Controller VM runs on each node, enabling the pooling of local storage from all nodes
in the cluster.

All nodes in a Nutanix cluster converge to deliver a unified pool of tiered storage and present
resources to VMs for seamless access. A global data system architecture integrates each
new node into the cluster, allowing you to scale the solution to meet the needs of your
infrastructure.

Nutanix HCI (hyperconverged infrastructure) uses off-the-shelf x86 servers with local flash
drives (SSD) and spinning hard disks (HDD) to create a cluster of compute and storage
resources.

• Easily scales out compute and storage resources with the addition of nodes.

• Tolerates one or two node failures with built-in resiliency.

• Restores resiliency after a node failure by replicating nonredundant data to other nodes.

• Provides a set of REST API calls that you can use for automation.

The platform is 100% software-defined, hardware agnostic, and is supported on an increasing


number of vendor hardware platforms including:

• Nutanix Supermicro

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   10
Introduction

• Dell XC

• Lenovo HX

• Cisco UCS C-series and B-series

• HPE ProLiant

Nodes, Blocks, and Clusters

A node is an x86 server with compute and storage resources. A single Nutanix cluster can have
an unlimited number of nodes. Different hardware platforms are available to address varying
workload needs for compute and storage.

A block is a chassis that holds one to four nodes, and contains power, cooling, and the
backplane for the nodes. The number of nodes and drives depends on the hardware chosen for
the solution.

Node and Block Example


This graphic shows a 4-node block in which each node takes up one node position identified as
A, B, C, or D. In this block chassis example, the node ports are accessible from the rear of the
chassis and the storage is at the front.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   11
Introduction

A Nutanix block is a rack-mountable enclosure that contains one to four Nutanix nodes.
A Nutanix cluster can handle the failure of a single node when specific cluster conditions are
met. In the case where multiple nodes in a block fail, guest VMs can continue to run because
cluster configuration data has been replicated on other blocks.

Cluster Example

• A Nutanix cluster is a logical grouping of physical and logical components.


• The nodes in a block can belong to the same or different clusters.

• Joining multiple nodes in a cluster allows for the pooling of resources.

• Acropolis presents storage as a single pool via the Controller VM (CVM).

• As part of the cluster creation process, all storage hardware (SSDs, HDDs, and NVMe) is
presented as a single storage pool.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   12
Introduction

Nutanix Cluster Components

The Nutanix cluster has a distributed architecture, which means that each node in the cluster
shares in the management of cluster resources and responsibilities. Within each node, there are
software components (aka AOS Services) that perform specific tasks during cluster operation.

All components run on multiple nodes in the cluster and depend on connectivity between their
peers that also run the component. Most components also depend on other components for
information.

Zookeeper

Zookeeper stores information about physical components, including their IP addresses,


capacities, and data replication rules, in the cluster configuration.

Zookeeper runs on either three or five nodes, depending on the redundancy factor (number of
data block copies) applied to the cluster. Zookeeper uses multiple nodes to prevent stale data
from being returned to other components. An odd number provides a method for breaking ties
if two nodes have different information.

Of these three nodes, Zookeeper elects one node as the leader. The leader receives all requests
for information and confers with the two follower nodes. If the leader stops responding, a new
leader is elected automatically.

Zookeeper has no dependencies, meaning that it can start without any other cluster
components running.

Zeus

Zeus is an interface to access the information stored within Zookeeper and is the Nutanix
library that all other components use to access the cluster configuration.

A key element of a distributed system is a method for all nodes to store and update the
cluster's configuration. This configuration includes details about the physical components in the
cluster, such as hosts and disks, and logical components, like storage containers. 

Medusa

Distributed systems that store data for other systems (for example, a hypervisor that hosts
virtual machines) must have a way to keep track of where that data is. In the case of a Nutanix
cluster, it is also important to track where the replicas of that data are stored.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   13
Introduction

Medusa is a Nutanix abstraction layer that sits in front of the database that holds metadata.
The database is distributed across all nodes in the cluster, using a modified form of Apache
Cassandra.

Cassandra

Cassandra is a distributed, high-performance, scalable database that stores all metadata about
the guest VM data stored in a Nutanix datastore. 

Cassandra runs on all nodes of the cluster. Cassandra monitor Level-2 periodically sends
heartbeat to the daemon, that include information about the load, schema, and health of all the
nodes in the ring. Cassandra monitor L2 depends on Zeus/Zk for this information.

Stargate

A distributed system that presents storage to other systems (such as a hypervisor) needs a
unified component for receiving and processing data that it receives. The Nutanix cluster has a
software component called Stargate that manages this responsibility.

All read and write requests are sent across an internal vSwitch to the Stargate process running
on that node.

Stargate depends on Medusa to gather metadata and Zeus to gather cluster configuration data.

From the perspective of the hypervisor, Stargate is the main point of contact for the Nutanix
cluster.

Note: If Stargate cannot reach Medusa, the log files include an HTTP timeout. Zeus
communication issues can include a Zookeeper timeout.

Curator

A Curator master node periodically scans the metadata database and identifies cleanup and
optimization tasks that Stargate should perform. Curator shares analyzed metadata across
other Curator nodes.

Curator depends on Zeus to learn which nodes are available, and Medusa to gather metadata.
Based on that analysis, it sends commands to Stargate.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   14
Introduction

Controller VM (CVM)

What makes a node “Nutanix” is the Controller VM (CVM).

• There is one CVM per node. 

- CVMs linked together across multiple nodes forms a cluster.

- The CVM has direct access to the local SSDs and HDDs of the node.

- A CVM communicates with all other cluster CVMs across a network to pool storage
resources from all nodes. This is the Distributed Storage Fabric (DSF).

• The CVM provides the user interface known as the Prism web console.

• The CVM allows for cluster-wide operations of VM-centric software-defined services:


snapshots, clones, High Availability, Disaster Recovery, deduplication, compression, erasure
coding, storage optimization and so on.

• Hypervisors (AHV, ESXi, Hyper-V, XenServer) communicate with DSF using the industry-
standard protocols NFS, iSCSI, and SMB3.

Creating a Cluster with Different Products


Note: Mixing nodes in a cluster from different vendors may cause problems. Always
check the Nutanix Hardware Compatibility List.

Due to the diversity of hardware platforms, there are several product mixing restrictions:

• Nodes with different Intel processor families can be part of the same cluster but cannot be
located on the same block.

• Hardware from different vendors cannot be part of the same cluster.

For more product mixing restrictions, please check the compatibility matrix in the Nutanix
Support Portal.

Vendors (OEM partners) include, but are not limited to:

• Lenovo Converged HX-Series

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   15
Introduction

• Dell XC-Series

• Nutanix NX-Series by Supermicro

Note: Nutanix and OEM partner platforms use the same Acropolis software.

Models

You can have a mix of nodes with self-encrypting drives (SED) and standard (non-SED) disks,
however you are not able to use the Data at Rest (DARE) hardware encryption feature. You can
mix models (nodes) in the same cluster, but not in the same block (physical chassis).

Additional Platforms

The Nutanix Enterprise Cloud is also available as a software option through local resellers on
Cisco, IBM, HPE x86 servers and on specialized rugged x86 platforms from Crystal and Klas
Telecom.

Storage Concepts
Enterprise Cloud Storage Components
Storage in a Nutanix Enterprise Cloud has both physical and logical components.

Storage Pool

A storage pool is a group of physical disks from all tiers and is a logical storage representation
of the physical drives from all nodes in the cluster. We apply a logical layer called a container to
the storage pool which appear as a JBOD (just a bunch of disks) to the hypervisor.

A storage device (SSD and HDD) can only belong to one storage pool at a time. This can
provide physical storage separation between VMs if the VMs are using a different storage pool.

Nutanix recommends creating a single storage pool to hold all disks within the cluster.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   16
Introduction

Storage Container

A storage container is a logical segmentation of the storage pool. Storage containers are thin
provisioned and can be configured with compression, deduplication, replication factor and so
on.

• Contains configuration options (example: compression, deduplication, RF, and so on)

• Contains the virtual disks (vDisks) used by virtual machines. Selecting a storage pool for a
new container defines the physical storage where the vDisks are stored.

A container is the equivalent of a datastore for vSphere ESXi and a share for Hyper-V.

vDisk

A vDisk is a logical component and is any file over 512KB on DSF including, .vmdks and VM hard
disks.

A vDisk is created on a storage container and is composed of extents that are grouped
and stored as an extent group. Extents consist of n number of contiguous blocks and are

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   17
Introduction

dynamically distributed among extent groups to provide data striping across nodes/disks to
improve performance.

NTNX DSF/Stargate does not impose artificial limits on the vDisk size.

A VM virtual disk (such as a VM-flat.vmdk) and a VM swap file (VM.vswp) are also vDisks. 

The Nutanix Enterprise Cloud: Acropolis and Prism


Nutanix Enterprise Cloud is a model for IT infrastructure and platform services that delivers
the agility, simplicity, and pay-as-you-grow economics of a public cloud without sacrificing the
security, control, and performance of a private infrastructure. 

AOS provides all of the core services (storage, upgrades, replication, etc.), Prism provides the
control plane and management console and AHV provides a free virtualization platform. 

Acropolis

Acropolis is the foundation for a platform that starts with hyperconverged infrastructure
then adds built-in virtualization, storage services, virtual networking, and cross-hypervisor
application mobility.

For the complete list of features, see the Software Options page on the Nutanix website.

Nutanix delivers a hyperconverged infrastructure solution purpose-built for virtualization and


cloud environments. This solution brings the performance and economic benefits of web-
scale architecture to the enterprise through the Enterprise Cloud Platform, which includes two
product families— Nutanix Acropolis and Nutanix Prism.

Nutanix Acropolis includes three foundational components:

• Distributed Storage Fabric (DSF)

• App Mobility Fabric

• AHV

AHV is the hypervisor while DSF and App Mobility Fabric are functional layers in the Controller
VM (CVM).

Note: Acropolis also refers to the base software running on each node in the
cluster.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   18
Introduction

AOS
AOS is the base operating system that runs on each CVM.

AHV
Nutanix AHV is a comprehensive enterprise virtualization solution tightly integrated into
Acropolis and is provided with no additional license cost.

AHV delivers the features required to run enterprise applications, for example: 

• Combined VM operations and performance monitoring via Nutanix Prism

• Backup, disaster recovery, host and VM high availability

• Dynamic scheduling (intelligent placement and resource contention avoidance)

• Broad ecosystem support (certified Citrix ready, Microsoft validated via SVVP)

You manage AHV through the Prism web console (GUI), command line interface (nCLI/aCLI),
and REST APIs.

VM features

• Intelligent placement

• Live migration

• Converged Backup/DR

• Image management

• VM operations

• Analytics

• Data path optimization

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   19
Introduction

Prism
Prism is the management plane that provides a unified management interface that can generate
actionable insights for optimizing virtualization, provides infrastructure management and
everyday operations.

Prism gives Nutanix administrators an easy way to manage and operate their end-to-end virtual
environments. Prism includes two software components: Prism Element and Prism Central.

Prism Element

Prism Element provides a graphical user interface to manage most activities in a Nutanix
cluster.

Some of the major tasks you can perform using Prism Element include:

• View or modify cluster parameters.

• Create a storage container.

• Add nodes to the cluster.

• Upgrade the cluster to newer Acropolis versions.

• Update disk firmware and other upgradeable components.

• Add, update, and delete user accounts.

• Specify alert policies.

Prism Central

Provides multicluster management through a single web console and runs as a separate VM.

Note: We will cover both Prism Element and Prism Central in separate lessons
within this course.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   20
Introduction

Prism Interface

Prism is an end-to-end management solution for any virtualized datacenter, with additional
functionality for AHV clusters, and streamlines common hypervisor and VM tasks.

The information in Prism focuses on common operational tasks grouped into four areas:

• Infrastructure management

• Operational insight

• Capacity planning

• Performance monitoring

Prism provides one-click infrastructure management for virtual environments and is hypervisor
agnostic. With AHV installed, Prism and aCLI (Acropolis Command Line Interface) provide more
VM and networking options and functionality.

Resources
The Support Portal
Nutanix provides a variety of support services and materials through its support portal.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   21
Introduction

Other Resources

https://www.nutanix.com/resources

Nutanix also maintains a list of resources including whitepapers, solution briefs, ebooks, and
other support material.

https://next.nutanix.com

Nutanix also has a strong community of peers and professionals, the .NEXT community. Access
the community via the direct link shown here or from the Documentation menu in the Support
Portal. The community is a great place to get answers, learn about the latest topics, and lend
your expertise to your peers.
https://www.nutanix.com/support-services/training-certification/

An excellent place to learn and grow your expertise is with Nutanix training and certification.
Learn about other classes and get certified with us.

http://www.nutanixbible.com

The Nutanix Bible has become a valuable reference point for those that want to learn
about hyperconvergence and web-scale principles or dig deep into Nutanix and hypervisor
architectures. The book explains these technologies in a way that is understandable to IT
generalists without compromising the technical veracity.

Nutanix Training and Certification


Nutanix training and certification programs are designed to build your knowledge and skills for
the Enterprise Cloud era.

Nutanix Training
There are three modes of training: 

• Instructor-Led: The ideal choice when you need comprehensive coverage of Nutanix
administration, performance, and optimization for yourself or your team. Our expert
instructors provide hands-on classes with presentations mixed with dedicated time on our
hosted lab environment to give you a robust learning experience.

• Online: No matter what stage of your Enterprise Cloud journey you're on, you'll learn
something new and useful from our free online training. Designed with you in mind, these
online options offer a fresh, engaging, and interactive way of teaching you the fundamentals
you need to succeed.

• Online Plus: Combines the convenience of online training with the hands-on labs of an
instructor-led course. This flexibility makes it the ideal choice when you or your team need
to get up to speed quickly and efficiently. You’ll have 2 weeks to complete the online course

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   22
Introduction

at your own pace, followed by a scheduled day of hands-on labs guided by an expert
instructor.

Nutanix Enterprise Cloud Administration (ECA)

This course teaches admins (system, network, and storage) how to successfully deploy
Nutanix in the datacenter. The course covers tasks Nutanix administrators perform, including
configuring and maintaining a Nutanix environment. It also introduces basic Nutanix
troubleshooting tools, offers tips for solving common problems and provides guidelines for
escalating problems to Nutanix Support.

Advanced Administration and Performance Management (AAPM)

This course features comprehensive coverage of performance management for Nutanix


clusters and details how to optimize performance, fix issues that slow down your system, and
improve datacenter performance. You'll learn through hands-on labs how to monitor system
performance and tuning, while also touching on advanced networking and storage to help
optimize datacenter administration.

Nutanix Certification

Nutanix technical certifications are designed to recognize the skills and knowledge you've
acquired to successfully deploy, manage, optimize, and scale your Enterprise Cloud. Earning
these certifications validates your proven abilities and aptness to guide your organization along
the next phase of your Enterprise Cloud journey.

Nutanix Certified Professional (NCP)

NCP certification validates your skills and abilities in deploying, administering, and
troubleshooting Nutanix AOS 5.10 in the datacenter.

Nutanix Certified Advanced Professional (NCAP)

NCAP certification measures your ability to perform complex administrative tasks on a Nutanix
cluster, as well as optimize both virtualized workloads and infrastructure components in an AOS
5.10 deployment.

Nutanix Platform Expert (NPX)

Earning the elite NPX certification validates that you have demonstrated the ability to design
and deliver enterprise-class solutions on the Nutanix platform, using multiple hypervisors and
vendor software stacks.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   23
Introduction

Knowledge Check
1. Knowledge check

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   24
Module

2
MANAGING THE NUTANIX CLUSTER

Overview
Within this module, you will learn how to manage your Nutanix cluster using various tools. After
completing this module, you will know how to:
• Use Prism to monitor a cluster.

• Configure a cluster using Prism and CLI.

• Differentiate between Pulse and Alert.

• Download various tools like Prism Central, cmdlets, and REST API.

• Use the REST API Explorer to retrieve info or make changes to the cluster.

• Describe how to install and run Nutanix-specific PowerShell cmdlets.

Note: Within this module you’ll discuss various ways to manage your Nutanix
cluster. First, we’ll start with the Prism Element GUI, then talk about Command
Line Interfaces. Finally, we’ll provide an overview of lesson common tools such as
PowerShell Cmdlets and REST API. 

Prism Overview
Nutanix Prism is an end-to-end management solution for an AHV virtualized datacenter.

Nutanix Prism provides central access to configure, monitor, and manage virtual environments
in a simple and elegant way. Prism offers simplicity by combining several aspects of datacenter
management into a single, easy-to-use solution. Using innovative machine learning technology,
Prism can mine large volumes of system data easily and quickly and generate actionable
insights for optimizing all aspects of virtual infrastructure management.

Prism Licensing
Note: In this module, we’ll describe and define the features of both Prism Starter
and Pro. The Prism Central module covers Prism Central and Pro administration in
more detail.

Prism Element

Prism Element is a service built into the platform for every Nutanix cluster deployed. Prism
Element provides the ability to fully configure, manage, and monitor a single Nutanix
cluster running any hypervisor.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   25
Managing the Nutanix Cluster

Prism Central

Because Prism Element only manages the cluster it is part of, each Nutanix cluster in a
deployment has a unique Prism Element instance for management. Prism Central allows you
to manage different clusters across separate physical locations on one screen and offers an
organizational view into a distributed Nutanix environment.

Prism Central is an application you can deploy in a VM (Prism Central VM) or in a scale out
cluster of VMs (Prism Central instance). You can deploy Prism Central:

• Manually

• Import a VM template

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   26
Managing the Nutanix Cluster

• Using one-click from Prism Element

You can run a Prism Central VM in a VM of any size; the only difference is the amount of CPU
and memory available to the Prism Central VM for VM management. You can deploy a Prism
Central instance initially as a scale-out cluster or, if you are running it as a single VM, easily
scale it out with one click using Prism Element. The design decisions involved in using this
architecture are dramatically simpler than legacy solutions. You only need to answer two
questions before deploying:

• How many VMs do you need to manage?

• Do you want High Availability?

This extensible architecture allows you to enable value-added features and products, such as
Prism Pro, Calm, and Flow networking within Prism Central. These additional features operate
within a single Prism Central VM or clustered Prism Central instance and do not require you to
design or deploy separate products.

Note: Both Prism Element and Prism Central are collectively referred to as Prism
Starter. Prism Central for a single cluster is free of charge, you must purchase a
license to manage multiple clusters.

Prism Pro

Every edition of Acropolis includes Prism Starter for single (Prism Element) and multiple site
(Prism Central) management.

Prism Pro is a set of features providing advanced analytics and intelligent insights into
managing a Nutanix environment. These features include performance anomaly detection,
capacity planning, custom dashboards, reporting, and advanced search capabilities. You can
license the Prism Pro feature set to unlock it within Prism Central.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   27
Managing the Nutanix Cluster

Prism Features
Infrastructure Management

• Streamline common hypervisor and VM tasks.

• Deploy, configure, and manage clusters for storage and virtualization.

• Deploy, configure, migrate, and manage virtual machines.

• Create datastores, manage storage policies, and administer DR.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   28
Managing the Nutanix Cluster

Performance Monitoring

• Provides real-time performance behavior of VMs and workloads.

• Utilizes predictive monitoring based on behavioral analysis to detect anomalies.

• Detects bottlenecks and provides guidance for VM resource allocation.

Operational Insight (Prism Pro)

• Advanced machine learning technology

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   29
Managing the Nutanix Cluster

• Built-in heuristics and business intelligence

• Customizable dashboards

• Built-in and custom reporting

• Single-click query

Capacity Planning (Prism Pro)

• Predictive analytics based on capacity usage and workload behavior

• Capacity optimization advisor

• Capacity expansion forecast

Management Interfaces
There are several methods to manage a Nutanix implementation.

• Graphical UI – Prism Element and Prism Central. This is the preferred method for
management because you can manage entire environment (when using Prism Central).

• Command line interfaces

- nCLI – Get status and configure entities within a cluster.

- aCLI – Manage the Acropolis portion of the Nutanix environment.


• Nutanix PowerShell cmdlets – Use with Windows PowerShell.

• REST API – Exposes all GUI components for orchestration and automation.

Prism Element Initial Configuration


Accessing Prism Element
Knowledge base article KB 1661 contains details of the initial logon procedure.

Nutanix supports current browser editions as well as the previous two major versions for:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   30
Managing the Nutanix Cluster

• Firefox

• Chrome

• Safari

• Internet Explorer (10 and 11)

• Microsoft Edge

Enabling Pulse

Pulse is enabled by default and monitors cluster health and proactively notifies customer
support if a problem is detected.

• Collects cluster data automatically and unobtrusively with no performance impact

• Sends diagnostic data via e-mail to both Nutanix Support and the user (if configured) once
per day, per node

• Proactive monitoring (different from alerts)

Controller VMs communicate with ESXi hosts and IPMI interfaces throughout the cluster to
gather health information.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   31
Managing the Nutanix Cluster

Warnings and errors are also displayed in Prism Element, where administrators can analyze the
data and create reports.

Pulse Settings

Select the gear icon to configure Pulse.

• Enable/disable

• Set email recipients

• Configure verbosity

- Basic: Collect basic statistics only

- Basic Coredump: Collect basic statistics plus core dump data

Disabling Pulse is not recommended, since Nutanix Support will not be notified if you have an
issue. Some Pulse data can trigger an automated case creation.

Pulse sends alerts to Nutanix Support by default, but administrators can define additional
recipients.

When configuring Verbosity

Basic statistics include Zeus, Stargate, Cassandra, and Curator subsystem information;
Controller VM information; hypervisor and VM information; cluster configuration; and
performance information.

The core dump data is a summary of information extracted from the core dump files including
the time stamp, the file name, and the fatal message.

What Gets Shared by Pulse?


Information collected and shared

• System alerts

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   32
Managing the Nutanix Cluster

• Current Nutanix software version

• Nutanix processes and Controller VM information

• Hypervisor details such as type and version

• System-level statistics

• Configuration information

Information not shared

• Guest VMs

• User data

• Metadata

• Administrator credentials

• Identification data

• Private information

Command Line Interfaces


You can run system administration commands against a Nutanix cluster from a local machine or
any CVM in the cluster.

There are two most commonly used command line interfaces (CLIs).

• nCLI – Get status and configure entities within a cluster.

• aCLI – Manage the Acropolis portion of the Nutanix environment: hosts, networks, snapshots,
and VMs.

The Acropolis 5.10 Command Reference on the Support Portal contains nCLI, aCLI and CVM
commands.

Nutanix Command Line Interface (nCLI)


 ncli> entity action parameter1=value parameter2=value ...

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   33
Managing the Nutanix Cluster

From Prism Element, download the nCLI installer to a local machine. This requires Java Runtime
Environment (JRE) version 5.0 or higher.

The PATH environment variable should point to the nCLI folder as well as the JRE bin folder.

Once downloaded and installed, go to a bash shell or command prompt and point ncli to the
cluster IP or any controller CVM.

Enter ncli -s management_ip_addr -u 'username' -p 'user_password'

Command Format

nCLI commands must match the following format:


ncli> entity action parameter1=value parameter2=value ...

• You can replace entity with any Nutanix entity, such as cluster or disk.

• You can replace action with any valid action for the preceding entity. Each entity has a
unique set of actions, but a common action across all entities is list. For example, you can
type the following command to request a list of all storage pools in the cluster.
ncli> storagepool list

Some actions require parameters at the end of the command. For example, when creating
an NFS datastore, you need to provide both the name of the datastore as it appears to the
hypervisor and the name of the source storage container.
ncli> datastore create name="NTNX-NFS" ctr-name="nfs-ctr"

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   34
Managing the Nutanix Cluster

You can list parameter-value pairs in any order, as long as they are preceded by a valid entity
and action.

Note: Tip: To avoid syntax errors, surround all string values with double-quotes,
as demonstrated in the preceding example. This is particularly important when
specifying parameters that accept a list of values.

nCLI Embedded Help

The nCLI provides assistance on all entities and actions. By typing help at the command line,
you can request additional information at one of three levels of detail.

• help provides a list of entities and their corresponding actions

• <entity> help provides a list of all actions and parameters associated with the entity, as well
as which parameters are required, and which are optional

• <entity> action help provides a list of all parameters associated with the action, as well as a
description of each parameter

The nCLI provides additional details at each level. To control the scope of the nCLI help output,
add the detailed parameter, which can be set to either true or false.

For example, type the following command to request a detailed list of all actions and
parameters for the cluster entity.
ncli> cluster help detailed=true

You can also type the following command if you prefer to see a list of parameters for the
cluster edit-params action without descriptions.
ncli> cluster edit-params help detailed=false

nCLI Entities and Parameters

Each entity has unique actions, but a common action for all entities is list ncli> storagepool list.

Some actions require parameters ncli> datastore create name="NTNX-NFS" ctr-name="nfs- ctr".

You can list parameter-value pairs in any order. You should surround string values with quotes.

Note: This is critical when specifying a list of values.

Acropolis Command Line Interface (aCLI)


Acropolis provides a command-line interface for managing hosts, networks, snapshots, and
VMs.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   35
Managing the Nutanix Cluster

To access the aCLI:

• Log on to a Controller VM in the cluster with SSH

• Type acli at the shell prompt

To exit the Acropolis CLI and return to the shell, type exit at the <acropolis> prompt.

aCLI Example

• Create a new virtual network for VMs: net.create vlan.100 ip_config=10.1.1.1/24

• Add a DHCP pool to a managed network: net.add_dhcp_pool vlan.100 start=10.1.1.100


end=10.1.1.200

• Clone a VM: vm.clone testClone clone_from_vm=Edu01VM

Note: Use extreme caution when executing allssh commands. The allssh command
executes a ssh command to all CVMs in the cluster.

PowerShell Cmdlets

Overview

Windows PowerShell is an intuitive and interactive scripting language built on the .NET
framework. A cmdlet is a lightweight command that is used in the Windows PowerShell
environment. Nutanix provides a set of PowerShell cmdlets to perform system administration
tasks using PowerShell. 

For more information, see The Nutanix Developer Community website. 

Nutanix PowerShell cmdlets utilize a getter/setter methodology: The typical syntax is <Verb>-
NTNX<Noun>.

Examples:
move-NTNXVirtualMachine
get-NTNXAlert

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   36
Managing the Nutanix Cluster

PowerShell Cmdlets (Partial List)

Acropolis Task Administration


Get-NTNXTask
Poll-NTNXTask
Acropolis VMAdministration
Operations
Add-NTNXVMDisk
Get-NTNXVMDisk
Remove- NTNXVMDisk
Set- NTNXVMDisk
Stop- NTNXVMDisk
Stop-NTNXVMMove
Add-NTNXVMNIC
Get- NTNXVMNIC
Remove-NTNXVMNIC

Acropolis Network Administration


Get-NTNXNetwork
New-NTNXNetwork
Remove-NTNXNetwork
Set-NTNXNetwork
Get-NTNXNetworkAddressTable
Reserve-NTNXNetworkIP
UnReserve-NTNXNetworkIP

Acropolis Snapshot Administration


Clone-NTNXSnapshot
Get-NTNXSnapshot
New-NTNXSnapshot
Remove-NTNXSnapshot

PowerShell Cmdlets Examples

• Connect to a cluster: Connect-NutanixCluster –Server <Cluster IP> -UserName <Prism User> -P


<Password>

• Get information about the cluster you are connected to: Get-NutanixCluster

• Get information about ALL of the clusters you are connected to by specifying a CVM IP for
each cluster: Get-NutanixCluster -Server cvm_ip_addr

• Get help while in the PowerShell interface: Get-Help

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   37
Managing the Nutanix Cluster

REST API

Overview

REST API allows an external system to interrogate a cluster using a script that makes REST
API calls. It uses HTTP requests (get, post, put, and delete) to retrieve information or to make
changes to the cluster.

• Codes responses in JSON format

• Prism Element includes a REST API Explorer


• Displays a list of cluster objects that can be managed by the API

• Sample API calls can be made to see output

There are three versions of the Nutanix REST API: v1, v2, and v3. We encourage users of the v1
API to migrate to v2. You can use the REST API Explorer to view the v1 and v2 APIs.

See the REST API Reference on the Nutanix Support portal and the Nutanix Developer
Community website for more information.

Labs
1. Conducting Prism Element Initial Setup

2. Configuring an NTP Server

3. Using Nutanix Interfaces

4. Exploring Prism Views

5. Exploring nCLI

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   38
Module

3
SECURING THE NUTANIX CLUSTER

Overview
After completing this module, you will be able to

• Describe Nutanix cluster security methodologies


• Use Data at Rest Encryption (DARE) to encrypt data

• Understand how key-based SSH access to a cluster works

• Configure user authentication

• Install an SSL certificate

Security Overview

Nutanix Security Development Life Cycle


The Nutanix Enterprise Cloud follows the Nutanix Security Development Life Cycle (SecDL).

• Mitigates risk through repeated assessment and testing.

• Performs fully automated testing during development, and times all security-related code
modifications during minor releases to minimize risk.

• Assesses and mitigates customer risk from code changes by using threat modeling.

Nutanix has been tested against multiple industry standards.

• Passes DoD and Federal security audits

• Certified in HIPAA environments

• Certified in payment and financial environments

• Implements and conforms to Security Technical Implementation Guides (STIG)

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   39
Securing the Nutanix Cluster

Note: Download the Information Security with Nutanix Tech Note for more


information on this topic.

Security in the Enterprise Cloud


Nutanix security features include:

Two-Factor Authentication
Logons require a combination of a client certificate and username and password.
Administrators can use local accounts or use AD.

• One-way: Authenticate to the server

• Two-way: Server also authenticates the client

• Two-factor: Username/Password and a valid certificate

Cluster Lockdown
You can restrict access to a Nutanix cluster. SSH sessions can be restricted through
nonrepudiated keys.

• Each node employs a public/private key-pair

• Cluster secured by distributing these keys

You can disable remote logon with a password. You can completely lock down SSH access
by disabling remote logon and deleting all keys except for the interCVM and CVM to host
communication keys.

Key Management and Administration


Nutanix nodes are authenticated by a key management server (KMS). SEDs generate new
encryption keys, which are uploaded to the KMS. In the event of power failure or a reboot, keys
are retrieved from the KMS and used to unlock the SEDs. You can instantly reprogram security
keys. Crypto Erase can be used to instantly erase all data on an SED while generating a new
key.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   40
Securing the Nutanix Cluster

Security Technical Implementation Guides (STIGs)


Once deployed, STIGs lock down IT environments and reduce security vulnerabilities in
infrastructure.

Traditionally, using STIGs to secure an environment is a manual process that is highly time-
consuming and prone to operator error. Because of this, only the most security-conscious IT
shops follow the required process.

Nutanix has created custom STIGs that are based on the guidelines outlined by DISA to keep
the Enterprise Cloud Platform within compliance and reduce attack surfaces.
Nutanix includes STIGs that collectively check over 800 security entities covering storage,
virtualization, and management:

• AHV

• AOS

• Prism

• Web server

• Prism reverse proxy

Working with STIGs

To make the STIGs usable by all organizations, Nutanix provides the STIGs in machine-readable
XCCDF.xml format and PDF. This enables organizations to use tools that can read STIGs and
automatically validate the security baseline of a deployment, reducing the accreditation time
required to stay within compliance from months to days.

Nutanix leverages SaltStack and SCMA to self-heal any deviation from the security baseline
configuration of the operating system and hypervisor to remain in compliance. If any
component is found as non-compliant, then the component is set back to the supported
security settings without any intervention. To achieve this objective, Nutanix Controller VM
conforms to RHEL 7 (Linux 7) STIG as published by DISA. Additionally, Nutanix maintains its
own STIG for the Acropolis Hypervisor (AHV).

STIG SCMA Monitoring

Security Configuration Management Automation (SCMA)

• Monitors over 800 security entities covering storage, virtualization, and management

• Detects unknown or unauthorized changes and can self-heal to maintain compliance

• Logs SCMA output/actions to syslog

The SCMA framework ensures that services are constantly inspected for variance to the
security policy.

Nutanix has implemented security configuration management automation (SCMA) to check


multiple security entities for both Nutanix storage and AHV. Nutanix automatically reports log
inconsistencies and reverts them to the baseline.

With SCMA, you can schedule the STIG to run hourly, daily, weekly, or monthly. STIG has the
lowest system priority within the virtual storage controller, ensuring that security checks do not
interfere with platform performance.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   41
Securing the Nutanix Cluster

Data at Rest Encryption


Data at Rest Encryption (DARE) secures data while at rest using built-in key-based access
management.

• Data is encrypted on all drives at all times.

• Data is inaccessible in the event of drive or node theft.

• Data on a drive can be securely destroyed.

• Key authorization enables password rotation at arbitrary times.


• Protection can be enabled or disabled at any time.

• No performance penalty is incurred despite encrypting all data.

Enterprise key management (with KMS): A consolidated, central key management server (KMS)
which provides service to multiple cryptographic clusters.

Nutanix provides a software-only option for data-at-rest security with the Ultimate license. This
does not require the use of self-encrypting drives. 

DARE Implementation
1. Install SEDs for all data drives in a cluster. The drives are FIPS 140-2 Level 2 validated and
use FIPS 140-2 validated cryptographic modules.

2. When you enable data protection for the cluster, the Controller VM must provide the proper
key to access data on a SED.

3. Keys are stored in a key management server that is outside the cluster, and the Controller
VM communicates with the key management server using the Key Management
Interoperability Protocol (KMIP) to upload and retrieve drive keys. 

4. When a node experiences a full power off and power on (and cluster protection is enabled),
the Controller VM retrieves the drive keys from the key management server and uses them
to unlock the drives.

Use Prism to manage key management device and certificate authorities.

Each Nutanix node automatically:

1. Generates an authentication certificate and adds it to the key management device

2. Auto-generates and sets PINs on its respective FIPS-validated SED.

The Nutanix controller in each node then adds the PINs (aka KEK, key encryption key) to the
key management device.

Once the PIN is set on an SED, you need the PIN to unlock the device (lose the PIN, lose data).
You can reset the PIN using the SecureErase primitive to “unsecure” the disk/partition, but all
existing data is lost in this case.

This is an important detail if you move drives between clusters or nodes.

ESXi and NTNX boot partition remain unencrypted. SEDs support encrypting individual disk
partitions selectively using the “BAND” feature (a range of blocks).

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   42
Securing the Nutanix Cluster

Configuring Authentication

Changing Passwords
It is possible to change 4 different sets of passwords in a Nutanix cluster: user, CVM, IPMI, and
the hypervisor host.

When you will change these passwords depends on your company’s IT security policies. Most
companies enforce password changes on a schedule via security guidelines, but the intervals
are usually company specific.

Nutanix enables administrators with password complexity features such as forcing the use of
upper/lower case letters, symbols, numbers, change frequency, and password length. After you
have successfully changed a password, the new password is synchronized across all Controller
VMs and interfaces (Prism web console, nCLI, and SSH).

By default, the admin user password does not expire and can be changed at any time. If you
do change the admin password, you will also need to update any applications and scripts that
use the admin credentials for authentication. For authentication purposes, Nutanix recommends
that you create a user with an admin role, instead of using the admin account.

Note: For more information on this topic, please see the Nutanix Support Portal >
Common Criteria Guidance Reference > User Identity and Authentication.

Changing User Passwords

You can change user passwords, including for the default admin user, in the web console or
nCLI. Changing the password through either interface changes it for both.
To change a user password, do one of the following:

Using the web console: Log on to the web console as the user whose password is to be
changed and select Change Password from the user icon pull-down list of the main menu.

Note: For more information about changing properties of the current users, see the
Web Console Guide.

Using nCLI: Specify the username and passwords.


$ ncli -u 'username' -p 'old_pw' user change-password current-password="curr_pw" new-
password="new_pw"

Remember to:

• Replace username with the name of the user whose password is to be changed.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   43
Securing the Nutanix Cluster

• Replace curr_pw with the current password.

• Replace new_pw with the new password.

Note: If you change the password of the admin user from the default, you must
specify the password every time you start an nCLI session from a remote system.
A password is not required if you are starting an nCLI session from a Controller VM
where you are already logged on.

Changing the CVM Password

For a Regular User Account

Perform these steps on any one Controller VM in the cluster to change the password of
the nutanix user. After you have successfully changed the password, the new password is
synchronized across all Controller VMs in the cluster. During the sync, you will see a task
appear in the Recent Tasks section of Prism and will be notified when the password sync task is
complete.

1. Log on to the Controller VM with SSH as the nutanix user.

2. Change the nutanix user password.


nutanix@cvm$ passwd

3. Respond to the prompts, providing the current and new nutanix user password.
Changing password for nutanix.
Old Password:
New password:
Retype new password:
Password changed.

Changing the IPMI Password

This procedure helps prevent the BMC password from being retrievable on port 49152.

Although it is not required for the administrative user to have the same password on all hosts,
doing so makes cluster management much easier. If you do select a different password for one
or more hosts, make sure to note the password for each host.

Note: The maximum allowed length of the IPMI password is 19 characters, except
on ESXi hosts, where the maximum length is 15 characters.

Note: Do not use the following special characters in the IPMI password: & ; ` ' \ " |
* ? ~ < > ^ ( ) [ ] { } $ \n \r

Change the administrative user password of all IPMI hosts.

Perform these steps on every IPMI host in the cluster.

1. Sign in to the IPMI web interface as the administrative user.

2. Navigate to the administrative user configuration and modify the user

3. Update the password

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   44
Securing the Nutanix Cluster

Note: It is also possible to change the IPMI password for ESXi, Hyper-V, and AHV
if you do not know the current password but have root access to the host. For
instructions on how to do this, please see the relevant section of the NX and SX
Series Hardware Administration Guide on the Nutanix support portal.

Changing the Acropolis Host Password

Perform these steps on every Acropolis host in the cluster.

1. Log on to the AHV host with SSH.

2. Change the root password.


root@ahv# passwd root

3. Respond to the prompts, providing the current and new root password.
Changing password for root.
New password:
Retype new password:
Password changed.

Role-Based Access Control

Prism Central supports role-based access control (RBAC) that you can configure to provide
customized access permissions for users based on their assigned roles. The roles dashboard
allows you to view information about all defined roles and the users and groups assigned to
those roles.

• Prism Central includes a set of predefined roles.

• You can also define additional custom roles.

• Configuring authentication confers default user permissions that vary depending on the
type of authentication (full permissions from a directory service or no permissions from an
identity provider). You can configure role maps to customize these user permissions.

• You can refine access permissions even further by assigning roles to individual users or
groups that apply to a specified set of entities.

Note: Defining custom roles and assigning roles are supported on AHV only.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   45
Securing the Nutanix Cluster

Gathering Requirements to Create Custom Roles


When securely delegating access to applications and infrastructure components, the ideal way
to assign permissions is to follow the rule of least privilege – that is, a person with elevated
access should have the permissions necessary for them to do their day-to-day work, no more.

Nutanix RBAC enables this by providing fine grained controls when creating custom roles in
Prism Central. As an example, it is possible to create a VM Admin role with the ability to view
VMs, limited permission to modify CPU, memory, and power state, and no other administrative
privileges.
When creating custom roles for your organization, remember to:

• Clearly understand the specific set of tasks a user will need to perform their job

• Identify permissions that map to those tasks and assign them accordingly

• Document and verify your custom roles to ensure that the correct privileges have been
assigned

Built-in Roles
The following built-in roles are defined by default. You can see a more detailed list of
permissions for any of the built-in roles through the details view for that role. The Project
Admin, Developer, Consumer, and Operator roles are available when assigning roles in a project.

Role Privileges

Super Admin Full administrator privileges

Full administrator privileges except for creating or modifying the user


Prism Admin accounts

Prism Viewer View-only privileges

Manages all cloud-oriented resources and services


Self-Service
Admin This is the only cloud administration role available.

Manages cloud objects (roles, VMs, Apps, Marketplace) belonging to a


project

You can specify a role for a user when you assign a user to a project, so
Project Admin individual users or groups can have different roles in the same project.

Developer Develops, troubleshoots, and tests applications in a project

Consumer Accesses the applications and blueprints in a project

Custom Roles
If the built-in roles are not sufficient for your needs, you can create one or more custom roles.
After creation, these roles can also be modified if necessary.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   46
Securing the Nutanix Cluster

Note: Custom role creation is only possible in AHV.

Creating a Custom Role

You can create a custom role from the Roles dashboard, with the following parameters:

• Name

• Description
• Permissions for VMs, blueprints, apps, marketplace items, and reports management

Modifying or Deleting a Custom Role

A custom role can also be modified or deleted from the Roles dashboard. When updating a
role, you will be able to modify the same parameters that are available when creating a custom
role. To delete a role, select the Delete option from the Actions menu and provide confirmation
when prompted.

Configuring Role Mapping


When user authentication is enabled, the following permissions are applied:

• Directory-service-authorized users are assigned full administrator permissions by default.

• SAML-authorized users are not assigned any permissions by default; they must be explicitly
assigned.

Note: To configure user authentication, please see the Prism Web Console Guide >
Security Management > Configuring Authentication section.

You can refine the authentication process by assigning a role (with associated permissions) to
users or groups. To assign roles:

1. Navigate to the Role Mapping section of the Settings page.

2. Create a role mapping and provide information for the directory or provider, role, entities
that should be assigned to the role, and then save. Repeat this process for each role that you
want to create.

You can edit a role map entry, which will present you with the same field available when
creating a role map. Make your desired changes and save to update the entry.

You can also delete a role map entry, by clicking the delete icon and then providing
confirmation when prompted.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   47
Securing the Nutanix Cluster

Working with SSL Certificates

Nutanix supports SSL certificate-based authentication for console access. AOS includes a self-
signed SSL certificate by default to enable secure communication with a cluster. AOS allows
you to replace the default certificate through the web console Prism user interface.

For more information, see the Nutanix Controller VM Security Operations Guide and


the Certificate Authority sections of the Common Criteria Guidance Reference on the Support
Portal.

Labs
1. Adding a user

2. Verifying the new user account

3. Updating default passwords 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   48
Module

4
NETWORKING

Overview
After completing this module, you will be able to:

• Explain managed and unmanaged Acropolis networks.


• Describe the use of Open vSwitch (OVS) in Acropolis.
• Determine Acropolis network configuration settings.
• Display network details using Prism.
• Differentiate supported OVS bond modes.
• Discuss default network configuration.

Default Network Configuration


The following diagram illustrates the networking configuration of a single host. The best
practice is to use only the 10 GB NICs and to disconnect the 1 GB NICs if you do not need them
or put them in a separate bond to be used in not critical networks.

Connections from the server to the physical switch use 10 GbE or higher interfaces. You can
establish connections between the switches with 40 GbE or faster direct links, or through a
leaf-spine network topology (not shown). The IPMI management interface of the Nutanix node
also connects to the out-of-band management network, which may connect to the production
network, but it is not mandatory. Each node always has a single connection to the management
network, but we have omitted this element from further images in this document for clarity and
simplicity.

Review the Leaf Spine section of the Physical Networking Guide for more information on leaf-
spine topology.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   49
Networking

Default Network Configuration (cont.)

Open vSwitch (OVS)


Open vSwitch OVS is an open source software switch implemented in the Linux kernel and
designed to work in a multiserver virtualization environment. By default, OVS behaves like a
layer-2 switch that maintains a MAC address table. The hypervisor host and VMs connect to
virtual ports on the switch. OVS supports many popular switch features, such as VLAN tagging,
load balancing, and Link Aggregation Control Protocol (LACP.) 

Each AHV server maintains an OVS instance, and all OVS instances combine to form a single
logical switch. Constructs called bridges manage the switch instances residing on the AHV
hosts. Use the following commands to configure OVS with bridges, bonds, and VLAN tags. For
example:

•ovs-vsctl (on the AHV hosts)

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   50
Networking

•ovs-appctl (on the AHV hosts)

•manage-ovs (on CVMs)

See the Open vSwitch website for more information. 

Bridges
Bridges act as virtual switches to manage traffic between physical and virtual network
interfaces. The default AHV configuration includes an OVS bridge called br0 and a native Linux
bridge called virbr0 (the names could vary between AHV/AOS versions and depending on
what configuration changes were done on the nodes, but in this training we will use br0 and
virbr0 by default). The virbr0 Linux bridge carries management and storage communication
between the CVM and AHV host. All other storage, host, and VM network traffic flows through
the br0 OVS bridge. The AHV host, VMs, and physical interfaces use "ports" for connectivity to
the bridge.

Ports
Ports are logical constructs created in a bridge that represent connectivity to the virtual switch.
Nutanix uses several port types, including internal, tap, VXLAN, and bond.

• An internal port with the same name as the default bridge (br0) provides access for the AHV
host.

• Tap ports connect virtual NICs presented to VMs.

• Use VXLAN ports for IP address management functionality provided by Acropolis.

• Bonded ports provide NIC teaming for the physical interfaces of the AHV host.

Bonds
Bonded ports aggregate the physical interfaces on the AHV host. By default, the system
creates a bond named br0-up in bridge br0 containing all physical interfaces. Changes to
the default bond br0-up using manage_ovs commands can rename it to bond0. Remember,
bond names on your system might differ from the diagram below. Nutanix recommends using
the name br0-up to quickly identify this interface as the bridge br0 uplink. Using this naming
scheme, you can also easily distinguish uplinks for additional bridges from each other.

OVS bonds allow for several load-balancing modes, including active-backup, balance-slb, and
balance-tcp. Active-backup mode is enabled by default. Nutanix recommends this mode for
ease of use.

The following diagram illustrates the networking configuration of a single host immediately
after imaging. The best practice is to use only the 10 GB NICs and to disconnect the 1 GB NICs if
you do not need them.

Only utilize NICs of the same speed within the same bond.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   51
Networking

Bond Modes

There are three load-balancing/failover modes that can be applied to bonds:

• active-backup (default)

• balance-slb

• LACP with balance-tcp

Active-Backup

With the active-backup bond mode, one interface in the bond carries traffic and other
interfaces in the bond are used only when the active link fails. Active-backup is the simplest
bond mode, easily allowing connections to multiple upstream switches without any additional
switch configuration. The active-backup bond mode requires no special hardware and you can
use different physical switches for redundancy.

The tradeoff is that traffic from all VMs uses only a single active link within the bond at
one time. All backup links remain unused until the active link fails. In a system with dual
10 GB adapters, the maximum throughput of all VMs running on a Nutanix node with this
configuration is 10 Gbps or the speed of a single link.

This mode only offers failover ability (no traffic load balancing.) If the active link goes down,
a backup or passive link activates to provide continued connectivity. AHV transmits all traffic
including those from the CVM and VMs across the active link. All traffic shares 10 Gbps of
network bandwidth.

Balance-SLB
To take advantage of the bandwidth provided by multiple upstream switch links, you can use
the balance-slb bond mode. The balance-slb bond mode in OVS takes advantage of all links in
a bond and uses measured traffic load to rebalance VM traffic from highly used to less used
interfaces. When the configurable bond-rebalance interval expires, OVS uses the measured
load for each interface and the load for each source MAC hash to spread traffic evenly among
links in the bond. Traffic from some source MAC hashes may move to a less active link to more
evenly balance bond member utilization. 

Perfectly even balancing may not always be possible, depending on the number of source
MAC hashes and their stream sizes. Each individual VM NIC uses only a single bond member
interface at a time, but a hashing algorithm distributes multiple VM NICs’ multiple source MAC
addresses across bond member interfaces. As a result, it is possible for a Nutanix AHV node
with two 10 GB interfaces to use up to 20 Gbps of network throughput. Individual VM NICs have
a maximum throughput of 10 Gbps, the speed of a single physical interface. A VM with multiple

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   52
Networking

NICs could still have more bandwidth than the speed of a single physical interface, but there is
no guarantee that the different VM NICs will land on different physical interfaces.

The default rebalance interval is 10 seconds, but Nutanix recommends setting this interval to
30 seconds to avoid excessive movement of source MAC address hashes between upstream
switches. Nutanix has tested this configuration using two separate upstream switches with
AHV. If the upstream switches are interconnected physically or virtually, and both uplinks allow
the same VLANs, no additional configuration, such as link aggregation is necessary.

Note: Do not use link aggregation technologies such as LACP with balance-slb.
The balance-slb algorithm assumes that upstream switch links are independent L2
interfaces. It handles broadcast, unicast, and multicast (BUM) traffic, selectively
listening for this traffic on only a single active adapter in the bond.

Note: Do not use IGMP snooping on physical switches connected to Nutanix


servers using balance-slb. Balance-slb forwards inbound multicast traffic on only a
single active adapter and discards multicast traffic from other adapters. Switches
with IGMP snooping may discard traffic to the active adapter and only send it
to the back up adapters. This mismatch leads to unpredictable multicast traffic
behavior. Disable IGMP snooping or configure static IGMP groups for all switch
ports connected to Nutanix servers using balance-slb. IGMP snooping is often
enabled by default on physical switches.

Note: Both active-backup and balance-slb do not require configuration on the


switch side.

LACP with Balance-TCP

Taking full advantage of bandwidth provided by multiple links to upstream switches, from
a single VM, requires dynamically negotiated link aggregation and load balancing using
balance-tcp. Nutanix recommends dynamic link aggregation with LACP instead of static link
aggregation due to improved failure detection and recovery.

Note: Ensure that you have appropriately configured the upstream switches before
enabling LACP. On the switch, link aggregation is commonly referred to as port
channel or LAG, depending on the switch vendor. Using multiple upstream switches
may require additional configuration such as MLAG or vPC. Configure switches to
fall back to active-backup mode in case LACP negotiation fails sometimes called
fallback or no suspend-individual. This setting assists with node imaging and initial
configuration where LACP may not yet be available.

Note: Review the following documents for more information


on MLAG and vPC best practices.

With link aggregation negotiated by LACP, multiple links to separate physical switches appear
as a single layer-2 (L2) link. A traffic-hashing algorithm such as balance-tcp can split traffic
between multiple links in an active-active fashion. Because the uplinks appear as a single L2
link, the algorithm can balance traffic among bond members without any regard for switch
MAC address tables. Nutanix recommends using balance-tcp when using LACP and link
aggregation, because each TCP stream from a single VM can potentially use a different uplink in
this configuration. 

With link aggregation, LACP, and balance-tcp, a single guest VM with multiple TCP streams
could use up to 20 Gbps of bandwidth in an AHV node with two 10 GB adapters.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   53
Networking

Configuring Link Aggregation

Configure link aggregation with LACP and balance-tcp using the commands below on all
Nutanix CVMs in the cluster.

Note: You must configure upstream switches for link aggregation with LACP
before configuring the AHV host from the CVM. Upstream LACP settings, such as
timers, should match the AHV host settings for configuration consistency. See KB
3263 for more information on LCAP configuration.

If upstream LACP negotiation fails, the default AHV host configuration disables the bond, thus
blocking all traffic. The following command allows fallback to active-backup bond mode in the
AHV host in the event of LACP negotiation failure:
nutanix@CVM$ ssh root@192.168.5.1 "ovs-vsctl set port br0-up other_config:lacp-fallback-
ab=true"

In the AHV host and on most switches, the default OVS LACP timer configuration is slow,
or 30 seconds. This value — which is independent of the switch timer setting — determines
how frequently the AHV host requests LACPDUs from the connected physical switch. The
fast setting (1 second) requests LACPDUs from the connected physical switch every second,
helping to detect interface failures more quickly. Failure to receive three LACPDUs — in other
words, after 3 seconds with the fast setting — shuts down the link within the bond. Nutanix
recommends setting lacp-time to decrease the time it takes to detect link failure from 90
seconds to 3 seconds. Only use the slower lacp-time setting if the physical switch requires it for
interoperability.
nutanix@CVM$ ssh root@192.168.5.1 "ovs-vsctl set port br0-up other_config:lacp-time=fast"

Next, enable LACP negotiation and set the hash algorithm to balance-tcp.
nutanix@CVM$ ssh root@192.168.5.1 "ovs-vsctl set port br0-up lacp=active"
nutanix@CVM$ ssh root@192.168.5.1 "ovs-vsctl set port br0-up bond_mode=balance-tcp"

Confirm the LACP negotiation with the upstream switch or switches using ovs-appctl, looking
for the word "negotiated" in the status lines.
nutanix@CVM$ ssh root@192.168.5.1 "ovs-appctl bond/show br0-up"
nutanix@CVM$ ssh root@192.168.5.1 "ovs-appctl lacp/show br0-up"

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   54
Networking

Virtual Local Area Networks (VLANs)

AHV supports two different ways to provide VM connectivity: managed and unmanaged
networks.

With unmanaged networks, VMs get a direct connection to their VLAN of choice. Each virtual
network in AHV maps to a single VLAN and bridge. All VLANs allowed on the physical switch
port to the AHV host are available to the CVM and guest VMs. You can create and manage
virtual networks, without any additional AHV host configuration, using:

• Prism Element

• Acropolis CLI (aCLI)


• REST API

Acropolis binds each virtual network it creates to a single VLAN. During VM creation, you can
create a virtual NIC and associate it with a network and VLAN. Or, you can provision multiple
virtual NICs each with a single VLAN or network.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   55
Networking

IP Address Management (IPAM)

A managed network is a VLAN plus IP Address Management (IPAM). IPAM is the cluster
capability to function like a DHCP server, to assign an IP address to a VM that sits on the
managed network.

Administrators can configure each virtual network with a specific IP subnet, associated domain
settings, and group of IP address pools available for assignment.

• The Acropolis Master acts as an internal DHCP server for all managed networks.

• The OVS is responsible for encapsulating DHCP requests from the VMs in VXLAN and
forwarding them to the Acropolis Master.

• VMs receive their IP addresses from the Acropolis Master’s responses.


• The IP address assigned to a VM is persistent until you delete the VNIC or destroy the VM.

The Acropolis Master runs the CVM administrative process to track device IP addresses. This
creates associations between the interface’s MAC addresses, IP addresses and defined pool of
IP addresses for the AOS DHCP server.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   56
Networking

Network Segmentation

Network segmentation is designed to manage traffic from backplane (storage and CVM) traffic.
It separates storage traffic from routable management traffic for security purposes and creates
separate virtual networks for each traffic type.

You can segment the network on a Nutanix cluster in the following ways:

• On an existing cluster by using the Prism Web Console

• When creating a cluster by using Nutanix Foundation 3.11.2 or higher versions.

View this Tech TopX video to learn more about network segmentation. You can also read
the Securing Traffic Through Network Segmentation section of the Nutanix Security Guide on
the Support Portal for more information on securing traffic through network segmentation.

Configuring Network Segmentation on an Existing Cluster

For more information about segmenting the network when creating a cluster, see the Field
Installation Guide on the Support Portal.

You can segment the network on an existing cluster by using the Prism web console. The
network segmentation process:
• Creates a separate network for backplane communications on the existing default virtual
switch.

• Configures the eth2 interfaces that AHV creates on the CVMs during upgrade.

• Places the host interfaces on the newly created network. 

From the specified subnet, AHV assigns IP addresses to each new interface. Each node requires
two IP addresses. For new backplane networks, you must specify a nonroutable subnet. The
interfaces on the backplane network are automatically assigned IP addresses from this subnet,
so reserve the entire subnet for the backplane network alone.

If you plan to specify a VLAN for the backplane network, configure the VLAN on the physical
switch ports to which the nodes connect. If you specify the optional VLAN ID, AHV places the

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   57
Networking

newly created interfaces on the VLAN. Nutanix highly recommends a separate VLAN for the
backplane network to achieve true segmentation.

Configuring Network Segmentation for an Existing RDMA Cluster


Segment the network on an existing RDMA cluster by using the Prism web console.

The network segmentation process:

• Creates a separate network for RDMA communications on the existing default virtual switch.

• Places the rdma0 interface created on the CVMs during upgrade.


• Places the host interfaces on the newly created network.

From the specified subnet, AHV assigns IP addresses (two per node) to each new interface. For
new RDMA networks, you must specify a nonroutable subnet. AHV automatically assigns the
interfaces on the backplane network IP addresses from this subnet, so reserve the entire subnet
for the backplane network alone.

If you plan to specify a VLAN for the RDMA network, configure the VLAN on the physical
switch ports to which the nodes connect. If you specify the optional VLAN ID, AHV places the
newly created interfaces on the VLAN. Nutanix highly recommends a separate VLAN for the
RDMA network to achieve true segmentation.

Network Segmentation During Cluster Expansion


When you expand a cluster, AHV extends network segmentation to the added nodes. For each
node you add to the cluster, AHV allocates two IP addresses from the specified nonroutable
network address space. If IP addresses are not available in the specified network, Prism displays
a message on the tasks page. In this case, you must reconfigure the network before you retry
cluster expansion.

When you change the subnet, any IP addresses assigned to the interfaces on the backplane
network change, and the procedure therefore involves stopping the cluster. For information
about how to reconfigure the network, see the Reconfiguring the Backplane Network section
of the Nutanix Security Guide on the Support Portal.

Network Segmentation During an AOS Upgrade


If the new AOS release supports network segmentation, AHV automatically creates the eth2
interface on each CVM. However, the network remains unsegmented and the cluster services on
the CVM continue to use eth0 until you configure network segmentation.

Note: Do not delete the eth2 interface that AHV creates on the CVMs, even if you
are not using the network segmentation feature.

Reconfiguring the Backplane Network


Backplane network reconfiguration is a CLI-driven procedure that you perform on any one of
the CVMs in the cluster. AHV propagates the change to the remaining CVMs.

Note: At the end of this procedure, the cluster stops and restarts, even if only
changing the VLAN ID, and therefore involves cluster downtime. Shut down all user
VMs and CVMs before reconfiguring the network backplane.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   58
Networking

Disabling Network Segmentation


Disabling network segmentation is a CLI-driven procedure that you perform on any one of the
CVMs in the cluster. AHV propagates the change to the remaining CVMs.

At the end of this procedure, the cluster stops and restarts and therefore involves cluster
downtime. Shut down all user VMs and CVMs before reconfiguring the disabling network
segmentation.

Unsupported Network Segmentation Configurations


Network segmentation is not supported in the following configurations:

• Clusters on which the CVMs have a manually created eth2 interface.

• Clusters on which the eth2 interface on one or more CVMs have manually assigned IP
addresses.
• In ESXi clusters where the CVM connects to a VMware distributed virtual switch.

• Clusters that have two (or more) vSwitches or bridges for CVM traffic isolation. The CVM
management network (eth0) and the CVM backplane network (eth2) must reside on a single
vSwitch or bridge. Do not create these CVM networks on separate vSwitches or bridges.

AHV Host Networking


Network management in an Acropolis cluster consists of the following tasks:

• Configuring L2 switching (configuring bridges, bonds, and VLANs.)

• Optionally changing the IP address, netmask, and default gateway that AHV specified for the
hosts during the imaging process.

This Tech TopX video walks through AHV networking concepts, including both CLI and Prism
examples.

Recommended Network Configuration


You need to change the default network configuration to the recommended configuration
below. 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   59
Networking

Network Component  Best Practice

OVS Do not modify the OpenFlow tables that are associated with the
default OVS bridge br0.

VLANs Add the Controller VM and the AHV host to the same VLAN.
By default, AHV assigns the Controller VM and the hypervisor
to VLAN 0, which effectively places them on the native VLAN
configured on the upstream physical switch.

Do not add other devices, such as guest VMs, to the same VLAN as
the CVM and hypervisor. Isolate guest VMs on one or more separate
VLANs.

Virtual bridges Do not delete or rename OVS bridge br0.

Do not modify the native Linux bridge virbr0.

OVS bonded port Aggregate the host 10 GbE interfaces to an OVS bond on br0. Trunk
these interfaces on the physical switch.
(bond0)
By default, the 10 GbE interfaces in the OVS bond operate in the
recommended active-backup mode.

Note: Nutanix does not recommend nor support the


mixing of bond modes across AHV hosts in the same
cluster.

LACP configurations might work but might have limited support.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   60
Networking

Network Component  Best Practice

1 GbE and 10 GbE If you want to use the 10 GbE interfaces for guest VM traffic, make
sure that the guest VMs do not use the VLAN over which the
interfaces (physical Controller VM and hypervisor communicate.
host)
If you want to use the 1 GbE interfaces for guest VM connectivity,
follow the hypervisor manufacturer’s switch port and networking
configuration guidelines.

Note: Do not include the 1 GbE interfaces in the same


bond as the 10 GbE interfaces. Also, to avoid loops,
do not add the 1 GbE interfaces to bridge br0, either
individually or in a second bond. Use them on other
bridges.

IPMI port on the Do not trunk switch ports that connect to the IPMI interface.
hypervisor host
Configure the switch ports as access ports for management
simplicity.

Upstream physical Nutanix does not recommend the use of Fabric Extenders (FEX)
switch or similar technologies for production use cases. While initial, low
load implementations might run smoothly with such technologies,
poor performance, VM lockups, and other issues might occur as
implementations scale upward. 

Nutanix recommends the use of 10Gbps, line-rate, nonblocking


switches with larger buffers for production workloads.

Use an 802.3-2012 standards–compliant switch that has a low


latency, cut-through design and provides predictable, consistent
traffic latency regardless of packet size, traffic pattern, or the
features enabled on the 10 GbE interfaces. 

Port-to-port latency should be no higher than 2 microseconds.

Use fast-convergence technologies (such as Cisco PortFast) on


switch ports connected to the hypervisor host.

Avoid using shared buffers for the 10 GbE ports. Use a dedicated
buffer for each port.

Physical Network Use redundant top-of-rack switches in a traditional leaf-spine


Layout architecture. The flat network design is well suited for a highly
distributed, shared-nothing compute and storage architecture.

Add all the nodes that belong to a given cluster to the same Layer-2
network segment.

Nutanix supports other network layouts as long as you follow all


other Nutanix recommendations.

Controller VM Do not remove the Controller VM from either the OVS bridge br0 or
the native Linux bridge virbr0.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   61
Networking

AHV Networking Terminology Comparison

AHV Term VMware Term Microsoft Hyper-V or SCVMM


Term

Bridge vSwitch, distributed virtual switch Virtual switch, logical switch

Bond NIC team Team or uplink port profile

Port Port N/A

Network Port group VLAN tag or logical network

Uplink Physical NIC or VMNIC Physical NIC or pNIC

VM NIC VNIC VM NIC

Internal port VMkernel port Virtual NIC

Active-backup Active/standby Active/standby

Balance-slb Route based on source MAC hash Switch-independent/ dynamic


combined with route based on
physical NIC load

LACP with LACP and route based on IP hash Switch-dependent (LACP) /


balance-tcp address hash

Labs
1. Creating an Unmanaged Network

2. Creating a Managed Network

3. Managing Open vSwitch (OVS)

4. Creating a New OVS

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   62
Module

5
VIRTUAL MACHINE MANAGEMENT

Overview
After completing this module, you will be able to:

• Use Image Configuration to upload and manage images


• Perform a self-service restore from a Nutanix data protection snapshot

• Manage a VM in AHV

• Define High Availability

• Define Data Path Redundancy

Understanding Image Configuration


The Prism Web Console allows you to import and configure operating system ISO and disk
image files. This image service allows you to assemble a repository of image files in different
formats (raw, vhd, vhdx, vmdk, vdi, iso, and qcow2) that you can later use when creating virtual
machines. How image creation, updates, and deletions work depends on whether or not Prism
Element is registered with Prism Central.

Images that are imported to Prism Element reside in and can be managed from Prism
Element. If connected to Prism Central, you can migrate your images over to Prism Central for
centralized management. This will not remove your images from Prism Element, but will allow
management only in Prism Central. So, for example, if you want to update a migrated image, it
can only be done from Prism Central, not from Prism Element.

Registration with Prism Central is also useful if you have multiple Prism Element clusters
managed by a single instance of Prism Central. In this scenario, if you upload an image to a local
Prism Element instance, for example, this is what happens:

• The image is available locally on that Prism Element instance. (Assuming it has not been
migrated to Prism Central.)

• The image is flagged as ‘active’ on that Prism Element cluster.

• The image is flagged as ‘inactive’ on other Prism Element clusters.

• When you create a VM using that image, the image is copied to other Prism Element
clusters, is made active, and is then available for use on all Prism Element clusters managed
by that instance of Prism Central.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   63
Virtual Machine Management

Overview

Supported Disk Formats


Supported disk formats include

• RAW
• VHD (virtual hard disk)

• VMDK (virtual machine disk)

• VDI (Oracle VirtualBox)

• ISO (disc image)

• QCOW2 (QEMU copy on write)

The QCOW2 format decouples the physical storage layer from the virtual layer by adding a
mapping between the logical and physical blocks.

Post-Import Actions

After you import an image you can perform several actions.

• Clone a VM from the image

• Leave the image in the service for future deployments

• Delete the imported image

After you import an image, you must clone a VM from the image that you have imported and
then delete the imported image.

For more information on how to create a VM from the imported image, see the Prism Web
Console Guide on the Support Portal.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   64
Virtual Machine Management

Uploading Images
There are two ways to upload an image to the Image Service:

• Via Prism

• Using the command line

Using the Prism Interface

1. From the Settings menu in Prism, select Image Configuration.


2. Upload an image and specify the required parameters, including the name, the image type,
the container on which it will be stored, and the image source for upload.

After Prism completes the upload, the image will appear in a list of available images for use
during VM creation.

Using the Command Line

To create an image (testimage) from an image located at http://example.com/disk_image, you


can use the following command:
<acropolis> image.create testimage source_url=http://example.com/image_iso
container=default image_type=kIsoImage

To create an image (testimage) from an image located at NFS server, you can use the following
command:
<acropolis> image.create testimage source_url=nfs://nfs_server_path/path_to_image

To create an image (image_template) from a vmdisk 0b4fc60b-cc56-41c6-911e-67cc8406d096


(UUID of the VM):
<acropolis> image.create image_template clone_from_vmdisk=0b4fc60b-
cc56-41c6-911e-67cc8406d096 imag

Creating and Managing Virtual Machines in AHV

You can use the Prism web console to create virtual machines (VMs) on a Nutanix cluster. If
you have administrative access to a cluster, you can create a VM with Prism by completing a
form that requires a name, compute, storage, and network specifications. If you have already
uploaded the required image files to the image service, you can create either Windows or Linux
VMs during the VM creation process.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   65
Virtual Machine Management

Prism also has self-service capabilities that enable administrators or project members with the
required permissions to create VMs. In this scenario, users will select from a list of pre-defined
templates for VMs and disk images to create their VM.

Finally, VMs can be updated after creation, cloned, or deleted as required. When updating a VM,
you can change compute details (vCPUs, cores per vCPU, memory), storage details (disk types
and capacity), as well as other parameters that were specified during the VM creation process.

Creating a VM in AHV
1. In Prism Central, navigate to VM dashboard, click the List tab, and click Create VM.
2. In the Cluster Selection window, select the target cluster for your VM and click OK.

3. In the Create VM dialog box, update the following information as required for your VM:

• Name

• Description (optional)

• Timezone

• vCPUs
• Number of cores per vCPU

• Memory

• GPU and GPU mode

• Disks (CD-ROM or disk drives)

• Network interface

- NIC

- VLAN name, ID, and UUID

- Network connection state

- Network address/prefix

- IP address (for NICs on managed networks only)

• VM host affinity

4. After all fields have been updated and verified, click Save to create the VM.

When creating a VM, you can also provide a user data file for Linux VMs, or an answer file for
Windows VMs, for unattended provisioning. There are 3 ways to do this:

• If the file has been uploaded to a storage container on a cluster, click ADSF path and enter
the path to the file.

• If the file is available on your local computer, click Upload a File, click Choose File, and then
upload the file.

• If you want to create the file or paste the contents directly, click Type or paste script and
then use the text box that is provided

You can also copy or move files to a location on the VM for Linux VMs, or to a location in the
ISO file for Windows VMs, during initialization. To do this, you need to specify the source file
ADSF path and the destination path in the VM. To add other files or directories, repeat this
process as necessary.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   66
Virtual Machine Management

Creating a VM using Prism Self-Service (PSS)


This process is slightly different from creating a VM with administrative permissions. This is
because self-service VMs are based on a source file stores in the Prism Central catalog. To
create a VM using Prism Self-Service:

1. In Prism Central, navigate to VM dashboard, click the List tab, and click Create VM.

2. Select source images for the VM, including the VM template and disk images.

3. In the Deploy VM tab, provide the following information:


• VM name

• Target project

• Disks

• Network

• Advanced Settings (vCPUs and memory)

4. After all the fields have been updated and verified, click Save to create the VM.

Managing a VM
Whether you have created a VM with administrative permissions or as a self-service
administrator, three options are available to you when managing VMs:

To modify a VM’s configuration

1. Select the VM and click Update.

2. The Update VM dialog box includes the same fields as the Create VM dialog box. Make the
required changes and click Save.

To delete a VM

1. Select the VM and click Delete.

2. A confirmation prompt will appear; click OK to delete the VM.

To clone a VM
1. Select the VM and click Clone.

2. The Clone VM dialog box includes the same fields as the Create VM dialog box. However,
all fields will be populated with information based on the VM that you are cloning. You can
either:

• Enter a name for the cloned VM and click Save, or

• Change the information in some of the fields as desired, and then click Save.

Other operations that are possible for a VM via one-click operations in Prism Central are:

• Launch console

• Power on/off

• Pause/Suspend

• Resume

• Take Snapshot

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   67
Virtual Machine Management

• Migrate (to move the VM to another host)

• Assign a category value

• Quarantine/Unquarantine

• Enable/disable Nutanix Guest Tools

• Configure host affinity

• Add snapshot to self-service portal template (Prism Central Administrator only)

• Manage VM ownership (for self-service VMs)

Note: For more information on each of these topics, please see the Prism
Central Guide > Virtual Infrastructure (Cluster) Administration > VM Management
documents on the Nutanix Support Portal.

Supported Guest VM Types for AHV 


OS Types with SCSI Bus Types  OS Types with PCI Bus Types

Windows 7, 8.x, 10 RHEL 5.10, 5.11, 6.3

Windows Server 2008/R2, 2012/R2, CentOS 5.10, 5.11, 6.3


2016

RHEL 6.4-6.9, 7.0-7.4 Ubuntu 12.0.4

CentOS 6.4-6.8, 7.0-7.3 SLES 12

Ubuntu 12.0.4.5, 14.04x, 16.04x, 16.10

FreeBSD 9.3, 10.0-10.3, 11

SLES 11 SP3/SP4, 12

Oracle Linux 6.x, 7.x

See the AHV Guest OS Compatibility Matrix on the Support Portal for the current list of
supported guest VMs in AHV.

Nutanix VirtIO
Nutanix VirtIO is a collection of drivers for paravirtual devices that enhance the stability and
performance of virtual machines on AHV.

Nutanix VirtIO drivers:

• Enable Windows 64-bit VMs to recognize AHV virtual hardware


• Contain Network, Storage and a Balloon driver (which is used to gather stats from Windows
guest VMs)

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   68
Virtual Machine Management

• If not added as ISO (CDROM), VM may not boot


• Most modern Linux distributions already include drivers
• Support Portal: Downloads > Tools and Firmware 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   69
Virtual Machine Management

Nutanix Guest Tools


Overview

Nutanix Guest Tools (NGT) is an in-guest agent framework that enables advanced VM
management functionality through the Nutanix Platform.

The NGT bundle consists of the following components:

• Nutanix Guest Agent (NGA) service. Communicates with the Nutanix Controller VM.

• File Level Restore CLI. Performs self-service file-level recovery from the VM snapshots. For
more information about self-service restore, see the Acropolis Advanced Setup Guide.

• Nutanix VM mobility drivers. Provides drivers for VM migration between ESXi and AHV,
in-place hypervisor conversion, and cross-hypervisor disaster recovery (CH-DR) features.
For more information about cross- hypervisor disaster recovery, see the Cross-Hypervisor
Disaster Recovery section of the Data Protection and Disaster Recovery guide on the
Support Portal.

• VSS requestor and hardware provider for Windows VMs. Enables application-consistent
snapshots of AHV or ESXi Windows VMs. For more information about Nutanix VSS-based
snapshots for the Windows VMs, see the Application-Consistent Snapshot Guidelines on
the Support Portal.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   70
Virtual Machine Management

NGT Requirements and Limitations

General requirements and limitations

• You must configure the cluster virtual IP address on the Nutanix cluster. If the virtual IP
address of the cluster changes, it will impact all the NGT instances that are running in your
cluster. For more information, see the Impact of Changing Virtual IP Address of the Cluster
section of the Prism Web Console Guide on the Support Portal.

• VMs must have at least one empty IDE CD-ROM slot to attach the ISO.
• Port 2074 should be open to communicate with the NGT-Controller VM service.

• The hypervisor should be ESXi 5.1 or later, or AHV 20160215 or later version.

• You should connect the VMs to a network that you can access by using the virtual IP
address of the cluster.

Supported operating systems

• For Windows Server Edition VMs, ensure that Microsoft VSS service is enabled before
starting the NGT installation.
• When you connect a VM to VG, NGT captures the IQN of the VM and stores the information.
If you change the VM IQN before the NGT refresh cycle occurs and you take a snapshot
of the VM, the NGT will not be able to provide auto restore capability because the
snapshot operation will not be able to capture the VM-VG connection. As a workaround,
you can manually restart the Nutanix guest agent service by running the $sudo service
ngt_guest_agent restart command on the Linux VM and from the Services tab of the
Windows VM to update NGT.

Note: See the supported operating system information for the specific NGT
features to verify if an operating system is supported for a specific NGT feature.

Requirements and Limitations by Operating System

Windows

Versions: Windows 2008 or later, Windows 7 or later

• Only the 64-bit operating system is supported.

• You must install the SHA-2 code signing support update before installing NGT.

Apply the security update in KB3033929 to enable SHA-2 code signing support on the
Windows OS. If the installation of the security update in KB3033929 fails, apply one of the
following rollups:

- KB3185330 (October 2016 Security Rollup)

- KB3197867 (November 2016 Security Only Rollup)

- KB3197868 (November 2016 Quality Rollup)

• For Windows Server Edition VMs, ensure that Microsoft VSS Services is enabled before
starting the NGT installation.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   71
Virtual Machine Management

Linux

Versions: CentOS 6.5 and 7.0, Red Hat Enterprise Linux (RHEL) 6.5 and 7.0, Oracle Linux 6.5
and 7.0, SUSE Linux Enterprise Server (SLES) 11 SP4 and 12, Ubuntu 14.0.4 or later

• The self-service restore feature is not available on Linux VMs.

• The SLES operating system is only supported for the application consistent snapshots with
VSS feature. The SLES operating system is not supported for the cross-hypervisor disaster
recovery feature.

Customizing a VM
Cloud-Init

On non-Windows VMs, Cloud-config files, special scripts designed to be run by the Cloud-Init
process, are generally used for initial configuration on the very first boot of a server. The cloud-
config format implements a declarative syntax for many common configuration items and
also allows you to specify arbitrary commands for anything that falls outside of the predefined
declarative capabilities. This lets the file act like a configuration file for common tasks, while
maintaining the flexibility of a script for more complex functionality.

You must pre-install the utility in the operating system image used to create VMs.

Cloud-Init runs early in the boot process and configures the operating system on the basis of
data that you provide. You can use Cloud-Init to automate tasks such as:

• Setting a host name and locale

• Creating users and groups

• Generating and adding SSH keys so that users can log on

• Installing packages

• Copying files

• Bootstrapping other configuration management tools such as Chef, Puppet, and Salt

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   72
Virtual Machine Management

Sysprep

Sysprep is a utility that prepares a Windows installation for duplication (imaging) across
multiple systems. Sysprep is most often used to generalize a Windows installation.

During generalization, Sysprep removes system-specific information and settings such as the
security identifier (SID) and leaves installed applications untouched.

You can capture an image of the generalized installation and use the image with an answer
file to customize the installation of Windows on other systems. The answer file contains the
information that Sysprep needs to complete an unattended installation.

Sysprep customization requires a reference image:

1. Log into the Web Console and browse to the VM dashboard.

2. Select a VM to clone, click Launch Console, and log in with Administrator credentials.

3. Configure Sysprep with a system cleanup. Specify whether or not to generalize the
installation, then choose to shut down the VM.

Note: Do not power on the VM after this step!

Customizing a VM
When cloning a VM, mark the Custom Script checkbox.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   73
Virtual Machine Management

You must create a reference image before customizing Sysprep. For more information, please
see the Creating a Reference Image section of the Prism Web Console Guide on the Support
Portal.

Guest VM Data Management


Guest VM Data: Standard Behavior

Hosts read and write data in shared Nutanix datastores as if they were connected to a SAN.
From the perspective of a hypervisor host, the only difference is the improved performance
that results from data not traveling across a traditional SAN. VM data is stored locally and
replicated on other nodes for protection against hardware failure.

When a guest VM submits a write request through the hypervisor, that request is sent to
the Controller VM on the host. To provide a rapid response to the guest VM, this data is first

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   74
Virtual Machine Management

stored on the metadata drive within a subset of storage called the oplog. This cache is rapidly
distributed across the 10GbE network to other metadata drives in the cluster.

Oplog data is periodically transferred to persistent storage within the cluster. Data is written
locally for performance and replicated on multiple nodes for high availability.

When the guest VM sends a read request through the hypervisor, the Controller VM reads
from the local copy first. If the host does not contain a local copy, then the Controller VM reads
across the network from a host that does contain a copy. As remote data is accessed, the
remote data is migrated to storage devices on the current host so that future read requests can
be local.

Live Migration

The Nutanix Enterprise Cloud Computing Platform fully supports live migration of VMs, whether
initiated manually or through an automatic process. All hosts within the cluster have visibility
into shared Nutanix datastores through the Controller VMs. Guest VM data is written locally and
is also replicated on other nodes for high availability.

If you migrate a VM to another host, future read requests are sent to a local copy of the data
(if it exists). Otherwise, the request is sent across the network to a host that does contain the
requested data. As remote data is accessed, the remote data is migrated to storage devices on
the current host, so that future read requests can be local.

High Availability

The built-in data redundancy in a Nutanix cluster supports high availability provided by the
hypervisor. If a node fails, all HA-protected VMs can be automatically restarted on other
nodes in the cluster. Virtualization management VM high availability may implement admission

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   75
Virtual Machine Management

control to ensure that, in case of node failure, the rest of the cluster has enough resources to
accommodate the VMs. The hypervisor management system selects a new host for the VMs
that may or may not contain a copy of the VM data.

If the data is stored on a node other than the VM’s new host, then read requests are sent across
the network. As remote data is accessed, the remote data is migrated to storage devices on the
current host so that future read requests can be local. Write requests are sent to local storage
and replicated on a different host. During this interaction, the Nutanix software also creates new
copies of preexisting data to protect against future node or disk failures.

Data Path Redundancy

The Nutanix cluster automatically selects the optimal path between a hypervisor host and its
guest VM data. The Controller VM has multiple redundant paths available, which makes the
cluster more resilient to failures.

Labs
1. Uploading an image

2. Creating a Windows virtual machine

3. Creating a Linux virtual machine

4. Using dynamic VM resource management 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   76
Module

6
HEALTH MONITORING AND ALERTS

Overview
After completing this module, you will be able to:

• Monitor cluster health


• Use the Health dashboard and other components

• Configure a health check

• Monitor alerts and events

• Configure alert email settings for a cluster

Health Monitoring
Nutanix provides a range of status checks to monitor the health of a cluster.

• Summary health status information for VMs, hosts, and disks displays on
the Home dashboard.

• In depth health status information for VMs, hosts, and disks is available through
the Health dashboard.

You can:

• Customize the frequency of scheduled health checks.

• Run Nutanix Cluster Check (NCC) health checks directly from Prism.

• Collect logs for all the nodes and components.

Note: If the Cluster Health service status is down for more than 15 minutes, an
alert email is sent by the AOS cluster to configured addresses and to Nutanix
support (if selected). In this case, no alert is generated in the Prism web
console. The email is sent once every 24 hours. You can run the NCC check
cluster_services_down_check to see the service status.

Health Dashboard
The Health dashboard displays dynamically updated health information about VMs, hosts, and
disks in the cluster. To view the Health dashboard, select Health from the drop down menu on
the left of the main menu.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   77
Health Monitoring and Alerts

The Health dashboard is divided into three columns:

The left column displays tabs for each entity type (VMs, hosts, disks, storage pools, storage
containers, cluster services, and [when configured] protection domains and remote sites). Each
tab displays the entity total for the cluster (such as the total number of disks) and the number
in each health state. Clicking a tab expands the displayed information (see the following
section).

The middle column displays more detailed information about whatever is selected in the left
column.

The right column displays a summary of all the health checks. You also have the option to view
individual checks from the Checks button (success, warning, failure, or disabled).

• The Summary tab provides a summarized view of all the health checks according to check
status and check type.
• The Checks tab provides information about individual checks. Hovering the cursor over an
entry displays more information about that health check. You can filter checks by clicking
the appropriate field type and clicking Apply.

• The Actions tab provides you with options to manage checks, run checks, and collect logs.

Configuring Health Checks


A set of automated health checks are run regularly. They provide a range of cluster health
indicators. You can specify which checks to run and configure the schedules for the checks and
other parameters.

Cluster health checks cover a range of entities including AOS, hypervisor, and hardware
components. A set of checks are enabled by default, but you can run, disable, or reconfigure
any of the checks at any time to suit your specific needs.

To configure health checks, From the Actions menu on the Health dashboard, click Manage


Checks.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   78
Health Monitoring and Alerts

The displayed screen lists all checks that can be run on the cluster, divided into categories
including CVM, Cluster, Data Protection, File Server, Host, and so on. Sub-categories include
CPU, disk, and hardware for CVMs; Network, Protection Domains, and Remote Sites for
Clusters; CPU and disk for hosts; and so on.

Selecting a check from the left pane will allow you to:

• View a history of all entities evaluated by this check, displayed in the middle of the screen.

• Run the check.

• Turn the check off.

• View causes and resolutions, as well as supporting reference articles on the Nutanix
Knowledge Base.

Setting NCC Frequency


Nutanix Cluster Check (NCC) is a framework of scripts that can help diagnose cluster health.
You can run individual or multiple simultaneous health checks from either the Prism Web
Console or the command line. When run from the CVM command line, NCC generates a log file
with the output of diagnostic commands selected by the user. A similar log file is generated
when the web console is used, but it is less easy to read than the one generated by the
command line.

NCC allows administrators to run a multitude of health checks, identify misconfigurations,


collect logs, and monitor checks via email. It’s a good practice to run NCC before or after
performing major activities on a cluster. NCC should be run:

• After a new install

• Before and after activities such as adding, removing, reconfiguring, or upgrading nodes

• As a first step before troubleshooting an issue

Set NCC Frequency allows you to configure the run schedule for Nutanix Cluster Checks and
view e-mail notification settings.

You can run cluster checks:

• Every 4 hours

• Every day

• Every week

You can set the day of the week and the start time for the checks where appropriate.

A report is sent to all e-mail recipients shown. You can configure e-mail notifications using
the Alert Email Configuration menu option.

Collecting Logs
Logs you your Nutanix cluster and its various components can be collected directly from the
Prism web console. Logs can be collected for Controller VMs, file server, hardware, alerts,
hypervisor, and for the system. The most common scenarios in which you will need to collect
logs are when troubleshooting an issue, or when you need to provide information for a Nutanix
Support case.

1. On the Health dashboard, click Actions on the right pane and select Log Collector.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   79
Health Monitoring and Alerts

2. Select the period for which you want to collect logs, either by choosing a duration in hours
or by setting a custom date range.

3. Run Log Collector.

After you run Log Collector and the task completes, the bundle will be available to download.

Analysis Dashboard
The Analysis dashboard allows you to create charts that can dynamically monitor a variety of
performance measures.

 The Analysis dashboard includes three sections.

 Chart definitions

The pane on the left lists the charts that can be run. No charts are provided by default, but you
can create any number of charts. A chart defines the metrics to monitor.

 Chart monitors

When a chart definition is checked, the monitor appears in the middle pane. An Alerts monitor
always displays first. The remaining displayed monitors are determined by which charts are
checked in the left pane. You can customize the display by selecting a time interval from the
Range drop-down (above the charts) and then refining the monitored period by moving the
time interval end points to the desired length.

Alerts

Any alerts that occur during the interval specified by the timeline in the middle pane display in
the pane on the right.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   80
Health Monitoring and Alerts

Understanding Metric and Entity Charts


A metric chart monitors the performance of a single metric on one or more entities. For
example, you can create a single chart that monitors the content cache hits for multiple hosts
within a cluster.

An entity chart monitors the performance of one or more metrics for a single entity. For
example, you can create a single metric chart that monitors a particular host, for metrics such
as Disk I/O Bandwidth for Reads, Disk I/O Bandwidth for Writes, and Disk IOPS.

To create either a metric or an entity chart:


1. On the Analysis dashboard, click New and select either a Metric chart or an Entity chart.

• For Metric charts, select the metric you want to monitor, the entity type, and then a list of
entities.

• For Entity charts, select the entity type, then the specific entity and all the metrics you want
to monitor on that entity.

Alerts Dashboard
The Alerts dashboard displays alert and event messages.

Alerts View

Two viewing modes are available: Alerts, and Events. The Alerts view, shown above, lists
all alerts messages and can be sorted by source entity, impact type, severity, resolution,
acknowledgement, and time of creation.

These are the fields in the Alerts view:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   81
Health Monitoring and Alerts

Parameter Description Values

(selection box) Click this box to select the message n/a


for acknowledgement or resolution.

Configure (button) Allows you to configure Alert Alert Policies, Email


Policies and email notification Configuration
settings for your cluster.

Title Displays the alert message. (message text)

Source Entity Displays the name of the entity to (entity name)


which this alert applies, for example
host or cluster.

Severity Displays the severity level of this Critical, Warning,


condition. There are three levels: Informational

Critical: A "critical" alert is one that


requires immediate attention, such
as a failed Controller VM.

Warning: A "warning" alert is one


that might need attention soon,
such as an issue that could lead to a
performance problem.

Informational: An "informational"
alert highlights a condition to be
aware of, for example, a reminder
that the support tunnel is enabled.

Resolved Indicates whether a user has set the (user and time), No
alert as resolved. Resolving an error
means you set that error as fixed.
(The alert may return if the condition
is scanned again at a future point.) If
you do not want to be notified about
the condition again, turn off the alert
for this condition.

Acknowledged Indicates whether the alert has been (user and time), No
acknowledged. Acknowledging
an alert means you recognize the
error exists (no more reminders for
this condition), but the alert status
remains.

Create Time Displays the date and time when the (time and date)
alert occurred.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   82
Health Monitoring and Alerts

Parameter Description Values

Documentation Displays Cause and Resolution links (test description of


that pop up an explanation of the cause or resolution)
alert cause and resolution when you
hover the cursor over the link.

(pencil icon) Clicking the pencil icon opens the n/a


Update Alert Policy window at that
message.

Configuring Alert Email Settings

Alert email notifications are enabled by default. This feature sends alert messages automatically
to Nutanix customer support through customer-opened ports 80 or 8443. To automatically
receive email notification alerts, ensure that nos-alerts and nos-asup recipients are added to the
accepted domain of your SMTP server. To customize who should receive the alert e-mails (or to
disable e-mail notification), do the following:

On the Alerts dashboard, click Configure and select Email Configuration.

The Email Configuration page allows you to customize:

• Your alert email settings

• The rules that govern when and to whom emails will be sent

• The template that will be used to send emails

Alert Email Settings

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   83
Health Monitoring and Alerts

Alert Email Rules

Alert Email Templates

Events View
The Event messages view displays a list of event messages. Event messages describe cluster
actions such as adding a storage pool or taking a snapshot. This view is read-only and you do
not need to take any action like acknowledging or resolving generated events.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   84
Health Monitoring and Alerts

To filter the list, click the filter icon on the right of the screen. This displays a pane (on the
right) for selecting filter values. Check the box for each value to include in the filter. You can
include multiple values. The values are for event type (Behavioral Anomaly, System Action, User
Action) and time range (Last 1 hour, Last 24 hours, Last week, From XXX to XXX). You can also
specify a cluster. The selected values appear in the filter field above the events list. You can do
the following in the current filters field:

• Remove a filter by clicking the X for that filter.

• Remove all filters by clicking Clear (on the right).

• Save the filter list by clicking the star icon. You can save a maximum of 20 filter lists per
entity type.

• Use a saved filter list by selecting from the drop down list.

These are the fields in the Events view:

Parameter Description Values

Title Displays the event message. (message text)

Source Entities Displays the entity (such as a cluster, (entity name)


host, or VM name) to which the
event applies. A comma separated
list appears if it applies to multiple
entities. If there is an associated
details page, the entity is a live link;
clicking the link displays the details
page.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   85
Health Monitoring and Alerts

Parameter Description Values

Event Type Displays the event category. For Availability, Capacity,


example, a user action like logging Configuration,
out, node added, and so on. Performance, System
Indicator, Behavioral
Abnormality, DR

Create Time Displays the date and time when the (time and date)
event occurred.

Labs
1. Creating a performance chart

2. Generating Write I/O

3. Managing alerts 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   86
Module

7
DISTRIBUTED STORAGE FABRIC

Overview
After completing this module, you will be able to:

• Create a storage container


• Determine the correct capacity optimization method based on workload

• Configure deduplication, compression, and erasure coding on Nutanix containers

• Explain how Nutanix capacity optimization features work

• Understand how hypervisors on a Nutanix cluster integrate with products from other
vendors

Understanding the Distributed Storage Fabric


DSF is a distributed storage architecture that replaces traditional SAN/NAS solutions.

The Distributed Storage Fabric (DSF) is a scalable distributed storage system which exposes
NFS/SMB file storage as well as iSCSI block storage with no single points of failure. The
distributed storage fabric stores user data (VM disk/files) across storage tiers (SSDs, Hard
Disks, Cloud) on multiple nodes. The DSF also supports instant snapshots, clones of VM disks
and other advanced features such as deduplication, compression and erasure coding.

The DSF logically divides user VM data into extents which are 1MB in size. These extents may
be compressed, erasure coded, deduplicated, snapshotted or left untransformed. Extents can
also move around; new or recently accessed extents stay on faster storage (SSD) while colder
extents move to HDD. The DSF utilizes a “least recently used” algorithm to determine what data
can be declared “cold” and migrated to HDD. Additionally, the DSF attempts to maintain data
locality for VM data – so that one copy of each vDisk’s data is available locally from the CVM on
the host where the VM is running.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   87
Distributed Storage Fabric

DSF presents SDDs and HDDs as a storage pool and provides cluster-wide storage services:

• Snapshots

• Clones

• HA/DR

• Deduplication

• Compression

• Erasure coding

The Controller VMs (CVMs) running on each node combine to form an interconnected
network within the cluster, where every node in the cluster has access to data from shared
SSD, HDD, and cloud resources. The CVMs allow for cluster-wide operations on VM-centric
software-defined services: snapshots, clones, high availability, disaster recovery, deduplication,
compression, erasure coding, storage optimization, and so on.

Hypervisors (AHV, ESXi, Hyper-V) and the DSF communicate using the industry-standard
protocols NFS, iSCSI, and SMB3.

The Extent Store

The extent store is the persistent bulk storage of DSF and spans SSD and HDD and is extensible
to facilitate additional devices/tiers. Data entering the extent store is either drained from the
OpLog or is sequential in nature and has bypassed the OpLog directly.

Nutanix ILM will determine tier placement dynamically based upon I/O patterns and will move
data between tiers.

The OpLog

The OpLog is similar to a filesystem journal and is used to service bursts of random write
operations, coalesce them, and then sequentially drain that data to the extent store. For each
write OP, the data is written to disk locally and synchronously replicated to another n number
of CVM’s OpLog before the write is acknowledged for data availability purposes (where “n” is
the RF of the container, 2 or 3).

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   88
Distributed Storage Fabric

All CVM participate in OpLog replication. Individual replica location is dynamically chosen based
upon load. The OpLog is stored on the SSD tier on the CVM to provide extremely fast write I/O
performance. OpLog storage is distributed across the SSD devices attached to each CVM.

For sequential workloads, the OpLog is bypassed and the writes go directly to the extent store. 

If data is currently sitting in the OpLog and has not been drained, all read requests will be
directly fulfilled from the OpLog until they have been drained, where they would then be served
by the extent store/unified cache.

For containers where fingerprinting (aka Dedupe) has been enabled, all write I/Os will be
fingerprinted using a hashing scheme allowing them to be deduplicated based upon fingerprint
in the unified cache.

Guest VM Write Request

Going through the hypervisor, DSF sends write operations to the CVM on the local host, where
they are written to either the Oplog or Extent Store. In addition to the local copy an additional
write operation is, then distributed across the 10 GbE network to other nodes in the cluster.

Guest VM Read Request

Going through the hypervisor, read operations are sent to the local CVM which returns data
from a local copy. If no local copy is present, the local CVM retrieves the data from a remote
CVM that contains a copy.

The file system automatically tiers data across different types of storage devices using
intelligent data placement algorithms. These algorithms make sure that the most frequently
used data is available in memory or in flash for the fastest possible performance.

Data Storage Representation


Storage Components
• Storage Pool

A storage pool is a group of physical storage devices for the cluster including PCIe
SSD, SSD, and HDD devices. The storage pool spans multiple nodes and scales as the
cluster expands. A storage device can only be a member of a single storage pool. Nutanix
recommends creating a single storage pool containing all disks within the cluster.

ncli sp ls displays existing storage pools.

• Storage Container

A storage container is a subset of available storage within a storage pool. Storage containers
enable an administrator to apply rules or transformations such as compression to a data set.
They hold the virtual disks (vDisks) used by virtual machines. Selecting a storage pool for a
new storage container defines the physical disks where the vDisks are stored.

ncli ctr ls displays existing containers.

• Volume Group

A volume group is a collection of logically related virtual disks or volumes. It is attached to


one or more execution contexts (VMs or other iSCSI initiators) that share the disks in the
volume group. You can manage volume groups as a single unit.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   89
Distributed Storage Fabric

Each volume group contains a UUID, a name, and iSCSI target name. Each disk in the volume
group also has a UUID and a LUN number that specifies ordering within the volume group.
You can include volume groups in protection domains configured for asynchronous data
replication (Async DR) either exclusively or with VMs.

Volume groups cannot be included in a protection domain configured for Metro Availability,
in a protected VStore, or in a consistency group for which application consistent
snapshotting is enabled.

• vDisk

A vDisk is a subset of available storage within a storage container that provides storage
to virtual machines. A vDisk is any file over 512 KB on DSF, including VMDKs and VM disks.
vDisks are broken up into extents, which are grouped and stored on physical disk as an
extent group.

• Datastore

A datastore is a hypervisor construct that provides a logical container for files necessary for VM
operations. In the context of the DSF, each container on a cluster is a datastore

Understanding Snapshots and Clones


DSF provides native support for offloaded snapshots and clones which can be leveraged via
VAAI, ODX, ncli, REST, Prism, etc. Both the snapshots and clones leverage the redirect-on-write
algorithm which is the most effective and efficient.

Snapshots

Snapshots for a VM are crash consistent, which means that the VMDK on-disk images are
consistent with a single point in time. That is, the snapshot represents the on-disk data as if the
VM crashed. The snapshots are not, however, application consistent, meaning that application
data is not quiesced at the time the snapshot is taken. 

In order to take application-consistent snapshots, select the option to do so when configuring a


protection domain. Nutanix Guest Tools (NGT) should be installed on any VM of which requires
application-consistent snapshots.

For a breakdown of the differences in snapshots for different hypervisors and operating
systems, with different statuses of NGT, see the following table.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   90
Distributed Storage Fabric

ESXi AHV

NGT Status Result NGT Status Result

Microsoft Installed and Nutanix script- Installed and Nutanix script-


Windows Active. Also based VSS Active. Also based VSS
Server Edition pre_freeze and snapshots pre_freeze and snapshots
post_thaw scripts post_thaw scripts
are present are present

Installed and Nutanix VSS- Installed and Nutanix VSS-


Active enabled Active enabled snapshots
snapshots.

Not enabled Hypervisor-based Not enabled Crash-consistent


application- snapshots
consistent or
crash-consistent
snapshots.

Microsoft Installed and Nutanix script- Installed and Nutanix script-


Windows Active. Also based VSS Active. Also based VSS
Client Edition pre_freeze and snapshots pre_freeze and snapshots
post_thaw scripts post_thaw scripts
are present are present

Not enabled Hypervisor-based Not enabled Crash-consistent


snapshots or snapshots
crash-consistent
snapshots.

Linux VMs Installed and Nutanix script- Installed and Nutanix script-
Active. Also based VSS Active. Also based VSS
pre_freeze and snapshots pre_freeze and snapshots
post_thaw scripts post_thaw scripts
are present are present

Not enabled Hypervisor-based Not enabled Crash-consistent


snapshots or snapshots
crash-consistent
snapshots.

Clones vs Shadow Clones


A clone is a duplicate of a vDisk, which can then be modified.

A shadow clone, on the other hand, is a cache of a vDisk on all the nodes in the cluster. When a
vDisk is read by multiple VMs (such as the base image for a VDI clone pool), the cluster creates
shadow clones of the vDisk. Shadow clones are enabled by default.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   91
Distributed Storage Fabric

How Snapshots and Clones Impact Performance


As mentioned in the introduction to this module, a vDisk is composed of extents, which
are logically contiguous chunks of data. Extents are stored within extent groups, which are
physically contiguous data stored as files on the storage devices. When a snapshot or clone is
taken, the base vDisk is marked immutable and another vDisk is created where new data will be
written.

At creation, both vDisks have the same block map, which is a metadata mapping of the vDisk to
its corresponding extents. Unlike traditional approaches which require traversal of the snapshot
chain to locate vDisk data (which can add read latency), each vDisk has its own block map.
This eliminates any of the overhead normally seen by large snapshot chain depths and allows
multiple snapshots to be taken without any performance impact.

Capacity Optimization - Deduplication


Deduplication is similar to incremental backup and a process that eliminates redundant data
and reduces storage overhead. Deduplication works with compression and erasure coding to
optimize capacity efficiency.

• Ensures that only one unique instance of data is retained

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   92
Distributed Storage Fabric

• Replaces redundant data blocks with pointers to copies

• Supports both inline and post-process deduplication

Deduplication Process

The Elastic Deduplication Engine is a software-based feature of DSF that allows for data
deduplication in the capacity (Extent Store) and performance (Unified Cache) tiers. Incoming
data is fingerprinted during ingest using a SHA-1 hash at a 16 K granularity. This fingerprint is
then stored persistently as part of the written block’s metadata.

Contrary to traditional approaches, which utilize background scans requiring the data to be
reread, Nutanix creates the fingerprint inline on ingest. For data being deduplicated in the
capacity tier, the data does not need to be scanned or reread – matching fingerprints are
detected and duplicate copies can be removed.

Block-level deduplication looks within a file and saves unique iterations of each block. All the
blocks are broken into chunks. Each chunk of data is processed using an SHA-1 hash algorithm.
This process generates a unique number for each piece: a fingerprint.

The fingerprint is then compared with the index of existing fingerprints. If it is already in
the index, the piece of data is considered a duplicate and does not need to be stored again.
Otherwise, the new hash number is added to the index and the new data is stored.

If you update a file, only the changed data is saved, even if only a few bytes of the document
or presentation have changed. The changes do not constitute an entirely new file. This behavior
makes block deduplication (compared with file deduplication) far more efficient.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   93
Distributed Storage Fabric

However, block deduplication takes more processing power and uses a much larger index to
track the individual pieces.

To reduce metadata overhead, fingerprint reference counts (refcounts) are monitored during
the deduplication process. Fingerprints with low refcounts will be discarded. Full extents are
preferred for capacity tier deduplication in order to minimize fragmentation.

When used in the appropriate situation, deduplication makes the effective size of the
performance tier larger so that more active data can be stored.

• Deduplication allows the sharing of guest VM data on Nutanix storage tiers.

• Performance of guest VMs suffers when active data can no longer fit in the performance
tiers.

Deduplication Techniques
Inline deduplication is useful for applications with large common working sets.

• Removes redundant data in performance tier

• Allows more active data, can improve performance to VMs

• Leverages hardware-assist capabilities; software-driven


Post-process deduplication is useful for virtual desktops (VDI) with full clones.

• Reduces redundant data in capacity tier, increasing effective storage capacity of a cluster

• Distributed across all nodes in a cluster (global)

Capacity Optimization - Compression


Nutanix recommends using inline compression (compression delay = 0), because it compresses
only larger/sequential writes and does not affect random write performance. This also increases
the usable size of the SSD tier, increasing effective performance and enabling more data to sit
in the SSD tier.

For sequential data that is written and compressed inline, the RF copy of the data is
compressed before transmission, further increasing performance since it is sending less data
across the network.

Inline compression also pairs perfectly with erasure coding. For instance, an algorithm may
represent a string of bits with a smaller string of 0s and 1s by using a dictionary for the
conversion between them, or the formula may insert a reference or pointer to a string of 0s and
1s that the program has already seen.

Text compression can be as simple as removing all unneeded characters, inserting a single
repeat character to indicate a string of repeated characters, and substituting a smaller bit
string for a frequently occurring bit string. Data compression can reduce a text file to 50% or a
significantly higher percentage of its original size.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   94
Distributed Storage Fabric

Compression Process

Inline compression condenses sequential streams of data or large I/O sizes (>64K) when
written to the Extent Store (SSD + HDD). This includes data draining from oplog as well as
sequential data skipping it.

Offline compression initially writes the data in an uncompressed state and then leverages the
Curator framework to compress the data cluster-wide. When inline compression is enabled
but the I/O operations are random in nature, the data is written uncompressed in the oplog,
coalesced, and then compressed in memory before being written to the Extent Store.

Nutanix leverages LZ4 for initial data compression, which provides a very good blend between
compression and performance. For cold data, Nutanix uses LZ4HC to provide an improved
compression ratio.

Compression Technique Comparison


Inline compression
• Data compressed as it’s written

• LZ4, an extremely fast compression algorithm

Post-process (offline) compression

• Data is compressed after a configured delay

• Utilizes Lz4 compression initially

• Cold data recompressed with LZ4HC, a high-compression version of LZ4 algorithm

• No impact on normal I/O path

• Ideal for random-access batch workloads

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   95
Distributed Storage Fabric

Workloads and Dedup/Compression


Although both dedup and compression optimize the use of storage capacity, it is important to
understand which use cases and workloads benefit most from each.

Environments suitable for compression

Data compression tends to be more effective than deduplication in reducing the size of unique
information, such as images, audio, videos, databases, and executable files. 

Environments less suitable for compression

Workloads that frequently update data (for example, virtualized applications for power users,
such as CAD) are not good candidates for compression.

Environments suitable for deduplication

Deduplication is most effective in environments that have a high degree of redundant data,
such as virtual desktop infrastructure or storage backup systems.

Environments less suitable for deduplication


View Composer API for Array Integration (VCAI) snapshots, linked clones: By using Nutanix
VCAI snapshots, linked clones, or similar approaches, storage requirements for end-user VMs
are already at a minimum. In this case, the overhead of deduplication outweighs the benefits.

Deduplication and Compression Best Practices

Use Case Example Recommendation

User data File server, user data, vDisk Post-process compression with 4-6 hour delay

VDI Vmware View, Citrix VCAI snapshots, linked clones, full clones with
XenDesktop inline dedup (not container compression)

Data processing Hadoop, data analytics, Inline compression


data warehousing

Transactional Exchange, Active Native application compression where


applications Directory, SQL Server, available, otherwise inline compression
Oracle

Archive or Handy Backup, SyncBack Inline compression unless data is already


backup compressed

Note: Nutanix does not recommend turning on deduplication for VAAI (vStorage
APIs for Array Integration) clone or linked clone environments.

Capacity Optimization - Erasure Coding


The Nutanix platform leverages a replication factor (RF) for data protection and availability.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   96
Distributed Storage Fabric

This method provides the highest degree of availability because it does not require reading
from more than one storage location or data recomputation on failure. However, this does
come at the cost of storage resources because it requires full copies.

To provide a balance between availability while reducing the amount of storage required, DSF
provides the ability to encode data using erasure coding (EC-X). Similar to the concept of
RAID-5 and RAID-6 where parity is calculated, EC-X encodes a strip of data blocks on different
nodes and calculates parity. In the event of a host or disk failure, the parity can be leveraged
to calculate any missing data blocks (decoding). In the case of DSF, the data block is an extent
group and each data block must be on a different node and belong to a different vDisk.

The number of data and parity blocks in a strip is configurable based upon the desired failures
to tolerate. The configuration is calculated using the number of <data blocks>/<number of
parity blocks>. For example, “RF2-like” availability (for example, N+1) could consist of three or
four data blocks and one parity block in a strip (for example, 3/1 or 4/1). “RF3-like” availability
(such as N+2) could consist of three or four data blocks and two parity blocks in a strip (such
as 3/2 or 4/2).

EC-X is complimentary to both compression and deduplication so you will get even more data
reduction.

EC-X works on all-flash (no HDDs) as well as hybrid (HDD + SSD) configurations. EC-X looks
at data access patterns, not where the data lives so it works on SSD, both in SSD-only all-flash
systems and in SSD+HDD systems. It is a great benefit to get a more usable SSD tier, since EC-
X works on write-cold data and the SSD tier is for keeping read-hot data. If data is write-cold
and read-hot, it should be somewhere where fast access is possible (SSD tier). Therefore, EC-X
is yet another way to get write-cold, read-hot data to take up less space in the SSD tier.

EC-X Compared to Traditional RAID

Traditional RAID

• Bottleneck by single disk

• Slow rebuilds

• Hardware-defined

• Hot spares waste space

Erasure Coding

• Keeps resiliency unchanged

• Optimizes availability (fast rebuilds)

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   97
Distributed Storage Fabric

• Uses resources of entire cluster

• Increases usable storage capacity

EC-X increases effective or usable capacity on a cluster. The savings after enabling EC-X is in
addition to deduplication and compression storage.

EC-X Process

Erasure coding is performed post-process and leverages the Curator MapReduce framework
for task distribution.

Since this is a post-process framework, the traditional write I/O path is unaffected. In this
scenario, primary copies of both RF2 and RF3 data are local and replicas are distributed on the
remaining cluster nodes.

When Curator runs a full scan, it finds eligible extent groups for encoding. Eligible extent
groups must be "write-cold", meaning they have not been overwritten for a defined amount of
time. For regular vdisks, this time period is 7 days. For snapshot vdisks, it is 1 day.
After erasure coding finds the eligible candidates, Chronos will distribute and throttle the
encoding tasks.

EC-X Pros and Cons

Pros

• Increases usable capacity of RAW storage.

• Potentially increases amount of data stored in SSD tier.

Cons

• Higher impact (read) in case of drive/node failure.

• Degrades performance for I/O patterns with high percentage of overwrites.

• Increases computational overheads

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   98
Distributed Storage Fabric

EC-X Workloads

Recommended workloads for erasure coding (workloads not requiring high I/O):

• Write Once, Read Many (WORM) workloads 

- Backups

- Archives

- File servers

- Log servers

- Email (depending on usage)

Workloads not ideal for erasure coding:

• Anything write/overwrite-intensive that increases the overhead on software-defined storage.


For example: VDI, which is typically very write-intensive.

• VDI is not capacity-intensive thanks to intelligent cloning (so EC-X advantages are minimal).

Erasure Coding in Operation

Once the data becomes cold, the erasure code engine computes double-parity for the data
copies by taking all the data copies (‘d’) and performing an exclusive OR operation to create
one or more parity blocks. With the two parity blocks in place, the 2nd and 3rd copies are
removed.

You end up with 12 (original three copies) + 2 (parity) - 8 (removal second + third copies) = 6
blocks, which is a storage savings of 50%.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   99
Distributed Storage Fabric

Data Block Restore After a Block Failure

A cluster should have one more node than combined strip size. This allows for rebuilding strips
in the event of a node failure.

Example: a 4/1 strip should have a minimum of six nodes.

Block “a” is rebuilt by placing members of the EC-X strip (b+c+d+P) on a node that doesn’t
contain blocks of the same strip.

Erasure Coding Best Practices


• A cluster must have at least four nodes in order to enable erasure coding.

• Do not use erasure coding on datasets with many overwrites.

- Optimal for snapshots, file server archives, backups, and other “cold” data.

• Read performance may be degraded during failure scenarios.

• Erasure coding is a backend job; achieving savings might take time.

Viewing Overall Capacity Optimization

You can create, migrate, and manage VMs within Nutanix datastores as you would with any
other storage solution.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   100
Distributed Storage Fabric

Hypervisor Integration

Overview
Nutanix Enterprise Cloud is designed for virtualization and includes a fully integrated,
enterprise-grade hypervisor (AHV). Several third-party hypervisors are also supported: 
• VMware ESXi

• Microsoft Hyper-V

• XenServer 
When you run a third-party hypervisor on Nutanix datastores, you can create, migrate, and
manage VMs identically to the way you would with any other storage solution.

AHV

AHV Networking

AHV leverages Open vSwitch (OVS) for all VM networking. Prism/aCLI is used to configure VM
networking and each VMnic is connected into an OVS tap interface. AHV supports the following
VM network interface types:

• Access (default)
• Trunked

By default, VM NICs are created as access interfaces, similar to what you'd see with a VM NIC
on a port group. However, it is possible to expose a trunked interface up to the VM's OS.

AHV Storage

Prism or aCLI is used to configure all AHV storage.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   101
Distributed Storage Fabric

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   102
Distributed Storage Fabric

vSphere

vSphere Controller VM - vSwitchO

Each ESXi host has an internal vSwitch that is used for intra-host communication between the
Nutanix CVM and host. For external communication and VM network connectivity, a standard
vSwitch (default) or dvSwitch is leveraged.

The local vSwitch (vSwitchNutanix) is for communication between the Nutanix CVM and ESXi
host. The host has a VMkernel interface on this vSwitch (vmk1 - 192.168.5.1) and the CVM has an
interface bound to a port group on this internal switch (svm-iscsi-pg - 192.168.5.2). This is the
primary storage communication path.

The external vSwitch can be a standard vSwitch or a dvSwitch. This hosts the external
interfaces for the ESXi host and CVM as well as the port groups leveraged by VMs on the host.
Nutanix leverages the external VMkernel interface for host management (vMotion, and so on).
Nutanix uses the external CVM interface for communication to other Nutanix CVMs. You can
create as many port groups as required, assuming you enable the VLANs on the trunk.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   103
Distributed Storage Fabric

vSwitchNutanix

Shared storage read/write requests go through ESXi host.

• Host communicates with CVM through vSwitchNutanix virtual switch.

• Membership in port group is limited to CVM and its host.

Note: Do not modify settings of vSwitchNutanix or its port groups.

vSphere Storage

This naming convention enables scalability by preventing datastore name conflicts when
creating a cluster of two or more blocks.

The local datastore is stored in host memory and you should not use it for additional storage.

The Controller VM boots from an .iso file stored in the local datastore, which scans all local
storage devices before locating the Controller VM boot files on the SATA-SSD.

All other storage devices, including the SSD-PCIe device, are attached directly to the Controller
VM as a PCIe pass-through device.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   104
Distributed Storage Fabric

Nodes in the cluster can mount a Nutanix container as an NFS datastore to provide shared
storage for VM files.

The name of the NFS volume matches the name of the container.

Hyper-V

Hyper-V Controller VM

Each Hyper-V host has an internal virtual switch that is used for intra-host communication
between the Nutanix CVM and host. For external communication and VMs, an external virtual
switch (default) or logical switch is leveraged.

The internal switch (InternalSwitch) is for local communication between the Nutanix CVM
and Hyper-V host. The host has a virtual Ethernet interface (vEth) on this internal switch
(192.168.5.1) and the CVM has a vEth on this internal switch (192.168.5.2). This is the primary
storage communication path.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   105
Distributed Storage Fabric

Hyper-V Networking

In SCVMM, the networking is displayed as a single external switch. The dual 10 GbE and 1 GbE
are teamed together as a single entity.

The external vSwitch can be a standard virtual switch or a logical switch. This hosts the external
interfaces for the Hyper-V host and CVM as well as the logical and VM networks leveraged by
VMs on the host. The external vEth interface is leveraged for host management, live migration,
etc. The external CVM interface is used for communication to other Nutanix CVMs. You can
create as many logical and VM networks as required, assuming the VLANs are enabled on the
trunk.

Labs
1. Creating a container with compression enabled
2. Creating a container without compression

3. Comparing data in a compressed vs uncompressed container 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   106
Module

8
MIGRATING WORKLOADS TO AHV

Objectives
When you have completed this module, you will be able to describe how to migrate workloads
using Nutanix Move.

Nutanix Move
Nutanix Move is a freely distributed application to support migration from a non-Nutanix source
to a Nutanix target with minimal downtime.

Nutanix Move supports three types of sources for migration.

• Migration of VMs running on an ESXi hypervisor managed by vCenter Server.

• Migration of Amazon Elastic Block Store (EBS) backed EC2 instances running on AWS.

• Migration of VMs running on a Hyper-V hypervisor.

The distributed architecture of Nutanix Move has the following components.

• Nutanix Move: a VM running on the Nutanix cluster to orchestrate the migration.

• NTNX-MOVE-AGENT: an agent running on AWS as an EC2 instance of type t2.micro. NTNX-


MOVE-AGENT interfaces with the source VM to facilitate migration, works with AWS APIs
to take snapshots, and transfers data from source to target. Move deploys the NTNX-MOVE-
AGENT in every region with the AWS account of the IAM user. When Move deletes the last
migration plan of the region, Move stops the NTNX-MOVE-AGENT instance. When Move
removes the source, the NTNX-MOVE-AGENT instance is terminated. 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   107
Migrating Workloads to AHV

- When you are migrating from ESXi to AHV, Nutanix Move directly communicates with
vCenter through the Management Server and the Source Agent. The Source Agent
collects information about the VM being migrated (guest VM) from the VMware library.

- Hyper-V to AHV migration requires installation of MOVE-AGENT agent on each source


Hyper-V host. The Agent is installed as a Windows service and must be running in order
to allow VM discovery and migration. Currently automatic and manual methods are
supported for Hyper-V Move agent deployment.

Note: Adding single AWS account as source with multiple IAM users is not
supported.

• Changed Block Tracking (CBT) driver: a driver running on the source VMs to be migrated to
facilitate efficient transfer of data from the source to the target. Move deploys the driver as
part of the source VM preparation and removes it during post migration cleanup.

In case of migration from AWS to AHV, NTNX-MOVE-AGENT runs on AWS as an EC2 instance
to establish connection between AWS and Nutanix Move. Nutanix Move takes snapshots of
the EBS volumes of the VMs for the actual transfer of data for the VM being migrated (guest
VM). The CBT driver computes the list of blocks that have changed to optimally transfer only
changed blocks of data on the disk. The data path connection between NTNX-MOVE-AGENT
and Nutanix Move is used to transfer data from AWS to the target Nutanix Cluster.

After the migration of the VM from the source to the target, Nutanix Move deletes all EBS
volume snapshots taken by it.

Note: Nutanix Move does not store other copies of the data.

Nutanix Move Operations


You can perform the following operations with Nutanix Move.

• Migrate powered on or powered off VMs.

Note: For AWS, the migration takes place in powered on state. For ESXi, the
power state is retained.

• Pause and resume migration.

• Schedule migration.

• Schedule data-seeding for the virtual machines in advance and cut over to a new AHV
cluster.
• Manage VM migrations between multiple clusters from a single management interface.

• Sort and group VMs for easy migration.

• Monitor details of migration plan execution, even at the individual VM level.

• Cancel in-progress migration for individual VMs.

• Migrate all AHV certified OSs (see the Supported Guest VM Types for AHV section of the
AHV Admin Guide on the Support Portal).

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   108
Migrating Workloads to AHV

Compatibility Matrix

Software Version Number

ESXi host version 5.1, 5.5, 6.0, 6.5, 6.7

vCenter 5.5, 6.0, 6.5, 6.7

Hyper-V Windows Server 2012 with Hyper-V role (Standalone and


Cluster)

Windows Server 2012 R2 with Hyper-V role (Standalone and


Cluster)

Windows Server 2016 with Hyper-V role (Standalone and


Cluster)

Microsoft Hyper-V Server 2012 (Standalone and Cluster)

Microsoft Hyper-V Server 2012 R2 (Standalone and Cluster)

Microsoft Hyper-V Server 2016 (Standalone and Cluster)

Unsupported Features
• IPV6

• VM names with non-English characters.

• VM names with single and double quotes.

• Windows VMs installed with any antivirus software. Antivirus software prevents the
installation of the VirtIO drivers.

Configuring Nutanix Move

Nutanix Move Migration

Downloading Nutanix Move


Download the Nutanix Move bundle from the Nutanix Move tab of the Nutanix Support Portal.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   109
Migrating Workloads to AHV

To get started with Nutanix Move, you need to first download and invoke the Nutanix Move
appliance on the target clusters, and then deploy Nutanix Move. If you are migrating to multiple
AHV clusters, you can deploy Nutanix Move on any of the target clusters. Once the installation
has completed, continue with configuring the Move environment and build a Migration Plan
using the Move interface.

Labs
1. Deploying a Move VM

2. Configuring Move

3. Configuring a Migration Plan

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   110
Module

9
ACROPOLIS SERVICES

Overview
After completing this module, you will be able to:

• Describe and configure Nutanix Volumes


• Describe Nutanix Files

Nutanix Volumes
Nutanix Volumes is a native scale-out block storage solution that enables enterprise
applications running on external servers to leverage the benefits of the hyperconverged
Nutanix architecture, accessing the Nutanix DSF via the iSCSI protocol

Nutanix Volumes offers a solution for workloads that may not be a fit for running on virtual
infrastructure but still need highly available and scalable storage. For example, workloads
requiring locally installed peripheral adaptors, high socket quantity compute demands, or
licensing constraints.

Nutanix Volumes enables you to create a shared infrastructure providing block-level iSCSI
storage for physical servers without compromising availability, scalability, or performance. In
addition, you can leverage efficient backup and recovery techniques, dynamic load-balancing,
LUN resizing, and simplified cloning of production databases. You can use Nutanix Volumes to
export Nutanix storage for use with applications like Oracle databases including Oracle RAC,
Microsoft SQL Server, and IBM Db2 running outside of the Nutanix cluster.

Every CVM in a Nutanix cluster can participate in presenting storage, allowing individual
applications to scale out for high performance. You can dynamically add or remove Nutanix
nodes, and by extension CVMs, from a Nutanix cluster.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   111
Acropolis Services

Upgrading a Nutanix cluster using Volumes is seamless and nondisruptive to applications.


Storage is always highly available with robust failure handling.

Nutanix manages storage allocation and assignment for Volumes through a construct called
a volume group (VG). A VG is a collection of “volumes,” more commonly referred to as virtual
disks (vDisks). Volumes presents these vDisks to both VMs and physical servers, which we refer
to as “hosts” unless otherwise specified.

vDisks represent logical “slices” of the ADSF’s container, which are then presented to the hosts
via the iSCSI protocol. vDisks inherit the properties (replication factor, compression, erasure
coding, and so on) of the container on which you create them. By default, these vDisks are
thinly provisioned. Because Nutanix uses iSCSI as the protocol for presenting VG storage, hosts
obtain access based on their iSCSI Qualified Name (IQN). The system uses IQNs as a whitelist
and attaches them to a VG to permit access by a given host. You can use IP addresses as an
alternative to IQNs for VG attachment. Once a host has access to a VG, Volumes discovers the
VG as one or more iSCSI targets. Upon connecting to the iSCSI targets, the host discovers the
vDisks as SCSI disk devices. The figure above shows these relationships.

Nutanix Volumes Use Cases

• Shared disks (Oracle RAC, Microsoft failover clustering).

• Disks as first-class entities - execution contexts are ephemeral and data is critical.

• Guest-initiated iSCSI supports bare-metal consumers and Microsoft Exchange on vSphere.

See Converting Volume Groups and Updating Clients to use Volumes for more information. 

iSCSI Qualified Name (IQN)

• YYYY- M M: The year and month the naming authority was established.

• NAMING - AUTHORITY: Usually a reverse syntax of the Internet domain name of the naming
authority.

• UNIQUE NAME: Any name you want to use, for example:


iqn.1998-01.com.nutanix.iscsi:name999

iSCSI Qualified Name (IQN) is one of the naming conventions used by iSCSI to identify initiators
and targets. IQN is documented in RFC 3720. The IQN can be up to 255 characters long. 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   112
Acropolis Services

Challenge-Handshake Authentication Protocol (CHAP) 


Challenge-Handshake Authentication Protocol (CHAP) authentication is a shared secret known
to both authenticator and peer. CHAP provides protection against replay attacks by the peer
using an incrementally changing identifier and of a variable challenge-value. CHAP requires that
both the client and server know the plaintext of the secret, although it is never sent over the
network.

Mutual CHAP authentication. With this level of security, the target and the initiator authenticate
each other. CHAP sets a separate secret for each target and for each initiator.

Attaching Initiators to Targets


The administrator has created two volume groups, volume group A and volume group B.

• Volume group A has three vDisks and volume group B has two vDisks.

• The hosts HostA and HostB have their iSCSI initiators configured to communicate with the
iSCSI target (data services IP).

• Volumes presents the vDisks to the initiators as LUNs.

Before we get to configuration, we need to configure the data services IP that will act as our
central discovery/login portal.

Volume groups (VGs) work with ESXi, Hyper-V, and AHV for iSCSI connectivity. AHV also
supports attaching VGs directly to VMs. In this case, the VM discovers the vDisks associated
with the VG over the virtual SCSI controller.

You can use VGs with traditional hypervisor vDisks. For example, some VMs in a Nutanix cluster
may leverage .vmdk or .vhdx based storage on Network File System (NFS) or Server Message
Block (SMB), while other hosts leverage VGs as their primary storage.

VMs utilizing VGs at a minimum have their boot and operating system drive presented with
hypervisor vDisks. You can manage VGs from Prism or from a preferred CLI such as aCLI, nCLI,
or PowerShell. Within Prism, the Storage page lets you create and monitor VGs.

Nutanix Volumes presents a volume group and its vDisks as iSCSI targets and assigns IQNs.
Initiators or hosts have their IQNs attached to a volume group to gain access.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   113
Acropolis Services

Configuring a Volume Group for Shared Access

Multiple hosts can share the vDisks associated with a VG for the purposes of shared storage
clustering. A common scenario for using shared storage is in Windows Server failover
clustering. You must explicitly mark the VG for sharing to allow more than one external initiator
or VM to attach.

In some cases, Volumes needs to present a volume group to multiple VMs or bare metal servers
for features like clustering. The graphic shows how an administrator can present the same
volume group to multiple servers.

Note: Allowing multiple systems to concurrently access this volume group can
cause serious problems.

Volume Group Connectivity Options


Volumes uses iSCSI redirection to control target path management for vDisk load balancing
and path resiliency.

Instead of configuring host iSCSI client sessions to connect directly to CVMs, Volumes uses
an external data services IP address. This data services IP acts as a discovery portal and initial
connection point. The data services address is owned by one CVM at a time. If the owner goes
offline, the address moves between CVMs, thus ensuring that it’s always available.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   114
Acropolis Services

For failback the default interval is 120 seconds.

Once the affined Stargate is healthy for 2 or more minutes, the system quiesces and closes the
session, forcing another logon back to the affined Stargate.

Guest VM data management

Hosts read and write data in shared Nutanix datastores as if they were connected to a SAN.
Therefore, from the perspective of a hypervisor host, the only difference is the improved
performance that results from data not traveling across a network.

When a guest VM submits a write request through the hypervisor, Stargate sends that request
to the Controller VM on the host. To provide a rapid response to the guest VM, Volumes first
stores this data on the metadata drive, within a subset of storage called the oplog. This cache is
rapidly distributed across the 10 GbE network to other metadata drives in the cluster. Volumes
periodically transfers oplog data to persistent storage within the cluster.

Volumes writes data locally for performance and replicated on multiple nodes for High
Availability.

When the guest VM sends a read request through the hypervisor, the Controller VM reads from
the local copy first, if present. If the host does not contain a local copy, then the Controller VM
reads across the network from a host that does contain a copy. As Volumes accesses remote
data, the remote data is migrated to storage devices on the current host so that future read
requests can be local.

Labs
1. Deploying Windows and Linux VMs

2. Creating a Volume Group for Windows

3. Configuring the Windows VM as an iSCSI Initiator

4. Configuring the Windows VM For Access to a Volume Group

5. Creating a Volume Group for Linux

6. Configuring the Linux VM as an iSCSI Initiator


7. Preparing the new disks for Linux

Nutanix Files
 Nutanix Files allows users to leverage the Nutanix platform as a highly available file server. 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   115
Acropolis Services

Files is a software-defined, scale-out file storage solution that provides a repository for
unstructured data, such as

• home directories

• user profiles

• departmental shares

• application logs

• backups

• archives

Flexible and responsive to workload requirements, Files is a fully integrated, core component of
the Nutanix Enterprise Cloud.

Unlike standalone NAS appliances, Files consolidates VM and file storage, eliminating the need
to create an infrastructure silo. Administrators can manage Files with Nutanix Prism, just like
VM services, unifying and simplifying management. Integration with Active Directory enables
support for quotas and access-based enumeration, as well as self-service restores with the
Windows previous versions feature. All administration of share permissions, users, and groups
is done using the traditional Windows MMC for file management. Nutanix Files also supports
file server cloning, which lets you back up Files off-site and run antivirus scans and machine
learning without affecting production.

Files is fully integrated into Microsoft Active Directory (AD) and DNS. This allows all the secure
and established authentication and authorization capabilities of AD to be leveraged. 

Files is a scale-out approach that provides SMB and NFS file services to clients. Nutanix Files
server instances contain a set of VMs (called FSVMs). Files requires at least three FSVMs
running on three nodes to satisfy a quorum for High Availability.

Files is compatible with:

• Hypervisors: AHV, ESXi, Hyper-V

• File protocols: CIFS 2.1

• Compatible features: Async-DR

Nutanix Files Architecture

Nutanix Files consists of the following constructs just like any file server.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   116
Acropolis Services

• File server: High level namespace. Each file server has a set of file services VMs (FSVM)
deployed.

• Share: A file share is a folder that can be accessed by machines over a network. Access
to these shares is controlled by a special Windows permissions called NTACLs, which are
typically set by the administrator. By default, domain administrators have full access and
domain users have read only access to the home share. General purpose shares have full
access to both domain administrator and domain users.

• Folder: Folders for storing files. Files shares folders across FSVMs.

Load Balancing and Scaling

The graphic above shows a high-level overview of File Services Virtual Machine (FSVM)
storage. Each FSVM leverages the Acropolis Volumes API for data storage. Files accesses the
API using in-guest iSCSI. This allows any FSVM to connect to any iSCSI target in the event of an
FSVM failure.

Load balancing occurs on two levels. First, a client can connect to any one of the FSVMs and
users can add FSVMs as needed. Second, on the storage side, Nutanix Files can redistribute

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   117
Acropolis Services

volume groups to different FSVMs for better load balancing across nodes. The following
situations prompt load balancing:

1. When removing an FSVM from the cluster, Files automatically load balances all its volume
groups across the remaining FSVMs.

2. During normal operation, the distribution of top-level directories becomes poorly balanced
due to changing client usage patterns or suboptimal initial placement.

3. When increased user demand necessitates adding a new FSVM, its volume groups are
initially empty and may require rebalancing.

Features

• Security descriptors

• Alternate data streams

•  Data streams

• OpLocks

• Shared-mode locks AHV

• ESXi

• Many-to-one replication

Networking

Nutanix Files uses an external and a storage network. The IP addresses are within the user-
defined range for VLAN and IP addresses.

• Storage network: The storage network enables communication between the file server VMs
and the Controller VM.

• Client-side network: The external network enables communication between the SMB clients
to the FSVMs. This allows Windows clients to access the Nutanix Files shares. Files also uses
the external network to communicate to the Active Directory and domain name servers.

High Availability
Nutanix Files provides two levels of High Availability:

• Stargate path failures through Nutanix Volumes

• VM failures by assuming different VM resources

To provide for path availability, Files leverages DM-MPIO within the FSVM, which has the active
path set to the local CVM by default.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   118
Acropolis Services

CVM Failure

If a CVM goes offline because of failure or planned maintenance, Files disconnects any active
sessions against that CVM, triggering the iSCSI client to log on again. The new logon occurs
through the external data services IP, which redirects the session to a healthy CVM. When the
failed CVM returns to operation, the iSCSI session fails back. In the case of a failback, the FSVM
off and redirected to the appropriate CVM.

Node and FSVM Failure

1.  Stop SMB and NFS services.

2.  Disconnect the volume group.

3.  Release the IP address and share and export locks.

4.  Register the volume group with FSVM-1.

5.  Present new shares and exports to FSVM-1 with eth1.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   119
Acropolis Services

Node Failure

When a physical node fails completely, Files uses leadership elections and the local CVM to
recover. The FSVM sends heartbeats to its local CVM once per second, indicating its state and
that it’s alive. The CVM keeps track of this information and can act during a failover. During a
node failure, a FSVM on that host can migrate to another host. Any loss of service of that FSVM
will then follow the below FSVM failure scenario until it is restored on a new host. 

FSVM Failure

When an FSVM goes down, the CVM unlocks the files from the downed FSVM and releases the
external address from eth1. The downed FSVM’s resources then appear on a running FSVM. The
internal Zookeeper instances store this information so that they can send it to other FSVMs if
necessary.

When an FSVM is unavailable, the remaining FSVMs volunteer for ownership of the shares
and exports that were associated with the failed FSVM. The FSVM that takes ownership of the
volume group informs the CVM that the volume group reservation has changed. If the FSVM
that attempts to control of the volume group is already the leader for a different volume group
that it has volunteered for, it relinquishes leadership for the new volume group immediately.
This arrangement ensures distribution of volume groups, even if multiple FSVMs fail.

The Nutanix Files Zookeeper instance tracks the original FSVM’s ownership using the storage
IP address (eth0), which does not float from node to node. Because FSVM-1’s client IP address
from eth1 is now on FSVM-2, client connections persist. The volume group and its shares and
exports are reregistered and locked to FSVM-2 until FSVM-1 can recover and a grace period has
elapsed.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   120
Acropolis Services

When FSVM-1 comes back up and finds that its shares and exports are locked, it assumes that
an HA event has occurred. After the grace period expires, FSVM-1 regains control of the volume
group through the CVM.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   121
Data Resiliency

Module

10
DATA RESILIENCY

Overview
Data Resiliency describes the number and types of failures a cluster can withstand; determined
by features such as redundancy factor and block or rack awareness.
After completing this module, you will be able to:

• Describe how component failures within the Nutanix cluster impact guest VM operations.
• Explain the recovery procedure for a given failure.
• Describe replication factor and its impact.
• Describe block and rack awareness.

Scenarios
 Component unavailability is an inevitable part of any datacenter lifecycle. The Nutanix
architecture was designed to address failures using various forms of hardware and software
redundancy.

A cluster can tolerate single failures of a variety of components while still running guest VMs
and responding to commands via the management console—all typically without a performance
penalty.

CVM Unavailability
A Nutanix node is a physical host with a Controller VM (CVM). Either component can fail
without impacting the rest of the cluster.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   122
Data Resiliency

The Nutanix cluster monitors the status of CVMs in the cluster. If any Stargate process fails to
respond two or more times in a 30-second period, another CVM redirects hypervisor I/O on the
related host to another CVM. Read and write operations occur over the 10 GbE network until
the failed Stargate comes back online.

To prevent constant switching between Stargates, the data path is not restored until the
original Stargate has been stable for 30 seconds.

What will users notice?

During the switching process, the host with a failed CVM may report that the shared storage
is unavailable. Guest VM IO may pause until the storage path is restored. Although the primary
copy of the guest VM data is unavailable because it is stored on disks mapped to the failed
CVM, the replicas of that data are still accessible.

As soon as the redirection takes place, VMs resume read and write I/O. Performance may
decrease slightly because the I/O is traveling across the network rather than across an internal
bus. Because all traffic goes across the 10 GbE network, most workloads do not diminish in a
way that is perceivable to users.

What happens if another one fails?

A second CVM failure has the same impact on the VMs on the other host, which means there
will be two hosts sending I/O requests across the network. More important is the additional risk
to guest VM data. With two CVMs unavailable, there are now two sets of physical disks that are
inaccessible. In a cluster with a replication factor 2 there is now a chance that some VM data
extents are missing completely, at least until one of the failed CVMs resume operation.

VM impact

• HA event: None

• Failed I/O operations: None

• Latency: Potentially, higher given I/O operations over the network

In the event of a CVM failure, the I/O operation is forwarded to another CVM in the cluster.

ESXi and Hyper-V handle this via a process called CVM autopathing, which leverages a Python
program called HA.py (like “happy”). HA.py modifies the routing table on the host to forward
traffic that is going to the internal CVM address (192.168.5.2) to the external IP of another CVM.
This enables the datastore to remain online - just the CVM responsible for serving the I/O
operations is remote. Once the local CVM comes back up and is stable the route is removed,
and the local CVM takes over all new I/O operations.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   123
Data Resiliency

AHV leverages iSCSI multipathing, where the primary path is the local CVM and the two other
paths are remote. In the event where the primary path fails, one of the other paths becomes
active. Similar to autopathing with ESXi and Hyper-V, when the local CVM comes back online it
takes over as the primary path.

In the event where the node remains down for a prolonged period (for example, 30-minutes),
the CVM is removed from the metadata ring. It is joined back into the ring after it has been up
and stable for a period of time.

Node Unavailability
The built-in data redundancy in a Nutanix cluster supports High Availability (HA) provided by
the hypervisor. If a node fails, all HA-protected VMs can be automatically restarted on other
nodes in the cluster.

Curator and Stargate respond to two issues that arise from the host failure:

• When the guest VM begins reading across the network, Stargate begins migrating those
extents to the new host. This improves performance for the guest VM.

• Curator responds to the host and CVM being down by instructing Stargate to create new
replicas of the missing vDisk data.

What will users notice?

Users who are accessing HA-protected VMs will notice that their VM is unavailable while it is
restarting on the new host. Without HA, the VM needs to be manually restarted.

What if another host fails?

Depending on the cluster workload, a second host failure could leave the remaining hosts with
insufficient processing power to restart the VMs from the second host. Even in lightly loaded
clusters, the larger concern is additional risk to guest VM data. For example, if a second host/
CVM fails before the cluster heals and its physical disks are inaccessible, some VM data will be
unavailable. 

Remember, with replication factor 2 (RF2, set on a storage container level) there are two copies
of all data. If two nodes go offline simultaneously, it is possible to lose the primary and replicate
data. If this is unacceptable, implement replication factor 3 at the storage container level, or
redundancy factor RF3 which applies to the full cluster.

Drive Unavailability
Drives in a Nutanix node store four primary types of data:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   124
Data Resiliency

• Persistent data (hot-tier and cold-tier)

• Storage metadata

• Oplog

• CVM boot files

Cold-tier persistent data is stored on the hard-disk (HDD) drives of the node. Storage metadata,
oplog, hot-tier persistent data, and CVM boot files are kept in the serial AT attachment solid
state drive (SATA-SSD) in drive bay one. SSDs in a dual-SSD system are used for storage
metadata, oplog, hot-tier persistent data according to the replication factor of system. CVM
boot and operating system files are stored on the first two SSD devices in a RAID-1 (mirrored)
configuration. In all-flash nodes, data of all types is stored in the SATA-SSDs.

Note: On hardware platforms that contain peripheral component interconnect


express SSD (PCIe-SSD) drives, the SATA-SSD holds only the CVM boot files.
Storage metadata, oplog, and hot-tier persistent data reside on the PCIe-SSD.

Boot Drive (DOM) Unavailability

When a boot DOM (SATA DOM for NX hardware) fails, the node will continue to operate
normally as long as the hypervisor or CVM does not reboot. After a DOM failure, the hypervisor
or CVM on that node will no longer be able to boot as  their boot files reside on the DOM.

Note: The CVM restarts if a boot drive fails or if you remove a boot drive without
marking the drive for removal and the data has not successfully migrated.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   125
Data Resiliency

Metadata drive failure

Cassandra uses up to 4 SSDs to store the database providing read and write access for cluster
metadata. 

Depending on cluster RF (2 or 3), either 3 or 5 copies of each piece of metadata is stored on


the CVMs in the cluster.

When a metadata drive fails, the local Cassandra process will no longer be able to access its
share of the database and will begin a persistent cycle of restarts until its data is available. If
Cassandra cannot restart, the Stargate process on that CVM will crash as well. Failure of both
processes results in automatic IO redirection.

During the switching process, the host with the failed SSD may report that the shared storage is
unavailable. Guest VM IO on this host will pause until the storage path is restored.

After redirection occurs, VMs can resume read and write I/O. Performance may decrease
slightly, because the I/O is traveling across the network rather than across the internal network.
Because all traffic goes across the 10 GbE network, most workloads will not diminish in a way
that is perceivable to users.

Multiple drive failures in a single selected domain (node, block, or rack) are also tolerated.

Note: The Controller VM restarts if a metadata drive fails, or if you remove a


metadata drive without marking the drive for removal and the data has not
successfully migrated.

If Cassandra remains in a failed state for more than thirty minutes, the surviving Cassandra
nodes detach the failed node from the Cassandra database so that the unavailable metadata
can be replicated to the remaining cluster nodes. The process of healing the database takes
about 30-40 minutes. 

If the Cassandra process restarts and remains running for five minutes, the procedure to
detach the node is canceled. If the process resumes and is stable after the healing procedure
is complete, the node will be automatically added back to the ring. A node can be manually
added to the database using the nCLI command:
ncli> host enable-metadata-store id=cvm_id

Data drive failure

Each node contributes its local storage devices to the cluster storage pool. Cold-tier data is
stored in HDDs, while hot-tier data is stored in SSDs for faster performance. Data is replicated
across the cluster, so a single data drive failure does not result in data loss. Nodes containing
only SSD drives only have a hot tier.

When a data drive (HDD/SSD) fails, the cluster receives an alert from the host and immediately
begins working to create a second replica of any guest VM data that was stored on the drive.

What happens if another drive fails?

In a cluster with a replication factor 2, losing a second drive in a different domain (node, block,
or rack) before the cluster heals can result in some VM data loss to both replicas. Although a
single drive failure does not have the same impact as a host failure, it is important to replace the
failed drive as soon as possible.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   126
Data Resiliency

Network Link Unavailability

The physical network adapters on each host are grouped together on the external network.
Unavailability of a network link is tolerated with no impact to users if multiple ports are
connected to the network. 

The Nutanix platform does not leverage any backplane for internode communication. It relies on
a standard 10 GbE network.

All storage I/O for VMs running on a Nutanix node is handled by the hypervisor on a dedicated
private network. The I/O request is handled by the hypervisor, which then forwards the request
to the private IP on the local CVM. The CVM then performs the remote replication with other
Nutanix nodes using its external IP over the public 10 GbE network.

In most cases, read requests are serviced by the local node and are not routed to the 10 GbE
network. This means that the only traffic routed to the public 10 GbE network is data replication
traffic and VM network I/O. Cluster-wide tasks, disk balancing for example, generate I/O on the
10 GbE network.

What will users notice?

Each Nutanix node is configured at the factory to use one 10 GbE port as the primary pathway
for vSwitch0. Other 10 GbE ports are configured in standby mode. Guest VM performance does
not decrease in this configuration. If a 10 GbE port is not configured as the failover path, then
traffic fails over to a 1 GbE port. This failover reduces the throughput of storage traffic and
decreases the write performance for guest VMs on the host with the failed link. Other hosts may
experience a slight decrease as well, but only on writes to extents that are stored on the host
with the link failure. Nutanix networking best practices recommend removing 1 GbE ports from
each host’s network configuration.

What happens if there is another failure?

If both 10 GbE links are down, then the host will fail over to a 1 GbE port if it is configured as
a standby interface. This failover reduces the throughput of storage traffic and decreases the
write performance for guest VMs on the host with the failed link. Other hosts may experience
a slight decrease as well, but only on writes to extents that are stored on the host with the link
failure.

Redundancy Factor 3
By default, Nutanix clusters have an RF of 2, which tolerates the failure of a single node or drive.
The larger the cluster, the more likely it is to experience simultaneous failures. Multiple failures
can result in cluster unavailability until the failures are repaired.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   127
Data Resiliency

Redundancy factor 3 (RF3) is a configurable option that allows a Nutanix cluster to withstand
the failure of two nodes or drives in different blocks.

Note: If a cluster is set to RF2 it can be converted to RF3 if sufficient nodes are
present. Increasing the cluster RF level consumes 66% of the cluster’s storage vs
50% for RF2.

RF3 features

• At least one copy of all guest VM data plus the oplog is available if two nodes fail.

• Under-replicated VM data is copied to other nodes.

• The cluster maintains five copies of metadata and five copies of configuration data.

- If two nodes fail at least three copies are available.

RF3 requirements

• A cluster must have at least five nodes for RF3 to be enabled.

• For guest VMs to tolerate the simultaneous failure of two nodes or drives in different blocks,
the data must be stored on storage containers with RF3.

• The CVM must be configured with enough memory to support RF3.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   128
Data Resiliency

Block Fault Tolerant Data Placement

Block-aware placement of guest VM data.

Block-aware placement of guest VM data with block failure.

Stargate is responsible for placing data across blocks, and Curator makes data placement
requests to Stargate to maintain block fault tolerance.

New and existing clusters can reach a block fault tolerant state. New clusters can be block fault
tolerant immediately after being created if the configuration supports it. Existing clusters that
were not previously block fault tolerant can be made tolerant by reconfiguring the cluster in a
manner that supports block fault tolerance.

New data in a block fault tolerant cluster is placed to maintain block fault tolerance. Existing
data that was not in a block fault tolerant state is moved and scanned by Curator to a block
fault tolerant state.

Depending on the volume of data that needs to be relocated, it might take Curator several
scans over a period of hours to distribute data across the blocks.

Block fault tolerant data placement is on a best effort basis but is not guaranteed. Conditions
such as high disk usage between blocks may prevent the cluster from placing guest VM
redundant copy data on other blocks.

Redundant copies of guest VM data are written to nodes in blocks other than the block that
contains the node where the VM is running. The cluster keeps two copies of each write stored
in the oplog.

The Nutanix Medusa component uses Cassandra to store metadata. Cassandra uses a ring-
like structure where data is copied to peers within the ring to ensure data consistency and

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   129
Data Resiliency

availability. The cluster keeps at least three redundant copies of the metadata, at least half of
which must be available to ensure consistency.

With block fault tolerance, the Cassandra peers are distributed among the blocks to ensure that
no two peers are on the same block. In the event of a block failure, at least two copies of the
metadata is present in the cluster.

Rack Fault Tolerance


Rack fault tolerance is the ability to provide rack level availability domain. With rack fault
tolerance, redundant copies of data are made and placed on the nodes that are not in the same
rack.

Rack failure can occur in the following situations: 

• All power supplies fail within a rack


• Top-of-rack (TOR) switch fails

• Network partition; where one of the racks becomes inaccessible from other racks

When rack fault tolerance is enabled, and the guest VMs can continue to run with failure of one
rack (RF2) or two racks (RF3). The redundant copies of guest VM data and metadata exist on
other racks when one rack fails.

 Fault  Replication  Minimum  Minimum  Minimum  Data Resiliency


Domain Factor Number of Number of Number of
Nodes Blocks Racks

 Rack  2  3  3  3*  1 node or 1 block


or 1 rack or 1 disk

 3  5  5  5*  2 nodes or 2
blocks or 2 racks
or 2 disks

* Erasure Coding with Rack Awareness - Erasure coding is supported on a rack-aware cluster.
You can enable erasure coding on new containers in rack aware clusters provided certain
minimums are met that are shown in the table above.

The table shows the level of data resiliency (simultaneous failure) provided for the following
combinations of replication factor, minimum number of nodes, minimum number of blocks, and
minimum number of racks.

Note: Rack Fault Tolerance is supported for AHV and ESXi only.

VM High Availability in Acropolis


 In Acropolis managed clusters, you can enable High Availability (HA) for the cluster to ensure
that VMs can be migrated and restarted on another host in case of failure.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   130
Data Resiliency

HA can ensure sufficient cluster resources are available to accommodate the migration of VMs
in case of node failure. 

The Acropolis Master tracks node health by monitoring connections on all cluster nodes. When
a node becomes unavailable, Acropolis Master restarts all the VMs that were running on that
node on another node in the same cluster.

The Acropolis Master detects failures due to VM network isolation, which is signaled by a failure
to respond to heartbeats. 

HA Configuration Options

There are three AHV cluster HA configuration options:

• Reserved segments. On each node, some memory is reserved in the cluster for failover
of virtual machines from a failed node. The Acropolis service in the cluster calculates the
memory to be reserved in the cluster based on the virtual machine memory configuration.
AHV marks all nodes as schedulable and resources available for running VMs.

• Best effort (not recommended). No reservations of node or memory on node are done in


the cluster and in case of any failures the virtual machines are moved to other nodes based
on the resources and memory available on the node. This is not a preferred method. If there
are no resources available on the cluster or node some of the virtual machines may not be
powered-on.

• Reserved host (only available via aCLI and not recommended). A full node is reserved for
HA of VM in case of any node failures in the cluster and does not allow virtual machines to
be run and powered on or migrated to the node during normal operation of the cluster. This
mode only works if all the nodes in the cluster have the same amount of memory.

Flash Mode
Flash Mode for VMs and volume groups (VG) is a feature that allows an administrator to set the
storage tier preference for a VM or VG that may be running latency-sensitive, mission critical
applications. 

For example, a cluster running mission critical applications (such as SQL database) with a
large working set alongside other workloads may be too large to fit into the SSD tier and could
potentially migrate to the HDD tier. For extremely latency-sensitive workloads, migration to
HDD tier could seriously affect the workload read and write performance.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   131
Data Resiliency

Flash Mode is available for all VMs and VGs running AHV, ESXi, and Hyper-V. You must manage
ESXi hosts using vCenter Server. Hyper-V supports Flash Mode for VGs. You activate the
feature on a per-VM or per-VG level, and all virtual disks corresponding to the activated VM or
VG are also enabled for Flash Mode. You can disable an individual virtual disk when you enable
the VM or VG to which it corresponds for Flash Mode.

You can configure Flash Mode for new or existing VMs, VGs, and their associated virtual disks.
When you add a new virtual disk to a Flash Mode–enabled VM or VG, you also pin that virtual
disk to the SSD.

Enabling Flash Mode on a VM or VG does not automatically migrate its data from the cold tier
to the hot tier. The information lifecycle management (ILM) mechanism manages hot tier data
migration. However, once Flash Mode is enabled for a VM in a hybrid system, data for business-
critical and latency-sensitive application data does not migrate from SSD to HDD.

Flash Mode provides all-flash performance for specific workloads on your existing hybrid
infrastructure without requiring you to purchase an all-flash system.

By default, you can use 25% of the SSD tier of the entire cluster as Flash Mode for VMs or VGs.

If the amount of VM data pinned to SSD exceeds 25% of the cluster’s SSD capacity, the system
may migrate data for pinned vDisks to HDD, depending on hot tier usage. Prism will alert when
this threshold is exceeded. When this occurs, you should evaluate the hot tier usage of the
cluster. Either reduce the amount of data pinned to SSD or add additional hot tier capacity to
the cluster.

If this feature is enabled for a VM, all the virtual disks that are attached to the VM are
automatically pinned to the SSD tier. Also, any virtual disks added to this VM after Flash Mode is
configured are automatically pinned. However, a VM’s configuration can be modified to remove
Flash Mode from any virtual disks. “VM pinning” is a feature available for both virtual machines
and volume groups.

Note: Before using Flash Mode, make the maximum use of capacity optimization
features such as deduplication, compression, and erasure coding (EC-X).

VM Flash Mode Considerations

Note: VM Flash Mode is recommended only for high-performance latency-sensitive


applications. VM Flash Mode reduces the ability of the DSF to manage workloads in
a dynamic manner. Use it as a last resort only.

Use VM Flash Mode for latency-sensitive applications

• Use VM Flash Mode for workloads that run on a schedule that initiates data migration to the
cold tier between jobs.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   132
Data Resiliency

• Activate the entire virtual disk if possible.

• When using VM Flash Mode to place a portion of the virtual disk in the hot tier, data that you
want to stay hot may migrate to cold tier. This occurs when other noncritical data is written
to the virtual disk and uses hot tier space.

• Activate for read- heavy workloads.

Flash Mode Configuration

Verification of Flash Mode settings using Prism:

Flash mode is configured when you update the VM configuration. In addition to modifying the
configuration, you can attach a volume group to the VM and enable flash mode on the VM. If
you attach a volume group to a VM that is part of a protection domain, the VM is not protected
automatically. Add the VM to the same consistency group manually.

To enable flash mode on the VM, click the Enable Flash Mode check box.
After you enable this feature on the VM, the status is updated in the VM table view. To view the
status of individual virtual disks (disks that are flashed to the SSD), go the Virtual Disks tab in
the VM table view.

You can disable the flash mode feature for individual virtual disks. To update the flash mode
for individual virtual disks, click the update disk icon in the Disks pane and deselect the Enable
Flash Mode check box.

Affinity and Anti-Affinity Rules for AHV


You can specify scheduling policies for virtual machines on an AHV cluster. By defining these
policies, you can control placement of the virtual machines on the hosts within a cluster.

You can define two types of affinity policies:

• VM-VM anti-affinity policy


• VM-host affinity policy

 VM Anti-Affinity policy

This policy prevents virtual machines from running on the same node. The policy forces VMs to
run on separate nodes so that application availability is not affected by node failure. This policy

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   133
Data Resiliency

does not limit the Acropolis Dynamic Scheduling (ADS) feature to take necessary action in case
of resource constraints.

Note: Currently, you can only define VM-VM anti-affinity policy by using aCLI. For
more information, see Configuring VM-VM Anti-Affinity Policy.

Note: Anti-Affinity policy is applied during the initial placement of VMs (when a VM
is powered on). Anti-Affinity policy can be over-ridden by manually migrating a VM
to the same host as its opposing VM, when a host is put in maintenance mode, or
during a HA event. ADS will attempt to resolve any anti-affinity violations when they
are detected.

Note: VM-VM affinity policy is not supported.

VM-Host Affinity policy

The VM-host affinity policy controls the placement of VMs. Use this policy to specify that
a selected VM can only run on the members of the affinity host list. This policy checks and
enforces where a VM can be hosted when you power on or migrate the VM.

Note: If you choose to apply a VM-host affinity policy, it limits Acropolis HA and
Acropolis Dynamic Scheduling (ADS) in such a way that a virtual machine cannot
be powered on or migrated to a host that does not conform to requirements of the
affinity policy as this policy is enforced mandatorily.

Note: The VM-host anti-affinity policy is not supported.

Note: Select at least two hosts when creating a host affinity list to protect against
downtime in the case of a node failure. This configuration is always enforced; VMs
will not be moved from the hosts specified here, even in the case of an HA event.

Watch the following video to learn more about Nutanix affinity rules: https://youtu.be/
rfHR93RFuuU.  

Limitations of Affinity Rules


• Even if a host is removed from a cluster, the host UUID is not removed from the host-affinity
list for a VM. Review and update any host affinity rules any time a node is removed from a
cluster.
• The VM-host affinity cannot be configured on a cluster that has HA configured using
reserved host method.

• You cannot remove the VM-host affinity for a powered on VM from Prism. You can use
the vm.affinity_unset vm_list aCLI command to perform this operation.

Labs
1. Failing a Node - VM High Availability

2. Configuring High Availability

3. Configuring Virtual Machine Affinity

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   134
Data Resiliency

4. Configuring Virtual Machine Anti-Affinity

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   135
Data Protection

Module

11
DATA PROTECTION

Overview
After completing this module, you will be able to:

• Understand a/synchronous replication options
• Understand Protection Domains, Consistency Groups, and migrate/activate procedures

• Understand Leap Availability Zones

VM-centric Data Protection Terminology

Disaster Recovery (DR)

Disaster Recovery (DR) is an area of failover planning that aims to protect an organization
from the effects of significant negative events. DR allows an organization to maintain or quickly
resume mission-critical functions following a disaster.

Recovery Point Objective (RPO)

• RPO is the tolerated time interval after disruption that allows for a quantity of data lost
without exceeding the maximum allowable threshold.
• RPO designates the variable amount of data that will be lost or will have to be re-entered
during network downtime.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   136
Data Protection

Example: If the data snapshot interval and the RPO is 180 minutes, and the outage lasts only
2 hours, you’re still within the parameters that allow for recovery and business processes to
proceed given the volume of data lost during the disruption.

Recovery Time Objective (RTO)

How much time does it take to recover after notification of business process disruption?

• RTO is therefore the duration of time and a service level within which a business process
must be restored after a disaster in order to avoid unacceptable consequences associated
with a break in continuity.
• RTO designates the amount of “real time” that can pass before the disruption begins to
seriously and unacceptably impede the flow of normal business operations.

Native (on-site) and Remote Data Replication Capabilities

• Data replication can be local or remote 


• Choose from backup or disaster recovery.

Local Replication

• This is also known as Time Stream, a set of snapshots.


• Snapshots are placed locally on the same cluster as the source VM.

Remote Replication

• Snapshots are replicated to one or more other clusters. 


• Remote cluster is physical server or Cloud.
• Synchronous [Metro] 
• Asynchronous

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   137
Data Protection

RPO and RTO Considerations

• Time Stream and Cloud: High RPO and RTO (hours) should be used for minor incidents.
• Synchronous and asynchronous: (near)-zero RPO and RTO should be used for major
incidents.

Time Stream
A time stream is a set of snapshots that are stored on the same cluster as the source VM or
volume group. Time stream is configured as an async protection domain without a remote site.
The Time Stream feature in Nutanix Acropolis gives you the ability to:

• Schedule and store VM-level snapshots on the primary cluster

• Configure retention policies for these snapshots

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   138
Data Protection

When a snapshot of a VM is initially taken on the Nutanix Enterprise Cloud Platform, the system
creates a read only, zero-space clone of the metadata (index to data) and makes the underlying
VM data immutable or read only. No VM data or virtual disks are copied or moved. The system
creates a read-only copy of the VM that can be accessed like its active counterpart.

Nutanix snapshots take only a few seconds to create, eliminating application and VM backup
windows.

Nutanix Guest Tools (NGT) is a software bundle that can be installed on a guest virtual machine
(Microsoft Windows or Linux). It is a software based in-guest agent framework which enables
advanced VM management functionality through the Nutanix Platform.

The solution is composed of the NGT installer which is installed on the VMs and the Guest Tools
Framework which is used for coordination between the agent and Nutanix platform.

The NGT installer contains the following components:

• Nutanix Guest Agent (NGA) service. Communicates with the Nutanix Controller VM.

• File Level Restore CLI. Performs self-service file-level recovery from the VM snapshots.

• Nutanix VM Mobility Drivers. Facilitates by providing drivers for VM migration between ESXi
and AHV, in-place hypervisor conversion, and cross-hypervisor disaster recovery (CHDR)
features.

• VSS requestor and hardware provider for Windows VMs. Enables application-consistent
snapshots of AHV or ESXi Windows VMs.

• Application-consistent snapshot for Linux VMs. Supports application-consistent snapshots


for Linux VMs by running specific scripts on VM quiesce.

The Guest Tools Framework is composed of a few high-level components:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   139
Data Protection

• Guest Tools Service: Gateway between the Acropolis and Nutanix services and the Guest
Agent. Distributed across CVMs within the cluster with an elected NGT Master which runs on
the current Prism Leader (hosting cluster vIP)

• Guest Agent: Agent and associated services deployed in the VM's OS as part of the NGT
installation process. Handles any local functions (e.g. VSS, Self-service Restore (SSR), etc.)
and interacts with the Guest Tools Service.

The Guest Agent Service communicates with Guest Tools Service via the Nutanix Cluster IP
using SSL. For deployments where the Nutanix cluster components and UVMs are on a different
network (hopefully all), ensure that the following are possible:

• Ensure routed communication from UVM network(s) to Cluster IP, or

• Create a firewall rule (and associated NAT) from UVM network(s) allowing communication
with the Cluster IP on port 2074 (preferred)

The Guest Tools Service acts as a Certificate Authority (CA) and is responsible for generating
certificate pairs for each NGT enabled UVM. This certificate is embedded into the ISO which is
configured for the UVM and used as part of the NGT deployment process. These certificates are
installed inside the UVM as part of the installation process.

Protection Domains
Concepts

Replication is a fundamental component of any enterprise data protection solution, ensuring


that critical data and applications can be reliably and efficiently replicated to a different site or a
separate infrastructure.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   140
Data Protection

Terminology

Protection Domain

Protection Domain (PD) is a defined group of entities (VMs, files and Volume Groups) that are
always backed up locally and optionally replicated to one or more remote sites.

An async DR protection domain supports backup snapshots for VMs and volume groups. A
metro availability protection domain operates at the storage container level.

A protection domain can use one of two replication engines depending on the replication
frequency that is defined when the protection domain is created. For 1 to 15 minute RPO,
NearSync will be used for replication. For 60 minutes and above, async DR will be used. 

Metro Availability Protection Domain


Active local storage container linked to a standby container at a remote site. Local and remote
containers will have the same name. Containers defined in a Metro Availability Protection
Domain are synchronously replicated to a remote container of the same.

Consistency Group
Optional subset of entities in a protection domain, a default CG is created with PD. A CG can
contain one or more virtual machines and/or volume groups. With Async DR, we can specify a
Consistency Group. With Metro availability, this process is automated.

Schedule

A schedule is a PD property that specifies snapshot intervals and snapshot retention. Retention
can be set differently for local and remote snapshots.

Snapshot

Read-only copy of the data and state of a VM, file or Volume Group at a specific point in time.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   141
Data Protection

Considerations for Async DR

• No more than 200 entities (VMs, files, and volume groups)

• Because restoring a VM does not allow for VMX editing, VM characteristics such as MAC
addresses may be in conflict with other VMs in the cluster

• VMs must be entirely on Nutanix datastore (no external storage)

• Data Replication between sites relies on the connection for encryption

• You cannot make snapshots of entire file systems (beyond the scope of a VM) or containers

• The shortest possible snapshot frequency is once per hour

• Consistency groups cannot define boot ordering

• You cannot include Volume Groups (Nutanix Volumes) in a protection domain configured for
Metro Availability

• Keep consistency groups as small as possible, typically at the application level. Note that
when using application consistent snapshots, it is not possible to include more than one VM
in a consistency group.

• Always specify retention time when you create one-time snapshots

• Do not deactivate and then delete a protection domain that contains VMs

• If you want to enable deduplication on a container with protected VMs that are replicated to
a remote site, wait to enable deduplication until:

- Both sites are upgraded to a version that supports capacity tier deduplication.

- No scheduled replications are in progress. If either of these conditions is false, replication


fails.

Protection Domain States


A protection domain on a cluster can be in one of two modes:

• Active: Manages volume groups and live VMs. Makes, replicates, and expires snapshots.

• Inactive: Receives snapshots from a remote cluster.

Note: For a list of guidelines when configuring async DR protection domains, please
see the Async DR Protection Domain Configuration section of the Prism Web
Console Guide on the Nutanix Support Portal.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   142
Data Protection

Protection Domain Failover and Failback

After a protection domain is replicated to at least one remote site, you can carry out a planned
migration of the contained entities by failing over the protection domain. You can also trigger
failover in the event of a site disaster.
Failover and failback events re-create the VMs and volume groups at the other site, but the
volume groups are detached from the iSCSI initiators to which they were attached before the
event. After the failover or failback event, you must manually reattach the volume groups to
iSCSI initiators and rediscover the iSCSI targets from the VMs.

Leap Availability Zone

Disaster recovery configurations which are created with Prism Element use protection domains
and optional third-party integrations to protect VMs, and they replicate data between on-
premises Nutanix clusters. Protection domains provide limited flexibility in terms of supporting
operations such as VM boot order and require you to perform manual tasks to protect new VMs
as an application scales up.

Leap uses an entity-centric approach and runbook-like automation to recover applications,


and the lowest RPO it supports is 60 minutes. It uses categories to group the entities to be

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   143
Data Protection

protected and applies policies to automate the protection of new entities as the application
scales. Application recovery is more flexible with network mappings, configurable stages
to enforce a boot order, and optional inter-stage delays. Application recovery can also be
validated and tested without affecting production workloads. All the configuration information
that an application requires upon failover are synchronized to the recovery location.

You can use Leap between two physical data centers or between a physical data center and Xi
Cloud Services. Leap works with pairs of physically isolated locations called availability zones.
One availability zone serves as the primary location for an application while a paired availability
zone serves as the recovery location. While the primary availability zone is an on-premises
Prism Central instance, the recovery availability zone can be either on-premises or in Xi Cloud
Services.

Configuration tasks and disaster recovery workflows are largely the same regardless of whether
you choose Xi Cloud Services or an on-premises deployment for recovery.

Availability Zone
An availability zone is a location to which you can replicate the data that you want to protect.
It is represented by a Prism Central instance to which a Nutanix cluster is registered. To ensure
availability, availability zones must be physically isolated from each other.
An availability zone can be in either of the following locations:

• Xi Cloud Services. If you choose to replicate data to Xi Cloud Services, the on-premises
Prism Central instance is paired with a Xi Cloud Services account, and data is replicated to Xi
Cloud Services.

• Physical Datacenter. If you choose to back up data to a physical datacenter, you must
provide the details of a Prism Central instance running in a datacenter that you own and that
is physically isolated from the primary availability zone.

Availability zones in Xi Cloud Services are physically isolated from each other to ensure that a
disaster at one location does not affect another location. If you choose to pair with a physical
datacenter, the responsibility of ensuring that the paired locations are physically isolated lies
with you.

Primary Availability Zone

The availability zone that is primarily meant to host the VMs you want to protect.

Recovery Availability Zone

The availability zone that is paired with the primary availability zone, for recovery purposes.
This can be a physical datacenter or Xi Cloud Services.

License Requirements
For disaster recovery between on-premises clusters and Xi Cloud Services, it is sufficient to use
the AOS Starter license on the on-premises clusters.

For disaster recovery between on-premises clusters, the license requirement depends on the
Leap features that you want to use. For information about the features that are available with
an AOS license, see Software Options.

Nutanix Software Requirements


• You cannot use Leap without Prism Central. Each datacenter must have a Prism Central
instance with Leap enabled on it.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   144
Data Protection

• On-premises Nutanix clusters and the Prism Central instance with which they are registered
must be running AOS 5.10 or later.

• The on-premises clusters must be running the version of AHV that is bundled with supported
version of AOS.

• On-Premises clusters registered with the Prism Central instance must have an external IP
address.

• The cluster on which the Prism Central instance is hosted must meet the following
requirements:

- The cluster must be registered to the Prism Central instance

- The cluster must have an iSCSI data services IP address configured on it.

- The cluster must also have sufficient memory to support a hot add of memory to all
Prism Central nodes when you enable Leap. A small Prism Central instance (4 vCPUs, 16
GB memory) requires a hot add of 4 GB and a large Prism Central VM (8 vCPUs, 32 GB
memory) requires a hot add of 8 GB. If you have enabled Nutanix Flow, an additional 1 GB
must be hot-added to each Prism Central instance.

• A single-node Prism Central instance must have a minimum of 8 vCPUs and 32 GB memory.

• Each node in a scaled-out Prism Central instance must have a minimum of 4 vCPUs and 16
GB memory.

• The Prism Central VM must not be on the same network as the protected user VMs. If
present on the user VM network, the Prism Central VM becomes inaccessible when the route
to the network is removed following failover.

• Do not uninstall Nutanix VM mobility drivers on the VMs as the VMs become unusable post
migration after uninstalling mobility drivers.

Networking Requirements

Requirements for Static IP Address Preservation After Failover

Static IP address preservation refers to maintaining the same IP address in the destination. The
considerations to achieve this are as follows:

• The VMs must have Nutanix Guest Tools (NGT) installed on them.

• VMs must have at least one empty CD-ROM slot.

• For an unplanned failover, if the snapshot used for restoration does not have an empty CD-
ROM slot, the static IP address is not configured on that VM.

• For a planned failover, if the latest state of the VM does not have an empty CD-ROM slot, the
static IP address is not configured on that VM after the failover.

• Linux VMs must have the NetworkManager command-line tool (nmcli) installed on them. The
version of nmcli must be greater than or equal to 0.9.10.0.

• Additionally, the network on Linux VMs must be managed by NetworkManager. To enable


NetworkManager on a Linux VM, in the interface configuration file (for example, in CentOS,
the file is /etc/sysconfig/network-scripts/ifcfg-eth0), set the value of the NM_CONTROLLED
field to yes. After setting the field, restart the network service on the VM.

• If you select a non-IPAM network in a VPC in Xi Cloud Services, the gateway IP address and
prefix fields are not auto-populated, and you must manually specify these values.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   145
Data Protection

Requirements for Static IP Address Mapping between Source and Target Virtual Networks

If you want to map, make sure that the following requirements are met:

• Install NGT on the VMs.

• Make sure that a free CD-ROM is available on each VM. The CD-ROM is required for
mounting NGT at the remote site after failover.

• Assign static IP addresses to the VMs.

• Make sure that the guest VMs can reach the Controller VM from both availability zones.

• Configure a VM-level IP address mapping in the recovery plan.

Virtual Network Design Requirements

You must design the virtual subnets that you plan to use for disaster recovery at the recovery
availability zone so that they can accommodate the VMs.

• Make sure that any virtual network intended for use as a recovery virtual network meets the
following requirements:

- The network prefix is the same as that of the source virtual network. For example, if the
source network address is 192.0.2.0/24, the network prefix of the recovery virtual network
must also be 24.

- The gateway IP address offset is the same as that in the source network. For example,
if the gateway IP address in the source virtual network 192.0.2.0/24 is 192.0.2.10, the last
octet of the gateway IP address in the recovery virtual network must also be 10.

• If you want to specify a single cluster as a target for recovering VMs from multiple source
clusters, make sure that the number of virtual networks on the target cluster is equal to the
sum of the number of virtual networks on the individual source clusters. For example, if there
are two source clusters, with one cluster having m networks and the other cluster having n
networks, make sure that the target cluster has m + n networks. Such a design ensures that
all migrated VMs can be attached to a network.

• It is possible to test failover and failback between physical clusters. To perform test runs
without affecting production, prepare test networks at both the source and destination sites.
Then, when testing, attach your test VMs to these networks.

• After you migrate VMs to Xi Cloud Services, make sure that the router in your data center
stops advertising the subnet in which the VMs were hosted.

Labs
1. Creating protection domains and local VM restore

2. Creating containers for replication

3. Configuring remote sites

4. Creating protection domains

5. Performing VM migration

6. Migrating back to primary 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   146
Module

12
PRISM CENTRAL

Overview
In the Managing a Nutanix Cluster module, you learned how to use Prism Element to configure a
cluster and set up Pulse and alerts. In this module you'll learn how to:
• Describe Prism Central

• Deploy a new instance of Prism Central

• Register and unregister clusters to Prism Central

• Recognize the additional features of Prism Pro

Prism Central Overview

Prism Central allows you to monitor and manage all Nutanix clusters from a single GUI:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   147
Prism Central

• Single sign-on for all registered clusters

• Summary dashboard across clusters

• Central dashboard for clusters, VMs, hosts, disks, and storage with drill-down for detailed
information.

• Multi-Cluster analytics

• Multi-Cluster alerts summary with drill-down for possible causes and corrective actions.

• Centrally configure individual clusters.

Prism Starter vs Prism Pro


Prism Element and Central are collectively referred to as Prism Starter and are both included
with every edition of Acropolis for single and multisite management.
Prism Pro is available as an add-on subscription.

The table above can be found on the Nutanix website.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   148
Prism Central

Deploying a New Instance of Prism Central


 For more information on this topic, see the Prism Central Guide on the Support Portal.

First, you must deploy an instance of Prism Central into your environment.

Once you have Prism Central deployed, you need to connect all of your clusters to Prism
Central.

You can deploy a Prism Central VM using the "1-click" method. This method employs the Prism
web console from a cluster of your choice and creates the Prism Central VM in that cluster.

The "1-click" method is the easiest method to deploy Prism Central in most cases. However, you
cannot use this method when:

• The target cluster runs Hyper-V or Citrix Hypervisor (or mixed hypervisors)

• You do not want to deploy the Prism Central VM in a Nutanix cluster

• You do not have access to a Nutanix cluster

In any of these cases, use the manual method of installation.

Deployment Methods

There are three methods to deploy Prism Central:

• Deploying from an AOS 5.10 cluster.

• Deploying from a cluster with Internet access.

• Deploying from a cluster that does not have Internet access (aka dark site). 

Registering a Cluster to Prism Central


• Ensure that you have logged on to the Prism cluster as an admin

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   149
Prism Central

• Do not enable client authentication in combination with ECDSA certificates

• Open ports 9440 and 80 in both directions

• You cannot register a cluster to multiple Prism Central instances

If you have never logged into Prism Central as the user admin, you need to log on and change
the password before attempting to register a cluster with Prism Central.

Do not enable client authentication in combination with ECDSA certificates on a registered


cluster since it causes interference when communicating with Prism Central.

Open ports 9440 and 80 (both directions) between the Prism Central VM, all Controller VMs,
and the cluster virtual IP address in each registered cluster.

A cluster can register with just one Prism Central instance at a time. To register with a different
Prism Central instance, first unregister the cluster.

Unregistering a Cluster from Prism Central


Unregistering a cluster through the Prism GUI is no longer available. Removal of this option
reduces the risk of accidentally unregistering a cluster. Because several features such as role-
based access control, application management, microsegmentation policies, and self-service
capability all require Prism Central. If a cluster is unregistered from Prism Central, these features
may not be available and the configuration for them may be erased. 

Note: See KB 4944 for additional details if you have enabled Prism Self Service,
Calm, or other special features in Prism Central.

Prism Pro Features 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   150
Prism Central

Customizable Dashboards

The custom dashboard feature allows you to build a dashboard based on a collection of fixed
and customizable widgets. You can arrange the widgets on the screen to create exactly the
view into the environment that works best for you. A dashboard’s contents can range from a
single widget to a screen full of widgets. 

Prism Pro comes with a default dashboard offering a view of capacity, health, performance, and
alerts that should be ideal for most users and a good starting point for others. The customizable
widgets allow you to display top lists, alerts, and analytics.

Note: Prism Pro allows you to create dashboards using fixed and customizable
widgets.

•  Fixed widgets = capacity, health, performance, and alerts. 


•  Customizable widgets = top lists, alerts, and analytics. 

Scheduled Reporting
Reports can provide information to the organization that is useful at all levels, from operations
to leadership. A few common good use cases include:

• Environmental summary: Provides a summary of cluster inventory entities and resource


utilization
• Cluster efficiency: Details possible capacity savings at the VM or cluster level

• Inventory: Produces a list of physical clusters, nodes, VMs, or other entities within an
environment

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   151
Prism Central

The reporting feature within Prism Pro allows you to create both scheduled and as-needed
reports. Prism Pro includes a set of customizable predefined reports, or you can create new
reports using a built-in WYSIWYG (what you see is what you get) editor. In the editor, simply
select data points and arrange them in the desired layout to create your report.The ability to
group within reports can help you get a global view of a given data point or allow you to look at
entities by cluster. Once you have created reports, they can be run either on an as-needed basis
or by setting them to run on a schedule. Configure each report to retain a certain number of
copies before the system deletes the oldest versions. To access reports, choose the report, then
select the version you wish to view. You can either view the report within Prism or via email, if
you have configured the report to send copies to a recipient list.

Dynamic Monitoring
The system learns the behavior of each VM and establishes a dynamic threshold as a
performance baseline for each resource assigned to that VM.

Dynamic monitoring uses VM behavioral learning powered by the Nutanix Machine Learning
Engine (X-Fit) technology to build on VM-level resource monitoring. Each resource chart
represents the baseline as a blue shaded range. If a given data point for a VM strays outside the
baseline range (higher or lower), the system detects an anomaly and generates an alert. The
anomaly appears on the performance charts for easy reference and follow-up.

If the data point’s anomalous results persist over time, the system learns the new VM behavior
and adjusts the baseline for that resource. With behavioral learning, performance reporting
helps you better understand your workloads and have early knowledge of issues that traditional
static threshold monitoring would not otherwise discover.

Dynamic monitoring is available for both VMs and physical hosts and encompasses multiple
data points within CPU, memory, storage, and networking.

Capacity Runway
Capacity planning focuses on the consumption of three resource categories within a Nutanix
cluster: storage capacity, CPU, and memory.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   152
Prism Central

Capacity results appear as a chart that shows the historical consumption for the metric along
with the estimated capacity runway. The capacity runway is the number of days remaining
before the resource item is fully consumed. The Nutanix X-Fit algorithms perform capacity
calculations based on historical data. Prism Pro initially uses 90 days of historical data from
each Prism Element instance, then continues to collect additional data to use in calculations.
Prism Pro retains capacity data points longer than Prism Element, allowing organizations to
study a larger data sample.

The X-Fit method considers resources consumed and the rate at which the system consumes
additional amounts in the calculations for runway days remaining. Storage calculations factor
the amounts of live usage, system usage, reserved capacity, and snapshot capacity into runway
calculations. Storage capacity runway is aware of containers, so it can calculate capacity when
multiple containers that are growing at different rates consume a single storage pool. Container
awareness allows X-Fit to create more accurate runway estimates.

Note:

The Capacity Runway tab allows you to view a summary of the resource runway


information for the registered clusters and access detailed runway information
about each cluster. Capacity runway calculations include data from live usage,
system usage, reserved capacity, and snapshot capacity.

Creating a Scenario

Anticipating future resource needs can be a challenging task. To address this task, Prism
Central provides an option to create "what if" scenarios that assess the resource requirements
for possible future workloads. This allows you to evaluate questions like

• How many new VMs can the current cluster support?

• If I need a new database server in a month, does the cluster have sufficient resources to
handle that increased load?

• If I create a new cluster for a given set of workloads, what kind of cluster do I need?

• If I remove a set of VMs or nodes, how will my cluster look?

You can create various "what if" scenarios to answer these and other questions. The answers
are derived by applying industry standard consumption patterns to the hypothetical workloads
and current consumption patterns for existing workloads.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   153
Prism Central

Finding Waste and Right-Sizing VMs

The VM efficiency features in Prism Pro recommend VMs within the environment that are
candidates for reclaiming unused resources that you can then return to the cluster.

Candidate types:

• Overprovisioned

• Inactive

• Constrained

• Bully

Within a virtualized environment, resources can become constrained globally or on a per-


VM basis. Administrators can address global capacity constraints by scaling out resources,
either by adding capacity or by reclaiming existing resources. Individual VMs can also become
constrained when they do not have enough resources to meet their demands.
Prism Pro presents the VMs it has identified as candidates for VM efficiency in a widget,
breaking the efficiency data into four different categories for easy identification:
overprovisioned, inactive, constrained, and bully. 

• Overprovisioned: An overprovisioned VM is the opposite of a constrained VM, meaning it


is a VM that is over-sized and wasting resources which are not needed. A VM is considered
over-provisioned when it exhibits one or more of the following baseline values, based on
the past 30 days: CPU usage < 50% (moderate) or < 20% (severe) and CPU ready time <
5%,  Memory usage < 50% (moderate) or < 20% (severe), and memory swap rate = 0 Kbps.

• Inactive: A VM is inactive in either of the following states:A VM is considered dead when it


has been powered off for at least 30 days.A VM is considered a zombie when it is powered
on but does fewer than 30 read or write I/Os (total), and receives or transfers fewer than
1000 bytes per day for the past 30 days. 

• Constrained: A constrained VM is one that does not have enough resources for the demand
and can lead to performance bottlenecks. A VM is considered constrained when it exhibits
one or more of the following baseline values, based on the past 30 days: CPU usage > 90%

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   154
Prism Central

(moderate), 95% (severe)  CPU ready time > 5% , 10%  Memory usage > 90%, 95%,  Memory
swap rate > 0 Kbps (no moderate value).

• Bully: A bully VM is one that consumes too many resources and causes other VMs to starve.
A VM is considered a bully when it exhibits one or more of the following conditions for over
an hour: CPU ready time > 5%, memory swap rate > 0 Kbps, host I/O Stargate CPU usage >
85%.

The lists of candidates show the total amount of CPU and memory configured versus peak
amounts of CPU and memory used for each VM. The overprovisioned and inactive categories
provide a high-level summary of potential resources that can be reclaimed from each VM.

Capacity Planning and Just-in-Time Forecasting

Prism Pro calculates the number, type, and configuration of nodes recommended for scaling to
provide the days of capacity requested.

You can model adding new workloads to a cluster and how those new workloads may affect
your capacity.

Capacity Planning

The Capacity Runway tab can help you understand how many days of resources you have left.
For example, determining how expanding an existing workload or adding new workloads to a
cluster may affect resources.

When you can’t reclaim enough resources, or when organizations need to scale the overall
environment, the capacity planning function can make node-based recommendations. These
node recommendations use the X-Fit data to account for consumption rates and growth and
meet the target runway period. Setting the runway period to 180 days causes Prism Pro to
calculate the number, type, and configuration of nodes recommended to provide the 180 days
of capacity requested.

Just in Time Forecasting

As part of the capacity planning portion of Prism Pro, you can model adding new workloads to
a cluster and how those new workloads may affect your capacity. The Nutanix Enterprise Cloud
uses data from X-Fit and workload models that have been carefully curated over time through
our Sizer application to inform capacity planning. The add workload function allows you to add
various applications for capacity planning.

The available workload planning options are:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   155
Prism Central

• SQL Server: Size database workload based on different workload sizes and database types

• VMs: Enables growth modeling specifying a generic VM size to model or selecting existing
VMs on a cluster to model.

- This is helpful when planning to scale a specific application already running on the cluster

• VDI: Provides options to select broker technology, provisioning method, user type, and
number of users

• Splunk: Size based on daily index size, hot and cold retention times, and number of search
users

• XenApp: Similar to VDI; size server-based computing with data points for broker types,
server OS, provisioning type, and concurrent user numbers

• Percentage: Allows modeling that increases or decreases capacity demand for the cluster 

- Example: Plan for 20 percent growth of cluster resources on a specified date

The figure below captures an example of this part of the modeling process.

Multiple Cluster Upgrades


Prism Pro offers the ability to upgrade multiple clusters from one Prism Central instance. With
this functionality, you can select multiple clusters, choose an available software version, and
push the upgrade to these clusters. If the multiple clusters you’re selecting are all within one
upgrade group, you can decide whether to perform the process on them sequentially or in
parallel.

This centralized upgrade approach provides a single point from which you can monitor status
and alerts as well as initiate upgrades. Currently, multiple cluster upgrades are only available

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   156
Prism Central

for AOS software. One-click upgrades of the hypervisor and firmware are still conducted at the
cluster level.

Labs
1. Deploying Prism Central

2. Registering a Cluster with Prism Central

3. Using Prism Central Basic Features

4. Creating a Custom Dashboard

5. Creating a Custom Report

6. Creating a “What-If?” Scenario


7. Unregistering a Cluster from Prism Central

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   157
Monitoring the Nutanix Cluster

Module

13
MONITORING THE NUTANIX CLUSTER

Overview
After completing this module, you will be able to:

• Understand available log files


• Access the Nutanix support portal and online help

Support Resources

Nutanix provides support services in several ways

• Nutanix Technical Support can monitor clusters and provide assistance when problems
occur.

• The Nutanix Support Portal is available for support assistance, software downloads, and
documentation.

• Nutanix supports REST API, which allows you to request information or run administration
scripts for a Nutanix cluster.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   158
Monitoring the Nutanix Cluster

Pulse

Pulse provides diagnostic system data to the Nutanix Support team to deliver proactive,
context- aware support for Nutanix solutions.

The Nutanix cluster automatically and unobtrusively collects this information with no effect on
system performance.

Pulse shares only basic system-level information necessary for monitoring the health and status
of a Nutanix cluster. Information includes:

• System alerts

• Current Nutanix software version

• Nutanix processes and Controller VM information

• Hypervisor details such as type and version

When Pulse is enabled, it sends a message once every 24 hours to a Nutanix Support server by
default.

Pulse also collects the most important system-level statistics and configuration information
more frequently to automatically detect issues and help make troubleshooting easier. With this
information, Nutanix Support can apply advanced analytics to optimize your implementation
and to address potential problems.

Note: Pulse sends messages through ports 80/8443/443. If this is not allowed,
Pulse sends messages through your mail server. The Zeus leader IP address must
also be open in the firewall.

Pulse is enabled by default. You can enable or disable Pulse at any time.

Log File Analysis


The Nutanix CVMs keep log files documenting events that occur over the life cycle of the
cluster. These files are stored in the /home/nutanix/data/logs directory.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   159
Monitoring the Nutanix Cluster

FATAL Logs
A FATAL log is generated as a result of a component failure in a cluster. Multiple failures will
result in a new .FATAL log file being created for each one. The naming convention followed for
this file is:

component-name.cvm-name.log.FATAL.date-timestamp

• component-name identifies the component that failed

• cvm-name identifies the CVM that created the log


• data-timestamp identifies the date and time when the first failure of that component
occurred

Entries within a FATAL log use the following format:

[IWEF] mmdd hh:mm:ss.uuuuuu threadid file:line] msg

• [IWEF] identifies whether the log entry is information, a warning, an error, or fatal

• mmdd identifies the month and date of the entry

• hh:mm:ss.uuuuuu identifies the time at which the entry was made


• threadid file:line

Note: The cluster also creates .INFO, .ERROR, and .WARNING log files for each
component.

You can also generate a FATAL log on a process for testing. To do this, run the following
command in the CVM:
curl http://<svm ip>:<component port>/h/exit?abort=1

For practice, you can use this FATAL log to understand how to correlate it with an INFO file to
get more information. There are two ways to correlate a FATAL log with an INFO log:

• Search for the timestamp of the FATAL event in the corresponding INFO files.

1. Determine the timestamp of the FATAL event.

2. Search for the timestamp in the corresponding INFO files.

3. Open the INFO file with vi and go to the bottom of the file (Shift+G).

4. Analyze the log entries immediately before the FATAL event, especially any errors or
warnings.

• If a process is repeatedly failing, it might be faster to do a long listing of the INFO files and
select the one immediately preceding the current one. The current one would be the one
referenced by the symbolic link.

Command Line Tools


cd/home/nutanix/data/logs

$ ls *stargate*FATAL*

$ tail stargate.NUTANIX-CVM03.nutanix.log.FATAL.20120510-152823

$ grep F0820stargate.NUTANIX-CVM03.nutanix.log.INFO.20120510-152823
stargate.ERROR

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   160
Monitoring the Nutanix Cluster
stargate.INFO
stargate.ntnx-16sm32070038-b-cvm.nutanix.log.ERROR.20190505-142229.18195
stargate.ntnx-16sm32070038-b-cvm.nutanix.log.INFO.20190927-204653.18195.gz
stargate.ntnx-16sm32070038-b-cvm.nutanix.log.WARNING.20190505-142229.18195
stargate.out
stargate.out.20190505-142228
stargate.WARNING
vip_service_stargate.out
vip_service_stargate.out.20190505-142302

Linux Tools

ls

This command returns a list of all files in the current directory, which is useful when you want to
see how many log files exist.

Include a subset of the filename that you are looking for to narrow the search. For example: $ ls
*stargate*FATAL*

cat

This command reads data from files and outputs their content. It is the simplest way to display
the contents of a file at the command line.

tail

This command returns the last 10 lines that were written to the file, which is useful when
investigating issues that have happened recently or are still happening.

To change the number of lines, add the -n flag. For example: $ tail -n 20 stargate.NUTANIX-
CVM03.nutanix.log.FATAL. 20120510-152823.3135

grep

This command returns lines in the file that match a search string, which is useful if you
are looking for a failure that occurred on a particular day. For example: $ grep F0820
stargate.NUTANIX-CVM03.nutanix.log.FATAL. 20120510-152823.3135

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   161
Monitoring the Nutanix Cluster

Nutanix Support Tools

Nutanix provides a variety of support services and materials through the Support portal. To
access the Nutanix support portal from Prism Central:

1. Select Support Portal from the user icon pull-down list of the main menu. The login screen
for the Nutanix support portal appears in a new tab or window.

2. Enter your support account user name and password. The Nutanix support portal home
page appears.

3. Select the desired service from the screen options. The options available to you are:

• Select an option from one of the main menu pull-down lists

• Search for a topic at the top of the screen

• Click one of the icons (Documentation, Open Case, View Cases, Downloads) in the middle

• View one of the selections at the bottom such as an announcement or KB article.

Note: Some options have restricted access and are not available to all users.

Labs
1. Collecting logs for support

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   162
Module

14
CLUSTER MANAGEMENT AND EXPANSION

Overview
After completing this module, you will be able to:

• Stop and start a cluster


• Shut down a node in a cluster and start a node in a cluster

• Expand a cluster

• Remove nodes from a cluster

• Explain license management

• Update AOS and firmware

Starting and Stopping a Cluster or Node


Understanding Controller VM Access
Most administrative functions of a Nutanix cluster can be performed through the web console
(Prism), however, there are some management tasks that require access to the Controller
VM (CVM) over SSH. Nutanix recommends restricting CVM SSH access with password or key
authentication.

Exercise caution whenever connecting directly to a CVM as the risk of causing cluster issues is
increased. This is because if you make an error when entering a container name or VM name,
you are not typically prompted to confirm your action – the command simply executes. In
addition, commands are executed with elevated privileges, similar to root, requiring attention
when making such changes.

Cluster Shutdown Procedures

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   163
Cluster Management and Expansion

While Nutanix cluster upgrades are non-disruptive and allow the cluster to run while nodes
upgrade in the background, there are situations in which some downtime may be necessary.
Certain maintenance operations and tasks such as hardware relocation would require a cluster
shutdown.

Before shutting down a node, shut down all the guest VMs running on the node or move
them to the other nodes in the cluster. Verify the data resiliency status of the cluster. The
recommendation for any RF level is to only shut down one node at a time, even if it's RF3. If
a cluster needs to have more than one node shut down, shut down the entire cluster. The
command, cluster status, executed from the CLI on a Control VM, shows the current status of all
cluster processes.

Note: This topic shows the process for AHV. Consult the appropriate admin manual
for other hypervisors.

Follow the tasks listed below if a cluster shutdown is needed:

1. Before the scheduled shutdown, SSH to a Controller VM using its local IP and not the cluster
VIP and run ncc health_checks run_all. If there are any errors or failures, contact Nutanix
Support.

2. Shut down all user VMs in the Nutanix cluster.


3. Stop all Nutanix Files cluster VMs, if applicable.

4. Stop the Nutanix cluster. Make sure you are connected to the static IP of any of the CVMs
rather than the cluster VIP.

5. Shut down each node in the cluster.

6. After completing maintenance or other tasks, power on the nodes and start the cluster.

Shutting Down a Node


To perform maintenance on a cluster node, open your SSH client and log on to the CVM.
Shut down the CVM and the node or place the node into maintenance mode. Remember, the
recommendation for any RF level is to only shut down one node at a time.

1. To place the node in maintenance mode, type:

$ acli host.enter_maintenance_mode host_ID [wait="{ true | false }" ]

2. To shut down the CVM, type: 

$ cvm_shutdown -P now

3. To shut down the host, type: 

$ shutdown -h now

Starting a Node
1. If the node is turned off, turn it on (otherwise, go to the next step).

2. Log on to the AHV host with SSH.

3. Find the name of the CVM by executing the following on the host: virsh list --all | grep
CVM

4. Examining the output from the previous command, if the CVM is OFF, start it from the
prompt on the host: virsh start cvm_name

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   164
Cluster Management and Expansion

Note: The cvm_name is obtained from the command run in step 3.

5. If the node is in maintenance mode, log on to the CVM over SSH and take it out of
maintenance mode: acli host.exit_maintenance_mode AHV-hypervisor-IP-address

6. Log on to another CVM in the cluster with SSH.

7. Confirm that cluster services are running on the CVM (make sure to replace cvm_ip_addr
accordingly): ncli cluster status | grep –A 15 cvm_ip_addr

a. Alternatively, you can use the following command to check if any services are down in the
cluster: cluster status | grep -v UP

8. Verify that all services are up on all CVMs.

Stopping a Cluster
Shut down all guest VMs, including vCenter if it is running on the cluster. Do not shut down
Nutanix Controller VMs.

1. Log on to a running Controller VM in the cluster with SSH.

2. Get the cluster status: cluster status

3. Stop the Nutanix cluster: cluster stop

4. Before proceeding, wait until the output is similar to what is shown here for every CVM in the
cluster.

Note: If you are running Nutanix Files, stop Files before stopping your AOS cluster.
This procedure stops all services provided by guest virtual machines and the
Nutanix cluster.

Starting a Cluster
1. Log on to any CVM in the cluster with SSH.

2. Get the cluster status: cluster status

3. Start the Nutanix cluster: cluster start

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   165
Cluster Management and Expansion

Once the process begins, you will see a list of all the services that need to be started on each
CVM:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   166
Cluster Management and Expansion

If the cluster starts properly, output similar to the following is displayed for each node in the
cluster at the end of the command execution:
CVM: 10.1.64.60 Up
Zeus UP [5362, 5391, 5392, 10848, 10977, 10992]
Scavenger UP [6174, 6215, 6216, 6217]
SSLTerminator UP [7705, 7742, 7743, 7744]
SecureFileSync UP [7710, 7761, 7762, 7763]
Medusa UP [8029, 8073, 8074, 8176, 8221]
DynamicRingChanger UP [8324, 8366, 8367, 8426]
Pithos UP [8328, 8399, 8400, 8418]
Hera UP [8347, 8408, 8409, 8410]
Stargate UP [8742, 8771, 8772, 9037, 9045]
InsightsDB UP [8774, 8805, 8806, 8939]
InsightsDataTransfer UP [8785, 8840, 8841, 8886, 8888, 8889, 8890]
Ergon UP [8814, 8862, 8863, 8864]
Cerebro UP [8850, 8914, 8915, 9288]
Chronos UP [8870, 8975, 8976, 9031]
Curator UP [8885, 8931, 8932, 9243]
Prism UP [3545, 3572, 3573, 3627, 4004, 4076]
CIM UP [8990, 9042, 9043, 9084]
AlertManager UP [9017, 9081, 9082, 9324]
Arithmos UP [9055, 9217, 9218, 9353]
Catalog UP [9110, 9178, 9179, 9180]
Acropolis UP [9201, 9321, 9322, 9323]
Atlas UP [9221, 9316, 9317, 9318]
Uhura UP [9390, 9447, 9448, 9449]
Snmp UP [9418, 9513, 9514, 9516]
SysStatCollector UP [9451, 9510, 9511, 9518]
Tunnel UP [9480, 9543, 9544]
ClusterHealth UP [9521, 9619, 9620, 9947, 9976, 9977,
10301]
Janus UP [9532, 9624, 9625]
NutanixGuestTools UP [9572, 9650, 9651, 9674]
MinervaCVM UP [10174, 10200, 10201, 10202, 10371]
ClusterConfig UP [10205, 10233, 10234, 10236]
APLOSEngine UP [10231, 10261, 10262, 10263]
APLOS UP [10343, 10368, 10369, 10370, 10502, 10503]
Lazan UP [10377, 10402, 10403, 10404]
Orion UP [10409, 10449, 10450, 10474]
Delphi UP [10418, 10466, 10467, 10468]

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   167
Cluster Management and Expansion

After you have verified that the cluster is up and running and there are no services down, you
can start guest VMs.

Removing a Node from a Cluster

Hardware components, such as nodes and disks, can be removed from a cluster or reconfigured
in other ways when conditions warrant it. However, node removal is typically a lengthy and I/O-
intensive operation. Nutanix recommends to remove a node only when it needs to be removed
permanently from a cluster. Node removal is not recommended for troubleshooting scenarios.

Before You Begin


If Data-at-Rest Encryption is enabled, then before removing a drive or node from a cluster:

1. Navigate to the Settings section of Prism and select Data at Rest Encryption.

2. Create a new configuration.

3. Enter the required credentials in the Certificate Signing Request Information section.

4. In the Key Management Server section, add a new key management server.

5. Add a new certificate authority and upload a CA certificate.

6. Return to the Key Management Server section, upload all node certificates.

7. Test the certificates again by clicking Test all nodes.

8. Ensure that testing is successful and the status is Verified.

Note: For a detailed procedure, refer to the Configuring Data-at-Rest Encryption


(SEDs) section of the Prism Web Console Guide on the Nutanix Support Portal.

Note: Note that if an SED drive or node is not removed as recommended, then the
drive or node will be locked.

Removing or Reconfiguring Cluster Hardware


When removing a host, remember that:

• You need to reclaim licenses before you remove a host from a cluster.

• Removing a host takes some time because data on that host must be migrated to other
hosts before it can be removed from the cluster. You can monitor progress through the
dashboard messages.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   168
Cluster Management and Expansion

• Removing a host automatically removes all the disks in that host from the storage containers
and the storage pool(s).

• Only one host can be removed at a time. If you want to remove multiple hosts, you must
wait until the first host is removed completely before attempting to remove the next host.

• After a node is removed, it goes into an unconfigured state. You can add such a node back
into the cluster through the expand cluster workflow, which we will discuss in the next topic
of this chapter.

Expanding a Cluster

Nutanix supports these cluster expansion scenarios:

• Add a new node to an existing block

• Add a new block containing one or more nodes

• Add all nodes from an existing cluster to another existing cluster

The ability to dynamically scale the Acropolis cluster is core to its functionality. To scale an
Acropolis cluster, install the new nodes in the rack and power them on. After the nodes are
powered on, if the nodes contain a factory installed image of AHV and CVM, the cluster should
discover the new nodes using IPv6 Neighborhood Discovery protocol. 

Note: Nodes that are installed with AHV and CVM, but not associated with a
cluster, are also discoverable. Factory install of AHV and CVM may not be possible
for nodes shipped in some regions of the world.

Multiple nodes can be discovered and added to the cluster concurrently if AHV and the CVM
are imaged in the factory, before they are shipped. Some pre-work is necessary for nodes that
do not meet this criteria. Additionally, nodes that are already part of a cluster are not listed as
options for cluster expansion.

The process for expanding a cluster depends on the hypervisor type, version of AOS, and data-
at-rest encryption status.

Configuration Description

Same hypervisor and The node is added to the cluster without re-imaging it.
AOS version

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   169
Cluster Management and Expansion

Configuration Description

AOS version is The node is re-imaged before it is added.


different
If the AOS version on the node is different (lower) but the hypervisor
version is the same, you have the option to upgrade just AOS from the
command line. To do this, log into a Controller VM in the cluster and run
the following command:

nutanix@cvm$ /home/nutanix/cluster/bin/cluster -u
new_node_cvm_ip_address upgrade_node

After the upgrade is complete, you can add the node to the cluster
without re-imaging it. Alternately, if the AOS version on the node is
higher than the cluster, you must either upgrade the cluster to that
version or re-image the node.

AOS version is same You are provided with the option to re-image the node before adding it.
but hypervisor version (Re-imaging is appropriate in many such cases, but in some cases it may
is different not be necessary such as for a minor version difference. Depending on
the hypervisor, installation binaries (e.g. ISO) might need to be provided.

Data-At-Rest If Data-At-Rest Encryption is enabled for the cluster (see Data-at-Rest


Encryption Encryption), you must configure Data-At-Rest Encryption for the new
nodes. The new nodes must have self-encrypting disks or AOS based
software encryption.

Re-imaging is not an option when adding nodes to a cluster where Data-


At-Rest Encryption is enabled. Therefore, such nodes must already have
the correct hypervisor and AOS version.

Expanding cluster To expand the ESXi cluster configured with DVS for Controller VM
when the ESXi cluster external communication, ensure that you do the following.
is configured with DVS
(Distributed VSwitch) • Expand DVS with the new node.
for CVM. • Make sure both the host and the CVM are configured with DVS.

• Make sure that host to CVM and CVM to CVM communications are
working.

• Follow the cluster expansion procedure.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   170
Cluster Management and Expansion

Managing Licenses

Nutanix provides automatic and manually applied licenses to ensure access to the variety of
features available. These features will enable you to administer your environment based on your
current and future needs. You can use the default feature set of AOS, upgrade to an advanced
feature set, update your license for a longer term or reassign existing licenses to nodes or
clusters as needed.

Each Nutanix NX Series node or block is delivered with a default Starter license which does not
expire. You are not required to register this license on your Nutanix Customer Portal account.
These licenses are automatically applied when a cluster is created, even when a cluster has
been destroyed and re-created. In these cases, Starter licenses do not need to be reclaimed.

Software only platforms, qualified by Nutanix (for example, the Cisco UCS M5 C-Series Rack
Server), might require a manually applied Starter license. Depending on the license level you
purchase, you can apply it using the Prism Element or Prism Central web console.

In this section, we will discuss the fundamentals of Nutanix license management.

Cluster Licensing Considerations


• Nutanix nodes and blocks are delivered with a default Starter license that does not expire.

• Pro and Ultimate licenses have expiration dates. License notification alerts in Prism start 60
days before expiration.

• Upgrade your license type if you require continued access to Pro or Ultimate features.

• An administrator must install a license after creating a cluster for Pro and Ultimate licensing.

• Reclaim licenses before destroying a cluster.

• Ensure consistent licensing for all nodes in a cluster. Nodes with different licensing, default to
minimum feature set.

For example, if two nodes in the cluster have Pro licenses and two nodes in the same have
Ultimate licenses, all nodes will effectively have Pro licenses and access to that feature set only.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   171
Cluster Management and Expansion

Attempts to access Ultimate features in this case result in a warning in the web console. If you
are using a Prism Pro trial license, the warning shows the expiration date and number of days
left in the trial period. Trial period is 60 days.

• You may see a "Licensing Status: In Process" alert message in the web console or log files.

• Generating a Cluster Summary File through the Prism web console, nCLI commands
(generate-cluster-info) or PowerShell commands (get-NTNXClusterLicenseInfo and get-
NTNXClusterLicenseInfoFile) initiates the cluster licensing process.

Understanding AOS Prism and Add on Licenses


Nutanix offers three licensed editions of AOS, two of Prism and a licensing or subscription
model for Add-ons. Subscription models are for one to five year terms.

AOS Licenses

Starter Licenses are installed by default, on each Nutanix node and block. They never expire
and they do not require registration on your assigned Nutanix customer portal account.

Pro and Ultimate licenses are downloaded as a license file from the Nutanix Support Portal and
applied to your cluster using Prism.

With Pro or Ultimate or after upgrading to the Pro or Ultimate license, adding nodes or clusters
to your environment, requires you to generate a new license file for download and installation.

Note: For more information about the different features that are available with
Acropolis Starter, Pro, and Ultimate, please see: https://www.nutanix.com/
products/software-options

Prism Licenses

The Prism Starter license is installed by default, on every edition of AOS.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   172
Cluster Management and Expansion

The Prism Pro license is available on a per-node basis, with options to purchase on a 1, 2, 3, 4, or
5-year term. A trial version of Prism Pro is included with every edition of AOS.

Add-on Licenses

Individual features known as add-ons can be added to your existing Prism license feature set.
When Nutanix makes add-ons available, you can add them to your existing Starter or Pro
license. For example, you can purchase Nutanix Files for your existing Pro licenses. 

You need to purchase and apply one add-on license for each node in the cluster with a Pro
license. For example, if your current Pro-licensed cluster consists of four nodes, you need to
purchase four add-on licenses, then apply them to your cluster.

All nodes in your cluster need to be at the same license level (four Pro licenses and four add-
on licenses). You cannot buy one add-on license, apply it to one node and have three nodes
without add-on licenses.

Add-ons that are available with one to five year subscription terms are Nutanix Era, Nutanix
Flow, Nutanix Files and Nutanix Files Pro. Nutanix Calm is available in 25 VM subscription
license packs.

Managing Your Licenses

Before Licensing a Cluster

Before attempting to install an upgraded or add-on license, ensure that you have created a
cluster and have logged into the web console to ensure the Starter license has been applied.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   173
Cluster Management and Expansion

Managing Licenses Using Portal Connection

The Portal Connection feature simplifies licensing by integrating the licensing workflow into
a single interface in the web console. Once you configure this feature, you can perform most
licensing tasks from Prism without needing to explicitly log on to the Nutanix Support Portal.

Note: This feature is disabled by default. If you want to enable Portal Connection,
please see the Nutanix Licensing Guide on the Nutanix Support Portal.

Portal Connection communicates with the Nutanix Support Portal to detect changes or updates
to your cluster license status. When you open Licensing from the web console, the screen
displays 1-click action buttons to enable you to manage your licenses without leaving Prism.

This button If you are eligible or want to…


appears…

Add Add an add-on license. This button appears if add-on features are
available for licensing.

Downgrade Downgrade your cluster to Pro from Ultimate or to Starter from


Pro or Ultimate. Use this button when reclaiming licenses before
destroying a cluster.

Rebalance Ensure your available licenses are applied to each node in your
cluster. For example:

If you have added a node and have an available license in your


account, click Rebalance.

If you have removed a node, click Rebalance to reclaim the now-


unused license.

Remove Remove an add-on license, disabling the add-on feature.

Renew Apply newly-purchased licenses.

Select Apply a license for an unlicensed cluster.

Update Extend the expiration date of current valid licenses.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   174
Cluster Management and Expansion

This button If you are eligible or want to…


appears…

Upgrade Upgrade your cluster from Starter to Pro or Ultimate, or Pro to


Ultimate license types.

Note: For more information on managing licenses with the Portal Connection
feature, including example of upgrades, renewals, and removal, please see
the Nutanix Licensing Guide on the Nutanix Support Portal.

Managing Licenses Without Portal Connection

This is the default method of managing licenses since the Portal Connection feature is disabled
by default. This method is a 3-step process, in which you:

1. Generate a cluster summary file in the web console and upload it to the Nutanix support
portal.

2. Generate and download a license file from the Nutanix support portal.

3. Install the license file on a cluster connected to the internet.

Generating a Cluster Summary File

1. From an internet-connected cluster, click the gear icon in the web console and
open Licensing.

2. Click Update License.

3. Click Generate to create and save a cluster summary file to your local machine. The cluster
summary file is saved to your browser download directory or directory you specify.

Generating and Downloading a License File

Note: To begin this process, you must have first generated a cluster summary file in
the web console.

1. Upload the Cluster Summary File to the Nutanix support portal.

2. Click Support Portal, log on to the Nutanix support portal, and click My Products > Licenses.

3. Click License a New Cluster. The Manage Licenses dialog box displays.

4. Click Choose File. Browse to the Cluster Summary File you just downloaded, select it, and
click Next. The portal automatically assigns a license, based on the information contained in
the Cluster Summary File.

5. Generate and apply the downloaded license file to the cluster. Click Generate to download
the license file created for the cluster to your browser download folder or directory you
specify.

Installing the License File

Note: To begin this process, you must have first generated and downloaded a
license file from the Nutanix Support Portal.

1. In the Prism web console, click the upload link in the Manage Licenses dialog box.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   175
Cluster Management and Expansion

2. Browse to the license file you downloaded, select it, and click Save.

Note: Note that the 3-step process described here applies to Prism Element, Prism
Central, and Add-on Licenses. For specific instructions related to each of these
three license types, please the relevant section of the Nutanix Licensing Guide on
the Nutanix Support Portal.

Managing Licenses in a Dark Site

Since a dark site cluster will not be connected to the internet, the Portal Connection feature
cannot be used from the cluster itself. However, some steps in the licensing process will require
the use of a system connected to the internet. The three step process for licensing a dark site
cluster is as follows:

Getting Cluster Summary Information Manually

1. Open Licensing from the gear icon in the web console for the connected cluster.

2. Click Update License.

3. Click Show Info and copy the cluster information needed to generate a license file. This
page displays the information that you need to enter at the support portal on an internet-
connected system. Copy this information to complete this licensing procedure.

Cluster UUID String indicating the unique cluster ID

Signature Cluster security key

License Class Indicates a software-only, appliance-based, or Prism Central


license class

License Version Indicates the version of the installed license file

Node Count Number of available licenses for this model

Cores Num Number of CPU cores; used with capacity-based licensing

Flash TiB Number of Flash TiBs; used with capacity based licensing

Installing a New License in a Dark Site

1. Get your cluster information from the web console. Complete the installation process on a
machine connected to the internet.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   176
Cluster Management and Expansion

2. Navigate to the Cluster Usage section of the Nutanix Support Portal to manage your
licenses.

3. Select the option for Dark Sites and then select the required license information, including
class, license version, and AOS version.

4. If necessary, enter capacity and block details. (Ensure that there are no typing errors.)

5. Select your licenses for Acropolis and then license your add-ons individually.

6. Check the summary, make sure all details are correct, and then download the license file.

7. Apply the downloaded license file to your dark site cluster to complete the process.

Reclaiming Your Licenses

Reclaiming a license returns it to your inventory and you can reapply it to other nodes in a
cluster. You will need to reclaim licenses when modifying license assignments, when removing
nodes from a cluster or before you destroy a cluster.

As with license management, licenses can be reclaimed both with and without the use of the
Portal Connection feature. Both procedures have been described below. For more information,
included detailed step-by-step procedures, please see the Nutanix Licensing Guide on the
Nutanix Support Portal.

Reclaiming Licenses with a Portal Connection


You must reclaim licenses (other than Starter) when you plan to destroy a cluster. First, reclaim
your licenses, then downgrade to Starter. When the cluster (all nodes) are at the Starter license
level, you can then destroy the cluster. You do not need to reclaim Starter licenses. These
licenses are automatically applied whenever you create a cluster.
Reclaim licenses to return them to your inventory when you remove one or more nodes from
a cluster. If you move nodes from one cluster to another, first reclaim the licenses, move the
nodes, then re-apply the licenses. Otherwise, if you are removing a node and not moving it to
another cluster, use the Rebalance button.

You can reclaim licenses for nodes in your clusters in cases where you want to make
modifications or downgrade licenses. For example, applying an Ultimate license to all nodes
in a cluster where some nodes are currently licensed as Pro and some nodes are licensed as
Ultimate. You might also want to transition nodes from Ultimate to Pro licensing.

Using Portal Connection to Reclaim a License

1. Open Licensing from the gear icon in the web console for the connected cluster.

2. Remove any add-ons. For example, Nutanix Files.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   177
Cluster Management and Expansion

a. Open Licensing from the gear icon in the Prism web console for the connected cluster.

b. The Licensing window shows that you have installed the Nutanix Files add-on.

c. Click Remove File Server to remove this add-on feature. Click Yes in the confirmation
window.

Portal Connection places the cluster into standby mode to remove the feature and update
the cluster license status. After this operation is complete, license status is updated.

d. Click X to close the Licensing window.

You will need to repeat this procedure for any other add-ons that you have installed.

3. Click Downgrade to Starter after any add-ons are removed.

4. Click X to close the Licensing window.

You can now perform any additional tasks, such as destroying the cluster or re-applying
licenses.

Reclaiming Licenses Without Portal Connection


Note: This procedure applies to clusters that are not configured with Portal
Connection, as well as dark-site clusters.

There are two scenarios in which you will reclaim licenses without using Portal Connection.
First, when destroying a cluster and second, when removing nodes from a cluster. The
procedure for both scenarios is largely the same. Differences have been noted in the steps
below, where applicable.

Points to Remember

• After you remove a node, if you move the node to another cluster, it requires using an
available license in your inventory.

• You must unlicense (reclaim) your cluster (other than Starter on Nutanix NX Series
platforms) when you plan to destroy a cluster. First unlicense (reclaim) the cluster, then
destroy the cluster.

Note: If you have destroyed the cluster and did not reclaim all existing licenses by
unlicensing the cluster, contact Nutanix Support to help reclaim the licenses.

• Return licenses to your inventory when you remove one or more nodes from a cluster. Also,
if you move nodes from one cluster to another, first reclaim the licenses, move the nodes,
then re-apply the licenses.

• You can reclaim licenses for nodes in your clusters in cases where you want to make
modifications or downgrade licenses. For example, applying an Ultimate license to all nodes
in a cluster where some nodes are currently licensed as Pro and some nodes are licensed as
Ultimate. You might also want to transition nodes from Ultimate to Pro licensing.

• You do not need to reclaim Starter licenses for Nutanix NX Series platforms. These licenses
are automatically applied whenever you create a cluster.

Reclaiming a License without Portal Connection

1. Generate a cluster summary file in the web console and upload it to the Nutanix Support
Portal.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   178
Cluster Management and Expansion

2. In the Support Portal, unlicense the cluster and download the license file.

3. Apply the downloaded license file to your cluster to complete the license reclamation
process.

Upgrading Software and Firmware


Nutanix provides a mechanism to perform nonintrusive rolling upgrades through Prism. This
simplifies the job of the administrator and results in zero loss of services.

AOS

Each node in a cluster runs AOS. When upgrading a cluster, all nodes should be upgraded to
the same AOS version.
Nutanix provides a live upgrade mechanism that allows the cluster to run continuously while a
rolling upgrade of the nodes is started in the background. There is no downgrade option.

Hypervisor Software

Hypervisor upgrades provided by vendors such as VMware and qualified by Nutanix. The
upgrade process updates one node in a cluster at a time.

NCC

Nutanix Cluster Check (NCC).

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   179
Cluster Management and Expansion

Foundation

Nutanix Foundation installation software.

BIOS and BMC Firmware

Nutanix provides updated BIOS and Base Management Controller (BMC) firmware.

Nutanix rarely includes this firmware on the Nutanix Support Portal. Nutanix recommends that
you open a case on the Support Portal to request the availability of updated firmware for your
platform.

Disk Firmware

Nutanix provides a live upgrade mechanism for disk firmware. The upgrade process updates
one disk at a time on each node for the disk group you have selected to upgrade.

Once the upgrade is complete on the first node in the cluster, the process begins on the next
node. Update happens on one disk at a time until all drives in the cluster have been updated.

Understanding Long Term Support and Short Term Support Releases

For AOS only, Nutanix offers two types of releases that cater to the needs of different customer
environments.

• Short Term Support (STS) releases have new features and provide a regular upgrade path

• Long Term Support (LTS) releases are maintained for longer periods of time and primarily
include bug fixes over that extended period

To understand whether you have an STS or LTS release or which one is right for you, refer to
the following table:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   180
Cluster Management and Expansion

Type Release Support Cycle Content Target User Current Upgrade


Cadence AOS paths
release
family

STS Quarterly 3 months of Major new Customers that 5.6.x To the next
maintenance, features, are interested in upgrade path
followed by hardware adopting major 5.8.x supported
an additional platforms for new features 5.9.x STS release
3 months of new features. and are able
support. Also contains to perform 5.11.x OR
bug fixes. upgrades To the next
multiple times a upgrade path
year. supported
LTS release

LTS Annually 12 months of Focused Customers that 5.5.x To the next


maintenance heavily on bug are interested upgrade path
after the release fixes. Minimal in a release 5.10.x supported
date of the minor feature family with STS release
next upgrade, introduction. an extended
followed by support cycle. OR
6 months of To the next
support. upgrade path
supported
LTS release

Note: Note that the upgrade path must always be to a later release. Downgrades
are not supported.

Before You Upgrade


Before you can proceed with an upgrade, you need to:

• Check the status of your cluster to ensure everything is in a proper working state.

• Check to see if your desired upgrade is a valid upgrade path.

• Check the compatibility matrix for details of hypervisor and hardware support for different
versions of AOS.

Lifecycle Manager (LCM) Upgrade Process

Upgrading Nutanix clusters involves a specific sequence of tasks:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   181
Cluster Management and Expansion

1. Upgrade AOS on each cluster

2. Perform a Lifecycle Manager (LCM) inventory

3. Update LCM

4. Upgrade any recommended firmware

The upgrade process steps are detailed below.

Upgrading the Hypervisor and AOS on Each Cluster

Overview and Requirements

1. Check the AOS release notes for late-breaking upgrade information.

2. Run the Nutanix Cluster Check (NCC) health checks from any CVM in the cluster.

3. Download the available hypervisor software from the vendor and the metadata file (JSON)
from the Nutanix Support Portal. If you are upgrading AHV, you can download the binary
bundle from the Nutanix Support Portal.

4. Upload the software and metadata through Upgrade Software.


5. Upgrading the hypervisor causes each CVM to restart.

6. Only one node is upgraded at a time. Ensure that all the hypervisors hosted in your cluster
are running the same version (all ESXi hosts running the same version, all AHV hosts running
the same version, and so on). The NCC check, same_hypervisor_version_check returns a
FAIL status if the hypervisors are different.

Note: Using the Upgrade Software (1-click upgrade) feature does not complete
successfully in this case.

Upgrading AHV

To upgrade AHV through the Upgrade Software feature in the Prism web console, do the
following:

1. Ensure that you are running the latest version of NCC. Upgrade NCC if required.

2. Run NCC to ensure that there are no issues with the cluster.

3. In the web console, navigate to the Upgrade Software section of the Settings page and click
the Hypervisor tab.

4. If Available Compatible Versions shows a new version of AHV, click Upgrade, then click
Upgrade Now, and click Yes when prompted for confirmation.

Upgrading AOS

To upgrade AOS through the Upgrade Software feature in the Prism web console, do the
following:

1. Ensure that you are running the latest version of NCC. Upgrade NCC if required.

2. Run NCC to ensure that there are no issues with the cluster.

3. In the web console, navigate to the Upgrade Software section of the Settings page and
select the option to upgrade AOS.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   182
Cluster Management and Expansion

4. Optionally, you can also run pre-upgrade installation checks before proceeding with the
ugrade process.

5. If automatic downloads are enabled on your cluster, install the downloaded package. If
automatic downloads are not enabled, download the upgrade package and install it.

Working with Life Cycle Manager


The Life Cycle Manager (LCM) tracks software and firmware versions of all entities in the
cluster. It performs two functions: taking inventory of the cluster and performing updates on
the cluster.
LCM consists of a framework consisting of a set of modules for inventory and update. LCM
supports all Nutanix, Dell XC, Dell XC Core, and Lenovo HX platforms. LCM modules are
independent of AOS. They contain libraries and images, as well as metadata and checksums for
security. Currently, Nutanix supplies all modules.

The LCM framework is accessible through the Prism interface. It acts as a download manager
for LCM modules, validating and downloading module content. All communication between the
cluster and LCM modules goes through the LCM framework.

Accessing LCM
Whether you are accessing LCM from Prism Element or Prism Central, the steps to do so are
the same.

1. Click the gear button to open the settings page.

2. Select Life Cycle Management from the sidebar.

Note: Note: In AOS 5.11 and later, LCM is available as a menu item from the
Prism Home page, rather than the Settings page.

Performing Inventory with LCM

You can use LCM to display software and firmware versions of entities in a cluster. Inventory
information for a node is persistent for as long as the node remains in the chassis. When you
remove a node from a chassis, LCM will not retain inventory information for that node. When
you return the node to the chassis, you must perform inventory again to restore the inventory
information.

To perform inventory:

1. Open LCM.

2. To take an inventory, click Options and select Perform Inventory. If you do not have auto-


update enabled, and a new version of the LCM framework is available, LCM will display the
following warning:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   183
Cluster Management and Expansion

3. Click OK. The new inventory appears on the Inventory page.

Other features in LCM that might be useful to you are:

• The Focus button, which lets you switch between a general display and a component-by-
component display.

• The Export option, which will export the inventory as a spreadsheet.

• Auto-inventory. To enable this feature, click Settings and select the Enable LCM Auto


Inventory check box in the dialog box that appears.

Upgrading Recommended Firmware


You will use LCM to upgrade firmware on your cluster. Before you begin, remember to:

• Get the current status of your cluster to ensure everything is in the proper working order.

• Update your cluster to the most recent version of Nutanix Foundation.


• Configure rules in your external firewall to allow LCM updates. For details, see the Firewall
Requirements section of the Prism Web Console Guide on the Support Portal.

The LCM Update Workflow

LCM updates the cluster one node at a time: it brings a node down (if needed), performs
updates, brings the node up, waits until is fully functional, and then moves on to the next node.
If LCM encounters a problem during an update, it waits until the problem has been resolved
before moving on to the next node.

During an LCM update, there is never more than one node down at the same time even if the
cluster is RF3.

All LCM updates follow the procedure shown in the following flowchart:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   184
Cluster Management and Expansion

Details of the procedure shown in the flowchart are as follows:

1. If updates for the LCM framework are available, LCM auto-updates its own framework, then
continues with the operation.

2. After a self-update, LCM runs the series of pre-checks described in the Life Cycle Manager
Pre-Checks section of the Life Cycle Manager Guide on the Support Portal.

3. When the pre-checks are complete, LCM looks at the available component updates and
batches them according to dependencies. LCM batches updates in order to reduce or
eliminate the downtime of the individual nodes; when updates are batched, LCM only
performs the pre-update and post-update actions once. For example, on NX platforms, BIOS
updates depend on BMC updates, so LCM batches them so the BMC always updates before
the BIOS on each node.
4. Next, LCM chooses a node and performs any necessary pre-update actions.

5. Next, LCM performs the update. The update process and duration vary by component. 

6. LCM performs any necessary post-update actions and brings the node back up.

7. When cluster data resiliency is back to normal, LCM moves to the next node.

Performing Upgrades with LCM

With Internet Access

1. Open LCM and select either software or firmware updates.

2. Specify where LCM should look for updates, and then select the updates you want to
perform.

3. Select the NCC prechecks you want to run before updating.

4. Once the prechecks are complete, apply your updates.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   185
Cluster Management and Expansion

At a Dark Site

By default, LCM automatically fetches updates from a pre-configured URL. If you are managing
a Nutanix cluster at a site that cannot access the provided URL, you must configure LCM to
fetch updates locally, using the procedure described in the Life Cycle Manager Guide on the
Nutanix Support Portal.

Labs
1. Performing a one-click NCC upgrade

2. Adding cluster nodes

3. Removing cluster nodes 

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   186
Module

15
ROBO DEPLOYMENTS

Overview
After completing this module, you will be able to:

• Understand hardware and software considerations for a ROBO site


• Understand Witness VM requirements for a ROBO site

• Understand failure and recovery scenarios for two-node clusters

• Understand the seeding process

Remote Office Branch Office


The landscape for enterprise remote and branch offices (ROBO), retail locations, regional
offices and other edge sites has rapidly evolved in the last 5 years. In addition, new demands
have grown with more field-based IT infrastructures such as oil rigs, kiosks, cruise ships,
forward-deployed military operations and even airport security devices that need processing
power in proximity to the point of data collection. Often these needs can’t be met by the public
cloud due to latency and connection realities. They are often too small for traditional, legacy
approaches to IT infrastructure from capex, power and space perspectives, as well as opex
constraints and the skills required on-site to manage and maintain them. 

The Nutanix Enterprise Cloud is a powerful converged compute and storage system that offers
one-click simplicity and high availability for remote and branch offices. This makes deploying
and operating remote and branch offices as easy as deploying to the public cloud, but with
control and security on your own terms. Picking the right solution always involves trade-offs.
While a remote site is not your datacenter, uptime is nonetheless a crucial concern. Financial
constraints and physical layout also affect what counts as the best architecture for your
environment.

Cluster Considerations
The 3 node and new 1 and 2 node offering (ROBO only) from Nutanix allows remote offices to
harness the power of Nutanix Enterprise Cloud OS and simplify remote IT infrastructure that
can now be managed centrally with a single pane of glass. The Nutanix OS can be consistently
deployed across classic on-premises data centers, remote office/branch office and disaster
recovery (DR) sites and the public clouds. This allows businesses to leverage common IT
tooling and enabling application mobility across private/ public clouds without being locked
into any hardware, hyper-visor, or cloud.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   187
ROBO Deployments

Three-Node Clusters

Three-node (or more) clusters are the gold standard for ROBO deployments. They provide data
protection by always committing two copies of your data, keeping data safe during failures, and
automatically rebuilding data within 60-seconds of a node failure.

Nutanix recommends designing three-node clusters with enough capacity to recover from the
failure of a single node. For sites with high availability requirements or which are difficult to
visit, additional capacity above the n+1 node counts is recommended.

Three-node clusters can scale up to eight nodes with 1 Gbps networking, and up to any scale
when using 10 Gbps and higher networking. 

Two-Node Clusters

Two-node clusters offer reliability for smaller sites while also being cost effective. A Witness
VM is required for two-node clusters only and is used only for failure scenarios to coordinate
rebuilding data and automatic upgrades. You can deploy the witness offsite up to 500 ms away
for ROBO. Multiple clusters can use the same witness for two-node configurations. Nutanix
supports two-node clusters with ESXi and AHV only.

Two-node clusters cannot be expanded.

One-Node Clusters

One-node clusters are recommended for low availability requirements coupled with strong
overall management for multiple sites. Note that a one-node cluster provides resiliency against
the loss of a single hard drive. Nutanix supports one-node clusters with ESXi and AHV only.

One-node clusters cannot be expanded.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   188
ROBO Deployments

Cluster Storage Considerations


Nutanix recommends N+1 (nodes) - 5% to ensure sufficient space for rebuilding two and three
node clusters. For single-node clusters, reserve 55 percent of useable space to recover from the
loss of a disk.

Software Considerations

Hypervisor
The three main considerations for choosing the right hypervisor for your ROBO environment
are supportability, operations, and licensing costs.

With Nutanix Acropolis, VM placement and data placement occurs automatically. Nutanix
also hardens systems by default to meet security requirements and provides the automation
necessary to maintain that security. Nutanix supplies STIGs (Security Technical Information
Guidelines) in machine-readable code for both AHV and the storage controller.

For environments that do not want to switch hypervisors in the main datacenter, Nutanix offers
cross-hypervisor disaster recovery to replicate VMs from AHV to ESXi or ESXi to AHV. In the
event of a disaster, administrators can restore their AHV VM to ESXi for quick recovery or
replicate the VM back to the remote site with easy workflows.

Centralized Management and Maintenance

Maintaining a branch utilizing onsite IT is an expensive and inefficient method to ROBO


deployments. In addition, managing three separate tiers of infrastructure requires special
training. Multiplying these requirements across dozens to hundreds of branch locations is often
a non-starter. Nutanix Prism offers centralized infrastructure management, one-click simplicity
and intelligence for everyday operations and insights into capacity planning and forecast. It
makes it possible to schedule upgrades for hundreds of remote sites within a few clicks. Prism
also provides network visualization allowing you to troubleshoot basic networking issues,
right from the same dashboard. With the scale out capabilities added to the control plane, it is
possible to manage as high as 25 thousand VMs and more centrally.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   189
ROBO Deployments

Prism Element

Prism Element is a service built into the platform for every Nutanix cluster deployed. Because
Prism Element manages only the cluster it is part of, each Nutanix cluster in a deployment has
a unique Prism Element instance for management. Multiple clusters are managed via Prism
Central.

Prism Central

Initial Installation and Sizing

• Small environments: For fewer than 2,500 VMs, size Prism Central to 4 vCPUs, 12 GB of
memory, and 500 GB of storage.

• Large environments: For up to 12,000 VMs, size Prism Central to 8 vCPUs, 32 GB of RAM,
and 2,500 GB of storage.

• If installing on Hyper-V, use the SCVMM library on the same cluster to enable fast copy. Fast
copy improves the deployment time.

Each node registered to and managed by Prism Pro requires you to apply a Prism Pro license
through the Prism Central web console. For example, if you have registered and are managing
10 Nutanix nodes (regardless of the individual node or cluster license level), you need to apply
10 Prism Pro licenses through the Prism Central web console.

Integrated Data Protection

Nutanix offers an integrated solution for local on-site backups and replication for central
backup and disaster recovery. The powerful Nutanix Time Stream capability allows unlimited
VM snapshots to be created on a local cluster for faster RPO and RTO and rapidly restore state
when required. Using Prism, administrators can schedule local snapshots and replication tasks
and control retention policies on an individual snapshot basis. An intuitive snapshot browser,
allows administrators to quickly see local and remote snapshots and restore or retrieve a saved
snapshot or a specific VM within a snapshot with a single click. Snapshots are differential and
de-duplicated, hence backup and recovery is automatically optimized, allowing DR and remote
backups to be completed efficiently, for different environments.

• Backup – Provides local snapshot/restore at the ROBO site as well as remote snapshot/
restore to the main data center.

• Disaster Recovery – Provides snapshot replication to the main data center with automatic
failover in the event of an outage.

Witness VM Requirements

There are several requirements when setting up a Witness VM. The minimum requirements are:

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   190
ROBO Deployments

• 2 vCPUs

• 6 GB of memory

• 25 GB of storage

The Witness VM must reside in a separate failure domain. This means the witness and all
two-node clusters must have independent power and network connections. We recommend
locating the witness VM in a third physical site with dedicated network connections to all sites
to avoid single points of failure.

Communication with the witness happens over port TCP 9440. This port must be open for the
CVMs on any two-node clusters using the witness.

Network latency between each two-node cluster and the Witness VM must be less than 500 ms
for ROBO.

The Witness VM may reside on any supported hypervisor and run on Nutanix or non-Nutanix
hardware. You can register multiple two-node clusters to a single Witness VM.

Failure and Recovery Scenarios for Two-Node Clusters


For two node recovery processes, a Witness VM is required. There are several potential failure
scenarios between the nodes and the Witness VM. Each failure generates one or more alerts
that can be reviewed in Prism. The recovery steps depend on the nature of the failure. In this
section, we will summarize the steps needed (or not needed) when a failure occurs.

Node Failure
When a node goes down, the live node sends a leadership request to the Witness VM and goes
into single-node mode. In this mode RF2 is still retained at the disk level, meaning data is copied
to two disks. (Normally, RF2 is maintained at the node level normally meaning data is copied to
each node.)

If one of the two metadata SSDs fails while in single-node mode, the cluster (node) goes into
read-only mode until a new SSD is picked for metadata service. When the node that was down
is back up and stable again, the system automatically returns to the previous state (RF2 at the
node level). No user intervention is necessary during this transition.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   191
ROBO Deployments

Network Failure Between Nodes


When the network connection between the nodes fails, both nodes send a leadership request
to the Witness VM. Whichever node gets the leadership lock stays active and goes into single-
node mode. All operations and services on the other node are shut down, and the node goes
into a waiting state. When the connection is re-established, the same recovery process as in the
node failure scenario begins.

Network Failure Between Node and Witness VM


When the network connection between a single node (Node A in this example) and the Witness
fails, an alert is generated that Node A is not able to reach the Witness. The cluster is otherwise
unaffected, and no administrator intervention is required.

Witness VM Failure
When the Witness goes down (or the network connections to both nodes and the Witness fail),
an alert is generated but the cluster is otherwise unaffected. When connection to the Witness
is re-established, the Witness process resumes automatically. No administrator intervention is
required.

If the Witness VM goes down permanently (unrecoverable), follow the steps for configuring a
new Witness through the Configure Witness option of the Prism web console as described in
the Configuring a Witness (two-node cluster) topic on the Nutanix Support Portal.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   192
ROBO Deployments

Complete Network Failure


When a complete network failure occurs (no connections between the nodes or the Witness),
the cluster becomes unavailable. Manual intervention is needed to fix the network. While
the network is down (or when a node fails and the other node does not have access to the
Witness), you have the option to manually elect a leader and run in single-node mode. To
manually elect a leader, do the following:

1. Log in using SSH to the Controller VM for the node to be set as the leader and enter the
following command:
nutanix@cvm$ cluster set_two_node_cluster_leader

Run this command on just the node you want to elect as the leader. If both nodes are
operational, do not run it on the other node.

2. Remove (unconfigure) the current Witness and reconfigure with a new (accessible) Witness
when one is available as described in the Configuring a Witness (two-node cluster) topic on
the Nutanix Support Portal.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   193
ROBO Deployments

Seeding

When dealing with a remote site that has a limited network connection back to the main
datacenter, it may be necessary to seed data to overcome network speed deficits. You may
also need to seed data if systems were foundationed at a main site and shipped to a remote
site without data, but that data is required at a later date.

Seeding involves using a separate device to ship the data to the remote location. Instead of
replication taking weeks or months, depending on the amount of data you need to protect, you
can copy the data locally to a separate Nutanix node and then ship it to your remote site.

Nutanix checks the snapshot metadata before sending the device to prevent unnecessary
duplication. Nutanix can apply its native data protection to a seed cluster by placing VMs in a
protection domain and replicating them to a seed cluster. A protection domain is a collection
of VMs that have a similar recovery point objective (RPO). You must ensure, however, that the
seeding snapshot doesn’t expire before you can copy the data to the final destination.

Note: For more information, please see the ROBO Deployment and Operations
Guide on the Nutanix Support Portal.

Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020   |   194

You might also like