EMC VPLEX Metro Witness Technology and High Availability: Jennifer Aspesi Oliver Shorey

EMC VPLEX Metro Witness
Technology and High Availability
Version 2.1
• EMC VPLEX Witness

• VPLEX Metro High Availability
• Metro HA Deployment Scenarios
Jennifer Aspesi
Oliver Shorey
Copyright © 2010 - 2012 EMC Corporation. All rights reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is
subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO
REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS
PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable
software license.
For the most up-to-date regulatory document for your product line, go to the Technical Documentation and
Advisories section on EMC Powerlink.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
All other trademarks used herein are the property of their respective owners.
Part number H7113.2
2 EMC VPLEX Metro Witness Technology and High Availability

Contents
Preface
Chapter 1 VPLEX Family and Use Case Overview

Introduction ....................................................................................... 18
VPLEX value overview .................................................................... 19
VPLEX product offerings ................................................................ 23
VPLEX Local, VPLEX Metro, and VPLEX Geo ......................23
Architecture highlights ..............................................................25
Metro high availability design considerations ............................. 28
Planned application mobility compared with disaster
restart ...........................................................................................29
Chapter 2 Hardware and Software

Introduction ....................................................................................... 32
VPLEX I/O ..................................................................................32
High-level VPLEX I/O flow......................................................32
Distributed coherent cache........................................................33
VPLEX family clustering architecture ....................................33
VPLEX single, dual, and quad engines ...................................35
VPLEX sizing tool.......................................................................35
Upgrade paths.............................................................................36
Hardware upgrades ...................................................................36
Software upgrades......................................................................36
VPLEX management interfaces ...................................................... 37
Web-based GUI ...........................................................................37
VPLEX CLI...................................................................................37
SNMP support for performance statistics...............................38
LDAP /AD support ...................................................................38
EMC VPLEX Metro Witness Technology and High Availability 3

Contents
VPLEX Element Manager API.................................................. 38

Simplified storage management..................................................... 39
Management server user accounts................................................. 40
Management server software.......................................................... 41
Management console ................................................................. 41
Command line interface ............................................................ 43
System reporting......................................................................... 44
Director software .............................................................................. 45
Configuration overview................................................................... 46
Single engine configurations..................................................... 46
Dual configurations.................................................................... 47
Quad configurations .................................................................. 48
I/O implementation ......................................................................... 50
Cache coherence ......................................................................... 50
Meta-directory ............................................................................ 50
How a read is handled............................................................... 50
How a write is handled ............................................................. 52
Chapter 3 System and Component Integrity

Overview............................................................................................ 54
Cluster ................................................................................................ 55
Path redundancy through different ports ..................................... 56
Path redundancy through different directors............................... 57
Path redundancy through different engines................................. 58
Path redundancy through site distribution .................................. 59
Serviceability ..................................................................................... 60
Chapter 4 Foundations of VPLEX High Availability

Foundations of VPLEX High Availability .................................... 62
Failure handling without VPLEX Witness (static preference).... 70
Chapter 5 Introduction to VPLEX Witness

VPLEX Witness overview and architecture .................................. 82
VPLEX Witness target solution, rules, and best practices .......... 85
VPLEX Witness failure semantics................................................... 87
CLI example outputs........................................................................ 93
VPLEX Witness – The importance of the third failure
domain ......................................................................................... 97

Contents
Chapter 6 VPLEX Metro HA

VPLEX Metro HA overview .......................................................... 100
VPLEX Metro HA Campus (with cross-connect) ...................... 101
VPLEX Metro HA (without cross-cluster connection)............... 111
Chapter 7 Conclusion
Conclusion........................................................................................ 120
Better protection from storage-related failures ....................121
Protection from a larger array of possible failures...............121
Greater overall resource utilization........................................122
Glossary

Contents

Figures
Title Page
1 Application and data mobility example ..................................................... 20
2 HA infrastructure example ........................................................................... 21
3 Distributed data collaboration example ..................................................... 22
4 VPLEX offerings ............................................................................................. 24
5 Architecture highlights.................................................................................. 26
6 VPLEX cluster example ................................................................................. 34
7 VPLEX Management Console ...................................................................... 42
8 Management Console welcome screen ....................................................... 43
9 VPLEX single engine configuration............................................................. 47
10 VPLEX dual engine configuration ............................................................... 48
11 VPLEX quad engine configuration .............................................................. 49
12 Port redundancy............................................................................................. 56
13 Director redundancy...................................................................................... 57
14 Engine redundancy ........................................................................................ 58
15 Site redundancy.............................................................................................. 59
16 High level functional sites in communication ........................................... 62
17 High level Site A failure ................................................................................ 63
18 High level Inter-site link failure ................................................................... 63
19 VPLEX active and functional between two sites ....................................... 64
20 VPLEX concept diagram with failure at Site A.......................................... 65
21 Correct resolution after volume failure at Site A....................................... 66
22 VPLEX active and functional between two sites ....................................... 67
23 Inter-site link failure and cluster partition ................................................. 68
24 Correct handling of cluster partition........................................................... 69
25 VPLEX static detach rule............................................................................... 71
26 Typical detach rule setup .............................................................................. 72
27 Non-preferred site failure ............................................................................. 73
28 Volume remains active at Cluster 1............................................................. 74
29 Typical detach rule setup before link failure ............................................. 75
30 Inter-site link failure and cluster partition ................................................. 76

Figures
31 Suspension after inter-site link failure and cluster partition ................... 77

32 Cluster 2 is preferred ..................................................................................... 78
33 Preferred site failure causes full Data Unavailability ............................... 79
34 High Level VPLEX Witness architecture.................................................... 83
35 High Level VPLEX Witness deployment .................................................. 84
36 Supported VPLEX versions for VPLEX Witness ....................................... 86
37 VPLEX Witness volume types and rule support....................................... 86
38 Typical VPLEX Witness configuration ....................................................... 87
39 VPLEX Witness and an inter-cluster link failure....................................... 88
40 VPLEX Witness and static preference after cluster partition................... 89
41 VPLEX Witness typical configuration for cluster 2 detaches .................. 90
42 VPLEX Witness diagram showing cluster 2 failure .................................. 91
43 VPLEX Witness with static preference override........................................ 92
44 Possible dual failure cluster isolation scenarios ........................................ 95
45 Highly unlikely dual failure scenarios that require manual
intervention ..................................................................................................... 96
46 Two further dual failure scenarios that would require manual
intervention ..................................................................................................... 97
47 High-level diagram of a Metro HA campus solution for VMware ...... 101
48 Metro HA campus diagram with failure domains.................................. 104
49 Metro HA campus diagram with disaster in zone A1............................ 105
50 Metro HA campus diagram with failure in zone A2.............................. 106
51 Metro HA campus diagram with failure in zone A3 or B3.................... 107
52 Metro HA campus diagram with failure in zone C1 .............................. 108
53 Metro HA campus diagram with intersite link failure........................... 109
54 Metro HA Standard High-level diagram ................................................. 111
55 Metro HA high-level diagram with fault domains ................................. 113
56 Metro HA high-level diagram with failure in domain A2..................... 114
57 Metro HA high-level diagram with intersite failure.............................. 116

Tables
Title Page
1 Overview of VPLEX features and benefits .................................................. 26
2 Configurations at a glance ............................................................................. 35
3 Management server user accounts ............................................................... 40

Tables

Preface
This EMC Engineering TechBook describes and provides an insightful

discussion on how implementation of VPLEX will lead to a higher level of
availability.
As part of an effort to improve and enhance the performance and capabilities
of its product lines, EMC periodically releases revisions of its hardware and
software. Therefore, some functions described in this document may not be
supported by all versions of the software or hardware currently in use. For
the most up-to-date information on product features, refer to your product
release notes. If a product does not function properly or does not function as
described in this document, please contact your EMC representative.
Audience This document is part of the EMC VPLEX family documentation set,
and is intended for use by storage and system administrators.
Readers of this document are expected to be familiar with the
following topics:
◆ Storage area networks
◆ Storage virtualization technologies
◆ EMC Symmetrix, VNX series, and CLARiiON products
Related Refer the EMC Powerlink website at http://powerlink.emc.com

documentation where the majority of the following documentation can be found
under Support > Technical Documentation and Advisories >
Hardware Platforms > VPLEX Family.
◆ EMC VPLEX Architecture Guide
◆ EMC VPLEX Installation and Setup Guide
◆ EMC VPLEX Site Preparation Guide

Preface
◆ Implementation and Planning Best Practices for EMC VPLEX

Technical Notes
◆ Using VMware Virtualization Platforms with EMC VPLEX - Best
Practices Planning
◆ VMware KB: Using VPLEX Metro with VMware HA
◆ Implementing EMC VPLEX Metro with Microsoft Hyper-V, Exchange
Server 2010 with Enhanced Failover Clustering Support
◆ White Paper: Using VMware vSphere with EMC VPLEX — Best
Practices Planning
◆ Oracle Extended RAC with EMC VPLEX Metro—Best Practices
Planning
◆ White Paper: EMC VPLEX with IBM AIX Virtualization and
Clustering
◆ White Paper: Conditions for Stretched Hosts Cluster Support on EMC
VPLEX Metro
◆ White Paper: Implementing EMC VPLEX and Microsoft Hyper-V and
SQL Server with Enhanced Failover Clustering Support — Applied
Technology
Organization of this This document is divided into the following chapters:

TechBook
◆ Chapter 1, “VPLEX Family and Use Case Overview,”
summarizes the VPLEX family. It also covers some of the key
features of the VPLEX family system, architecture and use cases.
◆ Chapter 2, “Hardware and Software,” summarizes hardware,
software, and network components of the VPLEX system. It also
highlights the software interfaces that can be used by an
administrator to manage all aspects of a VPLEX system.
◆ Chapter 3, “System and Component Integrity,” summarizes how
VPLEX clusters are able to handle hardware failures in any
subsystem within the storage cluster.
◆ Chapter 4, “Foundations of VPLEX High Availability,”
summarizes the concepts of the industry-wide dilemma of
building absolute HA environments and how VPLEX Metro
functionality manually accepts the historical challenge.
◆ Chapter 5, “Introduction to VPLEX Witness,” explains VPLEX
architecture and operation.

Preface
◆ Chapter 6, “VPLEX Metro HA,” explains how VPLEX

functionality can provide the absolute HA capability, by
introducing a “Witness” to the inter-cluster environment.
◆ Chapter 7, “Conclusion,” provides a summary of benefits using
VPLEX technology as related to VPLEX Witness and High
Availability.
◆ Appendix A, “vSphere 5.0 Update 1 Additional Settings,”
provides additional settings needed when using vSphere 5.0
update 1.
Authors This TechBook was authored by the following individuals from the
Enterprise Storage Division, VPLEX Business Unit based at EMC
headquarters, Hopkinton, Massachusetts.
Jennifer Aspesi has over 10 years of work experience with EMC in
Storage Area Networks (SAN), Wide Area Networks (WAN), and
Network and Storage Security technologies. Jen currently manages
the Corporate Systems Engineer team for the VPLEX Business Unit.
She earned her M.S. in Marketing and Technological Innovation from
Worcester Polytech Institute, Massachusetts.
Oliver Shorey has over 11 years experience working within the
Business Continuity arena, seven of which have been with EMC
engineering, designing and documenting high-end replication and
geographically-dispersed clustering technologies. He is currently a
Principal Corporate Systems Engineer in the VPLEX Business Unit.
Additional Additional contributors to this book include:

contributors
Colin Durocher has 8 years of experience in developing software for
the EMC VPLEX product as its predecessor and current state, testing
it, and helping customers implement it. He is currently working on
the product management team for the VPLEX business unit. He has a
B.S. in Computer Engineering from the University of Alberta and is
currently pursuing an MBA from the John Molson School of Business.
Gene Ortenberg has more than 15 years of experience in building
fault-tolerant distributed systems and applications. For the past 8
years he has been designing and developing highly-available storage
virtualization solutions at EMC. He currently holds a position of a
Software Architect for the VPLEX Business Unit under the EMC
Enterprise Storage Division.

Preface
Fernanda Torres has over 10 years of Marketing experience in the

Consumer Products industry, most recently in consumer electronics.
Fernanda is the Product Marketing Manager for VPLEX under the
EMC Enterprise Storage Division. She has undergraduate degree
from the University of Notre Dame and a bilingual degree
(English/Spanish) from IESE in Barcelona, Spain.
Typographical EMC uses the following type style conventions in this document:
conventions
Normal Used in running (nonprocedural) text for:
• Names of interface elements (such as names of windows, dialog
boxes, buttons, fields, and menus)
• Names of resources, attributes, pools, Boolean expressions,
buttons, DQL statements, keywords, clauses, environment
variables, functions, utilities
• URLs, pathnames, filenames, directory names, computer
names, filenames, links, groups, service keys, file systems,
notifications
Bold Used in running (nonprocedural) text for:
• Names of commands, daemons, options, programs, processes,
services, applications, utilities, kernels, notifications, system
calls, man pages
Used in procedures for:
• Names of interface elements (such as names of windows, dialog
boxes, buttons, fields, and menus)
• What user specifically selects, clicks, presses, or types
Italic Used in all text (including procedures) for:
• Full titles of publications referenced in text
• Emphasis (for example a new term)
• Variables
Courier Used for:
• System output, such as an error message or script
• URLs, complete paths, filenames, prompts, and syntax when
shown outside of running text
Courier bold Used for:
• Specific user input (such as commands)
Courier italic Used in procedures for:
• Variables on command line
• User input variables
<> Angle brackets enclose parameter or variable values supplied by
the user
[] Square brackets enclose optional values

Preface
| Vertical bar indicates alternate selections - the bar means “or”
{} Braces indicate content that you must specify (that is, x or y or z)
... Ellipses indicate nonessential information omitted from the example
We'd like to hear from you!

Your feedback on our TechBooks is important to us! We want our
books to be as helpful and relevant as possible, so please feel free to
send us your comments, opinions and thoughts on this or any other
TechBook:
TechBooks@emc.com

Preface

1
VPLEX Family and Use
Case Overview
This chapter provides a brief summary of the main use cases for the
EMC VPLEX family and design considerations for high availability. It
also covers some of the key features of the VPLEX family system.
Topics include:
◆ Introduction ........................................................................................ 18
◆ VPLEX value overview ..................................................................... 19
◆ VPLEX product offerings ................................................................. 23
◆ Metro high availability design considerations .............................. 28
VPLEX Family and Use Case Overview 17

VPLEX Family and Use Case Overview
Introduction
The purpose of this TechBook is to introduce EMC® VPLEX™ high
availability and the VPLEX Witness as it is conceptually
architectured, typically by customer storage administrators and EMC
Solutions Architects. The introduction of VPLEX Witness provides
customers with absolute physical and logical fabric and cache
coherent redundancy if it is properly designed in the VPLEX Metro
environment.
This TechBook is designed to provide an overview of the features and
functionality associated with the VPLEX Metro configuration and the
importance of active/active data resiliency for today’s advanced host
applications.

VPLEX value overview

At the highest level, VPLEX has unique capabilities that storage
administrators value and are seeking to enhance their existing data
centers. It delivers distributed, dynamic and smart functionality into
existing or new data centers to provide storage virtualization across
geographical boundaries.
◆ VPLEX is distributed, because it is a single interface for
multi-vendor storage and it delivers dynamic data mobility,
enabling the ability to move applications and data in real-time,
with no outage required.
◆ VPLEX is dynamic, because it provides data availability and
flexibility as well as maintaining business through failures
traditionally requiring outages or manual restore procedures.
◆ VPLEX is smart, because its unique AccessAnywhere technology
can present and keep the same data consistent within and
between sites and enable distributed data collaboration.
Because of these capabilities, VPLEX delivers unique and
differentiated value to address three distinct requirements within our
target customers’ IT environments:
◆ The ability to dynamically move applications and data across
different compute and storage installations, be they within the
same data center, across a campus, within a geographical region –
and now, with VPLEX Geo, across even greater distances.
◆ The ability to create high-availability storage and a compute
infrastructure across these same varied geographies with
unmatched resiliency.
◆ The ability to provide efficient real-time data collaboration over
distance for such “big data” applications as video, geographic
/oceanographic research, and more.
EMC VPLEX technology is a scalable, distributed-storage federation
solution that provides non-disruptive, heterogeneous
data-movement and volume-management functionality.
Insert VPLEX technology between hosts and storage in a storage area
network (SAN) and data can be extended over distance within,
between, and across data centers.
VPLEX value overview 19

The VPLEX architecture provides a highly available solution suitable

for many deployment strategies including:
◆ Application and Data Mobility — The movement of virtual
machines (VM) without downtime. An example is shown in
Figure 1.
Figure 1 Application and data mobility example
Storage administrators have the ability to automatically balance

loads through VPLEX, using storage and compute resources from
either cluster’s location. When combined with server
virtualization, VPLEX allows users to transparently move and
relocate Virtual Machines and their corresponding applications
and data over distance. This provides a unique capability allowing
users to relocate, share and balance infrastructure resources
between sites, which can be within a campus or between data
centers, up to 5ms apart with VPLEX Metro, or further apart
(50ms RTT) across asynchronous distances with VPLEX Geo.
Note: Please submit an RPQ if VPLEX Metro is required up to 10ms or check

the support matrix for the latest supported latencies.

• HA Infrastructure — Reduces recovery time objective (RTO).

An example is shown in Figure 2.
Figure 2 HA infrastructure example
High availability is a term that several products will claim they

can deliver. Ultimately, a high availability solution is supposed to
protect against a failure and keep an application online. Storage
administrators plan around HA to provide near continuous
uptime for their critical applications, and automate the restart of
an application once a failure has occurred, with as little human
intervention as possible.
With conventional solutions, customers typically have to choose a
Recovery Point Objective and a Recovery Time Objective. But
even while some solutions offer small RTOs and RPOs, there can
still be downtime and, for most customers, any downtime can be
costly.
VPLEX value overview 21

• Distributed Data Collaboration — Increases utilization of

passive data recovery (DR) assets and provides simultaneous
access to data. An example is shown in Figure 3.
Figure 3 Distributed data collaboration example
• This is when a workforce has multiple users at different sites

that need to work on the same data, and maintain consistency
in the dataset when changes are made. Use cases include
co-development of software where the development happens
across different teams from separate locations, and
collaborative workflows such as engineering, graphic arts,
videos, educational programs, designs, research reports, and
so forth.
• When customers have tried to build collaboration across
distance with the traditional solutions, they normally have to
save the entire file at one location and then send it to another
site using FTP. This is slow, can incur heavy bandwidth costs
for large files, or even small files that move regularly, and
negatively impacts productivity because the other sites can sit
idle while they wait to receive the latest data from another site.
If teams decide to do their own work independent of each
other, then the dataset quickly becomes inconsistent, as
multiple people are working on it at the same time and are
unaware of each other’s most recent changes. Bringing all of
the changes together in the end is time-consuming, costly, and
grows more complicated as the data-set gets larger.

VPLEX product offerings

VPLEX first meets high-availability and data mobility requirements
and then scales up to the I/O throughput required for the front-end
applications and back-end storage.
High-availability and data mobility features are characteristics of
VPLEX Local, VPLEX Metro, and VPLEX Geo.
A VPLEX cluster consists of one, two, or four engines (each
containing two directors), and a management server. A dual-engine
or quad-engine cluster also contains a pair of Fibre Channel switches
for communication between directors.
Each engine is protected by a standby power supply (SPS), and each
Fibre Channel switch gets its power through an uninterruptible
power supply (UPS). (In a dual-engine or quad-engine cluster, the
management server also gets power from a UPS.)
The management server has a public Ethernet port, which provides
cluster management services when connected to the customer
network.
This section provides information on the following:
◆ “VPLEX Local, VPLEX Metro, and VPLEX Geo” on page 23
◆ “Architecture highlights” on page 25
VPLEX Local, VPLEX Metro, and VPLEX Geo

EMC offers VPLEX in three configurations to address customer needs
for high-availability and data mobility:
◆ VPLEX Local
◆ VPLEX Metro
◆ VPLEX Geo
VPLEX product offerings 23

Figure 4 provides an example of each.
Figure 4 VPLEX offerings
VPLEX Local
VPLEX Local provides seamless, non-disruptive data mobility and
ability to manage multiple heterogeneous arrays from a single
interface within a data center.
VPLEX Local allows increased availability, simplified management,
and improved utilization across multiple arrays.
VPLEX Metro with AccessAnywhere

VPLEX Metro with AccessAnywhere enables active-active, block
level access to data between two sites within synchronous distances.
The distance is limited as to what Synchronous behavior can
withstand as well as consideration to host application stability and
MAN traffic. It is recommended that depending on the application
that consideration for Metro be less than or equal to 5ms1 RTT.
The combination of virtual storage with VPLEX Metro and virtual
servers enables the transparent movement of virtual machines and
storage across a distance.This technology provides improved
utilization across heterogeneous arrays and multiple sites.
1. Refer to VPLEX and vendor-specific White Papers for confirmation of

latency limitations.

VPLEX Geo with AccessAnywhere

VPLEX Geo with AccessAnywhere enables active-active, block level
access to data between two sites within asynchronous distances.
VPLEX Geo enables better cost-effective use of resources and power.
Geo provides the same distributed device flexibility as Metro but
extends the distance up to and within 50ms RTT. As with any
Asynchronous transport media, bandwidth is also important to
consider for optimal behavior as well as application sharing on the
link.
Note: For the purpose of this TechBook, the focus on technologies is

based on Metro configuration only. VPLEX Witness is supported with
VPLEX Geo; however, it is beyond the scope of this TechBook.
Architecture highlights
VPLEX support is open and heterogeneous, supporting both EMC
storage and common arrays from other storage vendors, such as
HDS, HP, and IBM. VPLEX conforms to established worldwide
naming (WWN) guidelines that can be used for zoning.
VPLEX supports operating systems including both physical and
virtual server environments with VMware ESX and Microsoft
Hyper-V. VPLEX supports network fabrics from Brocade and Cisco,
including legacy McData SANs.
Note: For the latest information please refer to the ESSM (EMC
Simple Support Matrix) for supported host types as well as the
connectivity ESM for fabric and extended fabric support.

An example of the architecture is shown in Figure 5.
Figure 5 Architecture highlights
Table 1 lists an overview of VPLEX features along with the benefits.
Table 1 Overview of VPLEX features and benefits (page 1 of 2)
Features Benefits
Mobility Move data and applications without impact on

users.
Resiliency Mirror across arrays without host impact, and

increase high availability for critical applications.
Distributed cache coherency Automate sharing, balancing, and failover of I/O

across the cluster and between clusters.

Table 1 Overview of VPLEX features and benefits (page 2 of 2)
Features Benefits
Advanced data caching Improve I/O performance and reduce storage array
contention.
Virtual Storage federation Achieve transparent mobility and access in a data

center and between data centers.
Scale-out cluster architecture Start small and grow larger with predictable service
levels.
For all VPLEX products, the appliance-based VPLEX technology:

◆ Presents storage area network (SAN) volumes from back-end
arrays to VPLEX engines
◆ Packages the SAN volumes into sets of VPLEX virtual volumes
with user-defined configuration and protection levels
◆ Presents virtual volumes to production hosts in the SAN via the
VPLEX front-end
◆ For VPLEX Metro and VPLEX Geo products, presents a global,
block-level directory for distributed cache and I/O between
VPLEX clusters.
Location and distance determine high-availability and data mobility
requirements. For example, if all storage arrays are in a single data
center, a VPLEX Local product federates back-end storage arrays
within the data center.
When back-end storage arrays span two data centers, the
AccessAnywhere feature in a VPLEX Metro or a VPLEX Geo product
federates storage in an active-active configuration between VPLEX
clusters. Choosing between VPLEX Metro or VPLEX Geo depends on
distance and data synchronicity requirements.
Application and back-end storage I/O throughput determine the
number of engines in each VPLEX cluster. High-availability features
within the VPLEX cluster allow for non-disruptive software upgrades
and expansion as I/O throughput increases.

Metro high availability design considerations

VPLEX Metro 5.0 (and above) introduces high availability concepts
beyond what is traditionally known as physical high availability.
Introduction of the “VPLEX Witness” to a high availability
environment, allows the VPLEX solution to increase the overall
availability of the environment by arbitrating a pure communication
failure between two primary sites and a true site failure in a multi-site
architecture. EMC VPLEX is the first product to bring to market the
features and functionality provided by VPLEX Witness prevents
failures and asserts the activity between clusters in a multi-site
architecture.
Through this TechBook, administrators and customers gain an
understanding of the high availability solution that VPLEX provides
them:
◆ Enabling of load balancing between their data centers
◆ Active/active use of both of their data centers
◆ Increased availability for their applications (no single points of
storage failure, auto-restart)
◆ Fully automatic failure handling
◆ Better resource utilization
◆ Lower CapEx and lower OpEx as a result
Broadly speaking, when one considers legacy environments one
typically sees “highly” available designs implemented within a data
center, and disaster recovery type functionality deployed between
data centers.
One of the main reasons for this is that within data centers
components generally operate in an active/active (or active/passive
with automatic failover) whereas between data centers legacy
replication technologies use active passive techniques which require
manual failover to use the passive component.
When using VPLEX Metro active/active replication technology in
conjunction with new features, such as VPLEX Witness server (as
described in “Introduction to VPLEX Witness” on page 81), the lines
between local high availability and long distance disaster recovery
are somewhat blurred since HA can be stretched beyond the data

center walls. Since replication is a by-product of federated and

distributed storage disaster avoidance, it is also achievable within
these geographically dispersed HA environments.
Planned application mobility compared with disaster restart

This section compares planned application mobility and disaster
restart.
Planned application An online planned application mobility event is defined as when an

mobility application or virtual machine can be moved fully online without
disruption from one location to another in either the same or remote
data center. This type of movement can only be performed when all
components that participate in this movement are available (e.g., the
running state of the application or VM exists in volatile memory
which would not be the case if an active site has failed) and if all
participating hosts have read/write access at both location to the
same block storage. Additional a mechanism is required to transition
volatile memory data from one system/host to another. When
performing planned online mobility jobs over distance a prerequisite
y is the use of an active/active underlying storage replication
solution (VPLEX Metro only at this publication).
An example of this online application mobility would be VMware
vMotion where a virtual machine would need to be fully operational
before it can be moved. It may sound obvious but if the VM was
offline then movement could not be performed online (This is
important to understand and is the key difference over application
restart).
When vMotion is executed all live components that are required to
make the VM function are copied elsewhere in the background before
cutting the VM over.
Since these types of mobility tasks are totally seamless to the user
some of the use cases associated are for disaster avoidance where an
application or VM can be moved ahead of a disaster (such as,
Hurricane, Tsunami, etc.) as the running state is available to be
copied, or in other cases it can be used to enable the ability to load
balance across multiple systems or even data centers.
Due to the need for the running state to be available for these types of
relocations these movements are always deemed planned activities.
Metro high availability design considerations 29

Disaster restart Disaster restart is where an application or service is re-started in

another location after a failure (be it on a different server or data
center) and will typically interrupt the service/application during the
failover.
A good example of this technology would be a VMware HA Cluster
configured over two geographically dispersed sites using VPLEX
Metro where a cluster will be formed over a number of ESX servers
and either single or multiple virtual machines can run on any of the
ESX servers within the cluster.
If for some reason an active ESX server were to fail (perhaps due to
site failure) then the VM can be re-started on a remaining ESX server
within the cluster at the remote site as the datastore where it was
running spans the two locations since it is configured on a VPLEX
Metro distributed volume. This would be deemed an unplanned
failover which will incur a small outage of the application since the
running state of the VM was lost when the ESX server failed meaning
the service will be unavailable until the VM has restarted elsewhere.
Although comparing a planned application mobility event to an
unplanned disaster restart will result in the same outcome (i.e., a
service relocating elsewhere) it can now be seen that there is a big
difference since the planned mobility job keeps the application online
during the relocation whereas the disaster restart will result in the
application being offline during the relocation as a restart is
conducted.
Compared to active/active technologies the use of legacy
active/passive type solutions in these restart scenarios would
typically require an extra step over and above standard application
failover since a storage failover would also be required (i.e. changing
the status of write disabled remote copy to read/write and reversing
replication direction flow). This is where VPLEX can assist greatly
since it is active/active therefore, in most cases, no manual
intervention at the storage layer is required, this greatly reduces the
complexity of a DR failover solution. If best practices for physical
high available and redundant hardware connectivity are followed the
value of VPLEX Witness will truly provide customers with
“Absolute” availability!

2
Hardware and Software
This chapter provides insight into the hardware and software

interfaces that can be used by an administrator to manage all aspects
of a VPLEX system. In addition, a brief overview of the internal
system software is included. Topics include:
◆ Introduction ........................................................................................ 32
◆ VPLEX management interfaces........................................................ 37
◆ Simplified storage management ...................................................... 39
◆ Management server user accounts .................................................. 40
◆ Management server software ........................................................... 41
◆ Director software................................................................................ 45
◆ Configuration overview.................................................................... 46
◆ I/O implementation .......................................................................... 50
Hardware and Software 31

Introduction
This section provides basic information on the following:
◆ “VPLEX I/O” on page 32
◆ “High-level VPLEX I/O flow” on page 32
◆ “Distributed coherent cache” on page 33
◆ “VPLEX family clustering architecture ” on page 33
VPLEX I/O
VPLEX is built on a lightweight protocol that maintains cache
coherency for storage I/O and the VPLEX cluster provides highly
available cache, processing power, front-end, and back-end Fibre
Channel interfaces.
EMC hardware powers the VPLEX cluster design so that all devices
are always available and I/O that enters the cluster from anywhere
can be serviced by any node within the cluster.
The AccessAnywhere feature in the VPLEX Metro and VPLEX Geo
products extends the cache coherency between data centers at a
distance.
High-level VPLEX I/O flow

VPLEX abstracts a block-level ownership model into a highly
organized hierarchal directory structure that is updated for every I/O
and shared across all engines. The directory uses a small amount of
metadata and tells all other engines in the cluster, in 4k block
transmissions, which block of data is owned by which engine and at
what time.
After a write completes and ownership is reflected in the directory,
VPLEX dynamically manages read requests for the completed write
in the most efficient way possible.
When a read request arrives, VPLEX checks the directory for an
owner. After VPLEX locates the owner, the read request goes directly
to that engine.

On reads from other engines, VPLEX checks the directory and tries to
pull the read I/O directly from the engine cache to avoid going to the
physical arrays to satisfy the read.
This model enables VPLEX to stretch the cluster as VPLEX distributes
the directory between clusters and sites. Due to the Hierarchical
nature of the VPLEX directory VPLEX is efficient with minimal
overhead and enables I/O communication over distance.
Distributed coherent cache

The VPLEX engine includes two directors that each have a total of 36
GB (version 5 hardware, also known as VS2) of local cache. Cache
pages are keyed by volume and go through a lifecycle from staging,
to visible, to draining.
The global cache is a combination of all director caches that spans all
clusters. The cache page holder information is maintained in a
memory data structure called a directory.
The directory is divided into chunks and distributed among the
VPLEX directors and locality controls where ownership is
maintained.
A meta-directory identifies which director owns which directory
chunks within the global directory.
VPLEX family clustering architecture

The VPLEX family uses a unique clustering architecture to help
customers break the boundaries of the data center and allow servers
at multiple data centers to have read/write access to shared block
storage devices. A VPLEX cluster, as shown in Figure 6 on page 34,
can scale up through the addition of more engines, and scale out by
connecting clusters into an EMC VPLEX Metro (two VPLEX Metro
clusters connected within Metro distances).
Introduction 33
Figure 6 VPLEX cluster example
VPLEX Metro transparently moves and shares workloads for a

variety of applications, VMs, databases and cluster file systems.
VPLEX Metro consolidates data centers, and optimizes resource
utilization across data centers. In addition, it provides non-disruptive
data mobility, heterogeneous storage management, and improved
application availability. VPLEX Metro supports up to two clusters,
which can be in the same data center, or at two different sites within
synchronous environments. Also, introduced with these solutions
architected by this TechBook, Geo cluster across distances achieves
the asynchronous partner to Metro. It is out of the scope of this
document to analyze VPLEX Geo capabilities.

VPLEX single, dual, and quad engines

The VPLEX engine provides cache and processing power with
redundant directors that each include two I/O modules per director
and one optional WAN COM I/O module for use in VPLEX Metro
and VPLEX Geo configurations.
The rackable hardware components are shipped in NEMA standard
racks or provided, as an option, as a field rackable product. Table 2
provides a list of configurations.
Table 2 Configurations at a glance
Components Single engine Dual engine Quad engine

Directors 2 4 8
Redundant Engine SPSs Yes Yes Yes
FE Fibre Channel ports (VS1) 16 32 64
FE Fibre Channel ports (VS2) 8 16 32
BE Fibre Channel ports (VS1) 16 32 64
BE Fibre Channel ports (VS2) 8 16 32
Cache size (VS1 Hardware) 64 GB 128 GB 256 GB
Cache size (VS2 Hardware) 72 GB 144 GB 288 GB
Management Servers 1 1 1
Internal Fibre Channel switches (Local Comm) None 2 2
Uninterruptable Power Supplies (UPSs) None 2 2
VPLEX sizing tool

Use the EMC VPLEX sizing tool provided by EMC Global Services
Software Development to configure the right VPLEX cluster
configuration.
The sizing tool concentrates on I/O throughput requirement for
installed applications (mail exchange, OLTP, data warehouse, video
streaming, etc.) and back-end configuration such as virtual volumes,
size and quantity of storage volumes, and initiators.
Introduction 35
Upgrade paths
VPLEX facilitates application and storage upgrades without a service
window through its flexibility to shift production workloads
throughout the VPLEX technology.
In addition, high-availability features of the VPLEX cluster allow for
non-disruptive VPLEX hardware and software upgrades.
This flexibility means that VPLEX is always servicing I/O and never
has to be completely shut down.
Hardware upgrades
Upgrades are supported for single-engine VPLEX systems to dual- or
quad-engine systems.
A single VPLEX Local system can be reconfigured to work as a
VPLEX Metro or VPLEX Geo by adding a new remote VPLEX cluster.
Additionally an entire VPLEX VS1 Cluster (hardware) can be fully
upgraded to VS2 hardware non disruptively.
Information for VPLEX hardware upgrades is in the Procedure
Generator that is available through EMC PowerLink.
Software upgrades
VPLEX features a robust non-disruptive upgrade (NDU) technology
to upgrade the software on VPLEX engines and VPLEX Witness
servers. Management server software must be upgraded before
running the NDU.
Due to the VPLEX distributed coherent cache, directors elsewhere in
the VPLEX installation service I/Os while the upgrade is taking
place. This alleviates the need for service windows and reduces RTO.
The NDU includes the following steps:
◆ Preparing the VPLEX system for the NDU
◆ Starting the NDU
◆ Transferring the I/O to an upgraded director
◆ Completing the NDU

VPLEX management interfaces

Within the VPLEX cluster, TCP/IP-based management traffic travels
through a private network subnet to the components in one or more
clusters. In VPLEX Metro and VPLEX Geo, VPLEX establishes a VPN
tunnel between the management servers of both clusters. When
VPLEX Witness is deployed, the VPN tunnel is extended to a 3-way
tunnel including both Management Servers and VPLEX Witness.
Web-based GUI
VPLEX includes a Web-based graphical user interface (GUI) for
management. The EMC VPLEX Management Console Help provides
more information on using this interface.
To perform other VPLEX operations that are not available in the GUI,
refer to the CLI, which supports full functionality. The EMC VPLEX
CLI Guide provides a comprehensive list of VPLEX commands and
detailed instructions on using those commands.
The EMC VPLEX Management Console contains but is not limited to
the following functions:
◆ Supports storage array discovery and provisioning
◆ Local provisioning
◆ Distributed provisioning
◆ Mobility Central
◆ Online help
VPLEX CLI
VPlexcli is a command line interface (CLI) to configure and operate
VPLEX systems. It also generates the EZ Wizard Setup process to
make installation of VPLEX easier and quicker.
The CLI is divided into command contexts. Some commands are
accessible from all contexts, and are referred to as ‘global commands’.
The remaining commands are arranged in a hierarchical context tree
that can only be executed from the appropriate location in the context
tree.
VPLEX management interfaces 37

The VPlexcli encompasses all capabilities in order to function if the

management station is unavailable. It is fully functional,
comprehensive, supporting full configuration, provisioning and
advanced systems management capabilities.
SNMP support for performance statistics

The VPLEX snmpv2c SNMP agent:
◆ Supports retrieval of performance-related statistics as published
in the VPLEX-MIB.mib.
◆ Runs on the management server and fetches performance related
data from individual directors using a firmware specific
interface.
◆ Provides SNMP MIB data for directors for the local cluster only.
LDAP /AD support

VPLEX offers Lightweight Directory Access Protocol (LDAP) or
Active Directory for an authentication directory service.
VPLEX Element Manager API

VPLEX Element Manager API uses the Representational State
Transfer (REST) software architecture for distributed systems such as
the World Wide Web. It allows software developers and other users to
use the API to create scripts to run VPLEX CLI commands.
The VPLEX Element Manager API supports all VPLEX CLI
commands that can be executed from the root context on a director.

Simplified storage management

VPLEX supports a variety of arrays from various vendors covering
both active/active and active/passive type arrays. VPLEX simplifies
storage management by allowing simple LUNs, provisioned from the
various arrays, to be managed through a centralized management
interface that is simple to use and very intuitive. In addition, a
VPLEX Metro or VPLEX Geo environment that spans data centers
allows the storage administrator to manage both locations through
the one interface from either location by logging in at the local site.
Simplified storage management 39

Management server user accounts

The management server requires the setup of user accounts for access
to certain tasks. Table 3 describes the types of user accounts on the
management server.
Table 3 Management server user accounts
Account type Purpose
admin (customer) • Performs administrative actions, such as user

management
• Creates and deletes Linux CLI accounts
• Resets passwords for all Linux CLI users
• Modifies the public Ethernet settings
service • Starts and stops necessary OS and VPLEX services

(EMC service) • Cannot modify user accounts
• (Customers do have access to this account)
Linux CLI accounts • Uses VPlexcli to manage federated storage
All account types • Uses VPlexcli

• Modifies their own password
• Can SSH or VNC into the management server
• Can SCP files off the management server from directories
to which they have access
Some service and administrator tasks require OS commands that

require root privileges. The management server has been configured
to use the sudo program to provide these root privileges just for the
duration of the command. Sudo is a secure and well-established
UNIX program for allowing users to run commands with root
privileges.
VPLEX documentation will indicate which commands must be
prefixed with "sudo" in order to acquire the necessary privileges. The
sudo command will ask for the user's password when it runs for the
first time, to ensure that the user knows the password for his account.
This prevents unauthorized users from executing these privileged
commands when they find an authenticated SSH login that was left
open.

Management server software

The management server software is installed during manufacturing
and is fully field upgradeable. The software includes:
◆ VPLEX Management Console
◆ VPlexcli
◆ Server Base Image Updates (when necessary)
◆ Call-home software
Each are briefly discussed in this section.
Management console
The VPLEX Management Console provides a graphical user interface
(GUI) to manage the VPLEX cluster. The GUI can be used to
provision storage, as well as manage and monitor system
performance.
Figure 7 on page 42 shows the VPLEX Management Console window
with the cluster tree expanded to show the objects that are
manageable from the front-end, back-end, and the federated storage.
Management server software 41

Figure 7 VPLEX Management Console
The VPLEX Management Console provides online help for all of its
available functions. Online help can be accessed in the following
ways:
◆ Click the Help icon in the upper right corner on the main screen
to open the online help system, or in a specific screen to open a
topic specific to the current task.
◆ Click the Help button on the task bar to display a list of links to
additional VPLEX documentation and other sources of
information.

Figure 8 is the welcome screen of the VPLEX Management Console

GUI, which utilizes a secure http connection via a browser. The
interface uses Flash technology for rapid response and unique look
and feel.
Figure 8 Management Console welcome screen
Command line interface

The VPlexcli is a command line interface (CLI) for configuring and
running the VPLEX system, for setting up and monitoring the
system’s hardware and intersite links (including com/tcp), and for
configuring global inter-site I/O cost and link-failure recovery. The
CLI runs as a service on the VPLEX management server and is
accessible using Secure Shell (SSH).
Management server software 43

For information about the VPlexcli, refer to the EMC VPLEX CLI
Guide.
System reporting
VPLEX system reporting software collects configuration information
from each cluster and each engine. The resulting configuration file
(XML) is zipped and stored locally on the management server or
presented to the SYR system at EMC via call home.
You can schedule a weekly job to automatically collect SYR data
(VPlexcli command scheduleSYR), or manually collect it whenever
needed (VPlexcli command syrcollect).

Director software
The director software provides:
◆ Basic Input/Output System (BIOS ) — Provides low-level
hardware support to the operating system, and maintains boot
configuration.
◆ Power-On Self Test (POST) — Provides automated testing of
system hardware during power on.
◆ Linux — Provides basic operating system services to the Vplexcli
software stack running on the directors.
◆ VPLEX Power and Environmental Monitoring (ZPEM) —
Provides monitoring and reporting of system hardware status.
◆ EMC Common Object Model (ECOM) —Provides management
logic and interfaces to the internal components of the system.
◆ Log server — Collates log messages from director processes and
sends them to the SMS.
◆ EMC GeoSynchrony™ (I/O Stack) — Processes I/O from hosts,
performs all cache processing, replication, and virtualization
logic, interfaces with arrays for claiming and I/O.
Director software 45
Configuration overview
The VPLEX configurations are based on how many engines are in the
cabinet. The basic configurations are single, dual and quad
(previously know as small, medium and large).
The configuration sizes refer to the number of engines in the VPLEX
cabinet. The remainder of this section describes each configuration
size.
Single engine configurations

The VPLEX single engine configuration includes the following:
◆ Two directors
◆ One engine
◆ Redundant engine SPSs
◆ 8 front-end Fibre Channel ports (16 for VS1 hardware)
◆ 8 back-end Fibre Channel ports (16 for VS1 hardware)
◆ One management server
The unused space between engine 1 and the management server as
shown in Figure 9 on page 47 is intentional.

Figure 9 VPLEX single engine configuration
Dual configurations
The VPLEX dual engine configuration includes the following:
◆ Four directors
◆ Two engines
Configuration overview 47
◆ Redundant Fibre Channel COM switches for local COM; UPS for
each Fibre Channel switch
Figure 10 shows an example of a medium configuration.
ON ON
I I
O O
OFF OFF
ON ON
I I
O O
OFF OFF
ON ON
I I
O O
OFF OFF
Fibre Channel switch B
UPS B
Fibre Channel switch A
UPS A
OFF OFF
O O
I I
ON ON
Management server
Engine 2
OFF OFF
O O
I I
ON ON
SPS 2
OFF OFF
O O
I I
ON ON
Engine 1
SPS 1
VPLX-000254
Figure 10 VPLEX dual engine configuration
Quad configurations
The VPLEX quad engine configuration includes the following:
◆ Eight directors
◆ Four engines


◆ Redundant Fibre Channel COM switches for local COM; UPS for
each Fibre Channel switch
Figure 11 shows an example of a quad configuration.
ON
I
O
ON
I
O
Engine 4
OFF OFF
SPS 4
ON ON
I I
O O
OFF OFF
Engine 3
ON ON
I I
O O
OFF OFF
SPS 3
Fibre Channel switch B
UPS B
Fibre Channel switch A
UPS A
OFF OFF
O O
I I
ON ON
Management server
Engine 2
OFF OFF
O O
I I
ON ON
SPS 2
OFF OFF
O O
I I
ON ON
Engine 1
SPS 1
VPLX 000253
Figure 11 VPLEX quad engine configuration
Configuration overview 49
I/O implementation
The VPLEX cluster utilizes a write-through mode when configured
for either VPLEX Local or Metro whereby all writes are written
through the cache to the back-end storage. To maintain data integrity,
a host write is acknowledged only after the back-end arrays (in one
cluster in case of VPLEX Local and in two clusters in case of VPLEX
Metro) acknowledge the write.
This section describes the VPLEX cluster caching layers, roles, and
interactions. It gives an overview of how reads and writes are
handled within the VPLEX cluster and how distributed cache
coherency works. This is important to the introduction of high
availability concepts.
Cache coherence
Cache coherence creates a consistent global view of a volume.
Distributed cache coherence is maintained using a directory. There is
one directory per virtual volume and each directory is split into
chunks (4096 directory entries within each). These chunks exist only
if they are populated. There is one directory entry per global cache
page, with responsibility for:
◆ Tracking page owner(s) and remembering the last writer
◆ Locking and queuing
Meta-directory
Directory chunks are managed by the meta-directory, which assigns
and remembers chunk ownership. These chunks can migrate using
Locality-Conscious Directory Migration (LCDM). This
meta-directory knowledge is cached across the share group (i.e., a
group of multiple directors within the cluster that are exporting a
given virtual volume) for efficiency.
How a read is handled

When a host makes a read request, VPLEX first searches its local
cache. If the data is found there, it is returned to the host.

If the data is not found in local cache, VPLEX searches global cache.
Global cache includes all directors that are connected to one another
within the single VPLEX cluster for VPLEX Local, and all of the
VPLEX clusters for both VPLEX Metro and VPLEX Geo. If there is a
global read hit in the local cluster (i.e. same cluster, but different
director) then the read will be serviced from global cache in the same
cluster. The read could also be serviced by the remote global cache if
the consistency group setting “local read override” is set to false (the
default is true). Whenever the read is serviced from global cache
(same cluster or remote), a copy is also stored in the local cache of the
director from where the request originated.
If a read cannot be serviced from either local cache or global cache, it
is read directly from the back-end storage. In these cases both the
global and local cache are updated.
I/O flow of a local read hit

1. Read request issued to virtual volume from host.
2. Look up in local cache of ingress director.
3. On hit, data returned from local cache to host.
I/O flow of a global read hit

3. On miss, look up in global cache.
4. On hit, data is copied from owner director into local cache.
5. Data returned from local cache to host.
I/O flow of a read miss

3. On miss, look up in global cache.
4. On miss, data read from storage volume into local cache.
5. Data returned from local cache to host.
6. The director that returned the data becomes the chunk owner.
I/O implementation 51
How a write is handled

For both VPLEX Local and Metro, all writes are written through
cache to the back-end storage. Writes are completed to the host only
after they have been completed to the back-end arrays. In the case of
VPLEX Metro, each write is duplicated at the cluster where it was
written. One of the copies is then written through to local back end
disk, whilst the other one is written to the remote VPLEX where in
turn it is written through to the remote back end disk. Host
acknowledgement is given once both writes to back end storage has
been acknowledged.
I/O flow of a write miss

1. Write request issued to virtual volume from host.
2. Look for prior data in local cache.
3. Look for prior data in global cache.
4. Transfer data to local cache.
5. Data is written through to back-end storage.
6. Write is acknowledged to host.
I/O flow of a write hit

1. Write request issued to virtual volume from host.
2. Look for prior data in local cache.
3. Look for prior data in global cache.
4. Invalidate prior data.
5. Transfer data to local cache.
6. Data is written through to back-end storage.
7. Write is acknowledged to host.

3
System and
Component Integrity
This chapter explains how VPLEX clusters are able to handle

hardware failures in any subsystem within the storage cluster. Topics
include:
◆ Overview ............................................................................................. 54
◆ Cluster.................................................................................................. 55
◆ Path redundancy through different ports ...................................... 56
◆ Path redundancy through different directors ................................ 57
◆ Path redundancy through different engines .................................. 58
◆ Path redundancy through site distribution.................................... 59
◆ Serviceability....................................................................................... 60
System and Component Integrity 53

System and Component Integrity
Overview
VPLEX clusters are capable of surviving any single hardware failure
in any subsystem within the overall storage cluster. These include
host connectivity subsystem, memory subsystem, etc. A single failure
in any subsystem will not affect the availability or integrity of the
data. Multiple failures in a single subsystem and certain
combinations of single failures in multiple subsystems may affect the
availability or integrity of data.
High availability requires that host connections be redundant and
that hosts are supplied with multipath drivers. In the event of a
front-end port failure or a director failure, hosts without redundant
physical connectivity to a VPLEX cluster and without multipathing
software installed may be susceptible to data unavailability.

Cluster
A cluster is a collection of one, two, or four engines in a physical
cabinet. A cluster serves I/O for one storage domain and is managed
as one storage cluster.
All hardware resources (CPU cycles, I/O ports, and cache memory)
are pooled:
◆ The front-end ports on all directors provide active/active access
to the virtual volumes exported by the cluster.
◆ For maximum availability, virtual volumes can be presented
through all director so that all directors but one can fail without
causing data loss or unavailability. To achieve this with version
5.0.1 code and below directors must be connected to all storage.
Note: Instant failure of all directors bar one in a dual or quad engine
system would result in the last remaining director also failing since it
would lose quorum. This is, therefore, only true if one director failed at a
time.
Cluster 55
Path redundancy through different ports

Because all paths are duplicated, when a director port goes down for
any reason, data seemlessly processes through a port of the other
director, as shown in Figure 12 (assuming correct multipath software
is in place).
Figure 12 Port redundancy
Multipathing software plus redundant volume presentation yields

continuous data availability in the presence of port failures.

Path redundancy through different directors

If a a director were to go down, the other director can completely take
over the I/O processing from the host, as shown in Figure 13.
Figure 13 Director redundancy
Multipathing software plus volume presentation on different

directors yields continuous data availability in the presence of
director failures.
Path redundancy through different directors 57

Path redundancy through different engines

In a clustered environment, if one engine goes down, another engine
completes the host I/O processing, as shown in Figure 14.
Figure 14 Engine redundancy
Multipathing software plus volume presentation on different engines

yields continuous data availability in the presence of engine failures.

Path redundancy through site distribution

Distributed site redundancy now enabled through VPLEX Metro HA
(including VPLEX Witness) ensures that if a site goes down, or even if
the link to that site goes down, the other site can continue seamlessly
processing the host I/O, as shown in Figure 15. As illustrated if on
site failure of Site B occurs, the I/O continues unhindered on Site A.
Figure 15 Site redundancy
Path redundancy through site distribution 59

Serviceability
In addition to the redundancy fail-safe features, the VPLEX cluster
provides event logs and call home capability via EMC Secure Remote
Support (ESRS).

4
Foundations of VPLEX
High Availability
This chapter explains VPLEX architecture and operation:

◆ Foundations of VPLEX High Availability ..................................... 62
◆ Failure handling without VPLEX Witness (static preference) ..... 70
Foundations of VPLEX High Availability 61

Foundations of VPLEX High Availability

The following section discusses several disruptive scenarios at a high
level to a multiple site VPLEX Metro configuration without VPLEX
Witness. The purpose of this section is to provide the customer or
solutions’ architect the ability to understand site failure semantics
prior to the deployment of VPLEX Witness and related solutions
outlined in this book. This section isn’t designed to highlight flaws in
high availability architecture as implemented in basic VPLEX best
practices without using VPLEX Witness. All solutions that are
deployed in a Metro Active /Active state be they VPLEX or not will
run into the same issues when not deploying a independent observer
or “Witness.” such as VPLEX Witness. The decision for an architect to
apply the VPLEX Witness capabilities or enhance connectivity paths
across data centers using the Metro HA Cross-Cluster Connect
solution is dependent on their basic fail-over needs.
Note: To ensure the explanation of this subject remains at a high level in the
following section the graphics have been broken down into major objects
(e.g. Site A, Site B and Link). You can assume that within each site resides a
VPLEX cluster. Therefore, when a site failure is shown it will also cause a full
VPLEX cluster failure within that site. You can also assume that the link
object between sites represents the main inter-cluster data network connected
to each VPLEX cluster in either site. One further assumption is that each site
shares the same failure domain. A site failure will affect all components
within this failure domain including VPLEX cluster.
This representation of Figure 16 as described shows normal operation

where all three components are fully operational. (Note: green
symbolizes normal operation and red symbolizes failure)
Figure 16 High level functional sites in communication

Figure 17 demonstrates that Site A has failed.
Figure 17 High level Site A failure
Suppose that an application or VM was running only in Site A at the

time of the incident it would now need to be restarted at the
remaining Site B. Reading this document, you know this since you
have an external perspective being able to see the entire diagram.
However, if you were looking at this purely from Site B’s perspective,
all that could be deduced is that communication has been lost to Site
A. Without an external independent observer of some kind, it is
impossible to distinguish between full Site A failure vs. the
inter-cluster link failure.
A link failure as depicted by the red arrow in Figure 18 is
representative of an inter-cluster link failure.
Figure 18 High level Inter-site link failure
Similar to the previous example, if you look at this from an overall

perspective, you can see that it is the link which is faulted. However,
if you consider this from Site A or Site B’s perspective all that can be
deduced is that communication is lost to Site A (exactly like the
previous example) and it cannot be distinguished if it is the link or
the site at fault.

The next section shows how different failures affect a VPLEX

distributed volume and highlights the different resolutions required
in each case starting with the site failure scenario. The high level
Figure 19 shows a VPLEX distributed volume spanning two sites:
Figure 19 VPLEX active and functional between two sites
As shown, the distributed volume is made up of a mirror at each site

(M1 and M2). Using the distributed cache coherency semantics
provided by VPLEX GeoSynchrony a consistent data presentation of
a logical volume is achieved across both clusters. Furthermore due to
cache coherency the ability to perform active/active data access (both
read and write) from two sites is enabled. Additionally shown in the
example is a distributed network where users are able to access either
site which would be true in a fully active/active environment.

Figure 20 shows a total failure at one of the sites (in this case Site A
has failed). In this case the distributed volume would become
degraded since the hardware required at Site A to support this
particular mirror leg is no longer available. For a resolution to this
example, keep the volume active at Site B so the application can
resume there.
Figure 20 VPLEX concept diagram with failure at Site A

Figure 21 shows the desired resolution if failure at Site A was to

occur. As discussed previously the correct outcome of this is to keep
the volume online in Site B.
Figure 21 Correct resolution after volume failure at Site A
“Failure handling without VPLEX Witness (static preference)” on

page 70 discusses the outcome after an inter-cluster link
partition/failure.

Figure 22 shows the configuration before the failure.
Figure 22 VPLEX active and functional between two sites
Recall based on the Site A / Site B simple failure scenarios, when a

link failed, neither site knew of the exact failure. With an
active/active distributed volume, a link failure would also degrade
the distributed volume since write I/O at either site would be unable
to propagate to the remote site.

Figure 23 shows what would happen if there was no mechanism to

suspend I/O at one of the sites in this scenario.
Figure 23 Inter-site link failure and cluster partition
As shown, this would lead to a split brain (or conflicting detach in

VPLEX terminology) since writes could be accepted on both sites
there is the potential to end up with two divergent copies of the data.
To protect against data corruption this situation has to be avoided.
Therefore, VPLEX must act and suspend access to the distributed
volume on one of the clusters.

Figure 24 displays a valid and acceptable state in the event of a link

partition as Site A is now suspended. This preferential behavior
(selectable for either cluster) is the default and automatic behavior of
VPLEX distributed volumes and protects against data corruption and
split brain scenarios. The following section explains in more detail
how this functions.
Figure 24 Correct handling of cluster partition

Failure handling without VPLEX Witness (static preference)

As previously demonstrated, in the presence of failures, VPLEX
Active/Active distributed solutions require different resolutions
depending on the type of failure. However, since VPLEX version 4.0
had no means to perform external arbitration no “mechanism”
existed to distinguish between a site failure and a link failure. To
overcome this, a feature called static preference (previously know as
static bias) is used to guard against split brain scenarios occurring.
The premise of static preference is to set a detach rule ahead of failure
for each distributed volume (or group of distributed volumes) that
spans two VPLEX clusters to effectively define which cluster will be
declared a preferred cluster and maintain access to the volume and
which cluster should be declared the non-preferred, therefore
suspending access should either of the VPLEX clusters lose
communication with each other (this concept covers both site and
link failure). This is known as a detach rule and means that one site can
unilaterally detach the other cluster and assume that the detached
cluster is either dead or that it will stay suspended if it is alive.
Note: VPLEX Metro also supports the rule set “no automatic winner”. If a
consistency group is configured with this setting then IO will suspend at both
VPLEX clusters if either the link were to partition or an entire VPLEX cluster
fail. Manual intervention can then be used to resume IO at a remaining
cluster if required. Care should be taken if setting this policy since although
this will always ensure that both VPLEX clusters remain identical at all times,
the trade off is that the production environment would be halted. This is
useful if a customer wishes to integrate VPLEX failover semantics with
failover behavior driven by the application (suppose the application has its
own witness, etc.) In this case, the application can provide a script that
invokes the resume CLI command on the VPLEX cluster of its choosing.

Figure 25 shows how static preference can be set for each distributed
volume (also known as a DR1 - Distributed RAID1).
Figure 25 VPLEX static detach rule
This detach rule can either be set within the VPLEX GUI or via
VPLEX CLI.
Each volume can be either set to Cluster 1 detaches, Cluster 2
detaches or no automatic winner.
If the Distributed Raid 1 device (DR1) is set to Cluster 1 detaches,
then in any failure scenario the preferred cluster for that volume
would be declared as Cluster 1, but if the DR1 detach rule is set to
Cluster 2 detaches, then in any failure scenario the preferred cluster
for that volume would be declared as Cluster 2.
Note: Some people when looking at this prefer to substitute the word detaches
for the word preferred or wins which is perfectly acceptable and may make it
easier to understand.
Failure handling without VPLEX Witness (static preference) 71

Setting the rule set on a volume to Cluster 1 detaches, would mean

that Cluster 1 would be the preferred site for the given volumes.
(Additionally the terminology that Cluster 1 has the bias for the given
volume is also appropriate)
Once this rule is set then regardless of the failure (be it link or site) the
rule will always be invoked.
Note: A caveat exists here that if the state of the BE at the preferred cluster is
out of date (due to prior BE failure, an incomplete rebuild or another issue)
the preferred cluster will suspend I/O regardless of preference.
The following diagrams show some examples of the rule set in action
for different failures, the first being a site loss at B with a single DR1
set to Cluster 1 detaches.
Figure 26 shows the initial running setup of the configuration. It can
be seen that the volume is set to Cluster 1 detaches.
Figure 26 Typical detach rule setup

If there was a problem at Site B, then the DR1 will become degraded
as shown in Figure 27.
Figure 27 Non-preferred site failure

As the preference rule was set to Cluster 1 detaches, then the

distributed volume will remain active at Site A. This is shown in
Figure 28.
Figure 28 Volume remains active at Cluster 1
Therefore in this scenario, if the service, application, or VM was

running only at Site A (the preferred site) then it would continue
uninterrupted without needing to restart. However, if the application
was running only at Site B on the given distributed volume then it
will need to be restarted at Site A, but since VPLEX is an active/active
solution no manual intervention at the storage layer will be required
in this case.

The next example shows static preference working under link failure
conditions.
Figure 29 shows a configuration with a distributed volume set to
Cluster 1 detaches as per the previous configuration.
Figure 29 Typical detach rule setup before link failure

If the link were now lost then the distributed volume will again be
degraded as shown in Figure 30.
Figure 30 Inter-site link failure and cluster partition
To ensure that split brain does not occur after this type of failure the
static preference rule is applied and I/O is suspended at Cluster 2 in
this case as the rule is set to Cluster 1 detaches.

This is shown in Figure 31.
Figure 31 Suspension after inter-site link failure and cluster partition
Therefore, in this scenario, if the service, application, or VM was

running only at Site A then it would continue uninterrupted without
needing to restart; However, if the application was running only at
Site B then it will need to be restarted at Site A since the preference
rule set will suspend access for the given distributed volumes on
Cluster2. Again, no manual intervention will be required in this case
at the storage level as the volume at Cluster 1 automatically remained
available.
In summary, static preference is a very effective method of preventing
split brain. However, there is a particular scenario that will result in
manual intervention if the static preference feature is used alone. This
can happen if there is a VPLEX cluster or site failure at the “preferred
cluster” (such as the pre-defined preferred cluster for the given
distributed volume).

This is shown in Figure 32 where there is distributed volumes which

has Cluster 2 detaches set on the DR1.
Figure 32 Cluster 2 is preferred

If Site B had a total failure in this example, disruption would now

also occur at Site A as shown in Figure 33.
Figure 33 Preferred site failure causes full Data Unavailability
As can be seen, the preferred site has now failed and the preference
rule has been used, but since the rule is “static” and cannot
distinguish between a link failure or remote site failure, in this
example the remaining site becomes suspended. Therefore, in this
case, manual intervention will be required to bring the volume online
at Site A.
Static preference is a very powerful rule. It does provide zero RPO
and zero RTO resolution for non-preferred cluster failure and
inter-cluster partition scenarios and it completely avoids split brain.
However, in the presence of a preferred cluster failure it provides
non-zero RTO. It is good to note that this feature is available without
automation and is a valuable alternative when a VPLEX Witness
configuration (discussed in the next chapter) is unavailable or
customer infrastructure cannot accommodate due to the lack of a
third failure domain.

VPLEX Witness has been designed to overcome this particular

non-zero RTO scenario since it can override the static preference and
leave what was the non preferred site active guaranteeing that split
brain scenarios are always avoided.
Note: If using a VPLEX Metro deployment without VPLEX Witness, and the
preferred cluster has been lost, IO can be manually resumed via cli at the
remaining (non-preferred) VPLEX cluster. However, care should be taken
here to avoid a conflicting detach or split brain scenario. (VPLEX Witness
solves this problem automatically.)

5
Introduction to VPLEX
Witness

◆ VPLEX Witness overview and architecture ................................... 82
◆ VPLEX Witness target solution, rules, and best practices............ 85
◆ VPLEX Witness failure semantics.................................................... 87
◆ CLI example outputs ......................................................................... 93
Introduction to VPLEX Witness 81

Introduction to VPLEX Witness
VPLEX Witness overview and architecture

VPLEX Metro v5.0 (and above) systems can now rely on a new
component called VPLEX Witness. VPLEX Witness is an optional
component designed to be deployed in customer environments
where the regular preference rule sets are insufficient to provide
seamless zero or near-zero RTO storage availability in the presence of
site disasters and VPLEX cluster and inter-cluster failures.
As described in the previous section, without VPLEX Witness, all
distributed volumes rely on configured rule sets to identify the
preferred cluster in the presence of cluster partition or cluster/site
failure. However, if the preferred cluster happens to fail (in the result
of a disaster event, etc.), VPLEX is unable to automatically allow the
surviving cluster to continue I/O to the affected distributed volumes.
VPLEX Witness has been designed specifically overcome this case.
An external VPLEX Witness Server is installed as a virtual machine
running on a customer supplied VMware ESX host deployed in a
failure domain separate from either of the VPLEX clusters (to
eliminate the possibility of a single fault affecting both the cluster and
the VPLEX Witness). VPLEX Witness connects to both VPLEX
clusters over the management IP network. By reconciling its own
observations with the information reported periodically by the
clusters, the VPLEX Witness enables the cluster(s) to distinguish
between inter-cluster network partition failures and cluster failures
and automatically resume I/O in these situations.
Figure 34 on page 83 shows a high level deployment of VPLEX
Witness and how it can augment an existing static preference
solution.The VPLEX Witness server resides in a fault domain separate
from VPLEX cluster 1 and cluster 2.

Figure 34 High Level VPLEX Witness architecture
Since the VPLEX Witness server is external to both of the production

locations more perspective can be gained as to the nature of a
particular failure and the correct action taken since as mentioned
previously it is this perspective that is vital to be able to determine
between a site outage and a link outage as either one of these
scenarios requires a different action to be taken.
VPLEX Witness overview and architecture 83

Figure 35 shows a high-level circuit diagram of how the VPLEX

Witness Server should be connected.
Figure 35 High Level VPLEX Witness deployment
The VPLEX Witness server is connected via the VPLEX management

IP network in a third failure domain.
Depending on the scenario that is to be protected against, this third
fault domain could reside in a different floor within the same
building as VPLEX cluster 1 and cluster 2. It can also be located in a
completely geographically dispersed data center which could be in a
different country.
Note: VPLEX Witness Server supports up to 1 second of network latency over

the management IP network.
Clearly, using the example of the third floor in the building, one
would not be protected from a disaster affecting the entire building
so, depending on the requirement, careful consideration should be
given if choosing this third failure domain.

VPLEX Witness target solution, rules, and best practices

VPLEX Witness is architecturally designed for VPLEX Metro clusters.
Customers who wish to use VPLEX Local will not require VPLEX
Witness functionality.
Furthermore VPLEX Witness is only suitable for customers who have
a third failure domain connected via two physical networks from
each of the data centers where the VPLEX clusters reside into each
VPLEX management station Ethernet port.
VPLEX Witness failure handling semantics only apply to Distributed
volumes in all synchronous (i.e., Metro) consistency groups on a pair
of VPLEX v5.x clusters if VPLEX Witness is enabled.
VPLEX Witness failure handling semantics do not apply to:
◆ Local volumes
◆ Distributed volumes outside of a consistency group
◆ Distributed volumes within a consistency group if the VPLEX
Witness is disabled
◆ Distributed volumes within a consistency group if the preference
rule is set to no automatic winner.
At the time of writing only one VPLEX Witness Server can be
configured for a given Metro system and when it is configured and
enabled, its failure semantics applies to all configured consistency
groups.
Additionally a single VPLEX Witness Server (virtual machine) can
only support a single VPLEX Metro system (however, more than one
VPLEX Witness Server can be configured onto a single physical ESX
host).
VPLEX Witness target solution, rules, and best practices 85

Figure 36 shows the supported versions (at the time of writing) for
VPLEX Witness.
Figure 36 Supported VPLEX versions for VPLEX Witness
As mentioned in Figure 36, depending on the solution, VPLEX Static

preference alone without VPLEX Witness may still be relevant in
some cases. Figure 37 shows the volume types and rules which can be
supported with VPLEX Witness
Figure 37 VPLEX Witness volume types and rule support
Check the latest VPLEX ESSM (EMC Simple Support Matrix), located
at https://elabnavigator.emc.com, Simple Support Matrix tab, for
the latest information including VPLEX Witness server physical host
requirements and site qualification.

VPLEX Witness failure semantics

As seen in the previous section VPLEX Witness will operate at the
consistency group level for a group of distributed devices and will
function in conjunction with the detach rule set within the
consistency group.
Starting with the inter-cluster link partition the next few pages
discuss failure scenarios (both site and link) which were raised in
previous sections and show how the failure semantics differ using
VPLEX Witness compared to just using static preference alone.
Figure 38 shows a typical setup for VPLEX 5.x with a single
distributed volume configured in a consistency group which has a
rule set configured for cluster 2 detaches (such as cluster 2 is
preferred). Additionally it shows the VPLEX Witness server is
connected via the management network in a third failure domain.
Figure 38 Typical VPLEX Witness configuration
VPLEX Witness failure semantics 87

If the inter-cluster link were to fail in this scenario VPLEX Witness

would still be able to communicate with both VPLEX clusters since
the management network that connects the VPLEX Witness server to
both of the VPLEX clusters is still operational. By communicating
with both VPLEX clusters, the VPLEX Witness will deduce that the
inter-cluster link has failed since both VPLEX clusters report to the
VPLEX Witness server that the connectivity with the remote VPLEX
cluster has been lost. (such as, cluster 1 reports that cluster 2 is
unavailable and vice versa). This is shown in Figure 39.
Figure 39 VPLEX Witness and an inter-cluster link failure
In this case the VPLEX Witness guides both clusters to follow the
pre-configured static preference rules and volume access at cluster 1
will be suspended since the rule set was configured as cluster 2
detaches.

Figures 40 shows the final state after this failure.
Figure 40 VPLEX Witness and static preference after cluster partition
The next example shows how VPLEX Witness can assist if you have a
site failure at the preferred site. As discussed above, this type of
failure without VPLEX Witness would cause the volumes in the
surviving site to go offline. This is where VPLEX Witness greatly
improves the outcome of this event and removes the need for manual
intervention.

Figure 41 shows a typical setup for VPLEX v5.x with a distributed

volume configured in a consistency group with a rule set configured
for Cluster 2 detaches (such as, Cluster 2 wins).
Figure 41 VPLEX Witness typical configuration for cluster 2 detaches

Figure 42 shows that Site B has now failed.
Figure 42 VPLEX Witness diagram showing cluster 2 failure
As discussed in the previous section, when a site has failed then the
distributed volumes are now degraded. However, unlike our
previous example where there was a site failure at the preferred site
and the static preference rule was used forcing volumes into a
suspend state at cluster 1, VPLEX Witness will now observe that
communication is still possible to cluster 1 (but not cluster 2).
Additionally since cluster 1 cannot contact cluster 2, VPLEX Witness
can make an informed decision and guide cluster 1 to override the
static rule set and proceed with I/O.

Figure 43 shows the outcome.
Figure 43 VPLEX Witness with static preference override
Clearly, this is a big improvement on the scenario where this

happened with just the static preference rule set but not using VPLEX
Witness. Since volumes had to be suspended at cluster 1 previously,
there was no way to tell the difference between a site failure or a link
failure.

CLI example outputs

On systems where VPLEX Witness is deployed and configured, the
VPLEX Witness CLI context appears under the root context as
"cluster-witness." By default, this context is hidden and will not be
visible until VPLEX Witness has been deployed by running the
cluster-Witness configure command. Once the user deploys VPLEX
Witness, the VPLEX Witness CLI context becomes visible.
The CLI context typically displays the following information:
VPlexcli:/> cd cluster-witness/
VPlexcli:/cluster-witness> ls
Attributes:
Name Value
------------- -------------
admin-state enabled
private-ip-address 128.221.254.3
public-ip-address 10.31.25.45
Contexts:
components
VPlexcli:/cluster-witness> ll components/
/cluster-Witness/components:
Name ID Admin State Operational State Mgmt Connectivity

---------- -- ----------- ------------------- -----------------
cluster-1 1 enabled in-contact ok
cluster-2 2 enabled in-contact ok
server - enabled clusters-in-contact ok
VPlexcli:/cluster-Witness> ll components/*
/cluster-Witness/components/cluster-1:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
diagnostic INFO: Current state of cluster-1 is in-contact (last
state change: 0 days, 13056 secs ago; last message
from server: 0 days, 0 secs ago.)
id 1
management-connectivity ok
operational-state in-contact
/cluster-witness/components/cluster-2:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
CLI example outputs 93

diagnostic INFO: Current state of cluster-2 is in-contact (last

state change: 0 days, 13056 secs ago; last message
from server: 0 days, 0 secs ago.)
id 2
operational-state in-contact
/cluster-Witness/components/server:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
diagnostic INFO: Current state is clusters-in-contact (last state
change: 0 days, 13056 secs ago.) (last time of
communication with cluster-2: 0 days, 0 secs ago.)
(last time of communication with cluster-1: 0 days, 0
secs ago.)
id -
operational-state clusters-in-contact
Eefer to the VPLEX CLI guide found on Powerlink for more details
around VPLEX Witness CLI.
VPLEX Witness cluster As discussed in the previous section, deploying a VPLEX solution
isolation semantics with VPLEX Witness will give continuous availability to the storage
and dual failures volumes regardless of there being a site failure or inter-cluster link
failure. These types of failure are deemed single component failures
and have shown no single point of failure can induce data
unavailability using the VPLEX Witness.
It should be noted, however, that in rare situations more than one
fault or component outage can occur especially when considering
inter-cluster communication links which if two failed at once would
lead to a VPLEX cluster isolation at a given site.
For instance, if you consider a typical VPLEX Setup with VPLEX
Witness you will automatically have three failure domains (this
example will use A, B, and C, where VPLEX cluster 1 resides at A,
VPLEX cluster 2 at B, and the VPLEX Witness server resides at C). In
this case there will be in inter-cluster link between A and B (cluster 1
and 2), plus a management IP link between A and C, as well as a
management IP link between B and C, effectively giving a
triangulated topology.
In rare situations there is a chance that if the link between A and B
failed followed by a further link failure from either A to C or B to C,
then one of the sites will be isolated (cut off).

Due to the nature of VPLEX Witness, these types of isolation can also
be dealt with effectively without manual intervention.
This is achieved since a site isolation is very similar in terms of
technical behavior to a full site outage the main difference being that
the isolated site is still fully operational and powered up (but needs
to be forced into I/O suspension) unlike a site failure where the failed
site is not operational.
In these cases the failure semantics and VPLEX Witness are
effectively the same. However, two further actions are taken at the
site that becomes isolated:
◆ I/O is shut off/suspended at the isolated site.
◆ The VPLEX cluster will attempt to call home.
Figure 44 shows the three scenarios that are described above:
Figure 44 Possible dual failure cluster isolation scenarios
As discussed previously, it is extremely rare to experience a double

failure and figure 44 showed how VPLEX can automatically ride
through isolation scenarios. However, there are also some other
possible situations where a dual failure could occur and require
manual intervention at one of the VPLEX clusters as VPLEX Witness
will not be able to distinguish the actual failure.

Note: If best practices are followed then the likelihood of these scenarios
occurring is significantly less than even the rare isolation incidents discussed
above mainly as the faults would have to disrupt components in totally
different fault domains that would be spread over many miles.
Figure 45 shows three scenarios where a double failure would require

manual intervention to bring the remaining component online since
VPLEX Witness would not be able to determine the gravity of the
failure.
Figure 45 Highly unlikely dual failure scenarios that require manual intervention
A point to note in the above scenarios is that for the shown outcomes
to be correct the failures would have to have happened in a specific
order where the link to the VPLEX Witness (or the Witness itself) has
failed and then either the inter-cluster link or the VPLEX cluster fail.
However, if the order of failure is reversed then in all three cases the
outcome would be different since one of the VPLEX clusters would
have remained online for the given distributed volume, therefore not
requiring manual intervention.
This is due to the fact that once a failure occurs, the VPLEX Witness
will give guidance to the VPLEX cluster. This guidance is “sticky”,
and once provided its guidance it is no longer consulted during any
subsequent failure until the system has been returned to a fully
operational state. (i.e.has fully recovered and connectivity between
both clusters and the VPLEX Witness is fully restored).

VPLEX Witness – The importance of the third failure domain

As discussed in the previous section, dual failures can occur but are
highly unlikely. As also mentioned many times within this TechBook,
it is imperative that if VPLEX Witness is to be deployed then the
VPLEX Witness server component must be installed into a different
failure domain than either of the two VPLEX clusters.
Figure 46 shows two further dual failure scenarios where both a
VPLEX cluster has failed as well as the VPLEX Witness server.
Figure 46 Two further dual failure scenarios that would require manual
intervention
Again, if best practice is followed and each component resides within

its own fault domain then these two situations are just as unlikely as
the previous three scenarios that required manual intervention.
However, now consider what could happen if the VPLEX Witness
server was not deployed within a third failure domain, but rather in
the same domain as one of the VPLEX clusters.
This situation would mean that a single domain failure would
potentially induce a dual failure as two components may have been
residing in the same failure domain. This effectively turns a highly
unlikely scenario into a more probable single failure scenario and
should be avoided.

By deploying the VPLEX Witness server into a third failure domain

the dual failure risk is substantially lowered, therefore manual
intervention would never be required since a fault would have to
disable more than one dissimilar component potentially hundreds of
miles apart spread over different fault domains.
Note: It is always considered best practice to ensure ESRS and alerting are
fully configured when using VPLEX Witness. This way if a VPLEX cluster
loses communication with a Witness server then the VPLEX cluster will dial
home and alert. This also ensures that if both VPLEX clusters lose
communication to the witness that the witness function can be manually
disabled if the witness communication or outage is expected to last for an
extended time reducing the risk of data unavailability in the event of an
additional VPLEX cluster failure or WAN partition.

6
VPLEX Metro HA

◆ VPLEX Metro HA overview ........................................................... 100
◆ VPLEX Metro HA Campus (with cross-connect) ........................ 101
◆ VPLEX Metro HA (without cross-cluster connection) ................ 111
VPLEX Metro HA 99
VPLEX Metro HA
VPLEX Metro HA overview

From a technical perspective VPLEX Metro HA solutions are
effectively two new flavors of reference architecture which utilize the
new VPLEX Witness feature in VPLEX v5.0 and greatly enhance the
overall solutions ability to tolerate component failure causing less or
no disruption than legacy solutions with little or no human
intervention over either Campus or Metro distances.
The two main architecture types enabled by VPLEX Witness are:
◆ VPLEX Metro HA Campus — This is defined as those clusters
that are within campus distance (typically < 1ms round trip time
or RTT). This solution utilizes a cross-connected front end host
path configuration giving each host an alternate path to the
VPLEX Metro distributed volume via the remote VPLEX cluster.
◆ VPLEX Metro HA — This is defined with distances larger than
campus but still within synchronous distance (typically higher
than 1ms RTT, but not more than 5ms RTT) where a VPLEX
Metro distributed volume is deployed between two VPLEX
clusters using a VPLEX Witness, but not using a cross-connected
host path configuration.
This section will look at each of these solutions in turn and show how
system uptime can be maximized by stepping through different
failure scenarios and showing from both a VPLEX and host HA
cluster perspective how the technologies interact with each failure. In
all of the scenarios shown, the VPLEX is able to continue servicing IO
automatically across at least one of the VPLEX clusters with zero data
loss ensuring that the application or service within the host HA
cluster simply remains online fully uninterrupted, or is restarted
elsewhere automatically be the host cluster.

VPLEX Metro HA
VPLEX Metro HA Campus (with cross-connect)

VPLEX Metro HA campus connect can be deployed when two sites
are within campus distance of each other (up to 1ms round trip
latency). A VPLEX Metro distributed volume can then be deployed
across the two sites using a cross-connected front end configuration
and a VPLEX Witness server installed within a different fault domain.
Figure 47 shows a high level schematic of a Metro HA campus
solution for VMware.
Figure 47 High-level diagram of a Metro HA campus solution for VMware
As can be seen, a single VPLEX cluster is deployed at each site

connected via an inter cluster link.
VPLEX Metro HA Campus (with cross-connect) 101

VPLEX Metro HA
A VPLEX distributed volume has been created across both of the

locations and a vSphere HA cluster instance has been stretched across
both locations using the underlying VPLEX distributed volume.
Also shown in Figure 47 on page 101 are the physical ESX hosts that
are not only connected to the local VPLEX cluster where they
physically reside, but also have an alternate path to the remote
VPLEX cluster via the additional cross-connect network that is
physically separate to the VPLEX inter cluster link connecting both of
the VPLEX clusters.
The key benefit to this solution is its ability to minimise and eliminate
any recovery time if components were to fail (including even an
entire VPLEX cluster which would be unlikely since there are no
single points of failure within a VPLEX engine) as now the physical
host has an alternate path to the same storage actively served up by
the remote VPLEX cluster which will automatically remain online
due to the VPLEX Witness regardless of rule set.
The high-level deployment best practices for a cross-connect
configuration are as follows:
◆ At the time of writing inter-cluster Network latency is not to
exceed 1ms round trip time between VPLEX clusters.
◆ VPLEX Witness must be deployed (in a third failure domain)
when using a cross-connect campus configuration.
◆ All remote VPLEX connection should be zoned to the local host as
per EMC best practice, and local host initiators must be registered
to the remote VPLEX. The distributed volume is then exposed
from both VPLEX clusters to the same host. The host path
preference should have a local path preference set, ensuring the
remote path will only be used if the primary one fails ensuring no
additional latency is incurred
Note: At the time of writing the only two qualified host cluster solutions
that can be configured with the additional VPLEX Metro HA campus (as
opposed to standard VPLEX Metro HA without the cross-cluster
connect) are vSphere version 4.1 & 5.0, Windows 2008 and IBM Power
HA 5.4. Be sure to check the latest VPLEX Simple Support Matrix, located
at https://elabnavigator.emc.com, Simple Support Matrix tab, for the
latest support information or submit an RPQ.

VPLEX Metro HA
Failure scenarios
For the following failure scenarios, this section assumes that vSphere
5.0 update 1 or above is configured in a stretched HA topology with
DRS so that all of the physical hosts (ESX servers) are within the same
HA cluster. As discussed previously, this type of configuration brings
the ability to teleport virtual machines over distance, which is
extremely useful in disaster avoidance, load balancing and cloud
infrastructure use cases. These use cases are all enabled using out of
the box features and functions; however, additional value can be
derived from deploying the VPLEX Metro HA campus solution to
ensure total availability for both planned and unplanned events.
High-level recommendations and pre-requisites for stretching
vSphere HA when used in conjunction with VPLEX Metro are as
follows:
◆ A single vCenter instance must span both locations that contain
the VPLEX Metro cluster pairs. (Note: it is recommended this is
virtualized and protected via vSphere heartbeat to ensure restart
in the event of failure)
◆ Must be used in conjunction with a stretched layer 2 network
ensuring that once a VM is moved it still resides on the same
logical network.
◆ vSphere HA can be enabled within the vSphere cluster, but
vSphere fault tolerance is not supported (at the time of writing
but is planned for late 2012).
◆ Can be used with vSphere DRS. However, careful consideration
should be given to this feature for vSphere versions prior to 5.0
update 1 since certain failure conditions where the VM is running
at the “non preferred site” may not invoke a VM fail over after
failure due to a problem where the ESX server does not detect a
storage “Persistent Device Loss (PDL)” state. This can lead to the
VM remaining online but intermittently unresponsive (also know
as a zombie VM). Manual intervention would be required in this
scenario.
Note: This can be avoided by using a VPLEX HA Campus solution with a

cross-cluster connect on a separate physical network to the VPLEX inter
cluster link. This will ensure that an active path is always present to the
storage no matter where the VM is running. Another possibility to avoid
would be to use host affinity groups (where supported). It is
recommended to upgrade to the latest ESX and vSphere versions (5.0
update 1 and above) to avoid these conditions.

VPLEX Metro HA
For detailed setup instructions and best practice planning for a

stretched HA vSphere environment, refer to White Paper: Using
VMware vSphere with EMC VPLEX — Best Practices Planning which
can be found at http://powerlink.emc.com under Support >
Technical Documentation and Advisories > Hardware Platforms >
VPLEX Family > White Papers.
Figure 48 shows the topology of a Metro HA campus environment
divided up into logical fault domains.
Figure 48 Metro HA campus diagram with failure domains
The following sections will demonstrate the recovery automation for

a single failure within any of these domains and show how no single
fault in any domain can take down the system as a whole, and in
most cases without an interruption of service.
If a physical host failure were to occur in either domain A1 or B1 the
VMware HA cluster would restart the affected virtual machine’s on
the remaining ESX servers.

VPLEX Metro HA
Example 1 Figure 49 shows all physical ESX hosts failing in domain A1.
Figure 49 Metro HA campus diagram with disaster in zone A1
Since all of the physical hosts in domain B1 are connected to the same
datastores via the VPLEX Metro distributed device VMware HA can
restart the virtual machines on any of the physical ESX hosts in
domain B1.
Example 2 The next example describes what will happen in the unlikely event
that a VPLEX cluster was to fail in either domain A2 or B2.
Note: This failure condition is considered unlikely since it would constitute a

dual failure as a VPLEX cluster has no single points of failure.
In this instance there would be no interruption of service to any of the

virtual machines.

VPLEX Metro HA
Figure 50 shows a full VPLEX cluster outage in domain A2.
Figure 50 Metro HA campus diagram with failure in zone A2
Since the ESX servers are cross connected to both VPLEX clusters in
each site, ESX will simply re-route the I/O to the alternate path,
which is still available since VPLEX is configured with a VPLEX
Witness protected distributed volume. This ensures the distributed
volume remains online in domain B2 as the VPLEX Witness Server
observes that it cannot communicate with the VPLEX cluster in A2
and guides the VPLEX cluster in B2 to remain online as this also
cannot communicate with A2, meaning A2 is either isolated or failed.
Note: Similarly in the event of a full isolation at A2, the distributed volumes
would simply suspend at A2 since communication would not be possible to
either the VPLEX Witness Server or the VPLEX cluster in domain B2. In this
case, the outcome is identical from a vSphere perspective and there will be no
interruption since I/O would be re-directed across the cross connect to
domain B2 where the distributed volume would remain online and available
to service I/O.

VPLEX Metro HA
Example 3 The following example describes what will happen in the event of a
failure to one (or all of) the back end storage arrays in either domain
A3 or B3. Again, in this instance there would be no interruption to
any of the virtual machines.
Figure 51 shows the failure to all storage arrays that reside in domain
A3. Since a cache coherent VPLEX Metro distributed volume is
configured between domains A2 and B2 IO can continue to be
actively serviced from the VPLEX in A2 even though the local back
end storage has failed. This is due to the embedded VPLEX cache
coherency which will efficiently cache any reads into the A2 domain
whilst also propagating writes to the back end storage in domain B3
via the remote VPLEX cluster in Site B2.
Figure 51 Metro HA campus diagram with failure in zone A3 or B3

VPLEX Metro HA
Example 4 The next example describes what will happen in the event of a
VPLEX Witness server failure in domain C1.
Again, in this instance there would be no interruption to any of the
virtual machines or VPLEX clusters. Figure 52 shows a complete
failure to domain C1 where the VPLEX Witness server resides. Since
the VPLEX Witness in not within the I/O path and is only an optional
component I/O will actively continue for any distributed volume in
domains A2 and B2 since the inter-cluster link is still available,
meaning cache coherency can be maintained between the VPLEX
cluster domains.
Although the service is uninterrupted, both VPLEX clusters will now
dial home and indicate they have lost communication with the
VPLEX Witness Server as a further failure to either of the VPLEX
clusters in domains A2 and B2 or the inter-cluster link would cause
data unavailability. The risk of this is heightened should the VPLEX
Witness server be offline for an extended duration. To remove this
risk the Witness feature may be disabled manually, therefore enabling
the VPLEX clusters to follow the static preference rules.
Figure 52 Metro HA campus diagram with failure in zone C1

VPLEX Metro HA
Example 5 The next example describes what will happen in the event of a failure
to the inter-cluster link between domains A2 and B2.
Again in this instance there would be no interruption to any of the
virtual machines or VPLEX clusters.
Figure 53 shows the inter-cluster link has failed between domains A2
and B2. In this instance the static preference rule set which was
defined previously will be invoked since neither VPLEX cluster can
communicate with the other VPLEX cluster (but the VPLEX Witness
Server can communicate with both VPLEX clusters). Therefore, access
to the given distributed volume within one of the domains A2 or B2
will be suspended. Since in this example the cross connect network is
physically separate from inter cluster link the alternate paths are still
available to the remote VPLEX cluster where the volume remains
online, therefore ESX will simply re-route the traffic to the alternate
VPLEX cluster meaning the virtual machine will remain online and
unaffected whichever site it was running on.
Figure 53 Metro HA campus diagram with intersite link failure

VPLEX Metro HA
Note: It is plausible in this example that the alternate path is physically

routing across the same ISL that has failed. In this instance there could be a
small interruption if a virtual machine was running in A1 as it will be
restarted in B1 (by the host cluster) since the alternate path is also dead.
However, it is also possible with vSphere versions prior to 5.0 update 1 that
the guest OS will simply hang and vSphere HA will not be prompted to
restart it. Although this is beyond the scope of the Techbook, to avoid any
disruption at all for any host cluster environment, EMC suggests that the
network that is used for the cross-cluster connect be a physically separate
network from the VPLEX inter-cluster link, therefore avoiding this potential
problem altogether. Refer to the VPLEX Metro HA (non campus- no cross
connect) scenarios in the next section for more details on full cluster partition
as well as Appendix A, “vSphere 5.0 Update 1 Additional Settings,” for
proper vSphere HA configuration settings that are applicable for ESX
implementations post version 5.0 update 1 to avoid this problem.

VPLEX Metro HA
VPLEX Metro HA (without cross-cluster connection)

VPLEX Metro HA (without cross-cluster connection) deployment is
very similar to a Metro HA campus deployment as mentioned in the
previous section; however, this solution is designed to cover
distances beyond the campus range and into distances of a
metropolitan range where round trip latency would be beyond 1 ms
and up to 5 ms. A VPLEX Metro distributed volume can then be
deployed across the two sites as well as deploying a VPLEX Witness
server within a different third failure/fault domain.
Figure 54 shows a high level schematic of an Metro HA solution for
vSphere without the cross-cluster deployment.
Figure 54 Metro HA Standard High-level diagram
As can be seen, a single VPLEX cluster is deployed at each site

connected via an inter cluster link.
VPLEX Metro HA (without cross-cluster connection) 111

VPLEX Metro HA
A VPLEX distributed volume has been created across both of the

locations and a vSphere HA cluster instance has been stretched across
both locations using the underlying VPLEX distributed volume.
Also shown in Figure 54 on page 111 are the physical ESX hosts that
are now only connected to the local VPLEX cluster where they reside
since the cross-connect does not exist.
The key benefit to this solution is its ability to minimise and eliminate
any recovery time if components were to fail as the host cluster is
connected to a VPLEX distributed volume which is actively serving
up the same block data via both VPLEX clusters, and will continue to
remain online via at least one VPLEX cluster under any single failure
event regardless of rule set due to the VPLEX Witness.
Note: At the time of writing a larger amount of host based cluster solutions
are supported with VPLEX Metro HA when compared to VPLEX Metro HA
campus (with the cross-connect). Although this next section only discusses
vSphere with VPLEX Metro HA, other supported host clusters include
Microsoft HyperV, Oracle RAC, Power HA, Serviceguard, etc. Be sure to
check the latest VPLEX simple support matrix found at
https://elabnavigator.emc.com, Simple Support Matrix tab, for the latest
support information.
Failure scenarios
As with the previous section, when deploying a stretched vSphere
configuration with VPLEX Metro HA, it is also possible to enable
long distance vMotion (virtual machine teleportation) since the ESX
datastore resides on a VPLEX Metro distributed volume, therefore
existing in two places at the same time exactly as described in the
previous section.
Again, for these failure scenarios, this section assumes that vSphere
version 5.0 update 1 or higher is configured in a stretched HA
topology so that all of the physical hosts at either site (ESX servers)
are within the same HA cluster.
Since the majority of the failure scenarios behave identically to the
cross-connect configuration, this section will only show two failure
scenarios where the outcomes differ slightly to the previous section.

VPLEX Metro HA
Note: For detailed setup instructions and best practice planning for a
stretched HA vSphere environment please read White Paper: Using VMware
vSphere with EMC VPLEX — Best Practices Planning, which can be found on
Powerlink (http://powerlink.emc.com) under Home > Support > Technical
Documentation and Advisories > Hardware/Platforms Documentation >
VPLEX Family > White Papers.
Figure 55 shows the topology of an Metro HA environment divided

up into logical fault domains. The next sections will demonstrate the
recovery automation for single failures within any of these domains.
Figure 55 Metro HA high-level diagram with fault domains

VPLEX Metro HA
Example 1 The following example describes what will happen in the unlikely
event that a VPLEX cluster was to fail in domain A2. In this instance
there would no interruption of service to any virtual machine’s
running in domain B1; however, any virtual machine’s that were
running in domain A1 would see a minor interruption as the virtual
machine’s are restarted at B1.
Figure 56 shows a full VPLEX cluster outage in domain A2.
Figure 56 Metro HA high-level diagram with failure in domain A2
Note: This failure condition is considered unlikely since it would constitute a

dual failure as a VPLEX cluster has no single points of failure.
As can be seen in the graphic, since the ESX servers are not
cross-connected to the remote VPLEX cluster, the ESX server will lose
access to the storage causing the HA host cluster (in this case
vSphere) to perform a HA restart for the virtual machines within

VPLEX Metro HA
domain A2. It can do this since the distributed volumes will remain
active at B2 as the VPLEX is configured with a VPLEX Witness
protected distributed volume which will deduce that the VPLEX in
domain A2 is unavailable (since neither the VPLEX Witness Server or
the VPLEX cluster in B2 can communicate with the VPLEX cluster in
A2, therefore VPLEX Witness will guide the VPLEX cluster in B2 to
remain online)
Note: It is important to understand that this failure is deemed a dual failure

since at the time of writing it is possible that with vSphere (all versions
including 5.0 update 1), the guest VMs in Site A will simply hang in this
situation (this is known as a zombie) and VMware HA will not be prompted
to restart it (all other supported HA clusters would detect this failure and
perform a restart under these conditions). Although this is beyond the scope
of the Techbook, manual intervention may be required here to resume VMs at
the remote VPLEX cluster that automatically remains online regardless of
rule sets due to the VPLEX witness. The reason for this is that if the VPLEX
cluster is totally disconnected from the hosts, then the hosts will be unable to
receive the PDL (Persistent device loss) status issued via the VPLEX,
therefore vSphere will only see this as an APD (All paths down) state and in
most cases wait for the device to be brought back online without failing the
VM.
Example 2 The next example describes what will happen in the event of a failure
to the inter-cluster link between domains A2 and B2.
One of two outcome of this scenario will happen:
◆ VMs running at the preferred site - If the static preference for a
given distributed volume was set to cluster 1 detaches (assuming
cluster 1 resides in domain A2) and the virtual machine was
running at the same site where the volume remains online (aka
the preferred site) then there is no interruption to service.
◆ VMs running at the non-preferred site - If the static preference
for a given distributed volume was set to cluster 1 detaches
(assuming cluster 1 resides in domain A2) and the virtual
machine was running at the remote site (Domain B1) then the
VM’s storage will be in the suspended state (PDL). In this case
the guest operating systems will fail allowing the virtual machine
to be automatically restarted in domain A1.

VPLEX Metro HA
Figure 57 shows the link has failed between domains A2 and B2.
Figure 57 Metro HA high-level diagram with intersite failure
In this instance the static preference rule set which was previously
defined as cluster 1 detaches will be invoked since neither VPLEX
cluster can communicate with the other VPLEX cluster (but the
VPLEX Witness Server can communicate with both VPLEX clusters),
therefore access to the given distributed volume within the domains
B2 will be suspended for the given distributed volume whilst
remaining active at A2.
Virtual machines that were running at A1 will be uninterrupted and
virtual machine’s that were running at B1 will be restarted at A1.

VPLEX Metro HA
Note: Similar to the previous note and though it is beyond the scope of this
TechBook, vSphere HA versions prior to 5.0 update 1 may not detect this
condition and not restart the VM if it was running at the non preferred site,
therefore to avoid any disruption when using vSphere in this type of
configuration (for versions prior to 5.0 update 1) VMware DRS host affinity
rules can be used (where supported) to ensure that virtual machines are
always running in their preferred location (i.e., the location that the storage
they rely on is biased towards). Another way to avoid this scenario is to
disable DRS altogether and use vSphere HA only, or use a cross-connect
configuration deployed across a separate physical network as discussed in
the previous section. See Appendix A, “vSphere 5.0 Update 1 Additional
Settings,” for proper vSphere HA configuration settings that are applicable
for ESX implementations post version 5.0 update 1 to avoid this problem.
The remaining failure scenarios with this solution are identical to the
previously discussed VPLEX Metro HA campus solutions. For failure
handling in domains A1, B1, A3, B3, or C, see “VPLEX Metro HA
Campus (with cross-connect)” on page 101.

VPLEX Metro HA

7
Conclusion
This chapter provides a conclusion to the VPLEX solutions outlined

in this TechBook:
◆ Conclusion ........................................................................................ 120
Conclusion 119
Conclusion
Conclusion
As outlined in this book, using VPLEX AccessAnywhereTM
technology in combination with High Availability and VPLEX
Witness, storage administrators and data center managers will be
able to provide absolute physical and logical high availability for
their organizations’ mission critical applications with less resource
overhead and dependency on manual intervention. Increasingly,
those mission critical applications are virtualized and in most cases
using VMware vSphere or Microsoft Hyper-V “virtual machine”
technologies. It is expected that VPLEX customers use the HA /
VPLEX Witness solution to incorporate several application-specific
clustering and virtualization technologies to provide HA benefits for
targeted mission critical applications.
As described, the storage administrator is provided with two specific
VPLEX Metro-based solutions around High Availability as outlined
specifically for VMware ESX 4.1 or higher as integrated into the
VPLEX Metro HA Campus (cross-cluster connect) and standard (non
campus) Metro environments. VPLEX Metro HA Campus provides a
higher level of HA than the VPLEX Metro HA deployment without
cross-cluster connectivity. However, it is limited to in-data center use
or cases where the network latency between data centers is
negligible.
Both solutions are ideal for customers who are not only currently or
planning on becoming highly virtualized but are looking for the
following:
◆ Elimination of the “night shift” storage and server administrator
positions. To accomplish this, they must be comfortable that their
applications will ride through any failures that happen during
the night.
◆ Reduction of capital expenditures by moving from an
active/passive data center replication model to a fully active
highly available data center model.
◆ Increase application availability by protecting against flood and
fire disasters that could affect their entire data center.

Conclusion
From a holistic view of both types of solutions and what it provides

the storage administrator, the following benefits are in common with
variances. What EMC VPLEX technology with Witness provides to
consumers are as follows, each discussed briefly:
◆ “Better protection from storage-related failures” on page 121
◆ “Protection from a larger array of possible failures” on page 121
◆ “Greater overall resource utilization” on page 122
Better protection from storage-related failures

Within a data center, applications are typically protected against
storage-related failures through the use of multipathing software
such as EMC PowerPath™. This allows applications to ride through
HBA failures, switch failures, cable failures, or storage array
controller failures by routing I/O around the location of the failure.
The VPLEX Metro HA cross-cluster connect solution extends this
protection to the rack and/or data center level by multipathing
between VPLEX clusters in independent failure domains. The VPLEX
Metro HA solution adds to this the ability to restart the application in
the other data center in case no alternative route for the I/O exists in
its current data center. As an example, if a fire were to affect an entire
VPLEX rack, the application could be restarted in the backup data
center automatically.This provides customers a much higher level of
availability and lower level of risk.
Protection from a larger array of possible failures

To highlight advantages of VPLEX Witness functionality, let’s recall
how VMware HA operates.
VMware HA and other offerings provides automatic restart of virtual
machines (applications) in the event of virtual machine failure for any
reason (server failure, failed connection to storage, etc.). This restart
involves a complete boot-up of the virtual machine’s guest operating
system and applications. While VM failure leads to an outage, the
recovery from that failure is usually automatic.
When combined with VPLEX in the Metro HA configuration, it
provides the same level of protection for data center scale disaster
scenarios.
Conclusion 121
Conclusion
Greater overall resource utilization

Using the same point of view of server virtualization based products
and their recovery capabilities, turning over to utlization, VMware
DRS (Distributed Resource Scheduler) can automatically move
applications between servers in order to balance their computational
and memory load over all the available servers. Within a data center,
this has increased server utilization because administrators no longer
need to size individual servers to the applications that will run on
them. Instead, they can size the entire data center to the suite of
applications that will run within it.
By adding HA configuration (Metro and Campus), the available pool
of server resources now covers both the primary and backup data
centers. Both can actively be used and excess compute capacity in one
data center can be used to satisfy new demands in the other.
Alternative Vendor Solutions:
◆ Microsoft Hyper-V Server 2008 R2 with Performance and
Resource Optimization (PRO)
Overall, as data centers continue their expected growth patterns and
storage administrators struggle to expand capacity and consolidate at
the same time, by introducing EMC VPLEX they can reduce several
areas of concern. To recap, these areas are:
◆ Hardware and component failures impacting data consistency
◆ System integrity
◆ High availability without manual intervention
◆ Witness to protect the entire highly available system
In reality, by reducing inter-site overhead and dependencies on
disaster recovery, administrators can depend on VPLEX to guarantee
that their data is available at anytime while the beepers and cell
phones are silenced.

A
vSphere 5.0 Update 1
Additional Settings
This appendix contains the following information for additional

settings needed for vSphere 5.0 update 1:
◆ vSphere 5.0 update 1........................................................................ 124
vSphere 5.0 Update 1 Additional Settings 123

vSphere 5.0 Update 1 Additional Settings
vSphere 5.0 update 1

As discussed in previous sections, vSphere HA does not
automatically recognize that a SCSI PDL (Persistent device loss) state
is a state that should cause a VM to invoke a HA failover.
Clearly, this may not be desirable when using vSphere HA with
VPLEX in a stretched cluster configuration. Therefore, it is important
to configure vSphere so that if the VPLEX WAN is partitioned and a
VM happens to be running at the non-preferred site (i.e., the storage
device is put into a PDL state), the VM recognizes this condition and
invokes the steps required to perform a HA failover.
ESX and vSphere versions prior to version 5.0 update 1 have no
ability to act on a SCSI PDL status and will therefore typically hang
(i.e., continue to be alive but in an unresponsive state).
However, vSphere 5.0 update 1 and later do have the ability to act on
the SCSI PDL state by powering off the VM, which in turn will invoke
a HA failover.
To ensure that the VM behaves in this way, additional settings within
the vSphere cluster are required. At the time of this writing the
settings are:
1. Use vSphere Client and select the cluster, right-click and select
Edit Settings. From the pop-up menu click to select the vSphere
HA, then click Advanced Options…. Define and save the option:
das.maskCleanShutdownEnabled=true
2. On every ESXi server, vi /etc/vmware/settings with the content

below, then reboot the ESXi server.
The following output shows the correct setting applied in the file:
~ # cat /etc/vmware/settings
disk.terminateVMOnPDLDefault=TRUE
Refer to the ESX documentation for further details.

Glossary
This glossary contains terms related to VPLEX federated storage

systems. Many of these terms are used in this manual.
A
AccessAnywhere The breakthrough technology that enables VPLEX clusters to provide
access to information between clusters that are separated by distance.
active/active A cluster with no primary or standby servers, because all servers can
run applications and interchangeably act as backup for one another.
active/passive A powered component that is ready to operate upon the failure of a

primary component.
array A collection of disk drives where user data and parity data may be
stored. Devices can consist of some or all of the drives within an
array.
asynchronous Describes objects or events that are not coordinated in time. A process
operates independently of other processes, being initiated and left for
another task before being acknowledged.
For example, a host writes data to the blades and then begins other
work while the data is transferred to a local disk and across the WAN
asynchronously. See also ”synchronous.”

Glossary
B
bandwidth The range of transmission frequencies a network can accommodate,
expressed as the difference between the highest and lowest
frequencies of a transmission cycle. High bandwidth allows fast or
high-volume transmissions.
bias When a cluster has the for a given DR1 it will remain online if
connectivity is lost to the remote cluster (in some cases this may get
over ruled by VPLEX Cluster Witness). This is now known as
preference.
bit A unit of information that has a binary digit value of either 0 or 1.
block The smallest amount of data that can be transferred following SCSI
standards, which is traditionally 512 bytes. Virtual volumes are
presented to users as a contiguous lists of blocks.
block size The actual size of a block on a device.
byte Memory space used to store eight bits of data.
C
cache Temporary storage for recent writes and recently accessed data. Disk
data is read through the cache so that subsequent read references are
found in the cache.
cache coherency Managing the cache so data is not lost, corrupted, or overwritten.
With multiple processors, data blocks may have several copies, one in
the main memory and one in each of the cache memories. Cache
coherency propagates the blocks of multiple users throughout the
system in a timely fashion, ensuring the data blocks do not have
inconsistent versions in the different processors caches.
cluster Two or more VPLEX directors forming a single fault-tolerant cluster,

deployed as one to four engines.
cluster ID The identifier for each cluster in a multi-cluster deployment. The ID

is assigned during installation.

Glossary
cluster deployment ID A numerical cluster identifier, unique within a VPLEX cluster. By

default, VPLEX clusters have a cluster deployment ID of 1. For
multi-cluster deployments, all but one cluster must be reconfigured
to have different cluster deployment IDs.
clustering Using two or more computers to function together as a single entity.

Benefits include fault tolerance and load balancing, which increases
reliability and up time.
COM The intra-cluster communication (Fibre Channel). The

communication used for cache coherency and replication traffic.
command line A way to interact with a computer operating system or software by

interface (CLI) typing commands to perform specific tasks.
continuity of The goal of establishing policies and procedures to be used during an

operations (COOP) emergency, including the ability to process, store, and transmit data
before and after.
controller A device that controls the transfer of data to and from a computer and
a peripheral device.
D
data sharing The ability to share access to the same data with multiple servers
regardless of time and location.
detach rule A rule set applied to a DR1 to declare a winning and a losing cluster
in the event of a failure.
device A combination of one or more extents to which you add specific

RAID properties. Devices use storage from one cluster only;
distributed devices use storage from both clusters in a multi-cluster
plex. See also ”distributed device.”
director A CPU module that runs GeoSynchrony, the core VPLEX software.
There are two directors in each engine, and each has dedicated
resources and is capable of functioning independently.
dirty data The write-specific data stored in the cache memory that has yet to be
written to disk.
disaster recovery (DR) The ability to restart system operations after an error, preventing data
loss.

Glossary
disk cache A section of RAM that provides cache between the disk and the CPU.
RAMs access time is significantly faster than disk access time;
therefore, a disk-caching program enables the computer to operate
faster by placing recently accessed data in the disk cache.
distributed device A RAID 1 device whose mirrors are in Geographically separate

locations.
distributed file system Supports the sharing of files and resources in the form of persistent
(DFS) storage over a network.
Distributed RAID1 A cache coherent VPLEX Metro or Geo volume that is distributed
device (DR1) between two VPLEX Clusters
E
engine Enclosure that contains two directors, management modules, and
redundant power.
Ethernet A Local Area Network (LAN) protocol. Ethernet uses a bus topology,
meaning all devices are connected to a central cable, and supports
data transfer rates of between 10 megabits per second and 10 gigabits
per second. For example, 100 Base-T supports data transfer rates of
100 Mb/s.
event A log message that results from a significant action initiated by a user
or the system.
extent A slice (range of blocks) of a storage volume.
F
failover Automatically switching to a redundant or standby device, system,
or data path upon the failure or abnormal termination of the
currently active device, system, or data path.
fault domain A concept where each component of a HA solution is separated by a

logical or physical boundary so if a fault happens in one domain it
will not transfer to the other. The boundary can represent any item
which could fail (i.e., a separate power domain would mean that is
power would remain in the second domain if it failed in the first
domain).

Glossary
fault tolerance Ability of a system to keep working in the event of hardware or

software failure, usually achieved by duplicating key system
components.
Fibre Channel (FC) A protocol for transmitting data between computer devices. Longer
distance requires the use of optical fiber; however, FC also works
using coaxial cable and ordinary telephone twisted pair media. Fibre
channel offers point-to-point, switched, and loop interfaces. Used
within a SAN to carry SCSI traffic.
field replaceable unit A unit or component of a system that can be replaced on site as
(FRU) opposed to returning the system to the manufacturer for repair.
firmware Software that is loaded on and runs from the flash ROM on the
VPLEX directors.
G
Geographically A system physically distributed across two or more Geographically
distributed system separated sites. The degree of distribution can vary widely, from
different locations on a campus or in a city to different continents.
Geoplex A DR1 device configured for VPLEX Geo
gigabit (Gb or Gbit) 1,073,741,824 (2^30) bits. Often rounded to 10^9.
gigabit Ethernet The version of Ethernet that supports data transfer rates of 1 Gigabit
per second.
gigabyte (GB) 1,073,741,824 (2^30) bytes. Often rounded to 10^9.
global file system A shared-storage cluster or distributed file system.

(GFS)
H
host bus adapter An I/O adapter that manages the transfer of information between the
(HBA) host computers bus and memory system. The adapter performs many
low-level interface functions automatically or with minimal processor
involvement to minimize the impact on the host processors
performance.

Glossary
I
input/output (I/O) Any operation, program, or device that transfers data to or from a
computer.
internet Fibre Channel Connects Fibre Channel storage devices to SANs or the Internet in
protocol (iFCP) Geographically distributed systems using TCP.
intranet A network operating like the World Wide Web but with access
restricted to a limited group of authorized users.
internet small A protocol that allows commands to travel through IP networks,

computer system which carries data from storage units to servers anywhere in a
interface (iSCSI) computer network.
I/O (input/output) The transfer of data to or from a computer.
K
kilobit (Kb) 1,024 (2^10) bits. Often rounded to 10^3.
kilobyte (K or KB) 1,024 (2^10) bytes. Often rounded to 10^3.
L
latency Amount of time it requires to fulfill an I/O request.
load balancing Distributing the processing and communications activity evenly

across a system or network so no single device is overwhelmed. Load
balancing is especially important when the number of I/O requests
issued is unpredictable.
local area network A group of computers and associated devices that share a common
(LAN) communications line and typically share the resources of a single
processor or server within a small Geographic area.
logical unit number Used to identify SCSI devices, such as external hard drives,
(LUN) connected to a computer. Each device is assigned a LUN number
which serves as the device's unique address.

Glossary
M
megabit (Mb) 1,048,576 (2^20) bits. Often rounded to 10^6.
megabyte (MB) 1,048,576 (2^20) bytes. Often rounded to 10^6.
metadata Data about data, such as data quality, content, and condition.
metavolume A storage volume used by the system that contains the metadata for
all the virtual volumes managed by the system. There is one metadata
storage volume per cluster.
Metro-Plex Two VPLEX Metro clusters connected within metro (synchronous)

distances, approximately 60 miles or 100 kilometers.
metroplex A DR1 device configured for VPLEX Metro
mirroring The writing of data to two or more disks simultaneously. If one of the
disk drives fails, the system can instantly switch to one of the other
disks without losing data or service. RAID 1 provides mirroring.
miss An operation where the cache is searched but does not contain the
data, so the data instead must be accessed from disk.
N
namespace A set of names recognized by a file system in which all names are
unique.
network System of computers, terminals, and databases connected by

communication lines.
network architecture Design of a network, including hardware, software, method of

connection, and the protocol used.
network-attached Storage elements connected directly to a network.

storage (NAS)
network partition When one site loses contact or communication with another site.

Glossary
P
parity The even or odd number of 0s and 1s in binary code.
parity checking Checking for errors in binary data. Depending on whether the byte
has an even or odd number of bits, an extra 0 or 1 bit, called a parity
bit, is added to each byte in a transmission. The sender and receiver
agree on odd parity, even parity, or no parity. If they agree on even
parity, a parity bit is added that makes each byte even. If they agree
on odd parity, a parity bit is added that makes each byte odd. If the
data is transmitted incorrectly, the change in parity will reveal the
error.
partition A subdivision of a physical or virtual disk, which is a logical entity

only visible to the end user, not any of the devices.
plex A VPLEX single cluster.
preference When a cluster has the for a given DR1 it will remain online if
connectivity is lost to the remote cluster (in some cases this may get
over ruled by VPLEX Cluster Witness). This was previously know as .
R
RAID The use of two or more storage volumes to provide better
performance, error recovery, and fault tolerance.
RAID 0 A performance-orientated striped or dispersed data mapping

technique. Uniformly sized blocks of storage are assigned in regular
sequence to all of the arrays disks. Provides high I/O performance at
low inherent cost. No additional disks are required. The advantages
of RAID 0 are a very simple design and an ease of implementation.
RAID 1 Also called mirroring, this has been used longer than any other form
of RAID. It remains popular because of simplicity and a high level of
data availability. A mirrored array consists of two or more disks. Each
disk in a mirrored array holds an identical image of the user data.
RAID 1 has no striping. Read performance is improved since either
disk can be read at the same time. Write performance is lower than
single disk storage. Writes must be performed on all disks, or mirrors,
in the RAID 1. RAID 1 provides very good data reliability for
read-intensive applications.

Glossary
RAID leg A copy of data, called a mirror, that is located at a user's current
location.
rebuild The process of reconstructing data onto a spare or replacement drive

after a drive failure. Data is reconstructed from the data on the
surviving disks, assuming mirroring has been employed.
redundancy The duplication of hardware and software components. In a

redundant system, if a component fails then a redundant component
takes over, allowing operations to continue without interruption.
reliability The ability of a system to recover lost data.
remote direct Allows computers within a network to exchange data using their
memory access main memories and without using the processor, cache, or operating
(RDMA) system of either computer.
Recovery Point The amount of data that can be lost before a given failure event.
Objective (RPO)
Recovery Time The amount of time the service takes to fully recover after a failure
Objective (RTO) event.
S
scalability Ability to easily change a system in size or configuration to suit
changing conditions, to grow with your needs.
simple network Monitors systems and devices in a network.

management
protocol (SNMP)
site ID The identifier for each cluster in a multi-cluster plex. By default, in a

non-Geographically distributed system the ID is 0. In a
Geographically distributed system, one clusters ID is 1, the next is 2,
and so on, each number identifying a physically separate cluster.
These identifiers are assigned during installation.
small computer A set of evolving ANSI standard electronic interfaces that allow
system interface personal computers to communicate faster and more flexibly than
(SCSI) previous interfaces with peripheral hardware such as disk drives,
tape drives, CD-ROM drives, printers, and scanners.

Glossary
split brain Condition when a partitioned DR1 accepts writes from both clusters.
This is also known as a conflicting detach.
storage RTO The amount of time taken for the storage to be available after a failure
event (In all cases this will be a smaller time interval than the RTO
since the storage is a pre-requisite).
stripe depth The number of blocks of data stored contiguously on each storage
volume in a RAID 0 device.
striping A technique for spreading data over multiple disk drives. Disk
striping can speed up operations that retrieve data from disk storage.
Data is divided into units and distributed across the available disks.
RAID 0 provides disk striping.
storage area network A high-speed special purpose network or subnetwork that

(SAN) interconnects different kinds of data storage devices with associated
data servers on behalf of a larger network of users.
storage view A combination of registered initiators (hosts), front-end ports, and

virtual volumes, used to control a hosts access to storage.
storage volume A LUN exported from an array.
synchronous Describes objects or events that are coordinated in time. A process is

initiated and must be completed before another task is allowed to
begin.
For example, in banking two withdrawals from a checking account
that are started at the same time must not overlap; therefore, they are
processed synchronously. See also ”asynchronous.”
T
throughput 1. The number of bits, characters, or blocks passing through a data
communication system or portion of that system.
2. The maximum capacity of a communications channel or system.
3. A measure of the amount of work performed by a system over a
period of time. For example, the number of I/Os per day.
tool command A scripting language often used for rapid prototypes and scripted
language (TCL) applications.

Glossary
transmission control The basic communication language or protocol used for traffic on a
protocol/Internet private network and the Internet.
protocol (TCP/IP)
U
uninterruptible power A power supply that includes a battery to maintain power in the
supply (UPS) event of a power failure.
universal unique A 64-bit number used to uniquely identify each VPLEX director. This
identifier (UUID) number is based on the hardware serial number assigned to each
director.
V
virtualization A layer of abstraction implemented in software that servers use to
divide available physical storage into storage volumes or virtual
volumes.
virtual volume A virtual volume looks like a contiguous volume, but can be
distributed over two or more storage volumes. Virtual volumes are
presented to hosts.
VPLEX Cluster Witness A new feature in VPLEX V5.x that can augment and improve upon
the failure handling semantics of Static .
W
wide area network A Geographically dispersed telecommunications network. This term
(WAN) distinguishes a broader telecommunication structure from a local
area network (LAN).
world wide name A specific Fibre Channel Name Identifier that is unique worldwide
(WWN) and represented by a 64-bit unsigned binary value.
write-through mode A caching technique in which the completion of a write request is

communicated only after data is written to disk. This is almost
equivalent to non-cached systems, but with data protection.

Glossary

EMC VPLEX Metro Witness Technology and High Availability: Jennifer Aspesi Oliver Shorey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EMC VPLEX Metro Witness Technology and High Availability: Jennifer Aspesi Oliver Shorey

Uploaded by

Copyright:

Available Formats

EMC VPLEX Metro Witness

Technology and High Availability

• EMC VPLEX Witness

Part number H7113.2

2 EMC VPLEX Metro Witness Technology and High Availability

Chapter 1 VPLEX Family and Use Case Overview

Chapter 2 Hardware and Software

EMC VPLEX Metro Witness Technology and High Availability 3

VPLEX Element Manager API.................................................. 38

Chapter 3 System and Component Integrity

Chapter 4 Foundations of VPLEX High Availability

Chapter 5 Introduction to VPLEX Witness

4 EMC VPLEX Metro Witness Technology and High Availability

Chapter 6 VPLEX Metro HA

EMC VPLEX Metro Witness Technology and High Availability 5

6 EMC VPLEX Metro Witness Technology and High Availability

EMC VPLEX Metro Witness Technology and High Availability 7

31 Suspension after inter-site link failure and cluster partition ................... 77

8 EMC VPLEX Metro Witness Technology and High Availability

EMC VPLEX Metro Witness Technology and High Availability 9

10 EMC VPLEX Metro Witness Technology and High Availability

This EMC Engineering TechBook describes and provides an insightful

Related Refer the EMC Powerlink website at http://powerlink.emc.com

EMC VPLEX Metro Witness Technology and High Availability 11

◆ Implementation and Planning Best Practices for EMC VPLEX

Organization of this This document is divided into the following chapters:

12 EMC VPLEX Metro Witness Technology and High Availability

◆ Chapter 6, “VPLEX Metro HA,” explains how VPLEX

Additional Additional contributors to this book include:

EMC VPLEX Metro Witness Technology and High Availability 13

Fernanda Torres has over 10 years of Marketing experience in the

14 EMC VPLEX Metro Witness Technology and High Availability

| Vertical bar indicates alternate selections - the bar means “or”

{} Braces indicate content that you must specify (that is, x or y or z)

... Ellipses indicate nonessential information omitted from the example

We'd like to hear from you!

EMC VPLEX Metro Witness Technology and High Availability 15

16 EMC VPLEX Metro Witness Technology and High Availability

VPLEX Family and Use Case Overview 17

18 EMC VPLEX Metro Witness Technology and High Availability

VPLEX value overview

VPLEX value overview 19

The VPLEX architecture provides a highly available solution suitable

Figure 1 Application and data mobility example

Storage administrators have the ability to automatically balance

Note: Please submit an RPQ if VPLEX Metro is required up to 10ms or check

20 EMC VPLEX Metro Witness Technology and High Availability

• HA Infrastructure — Reduces recovery time objective (RTO).

Figure 2 HA infrastructure example

High availability is a term that several products will claim they

VPLEX value overview 21

• Distributed Data Collaboration — Increases utilization of

Figure 3 Distributed data collaboration example

• This is when a workforce has multiple users at different sites

22 EMC VPLEX Metro Witness Technology and High Availability

VPLEX product offerings

VPLEX Local, VPLEX Metro, and VPLEX Geo

VPLEX product offerings 23

Figure 4 provides an example of each.

Figure 4 VPLEX offerings

VPLEX Metro with AccessAnywhere

1. Refer to VPLEX and vendor-specific White Papers for confirmation of

24 EMC VPLEX Metro Witness Technology and High Availability