You are on page 1of 136

EMC VPLEX Metro Witness

Technology and High Availability


Version 2.1
EMC VPLEX Witness
VPLEX Metro High Availability
Metro HA Deployment Scenarios
Jennifer Aspesi
Oliver Shorey
EMC VPLEX Metro Witness Technology and High Availability
2
Copyright 2010 - 2012 EMC Corporation. All rights reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is
subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. EMC CORPORATION MAKES NO
REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS
PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable
software license.
For the most up-to-date regulatory document for your product line, go to the Technical Documentation and
Advisories section on EMC Powerlink.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
All other trademarks used herein are the property of their respective owners.
Part number H7113.2
EMC VPLEX Metro Witness Technology and High Availability 3
Preface
Chapter 1 VPLEX Family and Use Case Overview
Introduction....................................................................................... 18
VPLEX value overview.................................................................... 19
VPLEX product offerings ................................................................ 23
VPLEX Local, VPLEX Metro, and VPLEX Geo ...................... 23
Architecture highlights .............................................................. 25
Metro high availability design considerations ............................. 28
Planned application mobility compared with disaster
restart ........................................................................................... 29
Chapter 2 Hardware and Software
Introduction....................................................................................... 32
VPLEX I/O.................................................................................. 32
High-level VPLEX I/O flow...................................................... 32
Distributed coherent cache........................................................ 33
VPLEX family clustering architecture .................................... 33
VPLEX single, dual, and quad engines ................................... 35
VPLEX sizing tool ....................................................................... 35
Upgrade paths............................................................................. 36
Hardware upgrades ................................................................... 36
Software upgrades...................................................................... 36
VPLEX management interfaces ...................................................... 37
Web-based GUI ........................................................................... 37
VPLEX CLI................................................................................... 37
SNMP support for performance statistics............................... 38
LDAP /AD support ................................................................... 38
Contents
EMC VPLEX Metro Witness Technology and High Availability
4
Contents
VPLEX Element Manager API.................................................. 38
Simplified storage management..................................................... 39
Management server user accounts................................................. 40
Management server software.......................................................... 41
Management console ................................................................. 41
Command line interface ............................................................ 43
System reporting......................................................................... 44
Director software .............................................................................. 45
Configuration overview................................................................... 46
Single engine configurations..................................................... 46
Dual configurations.................................................................... 47
Quad configurations .................................................................. 48
I/O implementation......................................................................... 50
Cache coherence ......................................................................... 50
Meta-directory ............................................................................ 50
How a read is handled............................................................... 50
How a write is handled ............................................................. 52
Chapter 3 System and Component Integrity
Overview............................................................................................ 54
Cluster ................................................................................................ 55
Path redundancy through different ports ..................................... 56
Path redundancy through different directors............................... 57
Path redundancy through different engines................................. 58
Path redundancy through site distribution .................................. 59
Serviceability ..................................................................................... 60
Chapter 4 Foundations of VPLEX High Availability
Foundations of VPLEX High Availability .................................... 62
Failure handling without VPLEX Witness (static preference).... 70
Chapter 5 Introduction to VPLEX Witness
VPLEX Witness overview and architecture .................................. 82
VPLEX Witness target solution, rules, and best practices .......... 85
VPLEX Witness failure semantics................................................... 87
CLI example outputs........................................................................ 93
VPLEX Witness The importance of the third failure
domain ......................................................................................... 97
5 EMC VPLEX Metro Witness Technology and High Availability
Contents
Chapter 6 VPLEX Metro HA
VPLEX Metro HA overview.......................................................... 100
VPLEX Metro HA Campus (with cross-connect) ...................... 101
VPLEX Metro HA (without cross-cluster connection)............... 111
Chapter 7 Conclusion
Conclusion........................................................................................ 120
Better protection from storage-related failures ....................121
Protection from a larger array of possible failures...............121
Greater overall resource utilization........................................122
Glossary
EMC VPLEX Metro Witness Technology and High Availability
6
Contents
EMC VPLEX Metro Witness Technology and High Availability 7
Title Page

1 Application and data mobility example ..................................................... 20
2 HA infrastructure example........................................................................... 21
3 Distributed data collaboration example ..................................................... 22
4 VPLEX offerings ............................................................................................. 24
5 Architecture highlights.................................................................................. 26
6 VPLEX cluster example................................................................................. 34
7 VPLEX Management Console ...................................................................... 42
8 Management Console welcome screen....................................................... 43
9 VPLEX single engine configuration............................................................. 47
10 VPLEX dual engine configuration............................................................... 48
11 VPLEX quad engine configuration.............................................................. 49
12 Port redundancy............................................................................................. 56
13 Director redundancy...................................................................................... 57
14 Engine redundancy........................................................................................ 58
15 Site redundancy.............................................................................................. 59
16 High level functional sites in communication ........................................... 62
17 High level Site A failure ................................................................................ 63
18 High level Inter-site link failure................................................................... 63
19 VPLEX active and functional between two sites ....................................... 64
20 VPLEX concept diagram with failure at Site A.......................................... 65
21 Correct resolution after volume failure at Site A....................................... 66
22 VPLEX active and functional between two sites ....................................... 67
23 Inter-site link failure and cluster partition ................................................. 68
24 Correct handling of cluster partition........................................................... 69
25 VPLEX static detach rule............................................................................... 71
26 Typical detach rule setup.............................................................................. 72
27 Non-preferred site failure ............................................................................. 73
28 Volume remains active at Cluster 1............................................................. 74
29 Typical detach rule setup before link failure ............................................. 75
30 Inter-site link failure and cluster partition ................................................. 76
Figures
EMC VPLEX Metro Witness Technology and High Availability 8
Figures
31 Suspension after inter-site link failure and cluster partition................... 77
32 Cluster 2 is preferred..................................................................................... 78
33 Preferred site failure causes full Data Unavailability............................... 79
34 High Level VPLEX Witness architecture.................................................... 83
35 High Level VPLEX Witness deployment .................................................. 84
36 Supported VPLEX versions for VPLEX Witness ....................................... 86
37 VPLEX Witness volume types and rule support....................................... 86
38 Typical VPLEX Witness configuration ....................................................... 87
39 VPLEX Witness and an inter-cluster link failure....................................... 88
40 VPLEX Witness and static preference after cluster partition................... 89
41 VPLEX Witness typical configuration for cluster 2 detaches .................. 90
42 VPLEX Witness diagram showing cluster 2 failure.................................. 91
43 VPLEX Witness with static preference override........................................ 92
44 Possible dual failure cluster isolation scenarios ........................................ 95
45 Highly unlikely dual failure scenarios that require manual
intervention ..................................................................................................... 96
46 Two further dual failure scenarios that would require manual
intervention ..................................................................................................... 97
47 High-level diagram of a Metro HA campus solution for VMware ...... 101
48 Metro HA campus diagram with failure domains.................................. 104
49 Metro HA campus diagram with disaster in zone A1............................ 105
50 Metro HA campus diagram with failure in zone A2.............................. 106
51 Metro HA campus diagram with failure in zone A3 or B3.................... 107
52 Metro HA campus diagram with failure in zone C1 .............................. 108
53 Metro HA campus diagram with intersite link failure........................... 109
54 Metro HA Standard High-level diagram................................................. 111
55 Metro HA high-level diagram with fault domains................................. 113
56 Metro HA high-level diagram with failure in domain A2..................... 114
57 Metro HA high-level diagram with intersite failure.............................. 116
EMC VPLEX Metro Witness Technology and High Availability 9
Title Page

1 Overview of VPLEX features and benefits .................................................. 26
2 Configurations at a glance ............................................................................. 35
3 Management server user accounts ............................................................... 40
Tables
EMC VPLEX Metro Witness Technology and High Availability 10
Tables
EMC VPLEX Metro Witness Technology and High Availability 11
Preface
This EMC Engineering TechBook describes and provides an insightful
discussion on how implementation of VPLEX will lead to a higher level of
availability.
As part of an effort to improve and enhance the performance and capabilities
of its product lines, EMC periodically releases revisions of its hardware and
software. Therefore, some functions described in this document may not be
supported by all versions of the software or hardware currently in use. For
the most up-to-date information on product features, refer to your product
release notes. If a product does not function properly or does not function as
described in this document, please contact your EMC representative.
Audience This document is part of the EMC

VPLEX family documentation set,
and is intended for use by storage and system administrators.
Readers of this document are expected to be familiar with the
following topics:
Storage area networks
Storage virtualization technologies
EMC Symmetrix, VNX series, and CLARiiON products
Related
documentation
Refer the EMC Powerlink website at http://powerlink.emc.com
where the majority of the following documentation can be found
under Support > Technical Documentation and Advisories >
Hardware Platforms > VPLEX Family.
EMC VPLEX Architecture Guide
EMC VPLEX Installation and Setup Guide
EMC VPLEX Site Preparation Guide
12 EMC VPLEX Metro Witness Technology and High Availability
Preface
Implementation and Planning Best Practices for EMC VPLEX
Technical Notes
Using VMware Virtualization Platforms with EMC VPLEX - Best
Practices Planning
VMware KB: Using VPLEX Metro with VMware HA
Implementing EMC VPLEX Metro with Microsoft Hyper-V, Exchange
Server 2010 with Enhanced Failover Clustering Support
White Paper: Using VMware vSphere with EMC VPLEX Best
Practices Planning
Oracle Extended RAC with EMC VPLEX MetroBest Practices
Planning
White Paper: EMC VPLEX with IBM AIX Virtualization and
Clustering
White Paper: Conditions for Stretched Hosts Cluster Support on EMC
VPLEX Metro
White Paper: Implementing EMC VPLEX and Microsoft Hyper-V and
SQL Server with Enhanced Failover Clustering Support Applied
Technology
Organization of this
TechBook
This document is divided into the following chapters:
Chapter 1, VPLEX Family and Use Case Overview,
summarizes the VPLEX family. It also covers some of the key
features of the VPLEX family system, architecture and use cases.
Chapter 2, Hardware and Software, summarizes hardware,
software, and network components of the VPLEX system. It also
highlights the software interfaces that can be used by an
administrator to manage all aspects of a VPLEX system.
Chapter 3, System and Component Integrity, summarizes how
VPLEX clusters are able to handle hardware failures in any
subsystem within the storage cluster.
Chapter 4, Foundations of VPLEX High Availability,
summarizes the concepts of the industry-wide dilemma of
building absolute HA environments and how VPLEX Metro
functionality manually accepts the historical challenge.
Chapter 5, Introduction to VPLEX Witness, explains VPLEX
architecture and operation.
EMC VPLEX Metro Witness Technology and High Availability 13
Preface
Chapter 6, VPLEX Metro HA, explains how VPLEX
functionality can provide the absolute HA capability, by
introducing a Witness to the inter-cluster environment.
Chapter 7, Conclusion, provides a summary of benefits using
VPLEX technology as related to VPLEX Witness and High
Availability.
Appendix A, vSphere 5.0 Update 1 Additional Settings,
provides additional settings needed when using vSphere 5.0
update 1.
Authors This TechBook was authored by the following individuals from the
Enterprise Storage Division, VPLEX Business Unit based at EMC
headquarters, Hopkinton, Massachusetts.
Jennifer Aspesi has over 10 years of work experience with EMC in
Storage Area Networks (SAN), Wide Area Networks (WAN), and
Network and Storage Security technologies. Jen currently manages
the Corporate Systems Engineer team for the VPLEX Business Unit.
She earned her M.S. in Marketing and Technological Innovation from
Worcester Polytech Institute, Massachusetts.
Oliver Shorey has over 11 years experience working within the
Business Continuity arena, seven of which have been with EMC
engineering, designing and documenting high-end replication and
geographically-dispersed clustering technologies. He is currently a
Principal Corporate Systems Engineer in the VPLEX Business Unit.
Additional
contributors
Additional contributors to this book include:
Colin Durocher has 8 years of experience in developing software for
the EMC VPLEX product as its predecessor and current state, testing
it, and helping customers implement it. He is currently working on
the product management team for the VPLEX business unit. He has a
B.S. in Computer Engineering from the University of Alberta and is
currently pursuing an MBA from the John Molson School of Business.
Gene Ortenberg has more than 15 years of experience in building
fault-tolerant distributed systems and applications. For the past 8
years he has been designing and developing highly-available storage
virtualization solutions at EMC. He currently holds a position of a
Software Architect for the VPLEX Business Unit under the EMC
Enterprise Storage Division.
14 EMC VPLEX Metro Witness Technology and High Availability
Preface
Fernanda Torres has over 10 years of Marketing experience in the
Consumer Products industry, most recently in consumer electronics.
Fernanda is the Product Marketing Manager for VPLEX under the
EMC Enterprise Storage Division. She has undergraduate degree
from the University of Notre Dame and a bilingual degree
(English/Spanish) from IESE in Barcelona, Spain.
Typographical
conventions
EMC uses the following type style conventions in this document:
Normal Used in running (nonprocedural) text for:
Names of interface elements (such as names of windows, dialog
boxes, buttons, fields, and menus)
Names of resources, attributes, pools, Boolean expressions,
buttons, DQL statements, keywords, clauses, environment
variables, functions, utilities
URLs, pathnames, filenames, directory names, computer
names, filenames, links, groups, service keys, file systems,
notifications
Bold Used in running (nonprocedural) text for:
Names of commands, daemons, options, programs, processes,
services, applications, utilities, kernels, notifications, system
calls, man pages
Used in procedures for:
Names of interface elements (such as names of windows, dialog
boxes, buttons, fields, and menus)
What user specifically selects, clicks, presses, or types
Italic Used in all text (including procedures) for:
Full titles of publications referenced in text
Emphasis (for example a new term)
Variables
Courier Used for:
System output, such as an error message or script
URLs, complete paths, filenames, prompts, and syntax when
shown outside of running text
Courier bold Used for:
Specific user input (such as commands)
Courier italic Used in procedures for:
Variables on command line
User input variables
< > Angle brackets enclose parameter or variable values supplied by
the user
[ ] Square brackets enclose optional values
EMC VPLEX Metro Witness Technology and High Availability 15
Preface
We'd like to hear from you!
Your feedback on our TechBooks is important to us! We want our
books to be as helpful and relevant as possible, so please feel free to
send us your comments, opinions and thoughts on this or any other
TechBook:
TechBooks@emc.com
| Vertical bar indicates alternate selections - the bar means or
{ } Braces indicate content that you must specify (that is, x or y or z)
... Ellipses indicate nonessential information omitted from the example
16 EMC VPLEX Metro Witness Technology and High Availability
Preface
VPLEX Family and Use Case Overview 17
1
This chapter provides a brief summary of the main use cases for the
EMC VPLEX family and design considerations for high availability. It
also covers some of the key features of the VPLEX family system.
Topics include:
Introduction ........................................................................................ 18
VPLEX value overview..................................................................... 19
VPLEX product offerings ................................................................. 23
Metro high availability design considerations .............................. 28
VPLEX Family and Use
Case Overview
18 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Family and Use Case Overview
Introduction
The purpose of this TechBook is to introduce EMC

VPLEX

high
availability and the VPLEX Witness as it is conceptually
architectured, typically by customer storage administrators and EMC
Solutions Architects. The introduction of VPLEX Witness provides
customers with absolute physical and logical fabric and cache
coherent redundancy if it is properly designed in the VPLEX Metro
environment.
This TechBook is designed to provide an overview of the features and
functionality associated with the VPLEX Metro configuration and the
importance of active/active data resiliency for todays advanced host
applications.
VPLEX value overview 19
VPLEX Family and Use Case Overview
VPLEX value overview
At the highest level, VPLEX has unique capabilities that storage
administrators value and are seeking to enhance their existing data
centers. It delivers distributed, dynamic and smart functionality into
existing or new data centers to provide storage virtualization across
geographical boundaries.
VPLEX is distributed, because it is a single interface for
multi-vendor storage and it delivers dynamic data mobility,
enabling the ability to move applications and data in real-time,
with no outage required.
VPLEX is dynamic, because it provides data availability and
flexibility as well as maintaining business through failures
traditionally requiring outages or manual restore procedures.
VPLEX is smart, because its unique AccessAnywhere technology
can present and keep the same data consistent within and
between sites and enable distributed data collaboration.
Because of these capabilities, VPLEX delivers unique and
differentiated value to address three distinct requirements within our
target customers IT environments:
The ability to dynamically move applications and data across
different compute and storage installations, be they within the
same data center, across a campus, within a geographical region
and now, with VPLEX Geo, across even greater distances.
The ability to create high-availability storage and a compute
infrastructure across these same varied geographies with
unmatched resiliency.
The ability to provide efficient real-time data collaboration over
distance for such big data applications as video, geographic
/oceanographic research, and more.
EMC VPLEX technology is a scalable, distributed-storage federation
solution that provides non-disruptive, heterogeneous
data-movement and volume-management functionality.
Insert VPLEX technology between hosts and storage in a storage area
network (SAN) and data can be extended over distance within,
between, and across data centers.
20 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Family and Use Case Overview
The VPLEX architecture provides a highly available solution suitable
for many deployment strategies including:
Application and Data Mobility The movement of virtual
machines (VM) without downtime. An example is shown in
Figure 1.
Figure 1 Application and data mobility example
Storage administrators have the ability to automatically balance
loads through VPLEX, using storage and compute resources from
either clusters location. When combined with server
virtualization, VPLEX allows users to transparently move and
relocate Virtual Machines and their corresponding applications
and data over distance. This provides a unique capability allowing
users to relocate, share and balance infrastructure resources
between sites, which can be within a campus or between data
centers, up to 5ms apart with VPLEX Metro, or further apart
(50ms RTT) across asynchronous distances with VPLEX Geo.
Note: Please submit an RPQ if VPLEX Metro is required up to 10ms or check
the support matrix for the latest supported latencies.
VPLEX value overview 21
VPLEX Family and Use Case Overview
HA Infrastructure Reduces recovery time objective (RTO).
An example is shown in Figure 2.
Figure 2 HA infrastructure example
High availability is a term that several products will claim they
can deliver. Ultimately, a high availability solution is supposed to
protect against a failure and keep an application online. Storage
administrators plan around HA to provide near continuous
uptime for their critical applications, and automate the restart of
an application once a failure has occurred, with as little human
intervention as possible.
With conventional solutions, customers typically have to choose a
Recovery Point Objective and a Recovery Time Objective. But
even while some solutions offer small RTOs and RPOs, there can
still be downtime and, for most customers, any downtime can be
costly.
22 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Family and Use Case Overview
Distributed Data Collaboration Increases utilization of
passive data recovery (DR) assets and provides simultaneous
access to data. An example is shown in Figure 3.
Figure 3 Distributed data collaboration example
This is when a workforce has multiple users at different sites
that need to work on the same data, and maintain consistency
in the dataset when changes are made. Use cases include
co-development of software where the development happens
across different teams from separate locations, and
collaborative workflows such as engineering, graphic arts,
videos, educational programs, designs, research reports, and
so forth.
When customers have tried to build collaboration across
distance with the traditional solutions, they normally have to
save the entire file at one location and then send it to another
site using FTP. This is slow, can incur heavy bandwidth costs
for large files, or even small files that move regularly, and
negatively impacts productivity because the other sites can sit
idle while they wait to receive the latest data from another site.
If teams decide to do their own work independent of each
other, then the dataset quickly becomes inconsistent, as
multiple people are working on it at the same time and are
unaware of each others most recent changes. Bringing all of
the changes together in the end is time-consuming, costly, and
grows more complicated as the data-set gets larger.
VPLEX product offerings 23
VPLEX Family and Use Case Overview
VPLEX product offerings
VPLEX first meets high-availability and data mobility requirements
and then scales up to the I/O throughput required for the front-end
applications and back-end storage.
High-availability and data mobility features are characteristics of
VPLEX Local, VPLEX Metro, and VPLEX Geo.
A VPLEX cluster consists of one, two, or four engines (each
containing two directors), and a management server. A dual-engine
or quad-engine cluster also contains a pair of Fibre Channel switches
for communication between directors.
Each engine is protected by a standby power supply (SPS), and each
Fibre Channel switch gets its power through an uninterruptible
power supply (UPS). (In a dual-engine or quad-engine cluster, the
management server also gets power from a UPS.)
The management server has a public Ethernet port, which provides
cluster management services when connected to the customer
network.
This section provides information on the following:
VPLEX Local, VPLEX Metro, and VPLEX Geo on page 23
Architecture highlights on page 25
VPLEX Local, VPLEX Metro, and VPLEX Geo
EMC offers VPLEX in three configurations to address customer needs
for high-availability and data mobility:
VPLEX Local
VPLEX Metro
VPLEX Geo
24 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Family and Use Case Overview
Figure 4 provides an example of each.
Figure 4 VPLEX offerings
VPLEX Local
VPLEX Local provides seamless, non-disruptive data mobility and
ability to manage multiple heterogeneous arrays from a single
interface within a data center.
VPLEX Local allows increased availability, simplified management,
and improved utilization across multiple arrays.
VPLEX Metro with AccessAnywhere
VPLEX Metro with AccessAnywhere enables active-active, block
level access to data between two sites within synchronous distances.
The distance is limited as to what Synchronous behavior can
withstand as well as consideration to host application stability and
MAN traffic. It is recommended that depending on the application
that consideration for Metro be less than or equal to 5ms
1
RTT.
The combination of virtual storage with VPLEX Metro and virtual
servers enables the transparent movement of virtual machines and
storage across a distance.This technology provides improved
utilization across heterogeneous arrays and multiple sites.
1. Refer to VPLEX and vendor-specific White Papers for confirmation of
latency limitations.
VPLEX product offerings 25
VPLEX Family and Use Case Overview
VPLEX Geo with AccessAnywhere
VPLEX Geo with AccessAnywhere enables active-active, block level
access to data between two sites within asynchronous distances.
VPLEX Geo enables better cost-effective use of resources and power.
Geo provides the same distributed device flexibility as Metro but
extends the distance up to and within 50ms RTT. As with any
Asynchronous transport media, bandwidth is also important to
consider for optimal behavior as well as application sharing on the
link.
Note: For the purpose of this TechBook, the focus on technologies is
based on Metro configuration only. VPLEX Witness is supported with
VPLEX Geo; however, it is beyond the scope of this TechBook.
Architecture highlights
VPLEX support is open and heterogeneous, supporting both EMC
storage and common arrays from other storage vendors, such as
HDS, HP, and IBM. VPLEX conforms to established worldwide
naming (WWN) guidelines that can be used for zoning.
VPLEX supports operating systems including both physical and
virtual server environments with VMware ESX and Microsoft
Hyper-V. VPLEX supports network fabrics from Brocade and Cisco,
including legacy McData SANs.
Note: For the latest information please refer to the ESSM (EMC
Simple Support Matrix) for supported host types as well as the
connectivity ESM for fabric and extended fabric support.
26 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Family and Use Case Overview
An example of the architecture is shown in Figure 5.
Figure 5 Architecture highlights
Table 1 lists an overview of VPLEX features along with the benefits.
Table 1 Overview of VPLEX features and benefits (page 1 of 2)
Features Benefits
Mobility Move data and applications without impact on
users.
Resiliency Mirror across arrays without host impact, and
increase high availability for critical applications.
Distributed cache coherency Automate sharing, balancing, and failover of I/O
across the cluster and between clusters.
VPLEX product offerings 27
VPLEX Family and Use Case Overview
For all VPLEX products, the appliance-based VPLEX technology:
Presents storage area network (SAN) volumes from back-end
arrays to VPLEX engines
Packages the SAN volumes into sets of VPLEX virtual volumes
with user-defined configuration and protection levels
Presents virtual volumes to production hosts in the SAN via the
VPLEX front-end
For VPLEX Metro and VPLEX Geo products, presents a global,
block-level directory for distributed cache and I/O between
VPLEX clusters.
Location and distance determine high-availability and data mobility
requirements. For example, if all storage arrays are in a single data
center, a VPLEX Local product federates back-end storage arrays
within the data center.
When back-end storage arrays span two data centers, the
AccessAnywhere feature in a VPLEX Metro or a VPLEX Geo product
federates storage in an active-active configuration between VPLEX
clusters. Choosing between VPLEX Metro or VPLEX Geo depends on
distance and data synchronicity requirements.
Application and back-end storage I/O throughput determine the
number of engines in each VPLEX cluster. High-availability features
within the VPLEX cluster allow for non-disruptive software upgrades
and expansion as I/O throughput increases.
Advanced data caching Improve I/O performance and reduce storage array
contention.
Virtual Storage federation Achieve transparent mobility and access in a data
center and between data centers.
Scale-out cluster architecture Start small and grow larger with predictable service
levels.
Table 1 Overview of VPLEX features and benefits (page 2 of 2)
Features Benefits
28 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Family and Use Case Overview
Metro high availability design considerations
VPLEX Metro 5.0 (and above) introduces high availability concepts
beyond what is traditionally known as physical high availability.
Introduction of the VPLEX Witness to a high availability
environment, allows the VPLEX solution to increase the overall
availability of the environment by arbitrating a pure communication
failure between two primary sites and a true site failure in a multi-site
architecture. EMC VPLEX is the first product to bring to market the
features and functionality provided by VPLEX Witness prevents
failures and asserts the activity between clusters in a multi-site
architecture.
Through this TechBook, administrators and customers gain an
understanding of the high availability solution that VPLEX provides
them:
Enabling of load balancing between their data centers
Active/active use of both of their data centers
Increased availability for their applications (no single points of
storage failure, auto-restart)
Fully automatic failure handling
Better resource utilization
Lower CapEx and lower OpEx as a result
Broadly speaking, when one considers legacy environments one
typically sees highly available designs implemented within a data
center, and disaster recovery type functionality deployed between
data centers.
One of the main reasons for this is that within data centers
components generally operate in an active/active (or active/passive
with automatic failover) whereas between data centers legacy
replication technologies use active passive techniques which require
manual failover to use the passive component.
When using VPLEX Metro active/active replication technology in
conjunction with new features, such as VPLEX Witness server (as
described in Introduction to VPLEX Witness on page 81), the lines
between local high availability and long distance disaster recovery
are somewhat blurred since HA can be stretched beyond the data
Metro high availability design considerations 29
VPLEX Family and Use Case Overview
center walls. Since replication is a by-product of federated and
distributed storage disaster avoidance, it is also achievable within
these geographically dispersed HA environments.
Planned application mobility compared with disaster restart
This section compares planned application mobility and disaster
restart.
Planned application
mobility
An online planned application mobility event is defined as when an
application or virtual machine can be moved fully online without
disruption from one location to another in either the same or remote
data center. This type of movement can only be performed when all
components that participate in this movement are available (e.g., the
running state of the application or VM exists in volatile memory
which would not be the case if an active site has failed) and if all
participating hosts have read/write access at both location to the
same block storage. Additional a mechanism is required to transition
volatile memory data from one system/host to another. When
performing planned online mobility jobs over distance a prerequisite
y is the use of an active/active underlying storage replication
solution (VPLEX Metro only at this publication).
An example of this online application mobility would be VMware
vMotion where a virtual machine would need to be fully operational
before it can be moved. It may sound obvious but if the VM was
offline then movement could not be performed online (This is
important to understand and is the key difference over application
restart).
When vMotion is executed all live components that are required to
make the VM function are copied elsewhere in the background before
cutting the VM over.
Since these types of mobility tasks are totally seamless to the user
some of the use cases associated are for disaster avoidance where an
application or VM can be moved ahead of a disaster (such as,
Hurricane, Tsunami, etc.) as the running state is available to be
copied, or in other cases it can be used to enable the ability to load
balance across multiple systems or even data centers.
Due to the need for the running state to be available for these types of
relocations these movements are always deemed planned activities.
30 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Family and Use Case Overview
Disaster restart Disaster restart is where an application or service is re-started in
another location after a failure (be it on a different server or data
center) and will typically interrupt the service/application during the
failover.
A good example of this technology would be a VMware HA Cluster
configured over two geographically dispersed sites using VPLEX
Metro where a cluster will be formed over a number of ESX servers
and either single or multiple virtual machines can run on any of the
ESX servers within the cluster.
If for some reason an active ESX server were to fail (perhaps due to
site failure) then the VM can be re-started on a remaining ESX server
within the cluster at the remote site as the datastore where it was
running spans the two locations since it is configured on a VPLEX
Metro distributed volume. This would be deemed an unplanned
failover which will incur a small outage of the application since the
running state of the VM was lost when the ESX server failed meaning
the service will be unavailable until the VM has restarted elsewhere.
Although comparing a planned application mobility event to an
unplanned disaster restart will result in the same outcome (i.e., a
service relocating elsewhere) it can now be seen that there is a big
difference since the planned mobility job keeps the application online
during the relocation whereas the disaster restart will result in the
application being offline during the relocation as a restart is
conducted.
Compared to active/active technologies the use of legacy
active/passive type solutions in these restart scenarios would
typically require an extra step over and above standard application
failover since a storage failover would also be required (i.e. changing
the status of write disabled remote copy to read/write and reversing
replication direction flow). This is where VPLEX can assist greatly
since it is active/active therefore, in most cases, no manual
intervention at the storage layer is required, this greatly reduces the
complexity of a DR failover solution. If best practices for physical
high available and redundant hardware connectivity are followed the
value of VPLEX Witness will truly provide customers with
Absolute availability!
Hardware and Software 31
2
This chapter provides insight into the hardware and software
interfaces that can be used by an administrator to manage all aspects
of a VPLEX system. In addition, a brief overview of the internal
system software is included. Topics include:
Introduction ........................................................................................ 32
VPLEX management interfaces........................................................ 37
Simplified storage management ...................................................... 39
Management server user accounts .................................................. 40
Management server software........................................................... 41
Director software................................................................................ 45
Configuration overview.................................................................... 46
I/O implementation .......................................................................... 50
Hardware and Software
32 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
Introduction
This section provides basic information on the following:
VPLEX I/O on page 32
High-level VPLEX I/O flow on page 32
Distributed coherent cache on page 33
VPLEX family clustering architecture on page 33
VPLEX I/O
VPLEX is built on a lightweight protocol that maintains cache
coherency for storage I/O and the VPLEX cluster provides highly
available cache, processing power, front-end, and back-end Fibre
Channel interfaces.
EMC hardware powers the VPLEX cluster design so that all devices
are always available and I/O that enters the cluster from anywhere
can be serviced by any node within the cluster.
The AccessAnywhere feature in the VPLEX Metro and VPLEX Geo
products extends the cache coherency between data centers at a
distance.
High-level VPLEX I/O flow
VPLEX abstracts a block-level ownership model into a highly
organized hierarchal directory structure that is updated for every I/O
and shared across all engines. The directory uses a small amount of
metadata and tells all other engines in the cluster, in 4k block
transmissions, which block of data is owned by which engine and at
what time.
After a write completes and ownership is reflected in the directory,
VPLEX dynamically manages read requests for the completed write
in the most efficient way possible.
When a read request arrives, VPLEX checks the directory for an
owner. After VPLEX locates the owner, the read request goes directly
to that engine.
Introduction 33
Hardware and Software
On reads from other engines, VPLEX checks the directory and tries to
pull the read I/O directly from the engine cache to avoid going to the
physical arrays to satisfy the read.
This model enables VPLEX to stretch the cluster as VPLEX distributes
the directory between clusters and sites. Due to the Hierarchical
nature of the VPLEX directory VPLEX is efficient with minimal
overhead and enables I/O communication over distance.
Distributed coherent cache
The VPLEX engine includes two directors that each have a total of 36
GB (version 5 hardware, also known as VS2) of local cache. Cache
pages are keyed by volume and go through a lifecycle from staging,
to visible, to draining.
The global cache is a combination of all director caches that spans all
clusters. The cache page holder information is maintained in a
memory data structure called a directory.
The directory is divided into chunks and distributed among the
VPLEX directors and locality controls where ownership is
maintained.
A meta-directory identifies which director owns which directory
chunks within the global directory.
VPLEX family clustering architecture
The VPLEX family uses a unique clustering architecture to help
customers break the boundaries of the data center and allow servers
at multiple data centers to have read/write access to shared block
storage devices. A VPLEX cluster, as shown in Figure 6 on page 34,
can scale up through the addition of more engines, and scale out by
connecting clusters into an EMC VPLEX Metro (two VPLEX Metro
clusters connected within Metro distances).
34 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
Figure 6 VPLEX cluster example
VPLEX Metro transparently moves and shares workloads for a
variety of applications, VMs, databases and cluster file systems.
VPLEX Metro consolidates data centers, and optimizes resource
utilization across data centers. In addition, it provides non-disruptive
data mobility, heterogeneous storage management, and improved
application availability. VPLEX Metro supports up to two clusters,
which can be in the same data center, or at two different sites within
synchronous environments. Also, introduced with these solutions
architected by this TechBook, Geo cluster across distances achieves
the asynchronous partner to Metro. It is out of the scope of this
document to analyze VPLEX Geo capabilities.
Introduction 35
Hardware and Software
VPLEX single, dual, and quad engines
The VPLEX engine provides cache and processing power with
redundant directors that each include two I/O modules per director
and one optional WAN COM I/O module for use in VPLEX Metro
and VPLEX Geo configurations.
The rackable hardware components are shipped in NEMA standard
racks or provided, as an option, as a field rackable product. Table 2
provides a list of configurations.
VPLEX sizing tool
Use the EMC VPLEX sizing tool provided by EMC Global Services
Software Development to configure the right VPLEX cluster
configuration.
The sizing tool concentrates on I/O throughput requirement for
installed applications (mail exchange, OLTP, data warehouse, video
streaming, etc.) and back-end configuration such as virtual volumes,
size and quantity of storage volumes, and initiators.
Table 2 Configurations at a glance
Components Single engine Dual engine Quad engine
Directors 2 4 8
Redundant Engine SPSs Yes Yes Yes
FE Fibre Channel ports (VS1) 16 32 64
FE Fibre Channel ports (VS2) 8 16 32
BE Fibre Channel ports (VS1) 16 32 64
BE Fibre Channel ports (VS2) 8 16 32
Cache size (VS1 Hardware) 64 GB 128 GB 256 GB
Cache size (VS2 Hardware) 72 GB 144 GB 288 GB
Management Servers 1 1 1
Internal Fibre Channel switches (Local Comm) None 2 2
Uninterruptable Power Supplies (UPSs) None 2 2
36 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
Upgrade paths
VPLEX facilitates application and storage upgrades without a service
window through its flexibility to shift production workloads
throughout the VPLEX technology.
In addition, high-availability features of the VPLEX cluster allow for
non-disruptive VPLEX hardware and software upgrades.
This flexibility means that VPLEX is always servicing I/O and never
has to be completely shut down.
Hardware upgrades
Upgrades are supported for single-engine VPLEX systems to dual- or
quad-engine systems.
A single VPLEX Local system can be reconfigured to work as a
VPLEX Metro or VPLEX Geo by adding a new remote VPLEX cluster.
Additionally an entire VPLEX VS1 Cluster (hardware) can be fully
upgraded to VS2 hardware non disruptively.
Information for VPLEX hardware upgrades is in the Procedure
Generator that is available through EMC PowerLink.
Software upgrades
VPLEX features a robust non-disruptive upgrade (NDU) technology
to upgrade the software on VPLEX engines and VPLEX Witness
servers. Management server software must be upgraded before
running the NDU.
Due to the VPLEX distributed coherent cache, directors elsewhere in
the VPLEX installation service I/Os while the upgrade is taking
place. This alleviates the need for service windows and reduces RTO.
The NDU includes the following steps:
Preparing the VPLEX system for the NDU
Starting the NDU
Transferring the I/O to an upgraded director
Completing the NDU
VPLEX management interfaces 37
Hardware and Software
VPLEX management interfaces
Within the VPLEX cluster, TCP/IP-based management traffic travels
through a private network subnet to the components in one or more
clusters. In VPLEX Metro and VPLEX Geo, VPLEX establishes a VPN
tunnel between the management servers of both clusters. When
VPLEX Witness is deployed, the VPN tunnel is extended to a 3-way
tunnel including both Management Servers and VPLEX Witness.
Web-based GUI
VPLEX includes a Web-based graphical user interface (GUI) for
management. The EMC VPLEX Management Console Help provides
more information on using this interface.
To perform other VPLEX operations that are not available in the GUI,
refer to the CLI, which supports full functionality. The EMC VPLEX
CLI Guide provides a comprehensive list of VPLEX commands and
detailed instructions on using those commands.
The EMC VPLEX Management Console contains but is not limited to
the following functions:
Supports storage array discovery and provisioning
Local provisioning
Distributed provisioning
Mobility Central
Online help
VPLEX CLI
VPlexcli is a command line interface (CLI) to configure and operate
VPLEX systems. It also generates the EZ Wizard Setup process to
make installation of VPLEX easier and quicker.
The CLI is divided into command contexts. Some commands are
accessible from all contexts, and are referred to as global commands.
The remaining commands are arranged in a hierarchical context tree
that can only be executed from the appropriate location in the context
tree.
38 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
The VPlexcli encompasses all capabilities in order to function if the
management station is unavailable. It is fully functional,
comprehensive, supporting full configuration, provisioning and
advanced systems management capabilities.
SNMP support for performance statistics
The VPLEX snmpv2c SNMP agent:
Supports retrieval of performance-related statistics as published
in the VPLEX-MIB.mib.
Runs on the management server and fetches performance related
data from individual directors using a firmware specific
interface.
Provides SNMP MIB data for directors for the local cluster only.
LDAP /AD support
VPLEX offers Lightweight Directory Access Protocol (LDAP) or
Active Directory for an authentication directory service.
VPLEX Element Manager API
VPLEX Element Manager API uses the Representational State
Transfer (REST) software architecture for distributed systems such as
the World Wide Web. It allows software developers and other users to
use the API to create scripts to run VPLEX CLI commands.
The VPLEX Element Manager API supports all VPLEX CLI
commands that can be executed from the root context on a director.
Simplified storage management 39
Hardware and Software
Simplified storage management
VPLEX supports a variety of arrays from various vendors covering
both active/active and active/passive type arrays. VPLEX simplifies
storage management by allowing simple LUNs, provisioned from the
various arrays, to be managed through a centralized management
interface that is simple to use and very intuitive. In addition, a
VPLEX Metro or VPLEX Geo environment that spans data centers
allows the storage administrator to manage both locations through
the one interface from either location by logging in at the local site.
40 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
Management server user accounts
The management server requires the setup of user accounts for access
to certain tasks. Table 3 describes the types of user accounts on the
management server.
Some service and administrator tasks require OS commands that
require root privileges. The management server has been configured
to use the sudo program to provide these root privileges just for the
duration of the command. Sudo is a secure and well-established
UNIX program for allowing users to run commands with root
privileges.
VPLEX documentation will indicate which commands must be
prefixed with "sudo" in order to acquire the necessary privileges. The
sudo command will ask for the user's password when it runs for the
first time, to ensure that the user knows the password for his account.
This prevents unauthorized users from executing these privileged
commands when they find an authenticated SSH login that was left
open.
Table 3 Management server user accounts
Account type Purpose
admin (customer) Performs administrative actions, such as user
management
Creates and deletes Linux CLI accounts
Resets passwords for all Linux CLI users
Modifies the public Ethernet settings
service
(EMC service)
Starts and stops necessary OS and VPLEX services
Cannot modify user accounts
(Customers do have access to this account)
Linux CLI accounts Uses VPlexcli to manage federated storage
All account types Uses VPlexcli
Modifies their own password
Can SSH or VNC into the management server
Can SCP files off the management server from directories
to which they have access
Management server software 41
Hardware and Software
Management server software
The management server software is installed during manufacturing
and is fully field upgradeable. The software includes:
VPLEX Management Console
VPlexcli
Server Base Image Updates (when necessary)
Call-home software
Each are briefly discussed in this section.
Management console
The VPLEX Management Console provides a graphical user interface
(GUI) to manage the VPLEX cluster. The GUI can be used to
provision storage, as well as manage and monitor system
performance.
Figure 7 on page 42 shows the VPLEX Management Console window
with the cluster tree expanded to show the objects that are
manageable from the front-end, back-end, and the federated storage.
42 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
Figure 7 VPLEX Management Console
The VPLEX Management Console provides online help for all of its
available functions. Online help can be accessed in the following
ways:
Click the Help icon in the upper right corner on the main screen
to open the online help system, or in a specific screen to open a
topic specific to the current task.
Click the Help button on the task bar to display a list of links to
additional VPLEX documentation and other sources of
information.
Management server software 43
Hardware and Software
Figure 8 is the welcome screen of the VPLEX Management Console
GUI, which utilizes a secure http connection via a browser. The
interface uses Flash technology for rapid response and unique look
and feel.
Figure 8 Management Console welcome screen
Command line interface
The VPlexcli is a command line interface (CLI) for configuring and
running the VPLEX system, for setting up and monitoring the
systems hardware and intersite links (including com/tcp), and for
configuring global inter-site I/O cost and link-failure recovery. The
CLI runs as a service on the VPLEX management server and is
accessible using Secure Shell (SSH).
44 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
For information about the VPlexcli, refer to the EMC VPLEX CLI
Guide.
System reporting
VPLEX system reporting software collects configuration information
from each cluster and each engine. The resulting configuration file
(XML) is zipped and stored locally on the management server or
presented to the SYR system at EMC via call home.
You can schedule a weekly job to automatically collect SYR data
(VPlexcli command scheduleSYR), or manually collect it whenever
needed (VPlexcli command syrcollect).
Director software 45
Hardware and Software
Director software
The director software provides:
Basic Input/Output System (BIOS ) Provides low-level
hardware support to the operating system, and maintains boot
configuration.
Power-On Self Test (POST) Provides automated testing of
system hardware during power on.
Linux Provides basic operating system services to the Vplexcli
software stack running on the directors.
VPLEX Power and Environmental Monitoring (ZPEM)
Provides monitoring and reporting of system hardware status.
EMC Common Object Model (ECOM) Provides management
logic and interfaces to the internal components of the system.
Log server Collates log messages from director processes and
sends them to the SMS.
EMC GeoSynchrony

(I/O Stack) Processes I/O from hosts,


performs all cache processing, replication, and virtualization
logic, interfaces with arrays for claiming and I/O.
46 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
Configuration overview
The VPLEX configurations are based on how many engines are in the
cabinet. The basic configurations are single, dual and quad
(previously know as small, medium and large).
The configuration sizes refer to the number of engines in the VPLEX
cabinet. The remainder of this section describes each configuration
size.
Single engine configurations
The VPLEX single engine configuration includes the following:
Two directors
One engine
Redundant engine SPSs
8 front-end Fibre Channel ports (16 for VS1 hardware)
8 back-end Fibre Channel ports (16 for VS1 hardware)
One management server
The unused space between engine 1 and the management server as
shown in Figure 9 on page 47 is intentional.
Configuration overview 47
Hardware and Software
Figure 9 VPLEX single engine configuration
Dual configurations
The VPLEX dual engine configuration includes the following:
Four directors
Two engines
Redundant engine SPSs
16 front-end Fibre Channel ports (32 for VS1 hardware)
16 back-end Fibre Channel ports (32 for VS1 hardware)
One management server
48 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
Redundant Fibre Channel COM switches for local COM; UPS for
each Fibre Channel switch
Figure 10 shows an example of a medium configuration.
Figure 10 VPLEX dual engine configuration
Quad configurations
The VPLEX quad engine configuration includes the following:
Eight directors
Four engines
Redundant engine SPSs
32 front-end Fibre Channel ports (64 for VS1 hardware)
VPLX-000254
SPS 1
Engine 2
Engine 1
UPS A
Management server
SPS 2
UPS B
Fibre Channel switch B
Fibre Channel switch A
O NI
O F F
O
O NI
O F F
O
O NI
O F F
O
O NI
O F F
O
O NI
O F F
O
O NI
O F F
O
ON I
OFF
O
ON I
OFF
O
ON I
OFF
O
ON I
OFF
O
ON I
OFF
O
ON I
OFF
O
Configuration overview 49
Hardware and Software
32 back-end Fibre Channel ports (64 for VS1 hardware)
One management server
Redundant Fibre Channel COM switches for local COM; UPS for
each Fibre Channel switch
Figure 11 shows an example of a quad configuration.
Figure 11 VPLEX quad engine configuration
VPLX 000253
SPS 1
Engine 2
Engine 1
Engine 4
Engine 3
UPS A
Management server
SPS 2
SPS 3
SPS 4
UPS B
Fibre Channel switch B
Fibre Channel switch A
O NI
O F F
O
O NI
O F F
O
O NI
O F F
O
O NI
O F F
O
O NI
O F F
O
O NI
O F F
O
ON I
OFF
O
ON I
OFF
O
ON I
OFF
O
ON I
OFF
O
ON I
OFF
O
ON I
OFF
O
50 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
I/O implementation
The VPLEX cluster utilizes a write-through mode when configured
for either VPLEX Local or Metro whereby all writes are written
through the cache to the back-end storage. To maintain data integrity,
a host write is acknowledged only after the back-end arrays (in one
cluster in case of VPLEX Local and in two clusters in case of VPLEX
Metro) acknowledge the write.
This section describes the VPLEX cluster caching layers, roles, and
interactions. It gives an overview of how reads and writes are
handled within the VPLEX cluster and how distributed cache
coherency works. This is important to the introduction of high
availability concepts.
Cache coherence
Cache coherence creates a consistent global view of a volume.
Distributed cache coherence is maintained using a directory. There is
one directory per virtual volume and each directory is split into
chunks (4096 directory entries within each). These chunks exist only
if they are populated. There is one directory entry per global cache
page, with responsibility for:
Tracking page owner(s) and remembering the last writer
Locking and queuing
Meta-directory
Directory chunks are managed by the meta-directory, which assigns
and remembers chunk ownership. These chunks can migrate using
Locality-Conscious Directory Migration (LCDM). This
meta-directory knowledge is cached across the share group (i.e., a
group of multiple directors within the cluster that are exporting a
given virtual volume) for efficiency.
How a read is handled
When a host makes a read request, VPLEX first searches its local
cache. If the data is found there, it is returned to the host.
I/O implementation 51
Hardware and Software
If the data is not found in local cache, VPLEX searches global cache.
Global cache includes all directors that are connected to one another
within the single VPLEX cluster for VPLEX Local, and all of the
VPLEX clusters for both VPLEX Metro and VPLEX Geo. If there is a
global read hit in the local cluster (i.e. same cluster, but different
director) then the read will be serviced from global cache in the same
cluster. The read could also be serviced by the remote global cache if
the consistency group setting local read override is set to false (the
default is true). Whenever the read is serviced from global cache
(same cluster or remote), a copy is also stored in the local cache of the
director from where the request originated.
If a read cannot be serviced from either local cache or global cache, it
is read directly from the back-end storage. In these cases both the
global and local cache are updated.
I/O flow of a local read hit
1. Read request issued to virtual volume from host.
2. Look up in local cache of ingress director.
3. On hit, data returned from local cache to host.
I/O flow of a global read hit
1. Read request issued to virtual volume from host.
2. Look up in local cache of ingress director.
3. On miss, look up in global cache.
4. On hit, data is copied from owner director into local cache.
5. Data returned from local cache to host.
I/O flow of a read miss
1. Read request issued to virtual volume from host.
2. Look up in local cache of ingress director.
3. On miss, look up in global cache.
4. On miss, data read from storage volume into local cache.
5. Data returned from local cache to host.
6. The director that returned the data becomes the chunk owner.
52 EMC VPLEX Metro Witness Technology and High Availability
Hardware and Software
How a write is handled
For both VPLEX Local and Metro, all writes are written through
cache to the back-end storage. Writes are completed to the host only
after they have been completed to the back-end arrays. In the case of
VPLEX Metro, each write is duplicated at the cluster where it was
written. One of the copies is then written through to local back end
disk, whilst the other one is written to the remote VPLEX where in
turn it is written through to the remote back end disk. Host
acknowledgement is given once both writes to back end storage has
been acknowledged.
I/O flow of a write miss
1. Write request issued to virtual volume from host.
2. Look for prior data in local cache.
3. Look for prior data in global cache.
4. Transfer data to local cache.
5. Data is written through to back-end storage.
6. Write is acknowledged to host.
I/O flow of a write hit
1. Write request issued to virtual volume from host.
2. Look for prior data in local cache.
3. Look for prior data in global cache.
4. Invalidate prior data.
5. Transfer data to local cache.
6. Data is written through to back-end storage.
7. Write is acknowledged to host.
System and Component Integrity 53
3
This chapter explains how VPLEX clusters are able to handle
hardware failures in any subsystem within the storage cluster. Topics
include:
Overview............................................................................................. 54
Cluster.................................................................................................. 55
Path redundancy through different ports ...................................... 56
Path redundancy through different directors ................................ 57
Path redundancy through different engines .................................. 58
Path redundancy through site distribution.................................... 59
Serviceability....................................................................................... 60
System and
Component Integrity
54 EMC VPLEX Metro Witness Technology and High Availability
System and Component Integrity
Overview
VPLEX clusters are capable of surviving any single hardware failure
in any subsystem within the overall storage cluster. These include
host connectivity subsystem, memory subsystem, etc. A single failure
in any subsystem will not affect the availability or integrity of the
data. Multiple failures in a single subsystem and certain
combinations of single failures in multiple subsystems may affect the
availability or integrity of data.
High availability requires that host connections be redundant and
that hosts are supplied with multipath drivers. In the event of a
front-end port failure or a director failure, hosts without redundant
physical connectivity to a VPLEX cluster and without multipathing
software installed may be susceptible to data unavailability.
Cluster 55
System and Component Integrity
Cluster
A cluster is a collection of one, two, or four engines in a physical
cabinet. A cluster serves I/O for one storage domain and is managed
as one storage cluster.
All hardware resources (CPU cycles, I/O ports, and cache memory)
are pooled:
The front-end ports on all directors provide active/active access
to the virtual volumes exported by the cluster.
For maximum availability, virtual volumes can be presented
through all director so that all directors but one can fail without
causing data loss or unavailability. To achieve this with version
5.0.1 code and below directors must be connected to all storage.
Note: Instant failure of all directors bar one in a dual or quad engine
system would result in the last remaining director also failing since it
would lose quorum. This is, therefore, only true if one director failed at a
time.
56 EMC VPLEX Metro Witness Technology and High Availability
System and Component Integrity
Path redundancy through different ports
Because all paths are duplicated, when a director port goes down for
any reason, data seemlessly processes through a port of the other
director, as shown in Figure 12 (assuming correct multipath software
is in place).
Figure 12 Port redundancy
Multipathing software plus redundant volume presentation yields
continuous data availability in the presence of port failures.
Path redundancy through different directors 57
System and Component Integrity
Path redundancy through different directors
If a a director were to go down, the other director can completely take
over the I/O processing from the host, as shown in Figure 13.
Figure 13 Director redundancy
Multipathing software plus volume presentation on different
directors yields continuous data availability in the presence of
director failures.
58 EMC VPLEX Metro Witness Technology and High Availability
System and Component Integrity
Path redundancy through different engines
In a clustered environment, if one engine goes down, another engine
completes the host I/O processing, as shown in Figure 14.
Figure 14 Engine redundancy
Multipathing software plus volume presentation on different engines
yields continuous data availability in the presence of engine failures.
Path redundancy through site distribution 59
System and Component Integrity
Path redundancy through site distribution
Distributed site redundancy now enabled through VPLEX Metro HA
(including VPLEX Witness) ensures that if a site goes down, or even if
the link to that site goes down, the other site can continue seamlessly
processing the host I/O, as shown in Figure 15. As illustrated if on
site failure of Site B occurs, the I/O continues unhindered on Site A.
Figure 15 Site redundancy
60 EMC VPLEX Metro Witness Technology and High Availability
System and Component Integrity
Serviceability
In addition to the redundancy fail-safe features, the VPLEX cluster
provides event logs and call home capability via EMC Secure Remote
Support (ESRS).
Foundations of VPLEX High Availability 61
4
This chapter explains VPLEX architecture and operation:
Foundations of VPLEX High Availability ..................................... 62
Failure handling without VPLEX Witness (static preference) ..... 70
Foundations of VPLEX
High Availability
62 EMC VPLEX Metro Witness Technology and High Availability
Foundations of VPLEX High Availability
Foundations of VPLEX High Availability
The following section discusses several disruptive scenarios at a high
level to a multiple site VPLEX Metro configuration without VPLEX
Witness. The purpose of this section is to provide the customer or
solutions architect the ability to understand site failure semantics
prior to the deployment of VPLEX Witness and related solutions
outlined in this book. This section isnt designed to highlight flaws in
high availability architecture as implemented in basic VPLEX best
practices without using VPLEX Witness. All solutions that are
deployed in a Metro Active /Active state be they VPLEX or not will
run into the same issues when not deploying a independent observer
or Witness. such as VPLEX Witness. The decision for an architect to
apply the VPLEX Witness capabilities or enhance connectivity paths
across data centers using the Metro HA Cross-Cluster Connect
solution is dependent on their basic fail-over needs.
Note: To ensure the explanation of this subject remains at a high level in the
following section the graphics have been broken down into major objects
(e.g. Site A, Site B and Link). You can assume that within each site resides a
VPLEX cluster. Therefore, when a site failure is shown it will also cause a full
VPLEX cluster failure within that site. You can also assume that the link
object between sites represents the main inter-cluster data network connected
to each VPLEX cluster in either site. One further assumption is that each site
shares the same failure domain. A site failure will affect all components
within this failure domain including VPLEX cluster.
This representation of Figure 16 as described shows normal operation
where all three components are fully operational. (Note: green
symbolizes normal operation and red symbolizes failure)
Figure 16 High level functional sites in communication
Foundations of VPLEX High Availability 63
Foundations of VPLEX High Availability
Figure 17 demonstrates that Site A has failed.
Figure 17 High level Site A failure
Suppose that an application or VM was running only in Site A at the
time of the incident it would now need to be restarted at the
remaining Site B. Reading this document, you know this since you
have an external perspective being able to see the entire diagram.
However, if you were looking at this purely from Site Bs perspective,
all that could be deduced is that communication has been lost to Site
A. Without an external independent observer of some kind, it is
impossible to distinguish between full Site A failure vs. the
inter-cluster link failure.
A link failure as depicted by the red arrow in Figure 18 is
representative of an inter-cluster link failure.
Figure 18 High level Inter-site link failure
Similar to the previous example, if you look at this from an overall
perspective, you can see that it is the link which is faulted. However,
if you consider this from Site A or Site Bs perspective all that can be
deduced is that communication is lost to Site A (exactly like the
previous example) and it cannot be distinguished if it is the link or
the site at fault.
64 EMC VPLEX Metro Witness Technology and High Availability
Foundations of VPLEX High Availability
The next section shows how different failures affect a VPLEX
distributed volume and highlights the different resolutions required
in each case starting with the site failure scenario. The high level
Figure 19 shows a VPLEX distributed volume spanning two sites:
Figure 19 VPLEX active and functional between two sites
As shown, the distributed volume is made up of a mirror at each site
(M1 and M2). Using the distributed cache coherency semantics
provided by VPLEX GeoSynchrony a consistent data presentation of
a logical volume is achieved across both clusters. Furthermore due to
cache coherency the ability to perform active/active data access (both
read and write) from two sites is enabled. Additionally shown in the
example is a distributed network where users are able to access either
site which would be true in a fully active/active environment.
Foundations of VPLEX High Availability 65
Foundations of VPLEX High Availability
Figure 20 shows a total failure at one of the sites (in this case Site A
has failed). In this case the distributed volume would become
degraded since the hardware required at Site A to support this
particular mirror leg is no longer available. For a resolution to this
example, keep the volume active at Site B so the application can
resume there.
Figure 20 VPLEX concept diagram with failure at Site A
66 EMC VPLEX Metro Witness Technology and High Availability
Foundations of VPLEX High Availability
Figure 21 shows the desired resolution if failure at Site A was to
occur. As discussed previously the correct outcome of this is to keep
the volume online in Site B.
Figure 21 Correct resolution after volume failure at Site A
Failure handling without VPLEX Witness (static preference) on
page 70 discusses the outcome after an inter-cluster link
partition/failure.
Foundations of VPLEX High Availability 67
Foundations of VPLEX High Availability
Figure 22 shows the configuration before the failure.
Figure 22 VPLEX active and functional between two sites
Recall based on the Site A / Site B simple failure scenarios, when a
link failed, neither site knew of the exact failure. With an
active/active distributed volume, a link failure would also degrade
the distributed volume since write I/O at either site would be unable
to propagate to the remote site.
68 EMC VPLEX Metro Witness Technology and High Availability
Foundations of VPLEX High Availability
Figure 23 shows what would happen if there was no mechanism to
suspend I/O at one of the sites in this scenario.
Figure 23 Inter-site link failure and cluster partition
As shown, this would lead to a split brain (or conflicting detach in
VPLEX terminology) since writes could be accepted on both sites
there is the potential to end up with two divergent copies of the data.
To protect against data corruption this situation has to be avoided.
Therefore, VPLEX must act and suspend access to the distributed
volume on one of the clusters.
Foundations of VPLEX High Availability 69
Foundations of VPLEX High Availability
Figure 24 displays a valid and acceptable state in the event of a link
partition as Site A is now suspended. This preferential behavior
(selectable for either cluster) is the default and automatic behavior of
VPLEX distributed volumes and protects against data corruption and
split brain scenarios. The following section explains in more detail
how this functions.
Figure 24 Correct handling of cluster partition
70 EMC VPLEX Metro Witness Technology and High Availability
Foundations of VPLEX High Availability
Failure handling without VPLEX Witness (static preference)
As previously demonstrated, in the presence of failures, VPLEX
Active/Active distributed solutions require different resolutions
depending on the type of failure. However, since VPLEX version 4.0
had no means to perform external arbitration no mechanism
existed to distinguish between a site failure and a link failure. To
overcome this, a feature called static preference (previously know as
static bias) is used to guard against split brain scenarios occurring.
The premise of static preference is to set a detach rule ahead of failure
for each distributed volume (or group of distributed volumes) that
spans two VPLEX clusters to effectively define which cluster will be
declared a preferred cluster and maintain access to the volume and
which cluster should be declared the non-preferred, therefore
suspending access should either of the VPLEX clusters lose
communication with each other (this concept covers both site and
link failure). This is known as a detach rule and means that one site can
unilaterally detach the other cluster and assume that the detached
cluster is either dead or that it will stay suspended if it is alive.
Note: VPLEX Metro also supports the rule set no automatic winner. If a
consistency group is configured with this setting then IO will suspend at both
VPLEX clusters if either the link were to partition or an entire VPLEX cluster
fail. Manual intervention can then be used to resume IO at a remaining
cluster if required. Care should be taken if setting this policy since although
this will always ensure that both VPLEX clusters remain identical at all times,
the trade off is that the production environment would be halted. This is
useful if a customer wishes to integrate VPLEX failover semantics with
failover behavior driven by the application (suppose the application has its
own witness, etc.) In this case, the application can provide a script that
invokes the resume CLI command on the VPLEX cluster of its choosing.
Failure handling without VPLEX Witness (static preference) 71
Foundations of VPLEX High Availability
Figure 25 shows how static preference can be set for each distributed
volume (also known as a DR1 - Distributed RAID1).
Figure 25 VPLEX static detach rule
This detach rule can either be set within the VPLEX GUI or via
VPLEX CLI.
Each volume can be either set to Cluster 1 detaches, Cluster 2
detaches or no automatic winner.
If the Distributed Raid 1 device (DR1) is set to Cluster 1 detaches,
then in any failure scenario the preferred cluster for that volume
would be declared as Cluster 1, but if the DR1 detach rule is set to
Cluster 2 detaches, then in any failure scenario the preferred cluster
for that volume would be declared as Cluster 2.
Note: Some people when looking at this prefer to substitute the word detaches
for the word preferred or wins which is perfectly acceptable and may make it
easier to understand.
72 EMC VPLEX Metro Witness Technology and High Availability
Foundations of VPLEX High Availability
Setting the rule set on a volume to Cluster 1 detaches, would mean
that Cluster 1 would be the preferred site for the given volumes.
(Additionally the terminology that Cluster 1 has the bias for the given
volume is also appropriate)
Once this rule is set then regardless of the failure (be it link or site) the
rule will always be invoked.
Note: A caveat exists here that if the state of the BE at the preferred cluster is
out of date (due to prior BE failure, an incomplete rebuild or another issue)
the preferred cluster will suspend I/O regardless of preference.
The following diagrams show some examples of the rule set in action
for different failures, the first being a site loss at B with a single DR1
set to Cluster 1 detaches.
Figure 26 shows the initial running setup of the configuration. It can
be seen that the volume is set to Cluster 1 detaches.
Figure 26 Typical detach rule setup
Failure handling without VPLEX Witness (static preference) 73
Foundations of VPLEX High Availability
If there was a problem at Site B, then the DR1 will become degraded
as shown in Figure 27.
Figure 27 Non-preferred site failure
74 EMC VPLEX Metro Witness Technology and High Availability
Foundations of VPLEX High Availability
As the preference rule was set to Cluster 1 detaches, then the
distributed volume will remain active at Site A. This is shown in
Figure 28.
Figure 28 Volume remains active at Cluster 1
Therefore in this scenario, if the service, application, or VM was
running only at Site A (the preferred site) then it would continue
uninterrupted without needing to restart. However, if the application
was running only at Site B on the given distributed volume then it
will need to be restarted at Site A, but since VPLEX is an active/active
solution no manual intervention at the storage layer will be required
in this case.
Failure handling without VPLEX Witness (static preference) 75
Foundations of VPLEX High Availability
The next example shows static preference working under link failure
conditions.
Figure 29 shows a configuration with a distributed volume set to
Cluster 1 detaches as per the previous configuration.
Figure 29 Typical detach rule setup before link failure
76 EMC VPLEX Metro Witness Technology and High Availability
Foundations of VPLEX High Availability
If the link were now lost then the distributed volume will again be
degraded as shown in Figure 30.
Figure 30 Inter-site link failure and cluster partition
To ensure that split brain does not occur after this type of failure the
static preference rule is applied and I/O is suspended at Cluster 2 in
this case as the rule is set to Cluster 1 detaches.
Failure handling without VPLEX Witness (static preference) 77
Foundations of VPLEX High Availability
This is shown in Figure 31.
Figure 31 Suspension after inter-site link failure and cluster partition
Therefore, in this scenario, if the service, application, or VM was
running only at Site A then it would continue uninterrupted without
needing to restart; However, if the application was running only at
Site B then it will need to be restarted at Site A since the preference
rule set will suspend access for the given distributed volumes on
Cluster2. Again, no manual intervention will be required in this case
at the storage level as the volume at Cluster 1 automatically remained
available.
In summary, static preference is a very effective method of preventing
split brain. However, there is a particular scenario that will result in
manual intervention if the static preference feature is used alone. This
can happen if there is a VPLEX cluster or site failure at the preferred
cluster (such as the pre-defined preferred cluster for the given
distributed volume).
78 EMC VPLEX Metro Witness Technology and High Availability
Foundations of VPLEX High Availability
This is shown in Figure 32 where there is distributed volumes which
has Cluster 2 detaches set on the DR1.
Figure 32 Cluster 2 is preferred
Failure handling without VPLEX Witness (static preference) 79
Foundations of VPLEX High Availability
If Site B had a total failure in this example, disruption would now
also occur at Site A as shown in Figure 33.
Figure 33 Preferred site failure causes full Data Unavailability
As can be seen, the preferred site has now failed and the preference
rule has been used, but since the rule is static and cannot
distinguish between a link failure or remote site failure, in this
example the remaining site becomes suspended. Therefore, in this
case, manual intervention will be required to bring the volume online
at Site A.
Static preference is a very powerful rule. It does provide zero RPO
and zero RTO resolution for non-preferred cluster failure and
inter-cluster partition scenarios and it completely avoids split brain.
However, in the presence of a preferred cluster failure it provides
non-zero RTO. It is good to note that this feature is available without
automation and is a valuable alternative when a VPLEX Witness
configuration (discussed in the next chapter) is unavailable or
customer infrastructure cannot accommodate due to the lack of a
third failure domain.
80 EMC VPLEX Metro Witness Technology and High Availability
Foundations of VPLEX High Availability
VPLEX Witness has been designed to overcome this particular
non-zero RTO scenario since it can override the static preference and
leave what was the non preferred site active guaranteeing that split
brain scenarios are always avoided.
Note: If using a VPLEX Metro deployment without VPLEX Witness, and the
preferred cluster has been lost, IO can be manually resumed via cli at the
remaining (non-preferred) VPLEX cluster. However, care should be taken
here to avoid a conflicting detach or split brain scenario. (VPLEX Witness
solves this problem automatically.)
Introduction to VPLEX Witness 81
5
This chapter explains VPLEX architecture and operation:
VPLEX Witness overview and architecture ................................... 82
VPLEX Witness target solution, rules, and best practices............ 85
VPLEX Witness failure semantics.................................................... 87
CLI example outputs ......................................................................... 93
Introduction to VPLEX
Witness
82
EMC VPLEX Metro Witness Technology and High Availability
Introduction to VPLEX Witness
VPLEX Witness overview and architecture
VPLEX Metro v5.0 (and above) systems can now rely on a new
component called VPLEX Witness. VPLEX Witness is an optional
component designed to be deployed in customer environments
where the regular preference rule sets are insufficient to provide
seamless zero or near-zero RTO storage availability in the presence of
site disasters and VPLEX cluster and inter-cluster failures.
As described in the previous section, without VPLEX Witness, all
distributed volumes rely on configured rule sets to identify the
preferred cluster in the presence of cluster partition or cluster/site
failure. However, if the preferred cluster happens to fail (in the result
of a disaster event, etc.), VPLEX is unable to automatically allow the
surviving cluster to continue I/O to the affected distributed volumes.
VPLEX Witness has been designed specifically overcome this case.
An external VPLEX Witness Server is installed as a virtual machine
running on a customer supplied VMware ESX host deployed in a
failure domain separate from either of the VPLEX clusters (to
eliminate the possibility of a single fault affecting both the cluster and
the VPLEX Witness). VPLEX Witness connects to both VPLEX
clusters over the management IP network. By reconciling its own
observations with the information reported periodically by the
clusters, the VPLEX Witness enables the cluster(s) to distinguish
between inter-cluster network partition failures and cluster failures
and automatically resume I/O in these situations.
Figure 34 on page 83 shows a high level deployment of VPLEX
Witness and how it can augment an existing static preference
solution.The VPLEX Witness server resides in a fault domain separate
from VPLEX cluster 1 and cluster 2.
VPLEX Witness overview and architecture 83
Introduction to VPLEX Witness
Figure 34 High Level VPLEX Witness architecture
Since the VPLEX Witness server is external to both of the production
locations more perspective can be gained as to the nature of a
particular failure and the correct action taken since as mentioned
previously it is this perspective that is vital to be able to determine
between a site outage and a link outage as either one of these
scenarios requires a different action to be taken.
84
EMC VPLEX Metro Witness Technology and High Availability
Introduction to VPLEX Witness
Figure 35 shows a high-level circuit diagram of how the VPLEX
Witness Server should be connected.
Figure 35 High Level VPLEX Witness deployment
The VPLEX Witness server is connected via the VPLEX management
IP network in a third failure domain.
Depending on the scenario that is to be protected against, this third
fault domain could reside in a different floor within the same
building as VPLEX cluster 1 and cluster 2. It can also be located in a
completely geographically dispersed data center which could be in a
different country.
Note: VPLEX Witness Server supports up to 1 second of network latency over
the management IP network.
Clearly, using the example of the third floor in the building, one
would not be protected from a disaster affecting the entire building
so, depending on the requirement, careful consideration should be
given if choosing this third failure domain.
VPLEX Witness target solution, rules, and best practices 85
Introduction to VPLEX Witness
VPLEX Witness target solution, rules, and best practices
VPLEX Witness is architecturally designed for VPLEX Metro clusters.
Customers who wish to use VPLEX Local will not require VPLEX
Witness functionality.
Furthermore VPLEX Witness is only suitable for customers who have
a third failure domain connected via two physical networks from
each of the data centers where the VPLEX clusters reside into each
VPLEX management station Ethernet port.
VPLEX Witness failure handling semantics only apply to Distributed
volumes in all synchronous (i.e., Metro) consistency groups on a pair
of VPLEX v5.x clusters if VPLEX Witness is enabled.
VPLEX Witness failure handling semantics do not apply to:
Local volumes
Distributed volumes outside of a consistency group
Distributed volumes within a consistency group if the VPLEX
Witness is disabled
Distributed volumes within a consistency group if the preference
rule is set to no automatic winner.
At the time of writing only one VPLEX Witness Server can be
configured for a given Metro system and when it is configured and
enabled, its failure semantics applies to all configured consistency
groups.
Additionally a single VPLEX Witness Server (virtual machine) can
only support a single VPLEX Metro system (however, more than one
VPLEX Witness Server can be configured onto a single physical ESX
host).
86
EMC VPLEX Metro Witness Technology and High Availability
Introduction to VPLEX Witness
Figure 36 shows the supported versions (at the time of writing) for
VPLEX Witness.
Figure 36 Supported VPLEX versions for VPLEX Witness
As mentioned in Figure 36, depending on the solution, VPLEX Static
preference alone without VPLEX Witness may still be relevant in
some cases. Figure 37 shows the volume types and rules which can be
supported with VPLEX Witness
Figure 37 VPLEX Witness volume types and rule support
Check the latest VPLEX ESSM (EMC Simple Support Matrix), located
at https://elabnavigator.emc.com, Simple Support Matrix tab, for
the latest information including VPLEX Witness server physical host
requirements and site qualification.
VPLEX Witness failure semantics 87
Introduction to VPLEX Witness
VPLEX Witness failure semantics
As seen in the previous section VPLEX Witness will operate at the
consistency group level for a group of distributed devices and will
function in conjunction with the detach rule set within the
consistency group.
Starting with the inter-cluster link partition the next few pages
discuss failure scenarios (both site and link) which were raised in
previous sections and show how the failure semantics differ using
VPLEX Witness compared to just using static preference alone.
Figure 38 shows a typical setup for VPLEX 5.x with a single
distributed volume configured in a consistency group which has a
rule set configured for cluster 2 detaches (such as cluster 2 is
preferred). Additionally it shows the VPLEX Witness server is
connected via the management network in a third failure domain.
Figure 38 Typical VPLEX Witness configuration
88
EMC VPLEX Metro Witness Technology and High Availability
Introduction to VPLEX Witness
If the inter-cluster link were to fail in this scenario VPLEX Witness
would still be able to communicate with both VPLEX clusters since
the management network that connects the VPLEX Witness server to
both of the VPLEX clusters is still operational. By communicating
with both VPLEX clusters, the VPLEX Witness will deduce that the
inter-cluster link has failed since both VPLEX clusters report to the
VPLEX Witness server that the connectivity with the remote VPLEX
cluster has been lost. (such as, cluster 1 reports that cluster 2 is
unavailable and vice versa). This is shown in Figure 39.
Figure 39 VPLEX Witness and an inter-cluster link failure
In this case the VPLEX Witness guides both clusters to follow the
pre-configured static preference rules and volume access at cluster 1
will be suspended since the rule set was configured as cluster 2
detaches.
VPLEX Witness failure semantics 89
Introduction to VPLEX Witness
Figures 40 shows the final state after this failure.
Figure 40 VPLEX Witness and static preference after cluster partition
The next example shows how VPLEX Witness can assist if you have a
site failure at the preferred site. As discussed above, this type of
failure without VPLEX Witness would cause the volumes in the
surviving site to go offline. This is where VPLEX Witness greatly
improves the outcome of this event and removes the need for manual
intervention.
90
EMC VPLEX Metro Witness Technology and High Availability
Introduction to VPLEX Witness
Figure 41 shows a typical setup for VPLEX v5.x with a distributed
volume configured in a consistency group with a rule set configured
for Cluster 2 detaches (such as, Cluster 2 wins).
Figure 41 VPLEX Witness typical configuration for cluster 2 detaches
VPLEX Witness failure semantics 91
Introduction to VPLEX Witness
Figure 42 shows that Site B has now failed.
Figure 42 VPLEX Witness diagram showing cluster 2 failure
As discussed in the previous section, when a site has failed then the
distributed volumes are now degraded. However, unlike our
previous example where there was a site failure at the preferred site
and the static preference rule was used forcing volumes into a
suspend state at cluster 1, VPLEX Witness will now observe that
communication is still possible to cluster 1 (but not cluster 2).
Additionally since cluster 1 cannot contact cluster 2, VPLEX Witness
can make an informed decision and guide cluster 1 to override the
static rule set and proceed with I/O.
92
EMC VPLEX Metro Witness Technology and High Availability
Introduction to VPLEX Witness
Figure 43 shows the outcome.
Figure 43 VPLEX Witness with static preference override
Clearly, this is a big improvement on the scenario where this
happened with just the static preference rule set but not using VPLEX
Witness. Since volumes had to be suspended at cluster 1 previously,
there was no way to tell the difference between a site failure or a link
failure.
CLI example outputs 93
Introduction to VPLEX Witness
CLI example outputs
On systems where VPLEX Witness is deployed and configured, the
VPLEX Witness CLI context appears under the root context as
"cluster-witness." By default, this context is hidden and will not be
visible until VPLEX Witness has been deployed by running the
cluster-Witness configure command. Once the user deploys VPLEX
Witness, the VPLEX Witness CLI context becomes visible.
The CLI context typically displays the following information:
VPlexcli:/> cd cluster-witness/
VPlexcli:/cluster-witness> ls

Attributes:
Name Value
------------- -------------
admin-state enabled
private-ip-address 128.221.254.3
public-ip-address 10.31.25.45
Contexts:
components
VPlexcli:/cluster-witness> ll components/
/cluster-Witness/components:
Name ID Admin State Operational State Mgmt Connectivity
---------- -- ----------- ------------------- -----------------
cluster-1 1 enabled in-contact ok
cluster-2 2 enabled in-contact ok
server - enabled clusters-in-contact ok
VPlexcli:/cluster-Witness> ll components/*
/cluster-Witness/components/cluster-1:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
diagnostic INFO: Current state of cluster-1 is in-contact (last
state change: 0 days, 13056 secs ago; last message
from server: 0 days, 0 secs ago.)
id 1
management-connectivity ok
operational-state in-contact
/cluster-witness/components/cluster-2:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
94
EMC VPLEX Metro Witness Technology and High Availability
Introduction to VPLEX Witness
diagnostic INFO: Current state of cluster-2 is in-contact (last
state change: 0 days, 13056 secs ago; last message
from server: 0 days, 0 secs ago.)
id 2
management-connectivity ok
operational-state in-contact
/cluster-Witness/components/server:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
diagnostic INFO: Current state is clusters-in-contact (last state
change: 0 days, 13056 secs ago.) (last time of
communication with cluster-2: 0 days, 0 secs ago.)
(last time of communication with cluster-1: 0 days, 0
secs ago.)
id -
management-connectivity ok
operational-state clusters-in-contact
Eefer to the VPLEX CLI guide found on Powerlink for more details
around VPLEX Witness CLI.
VPLEX Witness cluster
isolation semantics
and dual failures
As discussed in the previous section, deploying a VPLEX solution
with VPLEX Witness will give continuous availability to the storage
volumes regardless of there being a site failure or inter-cluster link
failure. These types of failure are deemed single component failures
and have shown no single point of failure can induce data
unavailability using the VPLEX Witness.
It should be noted, however, that in rare situations more than one
fault or component outage can occur especially when considering
inter-cluster communication links which if two failed at once would
lead to a VPLEX cluster isolation at a given site.
For instance, if you consider a typical VPLEX Setup with VPLEX
Witness you will automatically have three failure domains (this
example will use A, B, and C, where VPLEX cluster 1 resides at A,
VPLEX cluster 2 at B, and the VPLEX Witness server resides at C). In
this case there will be in inter-cluster link between A and B (cluster 1
and 2), plus a management IP link between A and C, as well as a
management IP link between B and C, effectively giving a
triangulated topology.
In rare situations there is a chance that if the link between A and B
failed followed by a further link failure from either A to C or B to C,
then one of the sites will be isolated (cut off).
CLI example outputs 95
Introduction to VPLEX Witness
Due to the nature of VPLEX Witness, these types of isolation can also
be dealt with effectively without manual intervention.
This is achieved since a site isolation is very similar in terms of
technical behavior to a full site outage the main difference being that
the isolated site is still fully operational and powered up (but needs
to be forced into I/O suspension) unlike a site failure where the failed
site is not operational.
In these cases the failure semantics and VPLEX Witness are
effectively the same. However, two further actions are taken at the
site that becomes isolated:
I/O is shut off/suspended at the isolated site.
The VPLEX cluster will attempt to call home.
Figure 44 shows the three scenarios that are described above:
Figure 44 Possible dual failure cluster isolation scenarios
As discussed previously, it is extremely rare to experience a double
failure and figure 44 showed how VPLEX can automatically ride
through isolation scenarios. However, there are also some other
possible situations where a dual failure could occur and require
manual intervention at one of the VPLEX clusters as VPLEX Witness
will not be able to distinguish the actual failure.
96
EMC VPLEX Metro Witness Technology and High Availability
Introduction to VPLEX Witness
Note: If best practices are followed then the likelihood of these scenarios
occurring is significantly less than even the rare isolation incidents discussed
above mainly as the faults would have to disrupt components in totally
different fault domains that would be spread over many miles.
Figure 45 shows three scenarios where a double failure would require
manual intervention to bring the remaining component online since
VPLEX Witness would not be able to determine the gravity of the
failure.
Figure 45 Highly unlikely dual failure scenarios that require manual intervention
A point to note in the above scenarios is that for the shown outcomes
to be correct the failures would have to have happened in a specific
order where the link to the VPLEX Witness (or the Witness itself) has
failed and then either the inter-cluster link or the VPLEX cluster fail.
However, if the order of failure is reversed then in all three cases the
outcome would be different since one of the VPLEX clusters would
have remained online for the given distributed volume, therefore not
requiring manual intervention.
This is due to the fact that once a failure occurs, the VPLEX Witness
will give guidance to the VPLEX cluster. This guidance is sticky,
and once provided its guidance it is no longer consulted during any
subsequent failure until the system has been returned to a fully
operational state. (i.e.has fully recovered and connectivity between
both clusters and the VPLEX Witness is fully restored).
CLI example outputs 97
Introduction to VPLEX Witness
VPLEX Witness The importance of the third failure domain
As discussed in the previous section, dual failures can occur but are
highly unlikely. As also mentioned many times within this TechBook,
it is imperative that if VPLEX Witness is to be deployed then the
VPLEX Witness server component must be installed into a different
failure domain than either of the two VPLEX clusters.
Figure 46 shows two further dual failure scenarios where both a
VPLEX cluster has failed as well as the VPLEX Witness server.
Figure 46 Two further dual failure scenarios that would require manual
intervention
Again, if best practice is followed and each component resides within
its own fault domain then these two situations are just as unlikely as
the previous three scenarios that required manual intervention.
However, now consider what could happen if the VPLEX Witness
server was not deployed within a third failure domain, but rather in
the same domain as one of the VPLEX clusters.
This situation would mean that a single domain failure would
potentially induce a dual failure as two components may have been
residing in the same failure domain. This effectively turns a highly
unlikely scenario into a more probable single failure scenario and
should be avoided.
98
EMC VPLEX Metro Witness Technology and High Availability
Introduction to VPLEX Witness
By deploying the VPLEX Witness server into a third failure domain
the dual failure risk is substantially lowered, therefore manual
intervention would never be required since a fault would have to
disable more than one dissimilar component potentially hundreds of
miles apart spread over different fault domains.
Note: It is always considered best practice to ensure ESRS and alerting are
fully configured when using VPLEX Witness. This way if a VPLEX cluster
loses communication with a Witness server then the VPLEX cluster will dial
home and alert. This also ensures that if both VPLEX clusters lose
communication to the witness that the witness function can be manually
disabled if the witness communication or outage is expected to last for an
extended time reducing the risk of data unavailability in the event of an
additional VPLEX cluster failure or WAN partition.
VPLEX Metro HA 99
6
This chapter explains VPLEX architecture and operation:
VPLEX Metro HA overview........................................................... 100
VPLEX Metro HA Campus (with cross-connect) ........................ 101
VPLEX Metro HA (without cross-cluster connection) ................ 111
VPLEX Metro HA
100 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Metro HA
VPLEX Metro HA overview
From a technical perspective VPLEX Metro HA solutions are
effectively two new flavors of reference architecture which utilize the
new VPLEX Witness feature in VPLEX v5.0 and greatly enhance the
overall solutions ability to tolerate component failure causing less or
no disruption than legacy solutions with little or no human
intervention over either Campus or Metro distances.
The two main architecture types enabled by VPLEX Witness are:
VPLEX Metro HA Campus This is defined as those clusters
that are within campus distance (typically < 1ms round trip time
or RTT). This solution utilizes a cross-connected front end host
path configuration giving each host an alternate path to the
VPLEX Metro distributed volume via the remote VPLEX cluster.
VPLEX Metro HA This is defined with distances larger than
campus but still within synchronous distance (typically higher
than 1ms RTT, but not more than 5ms RTT) where a VPLEX
Metro distributed volume is deployed between two VPLEX
clusters using a VPLEX Witness, but not using a cross-connected
host path configuration.
This section will look at each of these solutions in turn and show how
system uptime can be maximized by stepping through different
failure scenarios and showing from both a VPLEX and host HA
cluster perspective how the technologies interact with each failure. In
all of the scenarios shown, the VPLEX is able to continue servicing IO
automatically across at least one of the VPLEX clusters with zero data
loss ensuring that the application or service within the host HA
cluster simply remains online fully uninterrupted, or is restarted
elsewhere automatically be the host cluster.
VPLEX Metro HA Campus (with cross-connect) 101
VPLEX Metro HA
VPLEX Metro HA Campus (with cross-connect)
VPLEX Metro HA campus connect can be deployed when two sites
are within campus distance of each other (up to 1ms round trip
latency). A VPLEX Metro distributed volume can then be deployed
across the two sites using a cross-connected front end configuration
and a VPLEX Witness server installed within a different fault domain.
Figure 47 shows a high level schematic of a Metro HA campus
solution for VMware.
Figure 47 High-level diagram of a Metro HA campus solution for VMware
As can be seen, a single VPLEX cluster is deployed at each site
connected via an inter cluster link.
102 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Metro HA
A VPLEX distributed volume has been created across both of the
locations and a vSphere HA cluster instance has been stretched across
both locations using the underlying VPLEX distributed volume.
Also shown in Figure 47 on page 101 are the physical ESX hosts that
are not only connected to the local VPLEX cluster where they
physically reside, but also have an alternate path to the remote
VPLEX cluster via the additional cross-connect network that is
physically separate to the VPLEX inter cluster link connecting both of
the VPLEX clusters.
The key benefit to this solution is its ability to minimise and eliminate
any recovery time if components were to fail (including even an
entire VPLEX cluster which would be unlikely since there are no
single points of failure within a VPLEX engine) as now the physical
host has an alternate path to the same storage actively served up by
the remote VPLEX cluster which will automatically remain online
due to the VPLEX Witness regardless of rule set.
The high-level deployment best practices for a cross-connect
configuration are as follows:
At the time of writing inter-cluster Network latency is not to
exceed 1ms round trip time between VPLEX clusters.
VPLEX Witness must be deployed (in a third failure domain)
when using a cross-connect campus configuration.
All remote VPLEX connection should be zoned to the local host as
per EMC best practice, and local host initiators must be registered
to the remote VPLEX. The distributed volume is then exposed
from both VPLEX clusters to the same host. The host path
preference should have a local path preference set, ensuring the
remote path will only be used if the primary one fails ensuring no
additional latency is incurred
Note: At the time of writing the only two qualified host cluster solutions
that can be configured with the additional VPLEX Metro HA campus (as
opposed to standard VPLEX Metro HA without the cross-cluster
connect) are vSphere version 4.1 & 5.0, Windows 2008 and IBM Power
HA 5.4. Be sure to check the latest VPLEX Simple Support Matrix, located
at https://elabnavigator.emc.com, Simple Support Matrix tab, for the
latest support information or submit an RPQ.
VPLEX Metro HA Campus (with cross-connect) 103
VPLEX Metro HA
Failure scenarios
For the following failure scenarios, this section assumes that vSphere
5.0 update 1 or above is configured in a stretched HA topology with
DRS so that all of the physical hosts (ESX servers) are within the same
HA cluster. As discussed previously, this type of configuration brings
the ability to teleport virtual machines over distance, which is
extremely useful in disaster avoidance, load balancing and cloud
infrastructure use cases. These use cases are all enabled using out of
the box features and functions; however, additional value can be
derived from deploying the VPLEX Metro HA campus solution to
ensure total availability for both planned and unplanned events.
High-level recommendations and pre-requisites for stretching
vSphere HA when used in conjunction with VPLEX Metro are as
follows:
A single vCenter instance must span both locations that contain
the VPLEX Metro cluster pairs. (Note: it is recommended this is
virtualized and protected via vSphere heartbeat to ensure restart
in the event of failure)
Must be used in conjunction with a stretched layer 2 network
ensuring that once a VM is moved it still resides on the same
logical network.
vSphere HA can be enabled within the vSphere cluster, but
vSphere fault tolerance is not supported (at the time of writing
but is planned for late 2012).
Can be used with vSphere DRS. However, careful consideration
should be given to this feature for vSphere versions prior to 5.0
update 1 since certain failure conditions where the VM is running
at the non preferred site may not invoke a VM fail over after
failure due to a problem where the ESX server does not detect a
storage Persistent Device Loss (PDL) state. This can lead to the
VM remaining online but intermittently unresponsive (also know
as a zombie VM). Manual intervention would be required in this
scenario.
Note: This can be avoided by using a VPLEX HA Campus solution with a
cross-cluster connect on a separate physical network to the VPLEX inter
cluster link. This will ensure that an active path is always present to the
storage no matter where the VM is running. Another possibility to avoid
would be to use host affinity groups (where supported). It is
recommended to upgrade to the latest ESX and vSphere versions (5.0
update 1 and above) to avoid these conditions.
104 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Metro HA
For detailed setup instructions and best practice planning for a
stretched HA vSphere environment, refer to White Paper: Using
VMware vSphere with EMC VPLEX Best Practices Planning which
can be found at http://powerlink.emc.com under Support >
Technical Documentation and Advisories > Hardware Platforms >
VPLEX Family > White Papers.
Figure 48 shows the topology of a Metro HA campus environment
divided up into logical fault domains.
Figure 48 Metro HA campus diagram with failure domains
The following sections will demonstrate the recovery automation for
a single failure within any of these domains and show how no single
fault in any domain can take down the system as a whole, and in
most cases without an interruption of service.
If a physical host failure were to occur in either domain A1 or B1 the
VMware HA cluster would restart the affected virtual machines on
the remaining ESX servers.
VPLEX Metro HA Campus (with cross-connect) 105
VPLEX Metro HA
Example 1 Figure 49 shows all physical ESX hosts failing in domain A1.
Figure 49 Metro HA campus diagram with disaster in zone A1
Since all of the physical hosts in domain B1 are connected to the same
datastores via the VPLEX Metro distributed device VMware HA can
restart the virtual machines on any of the physical ESX hosts in
domain B1.
Example 2 The next example describes what will happen in the unlikely event
that a VPLEX cluster was to fail in either domain A2 or B2.
Note: This failure condition is considered unlikely since it would constitute a
dual failure as a VPLEX cluster has no single points of failure.
In this instance there would be no interruption of service to any of the
virtual machines.
106 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Metro HA
Figure 50 shows a full VPLEX cluster outage in domain A2.
Figure 50 Metro HA campus diagram with failure in zone A2
Since the ESX servers are cross connected to both VPLEX clusters in
each site, ESX will simply re-route the I/O to the alternate path,
which is still available since VPLEX is configured with a VPLEX
Witness protected distributed volume. This ensures the distributed
volume remains online in domain B2 as the VPLEX Witness Server
observes that it cannot communicate with the VPLEX cluster in A2
and guides the VPLEX cluster in B2 to remain online as this also
cannot communicate with A2, meaning A2 is either isolated or failed.
Note: Similarly in the event of a full isolation at A2, the distributed volumes
would simply suspend at A2 since communication would not be possible to
either the VPLEX Witness Server or the VPLEX cluster in domain B2. In this
case, the outcome is identical from a vSphere perspective and there will be no
interruption since I/O would be re-directed across the cross connect to
domain B2 where the distributed volume would remain online and available
to service I/O.
VPLEX Metro HA Campus (with cross-connect) 107
VPLEX Metro HA
Example 3 The following example describes what will happen in the event of a
failure to one (or all of) the back end storage arrays in either domain
A3 or B3. Again, in this instance there would be no interruption to
any of the virtual machines.
Figure 51 shows the failure to all storage arrays that reside in domain
A3. Since a cache coherent VPLEX Metro distributed volume is
configured between domains A2 and B2 IO can continue to be
actively serviced from the VPLEX in A2 even though the local back
end storage has failed. This is due to the embedded VPLEX cache
coherency which will efficiently cache any reads into the A2 domain
whilst also propagating writes to the back end storage in domain B3
via the remote VPLEX cluster in Site B2.
Figure 51 Metro HA campus diagram with failure in zone A3 or B3
108 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Metro HA
Example 4 The next example describes what will happen in the event of a
VPLEX Witness server failure in domain C1.
Again, in this instance there would be no interruption to any of the
virtual machines or VPLEX clusters. Figure 52 shows a complete
failure to domain C1 where the VPLEX Witness server resides. Since
the VPLEX Witness in not within the I/O path and is only an optional
component I/O will actively continue for any distributed volume in
domains A2 and B2 since the inter-cluster link is still available,
meaning cache coherency can be maintained between the VPLEX
cluster domains.
Although the service is uninterrupted, both VPLEX clusters will now
dial home and indicate they have lost communication with the
VPLEX Witness Server as a further failure to either of the VPLEX
clusters in domains A2 and B2 or the inter-cluster link would cause
data unavailability. The risk of this is heightened should the VPLEX
Witness server be offline for an extended duration. To remove this
risk the Witness feature may be disabled manually, therefore enabling
the VPLEX clusters to follow the static preference rules.
Figure 52 Metro HA campus diagram with failure in zone C1
VPLEX Metro HA Campus (with cross-connect) 109
VPLEX Metro HA
Example 5 The next example describes what will happen in the event of a failure
to the inter-cluster link between domains A2 and B2.
Again in this instance there would be no interruption to any of the
virtual machines or VPLEX clusters.
Figure 53 shows the inter-cluster link has failed between domains A2
and B2. In this instance the static preference rule set which was
defined previously will be invoked since neither VPLEX cluster can
communicate with the other VPLEX cluster (but the VPLEX Witness
Server can communicate with both VPLEX clusters). Therefore, access
to the given distributed volume within one of the domains A2 or B2
will be suspended. Since in this example the cross connect network is
physically separate from inter cluster link the alternate paths are still
available to the remote VPLEX cluster where the volume remains
online, therefore ESX will simply re-route the traffic to the alternate
VPLEX cluster meaning the virtual machine will remain online and
unaffected whichever site it was running on.
Figure 53 Metro HA campus diagram with intersite link failure
110 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Metro HA
Note: It is plausible in this example that the alternate path is physically
routing across the same ISL that has failed. In this instance there could be a
small interruption if a virtual machine was running in A1 as it will be
restarted in B1 (by the host cluster) since the alternate path is also dead.
However, it is also possible with vSphere versions prior to 5.0 update 1 that
the guest OS will simply hang and vSphere HA will not be prompted to
restart it. Although this is beyond the scope of the Techbook, to avoid any
disruption at all for any host cluster environment, EMC suggests that the
network that is used for the cross-cluster connect be a physically separate
network from the VPLEX inter-cluster link, therefore avoiding this potential
problem altogether. Refer to the VPLEX Metro HA (non campus- no cross
connect) scenarios in the next section for more details on full cluster partition
as well as Appendix A, vSphere 5.0 Update 1 Additional Settings, for
proper vSphere HA configuration settings that are applicable for ESX
implementations post version 5.0 update 1 to avoid this problem.
VPLEX Metro HA (without cross-cluster connection) 111
VPLEX Metro HA
VPLEX Metro HA (without cross-cluster connection)
VPLEX Metro HA (without cross-cluster connection) deployment is
very similar to a Metro HA campus deployment as mentioned in the
previous section; however, this solution is designed to cover
distances beyond the campus range and into distances of a
metropolitan range where round trip latency would be beyond 1 ms
and up to 5 ms. A VPLEX Metro distributed volume can then be
deployed across the two sites as well as deploying a VPLEX Witness
server within a different third failure/fault domain.
Figure 54 shows a high level schematic of an Metro HA solution for
vSphere without the cross-cluster deployment.
Figure 54 Metro HA Standard High-level diagram
As can be seen, a single VPLEX cluster is deployed at each site
connected via an inter cluster link.
112 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Metro HA
A VPLEX distributed volume has been created across both of the
locations and a vSphere HA cluster instance has been stretched across
both locations using the underlying VPLEX distributed volume.
Also shown in Figure 54 on page 111 are the physical ESX hosts that
are now only connected to the local VPLEX cluster where they reside
since the cross-connect does not exist.
The key benefit to this solution is its ability to minimise and eliminate
any recovery time if components were to fail as the host cluster is
connected to a VPLEX distributed volume which is actively serving
up the same block data via both VPLEX clusters, and will continue to
remain online via at least one VPLEX cluster under any single failure
event regardless of rule set due to the VPLEX Witness.
Note: At the time of writing a larger amount of host based cluster solutions
are supported with VPLEX Metro HA when compared to VPLEX Metro HA
campus (with the cross-connect). Although this next section only discusses
vSphere with VPLEX Metro HA, other supported host clusters include
Microsoft HyperV, Oracle RAC, Power HA, Serviceguard, etc. Be sure to
check the latest VPLEX simple support matrix found at
https://elabnavigator.emc.com, Simple Support Matrix tab, for the latest
support information.
Failure scenarios
As with the previous section, when deploying a stretched vSphere
configuration with VPLEX Metro HA, it is also possible to enable
long distance vMotion (virtual machine teleportation) since the ESX
datastore resides on a VPLEX Metro distributed volume, therefore
existing in two places at the same time exactly as described in the
previous section.
Again, for these failure scenarios, this section assumes that vSphere
version 5.0 update 1 or higher is configured in a stretched HA
topology so that all of the physical hosts at either site (ESX servers)
are within the same HA cluster.
Since the majority of the failure scenarios behave identically to the
cross-connect configuration, this section will only show two failure
scenarios where the outcomes differ slightly to the previous section.
VPLEX Metro HA (without cross-cluster connection) 113
VPLEX Metro HA
Note: For detailed setup instructions and best practice planning for a
stretched HA vSphere environment please read White Paper: Using VMware
vSphere with EMC VPLEX Best Practices Planning, which can be found on
Powerlink (http://powerlink.emc.com) under Home > Support > Technical
Documentation and Advisories > Hardware/Platforms Documentation >
VPLEX Family > White Papers.
Figure 55 shows the topology of an Metro HA environment divided
up into logical fault domains. The next sections will demonstrate the
recovery automation for single failures within any of these domains.
Figure 55 Metro HA high-level diagram with fault domains
114 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Metro HA
Example 1 The following example describes what will happen in the unlikely
event that a VPLEX cluster was to fail in domain A2. In this instance
there would no interruption of service to any virtual machines
running in domain B1; however, any virtual machines that were
running in domain A1 would see a minor interruption as the virtual
machines are restarted at B1.
Figure 56 shows a full VPLEX cluster outage in domain A2.
Figure 56 Metro HA high-level diagram with failure in domain A2
Note: This failure condition is considered unlikely since it would constitute a
dual failure as a VPLEX cluster has no single points of failure.
As can be seen in the graphic, since the ESX servers are not
cross-connected to the remote VPLEX cluster, the ESX server will lose
access to the storage causing the HA host cluster (in this case
vSphere) to perform a HA restart for the virtual machines within
VPLEX Metro HA (without cross-cluster connection) 115
VPLEX Metro HA
domain A2. It can do this since the distributed volumes will remain
active at B2 as the VPLEX is configured with a VPLEX Witness
protected distributed volume which will deduce that the VPLEX in
domain A2 is unavailable (since neither the VPLEX Witness Server or
the VPLEX cluster in B2 can communicate with the VPLEX cluster in
A2, therefore VPLEX Witness will guide the VPLEX cluster in B2 to
remain online)
Note: It is important to understand that this failure is deemed a dual failure
since at the time of writing it is possible that with vSphere (all versions
including 5.0 update 1), the guest VMs in Site A will simply hang in this
situation (this is known as a zombie) and VMware HA will not be prompted
to restart it (all other supported HA clusters would detect this failure and
perform a restart under these conditions). Although this is beyond the scope
of the Techbook, manual intervention may be required here to resume VMs at
the remote VPLEX cluster that automatically remains online regardless of
rule sets due to the VPLEX witness. The reason for this is that if the VPLEX
cluster is totally disconnected from the hosts, then the hosts will be unable to
receive the PDL (Persistent device loss) status issued via the VPLEX,
therefore vSphere will only see this as an APD (All paths down) state and in
most cases wait for the device to be brought back online without failing the
VM.
Example 2 The next example describes what will happen in the event of a failure
to the inter-cluster link between domains A2 and B2.
One of two outcome of this scenario will happen:
VMs running at the preferred site - If the static preference for a
given distributed volume was set to cluster 1 detaches (assuming
cluster 1 resides in domain A2) and the virtual machine was
running at the same site where the volume remains online (aka
the preferred site) then there is no interruption to service.
VMs running at the non-preferred site - If the static preference
for a given distributed volume was set to cluster 1 detaches
(assuming cluster 1 resides in domain A2) and the virtual
machine was running at the remote site (Domain B1) then the
VMs storage will be in the suspended state (PDL). In this case
the guest operating systems will fail allowing the virtual machine
to be automatically restarted in domain A1.
116 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Metro HA
Figure 57 shows the link has failed between domains A2 and B2.
Figure 57 Metro HA high-level diagram with intersite failure
In this instance the static preference rule set which was previously
defined as cluster 1 detaches will be invoked since neither VPLEX
cluster can communicate with the other VPLEX cluster (but the
VPLEX Witness Server can communicate with both VPLEX clusters),
therefore access to the given distributed volume within the domains
B2 will be suspended for the given distributed volume whilst
remaining active at A2.
Virtual machines that were running at A1 will be uninterrupted and
virtual machines that were running at B1 will be restarted at A1.
VPLEX Metro HA (without cross-cluster connection) 117
VPLEX Metro HA
Note: Similar to the previous note and though it is beyond the scope of this
TechBook, vSphere HA versions prior to 5.0 update 1 may not detect this
condition and not restart the VM if it was running at the non preferred site,
therefore to avoid any disruption when using vSphere in this type of
configuration (for versions prior to 5.0 update 1) VMware DRS host affinity
rules can be used (where supported) to ensure that virtual machines are
always running in their preferred location (i.e., the location that the storage
they rely on is biased towards). Another way to avoid this scenario is to
disable DRS altogether and use vSphere HA only, or use a cross-connect
configuration deployed across a separate physical network as discussed in
the previous section. See Appendix A, vSphere 5.0 Update 1 Additional
Settings, for proper vSphere HA configuration settings that are applicable
for ESX implementations post version 5.0 update 1 to avoid this problem.
The remaining failure scenarios with this solution are identical to the
previously discussed VPLEX Metro HA campus solutions. For failure
handling in domains A1, B1, A3, B3, or C, see VPLEX Metro HA
Campus (with cross-connect) on page 101.
118 EMC VPLEX Metro Witness Technology and High Availability
VPLEX Metro HA
Conclusion 119
7
This chapter provides a conclusion to the VPLEX solutions outlined
in this TechBook:
Conclusion ........................................................................................ 120
Conclusion
120 EMC VPLEX Metro Witness Technology and High Availability
Conclusion
Conclusion
As outlined in this book, using VPLEX AccessAnywhere
TM

technology in combination with High Availability and VPLEX
Witness, storage administrators and data center managers will be
able to provide absolute physical and logical high availability for
their organizations mission critical applications with less resource
overhead and dependency on manual intervention. Increasingly,
those mission critical applications are virtualized and in most cases
using VMware vSphere or Microsoft Hyper-V virtual machine
technologies. It is expected that VPLEX customers use the HA /
VPLEX Witness solution to incorporate several application-specific
clustering and virtualization technologies to provide HA benefits for
targeted mission critical applications.
As described, the storage administrator is provided with two specific
VPLEX Metro-based solutions around High Availability as outlined
specifically for VMware ESX 4.1 or higher as integrated into the
VPLEX Metro HA Campus (cross-cluster connect) and standard (non
campus) Metro environments. VPLEX Metro HA Campus provides a
higher level of HA than the VPLEX Metro HA deployment without
cross-cluster connectivity. However, it is limited to in-data center use
or cases where the network latency between data centers is
negligible.
Both solutions are ideal for customers who are not only currently or
planning on becoming highly virtualized but are looking for the
following:
Elimination of the night shift storage and server administrator
positions. To accomplish this, they must be comfortable that their
applications will ride through any failures that happen during
the night.
Reduction of capital expenditures by moving from an
active/passive data center replication model to a fully active
highly available data center model.
Increase application availability by protecting against flood and
fire disasters that could affect their entire data center.
Conclusion 121
Conclusion
From a holistic view of both types of solutions and what it provides
the storage administrator, the following benefits are in common with
variances. What EMC VPLEX technology with Witness provides to
consumers are as follows, each discussed briefly:
Better protection from storage-related failures on page 121
Protection from a larger array of possible failures on page 121
Greater overall resource utilization on page 122
Better protection from storage-related failures
Within a data center, applications are typically protected against
storage-related failures through the use of multipathing software
such as EMC PowerPath. This allows applications to ride through
HBA failures, switch failures, cable failures, or storage array
controller failures by routing I/O around the location of the failure.
The VPLEX Metro HA cross-cluster connect solution extends this
protection to the rack and/or data center level by multipathing
between VPLEX clusters in independent failure domains. The VPLEX
Metro HA solution adds to this the ability to restart the application in
the other data center in case no alternative route for the I/O exists in
its current data center. As an example, if a fire were to affect an entire
VPLEX rack, the application could be restarted in the backup data
center automatically.This provides customers a much higher level of
availability and lower level of risk.
Protection from a larger array of possible failures
To highlight advantages of VPLEX Witness functionality, lets recall
how VMware HA operates.
VMware HA and other offerings provides automatic restart of virtual
machines (applications) in the event of virtual machine failure for any
reason (server failure, failed connection to storage, etc.). This restart
involves a complete boot-up of the virtual machines guest operating
system and applications. While VM failure leads to an outage, the
recovery from that failure is usually automatic.
When combined with VPLEX in the Metro HA configuration, it
provides the same level of protection for data center scale disaster
scenarios.
122 EMC VPLEX Metro Witness Technology and High Availability
Conclusion
Greater overall resource utilization
Using the same point of view of server virtualization based products
and their recovery capabilities, turning over to utlization, VMware
DRS (Distributed Resource Scheduler) can automatically move
applications between servers in order to balance their computational
and memory load over all the available servers. Within a data center,
this has increased server utilization because administrators no longer
need to size individual servers to the applications that will run on
them. Instead, they can size the entire data center to the suite of
applications that will run within it.
By adding HA configuration (Metro and Campus), the available pool
of server resources now covers both the primary and backup data
centers. Both can actively be used and excess compute capacity in one
data center can be used to satisfy new demands in the other.
Alternative Vendor Solutions:
Microsoft Hyper-V Server 2008 R2 with Performance and
Resource Optimization (PRO)
Overall, as data centers continue their expected growth patterns and
storage administrators struggle to expand capacity and consolidate at
the same time, by introducing EMC VPLEX they can reduce several
areas of concern. To recap, these areas are:
Hardware and component failures impacting data consistency
System integrity
High availability without manual intervention
Witness to protect the entire highly available system
In reality, by reducing inter-site overhead and dependencies on
disaster recovery, administrators can depend on VPLEX to guarantee
that their data is available at anytime while the beepers and cell
phones are silenced.
vSphere 5.0 Update 1 Additional Settings 123
A
This appendix contains the following information for additional
settings needed for vSphere 5.0 update 1:
vSphere 5.0 update 1........................................................................ 124
vSphere 5.0 Update 1
Additional Settings
124 EMC VPLEX Metro Witness Technology and High Availability
vSphere 5.0 Update 1 Additional Settings
vSphere 5.0 update 1
As discussed in previous sections, vSphere HA does not
automatically recognize that a SCSI PDL (Persistent device loss) state
is a state that should cause a VM to invoke a HA failover.
Clearly, this may not be desirable when using vSphere HA with
VPLEX in a stretched cluster configuration. Therefore, it is important
to configure vSphere so that if the VPLEX WAN is partitioned and a
VM happens to be running at the non-preferred site (i.e., the storage
device is put into a PDL state), the VM recognizes this condition and
invokes the steps required to perform a HA failover.
ESX and vSphere versions prior to version 5.0 update 1 have no
ability to act on a SCSI PDL status and will therefore typically hang
(i.e., continue to be alive but in an unresponsive state).
However, vSphere 5.0 update 1 and later do have the ability to act on
the SCSI PDL state by powering off the VM, which in turn will invoke
a HA failover.
To ensure that the VM behaves in this way, additional settings within
the vSphere cluster are required. At the time of this writing the
settings are:
1. Use vSphere Client and select the cluster, right-click and select
Edit Settings. From the pop-up menu click to select the vSphere
HA, then click Advanced Options. Define and save the option:
das.maskCleanShutdownEnabled=true
2. On every ESXi server, vi /etc/vmware/settings with the content
below, then reboot the ESXi server.
The following output shows the correct setting applied in the file:
~ # cat /etc/vmware/settings
disk.terminateVMOnPDLDefault=TRUE
Refer to the ESX documentation for further details.
EMC VPLEX Metro Witness Technology and High Availability 125
Glossary
This glossary contains terms related to VPLEX federated storage
systems. Many of these terms are used in this manual.
A
AccessAnywhere The breakthrough technology that enables VPLEX clusters to provide
access to information between clusters that are separated by distance.
active/active A cluster with no primary or standby servers, because all servers can
run applications and interchangeably act as backup for one another.
active/passive A powered component that is ready to operate upon the failure of a
primary component.
array A collection of disk drives where user data and parity data may be
stored. Devices can consist of some or all of the drives within an
array.
asynchronous Describes objects or events that are not coordinated in time. A process
operates independently of other processes, being initiated and left for
another task before being acknowledged.
For example, a host writes data to the blades and then begins other
work while the data is transferred to a local disk and across the WAN
asynchronously. See also synchronous.
126 EMC VPLEX Metro Witness Technology and High Availability
Glossary
B
bandwidth The range of transmission frequencies a network can accommodate,
expressed as the difference between the highest and lowest
frequencies of a transmission cycle. High bandwidth allows fast or
high-volume transmissions.
bias When a cluster has the for a given DR1 it will remain online if
connectivity is lost to the remote cluster (in some cases this may get
over ruled by VPLEX Cluster Witness). This is now known as
preference.
bit A unit of information that has a binary digit value of either 0 or 1.
block The smallest amount of data that can be transferred following SCSI
standards, which is traditionally 512 bytes. Virtual volumes are
presented to users as a contiguous lists of blocks.
block size The actual size of a block on a device.
byte Memory space used to store eight bits of data.
C
cache Temporary storage for recent writes and recently accessed data. Disk
data is read through the cache so that subsequent read references are
found in the cache.
cache coherency Managing the cache so data is not lost, corrupted, or overwritten.
With multiple processors, data blocks may have several copies, one in
the main memory and one in each of the cache memories. Cache
coherency propagates the blocks of multiple users throughout the
system in a timely fashion, ensuring the data blocks do not have
inconsistent versions in the different processors caches.
cluster Two or more VPLEX directors forming a single fault-tolerant cluster,
deployed as one to four engines.
cluster ID The identifier for each cluster in a multi-cluster deployment. The ID
is assigned during installation.
EMC VPLEX Metro Witness Technology and High Availability 127
Glossary
cluster deployment ID A numerical cluster identifier, unique within a VPLEX cluster. By
default, VPLEX clusters have a cluster deployment ID of 1. For
multi-cluster deployments, all but one cluster must be reconfigured
to have different cluster deployment IDs.
clustering Using two or more computers to function together as a single entity.
Benefits include fault tolerance and load balancing, which increases
reliability and up time.
COM The intra-cluster communication (Fibre Channel). The
communication used for cache coherency and replication traffic.
command line
interface (CLI)
A way to interact with a computer operating system or software by
typing commands to perform specific tasks.
continuity of
operations (COOP)
The goal of establishing policies and procedures to be used during an
emergency, including the ability to process, store, and transmit data
before and after.
controller A device that controls the transfer of data to and from a computer and
a peripheral device.
D
data sharing The ability to share access to the same data with multiple servers
regardless of time and location.
detach rule A rule set applied to a DR1 to declare a winning and a losing cluster
in the event of a failure.
device A combination of one or more extents to which you add specific
RAID properties. Devices use storage from one cluster only;
distributed devices use storage from both clusters in a multi-cluster
plex. See also distributed device.
director A CPU module that runs GeoSynchrony, the core VPLEX software.
There are two directors in each engine, and each has dedicated
resources and is capable of functioning independently.
dirty data The write-specific data stored in the cache memory that has yet to be
written to disk.
disaster recovery (DR) The ability to restart system operations after an error, preventing data
loss.
128 EMC VPLEX Metro Witness Technology and High Availability
Glossary
disk cache A section of RAM that provides cache between the disk and the CPU.
RAMs access time is significantly faster than disk access time;
therefore, a disk-caching program enables the computer to operate
faster by placing recently accessed data in the disk cache.
distributed device A RAID 1 device whose mirrors are in Geographically separate
locations.
distributed file system
(DFS)
Supports the sharing of files and resources in the form of persistent
storage over a network.
Distributed RAID1
device (DR1)
A cache coherent VPLEX Metro or Geo volume that is distributed
between two VPLEX Clusters
E
engine Enclosure that contains two directors, management modules, and
redundant power.
Ethernet A Local Area Network (LAN) protocol. Ethernet uses a bus topology,
meaning all devices are connected to a central cable, and supports
data transfer rates of between 10 megabits per second and 10 gigabits
per second. For example, 100 Base-T supports data transfer rates of
100 Mb/s.
event A log message that results from a significant action initiated by a user
or the system.
extent A slice (range of blocks) of a storage volume.
F
failover Automatically switching to a redundant or standby device, system,
or data path upon the failure or abnormal termination of the
currently active device, system, or data path.
fault domain A concept where each component of a HA solution is separated by a
logical or physical boundary so if a fault happens in one domain it
will not transfer to the other. The boundary can represent any item
which could fail (i.e., a separate power domain would mean that is
power would remain in the second domain if it failed in the first
domain).
EMC VPLEX Metro Witness Technology and High Availability 129
Glossary
fault tolerance Ability of a system to keep working in the event of hardware or
software failure, usually achieved by duplicating key system
components.
Fibre Channel (FC) A protocol for transmitting data between computer devices. Longer
distance requires the use of optical fiber; however, FC also works
using coaxial cable and ordinary telephone twisted pair media. Fibre
channel offers point-to-point, switched, and loop interfaces. Used
within a SAN to carry SCSI traffic.
field replaceable unit
(FRU)
A unit or component of a system that can be replaced on site as
opposed to returning the system to the manufacturer for repair.
firmware Software that is loaded on and runs from the flash ROM on the
VPLEX directors.
G
Geographically
distributed system
A system physically distributed across two or more Geographically
separated sites. The degree of distribution can vary widely, from
different locations on a campus or in a city to different continents.
Geoplex A DR1 device configured for VPLEX Geo
gigabit (Gb or Gbit) 1,073,741,824 (2^30) bits. Often rounded to 10^9.
gigabit Ethernet The version of Ethernet that supports data transfer rates of 1 Gigabit
per second.
gigabyte (GB) 1,073,741,824 (2^30) bytes. Often rounded to 10^9.
global file system
(GFS)
A shared-storage cluster or distributed file system.
H
host bus adapter
(HBA)
An I/O adapter that manages the transfer of information between the
host computers bus and memory system. The adapter performs many
low-level interface functions automatically or with minimal processor
involvement to minimize the impact on the host processors
performance.
130 EMC VPLEX Metro Witness Technology and High Availability
Glossary
I
input/output (I/O) Any operation, program, or device that transfers data to or from a
computer.
internet Fibre Channel
protocol (iFCP)
Connects Fibre Channel storage devices to SANs or the Internet in
Geographically distributed systems using TCP.
intranet A network operating like the World Wide Web but with access
restricted to a limited group of authorized users.
internet small
computer system
interface (iSCSI)
A protocol that allows commands to travel through IP networks,
which carries data from storage units to servers anywhere in a
computer network.
I/O (input/output) The transfer of data to or from a computer.
K
kilobit (Kb) 1,024 (2^10) bits. Often rounded to 10^3.
kilobyte (K or KB) 1,024 (2^10) bytes. Often rounded to 10^3.
L
latency Amount of time it requires to fulfill an I/O request.
load balancing Distributing the processing and communications activity evenly
across a system or network so no single device is overwhelmed. Load
balancing is especially important when the number of I/O requests
issued is unpredictable.
local area network
(LAN)
A group of computers and associated devices that share a common
communications line and typically share the resources of a single
processor or server within a small Geographic area.
logical unit number
(LUN)
Used to identify SCSI devices, such as external hard drives,
connected to a computer. Each device is assigned a LUN number
which serves as the device's unique address.
EMC VPLEX Metro Witness Technology and High Availability 131
Glossary
M
megabit (Mb) 1,048,576 (2^20) bits. Often rounded to 10^6.
megabyte (MB) 1,048,576 (2^20) bytes. Often rounded to 10^6.
metadata Data about data, such as data quality, content, and condition.
metavolume A storage volume used by the system that contains the metadata for
all the virtual volumes managed by the system. There is one metadata
storage volume per cluster.
Metro-Plex Two VPLEX Metro clusters connected within metro (synchronous)
distances, approximately 60 miles or 100 kilometers.
metroplex A DR1 device configured for VPLEX Metro
mirroring The writing of data to two or more disks simultaneously. If one of the
disk drives fails, the system can instantly switch to one of the other
disks without losing data or service. RAID 1 provides mirroring.
miss An operation where the cache is searched but does not contain the
data, so the data instead must be accessed from disk.
N
namespace A set of names recognized by a file system in which all names are
unique.
network System of computers, terminals, and databases connected by
communication lines.
network architecture Design of a network, including hardware, software, method of
connection, and the protocol used.
network-attached
storage (NAS)
Storage elements connected directly to a network.
network partition When one site loses contact or communication with another site.
132 EMC VPLEX Metro Witness Technology and High Availability
Glossary
P
parity The even or odd number of 0s and 1s in binary code.
parity checking Checking for errors in binary data. Depending on whether the byte
has an even or odd number of bits, an extra 0 or 1 bit, called a parity
bit, is added to each byte in a transmission. The sender and receiver
agree on odd parity, even parity, or no parity. If they agree on even
parity, a parity bit is added that makes each byte even. If they agree
on odd parity, a parity bit is added that makes each byte odd. If the
data is transmitted incorrectly, the change in parity will reveal the
error.
partition A subdivision of a physical or virtual disk, which is a logical entity
only visible to the end user, not any of the devices.
plex A VPLEX single cluster.
preference When a cluster has the for a given DR1 it will remain online if
connectivity is lost to the remote cluster (in some cases this may get
over ruled by VPLEX Cluster Witness). This was previously know as .
R
RAID The use of two or more storage volumes to provide better
performance, error recovery, and fault tolerance.
RAID 0 A performance-orientated striped or dispersed data mapping
technique. Uniformly sized blocks of storage are assigned in regular
sequence to all of the arrays disks. Provides high I/O performance at
low inherent cost. No additional disks are required. The advantages
of RAID 0 are a very simple design and an ease of implementation.
RAID 1 Also called mirroring, this has been used longer than any other form
of RAID. It remains popular because of simplicity and a high level of
data availability. A mirrored array consists of two or more disks. Each
disk in a mirrored array holds an identical image of the user data.
RAID 1 has no striping. Read performance is improved since either
disk can be read at the same time. Write performance is lower than
single disk storage. Writes must be performed on all disks, or mirrors,
in the RAID 1. RAID 1 provides very good data reliability for
read-intensive applications.
EMC VPLEX Metro Witness Technology and High Availability 133
Glossary
RAID leg A copy of data, called a mirror, that is located at a user's current
location.
rebuild The process of reconstructing data onto a spare or replacement drive
after a drive failure. Data is reconstructed from the data on the
surviving disks, assuming mirroring has been employed.
redundancy The duplication of hardware and software components. In a
redundant system, if a component fails then a redundant component
takes over, allowing operations to continue without interruption.
reliability The ability of a system to recover lost data.
remote direct
memory access
(RDMA)
Allows computers within a network to exchange data using their
main memories and without using the processor, cache, or operating
system of either computer.
Recovery Point
Objective (RPO)
The amount of data that can be lost before a given failure event.
Recovery Time
Objective (RTO)
The amount of time the service takes to fully recover after a failure
event.
S
scalability Ability to easily change a system in size or configuration to suit
changing conditions, to grow with your needs.
simple network
management
protocol (SNMP)
Monitors systems and devices in a network.
site ID The identifier for each cluster in a multi-cluster plex. By default, in a
non-Geographically distributed system the ID is 0. In a
Geographically distributed system, one clusters ID is 1, the next is 2,
and so on, each number identifying a physically separate cluster.
These identifiers are assigned during installation.
small computer
system interface
(SCSI)
A set of evolving ANSI standard electronic interfaces that allow
personal computers to communicate faster and more flexibly than
previous interfaces with peripheral hardware such as disk drives,
tape drives, CD-ROM drives, printers, and scanners.
134 EMC VPLEX Metro Witness Technology and High Availability
Glossary
split brain Condition when a partitioned DR1 accepts writes from both clusters.
This is also known as a conflicting detach.
storage RTO The amount of time taken for the storage to be available after a failure
event (In all cases this will be a smaller time interval than the RTO
since the storage is a pre-requisite).
stripe depth The number of blocks of data stored contiguously on each storage
volume in a RAID 0 device.
striping A technique for spreading data over multiple disk drives. Disk
striping can speed up operations that retrieve data from disk storage.
Data is divided into units and distributed across the available disks.
RAID 0 provides disk striping.
storage area network
(SAN)
A high-speed special purpose network or subnetwork that
interconnects different kinds of data storage devices with associated
data servers on behalf of a larger network of users.
storage view A combination of registered initiators (hosts), front-end ports, and
virtual volumes, used to control a hosts access to storage.
storage volume A LUN exported from an array.
synchronous Describes objects or events that are coordinated in time. A process is
initiated and must be completed before another task is allowed to
begin.
For example, in banking two withdrawals from a checking account
that are started at the same time must not overlap; therefore, they are
processed synchronously. See also asynchronous.
T
throughput 1. The number of bits, characters, or blocks passing through a data
communication system or portion of that system.
2. The maximum capacity of a communications channel or system.
3. A measure of the amount of work performed by a system over a
period of time. For example, the number of I/Os per day.
tool command
language (TCL)
A scripting language often used for rapid prototypes and scripted
applications.
EMC VPLEX Metro Witness Technology and High Availability 135
Glossary
transmission control
protocol/Internet
protocol (TCP/IP)
The basic communication language or protocol used for traffic on a
private network and the Internet.
U
uninterruptible power
supply (UPS)
A power supply that includes a battery to maintain power in the
event of a power failure.
universal unique
identifier (UUID)
A 64-bit number used to uniquely identify each VPLEX director. This
number is based on the hardware serial number assigned to each
director.
V
virtualization A layer of abstraction implemented in software that servers use to
divide available physical storage into storage volumes or virtual
volumes.
virtual volume A virtual volume looks like a contiguous volume, but can be
distributed over two or more storage volumes. Virtual volumes are
presented to hosts.
VPLEX Cluster Witness A new feature in VPLEX V5.x that can augment and improve upon
the failure handling semantics of Static .
W
wide area network
(WAN)
A Geographically dispersed telecommunications network. This term
distinguishes a broader telecommunication structure from a local
area network (LAN).
world wide name
(WWN)
A specific Fibre Channel Name Identifier that is unique worldwide
and represented by a 64-bit unsigned binary value.
write-through mode A caching technique in which the completion of a write request is
communicated only after data is written to disk. This is almost
equivalent to non-cached systems, but with data protection.
136 EMC VPLEX Metro Witness Technology and High Availability
Glossary

You might also like